feat(lambda-runtime): log non-2xx Lambda Runtime API responses with status and body by darklight3it · Pull Request #1105 · aws/aws-lambda-rust-runtime

darklight3it · 2026-02-23T23:23:37Z

Summary

When the Lambda Runtime API returns a non-2xx response (e.g. HTTP 410 with {"errorMessage":"Invoke timeout","errorType":"InvokeTimeout"}), the previous implementation silently discarded the response body. Customers had no visibility into why their handler response was rejected.

Solution

Read and log the response body for non-2xx responses from the Lambda Runtime API.

The main challenge is that RuntimeApiClientFuture is a pub enum without #[non_exhaustive], so adding a new variant would be a breaking change for any downstream code that exhaustively matches on it. To avoid this, we instead spawn a background task via tokio::task::spawn to read the body asynchronously without modifying the enum's public surface.

Changes

RuntimeApiClientService now captures a tokio::runtime::Handle at construction time via Handle::try_current(), with a fallback minimal current-thread runtime for non-tokio contexts (e.g. tests)
On non-2xx responses, a background task is spawned to collect and log the response body including the HTTP status code
Added a log_or_print! macro in lib.rs that dispatches to tracing::error! when a tracing dispatcher is set, and falls back to eprintln! otherwise — consistent with the existing codebase pattern
Added unit tests covering 2xx success, 410 InvokeTimeout, and inner service errors, using httpmock

Example log output

ERROR lambda_runtime::layers::api_client: Lambda Runtime API returned non-200 response status=410 body={"errorMessage":"Invoke timeout","errorType":"InvokeTimeout"}

jlizen · 2026-02-24T00:18:29Z

lambda-runtime/src/layers/api_client.rs

-    First(#[pin] F, Arc<Client>),
-    Second(#[pin] BoxFuture<'static, Result<http::Response<Incoming>, BoxError>>),
+#[pin_project(project = RuntimeApiClientFutureStateProj)]
+pub enum RuntimeApiClientFutureState<F> {


This will be a breaking change. It is an internal type, that probably nobody is using, but it is part of our public API.

If we did want to go this route, I'd suggest making a new one, and marking the old one as deprecated. And cutting over all usage.

Given that probably nobody is using it, the new type is fairly painless. I don't mind that option.

That would only require a minor version bump (the deprecated old enum).

It wouldn't be breaking if we had tagged it #[non_exhaustive], but we missed this one.

Definitely if we make a new type, we should tag it non exhaustive :)

sorry I was already fixing that. It's still RuntimeApiClientFutureState. The CR was still a draft.

we changed the internals ase we said in the other CR. Cannot make it pub (crate) from api_client but still it isn't exposed in mod.rs

I discussed this with you and rcoh@ offline -

Technically this is a breaking change, whether we change the innards of RuntimeApiClientFuture or substitute it for a different enum. The type is accessible as an associated type on the RuntimeApiClientService which is publicly visible.

None of our internal machinery relies on this specific type surface - we are just polling the service that the runtime api client exposes. I'm pretty sure whoever implemented this just didn't realize that pub enum exposed all of its variants and field contents.

It would generally be risky to try to deploy something in the runtime api client, that the lambda-runtime needs to be upgraded in lockstep to, else they will have compile errors. And then that's fragile to do except across a breaking version boundary.

BUT - we aren't in that situation, the only people that would be impacted would be people that built their own higher level runtime on top of our own, or otherwise are doing things with the innards of the runtime client.

I've confirmed via code search that the only cases of RuntimeApiClientFuture being used are forks of the api client.

So, with that in mind, I think this is probably a fairly low risk change to make.

And then, the lowest risk change available would be to just keep the current future name, since then anybody that is naming the future as part of a tower service stack, wouldn't be broken.

And with the future, it is safer to add a new field to the existing variant, rather than add a new variant. Since that won't break match statements.

Unfortunately, that doesn't get us out of this mess for next time, but it is the least risky.

That said - what field to add - well, we already have the contents we want inside the RuntimeApiClientFuture, the issue is that we can't poll it to completion from inside the runtime (specifically, the body field of the http::Response future stored inside Second)

For that, we really would prefer a third state transition which is riskier than a new field .

I guess, we could also just change the second variant's contents to an enum (which is clearly sillier than a third variant, but less breaking-prone maybe?)

Maybe instead we could do something hacky like, check the status code, and bubble up the inner future inside of the BoxError, and downcast the BoxError from inside our managing runtime to decide whether to finish polling the body?

And then, another option we discussed was just storing a client current thread runtime or something to drive the future, but that's pretty awful... but we don't currently require that this be run inside a tokio runtime, so i don't think we can rely on tokio::spawn.

I guess we could also do something with a sync thread and block_on, but that is pretty terrible :)

I went with the tokio::spawn idea, the bubbling up of the error required a changing in then RuntimeApiClientFuture signature that while respecting the current enum and variant structure would have changed the signature constituting a breaking change.

This seems to be the only way for this to work without any signature change. The integration tests are working and the test I added are working too.

…useful for the customer for timeouts

jlizen

I think we have something workable here. I'll ping my colleague to take a quick peek as well for landmine avoidance.

Some changes needed (assuming he doesn't suggest a different route):

we need to pass the tokio runtime handle through to the the future so that it doesn't panic on being awaited in non-tokio (one option is a static oncelock holding it)
probably we should reflect against the logged output in our test

jlizen · 2026-02-24T16:49:53Z

lambda-runtime/src/layers/api_client.rs

+            Err(_) => {
+                // Fallback: build a minimal current-thread runtime and keep it alive alongside
+                // the service. This should only happen outside of a tokio context (e.g. some tests).
+                let rt = Arc::new(


Is there a reason that the rt needs to be inside an arc? It is stored inside a struct that doesn't have Clone anyway

Arc is necessary because:

RuntimeApiClientService does implement Clone, and it needs to be cloneable for concurrent workers (line 222 in runtime.rs).

The Handle becomes invalid if the Runtime is dropped. Arc ensures the runtime stays alive as long as any service instance exists.

Multiple cloned services share the same fallback runtime via reference counting instead of creating separate runtimes.

jlizen · 2026-02-24T16:54:10Z

lambda-runtime/src/layers/api_client.rs

+                        let status = resp.status();
+                        // Spawn a background task to read and log the error body without
+                        // blocking the current poll or requiring a new enum variant.
+                        Handle::current().spawn(async move {


This will panic if the future is awaited outside of a tokio runtime. It isn't actually using the handle we store on the client.

That said, there isn't a GREAT way to pass through the runtime handle to the future without changing the public signature, since we are stuck with an enum with wide open variants and inner fields. We would essentially need to add the handle to both the first and second, and have first plumb it through to second. IE add a field to both variants.

The main non breaking way I can think of would be to store the runtime in a static OnceLock (which you initialize inside RuntimeApiClientService::new(), and it stores the runtime and returns the handle via get_or_init. So nothing stored on the client itself anymore.

At that point, you won't need to carry any data in the future. And the runtime is something of a global resource, so it's not too strange..?

One thing we might need to do would be, add a check to see if a caller-provided tokio runtime is still alive, since it might shut down early and cause a panic? Not sure how much we need to guard against that... Will get a second opinion

is the only thing you need to do here log the message? Rather than doing it this way, can you send the resp over a channel to someone else to read it out and print it?

One "dumb" option is to use the response.extensions_mut() to get the channel (or runtime handle) in there.

That avoids the hazard of attempting to read it here from the thread local.

Yes we only need to log the message.

One concern is, the place we would want to store a runtime handle or channel receiver in order to write into response extensions, is the underlying runtime api client, which is also wide open as pub / non-exhaustive.

So it wouldn't necessarily sidestep the need for static state.

A nicer option might be, instead use a tokio::task_local! at the point where we are spawning the tokio tasks, to inject a mpsc::Sender. And then, we can listen on it in a tokio::select! loop alongside its work loop.

That reduces the hazard of of the thread local reads, right @rcoh ? It also has the side benefit of allowing us to enrich the logs with per-worker context, since we have a worker id span instrumented there.

And then, inside the RuntimeApiClientFuture, where we read the response from second, we can (perhaps behind cfg(feature="concurrency-tokio") check if LocalKey::try_get returns something, and if yes, write the response to the channel to let the worker-managed channel receive loop await the body and write out its log?

I guess we could also just use tokio::runtime::Handle::current() inside the lambda_runtime_api_client gated behind tokio_concurrency and drill that feature down? IE use .map_ok on the request future, get the handle, and shove it in response extensions, and then have the RuntimeApiClientFuture check extensions and spawn the task.

Feels more infectious though than just task_locals, and it still has some weird edge cases if anybody is instantiating the lambda_runtime_api_client::Client directly, with the feature enabled, but not on a tokio runtime. (probably fine, but technically breaking if somebody has --all-features enabled)

jlizen · 2026-02-24T16:54:52Z

lambda-runtime/src/layers/api_client.rs

+                                }
+                            }
+                        });
+                        break Ok(());


why are these breaks needed? i would have expected we can just have eg Ok(_) => Ok(())

jlizen · 2026-02-24T16:56:03Z

lambda-runtime/src/layers/api_client.rs

+                        break Ok(());
+                    }
+                    Ok(_) => break Ok(()),
+                    Err(err) => break Err(err),


just to confirm, we are already logging the connection related error higher up? It looked like the C++ impl was logging them all at the same site (though it's fine if we handle differently, with the same ultimate logs)

jlizen · 2026-02-24T16:56:49Z

lambda-runtime/src/lib.rs

 pub use tower::{self, service_fn, Service};

+/// Logs using tracing `error!` if a dispatcher is set, otherwise falls back to `eprintln!`.
+macro_rules! log_or_print {


Cute, can we cut over this usage as well?

jlizen · 2026-02-24T16:57:34Z

lambda-runtime/src/layers/api_client.rs

+    }
+
+    #[tokio::test]
+    async fn test_invoke_timeout_410_is_logged_and_succeeds() {


This isn't actually asserting that anything is logged, we probably want to use #[traced_test] to reflect against log output

I know. I can add this.

FullyTyped · 2026-02-24T23:03:34Z

Why not do this in the actual client a-la the PR I was working on? That will be effectively the same pattern as the Managed Runtimes do it, by delegating to the C++ client, which will then log.

That means we won't be breaking anyone along the way.

jlizen · 2026-02-24T23:18:51Z

That will be effectively the same pattern as the Managed Runtimes do it, by delegating to the C++ client, which will then log.

We could accomplish this, but it would be a bit awkward since it would require that we use tokio::runtime::Handle::current() directly, which in turn probably needs to be behind a tokio_concurrency feature gate since we don't force tokio usage in general. And even then, we might want to guard against non-tokio-runtime-contexts calling the client with --all-features enabled.

I'm not against it though, necessarily.

Interested in your thoughts about this thread @FullyTyped : #1105 (comment)

FullyTyped · 2026-02-24T23:40:15Z

Why would we need to do that?

I am under the impression this is universal and doesn't need us to hide it under concurrency.

IIRC the client impl doesn't depend on the tolio runtime at all.

darklight3it · 2026-02-25T00:05:45Z

Why would we need to do that?

I am under the impression this is universal and doesn't need us to hide it under concurrency.
on this I agree with @FullyTyped . The logging of the error should happen also in the non multiconcurrent path.

However I want to point out that tokio::spawn in this context is a workaround because we cannot change in any way RuntimeApiClientFuture without breaking existing code.

jlizen · 2026-02-25T01:14:35Z

Why would we need to do that?

This is attempting to sidestep a breaking change in the RuntimeApiClientFuture by adding a new variant.

The logging of the error should happen also in the non multiconcurrent path

Ah, good to know... in that case tokio::spawn based solutions will be pretty awkward since we don't want to assume usage of tokio runtime for non-LMI path. Really we would prefer modeling things in a single future state machine, like Quinn is pointing out.

That said, I think we could possibly just move this entire check downwards into the runtime api client, like you might have been alluding to Quinn? Since we are already working with a BoxFuture on the Client's side. So as long as we cast whatever information we want to transmit upwards into a BoxError, we could use future combinators?

It would go here

And look something like:

        self.client.request(req).map_err(Into::into)
            .then(|res| async move { 
                match res {
                    Ok(resp) if !resp.status().is_success() => {
                        let status = resp.status();
                        match resp.into_body().await {
                            // convert into a BoxError containing the status and optional body
                           // which is nicely loggable in the upper layer, or you could also log it here.
                           // If you need to specifically log ONLY this case rather than other BoxErrors, you could newtype
                          // a marker type that we downcast the stderror to
                           Err(BoxError::new(format!("my message with status code + body")))
                        }
                    } else {
                        res
                    }

                }
            })
            .boxed()

Does that seem like it might work?

FullyTyped · 2026-02-25T07:59:45Z

I think that this should be within the state machine.

Because all we do here is logging and apparently dataplane gives us a status code, we don't need to change the flow of the state machine. We will know that we time'd out by the status code, and therefore we don't need to consume more of the response (and thus add another state).

jlizen · 2026-02-25T08:01:30Z

If we can get away with checking the status only, then awesome, that makes this whole problem much simpler! I was assuming there was something important in the body. Much nicer to just have a log on non-2xx status when consuming the http:: Response inside the existing second state.

darklight3it · 2026-02-25T08:26:02Z

We can definitely get away reading only the status code for now and logging a specific message. But this is not the functional equivalent of what we are doing in other runtimes where we log also the body.

If you agree to this I would go on with the simple solution and open a new issue about the visibility about RuntimeApiErrorFuture.

At some point we would need to set things straight deprecating that and fixing the issue.

jlizen · 2026-02-25T08:28:10Z

Would the then combinator suggestion that smuggles the body in a BoxError be that much uglier? It seems like a relatively simple way to propagate the body upwards without needing manage a spawned task. The main downside is that it would await inside the foreground task (which probably is fine...?)

No objections to going status only though here, y'all have more context about expectations for cross language runtime parity.

darklight3it · 2026-02-25T09:18:21Z

I think the real solution here is to fix the original sin here deprecating RuntimeApiErrorFuture. Everything on top of it starts to feel an hack.

I would go with the easy fix and open an issue for that.

darklight3it force-pushed the dmelfi/timeout-fixup branch from b17ead3 to 0120ef0 Compare February 24, 2026 00:17

jlizen reviewed Feb 24, 2026

View reviewed changes

darklight3it force-pushed the dmelfi/timeout-fixup branch from 0120ef0 to ef4ded0 Compare February 24, 2026 12:15

darklight3it marked this pull request as ready for review February 24, 2026 12:29

darklight3it force-pushed the dmelfi/timeout-fixup branch from ef4ded0 to 0b88c4b Compare February 24, 2026 12:38

feat: ⚡ Improved Logging when Lambda Runtime Api fails, this will be …

8e6dc81

…useful for the customer for timeouts

darklight3it force-pushed the dmelfi/timeout-fixup branch from 0b88c4b to 8e6dc81 Compare February 24, 2026 12:51

darklight3it mentioned this pull request Feb 24, 2026

Log timeouts when DP responds with timedout #1091

Closed

4 tasks

darklight3it requested review from jlizen and maxday February 24, 2026 12:55

darklight3it self-assigned this Feb 24, 2026

jlizen requested changes Feb 24, 2026

View reviewed changes

Comments

Conversation

darklight3it commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Solution

Changes

Example log output

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

darklight3it Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jlizen Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jlizen Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jlizen left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jlizen Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jlizen Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jlizen Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jlizen Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

FullyTyped commented Feb 24, 2026

Uh oh!

jlizen commented Feb 24, 2026

Uh oh!

FullyTyped commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

darklight3it commented Feb 25, 2026

Uh oh!

jlizen commented Feb 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

darklight3it commented Feb 23, 2026 •

edited

Loading

darklight3it Feb 24, 2026 •

edited

Loading

jlizen Feb 24, 2026 •

edited

Loading

jlizen Feb 24, 2026 •

edited

Loading

jlizen Feb 24, 2026 •

edited

Loading

jlizen Feb 24, 2026 •

edited

Loading

jlizen Feb 24, 2026 •

edited

Loading

jlizen Feb 24, 2026 •

edited

Loading

FullyTyped commented Feb 24, 2026 •

edited

Loading

jlizen commented Feb 25, 2026 •

edited

Loading

jlizen commented Feb 25, 2026 •

edited

Loading

jlizen commented Feb 25, 2026 •

edited

Loading

darklight3it commented Feb 25, 2026 •

edited

Loading