Skip to content

Comments

feat(lambda-runtime): log non-2xx Lambda Runtime API responses with status and body#1105

Open
darklight3it wants to merge 1 commit intomainfrom
dmelfi/timeout-fixup
Open

feat(lambda-runtime): log non-2xx Lambda Runtime API responses with status and body#1105
darklight3it wants to merge 1 commit intomainfrom
dmelfi/timeout-fixup

Conversation

@darklight3it
Copy link

@darklight3it darklight3it commented Feb 23, 2026

Summary

When the Lambda Runtime API returns a non-2xx response (e.g. HTTP 410 with {"errorMessage":"Invoke timeout","errorType":"InvokeTimeout"}), the previous implementation silently discarded the response body. Customers had no visibility into why their handler response was rejected.

Solution

Read and log the response body for non-2xx responses from the Lambda Runtime API.

The main challenge is that RuntimeApiClientFuture is a pub enum without #[non_exhaustive], so adding a new variant would be a breaking change for any downstream code that exhaustively matches on it. To avoid this, we instead spawn a background task via tokio::task::spawn to read the body asynchronously without modifying the enum's public surface.

Changes

  • RuntimeApiClientService now captures a tokio::runtime::Handle at construction time via Handle::try_current(), with a fallback minimal current-thread runtime for non-tokio contexts (e.g. tests)
  • On non-2xx responses, a background task is spawned to collect and log the response body including the HTTP status code
  • Added a log_or_print! macro in lib.rs that dispatches to tracing::error! when a tracing dispatcher is set, and falls back to eprintln! otherwise — consistent with the existing codebase pattern
  • Added unit tests covering 2xx success, 410 InvokeTimeout, and inner service errors, using httpmock

Example log output

ERROR lambda_runtime::layers::api_client: Lambda Runtime API returned non-200 response status=410 body={"errorMessage":"Invoke timeout","errorType":"InvokeTimeout"}

First(#[pin] F, Arc<Client>),
Second(#[pin] BoxFuture<'static, Result<http::Response<Incoming>, BoxError>>),
#[pin_project(project = RuntimeApiClientFutureStateProj)]
pub enum RuntimeApiClientFutureState<F> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be a breaking change. It is an internal type, that probably nobody is using, but it is part of our public API.

If we did want to go this route, I'd suggest making a new one, and marking the old one as deprecated. And cutting over all usage.

Given that probably nobody is using it, the new type is fairly painless. I don't mind that option.

That would only require a minor version bump (the deprecated old enum).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It wouldn't be breaking if we had tagged it #[non_exhaustive], but we missed this one.

Definitely if we make a new type, we should tag it non exhaustive :)

Copy link
Author

@darklight3it darklight3it Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry I was already fixing that. It's still RuntimeApiClientFutureState. The CR was still a draft.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we changed the internals ase we said in the other CR. Cannot make it pub (crate) from api_client but still it isn't exposed in mod.rs

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I discussed this with you and rcoh@ offline -

Technically this is a breaking change, whether we change the innards of RuntimeApiClientFuture or substitute it for a different enum. The type is accessible as an associated type on the RuntimeApiClientService which is publicly visible.

None of our internal machinery relies on this specific type surface - we are just polling the service that the runtime api client exposes. I'm pretty sure whoever implemented this just didn't realize that pub enum exposed all of its variants and field contents.

It would generally be risky to try to deploy something in the runtime api client, that the lambda-runtime needs to be upgraded in lockstep to, else they will have compile errors. And then that's fragile to do except across a breaking version boundary.

BUT - we aren't in that situation, the only people that would be impacted would be people that built their own higher level runtime on top of our own, or otherwise are doing things with the innards of the runtime client.

I've confirmed via code search that the only cases of RuntimeApiClientFuture being used are forks of the api client.

So, with that in mind, I think this is probably a fairly low risk change to make.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And then, the lowest risk change available would be to just keep the current future name, since then anybody that is naming the future as part of a tower service stack, wouldn't be broken.

And with the future, it is safer to add a new field to the existing variant, rather than add a new variant. Since that won't break match statements.

Unfortunately, that doesn't get us out of this mess for next time, but it is the least risky.

Copy link
Collaborator

@jlizen jlizen Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That said - what field to add - well, we already have the contents we want inside the RuntimeApiClientFuture, the issue is that we can't poll it to completion from inside the runtime (specifically, the body field of the http::Response future stored inside Second)

For that, we really would prefer a third state transition which is riskier than a new field .

I guess, we could also just change the second variant's contents to an enum (which is clearly sillier than a third variant, but less breaking-prone maybe?)

Copy link
Collaborator

@jlizen jlizen Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe instead we could do something hacky like, check the status code, and bubble up the inner future inside of the BoxError, and downcast the BoxError from inside our managing runtime to decide whether to finish polling the body?

And then, another option we discussed was just storing a client current thread runtime or something to drive the future, but that's pretty awful... but we don't currently require that this be run inside a tokio runtime, so i don't think we can rely on tokio::spawn.

I guess we could also do something with a sync thread and block_on, but that is pretty terrible :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went with the tokio::spawn idea, the bubbling up of the error required a changing in then RuntimeApiClientFuture signature that while respecting the current enum and variant structure would have changed the signature constituting a breaking change.

This seems to be the only way for this to work without any signature change. The integration tests are working and the test I added are working too.

@darklight3it darklight3it marked this pull request as ready for review February 24, 2026 12:29
Copy link
Collaborator

@jlizen jlizen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have something workable here. I'll ping my colleague to take a quick peek as well for landmine avoidance.

Some changes needed (assuming he doesn't suggest a different route):

  • we need to pass the tokio runtime handle through to the the future so that it doesn't panic on being awaited in non-tokio (one option is a static oncelock holding it)
  • probably we should reflect against the logged output in our test

Err(_) => {
// Fallback: build a minimal current-thread runtime and keep it alive alongside
// the service. This should only happen outside of a tokio context (e.g. some tests).
let rt = Arc::new(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason that the rt needs to be inside an arc? It is stored inside a struct that doesn't have Clone anyway

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arc is necessary because:

  1. RuntimeApiClientService does implement Clone, and it needs to be cloneable for concurrent workers (line 222 in runtime.rs).

  2. The Handle becomes invalid if the Runtime is dropped. Arc ensures the runtime stays alive as long as any service instance exists.

  3. Multiple cloned services share the same fallback runtime via reference counting instead of creating separate runtimes.

let status = resp.status();
// Spawn a background task to read and log the error body without
// blocking the current poll or requiring a new enum variant.
Handle::current().spawn(async move {
Copy link
Collaborator

@jlizen jlizen Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will panic if the future is awaited outside of a tokio runtime. It isn't actually using the handle we store on the client.

That said, there isn't a GREAT way to pass through the runtime handle to the future without changing the public signature, since we are stuck with an enum with wide open variants and inner fields. We would essentially need to add the handle to both the first and second, and have first plumb it through to second. IE add a field to both variants.

The main non breaking way I can think of would be to store the runtime in a static OnceLock (which you initialize inside RuntimeApiClientService::new(), and it stores the runtime and returns the handle via get_or_init. So nothing stored on the client itself anymore.

At that point, you won't need to carry any data in the future. And the runtime is something of a global resource, so it's not too strange..?

Copy link
Collaborator

@jlizen jlizen Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing we might need to do would be, add a check to see if a caller-provided tokio runtime is still alive, since it might shut down early and cause a panic? Not sure how much we need to guard against that... Will get a second opinion

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the only thing you need to do here log the message? Rather than doing it this way, can you send the resp over a channel to someone else to read it out and print it?

One "dumb" option is to use the response.extensions_mut() to get the channel (or runtime handle) in there.

That avoids the hazard of attempting to read it here from the thread local.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we only need to log the message.

Copy link
Collaborator

@jlizen jlizen Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One concern is, the place we would want to store a runtime handle or channel receiver in order to write into response extensions, is the underlying runtime api client, which is also wide open as pub / non-exhaustive.

So it wouldn't necessarily sidestep the need for static state.

A nicer option might be, instead use a tokio::task_local! at the point where we are spawning the tokio tasks, to inject a mpsc::Sender. And then, we can listen on it in a tokio::select! loop alongside its work loop.

That reduces the hazard of of the thread local reads, right @rcoh ? It also has the side benefit of allowing us to enrich the logs with per-worker context, since we have a worker id span instrumented there.

And then, inside the RuntimeApiClientFuture, where we read the response from second, we can (perhaps behind cfg(feature="concurrency-tokio") check if LocalKey::try_get returns something, and if yes, write the response to the channel to let the worker-managed channel receive loop await the body and write out its log?

Copy link
Collaborator

@jlizen jlizen Feb 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we could also just use tokio::runtime::Handle::current() inside the lambda_runtime_api_client gated behind tokio_concurrency and drill that feature down? IE use .map_ok on the request future, get the handle, and shove it in response extensions, and then have the RuntimeApiClientFuture check extensions and spawn the task.

Feels more infectious though than just task_locals, and it still has some weird edge cases if anybody is instantiating the lambda_runtime_api_client::Client directly, with the feature enabled, but not on a tokio runtime. (probably fine, but technically breaking if somebody has --all-features enabled)

}
}
});
break Ok(());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are these breaks needed? i would have expected we can just have eg Ok(_) => Ok(())

break Ok(());
}
Ok(_) => break Ok(()),
Err(err) => break Err(err),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just to confirm, we are already logging the connection related error higher up? It looked like the C++ impl was logging them all at the same site (though it's fine if we handle differently, with the same ultimate logs)

pub use tower::{self, service_fn, Service};

/// Logs using tracing `error!` if a dispatcher is set, otherwise falls back to `eprintln!`.
macro_rules! log_or_print {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cute, can we cut over this usage as well?

}

#[tokio::test]
async fn test_invoke_timeout_410_is_logged_and_succeeds() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't actually asserting that anything is logged, we probably want to use #[traced_test] to reflect against log output

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know. I can add this.

@FullyTyped
Copy link
Collaborator

Why not do this in the actual client a-la the PR I was working on? That will be effectively the same pattern as the Managed Runtimes do it, by delegating to the C++ client, which will then log.

That means we won't be breaking anyone along the way.

@jlizen
Copy link
Collaborator

jlizen commented Feb 24, 2026

That will be effectively the same pattern as the Managed Runtimes do it, by delegating to the C++ client, which will then log.

We could accomplish this, but it would be a bit awkward since it would require that we use tokio::runtime::Handle::current() directly, which in turn probably needs to be behind a tokio_concurrency feature gate since we don't force tokio usage in general. And even then, we might want to guard against non-tokio-runtime-contexts calling the client with --all-features enabled.

I'm not against it though, necessarily.

Interested in your thoughts about this thread @FullyTyped : #1105 (comment)

@FullyTyped
Copy link
Collaborator

FullyTyped commented Feb 24, 2026

Why would we need to do that?

I am under the impression this is universal and doesn't need us to hide it under concurrency.

IIRC the client impl doesn't depend on the tolio runtime at all.

@darklight3it
Copy link
Author

Why would we need to do that?

I am under the impression this is universal and doesn't need us to hide it under concurrency.
on this I agree with @FullyTyped . The logging of the error should happen also in the non multiconcurrent path.

However I want to point out that tokio::spawn in this context is a workaround because we cannot change in any way RuntimeApiClientFuture without breaking existing code.

@jlizen
Copy link
Collaborator

jlizen commented Feb 25, 2026

Why would we need to do that?

This is attempting to sidestep a breaking change in the RuntimeApiClientFuture by adding a new variant.

The logging of the error should happen also in the non multiconcurrent path

Ah, good to know... in that case tokio::spawn based solutions will be pretty awkward since we don't want to assume usage of tokio runtime for non-LMI path. Really we would prefer modeling things in a single future state machine, like Quinn is pointing out.

That said, I think we could possibly just move this entire check downwards into the runtime api client, like you might have been alluding to Quinn? Since we are already working with a BoxFuture on the Client's side. So as long as we cast whatever information we want to transmit upwards into a BoxError, we could use future combinators?

It would go here

And look something like:

        self.client.request(req).map_err(Into::into)
            .then(|res| async move { 
                match res {
                    Ok(resp) if !resp.status().is_success() => {
                        let status = resp.status();
                        match resp.into_body().await {
                            // convert into a BoxError containing the status and optional body
                           // which is nicely loggable in the upper layer, or you could also log it here.
                           // If you need to specifically log ONLY this case rather than other BoxErrors, you could newtype
                          // a marker type that we downcast the stderror to
                           Err(BoxError::new(format!("my message with status code + body")))
                        }
                    } else {
                        res
                    }

                }
            })
            .boxed()

Does that seem like it might work?

@FullyTyped
Copy link
Collaborator

I think that this should be within the state machine.

Because all we do here is logging and apparently dataplane gives us a status code, we don't need to change the flow of the state machine. We will know that we time'd out by the status code, and therefore we don't need to consume more of the response (and thus add another state).

@jlizen
Copy link
Collaborator

jlizen commented Feb 25, 2026

If we can get away with checking the status only, then awesome, that makes this whole problem much simpler! I was assuming there was something important in the body. Much nicer to just have a log on non-2xx status when consuming the http:: Response inside the existing second state.

@darklight3it
Copy link
Author

We can definitely get away reading only the status code for now and logging a specific message. But this is not the functional equivalent of what we are doing in other runtimes where we log also the body.

If you agree to this I would go on with the simple solution and open a new issue about the visibility about RuntimeApiErrorFuture.

At some point we would need to set things straight deprecating that and fixing the issue.

@jlizen
Copy link
Collaborator

jlizen commented Feb 25, 2026

Would the then combinator suggestion that smuggles the body in a BoxError be that much uglier? It seems like a relatively simple way to propagate the body upwards without needing manage a spawned task. The main downside is that it would await inside the foreground task (which probably is fine...?)

No objections to going status only though here, y'all have more context about expectations for cross language runtime parity.

@darklight3it
Copy link
Author

darklight3it commented Feb 25, 2026

I think the real solution here is to fix the original sin here deprecating RuntimeApiErrorFuture. Everything on top of it starts to feel an hack.

I would go with the easy fix and open an issue for that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants