Really interesting approach here, especially binding the circuit breaker at the tool call layer. I’m running into similar agent execution issues and trying to understand where the real failure typically show up in production. What specifically broke that made you decide you needed this? And now that it’s live, what’s still hard to reason about or stabilize under load?
Trying to figure out whether this layer ends up being core to own project, or just reliability plumbing every team eventually rebuilds.