AI Agent Failure Handling
Here’s a strange thing about working with AI agents: most of the trouble they cause isn’t the failure itself. It’s that nobody noticed the failure happened.
Traditional software has the decency to crash. It throws exceptions, prints stack traces, returns error codes, lights up red. Whatever else you can say about a system that crashes, at least it tells you. AI agents don’t work like that. They fail quietly, confidently, with a completed-looking output that passes a quick glance and falls apart on a closer one. They self-certify. They hand you a finished answer to a question they couldn’t actually solve, and then the next step in the workflow treats that answer as ground truth.
So when people ask me how to make agent systems more reliable, my honest answer is that “reliable” is the wrong frame. The mature systems aren’t the ones that fail least often. They’re the ones that fail visibly, by policy, with a name attached to what went wrong. That’s the shift this piece is about: treating failure handling for agents as a first-class part of the design, not a debugging concern you’ll get to later.
The asymmetry that breaks everything
If you came up through traditional software engineering, your instincts around failure are mostly right but pointed at the wrong target. In a normal system, the hard problem is responding to failures well: retrying gracefully, degrading gracefully, surfacing useful errors. Detection is usually free, because the system itself is screaming.
With agents, that asymmetry flips. Detection is the hard part. By the time an agent’s output is in front of you, the agent has already decided it succeeded. It’s optimized to produce confident, complete-looking work; saying “I don’t know” or “I couldn’t do this” is not its default behavior. And once a silent failure happens in step three of a seven-step workflow, the remaining four steps build on a wrong foundation without any indication that something is off. Errors don’t just exist; they compound.
The implication is the foundational principle for everything that follows: detection cannot be the agent’s job. Something outside the agent has to evaluate whether the agent succeeded. If you remember nothing else from this piece, remember that. The systems that earn trust are the ones that fail visibly, by policy, with a name attached. We’ll come back to that line at the end, but it’s the thesis in compressed form, and the rest of the article is just the work of justifying it.
Failures cluster by who can see them
Once you accept that detection lives outside the agent, the next question is where outside. And it turns out failures organize naturally into three groups based on who or what can actually catch them. This matters a lot operationally, because the detection layer determines three things at once: who builds the detector, where it runs, and what it costs to operate. Sorting failures this way isn’t taxonomy for its own sake; it tells you what infrastructure you need to build.
The first group is externally observable failures. These are the ones the system can see without making any judgment about the quality of the agent’s output. The step ran too long. The agent burned through its token budget. The output came back in the wrong shape (a JSON blob when you expected a list, say). A tool the agent called returned an error. None of these require a second model to evaluate anything; they’re visible from the outside, like a smoke alarm. They’re cheap to detect, and you can afford to check them on every step.
The second group is failures that require a second opinion. These are the ones you can only catch by evaluating the agent’s output against criteria the agent itself can’t be trusted to apply. The work got done, but the reviewer (another model, a rules engine, whatever) isn’t sure it’s right. The output addresses the wrong question, or addresses the right question shallowly. The requirements were contradictory and the agent quietly papered over the conflict. The output exists but can’t be traced back to where the information came from. These detectors cost more, because they typically require another model call, and you have to budget for them.
The third group is failures only visible at the workflow level. No single step looks broken. You can only see something is wrong by looking across many steps or many executions. The same step has been retried six times without making progress. Confidence stays below threshold across repeated attempts. Information keeps accumulating until you’re approaching context limits. Multiple unrelated steps are showing the same weird anomaly, which suggests an upstream problem rather than a local one. These detectors need state that persists across steps, and they usually require a separate observer process watching the whole workflow.
That’s the taxonomy, and the order matters. External, second opinion, workflow. Each layer is more expensive than the last, and each catches things the layer below can’t.
Response is policy, not improvisation
Once you’ve detected a failure, the question is what to do about it. And here’s the thing I most want to land: the agent should not decide. The system should decide, in advance, by policy.
There’s a short vocabulary of responses that covers most situations. Halt stops the step, preserves whatever state exists, and does nothing else automatically. Retry attempts the same step again with the same inputs, within bounded conditions. Iterate is a close cousin of retry but with feedback between attempts; the second try gets information about why the first failed. Replan throws out the current approach and asks the agent to generate a new plan from a different angle. Escalate surfaces the failure to a human or to some higher-authority system. Terminate ends the workflow entirely.
The vocabulary itself isn’t the interesting part; if you’ve worked in distributed systems, none of those words are new. What’s interesting is the pairings, because some of them are actively wrong even though they look reasonable.
Workflow-level failures should never trigger retry. Retry is usually what created the iteration loop in the first place; responding to it with more retry compounds the same failure. When the workflow-level detector spots a loop, the answer is always halt and escalate, never “try again harder.”
Conflicting-requirements failures should never auto-resolve. The agent papered over the conflict once already; asking it to try again just produces a different paper-over. There’s no version of “try harder” that resolves a contradiction in the underlying criteria. That’s a human decision, escalated immediately, no retry.
Resource overruns should escalate before they halt. If you halt silently when a budget gets exhausted, you’re hiding the cost signal from the people who need to see it. Halting without surfacing is just a slower silent failure.
Schema violations get exactly one retry, with a corrective prompt that points out what was wrong about the shape. If the second attempt still fails, the problem isn’t a one-off slip; the agent doesn’t understand the schema, and you’re in a different failure mode entirely.
The point of these rules isn’t that they’re clever. It’s that they’re declared in advance. When something breaks at three in the morning, “what happens now” should be a lookup, not a decision.
The response matrix
The whole thing fits together in a compact table. For each named failure mode: which layer detects it, what the immediate response is, and what triggers escalation. This is the operational core of the design.
| Failure mode | Detection layer | Immediate response | Escalation trigger |
|---|---|---|---|
| Step timeout | External | Terminate step | Retry budget exhausted |
| Resource overrun | External | Halt | Always escalate |
| Schema violation | External | Retry once | Second failure |
| Low confidence | Second opinion | Iterate | Confidence stall threshold |
| Conflicting requirements | Second opinion | Halt | Always escalate, no retry |
| Provenance gap | Second opinion | Replan | Replan failure |
| Iteration loop | Workflow | Halt | Always escalate |
| Context overflow | Workflow | Compact per policy | Continued growth post-compaction |
| Pattern anomaly | Workflow | Flag for review | Human decision required |
This particular table is illustrative; the real argument is that every system needs a table like it. The named failures and their responses become the contract between the people running the agent and the people depending on its outputs. Without something like this, you’re not running an agent system; you’re running an experiment with no instrumentation.
Two failures played through
Abstractions are easier to trust when you’ve watched them work on something concrete. Two examples.
A schema violation that retries once. Imagine a research agent that’s supposed to return a structured summary: a title, three key findings, a list of sources. The first call comes back as prose, no structure at all. The external detector catches this immediately; the output doesn’t match the expected shape. The response, per policy, is one retry with a corrective prompt that includes the schema and an explicit note about what was wrong the first time. If the second attempt comes back correctly shaped, the workflow continues and the incident gets logged. If the second attempt still fails, the system halts and escalates, because at this point you’re not dealing with a one-off; the agent doesn’t understand the schema. Notice what’s happening: retry is doing real work here, not just being optimistic. It’s bounded, it’s informed by what went wrong, and it has a hard limit. Compare that to the version of this story where there’s no schema check at all, the prose output gets passed to the next step expecting a structured object, and three steps later you’re trying to debug a workflow that fell apart for reasons that look nothing like the actual cause.
A context overflow in a long-running workflow. A multi-step agent is working through a research task and accumulating findings across steps. By step eight, the context is approaching the model’s token limit. The external detector catches this (token counts are cheap to monitor), and the policy says: compact per policy and continue. The system summarizes the prior steps into a shorter representation, hands the compacted context to step nine, and keeps going. If the context starts growing again past a threshold even after compaction, that’s the escalation trigger, because now you have a workflow that isn’t converging and the compaction strategy isn’t working. Without this design, the workflow either crashes when it hits the token limit (loud failure, recoverable) or, worse, performance degrades silently as the model starts losing track of context and the outputs get gradually worse. The silent version is the one that destroys trust.
Both examples are doing the same underlying thing: a detector outside the agent watches for a specific named condition, and when the condition fires, a pre-declared response runs. Nothing improvised, nothing decided in the moment.
Failure modes as a first-class artifact
Here’s the argument that does the most work in this whole piece, and the one I’d most like builders to take seriously: failure modes should be a named, documented artifact of your agent design. The same way exception types and error codes are named, documented artifacts in traditional software.
A failure mode is “named” only if it has three things wired up. There’s a detector that runs in production. There’s a declared response. There’s a log signature that makes the failure countable after the fact. Labels in a design doc without detectors are aspirational. Detectors without declared responses are alarms nobody acts on. Detectors and responses without log signatures mean you can’t tell anyone, after the incident, how often this happens. All three have to be wired through for the failure mode to actually exist in the system.
This matters because of what changes when failures have names. Named failures get dashboards. Named failures get monitored. Named failures get postmortems with real artifacts attached. Unnamed failures get rationalized after the fact: “well, the model was just having a bad day,” or “the user asked a weird question,” or “we’ll look into it.” Naming converts vague unease into something you can measure and improve.
It’s also where the audit story lives, which is going to matter more as agent systems take on work where accountability is non-negotiable. When somebody asks “what went wrong here,” the only way to give them a real answer is if the system had vocabulary for what could go wrong before it happened. You can’t audit a workflow whose failure modes were never named.
I’d resist turning the count of named failure modes into a vanity metric, though. Twenty trivial labels with no detectors behind them is worse than three real ones, because it gives the illusion of maturity without the substance. The signal isn’t how many failures a team can list in a doc; it’s how many are wired through the full loop, from detector to response to log.
What to take with you
If you’re building agent systems and want a short list of things to actually do differently after reading this, here’s what I’d start with.
Write the failure modes before you write the success path. If you can’t enumerate how the agent could fail, you don’t yet understand the work well enough to build it. The exercise of naming failures often reveals that the success path you had in mind was incomplete.
Put the detection logic outside the agent. The agent’s job is to do the work. Deciding whether the work was done correctly is a different job, and giving both jobs to the same actor is how silent failures happen in the first place.
Declare responses in advance and make them lookups. When a named failure fires, the system should consult the policy, not improvise. Improvisation in failure handling is what produces incidents whose timeline includes the phrase “and then we tried…”
Treat the response matrix as a deliverable. Build it before you ship. Review it the way you’d review a contract, because that’s effectively what it is: a contract about what your system promises to notice and what it promises to do about it.
The old model of software failure was that things crash and you fix them. The new model for agents is that things fail quietly and the system needs the vocabulary to notice. The shift, if you accept the argument here, is from treating failures as bugs to be eliminated to treating them as conditions to be named, detected, and handled by policy.
The systems that earn trust aren’t the ones that fail least. They’re the ones that fail visibly, with a name attached, by a policy decided in advance. That’s the work. Most of it happens before the first agent ever runs.
