Agents demo beautifully and break quietly. A tool-using agent that nails a scripted path on stage will, in production, encounter the input no one anticipated, take six steps when two would do, call the wrong tool with confidence, and report success regardless. The distance between that demo and a system you can put in front of a customer is almost entirely unglamorous engineering — and it is where we spend most of our time.
The first lesson is that observability is not optional; it is the substrate. Every agent run we ship is fully traced and replayable: each step, each tool call, each intermediate decision is logged in a way an engineer can inspect after the fact. Without that, debugging an agent is archaeology. With it, a failure becomes a test case. We have never regretted over-investing in tracing, and we have always regretted under-investing in it.
The second lesson is that agents must be scoped to fail safely. The question is never whether an agent will be wrong — it will — but what happens when it is. We design tight, permissioned tool surfaces so an agent can only act within auditable boundaries, and we build explicit escalation paths so that when confidence is low, the system hands off to a human instead of bluffing. An agent that knows when to stop is worth more than one that is occasionally brilliant and occasionally catastrophic.
The third lesson is that reliability comes from decomposition. The agents that hold up in production are rarely a single heroic loop; they are a set of smaller, well-bounded steps with clear success criteria and recovery behavior at each one. This is less impressive to watch and far more dependable to run. It also makes the eval problem tractable, because you can score each step rather than only the end-to-end outcome.
None of this is exotic. It is the same discipline that turns any prototype into a system: instrument it, bound it, test it, and make its failures legible. The teams that internalize that ship agents that quietly do real work for years. The teams that chase the demo ship something that wows the boardroom and gets switched off within a quarter.