Production agents and the substrate problem

Production agents are usually described in terms of the harness running them, the skills they use, and the evals that test them. I’ve used agents in production for different use cases, and I’ve seen them fail despite having a high-quality harness, a specified skill and a rigorous eval to test them.

Agents that worked fine against a curated set of questions start to look different the moment they have to act against the actual operation - escalation criteria, exception handling, current policies, how a particular team actually decides what to do in a specific scenario.

The reason agents break when running against actual operations sits beneath all three: the operational substrate the agent has to act against. That substrate is scattered, contested, and constantly evolving. The agent has no good way to keep its skill current as the team’s operations evolve, and no automated way to learn from its unsuccessful outcomes.

Scattered sources of information

Operational knowledge in companies tends to be split across surfaces that were never designed to be a substrate for agents - Slack threads or DMs, Notion runbooks, or internal knowledge of experienced employees. They were written to help humans coordinate with each other, not for an agent to read. There is no single artifact a new joiner could read to know how the company operates, and there is certainly no single artifact an agent can read.

No canonical artifact

Even when teams collect those sources, there typically isn’t anything the operations owner has signed off on as “this is what the agent should do”. Approval in human systems is implicit: a Slack thread is approved by being widely read, a runbook by being maintained, a Zendesk macro by being used. An agent has none of those signals. It needs an explicit artifact someone has explicitly approved, and it needs to know which version is current.

Skill drifts from policy

The first two issues are typically resolved by hand-coding the agent’s behaviour into prompts and tool descriptions, which works as long as the policy does not change. When it does - which is often the case - the engineer has to translate the new policy into a new prompt. That translation step is where drift opens up. The team thinks the agent is following the new policy but the agent is actually following the engineer’s interpretation of the new policy, and these are not the same thing.

No feedback loop

This is the most expensive of the four over time. When the agent acts wrong, there is no path for that outcome to update the substrate. The wrong behaviour gets logged, possibly reviewed, possibly fixed in a prompt - but the original sources stay unchanged. The next time someone touches the policy, they touch it without knowing what the agent got wrong. The feedback loop never closes, and each agent failure becomes a one-off rather than an input to the system.

What looks like an agent failure usually traces to a substrate failure. The agent did its best with what it had, and what it had was incomplete, contested, or stale. Improving the harness or expanding the skill will not necessarily fix this issue, because the substrate problem will still be there. The work that actually moves things forward is upstream of the agent: in whatever lets a team approve, audit, and revise the substrate while still letting an agent run against it.