Measuring agents you can't fully script

A chatbot that gets an answer wrong wastes a few seconds. An agent that gets an action wrong books the wrong patient into the wrong chair, tells a caller their insurance is accepted when it is not, or quietly cancels a recall it should have confirmed. The moment an agent can do something, the cost of being wrong moves from annoyance to operational damage. That changes everything about how you evaluate it.

Why accuracy and benchmarks fall short

Most teams arrive with a benchmark instinct: pick a model, run it against a labeled set, report a percentage, ship the highest number. That works for classification. It falls apart for agents, for three reasons.

First, accuracy measures a single turn in isolation, but an agent operates over a trajectory. A model can be right at every step and still produce a bad outcome because it took the right action at the wrong time, or because it never asked the one clarifying question that mattered. Second, public benchmarks measure generic competence, not whether the thing works on your messy phone transcripts, your insurance edge cases, your scheduling rules. A model can top a leaderboard and fail on the way a specific front desk actually talks. Third, and most important, accuracy treats every error as equal. For an action-taking agent, errors are wildly unequal. Failing to answer is not the same failure as confidently doing the wrong thing.

The Cachalot Compass we work from starts with Signal Detection for exactly this reason: before you can measure anything useful, you have to know which signals in a conversation actually predict a good or bad outcome. Accuracy flattens all of that into one number.

What to measure instead

We replace the single accuracy number with a small panel of metrics that, taken together, tell you whether the agent is doing its job safely. None of these is novel on its own. The discipline is in tracking all of them at once and refusing to optimize one at the expense of another.

Task completion and containment. Of the conversations the agent was supposed to handle end to end, how many did it actually finish without a human stepping in? Containment is only a good number when it sits next to outcome quality. High containment with bad outcomes means the agent is confidently mishandling things and nobody noticed.
Real downstream outcomes. Not "did the agent respond well," but did the appointment get booked and kept, did the recall convert, did the caller stop calling back about the same thing. These live in the practice management system, not in the model logs, which is one more reason we instrument the whole loop rather than just the model.
False-action rate. How often the agent takes an action it should not have: booking a slot that was held, sending a confirmation for an appointment that was never made, answering a coverage question it had no business answering. This is the metric that keeps you up at night, so it gets the tightest threshold.
Escalation quality. When the agent hands off to a human, was the handoff warranted and was the context complete? A good escalation arrives with the caller's intent, what was already tried, and what is left to decide. A bad one dumps a cold transcript on a staff member who now has to start over.
Calibration of acting versus deferring. The single best predictor of a trustworthy agent is whether it knows when it does not know. We measure how well the agent's confidence lines up with whether it was actually right, and whether it defers in the cases where deferring was correct. An agent that acts boldly when uncertain is more dangerous than one that defers too often.

An agent that acts boldly when it is uncertain is more dangerous than one that defers too often.

Offline evals: golden sets and transcript replay

Before anything touches a live caller, it runs against evals we can repeat on demand. Two kinds do most of the work.

Golden sets are hand-curated cases with a known right answer, including the nasty ones: the caller who changes their mind mid-sentence, the insurance plan we deliberately do not support, the double-booking trap, the request that legally must go to a person. Golden sets are small, opinionated, and slow to grow because every case is argued over. They are the regression net. When we change a prompt, a tool, or a model version, the golden set tells us in minutes whether we broke something that used to work.

Transcript replay is where the volume comes from. We take real, anonymized conversations the system has already handled and replay them against a candidate version to see where its decisions diverge. Replay catches the long tail that no human would think to write into a golden set, because reality is stranger than any test author. Both of these run entirely inside the client cloud against the client's own conversations. We do not export transcripts to score them somewhere else. The evaluation runs where the data already lives.

Online evals: shadow, canary, sampling

Offline evals tell you the agent is plausibly ready. They cannot tell you it is ready, because the live world is not the replay. So we ramp deliberately.

Shadow mode. The agent runs against live traffic and proposes actions, but takes none. A human handles the call as usual while we record what the agent would have done. Shadow mode is the cheapest way to find out, with zero risk, how the agent behaves on traffic it has never seen.
Canary. Once shadow looks clean, the agent acts for real on a thin slice of traffic, with tighter alerting than the rest of the system. If the false-action rate or escalation quality moves the wrong way, the canary is small enough to roll back before it matters.
Human-in-the-loop sampling. Even at full volume, a sampled stream of live interactions gets reviewed by a person. This is not a one-time gate. It is permanent, because models drift, callers change, and last quarter's eval does not certify next quarter's behavior.

Guardrails and the decision to let it act unsupervised

Metrics tell you how the agent is doing. Guardrails decide what it is allowed to do while you find out. We treat these as separate concerns. A guardrail is a hard constraint that holds regardless of what the model decides: this action requires confirmation, that topic always escalates, an appointment can never be moved without an explicit caller yes. Guardrails are deterministic and auditable. The model proposes; the guardrails dispose.

The decision to let an agent act unsupervised in a given workflow is made one workflow at a time, never as a blanket switch. A clear, reversible, low-stakes action with a strong eval record earns autonomy early. A high-stakes or hard-to-reverse action keeps a human in the loop far longer, sometimes permanently, and that is a feature, not a failure. The honest version of this work is that some actions should never run unsupervised, and saying so is part of the job. This is the Strategic Resurfacing part of the Compass: knowing which decisions need to come back up to a person and which can stay below the surface.

You cannot measure what you do not run

Every metric on this page depends on access to the live system: the real outcomes in the practice management system, the sampled calls, the drift you only see week over week. A vendor who builds an agent, hands it over, and walks away has no way to measure any of this, because the signals only exist where the system runs. That is the practical reason we operate what we build, inside each client's own cloud, rather than shipping a model and a goodbye.

It is also why the boundary matters. Because the agent runs in your GCP, AWS, or Azure tenant, the evaluation runs there too. Golden sets, replay, shadow scoring, and sampling all execute against data that never leaves your environment. The architectural answer we stand behind is simpler and stronger than a badge: your data is measured where it lives, inside your own environment, where you can see and audit every run.

Where this leaves you

If you are evaluating an agent the way you would evaluate a model, you are measuring the wrong thing. Replace the single accuracy number with a panel that includes false-action rate, escalation quality, and calibration. Build a golden set you argue over and a replay corpus from your own traffic. Ramp through shadow and canary instead of flipping a switch. And accept that this is ongoing operating work, not a launch checklist, because an agent you do not keep measuring is an agent you no longer understand.

That is the discipline behind "see beyond the surface, think beyond the obvious": the number on the demo slide is the surface, and the behavior under real load is the part worth measuring. If you want to see what this looks like against your own workflows, the 5-day Diagnostic is where we map which actions are safe to automate, which need a human, and what we would have to measure to tell the difference.

↓ Sounding · the takeaway

Measure a panel, not a number: false-action rate, escalation quality, and calibration over a single accuracy score. Build a golden set you argue over and a replay corpus from your own traffic. Ramp through shadow and canary instead of flipping a switch. And keep measuring, inside your own cloud, because an agent you stop measuring is one you no longer understand.