"It worked in the demo and broke in prod" is the most common PhnomShip horror story. The pattern is always the same: no eval set, no regression check, no way to know whether the last prompt edit made things better or worse. You ship. Users complain. You revert. Nobody learns anything.
The fix is unglamorous: treat your eval set as the real code and your prompt as a candidate solution. Here's what that looks like in practice.
What "drift" actually is
An agent's behaviour changes when any of four things change: the prompt, the model version, the tool descriptions, or the distribution of user inputs. The first three are entirely under your control — and entirely your fault when they go wrong silently. The fourth is continuous, and the only defence is watching production.
Drift doesn't look like a 500 error. It looks like quality dropping from 94% to 87% over a week, with no deploy in between. By the time a user opens a support ticket, you've lost ten more.
Write the eval first
Flip the TDD mantra. Before you add a new capability to your agent, write five to ten input/output pairs that demonstrate what success looks like. Then write the prompt that passes them.
This is the discipline our companion piece on shipping your first AI feature hints at with the "ten real inputs" step. Eval-driven development is just that, done every day, by every person on the team, on every change.
A minimal eval harness
You don't need a framework. You need a list and a runner.
type Eval = {
name: string;
input: string;
expectedContains: string[];
};
const evals: Eval[] = [
{
name: "summarises a short blog post",
input: "We shipped a new feature today...",
expectedContains: ["feature", "shipped"],
},
];
export async function runEvals(agent: (s: string) => Promise<string>) {
let pass = 0;
for (const e of evals) {
const out = await agent(e.input);
const ok = e.expectedContains.every(s => out.toLowerCase().includes(s));
console.log(`${ok ? "PASS" : "FAIL"}: ${e.name}`);
if (ok) pass++;
}
return pass / evals.length;
}Fifteen lines. It runs in your CI. It's ugly. It works. You'll outgrow it once you hit a hundred evals, and that's fine — by then you'll know exactly what you need.
Run evals on every change
Wire the eval run into your PR pipeline. Drop below 80% pass rate and block merge. Drop below 60% and page the author. Whatever threshold fits your risk tolerance — the important thing is that it's automatic and the number is visible.
GitHub Actions or Vercel preview-deploy hooks are both fine. The eval run is cheap compared to the incident you avoid.
The three eval types you'll need
Deterministic — exact match, regex, schema validation. Free to run, high signal when they apply. Use wherever the output shape is strict.
Soft — contains, BLEU-style overlap, cosine similarity against a reference. Cheap, fuzzy, good for summaries and rewrites.
Judge (LLM-as-judge) — a second model scores the output against a rubric. Expensive. Often the only way to evaluate taste, nuance, or correctness-under-ambiguity. Use sparingly and pin the judge model's version.
Default to deterministic. Reach for soft when the output is open-ended. Reach for judge only when neither works — and be honest that judge scores have their own noise floor.
Eval maintenance is the hidden cost
Your eval set will drift too. Queries users asked six months ago aren't what they ask now. Tests that were important in v1 are obsolete in v3. A stale eval set gives you false confidence, which is worse than no eval set.
Review quarterly. Delete tests that no longer represent real usage. Add a test every time a user reports a bug that your suite missed. The eval set is a living artifact, not a one-time deliverable.
The shift in mindset
Stop treating the prompt as "the code". Start treating the eval set as the code. The prompt is just the current best-known solution to the eval. When a new model comes out, or a bug report lands, the eval is what survives — the prompt is disposable.
Agents that don't drift aren't magic. They're agents whose owners wrote down what "working" means, and run that check every single deploy.