
LLM Prompt Eval Harness in CI: Catch Regressions Pre-Merge
Prompts in git with semantic versions, a JSONL eval set per prompt, graders that block merge on regression, and canary rollouts gated by the same signal.
The problem
The bug shows up like this. A product manager pings the on-call to say the support-triage bot is suddenly classifying refund requests as feature requests. You pull recent traffic, you confirm the regression, you start digging. It turns out three weeks ago someone added a sentence to the system prompt to handle a new edge case. The change was reviewed for grammar. Nobody re-ran any test cases against the old behavior because there were no test cases. The model is doing exactly what it was asked to do; the ask just quietly broke a behavior the team had stopped checking.
Silent regression on prompt edits
This is the dominant failure mode for production LLM apps and it is not solvable with monitoring. By the time the dashboard shows a misclassification rate creeping up, the bad prompt has been live for days and the regression is mixed in with normal variance. By the time a customer complains, the broken behavior has shipped into their workflow. The root cause is usually the same shape: a prompt change merged without anyone proving it did not break the behaviors that were working before.
Vibes-based model upgrades
The flip side is just as expensive. The team is on Sonnet 4.5 and a new model drops. Someone wants to try it. They do a few hand-tested examples, the outputs look fine, the upgrade ships. Two weeks later, a customer notices the JSON output is now occasionally missing a required field. The model is better in aggregate but worse on a behavior nobody tested. The upgrade gets rolled back, the team loses a week of velocity, and the next model upgrade gets delayed by six months because nobody trusts the process.
Compounding cost of zero coverage
The business consequence is real money. Every prompt regression that ships is either a refund, a churn, or an engineering week spent rebuilding trust with one customer. Every model upgrade that gets blocked by lack of confidence is months of cheaper or better inference left on the table. In an LLM-heavy product, the cost of not having an eval harness compounds — every new prompt adds a new surface area for silent regression, every model upgrade adds risk, and the team gets slower over time instead of faster.
Non-determinism breaks naive assertions
What makes this hard is that LLM outputs are not deterministic in the way unit tests assume. You cannot write assert(output === "expected") and expect it to pass twice in a row. The fix is not to give up on testing — it is to write graders that score the output the way a human reviewer would, run them on a representative dataset every time the prompt or model changes, and block the merge when scores drop below a threshold. The pattern below is what that looks like in production.

What changes for your business
The architecture has four pieces that have to work together: a prompts/ directory in git with semantic versions, a JSONL test dataset per prompt, a grader stack that scores model outputs against the dataset, and a CI runner that fails the build when scores regress. Layer a canary rollout on top for prompts that pass CI but you want extra confidence on in production.
Versioned prompts directory in git
Start with the prompts directory. Each prompt lives in its own folder with the prompt text, the test dataset, and the grader configuration co-located. The semantic version is in the filename or the metadata — pick one and be consistent. A reasonable layout looks like this:
prompts/
support-triage/
v3.2.1/
prompt.md
tests.jsonl
graders.ts
v3.2.2/
prompt.md
tests.jsonl
graders.ts
The reason for versioning at the directory level rather than relying on git history alone is that production needs to be able to pin to a specific version while you iterate on the next one. The router code reads the version from config; rolling back a prompt is a config change, not a code revert. This also means that the eval harness can run the same test set against multiple versions in parallel and produce a side-by-side report — exactly the workflow Anthropic's Console evaluation tool documents for prompt versioning, just expressed in git.
JSONL dataset seeded from real traffic
The JSONL test dataset is one case per line. The fields you need are the input variables the prompt expects, and either an expected field (for exact-match or similarity graders) or a rubric field (for LLM-as-judge graders). The dataset is the source of truth for what the prompt is supposed to do. Treat it like a regression suite for application code — every production bug that reaches a customer becomes a new test case, and the dataset grows under real pressure rather than speculative coverage.
{"input": {"ticket": "I want my money back, this is broken"}, "expected_label": "refund_request", "rubric": "Output must classify as refund_request and explain the trigger phrase."}
{"input": {"ticket": "Can you add dark mode?"}, "expected_label": "feature_request", "rubric": "Output must classify as feature_request without confusing intent."}
Two-layer grader stack
The grader stack runs the prompt against each row, captures the output, and scores it. The cheap graders run first because they are fast and free: exact match for classification labels, regex for required JSON keys, schema validation for structured outputs. The expensive grader — semantic similarity or LLM-as-judge — runs second on the rows that passed the cheap layer, because there is no point judging the quality of an output that already failed the structural check.
OpenAI documents four grader shapes that map cleanly to this stack: string_check for exact and contains-style match, text_similarity for fuzzy and embedding-based scoring (with fuzzy_match, BLEU, GLEU, METEOR, cosine, and ROUGE variants against a reference string with a configurable pass_threshold), score_model for LLM-as-judge with a numeric range and a threshold, and python for arbitrary logic. The vitest-style runner below uses the same conceptual split — cheap graders are TypeScript functions, the judge grader is a Claude or GPT call — but the contract is identical.
import { describe, it, expect } from "vitest";
import fs from "node:fs";
import path from "node:path";
import { Anthropic } from "@anthropic-ai/sdk";
const anthropic = new Anthropic();
type Case = {
input: Record<string, string>;
expected_label?: string;
rubric?: string;
};
const PROMPT_DIR = "prompts/support-triage/v3.2.2";
const PROMPT = fs.readFileSync(path.join(PROMPT_DIR, "prompt.md"), "utf8");
const TESTS: Case[] = fs
.readFileSync(path.join(PROMPT_DIR, "tests.jsonl"), "utf8")
.split("\n")
.filter(Boolean)
.map((line) => JSON.parse(line));
// Per-criterion accumulators. The CI report reads these.
const results = {
exact_label: { passed: 0, total: 0 },
rubric: { passed: 0, total: 0 },
};
// Pass thresholds. Failing under threshold fails CI.
const THRESHOLDS = { exact_label: 0.95, rubric: 0.9 };
async function runPrompt(input: Record<string, string>): Promise<string> {
// Cache the system prompt so repeated test rows hit the cache.
// cache_read_input_tokens in the response confirms the hit.
const resp = await anthropic.messages.create({
model: "claude-sonnet-4-6",
max_tokens: 1024,
system: [
{
type: "text",
text: PROMPT,
cache_control: { type: "ephemeral" },
},
],
messages: [{ role: "user", content: JSON.stringify(input) }],
});
return resp.content[0].type === "text" ? resp.content[0].text : "";
}
async function judgeRubric(output: string, rubric: string): Promise<number> {
// LLM-as-judge. Returns 0-1.
const judge = await anthropic.messages.create({
model: "claude-haiku-4-5",
max_tokens: 16,
messages: [
{
role: "user",
content:
`Rubric: ${rubric}\n\nOutput to grade:\n${output}\n\n` +
`Respond with a single number 0 to 1. No prose.`,
},
],
});
const text = judge.content[0].type === "text" ? judge.content[0].text : "0";
const n = parseFloat(text.trim());
return Number.isFinite(n) ? Math.max(0, Math.min(1, n)) : 0;
}
describe("support-triage v3.2.2", () => {
for (const tc of TESTS) {
it(JSON.stringify(tc.input).slice(0, 60), async () => {
const output = await runPrompt(tc.input);
// Cheap grader: exact label match.
if (tc.expected_label) {
results.exact_label.total += 1;
const matched = output.includes(tc.expected_label);
if (matched) results.exact_label.passed += 1;
}
// Expensive grader: LLM-as-judge against the rubric.
if (tc.rubric) {
results.rubric.total += 1;
const score = await judgeRubric(output, tc.rubric);
if (score >= 0.7) results.rubric.passed += 1;
}
});
}
it("per-criterion pass rates meet thresholds", () => {
const rates = {
exact_label: results.exact_label.passed / results.exact_label.total,
rubric: results.rubric.passed / results.rubric.total,
};
// Surface the rates in the test output so the CI logs show them.
console.log("pass rates:", rates);
expect(rates.exact_label).toBeGreaterThanOrEqual(THRESHOLDS.exact_label);
expect(rates.rubric).toBeGreaterThanOrEqual(THRESHOLDS.rubric);
});
});
CI runner that blocks merge on regression
The CI runner wires this into a GitHub Actions step (or your equivalent) that runs on every PR touching prompts/** or the calling code. It runs the full suite, captures the per-criterion pass rates, and posts them as a PR comment so the reviewer sees the numbers next to the diff. If any criterion falls below its threshold, the job exits non-zero and the merge is blocked. The job also runs nightly against main against the current production prompts and current model, which catches the case where the provider quietly updated the model under you and your prompts now behave differently with no commit on your side.
Canary rollout gated by the same graders
The canary layer is the production-side counterpart. When a prompt change merges, the router reads a feature flag and sends 10% of traffic to the new version while 90% continues on the old. The same grader stack runs against the live samples in the background — same graders, same thresholds, just scored against real user inputs instead of the regression set. If pass rates stay within the v1 envelope for the canary window (usually 2-24 hours depending on traffic), the flag flips to 100% automatically. If they drop, the flag rolls back and pages a human. The point is that the same signal that gates the PR also gates the rollout — there is one definition of "this prompt is good enough," and it applies everywhere.

What gets shipped
A typical engagement on this pattern ships seven pieces. First, the prompts/ directory layout with semantic versions, a README that names the conventions, and at least one fully populated prompt folder as the reference example. Second, the JSONL test dataset seeded from 50-200 real production examples per prompt, anonymized and labeled. Third, the grader implementations — cheap graders as TypeScript functions, judge graders as Claude or GPT calls with prompt caching configured so repeated runs are cheap. Fourth, the vitest (or jest, or Python equivalent) runner that orchestrates the above and produces a per-criterion report. Fifth, the CI workflow that runs the suite on every PR touching prompts, posts the pass-rate report as a PR comment, and blocks merge on regression. Sixth, the nightly run against main that catches model-side drift. Seventh, the canary router that reads a feature flag, splits traffic, and runs the same graders against live samples to auto-promote or auto-rollback.
The non-obvious piece is the prompt-caching configuration. Anthropic charges cache reads at 0.1x the base input token price, which makes the 199 cases after the first one in a 200-case eval suite roughly 90% cheaper on the cached portion. The cache_creation_input_tokens and cache_read_input_tokens fields in the API response are what you verify against in the runner — if you expect cache hits and the response shows zero reads, your cache_control breakpoint is in the wrong place and the suite will cost 10x what it should. Bake that check into the runner so a misconfiguration shows up as a failed test, not a surprise bill.
Common failure modes
The first sharp edge is the dataset that stops growing. A team writes 20 test cases on day one, ships the harness, declares victory, and does not add another case for months. Six months later the suite is passing on yesterday's behaviors and missing all the new ones. The fix is a workflow rule: every production incident that traces back to a prompt requires a new test case in the same PR as the fix, and CI checks that the case count in tests.jsonl is monotonically non-decreasing on main. The dataset has to grow under the same pressure that produces bugs, or it stops representing reality.
The second is the LLM-as-judge that grades its own output. If the judge model is the same as the model under test, the suite scores high on a kind of self-consistency rather than actual quality. Cross-grader the suite where you can — Claude judges GPT outputs, GPT judges Claude outputs, or use a smaller model from a third provider as the judge. The cost difference is small and the bias reduction is meaningful.
The third is the threshold that drifts down. Someone ships a change that drops the rubric pass rate from 0.91 to 0.89. They lower the threshold from 0.9 to 0.88 to unblock the PR. Next quarter, the threshold is 0.7 and the suite is no longer catching anything. The fix is to treat threshold changes as a separate PR that requires a different reviewer than the prompt change, and to track the threshold values in a dashboard so a downward trend is visible. The harness only works if the bar holds.
The fourth is the canary that promotes too fast. A prompt change ships, the canary runs for 30 minutes against low-traffic hours, the graders all pass, and the flag flips to 100%. Two hours later, peak traffic exposes a class of inputs the canary window did not cover, and the regression hits everyone. The fix is to define the canary window in terms of traffic volume (e.g. 1,000 graded samples or 50 distinct customers) rather than wall-clock time, so a low-traffic canary stays in canary longer.
The fifth is the prompt that was not wired into the harness in the first place. A team has six production prompts; the harness covers four; the two it does not cover are the ones that regress. The fix is a lint rule that scans the codebase for messages.create or chat.completions.create calls and asserts that the system prompt loaded into the call resolves to a path under prompts/ that has a tests.jsonl alongside it. If a developer adds a new prompt and forgets the eval set, the build fails on the lint, not in production.
Model upgrades without the vibes test
The clearest payoff of the harness is what it does to model upgrades. Without it, "should we move from Sonnet 4.5 to 4.6" is a week of hand-testing examples in the Anthropic Console, plus a tense conversation about whether the cost saves are worth the risk. With it, the answer is two PRs: one that changes the model name in config and runs the full suite, one that ships the upgrade if the suite passes within tolerance. Same workflow for GPT-4o to o3, or for swapping providers entirely for cost reasons.
The per-criterion report is what makes this work. The headline number — "94% pass rate on the new model vs 96% on the old" — is not enough to decide on. What you need is the breakdown: which criteria dropped, by how much, on which kinds of inputs. The harness gives you that for free because the cases and graders are already structured. If the regression is on a criterion you do not care about (e.g. the new model is slightly worse at formatting bullet lists), the upgrade ships. If the regression is on a criterion that maps to a customer-visible behavior, you keep the old model on the affected prompts via a fallback chain at the router and upgrade the rest. Either way, the decision is grounded in measurement, not impressions.
This also unlocks the "evaluate the model before it is GA" workflow. When a provider ships a new model in preview, the same harness runs against it and tells you in an hour whether your prompts work with it. By the time the model is GA and your competitors are starting their evaluation, your team is already shipping the upgrade because the decision is already made.
What this looks like in production
At BFEAI we run the harness against every prompt change that touches the agentic loop. The dataset for the main support-triage prompt is around 180 cases, scored against three criteria (label match, JSON shape, rubric-based tone). A full eval run takes about 4 minutes against Sonnet 4.6 with prompt caching configured, and costs roughly $2 per run when the cache is hitting cleanly. The PR comment shows three numbers and a link to the full report; reviewers can see at a glance whether the change is a regression, an improvement, or a wash, and the dataset of "rejected because of an eval regression" PRs is the audit trail finance and compliance look at when they want evidence that prompt changes are not shipping unreviewed.
The nightly run against main is what surfaced the one production-relevant model behavior change we have observed in the last six months — a new model version handled a specific edge case differently than the prior version, the nightly run flagged the rubric criterion drop, and the team had a fix in place before the change rolled out to enough traffic to matter. Without the nightly run, the first signal would have been a customer ticket.
The dashboard a CTO wants on this pattern has four rows: the latest pass rate per prompt against current production model (this is the "is the suite green" number), the trend of that number over the last 30 days (a slow drift down is a leading indicator that the dataset is stale), the cost per eval run (a sudden jump means a caching regression), and the count of test cases per prompt (a flat line means the team has stopped feeding incidents back into the suite). Those four numbers are the entire health view; everything else is detail.
What to watch in your own implementation
Open your repo and search for every anthropic.messages.create, openai.chat.completions.create, or equivalent call. For each one, answer three questions. First: does the system prompt live in a file under prompts/ with a semantic version, or is it inlined in code? Inlined prompts are unversioned and untestable; move them. Second: is there a tests.jsonl alongside the prompt file, and does it have at least 20 cases drawn from real production inputs? If not, that prompt is shipping blind. Third: does CI block the merge when the harness fails, or is it advisory? Advisory eval suites get ignored within a month; gating ones get maintained.
Then run a one-shot survey of your last quarter's prompt-related incidents. For each one, ask whether the harness as currently configured would have caught it pre-merge. The cases where the answer is no are the ones to add to the dataset this week, and the cases where the answer is "we did not have a grader for that criterion" are the ones that tell you which grader to write next. The harness improves under real failure pressure, not under speculative test design.
Finally, verify the prompt cache is actually hitting. Add an assertion in the runner that cache_read_input_tokens on the second test case is greater than zero, and fail the run if it is not. A silently broken cache is the single biggest cost surprise on this pattern, and it is the easiest to catch with one line of code in the right place.
Outcomes you should expect
What this delivers
- Prompt regressions are caught at PR time, not by a customer pasting bad output into a support ticket two weeks after the merge.
- A model upgrade (Sonnet 4.5 to 4.6, GPT-4o to o3) becomes a one-command eval run with a green or red answer, not a week of vibes-based testing.
- Per-criterion pass rates are visible on every PR, so reviewers can see exactly which behavior changed and decide whether the trade-off is acceptable.
- Canary deploys for prompts ship 10% of traffic to v2, watch the same evals against live samples, and auto-promote or auto-rollback without a human in the loop on the happy path.
Primary sources
By the numbers
OpenAI's evals framework follows three steps: describe the task to be done as an eval, run your eval with test inputs (a prompt and input data), and analyze the results to iterate and improve on your prompt.
OpenAI eval runs return a result_counts object containing total, errored, failed, and passed, plus a per_testing_criteria_results breakdown that gives you a pass rate per grader.
OpenAI's text_similarity grader supports fuzzy_match, BLEU, GLEU, METEOR, cosine, and ROUGE variants (1-5, L) against a reference string with a configurable pass_threshold.
OpenAI's score_model grader (LLM-as-judge) takes a model, an input message array, a numeric range, and a pass_threshold, and is restricted to a supported set of grader models including gpt-4o, o1, o3, and o4 variants.
Anthropic's prompt caching charges cache reads at 0.1x the base input token price, which makes repeated eval runs against the same system prompt roughly 90% cheaper on the cached portion within the 5-minute default cache lifetime.
Anthropic returns cache_creation_input_tokens and cache_read_input_tokens in the API usage object on every response, which lets a CI eval runner verify the cache is actually hitting before reporting cost numbers.
Anthropic's Console evaluation tool supports prompt versioning by letting you create new versions of your prompt and re-run the test suite to compare side-by-side outputs and quality grades.
Live in production today
The same engineering, shipped in production at BFEAI.
I'm co-founder & CTO of Be Found Everywhere (BFEAI), a 7-app AI SaaS platform running today. The work I deliver for clients is the work I do every week on my own platform.
7
Production apps
200K+
Keywords generated
1,500+
AI scans run
7,000+
Sites automated
Common questions
What buyers ask before reaching out
Why version prompts in git instead of in a database or a prompt-management SaaS?
Git is the only place where the prompt, the code that calls the prompt, and the eval cases for the prompt all live under the same commit hash. When something regresses, you can git bisect across all three. When you ship, you ship them together. A SaaS prompt store gives you a nice UI but introduces a second source of truth, and the day those two disagree is the day production breaks in a way that is hard to debug. Use the SaaS for the human-readable view if you want; keep git as the source of truth.
What goes in the JSONL test dataset for a prompt?
Each line is one test case: the input the prompt will see, and either an expected output (for exact match or similarity graders) or a rubric the LLM-as-judge will score against. Real production cases are better than synthetic ones — pull 50-200 anonymized examples from logs, label them once, and use that as the regression set forever. Add new cases whenever a bug ships.
When should I use LLM-as-judge vs exact-match graders?
Exact match and regex are fast, cheap, deterministic, and only work for outputs that have a single correct shape — JSON keys, classification labels, structured fields. LLM-as-judge is slower and more expensive but it is the only graders that can score things like 'is this summary faithful to the source' or 'does this reply sound on-brand'. Most production prompts need both layers: cheap graders catch the structural failures fast, and judge graders catch the quality failures the cheap ones miss.
Will running evals on every PR get expensive?
It can if the eval set is big and you ignore caching. The fix is Anthropic-style prompt caching for the system prompt and few-shot block: cache reads cost about 10% of the base input token price within the default 5-minute window, so a 200-case eval that re-uses the same system prompt pays the full input price once and the cached price 199 times. On OpenAI, batch the runs and use the cheaper grader models for the LLM-as-judge pass. Most teams I work with land under $5 per full eval run.
What does a canary rollout for a prompt actually look like?
Same shape as a code canary. The router reads a feature flag, routes 10% of traffic to prompt v2 and 90% to v1, and runs the same eval graders against the live samples in the background. If v2's per-criterion pass rates stay within the v1 envelope for the canary window, the router promotes v2 to 100%. If they drop, it auto-rolls back. The point is that the same graders that gate the PR also gate the rollout — no separate 'production quality' signal that disagrees with the pre-merge one.
How do I evaluate a model upgrade without re-writing every prompt?
Point the eval runner at the new model, hold the prompts constant, and run the full suite. The per-criterion report tells you exactly which prompts regressed and on which criteria. From there you have three choices per prompt: keep the old model for it (fallback chain), tune the prompt for the new model, or accept the trade-off if the regression is on a criterion you do not care about. The eval harness turns a model upgrade from a vibes decision into a data decision.
What's the minimum eval coverage to start blocking merges?
Pick the three to five behaviors that would be a real incident if they regressed — the JSON contract, the brand voice, the refusal behavior, the citation format, whatever your top-of-mind concerns are. Write 10-20 cases per behavior. That is enough to gate PRs and start catching the obvious regressions. Expand from there as bugs ship — every production incident becomes a new test case, and the suite grows under real pressure instead of speculative coverage.
Where should the eval runner live in the CI pipeline?
After unit tests and before deploy. Treat it like an integration test: it can be slow (minutes, not seconds), it talks to a real model API, and it blocks merge on failure. Run it on every PR that touches the prompts directory or the calling code, and run the full suite nightly against main so you catch model-side drift even when your code is unchanged. The nightly run is the one that catches 'the provider quietly updated the model and now our prompts behave differently'.
Ready to see if this is a fit?
A 15-minute call. No deck, no slides. We talk about what you're shipping and where engineering is the bottleneck. Either way, you walk away with a senior engineer's read on your situation.