PlaybooksMarch 26, 2026By Avidan Nadav

How to Build Your First Eval System in a Week

Everyone says you need evals. Almost nobody tells you how to start. Here's a concrete five-day path from zero to a working eval you can run on every release.

Table of contents

"You need evals" is the most repeated and least actionable advice in AI product right now. Everyone nods. Almost nobody tells you what to actually do on Monday morning.

So here's the Monday-morning version. By Friday you'll have a working eval system: a fixed set of real cases, a way to score them, and a single number that tells you whether your feature got better or worse since the last release. Not a perfect system. A real one — which beats a perfect one you never get around to building.

ℹ️ Info

An eval, minus the jargon, is three things: a set of real inputs, a way to judge each output, and a score you can track over time. That's it. Everything fancier is just an upgrade to one of those three parts.

The mental model: evals are TDD for AI

If you've ever written a test before writing the code, you already understand evals. Test-driven development gave software a superpower — a regression suite that screams the moment you break something that used to work. For twenty years, that safety net is a big part of why teams could move fast without setting fire to production.

AI features tore a hole in that net, because "correct" is now fuzzy and probabilistic. Evals are how you patch it. Think of them as the regression tests for behavior you can't pin down with an assertEquals. The discipline that made TDD work — write the check first, run it on every change, treat a drop as a real failure — is the same discipline that makes evals work. The eval crowd online likes to put it bluntly: "vibes don't scale."

Day 1: Collect 20 real cases (your golden set)

Do not invent test cases. Go pull 20 real inputs your feature has seen or will see — actual customer emails, actual queries, actual documents. Real data fails in ways your imagination politely won't.

Bias the set on purpose: roughly half "normal" cases, half nasty ones, pulled straight from your failure-mode map if you've done one. An eval that's all happy-path inputs will reassure you right up until the moment it shouldn't.

Store it in the dumbest format that works. A spreadsheet, one row per case, is completely fine for week one. "Golden set" just means the trusted reference cases you grade against — borrowed from ML, no magic.

Day 2: Write the answer key and the rubric

For each case, write down what a good output looks like. Two flavors:

  • Cases with a right answer (did it extract the correct date? yes/no). Easy — write the expected answer.
  • Cases with no single right answer (is this draft reply good?). Write a short rubric instead: three to five checks like "accurate to policy," "right tone," "invents no facts." Pass = clears all of them.
⚠️ Warning

Write the rubric before you look at the model's outputs. Grade first and rubric-after, and you'll unconsciously write a rubric the model already passes. That's not evaluation. That's drawing the target around the arrow after it lands.

Day 3: Pick how you grade

Three options, cheapest first. Use the cheapest one that fits each case.

  • Exact / programmatic. The answer is checkable in code — a date, a number, a valid JSON shape. Fastest, most reliable, use it everywhere you can.
  • Human. You read the output and score it against the rubric. Slow, but the gold standard for subjective cases, and totally fine at 20 cases.
  • LLM-as-judge. A second model grades the output against your rubric. Scales beautifully — but it lies sometimes, so spot-check its grades against your own on a handful of cases before you trust it with the rest.

For week one: programmatic where possible, human for the rest. Add the LLM judge later, once you trust your rubric enough to trust a machine reading it.

Day 4: Run the baseline

Run all 20 cases through the current version and score every one. Add it up. That number — say, 14 out of 20 — is your baseline.

This is the moment the whole thing pays off. For the first time, you have an objective statement about how good your feature is. Not "it feels solid." Fourteen out of twenty, with the six failures named and waiting.

A score only means something against a bar, though. Deciding what number is good enough is its own small craft, and it's worth doing on purpose — I wrote up how to set your MVQ thresholds for exactly that.

Day 5: Wire it into your release

An eval you run once is a report. An eval you run every release is a system. That difference is the entire value.

Make the rule simple and boring: before any change to the feature ships, run the eval. Score drops? You've got a regression, and you treat it like a broken test — not a vibe, not a debate. Even if "running it" this week means manually re-scoring the spreadsheet over coffee, the discipline is the point. Automate it once it's earned the trust.


The artifact you keep

A sheet with 20 real cases, an expected answer or rubric for each, a grading method, and a running column of scores by date. That's your eval system. It looks humble. It will also save you from shipping a regression in front of a customer, which is the single most expensive thing an AI team does.

Start with 20 cases this week. Grow it to 200 once it's earned its place. Most teams are still arguing about whether they need evals. You'll have one running by Friday.