[{"data":1,"prerenderedAt":1521},["ShallowReactive",2],{"docs:\u002Fevals":3},{"id":4,"title":5,"accent":6,"body":7,"description":1486,"estReadTime":1487,"extension":1488,"eyebrow":1489,"icon":1490,"intro":1491,"lastUpdated":1491,"meta":1492,"navigation":242,"next":1493,"path":1496,"prev":1497,"review":1491,"seo":1500,"stem":6,"tocItems":1512,"__hash__":1520},"docs\u002Fevals.md","Evals","evals",{"type":8,"value":9,"toc":1463},"minimark",[10,19,92,209,527,825,947,1310,1459],[11,12,13,14,18],"p",{},"Every other section of this guide is about making Claude ",[15,16,17],"em",{},"do"," something. This one is about knowing whether what it did was any good — and catching the moment it stops being good. If you're building anything that customers touch, that runs on a schedule, or that you'd be embarrassed to ship broken, evals stop being optional. They're the difference between \"the demo worked once\" and \"this ships on Fridays.\"",[20,21,24,31,34,42,47,50,89],"docs-section",{"id":22,"title":23},"why","Why Evals Matter",[11,25,26,27,30],{},"Traditional tests assume determinism. Same input, same output, pass or fail. Agents don't work like that. Same prompt, same model, same tools — you'll get three slightly different answers across three runs, and ",[15,28,29],{},"all three can be acceptable",". Unit tests can't express \"acceptable.\" Evals can.",[11,32,33],{},"The failure mode evals catch isn't \"does it crash.\" It's \"does it still do the thing I hired it for.\" A refactoring subagent that used to produce clean diffs and now sprinkles console.logs doesn't throw an error. CI is green. Production is quietly getting worse. Only an eval suite notices.",[35,36,39],"docs-callout",{"title":37,"variant":38},"The silent regression","warning",[11,40,41],{},"Prompt changes are the most common cause. You edit a system prompt to fix one edge case; three other cases you weren't watching get 5% worse. Without evals, you'll ship the trade and only learn about it from user complaints — at which point you won't remember which prompt change caused it.",[43,44,46],"h3",{"id":45},"what-evals-actually-measure","What evals actually measure",[11,48,49],{},"Four axes cover most of what you'll care about:",[51,52,53,61,67,83],"ul",{},[54,55,56,60],"li",{},[57,58,59],"strong",{},"Correctness"," — did it produce the right answer? (Deterministic check: compare to a known-good output.)",[54,62,63,66],{},[57,64,65],{},"Adherence"," — did it follow the rules? (Heuristic check: \"did it cite sources,\" \"did it stay under 500 tokens,\" \"did it refuse the off-topic prompt.\")",[54,68,69,72,73,76,77,82],{},[57,70,71],{},"Quality"," — is the output ",[15,74,75],{},"good","? (Subjective, best handled by ",[78,79,81],"a",{"href":80},"#judge","LLM-as-judge",".)",[54,84,85,88],{},[57,86,87],{},"Cost & latency"," — did it do this efficiently? (Metric check: tokens in\u002Fout, wall time.)",[11,90,91],{},"A full eval suite touches all four. Most teams start with correctness and adherence, add quality once a judge is trustworthy, and layer cost\u002Flatency in last.",[20,93,96,107,111,135,139,176,179,183,186,196,202],{"id":94,"title":95},"golden","Golden Datasets",[11,97,98,99,102,103,106],{},"A ",[57,100,101],{},"golden dataset"," is a set of inputs paired with outputs (or rubrics) you trust. It's the ground truth your agent gets graded against. The single biggest mistake people make is treating this like a massive research undertaking — \"we need 10,000 labelled examples.\" You don't. You need ",[57,104,105],{},"50 examples you actually trust",", and you should build them in an afternoon.",[43,108,110],{"id":109},"where-the-examples-come-from","Where the examples come from",[51,112,113,123,129],{},[54,114,115,118,119,122],{},[57,116,117],{},"Real production logs."," Export 200 recent runs, sample 50, hand-label what the right output ",[15,120,121],{},"should have been",". These are the most valuable examples you'll ever write because they reflect real distribution.",[54,124,125,128],{},[57,126,127],{},"Bug reports."," Every time Claude screws up a real task, the input + the correct output goes straight into the golden set. Your eval suite now regression-tests every bug you've ever fixed.",[54,130,131,134],{},[57,132,133],{},"Edge cases you invented."," \"What if the user types in all caps?\" \"What if the input is empty?\" Useful, but don't let them outnumber production-derived cases.",[43,136,138],{"id":137},"the-minimum-viable-schema","The minimum-viable schema",[140,141,147],"pre",{"className":142,"code":143,"filename":144,"language":145,"meta":146,"style":146},"language-jsonl shiki shiki-themes github-light","{\"id\":\"001\",\"input\":\"cancel my subscription\",\"expected\":{\"intent\":\"cancel\"},\"notes\":\"clean case\"}\n{\"id\":\"002\",\"input\":\"change my card\",\"expected\":{\"intent\":\"update_payment\"},\"notes\":\"from bug #412\"}\n{\"id\":\"003\",\"input\":\"i want out\",\"expected\":{\"intent\":\"cancel\"},\"notes\":\"ambiguous phrasing\"}\n{\"id\":\"004\",\"input\":\"UPGRADE ME PLEASE\",\"expected\":{\"intent\":\"upgrade\"},\"notes\":\"caps edge case\"}\n","evals\u002Fgolden\u002Fclassify-intent.jsonl","jsonl","",[148,149,150,158,164,170],"code",{"__ignoreMap":146},[151,152,155],"span",{"class":153,"line":154},"line",1,[151,156,157],{},"{\"id\":\"001\",\"input\":\"cancel my subscription\",\"expected\":{\"intent\":\"cancel\"},\"notes\":\"clean case\"}\n",[151,159,161],{"class":153,"line":160},2,[151,162,163],{},"{\"id\":\"002\",\"input\":\"change my card\",\"expected\":{\"intent\":\"update_payment\"},\"notes\":\"from bug #412\"}\n",[151,165,167],{"class":153,"line":166},3,[151,168,169],{},"{\"id\":\"003\",\"input\":\"i want out\",\"expected\":{\"intent\":\"cancel\"},\"notes\":\"ambiguous phrasing\"}\n",[151,171,173],{"class":153,"line":172},4,[151,174,175],{},"{\"id\":\"004\",\"input\":\"UPGRADE ME PLEASE\",\"expected\":{\"intent\":\"upgrade\"},\"notes\":\"caps edge case\"}\n",[11,177,178],{},"JSONL (one JSON object per line) is the right format: streamable, diff-friendly, every line is valid on its own. Check it into the repo next to the code it tests.",[43,180,182],{"id":181},"when-expected-is-a-rubric-not-a-value","When \"expected\" is a rubric, not a value",[11,184,185],{},"For open-ended tasks (code review, summarization, planning), you can't write the literal right answer. Instead, store the criteria:",[140,187,190],{"className":142,"code":188,"filename":189,"language":145,"meta":146,"style":146},"{\"id\":\"r01\",\"input\":{\"diff_file\":\"fixtures\u002Fr01.diff\"},\"rubric\":[\"must flag the SQL injection\",\"must not flag the unrelated logging change\",\"verdict must be one of LGTM|NeedsChanges\"]}\n","evals\u002Fgolden\u002Fcode-review.jsonl",[148,191,192],{"__ignoreMap":146},[151,193,194],{"class":153,"line":154},[151,195,188],{},[11,197,198,199,201],{},"The rubric becomes the prompt to your ",[78,200,81],{"href":80},". Now your golden set is small, high-signal, and grows naturally over time.",[35,203,206],{"title":204,"variant":205},"Start with 20 examples, not 200","tip",[11,207,208],{},"Twenty hand-curated examples catch most regressions. The marginal example above 50 buys you less than the time it takes to label. Grow the set only when you find a failure class the current set doesn't cover.",[20,210,213,216,220,459,462,466,486,490,520],{"id":211,"title":212},"judge","LLM-as-Judge",[11,214,215],{},"For anything subjective — did the summary capture the key points, did the review catch the real issues, did the refactor stay faithful to intent — you need a judge. The judge is another Claude call whose only job is to grade the output under test against a rubric.",[43,217,219],{"id":218},"the-minimal-judge","The minimal judge",[140,221,226],{"className":222,"code":223,"filename":224,"language":225,"meta":146,"style":146},"language-python shiki shiki-themes github-light","import anthropic\nclient = anthropic.Anthropic()\n\nJUDGE_PROMPT = \"\"\"You are a strict grader. Given the task, the rubric, and\nthe candidate output, return a JSON object:\n\n{\n  \"pass\": true | false,\n  \"score\": 0..10,\n  \"reasoning\": \"\u003Cone-sentence>\",\n  \"failed_criteria\": [\"\u003Crubric item>\", ...]\n}\n\nDo not wrap the JSON in prose. Output only the JSON.\n\nTASK:\n{task}\n\nRUBRIC:\n{rubric}\n\nCANDIDATE OUTPUT:\n{output}\n\"\"\"\n\ndef judge(task: str, rubric: list[str], output: str) -> dict:\n    msg = client.messages.create(\n        model=\"claude-haiku-4-5-20251001\",\n        max_tokens=500,\n        messages=[{\n            \"role\": \"user\",\n            \"content\": JUDGE_PROMPT.format(\n                task=task,\n                rubric=\"\\n\".join(f\"- {r}\" for r in rubric),\n                output=output,\n            ),\n        }],\n    )\n    import json\n    return json.loads(msg.content[0].text)\n","evals\u002Fjudge.py","python",[148,227,228,233,238,244,249,255,260,266,272,278,284,290,296,301,307,312,318,324,329,335,341,346,352,358,364,369,375,381,387,393,399,405,411,417,423,429,435,441,447,453],{"__ignoreMap":146},[151,229,230],{"class":153,"line":154},[151,231,232],{},"import anthropic\n",[151,234,235],{"class":153,"line":160},[151,236,237],{},"client = anthropic.Anthropic()\n",[151,239,240],{"class":153,"line":166},[151,241,243],{"emptyLinePlaceholder":242},true,"\n",[151,245,246],{"class":153,"line":172},[151,247,248],{},"JUDGE_PROMPT = \"\"\"You are a strict grader. Given the task, the rubric, and\n",[151,250,252],{"class":153,"line":251},5,[151,253,254],{},"the candidate output, return a JSON object:\n",[151,256,258],{"class":153,"line":257},6,[151,259,243],{"emptyLinePlaceholder":242},[151,261,263],{"class":153,"line":262},7,[151,264,265],{},"{\n",[151,267,269],{"class":153,"line":268},8,[151,270,271],{},"  \"pass\": true | false,\n",[151,273,275],{"class":153,"line":274},9,[151,276,277],{},"  \"score\": 0..10,\n",[151,279,281],{"class":153,"line":280},10,[151,282,283],{},"  \"reasoning\": \"\u003Cone-sentence>\",\n",[151,285,287],{"class":153,"line":286},11,[151,288,289],{},"  \"failed_criteria\": [\"\u003Crubric item>\", ...]\n",[151,291,293],{"class":153,"line":292},12,[151,294,295],{},"}\n",[151,297,299],{"class":153,"line":298},13,[151,300,243],{"emptyLinePlaceholder":242},[151,302,304],{"class":153,"line":303},14,[151,305,306],{},"Do not wrap the JSON in prose. Output only the JSON.\n",[151,308,310],{"class":153,"line":309},15,[151,311,243],{"emptyLinePlaceholder":242},[151,313,315],{"class":153,"line":314},16,[151,316,317],{},"TASK:\n",[151,319,321],{"class":153,"line":320},17,[151,322,323],{},"{task}\n",[151,325,327],{"class":153,"line":326},18,[151,328,243],{"emptyLinePlaceholder":242},[151,330,332],{"class":153,"line":331},19,[151,333,334],{},"RUBRIC:\n",[151,336,338],{"class":153,"line":337},20,[151,339,340],{},"{rubric}\n",[151,342,344],{"class":153,"line":343},21,[151,345,243],{"emptyLinePlaceholder":242},[151,347,349],{"class":153,"line":348},22,[151,350,351],{},"CANDIDATE OUTPUT:\n",[151,353,355],{"class":153,"line":354},23,[151,356,357],{},"{output}\n",[151,359,361],{"class":153,"line":360},24,[151,362,363],{},"\"\"\"\n",[151,365,367],{"class":153,"line":366},25,[151,368,243],{"emptyLinePlaceholder":242},[151,370,372],{"class":153,"line":371},26,[151,373,374],{},"def judge(task: str, rubric: list[str], output: str) -> dict:\n",[151,376,378],{"class":153,"line":377},27,[151,379,380],{},"    msg = client.messages.create(\n",[151,382,384],{"class":153,"line":383},28,[151,385,386],{},"        model=\"claude-haiku-4-5-20251001\",\n",[151,388,390],{"class":153,"line":389},29,[151,391,392],{},"        max_tokens=500,\n",[151,394,396],{"class":153,"line":395},30,[151,397,398],{},"        messages=[{\n",[151,400,402],{"class":153,"line":401},31,[151,403,404],{},"            \"role\": \"user\",\n",[151,406,408],{"class":153,"line":407},32,[151,409,410],{},"            \"content\": JUDGE_PROMPT.format(\n",[151,412,414],{"class":153,"line":413},33,[151,415,416],{},"                task=task,\n",[151,418,420],{"class":153,"line":419},34,[151,421,422],{},"                rubric=\"\\n\".join(f\"- {r}\" for r in rubric),\n",[151,424,426],{"class":153,"line":425},35,[151,427,428],{},"                output=output,\n",[151,430,432],{"class":153,"line":431},36,[151,433,434],{},"            ),\n",[151,436,438],{"class":153,"line":437},37,[151,439,440],{},"        }],\n",[151,442,444],{"class":153,"line":443},38,[151,445,446],{},"    )\n",[151,448,450],{"class":153,"line":449},39,[151,451,452],{},"    import json\n",[151,454,456],{"class":153,"line":455},40,[151,457,458],{},"    return json.loads(msg.content[0].text)\n",[11,460,461],{},"Haiku is plenty for most judging — the judge isn't writing, it's matching output against a rubric. Use Sonnet only when the subject matter is technical enough that Haiku's mistakes dominate your signal.",[43,463,465],{"id":464},"three-patterns-that-actually-work","Three patterns that actually work",[51,467,468,474,480],{},[54,469,470,473],{},[57,471,472],{},"Pointwise grading"," — \"does this output meet the rubric, yes\u002Fno, why?\" Simple, cheap, noisy on its own. Run each example 3× and vote.",[54,475,476,479],{},[57,477,478],{},"Pairwise comparison"," — \"given two candidate outputs, which is better and why?\" Lower variance than pointwise, great for A\u002FB-ing prompt changes. Slightly more expensive.",[54,481,482,485],{},[57,483,484],{},"Reference comparison"," — \"how does the candidate compare to this known-good reference?\" Best for cases where you have a golden output and want to measure drift.",[43,487,489],{"id":488},"the-pitfalls","The pitfalls",[51,491,492,498,504,510],{},[54,493,494,497],{},[57,495,496],{},"Verbosity bias."," Judges default to preferring longer, more detailed answers even when concise is correct. Mitigate: explicitly tell the judge \"prefer concise answers that meet the rubric; do not reward extra detail.\"",[54,499,500,503],{},[57,501,502],{},"Position bias."," In pairwise, the first option wins more often than chance. Mitigate: flip the order on half your runs and average.",[54,505,506,509],{},[57,507,508],{},"Same-family blindness."," A Claude judge can be lenient on Claude-flavored mistakes (hedging, preamble). Mitigate: vary judge models across runs, or add explicit rubric items for the failure modes you know.",[54,511,512,515,516,519],{},[57,513,514],{},"Calibration drift."," A judge's \"7\u002F10\" in March isn't the same as \"7\u002F10\" in June — model updates shift the scale. Always re-score the ",[15,517,518],{},"baseline"," every run; compare deltas, not absolute scores.",[35,521,524],{"title":522,"variant":523},"Trust the judge only after it agrees with you","info",[11,525,526],{},"Before you rely on an LLM-as-judge, hand-grade 20 examples yourself. Run the judge on the same 20. If it agrees with you 90%+, ship it. If it's below 80%, fix the rubric — don't just average over more runs. A bad rubric doesn't improve with scale.",[20,528,531,534,538,723,726,730,737,763,767,774,819],{"id":529,"title":530},"regression","Regression Suites",[11,532,533],{},"A regression suite is a golden dataset + a runner + a baseline. On every prompt change, subagent edit, or model bump, you replay the whole set and compare scores to the baseline. If anything drops past a threshold, the change is blocked.",[43,535,537],{"id":536},"the-simplest-runner-that-works","The simplest runner that works",[140,539,542],{"className":222,"code":540,"filename":541,"language":225,"meta":146,"style":146},"import json, pathlib, statistics\nfrom judge import judge\nfrom your_agent import run_agent   # the thing you're evaluating\n\nDATASET = pathlib.Path(\"evals\u002Fgolden\u002Fcode-review.jsonl\")\nBASELINE = pathlib.Path(\"evals\u002Fbaseline.json\")\n\ndef main():\n    results = []\n    for line in DATASET.read_text().splitlines():\n        example = json.loads(line)\n        output = run_agent(example[\"input\"])\n        verdict = judge(\n            task=\"review the diff and return LGTM or NeedsChanges\",\n            rubric=example[\"rubric\"],\n            output=output,\n        )\n        results.append({\n            \"id\": example[\"id\"],\n            \"pass\": verdict[\"pass\"],\n            \"score\": verdict[\"score\"],\n            \"failed_criteria\": verdict[\"failed_criteria\"],\n        })\n\n    pass_rate = sum(1 for r in results if r[\"pass\"]) \u002F len(results)\n    avg_score = statistics.mean(r[\"score\"] for r in results)\n\n    baseline = json.loads(BASELINE.read_text())\n    print(f\"pass rate: {pass_rate:.2%} (baseline {baseline['pass_rate']:.2%})\")\n    print(f\"avg score: {avg_score:.2f} (baseline {baseline['avg_score']:.2f})\")\n\n    # Fail the run if we regressed more than 3pp on pass rate\n    if pass_rate \u003C baseline[\"pass_rate\"] - 0.03:\n        raise SystemExit(1)\n\nif __name__ == \"__main__\":\n    main()\n","evals\u002Frun.py",[148,543,544,549,554,559,563,568,573,577,582,587,592,597,602,607,612,617,622,627,632,637,642,647,652,657,661,666,671,675,680,685,690,694,699,704,709,713,718],{"__ignoreMap":146},[151,545,546],{"class":153,"line":154},[151,547,548],{},"import json, pathlib, statistics\n",[151,550,551],{"class":153,"line":160},[151,552,553],{},"from judge import judge\n",[151,555,556],{"class":153,"line":166},[151,557,558],{},"from your_agent import run_agent   # the thing you're evaluating\n",[151,560,561],{"class":153,"line":172},[151,562,243],{"emptyLinePlaceholder":242},[151,564,565],{"class":153,"line":251},[151,566,567],{},"DATASET = pathlib.Path(\"evals\u002Fgolden\u002Fcode-review.jsonl\")\n",[151,569,570],{"class":153,"line":257},[151,571,572],{},"BASELINE = pathlib.Path(\"evals\u002Fbaseline.json\")\n",[151,574,575],{"class":153,"line":262},[151,576,243],{"emptyLinePlaceholder":242},[151,578,579],{"class":153,"line":268},[151,580,581],{},"def main():\n",[151,583,584],{"class":153,"line":274},[151,585,586],{},"    results = []\n",[151,588,589],{"class":153,"line":280},[151,590,591],{},"    for line in DATASET.read_text().splitlines():\n",[151,593,594],{"class":153,"line":286},[151,595,596],{},"        example = json.loads(line)\n",[151,598,599],{"class":153,"line":292},[151,600,601],{},"        output = run_agent(example[\"input\"])\n",[151,603,604],{"class":153,"line":298},[151,605,606],{},"        verdict = judge(\n",[151,608,609],{"class":153,"line":303},[151,610,611],{},"            task=\"review the diff and return LGTM or NeedsChanges\",\n",[151,613,614],{"class":153,"line":309},[151,615,616],{},"            rubric=example[\"rubric\"],\n",[151,618,619],{"class":153,"line":314},[151,620,621],{},"            output=output,\n",[151,623,624],{"class":153,"line":320},[151,625,626],{},"        )\n",[151,628,629],{"class":153,"line":326},[151,630,631],{},"        results.append({\n",[151,633,634],{"class":153,"line":331},[151,635,636],{},"            \"id\": example[\"id\"],\n",[151,638,639],{"class":153,"line":337},[151,640,641],{},"            \"pass\": verdict[\"pass\"],\n",[151,643,644],{"class":153,"line":343},[151,645,646],{},"            \"score\": verdict[\"score\"],\n",[151,648,649],{"class":153,"line":348},[151,650,651],{},"            \"failed_criteria\": verdict[\"failed_criteria\"],\n",[151,653,654],{"class":153,"line":354},[151,655,656],{},"        })\n",[151,658,659],{"class":153,"line":360},[151,660,243],{"emptyLinePlaceholder":242},[151,662,663],{"class":153,"line":366},[151,664,665],{},"    pass_rate = sum(1 for r in results if r[\"pass\"]) \u002F len(results)\n",[151,667,668],{"class":153,"line":371},[151,669,670],{},"    avg_score = statistics.mean(r[\"score\"] for r in results)\n",[151,672,673],{"class":153,"line":377},[151,674,243],{"emptyLinePlaceholder":242},[151,676,677],{"class":153,"line":383},[151,678,679],{},"    baseline = json.loads(BASELINE.read_text())\n",[151,681,682],{"class":153,"line":389},[151,683,684],{},"    print(f\"pass rate: {pass_rate:.2%} (baseline {baseline['pass_rate']:.2%})\")\n",[151,686,687],{"class":153,"line":395},[151,688,689],{},"    print(f\"avg score: {avg_score:.2f} (baseline {baseline['avg_score']:.2f})\")\n",[151,691,692],{"class":153,"line":401},[151,693,243],{"emptyLinePlaceholder":242},[151,695,696],{"class":153,"line":407},[151,697,698],{},"    # Fail the run if we regressed more than 3pp on pass rate\n",[151,700,701],{"class":153,"line":413},[151,702,703],{},"    if pass_rate \u003C baseline[\"pass_rate\"] - 0.03:\n",[151,705,706],{"class":153,"line":419},[151,707,708],{},"        raise SystemExit(1)\n",[151,710,711],{"class":153,"line":425},[151,712,243],{"emptyLinePlaceholder":242},[151,714,715],{"class":153,"line":431},[151,716,717],{},"if __name__ == \"__main__\":\n",[151,719,720],{"class":153,"line":437},[151,721,722],{},"    main()\n",[11,724,725],{},"Fewer than 50 lines of Python handles 80% of what a \"framework\" would give you. Reach for a framework only when you've outgrown this.",[43,727,729],{"id":728},"non-determinism-noise-vs-signal","Non-determinism: noise vs signal",[11,731,732,733,736],{},"Run every example ",[57,734,735],{},"N"," times (N=3 is a reasonable default). Take the majority verdict. A single-run failure on a flaky example isn't a regression — a consistent failure on three of three is. This is the single most important rule in regression suites, and skipping it will make your CI scream at you every week for no reason.",[140,738,741],{"className":222,"code":739,"filename":740,"language":225,"meta":146,"style":146},"def run_with_votes(example, n=3):\n    verdicts = [judge(...) for _ in range(n)]\n    passes = sum(1 for v in verdicts if v[\"pass\"])\n    return passes >= (n \u002F\u002F 2 + 1)  # majority\n","evals\u002Frun.py — with N reruns",[148,742,743,748,753,758],{"__ignoreMap":146},[151,744,745],{"class":153,"line":154},[151,746,747],{},"def run_with_votes(example, n=3):\n",[151,749,750],{"class":153,"line":160},[151,751,752],{},"    verdicts = [judge(...) for _ in range(n)]\n",[151,754,755],{"class":153,"line":166},[151,756,757],{},"    passes = sum(1 for v in verdicts if v[\"pass\"])\n",[151,759,760],{"class":153,"line":172},[151,761,762],{},"    return passes >= (n \u002F\u002F 2 + 1)  # majority\n",[43,764,766],{"id":765},"managing-the-baseline","Managing the baseline",[11,768,769,770,773],{},"The ",[148,771,772],{},"baseline.json"," file is checked into the repo. You regenerate it deliberately — not on every commit, but when you accept a new scoring regime (new rubric, new judge model, intended improvement). The regeneration is a PR of its own, reviewed like any other.",[140,775,780],{"className":776,"code":777,"filename":778,"language":779,"meta":146,"style":146},"language-bash shiki shiki-themes github-light","python evals\u002Frun.py --save-baseline\ngit add evals\u002Fbaseline.json\ngit commit -m \"evals: rebaseline after rubric v2\"\n","regenerate the baseline","bash",[148,781,782,795,806],{"__ignoreMap":146},[151,783,784,787,791],{"class":153,"line":154},[151,785,225],{"class":786},"s7eDp",[151,788,790],{"class":789},"sYBdl"," evals\u002Frun.py",[151,792,794],{"class":793},"sYu0t"," --save-baseline\n",[151,796,797,800,803],{"class":153,"line":160},[151,798,799],{"class":786},"git",[151,801,802],{"class":789}," add",[151,804,805],{"class":789}," evals\u002Fbaseline.json\n",[151,807,808,810,813,816],{"class":153,"line":166},[151,809,799],{"class":786},[151,811,812],{"class":789}," commit",[151,814,815],{"class":793}," -m",[151,817,818],{"class":789}," \"evals: rebaseline after rubric v2\"\n",[35,820,822],{"title":821,"variant":38},"Don't rebaseline to hide a regression",[11,823,824],{},"The temptation is real: the eval drops 4pp, the deadline is today, you rebaseline and ship. Do this twice and your suite stops being a regression suite — it's just a rubber stamp. Require a one-paragraph justification in every rebaseline commit. Future-you will thank you.",[20,826,829,832,836,839,842,845,849,858,862,871,930,937],{"id":827,"title":828},"sdk","Anthropic's Eval Tooling",[11,830,831],{},"Anthropic publishes two things that slot into this workflow, plus one you can build on.",[43,833,835],{"id":834},"the-console-evaluation-feature","The Console evaluation feature",[11,837,838],{},"The Anthropic Console has a built-in eval workbench — upload a dataset, define a prompt, run variants, compare scores side-by-side. Good for prompt iteration before you've written any code. Less good once your eval is 200 examples and needs to run on every PR: at that point you want the logic in your repo, not a web UI.",[11,840,841],{},"Use the Console for: exploring new prompts, showing a PM what a change does, teaching a teammate the eval pattern.",[11,843,844],{},"Use your own runner for: CI gates, long-running suites, anything that needs to live next to your code.",[43,846,848],{"id":847},"the-cookbook-patterns","The cookbook patterns",[11,850,769,851,857],{},[78,852,856],{"href":853,"rel":854},"https:\u002F\u002Fgithub.com\u002Fanthropics\u002Fanthropic-cookbook",[855],"nofollow","anthropic-cookbook"," repo has working examples for LLM-as-judge, pairwise comparison, and multi-criteria rubrics. These are reference implementations, not a library — copy the pattern into your repo and adapt it. Fighting a framework-ified version of this is rarely worth it.",[43,859,861],{"id":860},"the-agent-sdk-for-agent-level-evals","The Agent SDK for agent-level evals",[11,863,864,865,870],{},"If what you're evaluating is a full agent (tools, loops, sub-agents), the ",[78,866,869],{"href":867,"rel":868},"https:\u002F\u002Fdocs.claude.com",[855],"Claude Agent SDK"," lets you script end-to-end runs in Python or TypeScript. Your eval runner becomes: spin up an Agent SDK instance, feed it the input, let it run to completion, score the final output (and optionally the trace).",[140,872,875],{"className":222,"code":873,"filename":874,"language":225,"meta":146,"style":146},"from claude_agent_sdk import Agent\n\nasync def run_agent(example):\n    agent = Agent(\n        system_prompt=open(\"prompts\u002Freviewer.md\").read(),\n        tools=[\"Read\", \"Grep\", \"Bash(git diff*)\"],\n        model=\"claude-sonnet-4-6\",\n    )\n    async with agent as a:\n        result = await a.query(example[\"input\"])\n    return result.text\n","evals\u002Fagent_run.py — sketch",[148,876,877,882,886,891,896,901,906,911,915,920,925],{"__ignoreMap":146},[151,878,879],{"class":153,"line":154},[151,880,881],{},"from claude_agent_sdk import Agent\n",[151,883,884],{"class":153,"line":160},[151,885,243],{"emptyLinePlaceholder":242},[151,887,888],{"class":153,"line":166},[151,889,890],{},"async def run_agent(example):\n",[151,892,893],{"class":153,"line":172},[151,894,895],{},"    agent = Agent(\n",[151,897,898],{"class":153,"line":251},[151,899,900],{},"        system_prompt=open(\"prompts\u002Freviewer.md\").read(),\n",[151,902,903],{"class":153,"line":257},[151,904,905],{},"        tools=[\"Read\", \"Grep\", \"Bash(git diff*)\"],\n",[151,907,908],{"class":153,"line":262},[151,909,910],{},"        model=\"claude-sonnet-4-6\",\n",[151,912,913],{"class":153,"line":268},[151,914,446],{},[151,916,917],{"class":153,"line":274},[151,918,919],{},"    async with agent as a:\n",[151,921,922],{"class":153,"line":280},[151,923,924],{},"        result = await a.query(example[\"input\"])\n",[151,926,927],{"class":153,"line":286},[151,928,929],{},"    return result.text\n",[11,931,932,933,936],{},"This is what you reach for when \"run my agent\" is more than a single ",[148,934,935],{},"messages.create"," call.",[35,938,940],{"title":939,"variant":523},"The honest state in 2026",[11,941,942,943,946],{},"Anthropic has not shipped an all-in-one \"eval SDK\" as a first-class product — what's available is the Console feature, the cookbook patterns, and the Agent SDK. The 50-line Python runner in ",[78,944,530],{"href":945},"#regression"," plus the Agent SDK for agent-level tests is the combo most teams converge on.",[20,948,951,954,958,1228,1232,1235,1249,1285,1289,1296,1303],{"id":949,"title":950},"ci","Running in CI",[11,952,953],{},"Local eval runs catch obvious regressions. CI eval runs catch the subtle ones — the ones where a teammate's PR looks fine in isolation and quietly drops pass rate by 4pp.",[43,955,957],{"id":956},"a-minimal-github-actions-workflow","A minimal GitHub Actions workflow",[140,959,964],{"className":960,"code":961,"filename":962,"language":963,"meta":146,"style":146},"language-yaml shiki shiki-themes github-light","name: evals\non:\n  pull_request:\n    paths:\n      - \"prompts\u002F**\"\n      - \".claude\u002Fagents\u002F**\"\n      - \"evals\u002F**\"\n      - \"src\u002Fagent\u002F**\"\n  schedule:\n    - cron: \"0 14 * * 1\"  # Mondays, 14:00 UTC — full nightly-style run\n\njobs:\n  run:\n    runs-on: ubuntu-latest\n    timeout-minutes: 20\n    steps:\n      - uses: actions\u002Fcheckout@v4\n      - uses: actions\u002Fsetup-python@v5\n        with: { python-version: \"3.12\" }\n      - run: pip install -r evals\u002Frequirements.txt\n      - run: python evals\u002Frun.py --mode=ci\n        env:\n          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}\n      - uses: actions\u002Fupload-artifact@v4\n        if: always()\n        with:\n          name: eval-report\n          path: evals\u002Freport.json\n",".github\u002Fworkflows\u002Fevals.yml","yaml",[148,965,966,979,987,994,1001,1009,1016,1023,1030,1037,1054,1058,1065,1072,1082,1092,1099,1111,1122,1141,1153,1164,1171,1181,1192,1202,1208,1218],{"__ignoreMap":146},[151,967,968,972,976],{"class":153,"line":154},[151,969,971],{"class":970},"shJU0","name",[151,973,975],{"class":974},"sgsFI",": ",[151,977,978],{"class":789},"evals\n",[151,980,981,984],{"class":153,"line":160},[151,982,983],{"class":793},"on",[151,985,986],{"class":974},":\n",[151,988,989,992],{"class":153,"line":166},[151,990,991],{"class":970},"  pull_request",[151,993,986],{"class":974},[151,995,996,999],{"class":153,"line":172},[151,997,998],{"class":970},"    paths",[151,1000,986],{"class":974},[151,1002,1003,1006],{"class":153,"line":251},[151,1004,1005],{"class":974},"      - ",[151,1007,1008],{"class":789},"\"prompts\u002F**\"\n",[151,1010,1011,1013],{"class":153,"line":257},[151,1012,1005],{"class":974},[151,1014,1015],{"class":789},"\".claude\u002Fagents\u002F**\"\n",[151,1017,1018,1020],{"class":153,"line":262},[151,1019,1005],{"class":974},[151,1021,1022],{"class":789},"\"evals\u002F**\"\n",[151,1024,1025,1027],{"class":153,"line":268},[151,1026,1005],{"class":974},[151,1028,1029],{"class":789},"\"src\u002Fagent\u002F**\"\n",[151,1031,1032,1035],{"class":153,"line":274},[151,1033,1034],{"class":970},"  schedule",[151,1036,986],{"class":974},[151,1038,1039,1042,1045,1047,1050],{"class":153,"line":280},[151,1040,1041],{"class":974},"    - ",[151,1043,1044],{"class":970},"cron",[151,1046,975],{"class":974},[151,1048,1049],{"class":789},"\"0 14 * * 1\"",[151,1051,1053],{"class":1052},"sAwPA","  # Mondays, 14:00 UTC — full nightly-style run\n",[151,1055,1056],{"class":153,"line":286},[151,1057,243],{"emptyLinePlaceholder":242},[151,1059,1060,1063],{"class":153,"line":292},[151,1061,1062],{"class":970},"jobs",[151,1064,986],{"class":974},[151,1066,1067,1070],{"class":153,"line":298},[151,1068,1069],{"class":970},"  run",[151,1071,986],{"class":974},[151,1073,1074,1077,1079],{"class":153,"line":303},[151,1075,1076],{"class":970},"    runs-on",[151,1078,975],{"class":974},[151,1080,1081],{"class":789},"ubuntu-latest\n",[151,1083,1084,1087,1089],{"class":153,"line":309},[151,1085,1086],{"class":970},"    timeout-minutes",[151,1088,975],{"class":974},[151,1090,1091],{"class":793},"20\n",[151,1093,1094,1097],{"class":153,"line":314},[151,1095,1096],{"class":970},"    steps",[151,1098,986],{"class":974},[151,1100,1101,1103,1106,1108],{"class":153,"line":320},[151,1102,1005],{"class":974},[151,1104,1105],{"class":970},"uses",[151,1107,975],{"class":974},[151,1109,1110],{"class":789},"actions\u002Fcheckout@v4\n",[151,1112,1113,1115,1117,1119],{"class":153,"line":326},[151,1114,1005],{"class":974},[151,1116,1105],{"class":970},[151,1118,975],{"class":974},[151,1120,1121],{"class":789},"actions\u002Fsetup-python@v5\n",[151,1123,1124,1127,1130,1133,1135,1138],{"class":153,"line":331},[151,1125,1126],{"class":970},"        with",[151,1128,1129],{"class":974},": { ",[151,1131,1132],{"class":970},"python-version",[151,1134,975],{"class":974},[151,1136,1137],{"class":789},"\"3.12\"",[151,1139,1140],{"class":974}," }\n",[151,1142,1143,1145,1148,1150],{"class":153,"line":337},[151,1144,1005],{"class":974},[151,1146,1147],{"class":970},"run",[151,1149,975],{"class":974},[151,1151,1152],{"class":789},"pip install -r evals\u002Frequirements.txt\n",[151,1154,1155,1157,1159,1161],{"class":153,"line":343},[151,1156,1005],{"class":974},[151,1158,1147],{"class":970},[151,1160,975],{"class":974},[151,1162,1163],{"class":789},"python evals\u002Frun.py --mode=ci\n",[151,1165,1166,1169],{"class":153,"line":348},[151,1167,1168],{"class":970},"        env",[151,1170,986],{"class":974},[151,1172,1173,1176,1178],{"class":153,"line":354},[151,1174,1175],{"class":970},"          ANTHROPIC_API_KEY",[151,1177,975],{"class":974},[151,1179,1180],{"class":789},"${{ secrets.ANTHROPIC_API_KEY }}\n",[151,1182,1183,1185,1187,1189],{"class":153,"line":360},[151,1184,1005],{"class":974},[151,1186,1105],{"class":970},[151,1188,975],{"class":974},[151,1190,1191],{"class":789},"actions\u002Fupload-artifact@v4\n",[151,1193,1194,1197,1199],{"class":153,"line":366},[151,1195,1196],{"class":970},"        if",[151,1198,975],{"class":974},[151,1200,1201],{"class":789},"always()\n",[151,1203,1204,1206],{"class":153,"line":371},[151,1205,1126],{"class":970},[151,1207,986],{"class":974},[151,1209,1210,1213,1215],{"class":153,"line":377},[151,1211,1212],{"class":970},"          name",[151,1214,975],{"class":974},[151,1216,1217],{"class":789},"eval-report\n",[151,1219,1220,1223,1225],{"class":153,"line":383},[151,1221,1222],{"class":970},"          path",[151,1224,975],{"class":974},[151,1226,1227],{"class":789},"evals\u002Freport.json\n",[43,1229,1231],{"id":1230},"two-modes-pr-vs-nightly","Two modes: PR vs nightly",[11,1233,1234],{},"Run different slices depending on when the job fires:",[51,1236,1237,1243],{},[54,1238,1239,1242],{},[57,1240,1241],{},"PR mode"," — a fast subset (20–30 examples, 1 run each). Blocks merge, needs to finish in under 5 minutes.",[54,1244,1245,1248],{},[57,1246,1247],{},"Nightly mode"," — the full golden set (200+ examples, 3 runs each for majority vote). Posts results to a channel, doesn't block anything.",[140,1250,1253],{"className":222,"code":1251,"filename":1252,"language":225,"meta":146,"style":146},"if args.mode == \"ci\":\n    examples = random.sample(all_examples, 30)\n    reruns = 1\nelse:  # nightly\n    examples = all_examples\n    reruns = 3\n","evals\u002Frun.py — the mode flag",[148,1254,1255,1260,1265,1270,1275,1280],{"__ignoreMap":146},[151,1256,1257],{"class":153,"line":154},[151,1258,1259],{},"if args.mode == \"ci\":\n",[151,1261,1262],{"class":153,"line":160},[151,1263,1264],{},"    examples = random.sample(all_examples, 30)\n",[151,1266,1267],{"class":153,"line":166},[151,1268,1269],{},"    reruns = 1\n",[151,1271,1272],{"class":153,"line":172},[151,1273,1274],{},"else:  # nightly\n",[151,1276,1277],{"class":153,"line":251},[151,1278,1279],{},"    examples = all_examples\n",[151,1281,1282],{"class":153,"line":257},[151,1283,1284],{},"    reruns = 3\n",[43,1286,1288],{"id":1287},"handling-flakes","Handling flakes",[11,1290,1291,1292,1295],{},"Even with majority vote, evals flake. Build a ",[57,1293,1294],{},"flake budget",": if a known-flaky example fails three weeks in a row, investigate; fewer than that, tolerate. The worst anti-pattern is making the CI pass by lowering the threshold every time it fails. That's not a passing CI — that's a disabled CI.",[43,1297,769,1299,1302],{"id":1298},"the-evals-label",[148,1300,1301],{},"evals:"," label",[11,1304,1305,1306,1309],{},"Consider gating the full-suite run on a PR label. Day-to-day PRs run the fast subset; PRs that touch prompts or agents get ",[148,1307,1308],{},"evals:full"," and run everything. Keeps CI time reasonable and focuses tokens where regressions are likely.",[20,1311,1314,1321,1325,1392,1395,1399,1445,1449,1452],{"id":1312,"title":1313},"cost","The Cost of Evals",[11,1315,1316,1317,1320],{},"Evals burn tokens. A 50-example suite with 3 reruns and a judge call per run is 50 × 3 × 2 = ",[57,1318,1319],{},"300 API calls per eval run",". Multiply by every PR and every nightly, and the bill adds up.",[43,1322,1324],{"id":1323},"a-concrete-estimate","A concrete estimate",[1326,1327,1328,1344],"table",{},[1329,1330,1331],"thead",{},[1332,1333,1334,1338,1341],"tr",{},[1335,1336,1337],"th",{},"Setup",[1335,1339,1340],{},"Calls\u002Frun",[1335,1342,1343],{},"Approx cost\u002Frun (Sonnet subject + Haiku judge)",[1345,1346,1347,1359,1370,1381],"tbody",{},[1332,1348,1349,1353,1356],{},[1350,1351,1352],"td",{},"20 examples × 1 rerun × judge",[1350,1354,1355],{},"40",[1350,1357,1358],{},"~$0.15",[1332,1360,1361,1364,1367],{},[1350,1362,1363],{},"50 examples × 3 reruns × judge",[1350,1365,1366],{},"300",[1350,1368,1369],{},"~$1.00",[1332,1371,1372,1375,1378],{},[1350,1373,1374],{},"200 examples × 3 reruns × judge",[1350,1376,1377],{},"1,200",[1350,1379,1380],{},"~$4.00",[1332,1382,1383,1386,1389],{},[1350,1384,1385],{},"200 examples × 3 reruns × judge + agent (5 tool turns)",[1350,1387,1388],{},"~6,000",[1350,1390,1391],{},"~$20",[11,1393,1394],{},"The agent-level number is what shocks people. End-to-end agent evals aren't cheap — they're sessions, and sessions cost real money. Budget accordingly.",[43,1396,1398],{"id":1397},"six-ways-to-keep-the-bill-sane","Six ways to keep the bill sane",[51,1400,1401,1407,1413,1419,1429,1435],{},[54,1402,1403,1406],{},[57,1404,1405],{},"Haiku for the judge."," Sonnet judges are rarely worth the 4× price. Verify once that Haiku agrees with you on 20 examples and then commit.",[54,1408,1409,1412],{},[57,1410,1411],{},"Prompt caching on the judge."," The rubric and instructions are identical across hundreds of calls. Enable caching on the judge prompt — you'll save 40–60% on judge tokens.",[54,1414,1415,1418],{},[57,1416,1417],{},"Sampling in CI."," Run 30 examples on PRs, 200 on nightly. You catch most regressions in the subset and only pay the full cost once a day.",[54,1420,1421,1424,1425,1428],{},[57,1422,1423],{},"Deterministic checks first."," Filter easy examples through a regex or JSON-schema check before spending a judge call. If the output is ",[15,1426,1427],{},"obviously"," wrong, don't ask the judge.",[54,1430,1431,1434],{},[57,1432,1433],{},"Skip passing examples on reruns."," If an example passes on run 1, don't bother with runs 2 and 3. Only rerun the ones near the boundary.",[54,1436,1437,1440,1441,1444],{},[57,1438,1439],{},"Parallelize."," The 50-example suite that runs serially in 10 minutes runs in 1 minute with a ",[148,1442,1443],{},"ThreadPoolExecutor(max_workers=20)",". Same tokens, 10× the wall speed.",[43,1446,1448],{"id":1447},"the-compounding-argument","The compounding argument",[11,1450,1451],{},"Evals feel expensive until you compare them to the cost of shipping a regression. A single prod regression caught by an eval — versus caught by a customer ticket two days later — pays for months of eval compute. The math rarely favors skipping evals once the thing you're building has users.",[35,1453,1456],{"title":1454,"variant":1455},"The $5 rule","success",[11,1457,1458],{},"If your eval suite costs less than $5 per run and catches one real regression a month, it's the highest-ROI infrastructure you own. Keep it cheap enough that no one argues about running it.",[1460,1461,1462],"style",{},"html .default .shiki span {color: var(--shiki-default);background: var(--shiki-default-bg);font-style: var(--shiki-default-font-style);font-weight: var(--shiki-default-font-weight);text-decoration: var(--shiki-default-text-decoration);}html .shiki span {color: var(--shiki-default);background: var(--shiki-default-bg);font-style: var(--shiki-default-font-style);font-weight: var(--shiki-default-font-weight);text-decoration: var(--shiki-default-text-decoration);}html pre.shiki code .s7eDp, html code.shiki .s7eDp{--shiki-default:#6F42C1}html pre.shiki code .sYBdl, html code.shiki .sYBdl{--shiki-default:#032F62}html pre.shiki code .sYu0t, html code.shiki .sYu0t{--shiki-default:#005CC5}html pre.shiki code .shJU0, html code.shiki .shJU0{--shiki-default:#22863A}html pre.shiki code .sgsFI, html code.shiki .sgsFI{--shiki-default:#24292E}html pre.shiki code .sAwPA, html code.shiki .sAwPA{--shiki-default:#6A737D}",{"title":146,"searchDepth":160,"depth":160,"links":1464},[1465,1466,1467,1468,1469,1470,1471,1472,1473,1474,1475,1476,1477,1478,1479,1480,1481,1483,1484,1485],{"id":45,"depth":166,"text":46},{"id":109,"depth":166,"text":110},{"id":137,"depth":166,"text":138},{"id":181,"depth":166,"text":182},{"id":218,"depth":166,"text":219},{"id":464,"depth":166,"text":465},{"id":488,"depth":166,"text":489},{"id":536,"depth":166,"text":537},{"id":728,"depth":166,"text":729},{"id":765,"depth":166,"text":766},{"id":834,"depth":166,"text":835},{"id":847,"depth":166,"text":848},{"id":860,"depth":166,"text":861},{"id":956,"depth":166,"text":957},{"id":1230,"depth":166,"text":1231},{"id":1287,"depth":166,"text":1288},{"id":1298,"depth":166,"text":1482},"The evals: label",{"id":1323,"depth":166,"text":1324},{"id":1397,"depth":166,"text":1398},{"id":1447,"depth":166,"text":1448},"Why evals matter, golden datasets, LLM-as-judge, regression suites, Anthropic's eval tooling, CI integration, and the real cost of testing agent output.","20 min","md","Test your agents","LucideFlaskConical",null,{},{"title":1494,"path":1495},"Recipes","\u002Frecipes","\u002Fevals",{"title":1498,"path":1499},"Orchestration","\u002Forchestration",{"title":1501,"description":1502,"keywords":1503,"proficiencyLevel":1510,"timeRequired":1511},"Evals — Testing Claude Code Agents","Why evals matter for agentic workflows: golden datasets, LLM-as-judge patterns, regression suites, Anthropic's eval tooling, CI integration, and the real token cost of testing agents in production.",[1504,1505,101,1506,1507,1508,1509],"claude code evals","llm as judge","agent regression tests","anthropic eval sdk","eval ci","agent testing","Advanced","PT20M",[1513,1514,1515,1516,1517,1518,1519],{"id":22,"title":23,"level":160},{"id":94,"title":95,"level":160},{"id":211,"title":212,"level":160},{"id":529,"title":530,"level":160},{"id":827,"title":828,"level":160},{"id":949,"title":950,"level":160},{"id":1312,"title":1313,"level":160},"6luIL6AvwMVZ0C1_3yUJa-3IbIyV2i2GexawqZ73vQ8",1777109528508]