LLM Evals: Open Coding, the Dev/Test Split, and a $1.75 Lesson About Scorer Choice

TLDR: Scorers tell you something is wrong. Open coding the traces tells you what is actually wrong, and which fix would help. If you're shipping an LLM-backed product without a dedicated eval engineer, start with cheap, specific scorers, read your traces regularly, and reach for LLM-as-judge only after you've identified the exact pattern you want to regression-test. And don't drop a generic Factuality scorer into your pipeline and forget about it. It's expensive, and in my case it didn't tell me anything useful.

If you read the first post about building a small RAG app, this one picks up where I closed it: "scorer design is harder than it looks."

The substring scorer that almost made me "fix" a working bot

I started with the most obvious eval scorers: substring matches. For "did the bot refuse this question," I had a list of phrases ("don't have", "outside the scope", "unable to", "can't help with", a handful more) and a scorer that returned 1 if any phrase appeared in the response and 0 if none did. It was simple, cheap, and free per row.

For about a month, this worked fine. Then I shipped a system prompt update. I tightened the refusal language, making the bot more direct about what it could and couldn't answer. And the next eval run showed my refusal scorer dropping from 87.5% to 75%. Twelve points overnight, on a metric I'd been holding pretty stable for weeks.

My first instinct was the obvious one: the prompt change had weakened the bot's refusal behavior.

I opened the traces for a look.

The weird thing was that the bot was refusing the questions cleanly and politely, the exact same questions it had refused before. The only thing that had changed was the language of the refusals. Instead of "I don't have information on that," the bot was now writing things like "I cannot answer that" and "the records explicitly prohibit me from sharing." It was actually grammatically better, factually identical, and completely invisible to my phrase-list scorer.

I was one prompt edit away from "fixing" a bot that wasn't broken. The scorer was broken.

Lesson: when a regression looks targeted (specific rows, specific scorers, no obvious mechanism), read the traces before changing the system.

The fix here was unglamorous: expand the phrase list aggressively, with each new failure mode adding the language pattern it surfaced. Over time you converge on a comprehensive list, but it never actually catches up. Better long-term answer: replace the substring scorer with an LLM-as-judge that just asks "did the assistant decline to answer this question?" Yes or no. We'll come back to this.

LLM-as-judge isn't a free upgrade

After the substring debacle, I figured the answer was to throw a Factuality scorer at the problem. Braintrust integrates autoevals.Factuality out of the box; it calls a frontier OpenAI model under the hood; it sounds rigorous. I added it to my eval pipeline and forgot about it.

Three things happened, none of them what I expected.

First, my eval costs went up 5x overnight. A single CI run jumped from ~$0.40 to ~$2.15. Factuality alone accounted for almost $2 of that, which was five times the rest of the eval combined. My substring scorers cost zero per row, Haiku for the chat itself was forty cents a run, and one frontier-model judge was suddenly the biggest line item in my entire eval budget. For a project running evals on every push, that adds up fast.

(Note: I'll likely move this project over to a local setup in the future, but I'm deliberately not doing that yet to learn more about costings.)

Second, the scores didn't move. I'd ship a prompt change that obviously improved a row, and Factuality would still score it 0.6. I'd ship a regression, and Factuality would still score it 0.6. Most rows sat in a narrow band around 60%, regardless of whether the bot was getting better or worse. After a couple of weeks I realized why: Factuality's grading rubric maps "verbose but consistent with expected" to a fixed middle score. My bot is verbose by design: it cites sources, explains its reasoning, gives context. So almost every response landed in that bucket, and the score never moved because the bot's shape never changed.

Third, the eval was sometimes wrong. This was the most surprising one. I had a row about how many councilors had been appointed to a committee. The expected answer said three. The bot's response said four, cited the meeting minutes, and named them. Factuality flagged the disagreement (score 0), and my other scorers (key facts present, source cited) gave it 1.0. So I opened the trace.

The bot was right. The minutes clearly listed four appointees. I hadn't even written this eval row myself; Replit Agent had generated it for me, scanning the corpus and producing the expected field, and somewhere along the way it had dropped one name. Factuality had correctly detected a disagreement between the bot's response and the expected field. It just happened that the disagreement was that the eval was wrong, not the bot.

This is a real failure mode of any eval set, regardless of who or what wrote it. I started finding more examples once I noticed the pattern. A row where the expected description of a vote was missing a key qualifier. A row where a councilor's stated position was paraphrased in a way that didn't quite match the record. Some of these were Replit Agent's; some were mine. The cause didn't matter much. Humans miss details across long records, and AI generators hallucinate plausible-looking facts that don't quite hold up. The result is the same either way: an expected field that doesn't actually match the source.

Lesson: expected fields are not ground truth, regardless of who or what wrote them. The source document is. When two scorers disagree on a row, especially when the bot scores high on factual scorers but low on a judge, the first move should be to open the trace, read the bot's response, and compare it to the source, not to whatever's sitting in your expected field.

This matters especially if you're using AI-assisted eval generation, which is increasingly the default for small teams. The same model class generating your eval rows is generating your bot's responses, and they share many of the same failure modes, including the failure of hallucinating facts that sound right. An AI-generated expected field that says "three appointees" because the model skimmed too fast is the same kind of error as a bot response that confidently states a wrong vote count. You need a human in the loop somewhere, comparing both against the actual source. Mine just happened to be after the fact, prompted by a scorer disagreement, rather than at row-authoring time.

I dropped Factuality. The eval cost went back to $0.40.

Reading traces is where the insight actually lives

If you don't take a structured eval course, here's the move that's hardest to learn on your own: read the actual traces. The actual raw outputs. Read them, tag patterns as you see them, group recurring failures together. The technique has a name in qualitative research (open coding).

I started doing this seriously after taking a great Maven course on LLM evals that hammered on the point: scorers tell you something is wrong; reading traces tells you what is actually wrong. You can't design a good fix from an aggregate score. You can only design it from understanding the specific failure pattern that's driving the score down.

A few concrete patterns I caught only by reading traces, not from any scorer:

The bot was dismissing genuinely relevant retrieved docs because they used formal vocabulary ("road maintenance") instead of the user's casual word ("potholes"). Aggregate scores looked fine; specific rows were quietly worse. The fix was a system prompt rule about trusting the retrieved sources even when their vocabulary differs from the question, and I'd never have written that rule without reading 20 traces and noticing the same pattern in every "topic with everyday wording" failure.
The bot was hedging on questions about meetings that had clearly already happened. "If April 6, 2026 is in the future..." it would say, even though the date was three months ago. The bot had no idea what "today" was. I'd never have caught this from a scorer: every row that surfaced it scored fine on factuality, fine on key facts, fine on citation. It only became visible when I read three responses in a row and noticed the same conditional phrasing.
A truncation bug was silently dropping the last 60% of every long retrieved document before the model saw it. The bot kept saying "the records don't mention this" about things I knew were in the records, but only for long records. No scorer would have caught this. It was a single trace, read carefully, that surfaced the bug.

The principle that came out of this: LLM-as-judge only on grouped errors, never as a generic quality gate. Don't reach for a judge to measure "is this good?" as that produces diffuse 60% scores like Factuality did. Reach for a judge once you've already open-coded the traces, identified a specific recurring failure pattern, and need a regression test for it. What works best is writing a tight judge that asks ONE yes/no question about THAT pattern: "Does this response describe a motion that failed as if it had succeeded?" "Does this response cite a source that was not in the retrieved set?" Each judge becomes a regression test for a known bug, with clear attribution when it fires.

The refusal scorer from Act 1 is exactly this pattern. The brittle substring version got replaced with a Haiku call that asks ONE question: "Did the assistant decline to answer the user's question, explaining that it cannot or should not?" Yes or no. Returns a 1 or a 0. Costs about $0.001 per row, near-free at the scale my eval runs. Doesn't care whether the bot says "I cannot answer," "the records don't include," "I won't speculate," or any of the other phrasings that broke the phrase list.

The important part isn't that I used Haiku. It's the order I did things. I built the judge after open coding the failures, not before. The judge knows exactly which pattern to detect because I already knew the pattern existed and could describe it in one sentence. Trying to build that judge before I'd read the traces would have produced the same shape of failure as the Factuality one I described earlier.

Two golden sets, used differently

The other thing that's been quietly important: keeping two eval sets, not one.

The first set, call it the dev set, is what I tune against day to day. About 30 hand-crafted rows covering the main failure modes I care about: factual recall, vote lookups, scope refusals, prediction refusals, aggregation refusals, clarification on vague queries. Every time I tweak the prompt or change a retrieval parameter, this set runs in CI and tells me whether the change helped.

The second set, the test set, I deliberately never tune against. About 20 hand-crafted rows, including the most adversarial and edge-case questions I could think of: prompt injection attempts, near-miss queries, boundary cases where the question touches both in-scope and out-of-scope territory. I run this set occasionally as a baseline check. Its job is to tell me when I've been tuning the bot to do well on my specific dev set questions, rather than actually making it better at the underlying job.

Both sets are "golden" in the sense that ML people use the word: hand-curated, hand-verified, ground truth. The word "golden" describes the quality of the rows, not their role. Dev and test is just the role distinction borrowed from ML training.

If dev scores keep climbing while test scores stay flat or drop, I've been gaming my own dev set. That's the moment to either refresh the dev set or expand its coverage.

What I'd tell other small teams

If you're shipping an LLM-backed product and you don't have a dedicated eval engineer:

Start with cheap, specific scorers. Substring scorers, key-facts checks, citation checks. These cost nothing per row and catch the obvious failures. Expect to expand them as you discover refusal language they miss.
Don't reach for a generic LLM-as-judge. A frontier-model scorer that asks "is this good?" produces diffuse scores that don't move when behaviour changes, and it costs five times the rest of your eval combined. Save that money.
Open code your traces. Read the actual responses, especially the failing ones. Tag patterns as you notice them. Group recurring failures. This is where design insight comes from, and no scorer will replace it.
Use LLM-as-judge surgically. Once you've identified a specific recurring failure pattern, write a tight judge asking ONE yes/no question about THAT pattern. Each judge becomes a regression test for a known bug.
Keep two golden sets, dev and test. Tune against dev. Hold test out. Watch the gap between them: when it grows, you've been tuning the bot to your specific dev questions rather than improving it overall.
Trust but verify your own expected fields. Eval rows drift, and the most expensive failure mode is "the eval was wrong and the bot was right." Spot-check disagreements against the source, not the expected.

Scorers tell you something is wrong. Open coding tells you what is actually wrong and which fix would help. You need both, but the second one is where the insight actually lives.