What was I building?
Node/TypeScript app deployed on Replit. Postgres with full-text search for retrieval. Anthropic Claude for scraping PDFs source documents and adding them to Github . Braintrust for observability and evals.
The corpus is a few hundred long-form documents. Users ask questions in natural language. The bot retrieves relevant docs, sends them to Claude with a system prompt, and produces a grounded answer with inline citations. Standard RAG architecture.
What happens when your AI developer inherits yesterday's defaults?
Replit Agent is impressive at the start. I described what I wanted, "a chatbot that retrieves from a Postgres FTS index and answers questions with citations," and within an afternoon I had a running app. For someone who isn't a full-time engineer, that's the difference between "this never gets built" and "this is live by the end of the week."
But Agent has a quiet failure mode that took me a while to recognise: it inherits defaults from its training data, and that training data is calibrated for an older era of LLM development. Smaller context windows. Different best practices. Era-appropriate hacks that have become obsolete.
The bug that crystallised this: the chatbot couldn't answer a question about something documented clearly in the source. The relevant text sat at character 11,649 of a 16,474-character file. Retrieval found the right file. The model just never saw the relevant content.
Agent had silently capped each retrieved document at 8,000 characters, dropping everything past it.
Why? I think it is because in 2023, when smaller-context-window models were the default, capping at 8K was sensible. In 2026, with Claude's 200K-token window, it's just throwing away your data. Agent reached for the pattern it had seen most often, not the one that was right for the current generation of models.
I removed the cap. The bot started answering correctly. Simple fix. But the bigger question stuck: what other defaults are quietly making decisions for me?
I started pulling threads. Top-k retrieval was set to 6, probably too low, bumped it to 12. Temperature was defaulting to 1.0. For a factual chatbot. I know. Set it to 0.0. The exact prompt being assembled at request time was invisible. So were retrieval scores, truncation behaviour, model selection logic. Each one had been a quiet design decision Agent made without surfacing it.
The lesson, for anyone building with AI code agents: hidden defaults are the killer. Every architectural decision the agent makes should be inspectable, logged, displayed. Build an admin page that surfaces every parameter (temperature, top-k, system prompt, retrieval rules) before you start trying to debug anything. The hour you spend making things visible saves you days of "why is this responding weirdly?"
So I asked Agent to build me an admin database. System prompt, temperature, top-k, retrieval rules, truncation limits. Every parameter that had been buried in code, surfaced into a database I could inspect and edit without redeploying. It took one prompt and maybe an hour of back-and-forth to get right. And suddenly the app went from a black box to something I could actually reason about. Every time the bot did something unexpected, I could check the admin settings first instead of digging through code I half-understood.
If you're building with an AI code agent and you do nothing else I suggest, do this.
How do you know if it actually works?
This is where Braintrust comes in, and I love it. Their pitch is "observability and evals for LLM apps," which sounds dry until you've tried debugging an LLM-backed product without it.
Traces capture every chat call: input, retrieved sources, output, scores, latency, tokens. When the bot misbehaves, I can click into a trace and see exactly what the LLM was given and what it produced. The truncation bug? Diagnosable in 30 seconds once I looked at the trace. From raw logs, that would've taken hours.
Their Datasets and Experiments tooling makes iteration feel scientific instead of vibes-based. I built a hand-grounded eval set with categories like "factual recall," "scope refusal," and "clarification on vague queries." Each row has expected behaviour and key facts. After every change to the prompt or retrieval logic, I run the eval and see exactly which rows moved in which direction. No guessing.
But the killer feature is side-by-side experiment comparison. Made a system prompt change? Compare the new run to the previous baseline; Braintrust diffs the per-row scores so you can see whether your "improvement" actually improved things or just shifted failure modes around. I've avoided shipping at least one prompt change I'd otherwise have been confident about, just because the dashboard showed it tanked another scorer. That alone justifies the tool.
I also wired up a GitHub Action to kick off a Braintrust experiment on every deploy. Each time I pushed a change, the eval ran automatically and I could see whether the new version was better or worse before anyone used it. No discipline required, the pipeline just wouldn't let me ship blind. That ended up being one of the most useful decisions in the whole project, and it took about twenty minutes to set up.
One small thing that made a big difference: I asked Replit to add a name flag to each experiment in Braintrust. So instead of a list of timestamped runs, I could see "bumped top-k to 12" or "simplified system prompt" or "removed truncation cap" in the dashboard. Sounds trivial. But when you're comparing six experiments side by side, knowing what you changed without clicking into each one saves you from losing track of your own iterations.
I should say, I'm enjoying the Replit and Braintrust integration. They work well together, and the workflow of building in Replit, deploying, and having the eval automatically run and report back in Braintrust feels like what AI-assisted development should feel like. It's not perfect. But it's close enough that I stopped thinking about the tooling and started thinking about the product, which is the whole point.
One thing I'd flag if you're new to it: scorer design is harder than it looks. I started with simple substring-match scorers, asking whether the response contains the expected source string. Of course, the model started phrasing citations naturally, in human-readable variants, and my literal-match scorer flagged perfectly correct answers as failures. Took a few iterations to build flexible scorers that recognise legitimate variation. LLM-as-judge scorers (Braintrust's Factuality) help with the fuzzy cases. Don't expect to nail this on the first try.
What would I do differently?
If I were starting over:
Force visibility before you write a line of code. Ask Replit Agent to explain every architectural decision and surface every parameter in an admin dash. Demand a debug log on every request showing prepared queries, retrieved scores, total characters sent, model and temperature.
Surface the temperature on the admin dash and set it!
Build evals from day one, not week three. They're not a nice-to-have; they're how you know whether your changes are improvements or regressions.
Automate your evals into CI. A GitHub Action that runs a Braintrust experiment on every deploy takes twenty minutes to set up and removes the temptation to skip the check.
Treat single-run scorer deltas as suspicious. With small eval sets, sampling noise is real. Three runs and an average tells you more than one run with a confident-looking number.
Read your traces. Braintrust will surface failure patterns you'd never anticipate from the eval set alone. The most useful bug I caught wasn't from the eval set, it was from clicking through random production traces and noticing the bot's confident negatives ("X did not happen") were structurally identical for every case where the answer was actually sitting in a truncated portion of the source doc.
Should you use this stack?
If you're building anything LLM-grounded and you're not a full-time platform engineer: yes, with eyes open. Replit Agent and Braintrust together are a real productivity multiplier, and the gaps are diagnosable rather than fundamental.
The trick is recognising that AI code agents will quietly inherit yesterday's defaults, and your job, the part that's still yours, is to drag those defaults into daylight. Log everything. Question every parameter. And when the bot gives you a confidently wrong answer, don't just fix it. Ask why it was wrong, and what else might be hiding behind the same assumption.
