LLM Evaluation Frameworks
How to measure what matters in language model performance.
Here’s something that happened to me last month. I changed three words in a system prompt. The change looked innocent. The evals passed. I shipped it. Within 48 hours, our ticket routing accuracy dropped from 91% to 74%. The eval suite didn’t catch it because the eval suite was testing the wrong thing.
That experience crystallized something I’d been thinking about for a while: evaluation is the hardest part of building with LLMs, and most teams are doing it wrong. Not wrong in a “they could be better” sense. Wrong in a “they’re measuring things that don’t correlate with production quality” sense.
This post is about how to actually evaluate LLM applications. Not benchmark scores. Not academic metrics. The practical engineering of knowing whether your system works, knowing when it breaks, and knowing why.
Why traditional metrics fail
Before we get into what works, let me explain why the obvious approaches don’t.
Benchmark scores are not product metrics
When someone asks “which model is better, Claude or GPT-4?” the answer is “for what?” Benchmark scores (MMLU, HumanEval, HellaSwag, and the dozens of others) measure specific capabilities in controlled conditions. They tell you almost nothing about how the model will perform on your specific task with your specific data.
I’ve seen teams pick a model based on benchmark scores and then be surprised when it performs worse on their use case than a model with lower benchmark scores. The reason is simple: benchmarks test general capabilities. Your product uses specific capabilities in specific ways. The correlation between general benchmark performance and product-specific performance is weak.
Benchmark contamination is rampant
There’s an even more fundamental problem with benchmarks: contamination. Benchmark contamination occurs when test data leaks into training sets, inflating scores without improving actual capability. By 2025, this had become so severe that most static benchmarks were considered unreliable. Models could score well on benchmarks they’d effectively memorized during training, giving a false impression of capability.
LiveBench emerged specifically to address this problem, using regularly updated questions designed to prevent test data from appearing in training sets and relying on verifiable, objective ground-truth answers rather than LLM judges. But even LiveBench is a general benchmark. It tells you about the model’s general capability, not about your specific application.
BLEU, ROUGE, and friends are from a different era
These metrics were designed for machine translation and summarization before LLMs existed. They measure surface-level text similarity: does the output contain the same words as the reference? This is a terrible proxy for quality in most LLM applications.
A model can produce an output that scores 0.0 on BLEU (completely different words from the reference) and be a perfect answer. It can score 0.9 on BLEU (nearly identical words) and be wrong, misleading, or unhelpful.
These metrics survive because they’re cheap to compute and easy to understand. But optimizing for them doesn’t optimize for quality. Use them only for tasks where surface-level similarity genuinely matters (like translation), and even then, treat them as one signal among many.
Accuracy is a lie in open-ended tasks
For classification tasks, accuracy is meaningful. The model either classifies correctly or it doesn’t. For open-ended tasks (summarization, writing, conversation, analysis), “accuracy” isn’t a coherent concept. What does it mean for a summary to be “accurate”? It could be factually correct but miss the key points. It could capture the key points but be poorly written. It could be well-written and correct but too long.
Open-ended tasks require multidimensional evaluation. A single number can’t capture quality. You need multiple dimensions, each measured separately, each with its own threshold.
The eval framework that actually works
After two years of building LLM applications and getting evaluation wrong multiple times, here’s the framework I’ve converged on. It has four layers, and each one catches a different category of problem.
Layer 1: Unit evals (automated, fast, cheap)
Layer 2: Behavioral evals (automated, medium cost)
Layer 3: LLM-as-judge evals (automated, higher cost)
Layer 4: Human evals (manual, expensive, gold standard)
Layer 1: Unit evals
Unit evals test specific, deterministic properties of the output. They’re like unit tests in traditional software: fast, cheap, and binary.
| What to test | How to test it | Example |
|---|---|---|
| Output format | Schema validation | JSON matches expected schema |
| Required fields | Field presence check | All required fields are non-null |
| Length constraints | Character/token count | Summary is under 200 words |
| Forbidden content | Pattern matching | No PII, no profanity, no competitor names |
| Factual constraints | Lookup verification | Dates match source data, names are spelled correctly |
| Deterministic extractions | Exact match | Extracted email address matches known correct value |
These evals catch about 30% of production issues. They’re not glamorous, but they’re essential. A model that returns malformed JSON or includes PII in a public-facing summary is failing at a basic level that should never reach a user.
Run these on every prompt change. They take seconds and cost nothing (they don’t require model calls). If a unit eval fails, don’t bother running the more expensive evals.
Layer 2: Behavioral evals
Behavioral evals test whether the model follows specific behavioral requirements. These are more nuanced than unit evals but still automated.
def test_severity_assignment():
"""Model should assign severity based on user impact, not tone."""
# Calm report about data loss
result1 = evaluate(
input="The export function silently drops rows when the dataset "
"exceeds 10,000 entries. We noticed some customers lost data.",
expected_severity="critical"
)
# Excited report about cosmetic issue
result2 = evaluate(
input="THE BUTTON COLOR IS WRONG!!! THIS IS UNACCEPTABLE!!! "
"THE SHADE OF BLUE IS OFF BY LIKE 2 PIXELS!!!!",
expected_severity="cosmetic"
)
assert result1.severity == "critical", "Data loss should be critical"
assert result2.severity == "cosmetic", "Color issue should be cosmetic"
Behavioral evals test the contract, not the output. They verify that the model follows the rules you’ve specified. Does it assign severity based on impact, not tone? Does it preserve all facts from the source? Does it refuse to answer questions outside its scope?
I typically have 100-300 behavioral eval cases per feature, organized into categories:
| Category | What it tests | Typical count |
|---|---|---|
| Happy path | Normal inputs, expected behavior | 30-50 |
| Edge cases | Boundary conditions, unusual inputs | 20-40 |
| Adversarial | Prompt injection, manipulation attempts | 15-25 |
| Regression | Cases that failed in previous versions | Growing over time |
| Domain-specific | Industry/domain-specific requirements | 20-50 |
Behavioral evals catch about 40% of production issues. They require model calls, so they cost more than unit evals (roughly $2-10 per run, depending on the number of cases and the model). Run them on every prompt change and every model update.
Layer 3: LLM-as-judge evals
For qualities that are hard to test programmatically (coherence, helpfulness, tone, completeness), use a separate LLM as an evaluator. This is the “LLM-as-judge” pattern, and it’s become the backbone of evaluation for open-ended tasks.
The basic approach: give a judge model the input, the output, and a rubric. Ask it to score the output on specific dimensions.
judge_prompt = """
You are evaluating the quality of a bug report summary.
## Input (original bug report)
{input}
## Output (structured summary)
{output}
## Rubric
Score each dimension from 1 to 5:
1. COMPLETENESS: Does the summary capture all key information from
the original report? (1=missing major details, 5=comprehensive)
2. ACCURACY: Are all facts in the summary faithful to the original?
(1=contains fabricated details, 5=perfectly accurate)
3. ACTIONABILITY: Could an engineer start working on this bug from
the summary alone? (1=no, needs original, 5=fully actionable)
4. SEVERITY_CORRECTNESS: Is the assigned severity appropriate?
(1=completely wrong, 5=exactly right)
Return JSON: {"completeness": N, "accuracy": N, "actionability": N,
"severity_correctness": N, "reasoning": "..."}
"""
The critical detail: the rubric. Without a specific rubric, the judge model defaults to general “is this good?” evaluation, which is inconsistent and unhelpful. A specific rubric with defined dimensions and scoring criteria produces consistent, actionable evaluations.
Known biases of LLM judges:
| Bias | What it means | Mitigation |
|---|---|---|
| Verbosity preference | LLMs favor longer, more detailed responses | Include “conciseness” as a rubric dimension |
| Self-preference | Models rate their own outputs higher | Use a different model family as judge |
| Position bias | The first option in a comparison is favored | Randomize option order |
| Sycophancy | Judges avoid harsh scores | Explicitly instruct: “Score 1 or 2 when the output clearly fails” |
| Preference leakage | Judge favors outputs similar to its training data | Use diverse judge models, cross-check with human eval |
Research from 2025 showed that GPT-4-Turbo has an error rate of up to 46% for challenging reasoning and math problems when acting as a judge. LLM judges are not reliable for hard problems. They’re useful for soft quality dimensions (coherence, tone, helpfulness) where the judge doesn’t need to reason deeply.
Cost management: LLM-as-judge evals are the most expensive automated layer. Each eval case requires a full model call for the judge. With 200 eval cases and a capable judge model, a single run might cost $10-30. Strategies to manage this:
- Run the full judge eval suite on major changes (prompt restructuring, model upgrades). Run a sampled subset (20-30 cases) on minor changes.
- Use a cheaper model for obvious-quality dimensions (format, length) and a capable model for hard dimensions (accuracy, completeness).
- Cache judge results for unchanged inputs.
Layer 4: Human evaluation
Human evaluation is the gold standard. It’s also the most expensive and least scalable. Use it strategically.
When human eval is essential:
- Validating the LLM-as-judge rubric. Before trusting your automated judges, have humans evaluate the same cases and check whether the judge scores align with human scores. If they diverge, fix the rubric, not the humans.
- New feature launches. Before the first production deployment, have domain experts evaluate a representative sample.
- Edge cases the automated suite can’t handle. Cultural nuance, domain-specific correctness, subtle factual errors that require specialist knowledge.
- Calibration checks. Periodically (monthly or quarterly), have humans evaluate a random sample of production outputs to verify that automated evals are still aligned with actual quality.
Human eval best practices:
Use multiple evaluators per case (minimum 3 for important dimensions). Measure inter-rater reliability. If evaluators disagree consistently, your rubric is ambiguous.
Match evaluator expertise to the domain. A generalist evaluator can assess coherence and formatting. A domain expert is needed to assess factual accuracy in specialized domains (medical, legal, financial).
Provide visual aids. Show evaluators the input, the output, and the relevant source data side by side. Highlighting key information reduces cognitive load and improves accuracy.
Use structured rubrics, not vibes. “Rate quality from 1-5” is useless. “Rate factual accuracy from 1-5 where 1 means multiple fabricated facts and 5 means all facts match the source” is actionable.
The eval tools landscape
The evaluation tooling landscape in 2026 has matured significantly. Here’s my honest assessment of the major tools.
Braintrust
Braintrust is the tool I’d recommend for most teams. It provides an integrated platform for prompt management, evaluation, and monitoring. The key feature: you can version prompts, test them against real data, and deploy them from one platform. Their GitHub Action runs experiments on PRs and posts score comparisons automatically, which makes eval-driven development practical for teams that use standard CI/CD.
Braintrust’s AI co-pilot (Loop) lets non-technical team members iterate on prompts, which sounds gimmicky but solves a real problem: the person who best understands the domain (a product manager, a subject matter expert) often isn’t the person writing the prompts.
Best for: Teams that want an all-in-one platform. Teams with CI/CD workflows. Teams where prompt iteration involves non-engineers.
promptfoo
promptfoo is open-source, CLI-first, and lives in your repo. Prompts and test cases are defined in YAML configuration files. It runs locally or in CI, compares results across models, and generates reports.
The standout feature is red teaming. promptfoo can probe your prompts for vulnerabilities: prompt injection, PII exposure, jailbreak risks. It’s the only evaluation tool I’ve used that’s purpose-built for security testing alongside performance evaluation.
# promptfoo config example
prompts:
- file://prompts/bug-structurer-v2.md
- file://prompts/bug-structurer-v3.md
providers:
- anthropic:messages:claude-sonnet-4-5-20250929
- openai:chat:gpt-4o
tests:
- vars:
input: "Login button is broken on mobile Safari"
assert:
- type: json-schema
value: file://schemas/bug-report.json
- type: llm-rubric
value: "Severity should be 'major' since it affects a core flow"
Best for: Engineering teams that prefer code-over-UI. Open-source/self-hosted requirements. Security-conscious teams that need red teaming.
Langfuse
Langfuse focuses on observability plus evaluation. It traces every LLM call in production, captures inputs and outputs, and lets you run evaluations on production data. This closes the loop between production monitoring and offline evaluation: when you see a failure in production, you can add it to your eval suite directly.
Best for: Teams that prioritize production observability. Teams that want to build eval suites from real production data rather than synthetic cases.
LangSmith
LangSmith is LangChain’s evaluation and tracing platform. If you’re already in the LangChain ecosystem, it integrates tightly. Tracing, datasets, evaluation, and prompt management in one tool.
Best for: Teams using LangChain/LangGraph. Teams that want tight framework integration.
Deepchecks
Deepchecks focuses on continuous monitoring and automated alerts. It’s designed for teams that need to know immediately when quality degrades in production.
Best for: Teams with high-volume production workloads. Teams that need real-time quality monitoring.
Roll your own?
For simple use cases, you can build evaluation infrastructure yourself. A test harness that runs inputs through your pipeline, validates outputs, and computes scores isn’t complex software. The build-vs-buy decision depends on how many features you need: if you just need basic eval automation, a few hundred lines of Python will do. If you need tracing, versioning, collaboration, CI/CD integration, and production monitoring, use a platform.
My recommendation: start with promptfoo (free, open-source, lives in your repo) for evaluation, and add Langfuse or Braintrust when you need production monitoring and collaboration features.
Eval-driven development
The most impactful change to my development process in the past year has been eval-driven development (EDD). The concept is simple: write the evals before writing the prompt. Then iterate on the prompt until the evals pass. Ship. Monitor production using the same metrics.
This is test-driven development for AI features. And like TDD, it works not because the tests are magical, but because it forces you to define success before you start building.
The EDD workflow:
1. Define success criteria
↓
2. Write eval cases that test the criteria
↓
3. Run evals against current baseline (establish benchmark)
↓
4. Iterate on prompt/pipeline
↓
5. Run evals after each change
↓
6. Ship when evals meet threshold
↓
7. Monitor production using same metrics
↓
8. Add production failures to eval suite
↓
9. [Loop back to step 4]
The key insight: the metrics you define pre-launch automatically become your production monitoring metrics. There’s no gap between how you test and how you measure production performance. The eval suite and the monitoring system measure the same things.
Starting your eval suite
Anthropic’s engineering team published practical guidance: start with 20-50 simple tasks drawn from real failures. That’s it. Don’t try to build a comprehensive eval suite before you ship. Build a minimal suite, ship, learn from production, and expand.
Where to get initial eval cases:
| Source | What you get | Example |
|---|---|---|
| Product requirements | Happy-path cases | “Summarize a standard bug report” |
| Edge case brainstorming | Boundary cases | “Handle empty input, 5000-char input, Unicode” |
| Similar product failures | Adversarial cases | “Common prompt injection patterns” |
| Customer support logs | Real-world difficulty | “Actual messy inputs from users” |
| Team red-teaming sessions | Creative adversarial cases | “Try to break it in 30 minutes” |
After launch, your best eval cases come from production. Every failure you catch in production should become an eval case. This is the “regression” category, and it grows continuously. After a few months, the regression category often becomes the most valuable part of your eval suite.
The eval flywheel
Good evaluation creates a flywheel:
Better evals → catch more issues before production → fewer production failures → higher user trust → more usage → more production data → better evals
The teams that spin this flywheel fastest are the ones that treat eval cases as first-class artifacts. They’re version-controlled, reviewed in PRs, and maintained with the same care as the code they test.
Evaluating specific capabilities
Different types of LLM applications need different evaluation strategies. Here’s how I approach the most common ones.
Evaluating retrieval-augmented generation (RAG)
RAG evaluation has two components: retrieval quality and generation quality. You need to evaluate both.
| Dimension | What it measures | How to test |
|---|---|---|
| Retrieval precision | Are the retrieved chunks relevant? | Judge relevance of top-K results |
| Retrieval recall | Does retrieval find all relevant information? | Known-answer test: does the right chunk appear? |
| Groundedness | Is the output supported by retrieved chunks? | LLM-as-judge: compare output claims to chunk content |
| Completeness | Does the output use all relevant retrieved info? | Check coverage: which chunks were referenced? |
| Faithfulness | Does the output avoid stating things not in chunks? | Hallucination detection: claims not traceable to source |
The most common RAG failure mode is the “answer from training” problem: the model ignores retrieved chunks and answers from its parametric knowledge. This produces confident, fluent answers that might be wrong because they’re not grounded in your data. Testing for this requires cases where your data contradicts the model’s training data. If the model follows your data, the retrieval is working. If it follows its training, it’s not.
Evaluating conversational agents
Conversational evaluation is hard because quality depends on the full conversation, not individual turns.
| Dimension | Scope | What it measures |
|---|---|---|
| Turn-level relevance | Single turn | Does this response address the user’s message? |
| Task completion | Full conversation | Did the user accomplish their goal? |
| Coherence | Multi-turn | Does the agent maintain consistent context? |
| Recovery | Error scenarios | Does the agent handle misunderstandings gracefully? |
| Boundary adherence | Full conversation | Does the agent stay within its defined scope? |
I evaluate conversations at two levels. Turn-level evaluation checks individual responses. Conversation-level evaluation checks the full trajectory. A conversation can have excellent individual turns and still fail because the agent lost context mid-way through.
Evaluating code generation
Code generation is one of the easier LLM tasks to evaluate because code has clear correctness criteria: it either runs or it doesn’t, it either passes tests or it doesn’t.
| Dimension | How to test |
|---|---|
| Syntactic correctness | Parse the output, check for syntax errors |
| Functional correctness | Run against test suite |
| Style compliance | Linting (ESLint, pylint, etc.) |
| Security | Static analysis (Semgrep, Bandit) |
| Performance | Benchmark against baseline |
| Idiomatic-ness | LLM-as-judge with style rubric |
The temptation is to rely on functional correctness alone (“it passes the tests”). This misses code that’s correct but fragile, insecure, or unreadable. Use all the dimensions.
Common eval mistakes
I’ve made all of these. Learn from my failures.
Mistake 1: Testing average case, ignoring distribution
Your eval suite has 100 cases. Average score: 4.2 out of 5. Looks great. Ship it.
Problem: 15 of those 100 cases score below 2.0. The average hides a long tail of bad outputs that real users will encounter.
Fix: Look at the distribution, not the average. Report p50, p90, and minimum scores. Set thresholds on the minimum, not the average. A model with a 4.0 average and a 3.0 minimum is better than one with a 4.2 average and a 1.0 minimum.
Mistake 2: Eval suite doesn’t match production distribution
Your eval suite is 50% happy-path cases, 30% edge cases, and 20% adversarial cases. In production, 80% of inputs are happy-path, 15% are mild variations, and 5% are edge cases.
This mismatch means your eval suite over-weights edge cases relative to production. A prompt change that improves edge case handling at the cost of happy-path quality will look good in evals and terrible in production.
Fix: Weight eval results by production frequency. Or better: maintain separate scores for each category and set thresholds per category.
Mistake 3: Stale eval suite
You built the eval suite at launch. It’s been six months. The product has changed. User behavior has changed. But the eval suite is the same 100 cases from day one.
Fix: Add production failures to the eval suite continuously. Review and prune cases quarterly. A living eval suite is valuable. A stale one is dangerous because it creates false confidence.
Mistake 4: Evaluating the model, not the system
You evaluate the model’s raw output, ignoring the preprocessing, post-processing, and error handling that wrap it in production. The model’s raw output might score 3.5, but the full system (with input validation, output parsing, fallback handling) scores 4.2.
Fix: Evaluate the full system, end to end. Input to final output. This is what the user experiences. Model-level evaluation is useful for debugging, but system-level evaluation is what tells you whether the product works.
Mistake 5: No eval for failure modes
Your eval suite tests what the model should do. It doesn’t test what happens when the model fails. What if the model times out? What if it returns malformed output? What if it hallucinates? What if the user provides adversarial input?
Fix: Include failure-mode test cases in your eval suite. Test the fallback paths. Test the error handling. In production, these paths get hit more often than you’d like.
Building the eval culture
The hardest part of evaluation isn’t the tooling or the framework. It’s the culture. Teams that treat evaluation as a chore produce bad evals. Teams that treat evaluation as the core engineering discipline produce reliable AI features.
Three practices that build an eval culture:
Eval cases in PR reviews. When someone changes a prompt, the PR must include new or updated eval cases that cover the change. Reviewing a prompt change without reviewing the corresponding eval change is like reviewing a code change without reviewing the tests.
Production failure postmortems include eval gaps. When a production failure happens, the postmortem doesn’t just ask “what went wrong?” It asks “why didn’t the eval suite catch this?” and the fix includes a new eval case.
Eval metrics on dashboards. The team sees eval scores daily. Not buried in CI logs. On the dashboard, next to the production metrics. This makes quality visible and keeps it top of mind.
Evaluation isn’t the glamorous part of building with LLMs. It’s not the demo. It’s not the architecture. It’s the thing that determines whether your demo becomes a product or becomes a cautionary tale. The teams that get evaluation right are the ones that ship AI features that actually work.