January 2026 · 18 min read

LLM Evaluation Frameworks

How to measure what matters in language model performance.

Here’s something that happened to me last month. I changed three words in a system prompt. The change looked innocent. The evals passed. I shipped it. Within 48 hours, our ticket routing accuracy dropped from 91% to 74%. The eval suite didn’t catch it because the eval suite was testing the wrong thing.

That experience crystallized something I’d been thinking about for a while: evaluation is the hardest part of building with LLMs, and most teams are doing it wrong. Not wrong in a “they could be better” sense. Wrong in a “they’re measuring things that don’t correlate with production quality” sense.

This post is about how to actually evaluate LLM applications. Not benchmark scores. Not academic metrics. The practical engineering of knowing whether your system works, knowing when it breaks, and knowing why.

Why traditional metrics fail

Before we get into what works, let me explain why the obvious approaches don’t.

Benchmark scores are not product metrics

When someone asks “which model is better, Claude or GPT-4?” the answer is “for what?” Benchmark scores (MMLU, HumanEval, HellaSwag, and the dozens of others) measure specific capabilities in controlled conditions. They tell you almost nothing about how the model will perform on your specific task with your specific data.

I’ve seen teams pick a model based on benchmark scores and then be surprised when it performs worse on their use case than a model with lower benchmark scores. The reason is simple: benchmarks test general capabilities. Your product uses specific capabilities in specific ways. The correlation between general benchmark performance and product-specific performance is weak.

Benchmark contamination is rampant

There’s an even more fundamental problem with benchmarks: contamination. Benchmark contamination occurs when test data leaks into training sets, inflating scores without improving actual capability. By 2025, this had become so severe that most static benchmarks were considered unreliable. Models could score well on benchmarks they’d effectively memorized during training, giving a false impression of capability.

LiveBench emerged specifically to address this problem, using regularly updated questions designed to prevent test data from appearing in training sets and relying on verifiable, objective ground-truth answers rather than LLM judges. But even LiveBench is a general benchmark. It tells you about the model’s general capability, not about your specific application.

BLEU, ROUGE, and friends are from a different era

These metrics were designed for machine translation and summarization before LLMs existed. They measure surface-level text similarity: does the output contain the same words as the reference? This is a terrible proxy for quality in most LLM applications.

A model can produce an output that scores 0.0 on BLEU (completely different words from the reference) and be a perfect answer. It can score 0.9 on BLEU (nearly identical words) and be wrong, misleading, or unhelpful.

These metrics survive because they’re cheap to compute and easy to understand. But optimizing for them doesn’t optimize for quality. Use them only for tasks where surface-level similarity genuinely matters (like translation), and even then, treat them as one signal among many.

Accuracy is a lie in open-ended tasks

For classification tasks, accuracy is meaningful. The model either classifies correctly or it doesn’t. For open-ended tasks (summarization, writing, conversation, analysis), “accuracy” isn’t a coherent concept. What does it mean for a summary to be “accurate”? It could be factually correct but miss the key points. It could capture the key points but be poorly written. It could be well-written and correct but too long.

Open-ended tasks require multidimensional evaluation. A single number can’t capture quality. You need multiple dimensions, each measured separately, each with its own threshold.

The eval framework that actually works

After two years of building LLM applications and getting evaluation wrong multiple times, here’s the framework I’ve converged on. It has four layers, and each one catches a different category of problem.

Layer 1: Unit evals          (automated, fast, cheap)
Layer 2: Behavioral evals    (automated, medium cost)
Layer 3: LLM-as-judge evals  (automated, higher cost)
Layer 4: Human evals         (manual, expensive, gold standard)

Layer 1: Unit evals

Unit evals test specific, deterministic properties of the output. They’re like unit tests in traditional software: fast, cheap, and binary.

What to test	How to test it	Example
Output format	Schema validation	JSON matches expected schema
Required fields	Field presence check	All required fields are non-null
Length constraints	Character/token count	Summary is under 200 words
Forbidden content	Pattern matching	No PII, no profanity, no competitor names
Factual constraints	Lookup verification	Dates match source data, names are spelled correctly
Deterministic extractions	Exact match	Extracted email address matches known correct value

These evals catch about 30% of production issues. They’re not glamorous, but they’re essential. A model that returns malformed JSON or includes PII in a public-facing summary is failing at a basic level that should never reach a user.

Run these on every prompt change. They take seconds and cost nothing (they don’t require model calls). If a unit eval fails, don’t bother running the more expensive evals.

Layer 2: Behavioral evals

Behavioral evals test whether the model follows specific behavioral requirements. These are more nuanced than unit evals but still automated.

def test_severity_assignment():
    """Model should assign severity based on user impact, not tone."""

    # Calm report about data loss
    result1 = evaluate(
        input="The export function silently drops rows when the dataset "
              "exceeds 10,000 entries. We noticed some customers lost data.",
        expected_severity="critical"
    )

    # Excited report about cosmetic issue
    result2 = evaluate(
        input="THE BUTTON COLOR IS WRONG!!! THIS IS UNACCEPTABLE!!! "
              "THE SHADE OF BLUE IS OFF BY LIKE 2 PIXELS!!!!",
        expected_severity="cosmetic"
    )

    assert result1.severity == "critical", "Data loss should be critical"
    assert result2.severity == "cosmetic", "Color issue should be cosmetic"

Behavioral evals test the contract, not the output. They verify that the model follows the rules you’ve specified. Does it assign severity based on impact, not tone? Does it preserve all facts from the source? Does it refuse to answer questions outside its scope?

I typically have 100-300 behavioral eval cases per feature, organized into categories:

Category	What it tests	Typical count
Happy path	Normal inputs, expected behavior	30-50
Edge cases	Boundary conditions, unusual inputs	20-40
Adversarial	Prompt injection, manipulation attempts	15-25
Regression	Cases that failed in previous versions	Growing over time
Domain-specific	Industry/domain-specific requirements	20-50

Behavioral evals catch about 40% of production issues. They require model calls, so they cost more than unit evals (roughly $2-10 per run, depending on the number of cases and the model). Run them on every prompt change and every model update.

Layer 3: LLM-as-judge evals

For qualities that are hard to test programmatically (coherence, helpfulness, tone, completeness), use a separate LLM as an evaluator. This is the “LLM-as-judge” pattern, and it’s become the backbone of evaluation for open-ended tasks.

The basic approach: give a judge model the input, the output, and a rubric. Ask it to score the output on specific dimensions.

judge_prompt = """
You are evaluating the quality of a bug report summary.

## Input (original bug report)
{input}

## Output (structured summary)
{output}

## Rubric
Score each dimension from 1 to 5:

1. COMPLETENESS: Does the summary capture all key information from
   the original report? (1=missing major details, 5=comprehensive)

2. ACCURACY: Are all facts in the summary faithful to the original?
   (1=contains fabricated details, 5=perfectly accurate)

3. ACTIONABILITY: Could an engineer start working on this bug from
   the summary alone? (1=no, needs original, 5=fully actionable)

4. SEVERITY_CORRECTNESS: Is the assigned severity appropriate?
   (1=completely wrong, 5=exactly right)

Return JSON: {"completeness": N, "accuracy": N, "actionability": N,
"severity_correctness": N, "reasoning": "..."}
"""

The critical detail: the rubric. Without a specific rubric, the judge model defaults to general “is this good?” evaluation, which is inconsistent and unhelpful. A specific rubric with defined dimensions and scoring criteria produces consistent, actionable evaluations.

Known biases of LLM judges:

Bias	What it means	Mitigation
Verbosity preference	LLMs favor longer, more detailed responses	Include “conciseness” as a rubric dimension
Self-preference	Models rate their own outputs higher	Use a different model family as judge
Position bias	The first option in a comparison is favored	Randomize option order
Sycophancy	Judges avoid harsh scores	Explicitly instruct: “Score 1 or 2 when the output clearly fails”
Preference leakage	Judge favors outputs similar to its training data	Use diverse judge models, cross-check with human eval

Research from 2025 showed that GPT-4-Turbo has an error rate of up to 46% for challenging reasoning and math problems when acting as a judge. LLM judges are not reliable for hard problems. They’re useful for soft quality dimensions (coherence, tone, helpfulness) where the judge doesn’t need to reason deeply.

Cost management: LLM-as-judge evals are the most expensive automated layer. Each eval case requires a full model call for the judge. With 200 eval cases and a capable judge model, a single run might cost $10-30. Strategies to manage this:

Run the full judge eval suite on major changes (prompt restructuring, model upgrades). Run a sampled subset (20-30 cases) on minor changes.
Use a cheaper model for obvious-quality dimensions (format, length) and a capable model for hard dimensions (accuracy, completeness).
Cache judge results for unchanged inputs.

Layer 4: Human evaluation

Human evaluation is the gold standard. It’s also the most expensive and least scalable. Use it strategically.

When human eval is essential:

Validating the LLM-as-judge rubric. Before trusting your automated judges, have humans evaluate the same cases and check whether the judge scores align with human scores. If they diverge, fix the rubric, not the humans.
New feature launches. Before the first production deployment, have domain experts evaluate a representative sample.
Edge cases the automated suite can’t handle. Cultural nuance, domain-specific correctness, subtle factual errors that require specialist knowledge.
Calibration checks. Periodically (monthly or quarterly), have humans evaluate a random sample of production outputs to verify that automated evals are still aligned with actual quality.

Human eval best practices:

Use multiple evaluators per case (minimum 3 for important dimensions). Measure inter-rater reliability. If evaluators disagree consistently, your rubric is ambiguous.

Match evaluator expertise to the domain. A generalist evaluator can assess coherence and formatting. A domain expert is needed to assess factual accuracy in specialized domains (medical, legal, financial).

Provide visual aids. Show evaluators the input, the output, and the relevant source data side by side. Highlighting key information reduces cognitive load and improves accuracy.

Use structured rubrics, not vibes. “Rate quality from 1-5” is useless. “Rate factual accuracy from 1-5 where 1 means multiple fabricated facts and 5 means all facts match the source” is actionable.

The eval tools landscape

The evaluation tooling landscape in 2026 has matured significantly. Here’s my honest assessment of the major tools.

Braintrust

Braintrust is the tool I’d recommend for most teams. It provides an integrated platform for prompt management, evaluation, and monitoring. The key feature: you can version prompts, test them against real data, and deploy them from one platform. Their GitHub Action runs experiments on PRs and posts score comparisons automatically, which makes eval-driven development practical for teams that use standard CI/CD.

Braintrust’s AI co-pilot (Loop) lets non-technical team members iterate on prompts, which sounds gimmicky but solves a real problem: the person who best understands the domain (a product manager, a subject matter expert) often isn’t the person writing the prompts.

Best for: Teams that want an all-in-one platform. Teams with CI/CD workflows. Teams where prompt iteration involves non-engineers.

promptfoo

promptfoo is open-source, CLI-first, and lives in your repo. Prompts and test cases are defined in YAML configuration files. It runs locally or in CI, compares results across models, and generates reports.

The standout feature is red teaming. promptfoo can probe your prompts for vulnerabilities: prompt injection, PII exposure, jailbreak risks. It’s the only evaluation tool I’ve used that’s purpose-built for security testing alongside performance evaluation.

# promptfoo config example
prompts:
  - file://prompts/bug-structurer-v2.md
  - file://prompts/bug-structurer-v3.md

providers:
  - anthropic:messages:claude-sonnet-4-5-20250929
  - openai:chat:gpt-4o

tests:
  - vars:
      input: "Login button is broken on mobile Safari"
    assert:
      - type: json-schema
        value: file://schemas/bug-report.json
      - type: llm-rubric
        value: "Severity should be 'major' since it affects a core flow"

Best for: Engineering teams that prefer code-over-UI. Open-source/self-hosted requirements. Security-conscious teams that need red teaming.

Langfuse

Langfuse focuses on observability plus evaluation. It traces every LLM call in production, captures inputs and outputs, and lets you run evaluations on production data. This closes the loop between production monitoring and offline evaluation: when you see a failure in production, you can add it to your eval suite directly.

Best for: Teams that prioritize production observability. Teams that want to build eval suites from real production data rather than synthetic cases.

LangSmith

LangSmith is LangChain’s evaluation and tracing platform. If you’re already in the LangChain ecosystem, it integrates tightly. Tracing, datasets, evaluation, and prompt management in one tool.

Best for: Teams using LangChain/LangGraph. Teams that want tight framework integration.

Deepchecks

Deepchecks focuses on continuous monitoring and automated alerts. It’s designed for teams that need to know immediately when quality degrades in production.

Best for: Teams with high-volume production workloads. Teams that need real-time quality monitoring.

Roll your own?

For simple use cases, you can build evaluation infrastructure yourself. A test harness that runs inputs through your pipeline, validates outputs, and computes scores isn’t complex software. The build-vs-buy decision depends on how many features you need: if you just need basic eval automation, a few hundred lines of Python will do. If you need tracing, versioning, collaboration, CI/CD integration, and production monitoring, use a platform.

My recommendation: start with promptfoo (free, open-source, lives in your repo) for evaluation, and add Langfuse or Braintrust when you need production monitoring and collaboration features.

Eval-driven development

The most impactful change to my development process in the past year has been eval-driven development (EDD). The concept is simple: write the evals before writing the prompt. Then iterate on the prompt until the evals pass. Ship. Monitor production using the same metrics.

This is test-driven development for AI features. And like TDD, it works not because the tests are magical, but because it forces you to define success before you start building.

The EDD workflow:

1. Define success criteria
   ↓
2. Write eval cases that test the criteria
   ↓
3. Run evals against current baseline (establish benchmark)
   ↓
4. Iterate on prompt/pipeline
   ↓
5. Run evals after each change
   ↓
6. Ship when evals meet threshold
   ↓
7. Monitor production using same metrics
   ↓
8. Add production failures to eval suite
   ↓
9. [Loop back to step 4]

The key insight: the metrics you define pre-launch automatically become your production monitoring metrics. There’s no gap between how you test and how you measure production performance. The eval suite and the monitoring system measure the same things.

Starting your eval suite

Anthropic’s engineering team published practical guidance: start with 20-50 simple tasks drawn from real failures. That’s it. Don’t try to build a comprehensive eval suite before you ship. Build a minimal suite, ship, learn from production, and expand.

Where to get initial eval cases:

Source	What you get	Example
Product requirements	Happy-path cases	“Summarize a standard bug report”
Edge case brainstorming	Boundary cases	“Handle empty input, 5000-char input, Unicode”
Similar product failures	Adversarial cases	“Common prompt injection patterns”
Customer support logs	Real-world difficulty	“Actual messy inputs from users”
Team red-teaming sessions	Creative adversarial cases	“Try to break it in 30 minutes”

After launch, your best eval cases come from production. Every failure you catch in production should become an eval case. This is the “regression” category, and it grows continuously. After a few months, the regression category often becomes the most valuable part of your eval suite.

The eval flywheel

Good evaluation creates a flywheel:

Better evals → catch more issues before production → fewer production failures → higher user trust → more usage → more production data → better evals

The teams that spin this flywheel fastest are the ones that treat eval cases as first-class artifacts. They’re version-controlled, reviewed in PRs, and maintained with the same care as the code they test.

Evaluating specific capabilities

Different types of LLM applications need different evaluation strategies. Here’s how I approach the most common ones.

Evaluating retrieval-augmented generation (RAG)

RAG evaluation has two components: retrieval quality and generation quality. You need to evaluate both.

Dimension	What it measures	How to test
Retrieval precision	Are the retrieved chunks relevant?	Judge relevance of top-K results
Retrieval recall	Does retrieval find all relevant information?	Known-answer test: does the right chunk appear?
Groundedness	Is the output supported by retrieved chunks?	LLM-as-judge: compare output claims to chunk content
Completeness	Does the output use all relevant retrieved info?	Check coverage: which chunks were referenced?
Faithfulness	Does the output avoid stating things not in chunks?	Hallucination detection: claims not traceable to source

The most common RAG failure mode is the “answer from training” problem: the model ignores retrieved chunks and answers from its parametric knowledge. This produces confident, fluent answers that might be wrong because they’re not grounded in your data. Testing for this requires cases where your data contradicts the model’s training data. If the model follows your data, the retrieval is working. If it follows its training, it’s not.

Evaluating conversational agents

Conversational evaluation is hard because quality depends on the full conversation, not individual turns.

Dimension	Scope	What it measures
Turn-level relevance	Single turn	Does this response address the user’s message?
Task completion	Full conversation	Did the user accomplish their goal?
Coherence	Multi-turn	Does the agent maintain consistent context?
Recovery	Error scenarios	Does the agent handle misunderstandings gracefully?
Boundary adherence	Full conversation	Does the agent stay within its defined scope?

I evaluate conversations at two levels. Turn-level evaluation checks individual responses. Conversation-level evaluation checks the full trajectory. A conversation can have excellent individual turns and still fail because the agent lost context mid-way through.

Evaluating code generation

Code generation is one of the easier LLM tasks to evaluate because code has clear correctness criteria: it either runs or it doesn’t, it either passes tests or it doesn’t.

Dimension	How to test
Syntactic correctness	Parse the output, check for syntax errors
Functional correctness	Run against test suite
Style compliance	Linting (ESLint, pylint, etc.)
Security	Static analysis (Semgrep, Bandit)
Performance	Benchmark against baseline
Idiomatic-ness	LLM-as-judge with style rubric

The temptation is to rely on functional correctness alone (“it passes the tests”). This misses code that’s correct but fragile, insecure, or unreadable. Use all the dimensions.

Common eval mistakes

I’ve made all of these. Learn from my failures.

Mistake 1: Testing average case, ignoring distribution

Your eval suite has 100 cases. Average score: 4.2 out of 5. Looks great. Ship it.

Problem: 15 of those 100 cases score below 2.0. The average hides a long tail of bad outputs that real users will encounter.

Fix: Look at the distribution, not the average. Report p50, p90, and minimum scores. Set thresholds on the minimum, not the average. A model with a 4.0 average and a 3.0 minimum is better than one with a 4.2 average and a 1.0 minimum.

Mistake 2: Eval suite doesn’t match production distribution

Your eval suite is 50% happy-path cases, 30% edge cases, and 20% adversarial cases. In production, 80% of inputs are happy-path, 15% are mild variations, and 5% are edge cases.

This mismatch means your eval suite over-weights edge cases relative to production. A prompt change that improves edge case handling at the cost of happy-path quality will look good in evals and terrible in production.

Fix: Weight eval results by production frequency. Or better: maintain separate scores for each category and set thresholds per category.

Mistake 3: Stale eval suite

You built the eval suite at launch. It’s been six months. The product has changed. User behavior has changed. But the eval suite is the same 100 cases from day one.

Fix: Add production failures to the eval suite continuously. Review and prune cases quarterly. A living eval suite is valuable. A stale one is dangerous because it creates false confidence.

Mistake 4: Evaluating the model, not the system

You evaluate the model’s raw output, ignoring the preprocessing, post-processing, and error handling that wrap it in production. The model’s raw output might score 3.5, but the full system (with input validation, output parsing, fallback handling) scores 4.2.

Fix: Evaluate the full system, end to end. Input to final output. This is what the user experiences. Model-level evaluation is useful for debugging, but system-level evaluation is what tells you whether the product works.

Mistake 5: No eval for failure modes

Your eval suite tests what the model should do. It doesn’t test what happens when the model fails. What if the model times out? What if it returns malformed output? What if it hallucinates? What if the user provides adversarial input?

Fix: Include failure-mode test cases in your eval suite. Test the fallback paths. Test the error handling. In production, these paths get hit more often than you’d like.

Building the eval culture

The hardest part of evaluation isn’t the tooling or the framework. It’s the culture. Teams that treat evaluation as a chore produce bad evals. Teams that treat evaluation as the core engineering discipline produce reliable AI features.

Three practices that build an eval culture:

Eval cases in PR reviews. When someone changes a prompt, the PR must include new or updated eval cases that cover the change. Reviewing a prompt change without reviewing the corresponding eval change is like reviewing a code change without reviewing the tests.

Production failure postmortems include eval gaps. When a production failure happens, the postmortem doesn’t just ask “what went wrong?” It asks “why didn’t the eval suite catch this?” and the fix includes a new eval case.

Eval metrics on dashboards. The team sees eval scores daily. Not buried in CI logs. On the dashboard, next to the production metrics. This makes quality visible and keeps it top of mind.

Evaluation isn’t the glamorous part of building with LLMs. It’s not the demo. It’s not the architecture. It’s the thing that determines whether your demo becomes a product or becomes a cautionary tale. The teams that get evaluation right are the ones that ship AI features that actually work.

ai llm evaluation

Continue Reading

The Future of Knowledge Work