top of page

The Evaluation Crisis:Why Nobody Actually KnowsIf Their LLM Is Getting Better

  • 4 days ago
  • 11 min read

You upgraded the model, tweaked the prompt, and ran your benchmark suite. The numbers improved. Then you shipped it and users complained. Here is why that keeps happening.



There is a quiet crisis running through every US tech team building on top of LLMs right now. It is not a model quality crisis. It is not a latency crisis. It is an evaluation crisis, and it is arguably more dangerous than either of those because it is invisible until it is too late.


The pattern is now so common it has become a dark joke in the community: team runs evals, evals look great, team ships, users are unhappy, nobody knows why, team debates whether to roll back. Repeat every six weeks when a new model version drops.


The core problem is that most teams are measuring the wrong things, with the wrong instruments, using a methodology that was designed for research papers rather than production software. And because everyone else appears to be doing it the same way, it feels normal.


It is not normal. It is a systematic failure of measurement infrastructure, and fixing it is one of the highest-leverage investments any LLM product team can make right now.


84%

of LLM teams rely primarily on benchmark suites not derived from their own production traffic

61%

report shipping a model change that degraded user satisfaction despite passing internal evals

2.3x

higher eval-to-prod alignment for teams using traffic-derived golden sets vs synthetic benchmarks


Sources: Hamel Husain's LLM eval survey 2025, Braintrust eval benchmarking report, and practitioner interviews 2025-2026.


The four broken assumptions in standard LLM evaluation


Before proposing a better system, it is worth being precise about exactly how current evaluation practice fails. There are four assumptions baked into how most teams run evals, and all four are wrong in production contexts.


Broken assumption 1: benchmarks measure what users care about


MMLU, HellaSwag, HumanEval, GSM8K. These benchmarks are academically rigorous, widely cited, and almost entirely disconnected from whether your enterprise knowledge assistant gives useful answers about your product documentation.


The canonical benchmark suites were designed to measure general model capability across broad categories of tasks. They are excellent for comparing base models in a research context. They are poor predictors of quality for a specific product with a specific user population and a specific definition of a good response.

When you use general benchmarks to evaluate a product-specific LLM, you are measuring the model's performance on a proxy task and hoping it correlates with your actual task. Sometimes it does. More often, it correlates loosely in development, and then diverges in production as real user inputs reveal the gaps between the proxy and the reality.


Broken assumption 2: automated metrics capture response quality


BLEU score, ROUGE, BERTScore, exact match, F1 on extracted entities. These are all token-level or embedding-level similarity measures. They tell you whether the model's output is lexically similar to a reference output. They do not tell you whether the output is actually correct, helpful, appropriately cautious, or aligned with your product's tone.


A model that says "I cannot find information about that in the provided documents" scores terribly on ROUGE against a reference answer that gives the correct information. But in a product that requires grounded, citation-based responses, the refusal might be the correct behavior if the documents genuinely do not support the answer.


"Our ROUGE scores went up when we switched models. Our support ticket volume also went up. Turns out the new model was more verbose and confident, which is good for token-level similarity to reference answers and bad for users who got long, confident wrong answers."
ML Engineer, enterprise SaaS (paraphrased from community discussion)

Broken assumption 3: human evaluation is the ground truth


When automated metrics fail, the reflex is to bring in human raters. Human eval is genuinely valuable, but it has a failure mode that teams rarely account for: raters are not your users.


Professional raters (whether in-house, via Mechanical Turk, or through a specialized vendor) optimize for legibility, grammar, and apparent confidence. They rate responses that sound authoritative highly. Your actual users optimize for whether the answer helped them do the thing they were trying to do. These two preferences diverge significantly for technical, domain-specific, or ambiguous queries.


The research literature on this is clear: human rater quality correlates with user satisfaction for general-purpose tasks and decorrelates for specialized domain tasks. When your product is a legal research assistant, a clinical decision support tool, or a code review agent, your evaluators need domain expertise that general raters do not have.


Broken assumption 4: eval sets stay valid over time


This is the most pernicious failure mode and the least discussed. Eval sets become contaminated over time. Not through deliberate cheating, but through a process of unconscious optimization: every time a team looks at failing eval cases and updates the prompt to fix them, the eval set loses a small amount of its signal. Over six months of prompt engineering, a team may have inadvertently tuned their prompt to pass their specific eval set, not to maximize quality on real traffic.


This is eval set overfitting, and it is widespread. The symptom is an eval suite that trends upward over months while qualitative user feedback stays flat or degrades. The cause is that the eval set is no longer an independent measure of quality. It has become a training signal.


An eval set you have been iterating against for six months is no longer an evaluation tool. It is a test set you have been training on. The signal has decayed.


The SCORE framework: a production-grade LLM evaluation system


The fix is not to abandon evals. It is to build an evaluation system with the same engineering rigor you would apply to any other production system: separation of concerns, fresh data, multiple signal sources, and a clear decision model for what the numbers actually mean.


The SCORE framework organizes LLM evaluation into five disciplines that, together, give you a trustworthy signal about whether your model changes are improvements:


Framework

SCORE: Segmented evals, Contamination control, Online signals, Reference diversity, Ensemble judgment


SSegmented eval suites: Never run one flat eval suite against all query types. Segment your eval set by task category (factual lookup, reasoning, generation, refusal, edge case handling) and by user segment (expert users, novice users, high-stakes queries, casual queries). A regression in reasoning tasks hidden by improvements in generation tasks is a real regression. Flat aggregate scores hide it. Segment scores expose it.


CContamination control via holdout rotation: Maintain three eval set pools. The Active pool is what you evaluate against daily. The Holdout pool is sealed: no one evaluates against it, no one sees it, no prompt engineering targets it. Every 90 days, rotate: a fraction of the Holdout becomes the new Active, and new cases from production traffic enter the Holdout. This gives you a periodic independent measurement that cannot be contaminated by optimization cycles.


OOnline signals from production traffic: Offline evals answer "does this pass our test cases?" Online signals answer "are users getting what they need?" Track implicit feedback signals at minimum: session continuation rate after an AI response, follow-up question rate (a proxy for answer completeness), escalation rate to human support, and task completion rate for goal-oriented agents. These signals are noisy but uncontaminated. They do not care what your prompt says.


RReference diversity in golden sets: For every eval case in your Active pool, maintain at minimum two reference answers: a high-quality answer and a borderline-acceptable answer. Eval metrics should reward matching the high-quality reference more than the borderline one. This surfaces model outputs that are technically acceptable but below the bar you want to set. Single-reference evals cannot distinguish between these two outcomes.


EEnsemble judgment for subjective quality: For tasks where quality is genuinely subjective (tone, appropriateness, helpfulness for open-ended queries), use LLM-as-judge with three judges and a majority vote, not a single judge call. Define rubrics with explicit criteria rather than "rate this 1-5." Document the rubric in your eval infrastructure. Rubric drift is as damaging as eval set drift: if the definition of a 4/5 answer changes across evaluators or over time, your trend lines mean nothing.


Building the eval pyramid: the right layer for each signal type

One of the most useful mental models for organizing your SCORE implementation is the eval pyramid. Different signal types have different costs, latency, and reliability characteristics. The right architecture uses all four layers in parallel rather than relying on any single one:


Layer 4 (slowest, most reliable)

Online user satisfaction signals: session continuation, task completion, explicit feedback thumbs. High latency (days to weeks), uncontaminated, ground truth for user value.


Layer 3 (days, high signal)

Domain expert human eval on holdout pool: specialized raters evaluating on your actual product rubric. Run on holdout rotation schedule, not on every prompt change.


Layer 2 (minutes to hours, good signal)

LLM-as-judge ensemble on active eval set: 3-judge rubric-graded evaluation on segmented task categories. Run on every significant model or prompt change.


Layer 1 (seconds, fast signal)

Deterministic unit tests: exact match, regex, JSON schema validation, refusal classification. Run on every commit. Catches regressions in structured output, known edge cases, safety violations.


The mistake most teams make is inverting this pyramid: they spend the most time and attention on Layer 1 (fast, cheap, low signal) and skip Layer 3 and 4 entirely. The pyramid should be read bottom-up for speed and top-down for reliability. You need both.


Implementation: the golden set pipeline


Here is a concrete implementation of the contamination-controlled golden set pipeline at the infrastructure level:


# Golden set pipeline: traffic-derived, rotation-controlled

class GoldenSetPipeline:
    def __init__(self, db, holdout_fraction=0.30, rotation_days=90):
        self.db = db
        self.holdout_fraction = holdout_fraction
        self.rotation_days = rotation_days

    def ingest_from_traffic(self, sampled_traces: list[Trace]):
        """Stratified sampling from prod traffic by query category"""
        by_category = self._stratify(sampled_traces)
        for category, traces in by_category.items():
            # 70% → active pool, 30% → sealed holdout
            active, holdout = self._split(traces, self.holdout_fraction)
            self.db.insert_active(active, category=category)
            self.db.insert_holdout(holdout, category=category, # sealed: never read by eval pipeline
                                   sealed_until=now() + timedelta(days=self.rotation_days))

    def rotate(self):
        """Called every 90 days: graduate holdout → active, ingest new holdout"""
        graduating = self.db.get_holdout_ready_to_graduate()
        self.db.move_to_active(graduating)
        self.db.mark_old_active_as_retired()  # prevents runaway set growth

    def run_eval(self, model_config: dict) -> EvalReport:
        """Evaluate only against active pool — never holdout"""
        active_cases = self.db.get_active()
        results = self._score_segmented(active_cases, model_config)
        return EvalReport(
            segments=results,
            contamination_risk=self._estimate_contamination(active_cases),
            holdout_last_rotated=self.db.last_rotation_date()
        )
Traffic sampling note: When ingesting from production traffic, do not sample uniformly by volume. Uniform sampling over-represents your most common query type and under-represents rare but high-stakes query categories (refusals, edge cases, ambiguous inputs). Use stratified sampling with a floor: at minimum 15% of each meaningful query category in the eval set, regardless of its share of total traffic. Your rarest queries are often where the model fails most consequentially.

LLM-as-judge: the rubric that actually works

The LLM-as-judge pattern is now standard practice, but most implementations fail at the rubric layer. Vague rubrics produce vague, unreliable judgments. Here is what a production-grade rubric looks like for a factual Q&A assistant:


# Judge prompt: factual QA rubric (5-dimension, 4-point scale each)

JUDGE_SYSTEM = """You are evaluating responses from an AI assistant.
Rate the response on EACH dimension independently using the 4-point scale.
Return ONLY valid JSON. No preamble, no explanation outside JSON fields."""

JUDGE_RUBRIC = """
Dimensions:
1. Factual accuracy (1-4): Is every factual claim in the response correct?
   1=Contains factual errors, 2=Mostly correct with minor errors,
   3=Correct with minor omissions, 4=Fully accurate

2. Groundedness (1-4): Does the response cite or derive from provided context?
   1=Fabricated/ungrounded, 2=Partially grounded, 3=Mostly grounded,
   4=Fully grounded, no hallucination

3. Completeness (1-4): Does the response address all parts of the question?
   1=Misses main point, 2=Partial answer, 3=Mostly complete, 4=Fully complete

4. Conciseness (1-4): Is the response appropriately brief for the query complexity?
   1=Severely verbose or too brief, 2=Some unnecessary content,
   3=Mostly appropriate length, 4=Optimal length

5. Appropriate uncertainty (1-4): Does the model correctly express confidence?
   1=Overconfident on uncertain claims, 2=Sometimes overconfident,
   3=Usually calibrated, 4=Well-calibrated throughout
"""

# Run with 3 different temperature=0 judge calls, take dimension-wise median
# Flag cases where any two judges disagree by >1 point for human review

The five-dimension structure is deliberate. Factual accuracy and groundedness often move in opposite directions when you upgrade models: a more capable model might give more accurate answers but draw on parametric knowledge rather than your retrieved context. A flat aggregate score hides this tradeoff. Dimension-level scores surface it immediately.


The model upgrade decision framework


With a SCORE-compliant eval system, you now have enough signal to make model upgrade decisions with actual confidence. Here is the decision tree that prevents the "evals improved, users are worse" failure mode:


Model upgrade decision: required gates before shipping


Layer 1 unit tests pass (100%)

Layer 2 LLM judge on active pool: no segment regresses more than 3pts

Shadow traffic A/B for 48 hrs: online signals within 5% of baseline

Holdout eval (if within rotation window): improvement or neutral

Canary deploy 5%: session continuation and escalation rate within 2% of control

Full rollout with 7-day online signal monitoring window

Each gate has a specific passing criterion, not a subjective judgment call. This is the key difference from how most teams operate. "Evals look pretty good" is not a gate. "No segment regresses more than 3 points on the active pool and shadow traffic online signals are within 5% of baseline" is a gate. The specificity is what creates accountability and what makes post-mortems tractable when something still slips through.


The organizational failure: who owns evals?

The most common reason eval systems remain broken is not technical. It is organizational. In most LLM product teams, evals are owned by whoever is closest to the model at any given time: a researcher when the team is doing fine-tuning, a prompt engineer when the team is doing prompt work, an ML engineer when the team is benchmarking a new model version.


This fragmented ownership means the eval system never gets treated as a first-class production artifact. It does not get code review. It does not get maintained. The golden set does not get rotated. The rubric does not get versioned. The contamination accumulates silently.


What eval ownership should look like

One person or team owns the eval infrastructure as a permanent responsibility. It has a README. It has a changelog. The golden set rotation is a calendar event. The rubric is versioned in Git.

What it usually looks like

A Jupyter notebook in a repo no one has touched in four months. A Google Sheet labeled "eval cases v3 FINAL." A Slack message that says "can someone run the evals before we ship this?"

Minimum viable eval infrastructure

Eval cases in version-controlled storage with a schema. A runner script that produces a standardized report. A diff view that shows change vs the previous run. A rotation calendar for the holdout pool.

Signals that your eval system has drifted

Evals trend upward for 3+ months while user satisfaction is flat. Your eval set has not been updated with new production cases in 60+ days. You cannot name the last time you ran against the holdout pool.


The competitive angle: why this is a product moat


For AI product managers and founders reading this: trustworthy evals are a durable competitive advantage in a market where model capabilities are converging.


When every team can access the same frontier models, the differentiator is how quickly and confidently you can iterate. A team with a reliable eval system can ship model changes in hours with confidence, roll back safely when something degrades, and accumulate product-specific quality improvements faster than teams running blind. A team without one ships every change as a prayer and rolls back as a guess.


The teams that will win on LLM product quality in 2026 and beyond are not necessarily the ones with the best prompts or the biggest models. They are the ones that know, at any given moment, whether they are getting better or worse, and can act on that knowledge faster than their competitors.


That requires treating evaluation as a product, not as a pre-ship checkbox.


Bottom line
The LLM evaluation crisis is a solvable engineering problem dressed up as a research problem. The SCORE framework gives you five concrete disciplines to fix it: segmented suites, contamination-controlled holdout rotation, online production signals, multi-reference golden sets, and ensemble LLM-as-judge with explicit rubrics. Start with two changes: ingest eval cases from real production traffic instead of synthetic examples, and seal 30% of them in a holdout pool that no prompt engineering cycle can touch. Run the holdout quarterly. The gap between your active eval score and your holdout score is a direct measure of how much eval set overfitting you have accumulated. If you have never run that measurement, you are flying blind on every model decision you make.

About this blog: Personal publication at the intersection of LLM engineering, AI product strategy, and measurement infrastructure. All failure patterns described are drawn from real production post-mortems and community discussions. Framework implementations are reference patterns, not production-ready code.


 
 
 

Comments


Top Articles

The AI Product Marketer | Soniya Singh

Deep dives into AI products, GTM strategy, and market adoption

Pro+ Member of PMA - Product Marketing Alliance
  • LinkedIn

© 2025 by The AI Product Marketer.

bottom of page