AI Evaluation - Metrics & Human in the Loop
Measuring What Matters, With People in the Loop
Anje Barnard
Jan, 2026
AI often feels like magic. It can generate a photorealistic picture of your dogs having a snowball fight (give or take a few toe beans) or generate a catchy blues song from a simple text prompt. But anyone building AI systems also knows that it is just as capable of spectacular failure.
AI operates on probability, not certainty. The non-deterministic nature of modern AI models means you might get the 'perfect' result ten times in a row, only to see it break down on the eleventh try (even with the same input). A transcription model might randomly insert "Here's to many more ears!" instead of "years", or hallucinate a "Thanks for watching" sign-off where none exists.
With isolated model calls, this is frustrating, but you can simply try again. In a chained AI pipeline, these individual errors can snowball. A degraded output becomes a messy input for the next AI component, compounding until you are left with an incoherent outcome.
Consider a dubbing pipeline: A source separation model leaves a faint dog bark and music in the background of what is supposed to be a speech-only track. The transcription model interprets the dog bark as a word and hallucinates words for the musical instruments. The translation model ends up translating this poor transcription and by the time we reach the text-to-speech model we are left with a confidently spoken sentence that makes absolutely no sense.
Stopping this snowball effect starts with systematic evaluation and human in the loop (HITL)
Evaluation isn't just a final stamp of approval (or disapproval) on your output, it is a continuous process. To prevent errors from snowballing, you need various evaluation checkpoints at three critical phases of your AI system's lifecycle:
R&D: This is the foundational phase. Here you define your task and understand its domain, gather data, select your metrics and build the system components. This could mean vetting third-party models, training or fine-tuning your own or crafting prompt instructions and examples.
Production: This is the runtime quality control phase. You run evaluations to gather metrics against real-world output to decide whether to pass the data downstream, trigger an automated retry, or flag the bad output for human review.
Continuous Optimization: This is the refinement phase. You leverage the data collected in production (approvals, fixes, ratings and feedback) to improve the system, i.e. improve output quality and lessen its chance of repeated mistakes.
Evaluation During R&D
The R&D phase is where you define success. This isn't just about picking metrics to use in production, it is about using evaluation insights, whether from your own experiments or trusted industry benchmarks, to guide your solution strategy. These results will dictate the AI components you choose:
Does a pre-trained model exist that already align with our data and quality standards?
Do we need to train a model from scratch?
Should we fine-tune an existing model?
Is a generic LLM sufficient if we optimize the prompts and examples?
Do we have enough high-quality data to support these decisions?
Defining your Metrics
To answer these questions, you must first define what "success" actually looks like for your specific use case. This requires you to balance industry standards with your own domain requirements. You should address several key areas:
Self-Evaluate: Before peeking at what's out there, ask: How would I evaluate this? What makes an output "good", "average" or "bad"? Even if you don't have a formal answer yet, building a personal intuition is valuable. If you lack the domain expertise, this is the time to lean on subject matter experts (SMEs) in your team to help define this initial understanding.
Available Metrics: Research the standard metrics for your task. Do they require "Gold Standard" references?
Look for cutting edge metrics: Don't solely rely on legacy metrics simply because everyone uses them. Newer metrics often correlate better with human perception than older or more rigid ones.
Take a hybrid approach: Oftentimes a single metric isn't enough. If one metric captures exact matches and another captures meaning, consider combining them into a weighted score to cover more nuances of your output.
Interpretability: Does the metric actually align with human judgement? If a metric gives a high score, but the output reads poorly to a human, the metric is misleading. You need metrics that map to reality, so prioritize those that:
Are explainable: You can clearly derive why the score is high or low.
Target your priorities: They evaluate the qualities that matter most to your use case.
Data Availability: Your choice of metrics can also be dictated by your available data. Do you have the necessary "ground truth" to support your ideal metrics or must you curate it?
Attribute Gap: A dataset might look perfect (covers your use case), but miss specific attributes (like "reasoning steps" or "reference translations") that your metric requires to function. Or the attribute exists, but requires formatting before the metric can be applied.
Golden Set Recommendation: Always build a custom Golden Set. This acts as your anchor, a curated list of inputs and their expected outputs covering all the scenarios and edge cases your system will face. It should be kept as an evaluation-only set and never used to improve the system.
Client Data: Sometimes clients have years of raw data. The challenge in this case is to process this data into the structured format your metrics (and model) require.
The Dirty Data Trap: Always audit public datasets. They are often polluted.
Recall the "Thanks for watching" hallucination mentioned earlier. This happens because the models are often trained on barely cleaned datasets (like YouTube subtitles) that included non-verbal sign-offs or sound event cues, teaching the model to predict text that wasn't actually spoken.
Establishing Your Thresholds
Now that you have your metrics, you must calibrate them to reality. A raw score of 0.7 means only so much in isolation.
Manually review a subset of your data to see how the automated scores align with the human eye. If you find that a score below 0.7 mostly yields average to poor quality results when your system requires excellent quality, use that as your threshold for human review and performance metrics (e.g. your system's pass/fail rate).
However, quality is subjective and quality expectation will vary. Instead of hard-coding this number, make it configurable. This empowers users to define their own risk tolerance, allowing them to balance strict quality control against higher throughput based on their specific needs.
Leveraging AI as a Judge
Sometimes standard metrics fall short. They might miss semantic nuances or require more human involvement than you are willing to spend. Leveraging an LLM as a Judge of AI output is a powerful way to evaluate subjective qualities like "tone", "context appropriateness" or "completeness".
Start Simple: For straightforward tasks, an LLM's general world knowledge might work perfectly out of the box, without any handcrafted examples or too carefully crafted instructions. Feel free to start here, but verify the output. Even if the prompt is simple, you must audit a representative sample of expected input-output pairs to ensure the judge's output aligns with intuition before you rely on it in the bigger AI system.
Optimize if needed: However, an LLM Judge is only as good as its prompt (and its training data). For more complex evaluation tasks, treat this component as its own optimization problem:
Craft a Prompt: Create instructions and few-shot examples.
Tip: You can also chain or aggregate multiple prompts together for complex reasoning or step-wise evaluation.
Construct the Dataset: Using existing datasets where available or curate your own, make sure to include expected judge output.
Define the Metrics: How do you know the Judge is accurate?
Ground Truth Alignment: The standard approach is to compare the Judge's output (e.g. 1-5 stars) against the expected outputs in the dataset.
Industry Metrics: If your Judge evaluates objective properties (like tense or point of view) and there exists NLP models or standard metrics for those tasks, leverage them.
Evaluate the Judge: Run your Judge against the dataset and calculate the metrics.
Refine the Prompt: Review the Judge's output alongside the input-output pairs to understand where the prompt might be falling short or if your metrics need tweaking. Update your prompt(s) or metrics accordingly.
Iterate: Repeat the evaluation and prompt refinement until the AI Judge aligns well with your Golden Set.
Tip: You can automate this loop. Frameworks like DSPy can programmatically iterate on prompts based on your dataset (and program), saving you from hand-crafting instructions and few-shot examples.
Evaluation During Production
Once your AI system is live, evaluation becomes a gatekeeper for output quality. It is now part of the decision engine of when to pass outputs on to the next step, automatic improvement attempts or flagging output for human review.
The Ground Truth Problem
In production, you rarely have a "correct answer" waiting for you. So, how do you use your R&D metrics?
Cycle Consistency: If you can't verify the output directly, try reversing the process.
Example: In translation tasks, use Back-Translation. Let's say you are translating English to Spanish, now translate the Spanish back to English. By comparing this back-translated English text with the source English text, you can e.g. verify the semantic meaning was preserved.
Self-consistency: Models are non-deterministic, they hallucinate and they sometimes miss important points. You can exploit this by running the same input through a model multiple times or running the input through multiple models.
If the outputs are similar, confidence is high. If it gives very different outputs, it is probably guessing. You can discard outliers and keep the best matching option(s) for human review.
AI as a Judge: Utilize your optimized AI Judge from the R&D phase.
When to Leverage Evaluation Metrics
Once you have your metrics, you have to decide how your AI pipeline reacts. Broadly, there are two strategies: Fully Automated (for speed) or Human in the Loop (for quality).
Fully Automated: The system calculates metrics but does not stop the pipeline. It logs metrics for later analysis and perhaps asks the user for feedback on the final output.
This is fast and most likely cheaper than human intervention, but risky. In a snowballing pipeline, errors will slip through and infect downstream tasks. This is best for low-stakes features where occasional weirdness is acceptable.
Human in the Loop: Instead of relying solely on AI outputs for quality, you have quality checkpoints where humans can review (approve, edit, rate) the output before it moves to the next critical step in the pipeline. You can dial this human intervention up or down as needed:
The Exception Approach: This is the cost and efficiency sweet spot. You use your thresholds to only trigger human reviews for outputs falling below the quality threshold.
The Human Gatekeeper Approach: For use cases that require absolute accuracy and quality output (think medical or legal industry), automation isn't enough. At every critical step, the output is routed for human review. The system continues to run metrics to assist the human (and gain insights on system quality), but the human must give the final stamp of approval to proceed.
This is slower, yields higher quality output and is less risky. In the snowballing pipeline, the errors now have a chance to be fixed at each critical step, meaning we give cleaner, better quality input for the next AI component. This is the better option when accuracy and quality output is required to meet industry standards.
Improving the AI System
After your system has been live for a few months, you will have accumulated valuable data: evaluation metrics, user ratings, and most importantly, human fixes.
You could simply review this data manually to spot issues (and you should), but the real win comes from Automated Optimization. This is where you turn the pipeline into a Data Flywheel: a system that gets smarter with every mistake it makes.
Harvest the Signals: Each human interaction tells you how to improve.
User Edits: Every time a user fixes an output, they generate a perfect "before vs. after" training example.
Approvals & Ratings: These tell you if your internal metrics align with reality. If users consistently rate outputs as "Bad" while your metrics rate them as "Excellent", your metrics are broken and need calibration.
Feedback: Subjective comments reveal why the system failed (e.g. "too formal", "missed the context"), revealing trends that raw numbers miss.
Feed the System: Feed this real-world data back into your pipeline to close the loop.
Fine-tune Models and Refine Prompts: Inject those user edits back into your datasets. Use them to fine-tune your model or optimize prompt instructions and few-shot examples.
Metric Calibration: Adjust your thresholds or fine-tune your evaluation models based on the additional real-world data.
Drift Detection: R&D data is often cleaner and production data messier and additionally interactions change. Continuously compare your real-world inputs, outputs and their relationships against the R&D baseline.
Conclusion: The Self-Improving System
The ultimate goal is to build a system that closes its own loop. By tracking the user fixes and feedback you move from guessing how to improve to knowing.
By automatically feeding this data back into the system, via fine-tuning or prompt refinement, you transition from a static system that requires constant manual maintenance to a dynamic one that evolves with its users. While human oversight remains the final gatekeeper, the heavy lifting is done by the system itself.