Master LLM evaluation by implementing LLM-as-a-judge pipelines. Use LangSmith and Pydantic for automated testing that catches regressions and ensures quality.

Last month, we pushed a prompt update that "improved" our RAG system’s summarization capabilities on paper but nuked our downstream data parsing in production. We spent about two days manually verifying test cases before realizing we were flying blind. That was the last time we relied on "gut-check" testing for our AI features.
If you’re building production LLM apps, you already know that evaluating LLM features: a practical guide for engineers is the only way to move beyond prototypes. To stop the cycle of fixing one bug and introducing two more, you need an automated pipeline that treats LLM outputs like any other software dependency.
The core of our new pipeline is "LLM-as-a-judge." Instead of writing brittle regex or exact-match assertions, we use a stronger model (usually GPT-4o) to grade the output of our production model (often a faster, cheaper model like GPT-4o-mini or Haiku).
The pipeline follows three simple steps:
The biggest challenge isn't the logic; it's the structure. When your judge LLM returns a raw string, parsing it is a nightmare. This is where Pydantic structured output becomes mandatory.

We don't want a judge to return "I think this answer is a 4/5." We want a structured JSON object that we can aggregate into meaningful metrics. By using Pydantic, we enforce a schema that the judge must follow.
Here is how we define our judge's schema in Python:
PYTHONfrom pydantic import BaseModel, Field class EvaluationResult(BaseModel): score: int = Field(..., ge=1, le=5, description="Score from 1 to 5") reasoning: str = Field(..., description="Short explanation of the score") is_hallucination: bool = Field(..., description="True if the model invented facts") # Then we pass this to our LLM client # (e.g., using OpenAI's client with structured_outputs=True)
By enforcing this schema, we eliminate parsing errors. If the judge tries to return a string instead of an integer, the Pydantic validator throws an error immediately. This is the same pattern we use for getting reliable structured output from an LLM in production, and it’s just as effective for evaluation.
Once we have our structured evaluation results, we need a place to store them. We use LangSmith to track these runs over time. This gives us the AI observability we need to see if our average score is drifting downward after a prompt change.
Here is a simplified flow of our evaluation loop:
EvaluationResult Pydantic model.langsmith client to log the run and the evaluation score.PYTHONfrom langsmith import Client client = Client() # Log the evaluation to LangSmith client.create_run( name="summary_evaluation", run_type="evaluator", inputs={"input": test_input, "output": model_output}, outputs=evaluation.model_dump(), )
We initially tried using a simple "correct/incorrect" boolean for our judges. It was fast, but it didn't give us enough nuance. When we switched to a 1–5 scale, our costs increased by roughly 1.8x because the judge model needed more tokens to provide the reasoning field.
Was it worth it? Absolutely. That extra context allowed us to identify that our model was struggling with specific document types, not just failing randomly. We learned that automated testing isn't just about passing or failing; it's about debugging the "why" behind the failure.
If I were starting this from scratch today, I’d focus on these three things:
score, reasoning, and is_hallucination.I’m still not 100% satisfied with our latency. Running the evaluation pipeline takes about 45 seconds for a batch of 20 tests. We could speed this up by running evaluations in parallel, but for now, it’s fast enough that it doesn't block our CI/CD pipeline.
Automated LLM evaluation is a moving target. As models get better, your evaluation criteria will need to evolve. By anchoring your tests in Pydantic schemas and using LLM-as-a-judge to quantify performance, you’ll spend less time debugging and more time shipping features that actually work.

How do you handle the "judge" being wrong? We occasionally audit the judge's scores manually. If the judge is consistently mislabeling, we tweak the system prompt for the judge, not the target model.
Does this increase production latency? No. This is an offline process. We run these evaluations in our CI/CD pipeline, not during the user's request/response cycle.
Can I use this for non-text tasks? Yes, as long as you can define a Pydantic schema that represents the success criteria for your specific output type.
LLM agents self-correction relies on recursive feedback loops to catch and fix errors before they reach your users. Learn to build resilient workflows.