AI/MLJune 21, 20265 min read

LLM evaluation pipelines: Building automated tests with LangSmith

Master LLM evaluation by implementing LLM-as-a-judge pipelines. Use LangSmith and Pydantic for automated testing that catches regressions and ensures quality.

LLMAILangSmithPythonPydanticEvaluationRAGPrompt Engineering

Last month, we pushed a prompt update that "improved" our RAG system’s summarization capabilities on paper but nuked our downstream data parsing in production. We spent about two days manually verifying test cases before realizing we were flying blind. That was the last time we relied on "gut-check" testing for our AI features.

If you’re building production LLM apps, you already know that evaluating LLM features: a practical guide for engineers is the only way to move beyond prototypes. To stop the cycle of fixing one bug and introducing two more, you need an automated pipeline that treats LLM outputs like any other software dependency.

The LLM-as-a-Judge Architecture

The core of our new pipeline is "LLM-as-a-judge." Instead of writing brittle regex or exact-match assertions, we use a stronger model (usually GPT-4o) to grade the output of our production model (often a faster, cheaper model like GPT-4o-mini or Haiku).

The pipeline follows three simple steps:

Generate: Run the target LLM over a static dataset of inputs.
Evaluate: Pass the output and the input to an "evaluator" LLM.
Log: Send the results to a platform like LangSmith for observability and historical tracking.

The biggest challenge isn't the logic; it's the structure. When your judge LLM returns a raw string, parsing it is a nightmare. This is where Pydantic structured output becomes mandatory.

Implementing Pydantic for Structured Evaluation

Wooden letter blocks arranged to spell 'evaluation' on a beige background.

We don't want a judge to return "I think this answer is a 4/5." We want a structured JSON object that we can aggregate into meaningful metrics. By using Pydantic, we enforce a schema that the judge must follow.

Here is how we define our judge's schema in Python:


PYTHON
from pydantic import BaseModel, Field

class EvaluationResult(BaseModel):
    score: int = Field(..., ge=1, le=5, description="Score from 1 to 5")
    reasoning: str = Field(..., description="Short explanation of the score")
    is_hallucination: bool = Field(..., description="True if the model invented facts")

# Then we pass this to our LLM client
# (e.g., using OpenAI's client with structured_outputs=True)

By enforcing this schema, we eliminate parsing errors. If the judge tries to return a string instead of an integer, the Pydantic validator throws an error immediately. This is the same pattern we use for getting reliable structured output from an LLM in production, and it’s just as effective for evaluation.

Integrating with LangSmith

Once we have our structured evaluation results, we need a place to store them. We use LangSmith to track these runs over time. This gives us the AI observability we need to see if our average score is drifting downward after a prompt change.

Here is a simplified flow of our evaluation loop:

Dataset Loading: We pull a CSV of 50 "golden" input/output pairs from LangSmith.
Execution: The target LLM processes these inputs.
Judgment: The judge LLM processes the target output and returns our EvaluationResult Pydantic model.
Logging: We use the langsmith client to log the run and the evaluation score.


PYTHON
from langsmith import Client

client = Client()

# Log the evaluation to LangSmith
client.create_run(
    name="summary_evaluation",
    run_type="evaluator",
    inputs={"input": test_input, "output": model_output},
    outputs=evaluation.model_dump(),
)

The Trade-offs We Faced

We initially tried using a simple "correct/incorrect" boolean for our judges. It was fast, but it didn't give us enough nuance. When we switched to a 1–5 scale, our costs increased by roughly 1.8x because the judge model needed more tokens to provide the reasoning field.

Was it worth it? Absolutely. That extra context allowed us to identify that our model was struggling with specific document types, not just failing randomly. We learned that automated testing isn't just about passing or failing; it's about debugging the "why" behind the failure.

Lessons Learned

If I were starting this from scratch today, I’d focus on these three things:

Start with a small, high-quality dataset. Don't try to evaluate 1,000 cases. 20 well-curated examples that cover your edge cases are worth more than 500 noisy ones.
Use a stronger model for the judge. Don't try to use a cheap model to evaluate a cheap model. You need the reasoning capability of a frontier model to act as a reliable judge.
Keep your schema simple. If you ask for too many fields, the judge LLM will start hallucinating its own reasoning. Stick to score, reasoning, and is_hallucination.

I’m still not 100% satisfied with our latency. Running the evaluation pipeline takes about 45 seconds for a batch of 20 tests. We could speed this up by running evaluations in parallel, but for now, it’s fast enough that it doesn't block our CI/CD pipeline.

Automated LLM evaluation is a moving target. As models get better, your evaluation criteria will need to evolve. By anchoring your tests in Pydantic schemas and using LLM-as-a-judge to quantify performance, you’ll spend less time debugging and more time shipping features that actually work.

Frequently Asked Questions

Close-up of a magnifying glass focusing on the phrase 'Frequently Asked Questions'.

How do you handle the "judge" being wrong? We occasionally audit the judge's scores manually. If the judge is consistently mislabeling, we tweak the system prompt for the judge, not the target model.

Does this increase production latency? No. This is an offline process. We run these evaluations in our CI/CD pipeline, not during the user's request/response cycle.

Can I use this for non-text tasks? Yes, as long as you can define a Pydantic schema that represents the success criteria for your specific output type.

Back to Blog

LLM evaluation pipelines: Building automated tests with LangSmith

The LLM-as-a-Judge Architecture

Implementing Pydantic for Structured Evaluation

Integrating with LangSmith

The Trade-offs We Faced

Lessons Learned

Frequently Asked Questions

Similar Posts

LLM Cost Control: Mastering Dynamic Context Window Management

LLM agents self-correction: Building Recursive Feedback Loops

Optimizing RAG Retrieval: A Practical Guide to Semantic Reranking