AI/MLJune 22, 20264 min read

LLM Data Enrichment: Building Robust Asynchronous Pipelines

LLM data enrichment pipelines require asynchronous processing to scale. Learn how to handle batch inference and enforce strict schemas for reliable results.

LLMdata-engineeringpythonbackendarchitectureAI-pipelinesAIRAGPrompt Engineering

Last month, I spent about three days debugging a production pipeline that was failing every time we tried to enrich a batch of 5,000 customer records. We were hitting API rate limits, our database connections were timing out, and the LLM occasionally decided that "null" was a valid response for a non-nullable field.

If you're trying to move beyond simple chat interfaces, you quickly realize that LLM data enrichment isn't just about prompt engineering. It’s about building a resilient system that treats LLM outputs as unreliable inputs until proven otherwise.

Why Synchronous LLM Pipelines Fail

When you start, it’s tempting to fire off an LLM request inside a request-response cycle. Don't. Even with a fast model like GPT-4o-mini or Claude 3.5 Haiku, you’re looking at hundreds of milliseconds to multiple seconds per inference. If your user is waiting for that page to load, you’ve already lost them.

We initially tried wrapping our enrichment logic in a standard API endpoint. It worked for two records. By the time we hit ten, the latency spikes triggered our load balancer's timeout threshold, resulting in a cascade of 504 errors. We needed to move to an asynchronous architecture where the LLM inference happens in the background, away from the user’s critical path.

Designing for Asynchronous LLM Pipelines

For production-grade LLM data enrichment, I’ve found that a producer-consumer model is the only way to keep the system stable. Here is the pattern I settled on:

Queueing: Push the raw data records into a message broker (we use Redis with BullMQ, but RabbitMQ works fine too).
Worker Pool: Spin up a cluster of workers that pull from the queue.
Throttling: Implement a token bucket or fixed-window rate limiter to ensure you don't exceed your LLM provider's RPM (requests per minute) limits.
Persistence: Write the results to a "pending" table in your database before marking the job as complete.

This setup allows you to process thousands of records without blocking your main application. If the LLM provider experiences a transient dip, your workers can simply retry the job with exponential backoff.

Enforcing Structured Data Extraction

The biggest headache in batch LLM inference is output variance. Even when you ask for JSON, models occasionally hallucinate markdown formatting or include conversational filler like "Here is the data you requested."

You must treat the LLM output as untrusted and enforce a strict schema. We use Pydantic models in Python or Zod in TypeScript to validate every single response. If the validation fails, the worker logs the error, captures the raw output for debugging, and moves to the next item. As I’ve discussed in Structured output: Implementing Deterministic JSON Schema Validation, this layer of validation is non-negotiable for production stability.

Here is a simplified example of how we handle the enrichment task:


PYTHON
from pydantic import BaseModel, ValidationError
import openai

class UserProfile(BaseModel):
    sentiment: str
    category: str
    summary: str

def enrich_record(raw_data):
    try:
        response = client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": f"Extract info from: {raw_data}"}],
            response_format={"type": "json_object"}
        )
        data = json.loads(response.choices[0].message.content)
        return UserProfile(**data)
    except (ValidationError, json.JSONDecodeError) as e:
        # Log error and handle retry logic
        raise EnrichmentError(f"Schema mismatch: {e}")

Managing Batch LLM Inference Costs and Latency

When you scale up to millions of rows, cost becomes the primary constraint. We found that by caching common prompts and using smaller, cheaper models for "easy" classification tasks, we saved roughly 40% on our monthly bill.

Another trick is to batch multiple records into a single prompt when possible. Instead of one call per record, we sometimes send five, asking the model to return a list of JSON objects. Just be careful—as the context length increases, the likelihood of a schema validation error rises. If a single object in the list fails validation, you might have to discard the entire batch, which is a trade-off you need to account for.

If you’re doing this at scale, LLM Streaming Structured Data: Real-Time Parsing Guide might seem like overkill, but the lessons on parsing partial structures apply surprisingly well to large-scale batch processing where you might want to extract data as it arrives.

What I’d Do Differently

Looking back, I wish we had implemented a "dead letter queue" (DLQ) from day one. When a job fails three times, it shouldn't just vanish into the logs. It should go to a DLQ where we can inspect it, tweak the prompt, and replay the jobs.

I’m also still struggling with the "drift" problem. When we update our prompt, the schema we enforced six months ago might change slightly. We’re currently exploring automated evaluation pipelines to catch these regressions before they hit production, as outlined in LLM evaluation pipelines: Building automated tests with LangSmith.

LLM data enrichment is inherently messy. You’ll never get 100% accuracy, but by moving to an asynchronous, schema-validated pipeline, you turn a chaotic process into an observable, manageable engineering task.

Back to Blog

LLM Data Enrichment: Building Robust Asynchronous Pipelines

Why Synchronous LLM Pipelines Fail

Designing for Asynchronous LLM Pipelines

Enforcing Structured Data Extraction

Managing Batch LLM Inference Costs and Latency

What I’d Do Differently

Similar Posts

LLM Streaming with Partial JSON Reconstruction for Better UI

RAG Pipelines: Using LLM-Powered Semantic Query Rewriting

LLM Fallback Strategies: Designing Resilient AI Architectures