AI/MLJune 23, 20264 min read

LLM Agents: Implementing Reflection Patterns for Better Reasoning

LLM agents need reflection patterns to catch errors before they reach your users. Learn how to implement self-correction loops for more reliable AI workflows.

LLM agentschain of thoughtrecursive reasoningagentic workflowsself-correctionAI engineeringAILLMRAGPrompt Engineering

Last month, I spent about three days debugging a customer-facing agent that kept hallucinating SQL queries. It would generate a syntactically correct query, but one that referenced non-existent columns because it "assumed" the schema. I realized that my single-pass prompt was doomed; I needed to move from simple generation to a system that actually thinks.

Implementing LLM agents that rely on a single inference call is fine for chatbots, but for complex data extraction or code generation, it’s a recipe for production fires. You need to introduce a feedback loop.

Why You Need Reflection Patterns

When we talk about agentic workflows, we’re really talking about moving away from "write once, hope it works" and toward "write, critique, refine." The goal isn't to get the model to be perfect on the first try—it's to give the model the tools to identify its own failure modes.

We first tried adding a "be accurate" instruction to the system prompt, which resulted in a massive 0% improvement in accuracy. It turns out, telling a model to be smarter doesn't actually change its underlying reasoning process. We needed to explicitly implement chain of thought patterns where the model is forced to output its reasoning before the final result, and then verify that reasoning against the available context.

Building a Basic Reflection Loop

To fix my SQL hallucination problem, I refactored the pipeline into a two-step process. First, the model generates the draft. Second, a "critic" pass inspects that draft against the schema. If the critic finds a mismatch, it triggers a retry with the error message.

Here’s a simplified version of what that looks like in Python using a basic loop:


PYTHON
def generate_with_reflection(user_query, schema):
    # Step 1: Draft the response
    draft = llm.invoke(f"Generate SQL for: {user_query}. Schema: {schema}")
    
    # Step 2: Self-Correction
    critique = llm.invoke(f"Critique this SQL: {draft}. Does it match {schema}?")
    
    if "ERROR" in critique:
        # Step 3: Recursive reasoning to fix
        final_output = llm.invoke(f"Fix the SQL based on this critique: {critique}")
        return final_output
    
    return draft

This pattern is essentially the foundation for more advanced LLM agents self-correction: Building Recursive Feedback Loops, where you can chain these evaluations to ensure the output meets specific constraints.

Scaling to Complex Tasks

As your requirements grow, simple loops aren't enough. You’ll find yourself needing to handle state. I’ve found that using a library like LangGraph or just simple structured state machines helps keep the recursive reasoning manageable.

When you move into these more complex flows, remember that you’re essentially trading latency for reliability. My SQL agent went from ~400ms per request to about 1.2s, but the error rate dropped from 15% to under 2%. That’s a trade-off I’ll take every day in production.

If the task is truly high-stakes, you shouldn't rely solely on the model to catch its own mistakes. You should consider Implementing LLM Human-in-the-Loop for High-Stakes Workflows to handle those edge cases where the AI’s confidence score is suspiciously low.

The Pitfalls of Over-Engineering

A common mistake I see developers make is building a "reflexive loop" that never ends. If your model gets stuck in a loop of "I made a mistake, let me fix it" -> "That's still wrong" -> "Let me try again," you’re going to burn through your token budget in seconds.

Always implement:

Hard limits: Never allow more than 3 correction cycles.
Context pruning: Don't feed the entire conversation history back into the reflection step; just send the draft and the specific error.
Structured outputs: Force the model to output JSON schemas for the critique so you can parse it programmatically.

When you're building these systems, also keep an eye on how you handle tool selection. If your agent is reflecting on tool output, make sure you're using strict schema validation, as described in my guide on LLM Function Calling: A Guide to Dynamic Tool Selection.

What I'm Still Figuring Out

I’m currently experimenting with "multi-agent reflection," where one model acts as the architect and another as the code reviewer. It’s significantly more expensive, but the reasoning depth is impressive.

However, I'm still not convinced that the added complexity is worth it for every feature. Sometimes a well-engineered prompt and a single validation step are all you need. Don't fall into the trap of adding agentic layers just because they're trendy. Start with one reflection step, measure the impact on your error rate, and only add more if the data justifies the cost.

Back to Blog

LLM Agents: Implementing Reflection Patterns for Better Reasoning

Why You Need Reflection Patterns

Building a Basic Reflection Loop

Scaling to Complex Tasks

The Pitfalls of Over-Engineering

What I'm Still Figuring Out

Similar Posts

LLM Routing for Production: Dynamic Task Classification & Scaling

LLM Streaming with Partial JSON Reconstruction for Better UI

LLM Streaming Structured Data: Real-Time Parsing Guide