RAG pipelines often fail when user queries don't match index terminology. Learn how LLM-powered semantic query rewriting fixes this to boost recall.
Last month, I spent three days debugging a search feature that kept returning empty results for common user questions. Our users were asking about "billing cycles," but our internal documentation used the term "subscription cadence." The vector embeddings weren't capturing the synonymy, and our basic keyword fallback was too rigid. We needed a way to bridge this gap without re-indexing our entire knowledge base.
That’s where LLM-powered semantic query rewriting comes in. By using a small, fast model to transform a raw user input into a search-optimized query, you can significantly improve the recall of your RAG pipelines.
In most vector-based systems, the bottleneck isn't the embedding model itself—it’s the mismatch between how a human asks a question and how a document is indexed. If your user asks, "Why is my account locked?", but your technical logs say "User authentication suspended due to policy violation," the cosine similarity between those two vectors might be low.
We first tried Hybrid Search in RAG Pipelines: Boosting Retrieval Accuracy to solve this, which helped, but it didn't solve the semantic intent gap. We needed to translate the user’s intent into the language of the documents before the retrieval step.
I recommend using a lightweight model like gpt-4o-mini or llama-3.1-8b-instruct for this task. You don't need a massive model to rewrite a query; you need speed.
Here’s a simple prompt pattern that works well:
TEXTSystem: You are an expert search assistant. Your goal is to rewrite the user's query to be more descriptive and use terminology found in technical documentation. Keep the rewrite concise. Do not add conversational filler. User: "Why is my account locked?" Rewrite: "Reason for user authentication suspension and account access restriction."
By injecting this step into your retrieval chain, you’re essentially performing a real-time translation between natural language and domain-specific jargon.
When building this, keep your latency budget in mind. Adding an LLM call adds roughly 200ms to 400ms to your retrieval time. If that's too much, consider Semantic caching for RAG pipelines: Cut latency and costs to avoid re-running the rewrite for common queries.
Here is the basic flow of the pipeline:
Don't over-engineer the prompt. I’ve seen developers try to force the LLM to output JSON or specific search operators. Keep it simple. If the model gets too creative, it might hallucinate terms that don't exist in your index, leading to "ghost" search results that return nothing.
Another issue is "query drift." If your rewrite is too far from the original intent, your recall will actually suffer. Always log both the original and rewritten queries. If you notice the rewrite is consistently ignoring key entities (like specific product names or error codes), adjust your system prompt to explicitly include those entities.
Not every query needs a rewrite. If the user query is already highly specific—like a product SKU or a specific error ID—the rewrite step is just wasted compute. We implemented a simple heuristic: if the query is under 5 words and contains no complex nouns, we skip the LLM call entirely.
Does query rewriting replace the need for fine-tuned embeddings? No. It complements them. Embeddings handle the vector space geometry; rewriting handles the vocabulary mismatch.
How do I measure if this is working? Track your "Recall@K." If your retrieval success rate at the top 5 results increases after adding the rewrite step, you’re on the right track.
What happens if the LLM rewrite is wrong? This is the biggest risk. If the rewrite is bad, the search is bad. Always keep the original user query in your retrieval set as a fallback, or use a "multi-query" approach where you search for both the original and the rewritten version simultaneously.
I'm still experimenting with whether it’s better to generate one high-quality rewrite or three slightly different variations to capture multiple angles. Sometimes, a single, highly accurate rewrite is better than a noisy set of three. For now, I'm sticking to the single-rewrite approach to keep latency predictable, but I suspect that as we scale, a more nuanced multi-query strategy will be necessary.
Building a small RAG pipeline is the fastest way to ground LLMs in your data. Learn the end-to-end process of indexing, retrieval, and generation.