Full Project RAG (Retrieval-Augmented Generation) – IV

The Reranker and the Power of the Prompt

Welcome to the fourth installment of our full RAG (Retrieval-Augmented Generation) project. If you’ve followed the series so far, you already know how to prepare documents, build the index, and retrieve relevant results using multiple strategies.

Now it’s time to go one level deeper: filtering, refining, and shaping the final answer. This is where one of the most important pieces of the RAG puzzle comes into play: the reranker.

What Is a Reranker, and Why Do We Need It?

In the retrieval phase—especially when using a combined retriever like get_mergeRetriever()—you get a list of documents that might be relevant. But like any similarity-based system, there’s often a lot of «noise»: documents that match on surface-level keywords but don’t really answer the question.

This is where the reranker steps in. More specifically, a CrossEncoder model performs a second, much more accurate pass. Unlike typical embedding-based retrievers, a CrossEncoder evaluates the semantic relationship between the question and the full document text, assigning a much more precise relevance score.

In our implementation:

model = HuggingFaceCrossEncoder(model_name="mixedbread-ai/mxbai-rerank-xsmall-v1")

model = HuggingFaceCrossEncoder(model_name="mixedbread-ai/mxbai-rerank-xsmall-v1")

This small but capable model is trained to rank question-document pairs. We wrap it inside a CrossEncoderReranker, and then use a ContextualCompressionRetriever, which:

Retrieves documents using our base retriever.
Filters by score (filter_by_score=True).
Automatically discards low-quality results using a defined threshold (score_threshold=1.5).

The result: only the most relevant documents survive. This boosts the precision of the final answer and reduces hallucination risk from irrelevant content.

The Prompt: Your Secret Weapon

Once you’ve got the right documents, the next step is to properly frame the question. This is where the prompt becomes essential. The prompt acts as a bridge between the retrieved knowledge and the generation engine.

Here’s how we define the QA prompt in our pipeline:

qa_template = """You are an assistant answering questions based on the provided document.
Reproduce exactly the part of the document where the answer appears.

{context}

Question: {question}
Answer:"""

qa_template = """You are an assistant answering questions based on the provided document.
Reproduce exactly the part of the document where the answer appears.

{context}

Question: {question}
Answer:"""

Why This Prompt Format?

Full Control: It tells the model to stick strictly to the provided content—no guessing.
Transparency: It encourages reproducing the exact section where the answer lives. That builds trust.
Focused Context: Because the reranker has already trimmed the noise, the prompt remains concise and effective.

What If Multiple Documents Are Relevant?

For that, we use a second prompt to combine and summarize content when needed:

combine_custom_prompt = '''
You are an assistant combining answers found in the retrieved documents.
Summarize the relevant information separately for each document in a bullet-point list.
If a document contains no relevant info, skip it.
Order the information to be clear and concise.

Text:`{context}`
'''

combine_custom_prompt = '''
You are an assistant combining answers found in the retrieved documents.
Summarize the relevant information separately for each document in a bullet-point list.
If a document contains no relevant info, skip it.
Order the information to be clear and concise.

Text:`{context}`
'''

This prompt is designed to:

Present the response in a structured way.
Respect document separation (for traceability).
Automatically discard irrelevant content.

Again: clarity, control, and relevance.

End-to-End Flow (So Far)

Let’s recap the full flow with everything in place:

User question → e.g., "What impact do electric vehicles have on the power grid?"
Mixed Retriever → Retrieves potentially relevant documents.
CrossEncoder Reranker → Ranks and filters based on real semantic relevance.
QA Prompt → Generates answer using the cleaned and focused context.
Combine Prompt (optional) → Structures a clear summary from multiple sources.

Conclusion: Precision First, Generation Second

Many people focus heavily on tuning the language model itself. But in RAG, the quality of the retrieved documents and the prompt are what truly drive useful answers.

The reranker ensures that only meaningful content reaches the LLM. The prompt makes sure that the model understands the task, the context, and its limits.

This isn’t about beautiful language. It’s about correct, grounded answers.

See the full code here: https://github.com/dorapps/RAG_Project