Retrieval Augmented Generation (RAG) is a technique for enhancing the accuracy and reliability of your large language model (LLM), and building a proof of concept (POC) with RAG is fast and simple. Based on our own introductory article and additional research, we’ve shown the need for successful RAG.
The key to success, though, is building a reliable and sustainable RAG-based solution—and evaluation is the way.
Evaluation ensures your architectural choices are made correctly, are data-driven, and are created with set parameters for optimal RAG performance. You can expect to find long-term benefits as well, since you’ll have built a model with fitting prompts and accurate contexts, and confidence in future releases.
First things first: Let’s define our challenge
RAG solutions are a combination of retrieval and generation systems. The retrieval system matches queries to indexed data and extracts relevant information. Those results augment the question handed off to the generation system, which takes it as context and generates a response.
Popular RAG implementations use an embedding model like ada or davinci, vector databases like Pinecone, Chroma, FAISS, and Milvus, and similarity scores for retrieval. Retrieval can also be done with lexical models or a combination of the two, called hybrid retrieval. For generation, RAG implementations use LLMs like GPT-3.5, GPT-4, and Claude.
With so many variables, we must make some choices up front, namely a number of architectural decisions and parameters for both retrieval and generation.
Retrieval
- Input data preprocessing: Anything other than plain text needs to be parsed and preprocessed before being indexed. For example, parsing a PDF requires heuristics on word and paragraph splitting, which many parsing libraries or engines can do, and decisions on how to handle tables and images. This step is the most critical and yet, often underexplored. Any change here can influence results and the remaining steps.
- Chunking: Documents are usually too large to easily fit into the context of a query, since there might be multiple search results needed to fully answer it. So, we split them into chunks. First, we need to decide the right chunking strategy: chunk size, how to split them, and how to handle overlap.
- Architecture: To make the right choice between semantic/embedding, lexical, or hybrid retrieval, we have to weigh considerations such as which embedding mode—OpenAI, ada, Llama, davinci—gives the best scores. Other factors to consider include the minimum similarity in scores for relevancy, and “top k” (the maximum number of relevant chunks)—without which our choices for retrieval are incomplete.
Generation
- Prompt choice: What prompt best combines system preferences, extracted relevant document chunks, and user query?
- LLM choice: Which LLM generates the best responses? GPT-3.5, GPT-4, Claude, or Llama?
- Temperature: What is the optimal temperature for best response generation? Adjust temperature to control the diversity and style of generated text.
Every additional choice and parameter leads to a variety of experiments. For example, an optimal prompt for GPT-3.5 will not be the same as an optimal prompt for GPT-4.
Evaluating LLM systems: Define metrics
Users expect quality, meaningful responses when they make a query, and we need metrics to ensure we’re delivering. Such metrics may be accuracy, fluency, coherence, relevance, bias, toxicity, timing, and token counts.
Since RAG is a two-part system, however, it’s important to measure both generation and retrieval quality. If retrieval fails, no LLM can answer the question correctly.
Retrieval Metrics
Read more about Retrieval Metrics
Generation Metrics
Operational Metrics
A separate group of useful metrics to track in order to understand expected costs and latency. These can include:
Gather data
To evaluate RAG, we need to consider real use cases and determine data that measures both retrieval and generation.
Start by gathering documents for the use case—context documents to which the user has access and other relevant documents needed to provide a complete answer based on the query.
Here’s an example. A legal firm knows the answer to its query is within its file of PDF contracts signed by clients.
An example data point might be:
– Question: “Who signed the contract with company A?”
– Answer: “The contract with company A was signed by CEO of company A John Doe and our CEO Jane Doe.”
– Context documents: PDF collection of contracts (indices of documents in a data store)
– Relevant documents: signed contract with company A (index of document in a data store)
Important note—data and documents should be private (especially for factual queries), to avoid LLM answering questions from its own implicit knowledge base instead of taking information from retrieved context. This can lead to overly positive generation metrics even if retrieval fails.
Now it’s time to evaluate
While computing retrieval metrics is straightforward, computing generation metrics is not. Assessing text for accuracy, fluency, coherence, relevance, bias, and toxicity requires a deep understanding of semantics, the context around the user query, and factual knowledge.
Language is ambiguous and there can be multiple equally accurate responses to the same question. But there are no formulas or algorithms to compute those metrics automatically.
Usual metrics used to evaluate dialogue systems like BLEU (Bilingual Evaluation Understudy) and ROUGE (Recall-Oriented Understudy for Gisting Evaluation) do not correlate strongly with human judgment. The most reliable way to evaluate is to get a team of annotators, preferably system users, and let them score system outputs. This is expensive and time consuming; can we approximate human judgment?
LLM-as-a-judge is asked to compare LLM-generated responses to human annotations (gold truth) and score them with generation metrics. Unlike traditional dialogue system metrics, these scores mostly agree with humans and are a reliable approximation of how people judge text accuracy, bias, and other criteria.
The approach is not perfect as LLMs tend to have different biases, notably self-enhancing bias which leads to them scoring their own generations higher, but it is negligible in practice. Time and operational cost savings from using LLM-as-a-judge heavily outweigh these disadvantages.
Now, we know how to compute the metrics during evaluation, so we move on to running standard scientific experimentation. Ideally, you’ll do this with each change to your RAG.
1. Define the experiments (different combinations of RAG parameters)
2. Run the evaluation dataset through the experimental version of the system
3. Compute the evaluation metrics
4. Log it into an experiment tracking system.
Finally, choose the system variant that scores best according to your business needs.
Conclusion
Evaluation ensures your architectural choices are made correctly, are data-driven, and are created with set parameters for optimal RAG performance. New RAG architectures, ideas, and methods are constantly designed, and with evaluation you can rapidly test if they work for your application.
Evaluation is the way to ensure you’re getting a reliable, sustainable, and less costly RAG system. With the proliferation of new LLMs and architectural choices, having an evaluation pipeline established early helps make the best data-driven decisions long term.