Retrieval Augmented | Towards Data Science

Overcome Failing Document Ingestion & RAG Strategies with Agentic Knowledge Distillation

Tula Masterman — Wed, 05 Mar 2025 19:50:12 +0000

Introduction

Many generative AI use cases still revolve around Retrieval Augmented Generation (RAG), yet consistently fall short of user expectations. Despite the growing body of research on RAG improvements and even adding Agents into the process, many solutions still fail to return exhaustive results, miss information that is critical but infrequently mentioned in the documents, require multiple search iterations, and generally struggle to reconcile key themes across multiple documents. To top it all off, many implementations still rely on cramming as much “relevant” information as possible into the model’s context window alongside detailed system and user prompts. Reconciling all this information often exceeds the model’s cognitive capacity and compromises response quality and consistency.

This is where our Agentic Knowledge Distillation + Pyramid Search Approach comes into play. Instead of chasing the best chunking strategy, retrieval algorithm, or inference-time reasoning method, my team, Jim Brown, Mason Sawtell, Sandi Besen, and I, take an agentic approach to document ingestion.

We leverage the full capability of the model at ingestion time to focus exclusively on distilling and preserving the most meaningful information from the document dataset. This fundamentally simplifies the RAG process by allowing the model to direct its reasoning abilities toward addressing the user/system instructions rather than struggling to understand formatting and disparate information across document chunks.

We specifically target high-value questions that are often difficult to evaluate because they have multiple correct answers or solution paths. These cases are where traditional RAG solutions struggle most and existing RAG evaluation datasets are largely insufficient for testing this problem space. For our research implementation, we downloaded annual and quarterly reports from the last year for the 30 companies in the DOW Jones Industrial Average. These documents can be found through the SEC EDGAR website. The information on EDGAR is accessible and able to be downloaded for free or can be queried through EDGAR public searches. See the SEC privacy policy for additional details, information on the SEC website is “considered public information and may be copied or further distributed by users of the web site without the SEC’s permission”. We selected this dataset for two key reasons: first, it falls outside the knowledge cutoff for the models evaluated, ensuring that the models cannot respond to questions based on their knowledge from pre-training; second, it’s a close approximation for real-world business problems while allowing us to discuss and share our findings using publicly available data.

While typical RAG solutions excel at factual retrieval where the answer is easily identified in the document dataset (e.g., “When did Apple’s annual shareholder’s meeting occur?”), they struggle with nuanced questions that require a deeper understanding of concepts across documents (e.g., “Which of the DOW companies has the most promising AI strategy?”). Our Agentic Knowledge Distillation + Pyramid Search Approach addresses these types of questions with much greater success compared to other standard approaches we tested and overcomes limitations associated with using knowledge graphs in RAG systems.

In this article, we’ll cover how our knowledge distillation process works, key benefits of this approach, examples, and an open discussion on the best way to evaluate these types of systems where, in many cases, there is no singular “right” answer.

Building the pyramid: How Agentic Knowledge Distillation works

Image by author and team depicting pyramid structure for document ingestion. Robots meant to represent agents building the pyramid.

Overview

Our knowledge distillation process creates a multi-tiered pyramid of information from the raw source documents. Our approach is inspired by the pyramids used in deep learning computer vision-based tasks, which allow a model to analyze an image at multiple scales. We take the contents of the raw document, convert it to markdown, and distill the content into a list of atomic insights, related concepts, document abstracts, and general recollections/memories. During retrieval it’s possible to access any or all levels of the pyramid to respond to the user request.

How to distill documents and build the pyramid:

Convert documents to Markdown: Convert all raw source documents to Markdown. We’ve found models process markdown best for this task compared to other formats like JSON and it is more token efficient. We used Azure Document Intelligence to generate the markdown for each page of the document, but there are many other open-source libraries like MarkItDown which do the same thing. Our dataset included 331 documents and 16,601 pages.
Extract atomic insights from each page: We process documents using a two-page sliding window, which allows each page to be analyzed twice. This gives the agent the opportunity to correct any potential mistakes when processing the page initially. We instruct the model to create a numbered list of insights that grows as it processes the pages in the document. The agent can overwrite insights from the previous page if they were incorrect since it sees each page twice. We instruct the model to extract insights in simple sentences following the subject-verb-object (SVO) format and to write sentences as if English is the second language of the user. This significantly improves performance by encouraging clarity and precision. Rolling over each page multiple times and using the SVO format also solves the disambiguation problem, which is a huge challenge for knowledge graphs. The insight generation step is also particularly helpful for extracting information from tables since the model captures the facts from the table in clear, succinct sentences. Our dataset produced 216,931 total insights, about 13 insights per page and 655 insights per document.
Distilling concepts from insights: From the detailed list of insights, we identify higher-level concepts that connect related information about the document. This step significantly reduces noise and redundant information in the document while preserving essential information and themes. Our dataset produced 14,824 total concepts, about 1 concept per page and 45 concepts per document.
Creating abstracts from concepts: Given the insights and concepts in the document, the LLM writes an abstract that appears both better than any abstract a human would write and more information-dense than any abstract present in the original document. The LLM generated abstract provides incredibly comprehensive knowledge about the document with a small token density that carries a significant amount of information. We produce one abstract per document, 331 total.
Storing recollections/memories across documents: At the top of the pyramid we store critical information that is useful across all tasks. This can be information that the user shares about the task or information the agent learns about the dataset over time by researching and responding to tasks. For example, we can store the current 30 companies in the DOW as a recollection since this list is different from the 30 companies in the DOW at the time of the model’s knowledge cutoff. As we conduct more and more research tasks, we can continuously improve our recollections and maintain an audit trail of which documents these recollections originated from. For example, we can keep track of AI strategies across companies, where companies are making major investments, etc. These high-level connections are super important since they reveal relationships and information that are not apparent in a single page or document.

Sample subset of insights extracted from IBM 10Q, Q3 2024 (page 4)

We store the text and embeddings for each layer of the pyramid (pages and up) in Azure PostgreSQL. We originally used Azure AI Search, but switched to PostgreSQL for cost reasons. This required us to write our own hybrid search function since PostgreSQL doesn’t yet natively support this feature. This implementation would work with any vector database or vector index of your choosing. The key requirement is to store and efficiently retrieve both text and vector embeddings at any level of the pyramid.

This approach essentially creates the essence of a knowledge graph, but stores information in natural language, the way an LLM natively wants to interact with it, and is more efficient on token retrieval. We also let the LLM pick the terms used to categorize each level of the pyramid, this seemed to let the model decide for itself the best way to describe and differentiate between the information stored at each level. For example, the LLM preferred “insights” to “facts” as the label for the first level of distilled knowledge. Our goal in doing this was to better understand how an LLM thinks about the process by letting it decide how to store and group related information.

Using the pyramid: How it works with RAG & Agents

At inference time, both traditional RAG and agentic approaches benefit from the pre-processed, distilled information ingested in our knowledge pyramid. The pyramid structure allows for efficient retrieval in both the traditional RAG case, where only the top X related pieces of information are retrieved or in the Agentic case, where the Agent iteratively plans, retrieves, and evaluates information before returning a final response.

The benefit of the pyramid approach is that information at any and all levels of the pyramid can be used during inference. For our implementation, we used PydanticAI to create a search agent that takes in the user request, generates search terms, explores ideas related to the request, and keeps track of information relevant to the request. Once the search agent determines there’s sufficient information to address the user request, the results are re-ranked and sent back to the LLM to generate a final reply. Our implementation allows a search agent to traverse the information in the pyramid as it gathers details about a concept/search term. This is similar to walking a knowledge graph, but in a way that’s more natural for the LLM since all the information in the pyramid is stored in natural language.

Depending on the use case, the Agent could access information at all levels of the pyramid or only at specific levels (e.g. only retrieve information from the concepts). For our experiments, we did not retrieve raw page-level data since we wanted to focus on token efficiency and found the LLM-generated information for the insights, concepts, abstracts, and recollections was sufficient for completing our tasks. In theory, the Agent could also have access to the page data; this would provide additional opportunities for the agent to re-examine the original document text; however, it would also significantly increase the total tokens used.

Here is a high-level visualization of our Agentic approach to responding to user requests:

Image created by author and team providing an overview of the agentic research & response process

Results from the pyramid: Real-world examples

To evaluate the effectiveness of our approach, we tested it against a variety of question categories, including typical fact-finding questions and complex cross-document research and analysis tasks.

Fact-finding (spear fishing):

These tasks require identifying specific information or facts that are buried in a document. These are the types of questions typical RAG solutions target but often require many searches and consume lots of tokens to answer correctly.

Example task: “What was IBM’s total revenue in the latest financial reporting?”

Example response using pyramid approach: “IBM’s total revenue for the third quarter of 2024 was $14.968 billion [ibm-10q-q3-2024.pdf, pg. 4]

Total tokens used to research and generate response

This result is correct (human-validated) and was generated using only 9,994 total tokens, with 1,240 tokens in the generated final response.

Complex research and analysis:

These tasks involve researching and understanding multiple concepts to gain a broader understanding of the documents and make inferences and informed assumptions based on the gathered facts.

Example task: “Analyze the investments Microsoft and NVIDIA are making in AI and how they are positioning themselves in the market. The report should be clearly formatted.”

Example response:

Response generated by the agent analyzing AI investments and positioning for Microsoft and NVIDIA.

The result is a comprehensive report that executed quickly and contains detailed information about each of the companies. 26,802 total tokens were used to research and respond to the request with a significant percentage of them used for the final response (2,893 tokens or ~11%). These results were also reviewed by a human to verify their validity.

Snippet indicating total token usage for the task

Example task: “Create a report on analyzing the risks disclosed by the various financial companies in the DOW. Indicate which risks are shared and unique.”

Example response:

Part 1 of response generated by the agent on disclosed risks.

Part 2 of response generated by the agent on disclosed risks.

Similarly, this task was completed in 42.7 seconds and used 31,685 total tokens, with 3,116 tokens used to generate the final report.

Snippet indicating total token usage for the task

These results for both fact-finding and complex analysis tasks demonstrate that the pyramid approach efficiently creates detailed reports with low latency using a minimal amount of tokens. The tokens used for the tasks carry dense meaning with little noise allowing for high-quality, thorough responses across tasks.

Benefits of the pyramid: Why use it?

Overall, we found that our pyramid approach provided a significant boost in response quality and overall performance for high-value questions.

Some of the key benefits we observed include:

Reduced model’s cognitive load: When the agent receives the user task, it retrieves pre-processed, distilled information rather than the raw, inconsistently formatted, disparate document chunks. This fundamentally improves the retrieval process since the model doesn’t waste its cognitive capacity on trying to break down the page/chunk text for the first time.
Superior table processing: By breaking down table information and storing it in concise but descriptive sentences, the pyramid approach makes it easier to retrieve relevant information at inference time through natural language queries. This was particularly important for our dataset since financial reports contain lots of critical information in tables.
Improved response quality to many types of requests: The pyramid enables more comprehensive context-aware responses to both precise, fact-finding questions and broad analysis based tasks that involve many themes across numerous documents.
Preservation of critical context: Since the distillation process identifies and keeps track of key facts, important information that might appear only once in the document is easier to maintain. For example, noting that all tables are represented in millions of dollars or in a particular currency. Traditional chunking methods often cause this type of information to slip through the cracks.
Optimized token usage, memory, and speed: By distilling information at ingestion time, we significantly reduce the number of tokens required during inference, are able to maximize the value of information put in the context window, and improve memory use.
Scalability: Many solutions struggle to perform as the size of the document dataset grows. This approach provides a much more efficient way to manage a large volume of text by only preserving critical information. This also allows for a more efficient use of the LLMs context window by only sending it useful, clear information.
Efficient concept exploration: The pyramid enables the agent to explore related information similar to navigating a knowledge graph, but does not require ever generating or maintaining relationships in the graph. The agent can use natural language exclusively and keep track of important facts related to the concepts it’s exploring in a highly token-efficient and fluid way.
Emergent dataset understanding: An unexpected benefit of this approach emerged during our testing. When asking questions like “what can you tell me about this dataset?” or “what types of questions can I ask?”, the system is able to respond and suggest productive search topics because it has a more robust understanding of the dataset context by accessing higher levels in the pyramid like the abstracts and recollections.

Beyond the pyramid: Evaluation challenges & future directions

Challenges

While the results we’ve observed when using the pyramid search approach have been nothing short of amazing, finding ways to establish meaningful metrics to evaluate the entire system both at ingestion time and during information retrieval is challenging. Traditional RAG and Agent evaluation frameworks often fail to address nuanced questions and analytical responses where many different responses are valid.

Our team plans to write a research paper on this approach in the future, and we are open to any thoughts and feedback from the community, especially when it comes to evaluation metrics. Many of the existing datasets we found were focused on evaluating RAG use cases within one document or precise information retrieval across multiple documents rather than robust concept and theme analysis across documents and domains.

The main use cases we are interested in relate to broader questions that are representative of how businesses actually want to interact with GenAI systems. For example, “tell me everything I need to know about customer X” or “how do the behaviors of Customer A and B differ? Which am I more likely to have a successful meeting with?”. These types of questions require a deep understanding of information across many sources. The answers to these questions typically require a person to synthesize data from multiple areas of the business and think critically about it. As a result, the answers to these questions are rarely written or saved anywhere which makes it impossible to simply store and retrieve them through a vector index in a typical RAG process.

Another consideration is that many real-world use cases involve dynamic datasets where documents are consistently being added, edited, and deleted. This makes it difficult to evaluate and track what a “correct” response is since the answer will evolve as the available information changes.

Future directions

In the future, we believe that the pyramid approach can address some of these challenges by enabling more effective processing of dense documents and storing learned information as recollections. However, tracking and evaluating the validity of the recollections over time will be critical to the system’s overall success and remains a key focus area for our ongoing work.

When applying this approach to organizational data, the pyramid process could also be used to identify and assess discrepancies across areas of the business. For example, uploading all of a company’s sales pitch decks could surface where certain products or services are being positioned inconsistently. It could also be used to compare insights extracted from various line of business data to help understand if and where teams have developed conflicting understandings of topics or different priorities. This application goes beyond pure information retrieval use cases and would allow the pyramid to serve as an organizational alignment tool that helps identify divergences in messaging, terminology, and overall communication.

Conclusion: Key takeaways and why the pyramid approach matters

The knowledge distillation pyramid approach is significant because it leverages the full power of the LLM at both ingestion and retrieval time. Our approach allows you to store dense information in fewer tokens which has the added benefit of reducing noise in the dataset at inference. Our approach also runs very quickly and is incredibly token efficient, we are able to generate responses within seconds, explore potentially hundreds of searches, and on average use <40K tokens for the entire search, retrieval, and response generation process (this includes all the search iterations!).

We find that the LLM is much better at writing atomic insights as sentences and that these insights effectively distill information from both text-based and tabular data. This distilled information written in natural language is very easy for the LLM to understand and navigate at inference since it does not have to expend unnecessary energy reasoning about and breaking down document formatting or filtering through noise.

The ability to retrieve and aggregate information at any level of the pyramid also provides significant flexibility to address a variety of query types. This approach offers promising performance for large datasets and enables high-value use cases that require nuanced information retrieval and analysis.

Note: The opinions expressed in this article are solely my own and do not necessarily reflect the views or policies of my employer.

Interested in discussing further or collaborating? Reach out on LinkedIn!

The post Overcome Failing Document Ingestion & RAG Strategies with Agentic Knowledge Distillation appeared first on Towards Data Science.

LLM + RAG: Creating an AI-Powered File Reader Assistant

Gustavo Santos — Mon, 03 Mar 2025 21:02:28 +0000

Introduction

AI is everywhere.

It is hard not to interact at least once a day with a Large Language Model (LLM). The chatbots are here to stay. They’re in your apps, they help you write better, they compose emails, they read emails…well, they do a lot.

And I don’t think that that is bad. In fact, my opinion is the other way – at least so far. I defend and advocate for the use of AI in our daily lives because, let’s agree, it makes everything much easier.

I don’t have to spend time double-reading a document to find punctuation problems or type. AI does that for me. I don’t waste time writing that follow-up email every single Monday. AI does that for me. I don’t need to read a huge and boring contract when I have an AI to summarize the main takeaways and action points to me!

These are only some of AI’s great uses. If you’d like to know more use cases of LLMs to make our lives easier, I wrote a whole book about them.

Now, thinking as a data scientist and looking at the technical side, not everything is that bright and shiny.

LLMs are great for several general use cases that apply to anyone or any company. For example, coding, summarizing, or answering questions about general content created until the training cutoff date. However, when it comes to specific business applications, for a single purpose, or something new that didn’t make the cutoff date, that is when the models won’t be that useful if used out-of-the-box – meaning, they will not know the answer. Thus, it will need adjustments.

Training an LLM model can take months and millions of dollars. What is even worse is that if we don’t adjust and tune the model to our purpose, there will be unsatisfactory results or hallucinations (when the model’s response doesn’t make sense given our query).

So what is the solution, then? Spending a lot of money retraining the model to include our data?

Not really. That’s when the Retrieval-Augmented Generation (RAG) becomes useful.

RAG is a framework that combines getting information from an external knowledge base with large language models (LLMs). It helps AI models produce more accurate and relevant responses.

Let’s learn more about RAG next.

What is RAG?

Let me tell you a story to illustrate the concept.

I love movies. For some time in the past, I knew which movies were competing for the best movie category at the Oscars or the best actors and actresses. And I would certainly know which ones got the statue for that year. But now I am all rusty on that subject. If you asked me who was competing, I would not know. And even if I tried to answer you, I would give you a weak response.

So, to provide you with a quality response, I will do what everybody else does: search for the information online, obtain it, and then give it to you. What I just did is the same idea as the RAG: I obtained data from an external database to give you an answer.

When we enhance the LLM with a content store where it can go and retrieve data to augment (increase) its knowledge base, that is the RAG framework in action.

RAG is like creating a content store where the model can enhance its knowledge and respond more accurately.

User prompt about Content C. LLM retrieves external content to aggregate to the answer. Image by the author.

Summarizing:

Uses search algorithms to query external data sources, such as databases, knowledge bases, and web pages.
Pre-processes the retrieved information.
Incorporates the pre-processed information into the LLM.

Why use RAG?

Now that we know what the RAG framework is let’s understand why we should be using it.

Here are some of the benefits:

Enhances factual accuracy by referencing real data.
RAG can help LLMs process and consolidate knowledge to create more relevant answers
RAG can help LLMs access additional knowledge bases, such as internal organizational data
RAG can help LLMs create more accurate domain-specific content
RAG can help reduce knowledge gaps and AI hallucination

As previously explained, I like to say that with the RAG framework, we are giving an internal search engine for the content we want it to add to the knowledge base.

Well. All of that is very interesting. But let’s see an application of RAG. We will learn how to create an AI-powered PDF Reader Assistant.

Project

This is an application that allows users to upload a PDF document and ask questions about its content using AI-powered natural language processing (NLP) tools.

The app uses Streamlit as the front end.
Langchain, OpenAI’s GPT-4 model, and FAISS (Facebook AI Similarity Search) for document retrieval and question answering in the backend.

Let’s break down the steps for better understanding:

Loading a PDF file and splitting it into chunks of text.
1. This makes the data optimized for retrieval
Present the chunks to an embedding tool.
1. Embeddings are numerical vector representations of data used to capture relationships, similarities, and meanings in a way that machines can understand. They are widely used in Natural Language Processing (NLP), recommender systems, and search engines.
Next, we put those chunks of text and embeddings in the same DB for retrieval.
Finally, we make it available to the LLM.

Data preparation

Preparing a content store for the LLM will take some steps, as we just saw. So, let’s start by creating a function that can load a file and split it into text chunks for efficient retrieval.

# Imports
from  langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

def load_document(pdf):
    # Load a PDF
    """
    Load a PDF and split it into chunks for efficient retrieval.

    :param pdf: PDF file to load
    :return: List of chunks of text
    """

    loader = PyPDFLoader(pdf)
    docs = loader.load()

    # Instantiate Text Splitter with Chunk Size of 500 words and Overlap of 100 words so that context is not lost
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)
    # Split into chunks for efficient retrieval
    chunks = text_splitter.split_documents(docs)

    # Return
    return chunks

Next, we will start building our Streamlit app, and we’ll use that function in the next script.

Web application

We will begin importing the necessary modules in Python. Most of those will come from the langchain packages.

FAISS is used for document retrieval; OpenAIEmbeddings transforms the text chunks into numerical scores for better similarity calculation by the LLM; ChatOpenAI is what enables us to interact with the OpenAI API; create_retrieval_chain is what actually the RAG does, retrieving and augmenting the LLM with that data; create_stuff_documents_chain glues the model and the ChatPromptTemplate.

Note: You will need to generate an OpenAI Key to be able to run this script. If it’s the first time you’re creating your account, you get some free credits. But if you have it for some time, it is possible that you will have to add 5 dollars in credits to be able to access OpenAI’s API. An option is using Hugging Face’s Embedding.

# Imports
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
from langchain.chains import create_retrieval_chain
from langchain_openai import ChatOpenAI
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate
from scripts.secret import OPENAI_KEY
from scripts.document_loader import load_document
import streamlit as st

This first code snippet will create the App title, create a box for file upload, and prepare the file to be added to the load_document() function.

# Create a Streamlit app
st.title("AI-Powered Document Q&A")

# Load document to streamlit
uploaded_file = st.file_uploader("Upload a PDF file", type="pdf")

# If a file is uploaded, create the TextSplitter and vector database
if uploaded_file :

    # Code to work around document loader from Streamlit and make it readable by langchain
    temp_file = "./temp.pdf"
    with open(temp_file, "wb") as file:
        file.write(uploaded_file.getvalue())
        file_name = uploaded_file.name

    # Load document and split it into chunks for efficient retrieval.
    chunks = load_document(temp_file)

    # Message user that document is being processed with time emoji
    st.write("Processing document... :watch:")

Machines understand numbers better than text, so in the end, we will have to provide the model with a database of numbers that it can compare and check for similarity when performing a query. That’s where the embeddings will be useful to create the vector_db, in this next piece of code.

# Generate embeddings
    # Embeddings are numerical vector representations of data, typically used to capture relationships, similarities,
    # and meanings in a way that machines can understand. They are widely used in Natural Language Processing (NLP),
    # recommender systems, and search engines.
    embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_KEY,
                                  model="text-embedding-ada-002")

    # Can also use HuggingFaceEmbeddings
    # from langchain_huggingface.embeddings import HuggingFaceEmbeddings
    # embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

    # Create vector database containing chunks and embeddings
    vector_db = FAISS.from_documents(chunks, embeddings)

Next, we create a retriever object to navigate in the vector_db.

# Create a document retriever
    retriever = vector_db.as_retriever()
    llm = ChatOpenAI(model_name="gpt-4o-mini", openai_api_key=OPENAI_KEY)

Then, we will create the system_prompt, which is a set of instructions to the LLM on how to answer, and we will create a prompt template, preparing it to be added to the model once we get the input from the user.

# Create a system prompt
    # It sets the overall context for the model.
    # It influences tone, style, and focus before user interaction starts.
    # Unlike user inputs, a system prompt is not visible to the end user.

    system_prompt = (
        "You are a helpful assistant. Use the given context to answer the question."
        "If you don't know the answer, say you don't know. "
        "{context}"
    )

    # Create a prompt Template
    prompt = ChatPromptTemplate.from_messages(
        [
            ("system", system_prompt),
            ("human", "{input}"),
        ]
    )

    # Create a chain
    # It creates a StuffDocumentsChain, which takes multiple documents (text data) and "stuffs" them together before passing them to the LLM for processing.

    question_answer_chain = create_stuff_documents_chain(llm, prompt)

Moving on, we create the core of the RAG framework, pasting together the retriever object and the prompt. This object adds relevant documents from a data source (e.g., a vector database) and makes it ready to be processed using an LLM to generate a response.

# Creates the RAG
     chain = create_retrieval_chain(retriever, question_answer_chain)

Finally, we create the variable question for the user input. If this question box is filled with a query, we pass it to the chain, which calls the LLM to process and return the response, which will be printed on the app’s screen.

# Streamlit input for question
    question = st.text_input("Ask a question about the document:")
    if question:
        # Answer
        response = chain.invoke({"input": question})['answer']
        st.write(response)

Here is a screenshot of the result.

Screenshot of the final app. Image by the author.

And this is a GIF for you to see the File Reader Ai Assistant in action!

File Reader AI Assistant in action. Image by the author.

Before you go

In this project, we learned what the RAG framework is and how it helps the Llm to perform better and also perform well with specific knowledge.

AI can be powered with knowledge from an instruction manual, databases from a company, some finance files, or contracts, and then become fine-tuned to respond accurately to domain-specific content queries. The knowledge base is augmented with a content store.

To recap, this is how the framework works:

1️⃣ User Query → Input text is received.

2️⃣ Retrieve Relevant Documents → Searches a knowledge base (e.g., a database, vector store).

3️⃣ Augment Context → Retrieved documents are added to the input.

4️⃣ Generate Response → An LLM processes the combined input and produces an answer.

GitHub repository

https://github.com/gurezende/Basic-Rag

About me

If you liked this content and want to learn more about my work, here is my website, where you can also find all my contacts.

https://gustavorsantos.me

References

https://cloud.google.com/use-cases/retrieval-augmented-generation

https://www.ibm.com/think/topics/retrieval-augmented-generation

https://youtu.be/T-D1OfcDW1M?si=G0UWfH5-wZnMu0nw

https://python.langchain.com/docs/introduction

https://www.geeksforgeeks.org/how-to-get-your-own-openai-api-key

The post LLM + RAG: Creating an AI-Powered File Reader Assistant appeared first on Towards Data Science.

Enhancing RAG: Beyond Vanilla Approaches

Ziad SALLOUM — Tue, 25 Feb 2025 01:55:02 +0000

Retrieval-Augmented Generation (RAG) is a powerful technique that enhances language models by incorporating external information retrieval mechanisms. While standard RAG implementations improve response relevance, they often struggle in complex retrieval scenarios. This article explores the limitations of a vanilla RAG setup and introduces advanced techniques to enhance its accuracy and efficiency.

The Challenge with Vanilla RAG

To illustrate RAG’s limitations, consider a simple experiment where we attempt to retrieve relevant information from a set of documents. Our dataset includes:

A primary document discussing best practices for staying healthy, productive, and in good shape.
Two additional documents on unrelated topics, but contain some similar words used in different contexts.

main_document_text = """
Morning Routine (5:30 AM - 9:00 AM)
 Wake Up Early - Aim for 6-8 hours of sleep to feel well-rested.
 Hydrate First - Drink a glass of water to rehydrate your body.
 Morning Stretch or Light Exercise - Do 5-10 minutes of stretching or a short workout to activate your body.
 Mindfulness or Meditation - Spend 5-10 minutes practicing mindfulness or deep breathing.
 Healthy Breakfast - Eat a balanced meal with protein, healthy fats, and fiber.
 Plan Your Day - Set goals, review your schedule, and prioritize tasks.
...
"""

Using a standard RAG setup, we query the system with:

What should I do to stay healthy and productive?
What are the best practices to stay healthy and productive?

Helper Functions

To enhance retrieval accuracy and streamline query processing, we implement a set of essential helper functions. These functions serve various purposes, from querying the ChatGPT API to computing document embeddings and similarity scores. By leveraging these functions, we create a more efficient RAG pipeline that effectively retrieves the most relevant information for user queries.

To support our RAG improvements, we define the following helper functions:

# **Imports**
import os
import json
import openai
import numpy as np
from scipy.spatial.distance import cosine
from google.colab import userdata

# Set up OpenAI API key
os.environ["OPENAI_API_KEY"] = userdata.get('AiTeam')

def query_chatgpt(prompt, model="gpt-4o", response_format=openai.NOT_GIVEN):
    try:
        response = client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": prompt}],
            temperature=0.0 , # Adjust for more or less creativity
            response_format=response_format
        )
        return response.choices[0].message.content.strip()
    except Exception as e:
        return f"Error: {e}"

def get_embedding(text, model="text-embedding-3-large"): #"text-embedding-ada-002"
    """Fetches the embedding for a given text using OpenAI's API."""
    response = client.embeddings.create(
        input=[text],
        model=model
    )
    return response.data[0].embedding

def compute_similarity_metrics(embed1, embed2):
    """Computes different similarity/distance metrics between two embeddings."""
    cosine_sim = 1- cosine(embed1, embed2)  # Cosine similarity

    return cosine_sim

def fetch_similar_docs(query, docs, threshold = .55, top=1):
  query_em = get_embedding(query)
  data = []
  for d in docs:
    # Compute and print similarity metrics
    similarity_results = compute_similarity_metrics(d["embedding"], query_em)
    if(similarity_results >= threshold):
      data.append({"id":d["id"], "ref_doc":d.get("ref_doc", ""), "score":similarity_results})

  # Sorting by value (second element in each tuple)
  sorted_data = sorted(data, key=lambda x: x["score"], reverse=True)  # Ascending order
  sorted_data = sorted_data[:min(top, len(sorted_data))]
  return sorted_data

Evaluating the Vanilla RAG

To evaluate the effectiveness of a vanilla RAG setup, we conduct a simple test using predefined queries. Our goal is to determine whether the system retrieves the most relevant document based on semantic similarity. We then analyze the limitations and explore possible improvements.

"""# **Testing Vanilla RAG**"""

query = "what should I do to stay healthy and productive?"
r = fetch_similar_docs(query, docs)
print("query = ", query)
print("documents = ", r)

query = "what are the best practices to stay healthy and productive ?"
r = fetch_similar_docs(query, docs)
print("query = ", query)
print("documents = ", r)

Advanced Techniques for Improved RAG

To further refine the retrieval process, we introduce advanced functions that enhance the capabilities of our RAG system. These functions generate structured information that aids in document retrieval and query processing, making our system more robust and context-aware.

To address these challenges, we implement three key enhancements:

1. Generating FAQs

By automatically creating a list of frequently asked questions related to a document, we expand the range of potential queries the model can match. These FAQs are generated once and stored alongside the document, providing a richer search space without incurring ongoing costs.

def generate_faq(text):
  prompt = f'''
  given the following text: """{text}"""
  Ask relevant simple atomic questions ONLY (don't answer them) to cover all subjects covered by the text. Return the result as a json list example [q1, q2, q3...]
  '''
  return query_chatgpt(prompt, response_format={ "type": "json_object" })

2. Creating an Overview

A high-level summary of the document helps capture its core ideas, making retrieval more effective. By embedding the overview alongside the document, we provide additional entry points for relevant queries, improving match rates.

def generate_overview(text):
  prompt = f'''
  given the following text: """{text}"""
  Generate an abstract for it that tells in maximum 3 lines what is it about and use high level terms that will capture the main points,
  Use terms and words that will be most likely used by average person.
  '''
  return query_chatgpt(prompt)

3. Query Decomposition

Instead of searching with broad user queries, we break them down into smaller, more precise sub-queries. Each sub-query is then compared against our enhanced document collection, which now includes:

The original document
The generated FAQs
The generated overview

By merging the retrieval results from these multiple sources, we significantly improve the likelihood of finding relevant information.

def decompose_query(query):
  prompt = f'''
  Given the user query: """{query}"""
break it down into smaller, relevant subqueries
that can retrieve the best information for answering the original query.
Return them as a ranked json list example [q1, q2, q3...].
'''
  return query_chatgpt(prompt, response_format={ "type": "json_object" })

Evaluating the Improved RAG

Implementing these techniques, we re-run our initial queries. This time, query decomposition generates several sub-queries, each focusing on different aspects of the original question. As a result, our system successfully retrieves relevant information from both the FAQ and the original document, demonstrating a substantial improvement over the vanilla RAG approach.

"""# **Testing Advanced Functions**"""

## Generate overview of the document
overview_text = generate_overview(main_document_text)
print(overview_text)
# generate embedding
docs.append({"id":"overview_text", "ref_doc": "main_document_text", "embedding":get_embedding(overview_text)})


## Generate FAQ for the document
main_doc_faq_arr = generate_faq(main_document_text)
print(main_doc_faq_arr)
faq =json.loads(main_doc_faq_arr)["questions"]

for f, i in zip(faq, range(len(faq))):
  docs.append({"id": f"main_doc_faq_{i}", "ref_doc": "main_document_text", "embedding":  get_embedding(f)})


## Decompose the 1st query
query = "what should I do to stay healty and productive?"
subqueries = decompose_query(query)
print(subqueries)




subqueries_list = json.loads(subqueries)['subqueries']


## compute the similarities between the subqueries and documents, including FAQ
for subq in subqueries_list:
  print("query = ", subq)
  r = fetch_similar_docs(subq, docs, threshold=.55, top=2)
  print(r)
  print('=================================\n')


## Decompose the 2nd query
query = "what the best practices to stay healty and productive?"
subqueries = decompose_query(query)
print(subqueries)

subqueries_list = json.loads(subqueries)['subqueries']


## compute the similarities between the subqueries and documents, including FAQ
for subq in subqueries_list:
  print("query = ", subq)
  r = fetch_similar_docs(subq, docs, threshold=.55, top=2)
  print(r)
  print('=================================\n')

Here are some of the FAQ that were generated:

{
  "questions": [
    "How many hours of sleep are recommended to feel well-rested?",
    "How long should you spend on morning stretching or light exercise?",
    "What is the recommended duration for mindfulness or meditation in the morning?",
    "What should a healthy breakfast include?",
    "What should you do to plan your day effectively?",
    "How can you minimize distractions during work?",
    "How often should you take breaks during work/study productivity time?",
    "What should a healthy lunch consist of?",
    "What activities are recommended for afternoon productivity?",
    "Why is it important to move around every hour in the afternoon?",
    "What types of physical activities are suggested for the evening routine?",
    "What should a nutritious dinner include?",
    "What activities can help you reflect and unwind in the evening?",
    "What should you do to prepare for sleep?",
    …
  ]
}

Cost-Benefit Analysis

While these enhancements introduce an upfront processing cost—generating FAQs, overviews, and embeddings—this is a one-time cost per document. In contrast, a poorly optimized RAG system would lead to two major inefficiencies:

Frustrated users due to low-quality retrieval.
Increased query costs from retrieving excessive, loosely related documents.

For systems handling high query volumes, these inefficiencies compound quickly, making preprocessing a worthwhile investment.

Conclusion

By integrating document preprocessing (FAQs and overviews) with query decomposition, we create a more intelligent RAG system that balances accuracy and cost-effectiveness. This approach enhances retrieval quality, reduces irrelevant results, and ensures a better user experience.

As RAG continues to evolve, these techniques will be instrumental in refining AI-driven retrieval systems. Future research may explore further optimizations, including dynamic thresholding and reinforcement learning for query refinement.

The post Enhancing RAG: Beyond Vanilla Approaches appeared first on Towards Data Science.

6 Common LLM Customization Strategies Briefly Explained

Destin Gong — Mon, 24 Feb 2025 19:27:50 +0000

Why Customize LLMs?

Large Language Models (Llms) are deep learning models pre-trained based on self-supervised learning, requiring a vast amount of resources on training data, training time and holding a large number of parameters. LLM have revolutionized natural language processing especially in the last 2 years, demonstrating remarkable capabilities in understanding and generating human-like text. However, these general purpose models’ out-of-the-box performance may not always meet specific business needs or domain requirements. LLMs alone cannot answer questions that rely on proprietary company data or closed-book settings, making them relatively generic in their applications. Training a LLM model from scratch is largely infeasible to small to medium teams due to the demand of massive amounts of training data and resources. Therefore, a wide range of LLM customization strategies are developed in recent years to tune the models for various scenarios that require specialized knowledge.

The customization strategies can be broadly split into two types:

Using a frozen model: These techniques don’t necessitate updating model parameters and typically accomplished through in-context learning or prompt engineering. They are cost-effective since they alter the model’s behavior without incurring extensive training costs, therefore widely explored in both the industry and academic with new research papers published on a daily basis.
Updating model parameters: This is a relatively resource-intensive approach that requires tuning a pre-trained LLM using custom datasets designed for the intended purpose. This includes popular techniques like Fine-Tuning and Reinforcement Learning from Human Feedback (RLHF).

These two broad customization paradigms branch out into various specialized techniques including LoRA fine-tuning, Chain of Thought, Retrieval Augmented Generation, ReAct, and Agent frameworks. Each technique offers distinct advantages and trade-offs regarding computational resources, implementation complexity, and performance improvements.

This article is also available as a video here

How to Choose LLMs?

The first step of customizing LLMs is to select the appropriate foundation models as the baseline. Community based platform e.g. “Huggingface” offers a wide range of open-source pre-trained models contributed by top companies or communities, such as Llama series from Meta and Gemini from Google. Huggingface additionally provides leaderboards, for example “Open LLM Leaderboard” to compare LLMs based on industry-standard metrics and tasks (e.g. MMLU). Cloud providers (e.g., AWS) and AI companies (e.g., OpenAI and Anthropic) also offer access to proprietary models that are typically paid services with restricted access. Following factors are essentials to consider when choosing LLMs.

Open source or proprietary model: Open source models allow full customization and self-hosting but require technical expertise while proprietary models offer immediate access and often better quality responses but with higher costs.

Task and metrics: Models excel at different tasks including question-answering, summarization, code generation etc. Compare benchmark metrics and test on domain-specific tasks to determine the appropriate models.

Architecture: In general, decoder-only models (GPT series) perform better at text generation while encoder-decoder models (T5) handle translation well. There are more architecture emerging and showing promising results, for instance Mixture of Experts (MoE) model “DeepSeek”.

Number of Parameters and Size: Larger models (70B-175B parameters) offer better performance but need more computing power. Smaller models (7B-13B) run faster and cheaper but may have reduced capabilities.

After determining a base LLM, let’s explore 6 most common strategies for LLM customization, ranked in order of resource consumption from the least to the most intensive:

Prompt Engineering
Decoding and Sampling Strategy
Retrieval Augmented Generation
Agent
Fine Tuning
Reinforcement Learning from Human Feedback

If you’d prefer a video walkthrough of these concepts, please check out my video on “6 Common LLM Customization Strategies Briefly Explained”.

LLM Customization Techniques

1. Prompt Engineering

Prompt is the input text sent to an LLM to elicit an AI-generated response, and it can be composed of instructions, context, input data and output indicator.

Instructions: This provides a task description or instruction for how the model should perform.

Context: This is external information to guide the model to respond within a certain scope.

Input data: This is the input for which you want a response.

Output indicator: This specifies the output type or format.

Prompt Engineering involves crafting these prompt components strategically to shape and control the model’s response. Basic prompt engineering techniques include zero shot, one shot, and few shot prompting. User can implement basic prompt engineering techniques directly while interacting with the LLM, making it an efficient approach to align model’s behavior to on a novel objective. API implementation is also an option and more details are introduced in my previous article “A Simple Pipeline for Integrating LLM Prompt with Knowledge Graph”.

Due to the efficiency and effectiveness of prompt engineering, more complex approaches are explored and developed to advance the logical structure of prompts.

Chain of Thought (CoT) asks LLMs to break down complex reasoning tasks into step-by-step thought processes, improving performance on multi-step problems. Each step explicitly exposes its reasoning outcome which serves as the precursor context of its subsequent steps until arriving at the answer.

Tree of thoughts extends from CoT by considering multiple different reasoning branches and self-evaluating choices to decide the next best action. It is more effective for tasks that involve initial decisions, strategies for the future and exploration of multiple solutions.

Automatic reasoning and tool use (ART) builds upon the CoT process, it deconstructs complex tasks and allows the model to select few-shot examples from a task library using predefined external tools like search and code generation.

Synergizing reasoning and acting (ReAct) combines reasoning trajectories with an action space, where the model search through the action space and determine the next best action based on environmental observations.

Techniques like CoT and ReAct are often combined with an Agentic workflow to strengthen its capability. These techniques will be introduced in more detail in the “Agent” section.

Further Reading

2. Decoding and Sampling Strategy

Decoding strategy can be controlled at model inference time through inference parameters (e.g. temperature, top p, top k), determining the randomness and diversity of model responses. Greedy search, beam search and sampling are three common decoding strategies for auto-regressive model generation. ****

During the autoregressive generation process, LLM outputs one token at a time based on a probability distribution of candidate tokens conditioned by the pervious token. By default, greedy search is applied to produce the next token with the highest probability.

In contrast, beam search decoding considers multiple hypotheses of next-best tokens and selects the hypothesis with the highest combined probabilities across all tokens in the text sequence. The code snippet below uses transformers library to specify the the number of beam paths (e.g. num_beams=5 considers 5 distinct hypotheses) during the model generation process.

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
inputs = tokenizer(prompt, return_tensors="pt")

model = AutoModelForCausalLM.from_pretrained(model_name)
outputs = model.generate(**inputs, num_beams=5)

Sampling strategy is the third approach to control the randomness of model responses by adjusting these inference parameters:

Temperature: Lowering the temperature makes the probability distribution sharper by increasing the likelihood of generating high-probability words and decreasing the likelihood of generating low-probability words. When temperature = 0, it becomes equivalent to greedy search (least creative); when temperature = 1, it produces the most creative outputs.
Top K sampling: This method filters the K most probable next tokens and redistributes the probability among those tokens. The model then samples from this filtered set of tokens.
Top P sampling: Instead of sampling from the K most probable tokens, top-p sampling selects from the smallest possible set of tokens whose cumulative probability exceeds the threshold p.

The example code snippet below samples from the top 50 most likely tokens (top_k=50) with a cumulative probability higher than 0.95 (top_p=0.95)

sample_outputs = model.generate(
    **model_inputs,
    max_new_tokens=40,
    do_sample=True,
    top_k=50,
    top_p=0.95,
    num_return_sequences=3,
)

Further Reading

3. RAG

Retrieval Augmented Generation (or RAG), initially introduced in the paper “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks”, has been demonstrated as a promising solution that integrates external knowledge and reduces common LLM “hallucination” issues when handling domain specific or specialized queries. RAG allows dynamically pulling relevant information from knowledge domain and generally does not involve extensive training to update LLM parameters, making it a cost-effective strategy to adapt a general-purpose LLM for a specialized domain.

A RAG system can be decomposed into retrieval and generation stage. The objective of retrieval process is to find contents within the knowledge base that are closely related to the user query, by chunking external knowledge, creating embeddings, indexing and similarity search.

Chunking: Documents are divided into smaller segments, with each segment containing a distinct unit of information.
Create embeddings: An embedding model compresses each information chunk into a vector representation. The user query is also converted into its vector representation through the same vectorization process, so that the user query can be compared in the same dimensional space.
Indexing: This process stores these text chunks and their vector embeddings as key-value pairs, enabling efficient and scalable search functionality. For large external knowledge bases that exceed memory capacity, vector databases offer efficient long-term storage.
Similarity search: Similarity scores between the query embeddings and text chunk embeddings are calculated, which are used for searching information highly relevant to the user query.

The generation process of the RAG system then combines retrieved information with the user query to form the augmented query which is parsed to the LLM to generate the context rich response.

Code Snippet

The code snippet firstly specifies the LLM and embedding model, then perform the steps to chunk the external knowledge base documents into a collection of document. Create index from document, define the query_engine based on the index and query the query_engine with the user prompt.

from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import VectorStoreIndex

Settings.llm = OpenAI(model="gpt-3.5-turbo")
Settings.embed_model="BAAI/bge-small-en-v1.5"

document = Document(text="\\n\\n".join([doc.text for doc in documents]))
index = VectorStoreIndex.from_documents([document])                                    
query_engine = index.as_query_engine()
response = query_engine.query(
    "Tell me about LLM customization strategies."
)

The example above shows a simple RAG system. Advanced RAG improve based on this by introducing pre-retrieval and post-retrieval strategies to reduce pitfalls such as limited synergy between the retrieval and generation process. For example rerank technique reorders the retrieved information using a model capable of understanding bidirectional context, and integration with knowledge graph for advanced query routing. More use cases can be found on the llamaindex website.

Further Reading

4. Agent

LLM Agent was a trending topic in 2024 and will likely remain a main focus in the GenAI field in 2025. Compared to RAG, Agent excels at creating query routes and planning LLM-based workflows, with the following benefits:

Maintaining memory and state of previous model generated responses.
Leveraging various tools based on specific criteria. This tool-using capability sets agents apart from basic RAG systems by giving the LLM independent control over tool selection.
Breaking down a complex task into smaller steps and planning for a sequence of actions.
Collaborating with other agents to form a orchestrated system.

Several in-context learning techniques (e.g. CoT, ReAct ) can be implemented through the Agentic framework and we will discuss ReAct in more details. ReAct, stands for “Synergizing Reasoning and Acting in Language Models”, is composed of three key elements – actions, thoughts and observations. This framework was introduced by Google Research at Princeton University, built upon Chain of Thought by integrating the reasoning steps with an action space that enables tool uses and function calling. Additionally, ReAct framework emphasizes on determining the next best action based on the environmental observations.

This example from the original paper demonstrated ReAct’s inner working process, where the LLM generates the first thought and acts by calling the function to “Search [Apple Remote]”, then observes the feedback from its first output. The second thought is then based on the previous observation, hence leading to a different action “Search [Front Row]”. This process iterates until reaching the goal. The research shows that ReAct overcomes prevalent issues of hallucination and error propagation as more often observed in chain-of-thought reasoning by interacting with a simple Wikipedia API. Furthermore, through the implementation of decision traces, ReAct framework additionally increases the model’s interpretability, trustworthiness and diagnosability.

Example from “ReAct: Synergizing Reasoning and Acting in Language Models” (Yu et. al., 2022)

Code Snippet

This demonstrates an ReAct-based agent implementation using llamaindex. Firstly, it defines two functions (multiply and add). Secondly, these two functions are encapsulated as FunctionTool, forming the Agent’s action space and executed based on its reasoning.

from llama_index.core.agent import ReActAgent
from llama_index.core.tools import FunctionTool

# create basic function tools
def multiply(a: float, b: float) -> float:
    return a * b
multiply_tool = FunctionTool.from_defaults(fn=multiply)

def add(a: float, b: float) -> float:
    return a + b
add_tool = FunctionTool.from_defaults(fn=add)

agent = ReActAgent.from_tools([multiply_tool, add_tool], llm=llm, verbose=True)

The advantages of an Agentic Workflow are more substantial when combined with self-reflection or self-correction. It is an increasingly growing domain with a variety of Agent architecture being explored. For instance, Reflexion framework facilitate iterative learning by providing a summary of verbal feedback from environmental and storing the feedback in model’s memory; CRITIC framework empowers frozen LLMs to self-verify through interacting with external tools such as code interpreter and API calls.

Further Reading

5. Fine-Tuning

Fine-tuning is the process of feeding niche and specialized datasets to modify the LLM so that it is more aligned with a certain objective. It differs from prompt engineering and RAG as it enables updates to the LLM weights and parameters. Full fine-tuning refers to updating all weights of the pretrained LLM through backpropogation, which requires large memory to store all weights and parameters and may suffer from significant reduction in ability on other tasks (i.e. catastrophic forgetting). Therefore, PEFT (or parameter efficient fine tuning) is more widely used to mitigate these caveats while saving the time and cost of model training. There are three categories of PEFT methods:

Selective: Select a subset of initial LLM parameters to fine tune which can be more computationally intensive compared to other PEFT methods.
Reparameterization: Adjust model weights through training the weights of low rank representations. For example, Lower Rank Adaptation (LoRA) is among this category that accelerates fine-tuning by representing the weight updates with two smaller matrices.
Additive: Add additional trainable layers to model, including techniques like adapters and soft prompts

The fine-tuning process is similar to deep learning training process., requiring the following inputs:

training and evaluation datasets
training arguments define the hyperparameters e.g. learning rate, optimizer
pretrained LLM model
compute metrics and objective functions that algorithm should be optimized for

Code Snippet

Below is an example of implementing fine-tuning using the transformer Trainer.

from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
		output_dir=output_dir,
		learning_rate=1e-5,
		eval_strategy="epoch"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics,
)

trainer.train()

Fine-tuning has a wide range of use cases. For instance, instruction fine-tuning optimizes LLMs for conversations and following instructions by training them on prompt-completion pairs. Another example is domain adaptation, an unsupervised fine-tuning method that helps LLMs specialize in specific knowledge domains.

Further Reading

6. RLHF

Reinforcement learning from human feedback, or RLHF, is a reinforcement learning technique that fine tunes LLMs based on human preferences. RLHF operates by training a reward model based on human feedback and uses this model as a reward function to optimize a reinforcement learning policy through PPO (Proximal Policy Optimization). The process requires two sets of training data: a preference dataset for training reward model, and a prompt dataset used in the reinforcement learning loop.

Let’s break it down into steps:

Gather preference dataset annotated by human labelers who rate different completions generated by the model based on human preference. An example format of the preference dataset is {input_text, candidate1, candidate2, human_preference}, indicating which candidate response is preferred.
Train a reward model using the preference dataset, the reward model is essentially a regression model that outputs a scalar indicating the quality of the model generated response. The objective of the reward model is to maximize the score between the winning candidate and losing candidate.
Use the reward model in a reinforcement learning loop to fine-tune the LLM. The objective is that the policy is updated so that LLM can generate responses that maximize the reward produced by the reward model. This process utilizes the prompt dataset which is a collection of prompts in the format of {prompt, response, rewards}.

Code Snippet

Open source library Trlx is widely applied in implementing RLHF and they provided a template code that shows the basic RLHF setup:

Initialize the base model and tokenizer from a pretrained checkpoint
Configure PPO hyperparameters PPOConfig like learning rate, epochs, and batch sizes
Create the PPO trainer PPOTrainer by combining the model, tokenizer, and training data
The training loop uses step() method to iteratively update the model to optimized the rewards calculated from the query and model response

# trl: Transformer Reinforcement Learning library
from trl import PPOTrainer, PPOConfig, AutoModelForSeq2SeqLMWithValueHead
from trl import create_reference_model
from trl.core import LengthSampler

# initiate the pretrained model and tokenizer
model = AutoModelForCausalLMWithValueHead.from_pretrained(config.model_name)
tokenizer = AutoTokenizer.from_pretrained(config.model_name)

# define the hyperparameters of PPO algorithm
config = PPOConfig(
    model_name=model_name,    
    learning_rate=learning_rate,
    ppo_epochs=max_ppo_epochs,
    mini_batch_size=mini_batch_size,
    batch_size=batch_size
)

# initiate the PPO trainer with reference to the model
ppo_trainer = PPOTrainer(
	config=config, 
	model=ppo_model, 
  tokenizer=tokenizer, 
  dataset=dataset["train"],
  data_collator=collator
)                      
                        
# ppo_trainer is iteratively updated through the rewards
ppo_trainer.step(query_tensors, response_tensors, rewards)

RLHF is widely applied for aligning model responses with human preference. Common use cases involve reducing response toxicity and model hallucination. However, it does have the downside of requiring a large amount of human annotated data as well as computation costs associated with policy optimization. Therefore, alternatives like Reinforcement Learning from AI feedback and Direct Preference Optimization (DPO) are introduced to mitigate these limitations.

Further Reading

Take-Home Message

This article briefly explains six essential LLM customization strategies including prompt engineering, decoding strategy, RAG, Agent, fine-tuning, and RLHF. Hope you find it helpful in terms of understanding the pros/cons of each strategy as well as how to implement them based on the practical examples.

The post 6 Common LLM Customization Strategies Briefly Explained appeared first on Towards Data Science.

Retrieval Augmented Generation in SQLite

Ed Izaguirre — Tue, 18 Feb 2025 17:05:58 +0000

This is the second in a two-part series on using SQLite for Machine Learning. In my last article, I dove into how SQLite is rapidly becoming a production-ready database for web applications. In this article, I will discuss how to perform retrieval-augmented-generation using SQLite.

If you’d like a custom web application with generative AI integration, visit losangelesaiapps.com

The code referenced in this article can be found here.

When I first learned how to perform retrieval-augmented-generation (RAG) as a budding data scientist, I followed the traditional path. This usually looks something like:

Google retrieval-augmented-generation and look for tutorials
Find the most popular framework, usually LangChain or LlamaIndex
Find the most popular cloud vector database, usually Pinecone or Weaviate
Read a bunch of docs, put all the pieces together, and success!

In fact I actually wrote an article about my experience building a RAG system in LangChain with Pinecone.

There is nothing terribly wrong with using a RAG framework with a cloud vector database. However, I would argue that for first time learners it overcomplicates the situation. Do we really need an entire framework to learn how to do RAG? Is it necessary to perform API calls to cloud vector databases? These databases act as black boxes, which is never good for learners (or frankly for anyone).

In this article, I will walk you through how to perform RAG on the simplest stack possible. In fact, this ‘stack’ is just Sqlite with the sqlite-vec extension and the OpenAI API for use of their embedding and chat models. I recommend you re ad part 1 of this series to get a deep dive on SQLite and how it is rapidly becoming production ready for web applications. For our purposes here, it is enough to understand that SQLite is the simplest kind of database possible: a single file in your repository.

So ditch your cloud vector databases and your bloated frameworks, and let’s do some RAG.

SQLite-Vec

One of the powers of the SQLite database is the use of extensions. For those of us familiar with Python, extensions are a lot like libraries. They are modular pieces of code written in C to extend the functionality of SQLite, making things that were once impossible possible. One popular example of a SQLite extension is the Full-Text Search (FTS) extension. This extension allows SQLite to perform efficient searches across large volumes of textual data in SQLite. Because the extension is written purely in C, we can run it anywhere a SQLite database can be run, including Raspberry Pis and browsers.

In this article I will be going over the extension known as sqlite-vec. This gives SQLite the power of performing vector search. Vector search is similar to full-text search in that it allows for efficient search across textual data. However, rather than search for an exact word or phrase in the text, vector search has a semantic understanding. In other words, searching for “horses” will find matches of “equestrian”, “pony”, “Clydesdale”, etc. Full-text search is incapable of this.

sqlite-vec makes use of virtual tables, as do most extensions in SQLite. A virtual table is similar to a regular table, but with additional powers:

Custom Data Sources: The data for a standard table in SQLite is housed in a single db file. For a virtual table, the data can be housed in external sources, for example a CSV file or an API call.
Flexible Functionality: Virtual tables can add specialized indexing or querying capabilities and support complex data types like JSON or XML.
Integration with SQLite Query Engine: Virtual tables integrate seamlessly with SQLite’s standard query syntax e.g. SELECT , INSERT, UPDATE, and DELETE options. Ultimately it is up to the writers of the extensions to support these operations.
Use of Modules: The backend logic for how the virtual table will work is implemented by a module (written in C or another language).

The typical syntax for creating a virtual table looks like the following:

CREATE VIRTUAL TABLE my_table USING my_extension_module();

The important part of this statement is my_extension_module(). This specifies the module that will be powering the backend of the my_table virtual table. In sqlite-vec we will use the vec0 module.

Code Walkthrough

The code for this article can be found here. It is a simple directory with the majority of files being .txt files that we will be using as our dummy data. Because I am a physics nerd, the majority of the files pertain to physics, with just a few files relating to other random fields. I will not present the full code in this walkthrough, but instead will highlight the important pieces. Clone my repo and play around with it to investigate the full code. Below is a tree view of the repo. Note that my_docs.db is the single-file database used by SQLite to manage all of our data.

.

├── data

│   ├── cooking.txt

│   ├── gardening.txt

│   ├── general_relativity.txt

│   ├── newton.txt

│   ├── personal_finance.txt

│   ├── quantum.txt

│   ├── thermodynamics.txt

│   └── travel.txt

├── my_docs.db

├── requirements.txt

└── sqlite_rag_tutorial.py

Step 1 is to install the necessary libraries. Below is our requirements.txt file. As you can see it has only three libraries. I recommend creating a virtual environment with the latest Python version (3.13.1 was used for this article) and then running pip install -r requirements.txt to install the libraries.

# requirements.txt

sqlite-vec==0.1.6

openai==1.63.0

python-dotenv==1.0.1

Step 2 is to create an OpenAI API key if you don’t already have one. We will be using OpenAI to generate embeddings for the text files so that we can perform our vector search.

# sqlite_rag_tutorial.py

import sqlite3

from sqlite_vec import serialize_float32

import sqlite_vec

import os

from openai import OpenAI

from dotenv import load_dotenv

# Set up OpenAI client

client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))

Step 3 is to load the sqlite-vec extension into SQLite. We will be using Python and SQL for our examples in this article. Disabling the ability to load extensions immediately after loading your extension is a good security practice.

# Path to the database file

db_path = 'my_docs.db'

# Delete the database file if it exists

db = sqlite3.connect(db_path)

db.enable_load_extension(True)

sqlite_vec.load(db)

db.enable_load_extension(False)

Next we will go ahead and create our virtual table:

db.execute('''

   CREATE VIRTUAL TABLE documents USING vec0(

       embedding float[1536],

       +file_name TEXT,

       +content TEXT

   )

''')

documents is a virtual table with three columns:

sample_embedding : 1536-dimension float that will store the embeddings of our sample documents.
file_name : Text that will house the name of each file we store in the database. Note that this column and the following have a + symbol in front of them. This indicates that they are auxiliary fields. Previously in sqlite-vec only embedding data could be stored in the virtual table. However, recently an update was pushed that allows us to add fields to our table that we don’t really want embedded. In this case we are adding the content and name of the file in the same table as our embeddings. This will allow us to easily see what embeddings correspond to what content easily while sparing us the need for extra tables and JOIN statements.
content : Text that will store the content of each file.

Now that we have our virtual table set up in our SQLite database, we can begin converting our text files into embeddings and storing them in our table:

# Function to get embeddings using the OpenAI API

def get_openai_embedding(text):

   response = client.embeddings.create(

       model="text-embedding-3-small",

       input=text

   )

   return response.data[0].embedding

# Iterate over .txt files in the /data directory

for file_name in os.listdir("data"):

   file_path = os.path.join("data", file_name)

   with open(file_path, 'r', encoding='utf-8') as file:

       content = file.read()

       # Generate embedding for the content

       embedding = get_openai_embedding(content)

       if embedding:

           # Insert file content and embedding into the vec0 table

           db.execute(

               'INSERT INTO documents (embedding, file_name, content) VALUES (?, ?, ?)',

               (serialize_float32(embedding), file_name, content)

# Commit changes

db.commit()

We essentially loop through each of our .txt files, embedding the content from each file, and then using an INSERT INTO statement to insert the embedding, file_name, and content into documents virtual table. A commit statement at the end ensures the changes are persisted. Note that we are using serialize_float32 here from the sqlite-vec library. SQLite itself does not have a built-in vector type, so it stores vectors as binary large objects (BLOBs) to save space and allow fast operations. Internally, it uses Python’s struct.pack() function, which converts Python data into C-style binary representations.

Finally, to perform RAG, you then use the following code to do a K-Nearest-Neighbors (KNN-style) operation. This is the heart of vector search.

# Perform a sample KNN query

query_text = "What is general relativity?"

query_embedding = get_openai_embedding(query_text)

if query_embedding:

   rows = db.execute(

       """

       SELECT

           file_name,

           content,

           distance

       FROM documents

       WHERE embedding MATCH ?

       ORDER BY distance

       LIMIT 3

       """,

       [serialize_float32(query_embedding)]

   ).fetchall()

   print("Top 3 most similar documents:")

   top_contexts = []

   for row in rows:

       print(row)

       top_contexts.append(row[1])  # Append the 'content' column

We begin by taking in a query from the user, in this case “What is general relativity?” and embedding that query using the same embedding model as before. We then perform a SQL operation. Let’s break this down:

The SELECT statement means the retrieved data will have three columns: file_name, content, and distance. The first two we have already mentioned. Distance will be calculated during the SQL operation, more on this in a moment.
The FROM statement ensures you are pulling data from the documents table.
The WHERE embedding MATCH ? statement performs a similarity search between all of the vectors in your database and the query vector. The returned data will include a distance column. This distance is just a floating point number measuring the similarity between the query and database vectors. The higher the number, the closer the vectors are. sqlite-vec provides a few options for how to calculate this similarity.
The ORDER BY distance makes sure to order the retrieved vectors in descending order of similarity (high -> low).
LIMIT 3 ensures we only get the top three documents that are nearest to our query embedding vector. You can tweak this number to see how retrieving more or less vectors affects your results.

Given our query of “What is general relativity?”, the following documents were pulled. It did a pretty good job!

Conclusion

sqlite-vec is a project sponsored by the Mozilla Builders Accelerator program, so it has some significant backing behind it. Have to give a big thanks to Alex Garcia, the creator of sqlite-vec , for helping to push the SQLite ecosystem and making ML possible with this simple database. This is a well maintained library, with updates coming down the pipeline on a regular basis. As of November 20th, they even added filtering by metadata! Perhaps I should re-do my aforementioned RAG article using SQLite .

The extension also offers bindings for several popular programming languages, including Ruby, Go, Rust, and more.

The fact that we are able to radically simplify our RAG pipeline to the bare essentials is remarkable. To recap, there is no need for a database service to be spun up and spun down, like Postgres, MySQL, etc. There is no need for API calls to cloud vendors. If you deploy to a server directly via Digital Ocean or Hetzner, you can even avoid costly and unnecessary complexity associated with managed cloud services like AWS, Azure, or Vercel.

I believe this simple architecture can work for a variety of applications. It is cheaper to use, easier to maintain, and faster to iterate on. Once you reach a certain scale it will likely make sense to migrate to a more robust database such as Postgres with the pgvector extension for RAG capabilities. For more advanced capabilities such as chunking and document cleaning, a framework may be the right choice. But for startups and smaller players, it’s SQLite to the moon.

Have fun trying out sqlite-vec for yourself!

Simple RAG architecture. Image by author.

The post Retrieval Augmented Generation in SQLite appeared first on Towards Data Science.

How to Measure the Reliability of a Large Language Model’s Response

Umair Ali Khan — Thu, 13 Feb 2025 02:11:41 +0000

The basic principle of Large Language Models (LLMs) is very simple: to predict the next word (or token) in a sequence of words based on statistical patterns in their training data. However, this seemingly simple capability turns out to be incredibly sophisticated when it can do a number of amazing tasks such as text summarization, idea generation, brainstorming, code generation, information processing, and content creation. That said, LLMs do not have any memory no do they actually “understand” anything, other than sticking to their basic function: predicting the next word.

The process of next-word prediction is probabilistic. The LLM has to select each word from a probability distribution. In the process, they often generate false, fabricated, or inconsistent content in an attempt to produce coherent responses and fill in gaps with plausible-looking but incorrect information. This phenomenon is called hallucination, an inevitable, well-known feature of LLMs that warrants validation and corroboration of their outputs.

Retrieval augment generation (RAG) methods, which make an LLM work with external knowledge sources, do minimize hallucinations to some extent, but they cannot completely eradicate them. Although advanced RAGs can provide in-text citations and URLs, verifying these references could be hectic and time-consuming. Therefore, we need an objective criterion for assessing the reliability or trustworthiness of an LLM’s response, whether it is generated from its own knowledge or an external knowledge base (RAG).

In this article, we will discuss how the output of an LLM can be assessed for trustworthiness by a trustworthy language model which assigns a score to the LLM’s output. We will first discuss how we can use a trustworthy language model to assign scores to an LLM’s answer and explain trustworthiness. Subsequently, we will develop an example RAG with LlamaParse and Llamaindex that assesses the RAG’s answers for trustworthiness.

The entire code of this article is available in the jupyter notebook on GitHub.

Assigning a Trustworthiness Score to an LLM’s Answer

To demonstrate how we can assign a trustworthiness score to an Llm’s response, I will use Cleanlab’s Trustworthy Language Model (TLM). Such TLMs use a combination of uncertainty quantification and consistency analysis to compute trustworthiness scores and explanations for LLM responses.

Cleanlab offers free trial APIs which can be obtained by creating an account at their website. We first need to install Cleanlab’s Python client:

pip install --upgrade cleanlab-studio

Cleanlab supports several proprietary models such as ‘gpt-4o’, ‘gpt-4o-mini’, ‘o1-preview’, ‘claude-3-sonnet’, ‘claude-3.5-sonnet’, ‘claude-3.5-sonnet-v2’ and others. Here is how TLM assigns a trustworhiness score to gpt-4o’s answer. The trustworthiness score ranges from 0 to 1, where higher values indicate greater trustworthiness.

from cleanlab_studio import Studio
studio = Studio("")  # Get your API key from above
tlm = studio.TLM(options={"log": ["explanation"], "model": "gpt-4o"}) # GPT, Claude, etc
#set the prompt
out = tlm.prompt("How many vowels are there in the word 'Abracadabra'.?")
#the TLM response contains the actual output 'response', trustworthiness score and explanation
print(f"Model's response = {out['response']}")
print(f"Trustworthiness score = {out['trustworthiness_score']}")
print(f"Explanation = {out['log']['explanation']}")

The above code tested the response of gpt-4o for the question “How many vowels are there in the word ‘Abracadabra’.?”. The TLM’s output contains the model’s answer (response), trustworthiness score, and explanation. Here is the output of this code.

Model's response = The word "Abracadabra" contains 6 vowels. The vowels are: A, a, a, a, a, and a.
Trustworthiness score = 0.6842228802750124
Explanation = This response is untrustworthy due to a lack of consistency in possible responses from the model. Here's one inconsistent alternate response that the model considered (which may not be accurate either):
5.

It can be seen how the most advanced language model hallucinates for such simple tasks and produces the wrong output. Here is the response and trustworthiness score for the same question for claude-3.5-sonnet-v2.

Model's response = Let me count the vowels in 'Abracadabra':
A-b-r-a-c-a-d-a-b-r-a

The vowels are: A, a, a, a, a

There are 5 vowels in the word 'Abracadabra'.
Trustworthiness score = 0.9378276048845285
Explanation = Did not find a reason to doubt trustworthiness.

claude-3.5-sonnet-v2 produces the correct output. Let’s compare the two models’ responses to another question.

from cleanlab_studio import Studio
import markdown
from IPython.core.display import display, Markdown

# Initialize the Cleanlab Studio with API key
studio = Studio("")  # Replace with your actual API key

# List of models to evaluate
models = ["gpt-4o", "claude-3.5-sonnet-v2"]

# Define the prompt
prompt_text = "Which one of 9.11 and 9.9 is bigger?"

# Loop through each model and evaluate
for model in models:
   tlm = studio.TLM(options={"log": ["explanation"], "model": model})
   out = tlm.prompt(prompt_text)
  
   md_content = f"""
## Model: {model}

**Response:** {out['response']}

**Trustworthiness Score:** {out['trustworthiness_score']}

**Explanation:** {out['log']['explanation']}

---
"""
   display(Markdown(md_content))

Here is the response of the two models:

Wrong outputs generated by gpt-4o and claude-3.5-sonnet-v2, represented by low trustworthiness score

We can also generate a trustworthiness score for open-source LLMs. Let’s check the recent, much-hyped open-source LLM: deepseek-R1. I will use DeepSeek-R1-Distill-Llama-70B, based on Meta’s Llama-3.3–70B-Instruct model and distilled from DeepSeek’s larger 671-billion parameter Mixture of Experts (MoE) model. Knowledge distillation is a Machine Learning technique that aims to transfer the learnings of a large pre-trained model, the “teacher model,” to a smaller “student model.”

import streamlit as st
from langchain_groq.chat_models import ChatGroq
import os
os.environ["GROQ_API_KEY"]=st.secrets["GROQ_API_KEY"]
# Initialize the Groq Llama Instant model
groq_llm = ChatGroq(model="deepseek-r1-distill-llama-70b", temperature=0.5)
prompt = "Which one of 9.11 and 9.9 is bigger?"
# Get the response from the model
response = groq_llm.invoke(prompt)
#Initialize Cleanlab's studio
studio = Studio("226eeab91e944b23bd817a46dbe3c8ae") 
cleanlab_tlm = studio.TLM(options={"log": ["explanation"]})  #for explanations
#Get the output containing trustworthiness score and explanation
output = cleanlab_tlm.get_trustworthiness_score(prompt, response=response.content.strip())
md_content = f"""
## Model: {model}
**Response:** {response.content.strip()}
**Trustworthiness Score:** {output['trustworthiness_score']}
**Explanation:** {output['log']['explanation']}
---
"""
display(Markdown(md_content))

Here is the output of deepseek-r1-distill-llama-70b model.

The correct output of deepseek-r1-distill-llama-70b model with a high trustworthiness score

Developing a Trustworthy RAG

We will now develop an RAG to demonstrate how we can measure the trustworthiness of an LLM response in RAG. This RAG will be developed by scraping data from given links, parsing it in markdown format, and creating a vector store.

The following libraries need to be installed for the next code.

pip install llama-parse llama-index-core llama-index-embeddings-huggingface 
llama-index-llms-cleanlab requests beautifulsoup4 pdfkit nest-asyncio

To render HTML into PDF format, we also need to install wkhtmltopdf command line tool from their website.

The following libraries will be imported:

from llama_parse import LlamaParse
from llama_index.core import VectorStoreIndex
import requests
from bs4 import BeautifulSoup
import pdfkit
from llama_index.readers.docling import DoclingReader
from llama_index.core import Settings
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.llms.cleanlab import CleanlabTLM
from typing import Dict, List, ClassVar
from llama_index.core.instrumentation.events import BaseEvent
from llama_index.core.instrumentation.event_handlers import BaseEventHandler
from llama_index.core.instrumentation import get_dispatcher
from llama_index.core.instrumentation.events.llm import LLMCompletionEndEvent
import nest_asyncio
import os

The next steps will involve scraping data from given URLs using Python’s BeautifulSoup library, saving the scraped data in PDF file(s) using pdfkit, and parsing the data from PDF(s) to markdown file using LlamaParse which is a genAI-native document parsing platform built with LLMs and for LLM use cases.

We will first configure the LLM to be used by CleanlabTLM and the embedding model (Huggingface embedding model BAAI/bge-small-en-v1.5) that will be used to compute the embeddings of the scraped data to create the vector store.

options = {
   "model": "gpt-4o",
   "max_tokens": 512,
   "log": ["explanation"]
}
llm = CleanlabTLM(api_key="", options=options)#Get your free API from https://cleanlab.ai/
Settings.llm = llm
Settings.embed_model = HuggingFaceEmbedding(
   model_name="BAAI/bge-small-en-v1.5"
)

We will now define a custom event handler, GetTrustworthinessScore, that is derived from a base event handler class. This handler gets triggered by the end of an LLM completion and extracts a trustworthiness score from the response metadata. A helper function, display_response, displays the LLM’s response along with its trustworthiness score.

# Event Handler for Trustworthiness Score
class GetTrustworthinessScore(BaseEventHandler):
   events: ClassVar[List[BaseEvent]] = []
   trustworthiness_score: float = 0.0
   @classmethod
   def class_name(cls) -> str:
       return "GetTrustworthinessScore"
   def handle(self, event: BaseEvent) -> Dict:
       if isinstance(event, LLMCompletionEndEvent):
           self.trustworthiness_score = event.response.additional_kwargs.get("trustworthiness_score", 0.0)
           self.events.append(event)
       return {}
# Helper function to display LLM's response
def display_response(response):
   response_str = response.response
   trustworthiness_score = event_handler.trustworthiness_score
   print(f"Response: {response_str}")
   print(f"Trustworthiness score: {round(trustworthiness_score, 2)}")

We will now generate PDFs by scraping data from given URLs. For demonstration, we will scrap data only from this Wikipedia article about large language models (Creative Commons Attribution-ShareAlike 4.0 License).

Note: Readers are advised to always double-check the status of the content/data they are about to scrape and ensure they are allowed to do so.

The following piece of code scrapes data from the given URLs by making an HTTP request and using BeautifulSoup Python library to parse the HTML content. HTML content is cleaned by converting protocol-relative URLs to absolute ones. Subsequently, the scraped content is converted into a PDF file(s) using pdfkit.

##########################################
# PDF Generation from Multiple URLs
##########################################
# Configure wkhtmltopdf path
wkhtml_path = r'C:\Program Files\wkhtmltopdf\bin\wkhtmltopdf.exe'
config = pdfkit.configuration(wkhtmltopdf=wkhtml_path)
# Define URLs and assign document names
urls = {
   "LLMs": "https://en.wikipedia.org/wiki/Large_language_model"
}
# Directory to save PDFs
pdf_directory = "PDFs"
os.makedirs(pdf_directory, exist_ok=True)
pdf_paths = {}
for doc_name, url in urls.items():
   try:
       print(f"Processing {doc_name} from {url} ...")
       response = requests.get(url)
       soup = BeautifulSoup(response.text, "html.parser")
       main_content = soup.find("div", {"id": "mw-content-text"})
       if main_content is None:
           raise ValueError("Main content not found")
       # Replace protocol-relative URLs with absolute URLs
       html_string = str(main_content).replace('src="//', 'src="https://').replace('href="//', 'href="https://')
       pdf_file_path = os.path.join(pdf_directory, f"{doc_name}.pdf")
       pdfkit.from_string(
           html_string,
           pdf_file_path,
           options={'encoding': 'UTF-8', 'quiet': ''},
           configuration=config
       )
       pdf_paths[doc_name] = pdf_file_path
       print(f"Saved PDF for {doc_name} at {pdf_file_path}")
   except Exception as e:
       print(f"Error processing {doc_name}: {e}")

After generating PDF(s) from the scraped data, we parse these PDFs using LlamaParse. We set the parsing instructions to extract the content in markdown format and parse the document(s) page-wise along with the document name and page number. These extracted entities (pages) are referred to as nodes. The parser iterates over the extracted nodes and updates each node’s metadata by appending a citation header which facilitates later referencing.

##########################################
# Parse PDFs with LlamaParse and Inject Metadata
##########################################

# Define parsing instructions (if your parser supports it)
parsing_instructions = """Extract the document content in markdown.
Split the document into nodes (for example, by page).
Ensure each node has metadata for document name and page number."""
      
# Create a LlamaParse instance
parser = LlamaParse(
   api_key="",  #Replace with your actual key
   parsing_instructions=parsing_instructions,
   result_type="markdown",
   premium_mode=True,
   max_timeout=600
)
# Directory to save combined Markdown files (one per PDF)
output_md_dir = os.path.join(pdf_directory, "markdown_docs")
os.makedirs(output_md_dir, exist_ok=True)
# List to hold all updated nodes for indexing
all_nodes = []
for doc_name, pdf_path in pdf_paths.items():
   try:
       print(f"Parsing PDF for {doc_name} from {pdf_path} ...")
       nodes = parser.load_data(pdf_path)  # Returns a list of nodes
       updated_nodes = []
       # Process each node: update metadata and inject citation header into the text.
       for i, node in enumerate(nodes, start=1):
           # Copy existing metadata (if any) and add our own keys.
           new_metadata = dict(node.metadata) if node.metadata else {}
           new_metadata["document_name"] = doc_name
           if "page_number" not in new_metadata:
               new_metadata["page_number"] = str(i)
           # Build the citation header.
           citation_header = f"[{new_metadata['document_name']}, page {new_metadata['page_number']}]\n\n"
           # Prepend the citation header to the node's text.
           updated_text = citation_header + node.text
           new_node = node.__class__(text=updated_text, metadata=new_metadata)
           updated_nodes.append(new_node)
       # Save a single combined Markdown file for the document using the updated node texts.
       combined_texts = [node.text for node in updated_nodes]
       combined_md = "\n\n---\n\n".join(combined_texts)
       md_filename = f"{doc_name}.md"
       md_filepath = os.path.join(output_md_dir, md_filename)
       with open(md_filepath, "w", encoding="utf-8") as f:
           f.write(combined_md)
       print(f"Saved combined markdown for {doc_name} to {md_filepath}")
       # Add the updated nodes to the global list for indexing.
       all_nodes.extend(updated_nodes)
       print(f"Parsed {len(updated_nodes)} nodes from {doc_name}.")
   except Exception as e:
       print(f"Error parsing {doc_name}: {e}")

We now create a vector store and a query engine. We define a customer prompt template to guide the LLM’s behavior in answering the questions. Finally, we create a query engine with the created index to answer queries. For each query, we retrieve the top 3 nodes from the vector store based on their semantic similarity with the query. The LLM uses these retrieved nodes to generate the final answer.

##########################################
# Create Index and Query Engine
##########################################
# Create an index from all nodes.
index = VectorStoreIndex.from_documents(documents=all_nodes)
# Define a custom prompt template that forces the inclusion of citations.
prompt_template = """
You are an AI assistant with expertise in the subject matter.
Answer the question using ONLY the provided context.
Answer in well-formatted Markdown with bullets and sections wherever necessary.
If the provided context does not support an answer, respond with "I don't know."
Context:
{context_str}
Question:
{query_str}
Answer:
"""
# Create a query engine with the custom prompt.
query_engine = index.as_query_engine(similarity_top_k=3, llm=llm, prompt_template = prompt_template)
print("Combined index and query engine created successfully!")

Now let’s test the RAG for some queries and their corresponding trustworthiness scores.

query = "When is mixture of experts approach used?"
response = query_engine.query(query)
display_response(response)

Response to the query ‘When is mixture of experts approach used?’ (image by author)

query = "How do you compare Deepseek model with OpenAI's models?"
response = query_engine.query(query)
display_response(response)

Response to the query ‘How do you compare the Deepseek model with OpenAI’s models?’ (image by author)

Assigning a trustworthiness score to LLM’s response, whether generated through direct inference or RAG, helps to define the reliability of AI’s output and prioritize human verification where needed. This is particularly important for critical domains where a wrong or unreliable response could have severe consequences.

That’s all folks! If you like the article, please follow me on Medium and LinkedIn.

The post How to Measure the Reliability of a Large Language Model’s Response appeared first on Towards Data Science.

Synthetic Data Generation with LLMs

Nethra Ranganathan — Fri, 07 Feb 2025 18:58:45 +0000

Popularity of RAG

Over the past two years while working with financial firms, I’ve observed firsthand how they identify and prioritize Generative AI use cases, balancing complexity with potential value.

Retrieval-Augmented Generation (RAG) often stands out as a foundational capability across many LLM-driven solutions, striking a balance between ease of implementation and real-world impact. By combining a retriever that surfaces relevant documents with an LLM that synthesizes responses, RAG streamlines knowledge access, making it invaluable for applications like customer support, research, and internal knowledge management.

Defining clear evaluation criteria is key to ensuring LLM solutions meet performance standards, just as Test-Driven Development (TDD) ensures reliability in traditional software. Drawing from TDD principles, an evaluation-driven approach sets measurable benchmarks to validate and improve AI workflows. This becomes especially important for LLMs, where the complexity of open-ended responses demands consistent and thoughtful evaluation to deliver reliable results.

For RAG applications, a typical evaluation set includes representative input-output pairs that align with the intended use case. For example, in chatbot applications, this might involve Q&A pairs reflecting user inquiries. In other contexts, such as retrieving and summarizing relevant text, the evaluation set could include source documents alongside expected summaries or extracted key points. These pairs are often generated from a subset of documents, such as those that are most viewed or frequently accessed, ensuring the evaluation focuses on the most relevant content.

Key Challenges

Creating evaluation datasets for RAG systems has traditionally faced two major challenges.

The process often relied on subject matter experts (SMEs) to manually review documents and generate Q&A pairs, making it time-intensive, inconsistent, and costly.
Limitations preventing LLMs from processing visual elements within documents, such as tables or diagrams, as they are restricted to handling text. Standard OCR tools struggle to bridge this gap, often failing to extract meaningful information from non-textual content.

Multi-Modal Capabilities

The challenges of handling complex documents have evolved with the introduction of multimodal capabilities in foundation models. Commercial and open-source models can now process both text and visual content. This vision capability eliminates the need for separate text-extraction workflows, offering an integrated approach for handling mixed-media PDFs.

By leveraging these vision features, models can ingest entire pages at once, recognizing layout structures, chart labels, and table content. This not only reduces manual effort but also improves scalability and data quality, making it a powerful enabler for RAG workflows that rely on accurate information from a variety of sources.

Dataset Curation for Wealth Management Research Report

To demonstrate a solution to the problem of manual evaluation set generation, I tested my approach using a sample document — the 2023 Cerulli report. This type of document is typical in wealth management, where analyst-style reports often combine text with complex visuals. For a RAG-powered search assistant, a knowledge corpus like this would likely contain many such documents.

My goal was to demonstrate how a single document could be leveraged to generate Q&A pairs, incorporating both text and visual elements. While I didn’t define specific dimensions for the Q&A pairs in this test, a real-world implementation would involve providing details on types of questions (comparative, analysis, multiple choice), topics (investment strategies, account types), and many other aspects. The primary focus of this experiment was to ensure the LLM generated questions that incorporated visual elements and produced reliable answers.

POC Workflow

My workflow, illustrated in the diagram, leverages Anthropic’s Claude Sonnet 3.5 model, which simplifies the process of working with PDFs by handling the conversion of documents into images before passing them to the model. This built-in functionality eliminates the need for additional third-party dependencies, streamlining the workflow and reducing code complexity.

I excluded preliminary pages of the report like the table of contents and glossary, focusing on pages with relevant content and charts for generating Q&A pairs. Below is the prompt I used to generate the initial question-answer sets.

You are an expert at analyzing financial reports and generating question-answer pairs. For the provided PDF, the 2023 Cerulli report:

1. Analyze pages {start_idx} to {end_idx} and for **each** of those 10 pages:
   - Identify the **exact page title** as it appears on that page (e.g., "Exhibit 4.03 Core Market Databank, 2023").
   - If the page includes a chart, graph, or diagram, create a question that references that visual element. Otherwise, create a question about the textual content.
   - Generate two distinct answers to that question ("answer_1" and "answer_2"), both supported by the page’s content.
   - Identify the correct page number as indicated in the bottom left corner of the page.
2. Return exactly 10 results as a valid JSON array (a list of dictionaries). Each dictionary should have the keys: “page” (int), “page_title” (str), “question” (str), “answer_1” (str), and “answer_2” (str). The page title typically includes the word "Exhibit" followed by a number.

Q&A Pair Generation

To refine the Q&A generation process, I implemented a comparative learning approach that generates two distinct answers for each question. During the evaluation phase, these answers are assessed across key dimensions such as accuracy and clarity, with the stronger response selected as the final answer.

This approach mirrors how humans often find it easier to make decisions when comparing alternatives rather than evaluating something in isolation. It’s like an eye examination: the optometrist doesn’t ask if your vision has improved or declined but instead, presents two lenses and asks, Which is clearer, option 1 or option 2? This comparative process eliminates the ambiguity of assessing absolute improvement and focuses on relative differences, making the choice simpler and more actionable. Similarly, by presenting two concrete answer options, the system can more effectively evaluate which response is stronger.

This methodology is also cited as a best practice in the article “What We Learned from a Year of Building with LLMs” by leaders in the AI space. They highlight the value of pairwise comparisons, stating: “Instead of asking the LLM to score a single output on a Likert scale, present it with two options and ask it to select the better one. This tends to lead to more stable results.” I highly recommend reading their three-part series, as it provides invaluable insights into building effective systems with LLMs!

LLM Evaluation

For evaluating the generated Q&A pairs, I used Claude Opus for its advanced reasoning capabilities. Acting as a “judge,” the LLM compared the two answers generated for each question and selected the better option based on criteria such as directness and clarity. This approach is supported by extensive research (Zheng et al., 2023) that showcases LLMs can perform evaluations on par with human reviewers.

This approach significantly reduces the amount of manual review required by SMEs, enabling a more scalable and efficient refinement process. While SMEs remain essential during the initial stages to spot-check questions and validate system outputs, this dependency diminishes over time. Once a sufficient level of confidence is established in the system’s performance, the need for frequent spot-checking is reduced, allowing SMEs to focus on higher-value tasks.

Lessons Learned

Claude’s PDF capability has a limit of 100 pages, so I broke the original document into four 50-page sections. When I tried processing each 50-page section in a single request — and explicitly instructed the model to generate one Q&A pair per page — it still missed some pages. The token limit wasn’t the real problem; the model tended to focus on whichever content it considered most relevant, leaving certain pages underrepresented.

To address this, I experimented with processing the document in smaller batches, testing 5, 10, and 20 pages at a time. Through these tests, I found that batches of 10 pages (e.g., pages 1–10, 11–20, etc.) provided the best balance between precision and efficiency. Processing 10 pages per batch ensured consistent results across all pages while optimizing performance.

Another challenge was linking Q&A pairs back to their source. Using tiny page numbers in a PDF’s footer alone didn’t consistently work. In contrast, page titles or clear headings at the top of each page served as reliable anchors. They were easier for the model to pick up and helped me accurately map each Q&A pair to the right section.

Example Output

Below is an example page from the report, featuring two tables with numerical data. The following question was generated for this page:
How has the distribution of AUM changed across different-sized Hybrid RIA firms?

Answer: Mid-sized firms ($25m to <$100m) experienced a decline in AUM share from 2.3% to 1.0%.

In the first table, the 2017 column shows a 2.3% share of AUM for mid-sized firms, which decreases to 1.0% in 2022, thereby showcasing the LLM’s ability to synthesize visual and tabular content accurately.

Benefits

Combining caching, batching and a refined Q&A workflow led to three key advantages:

Caching

In my experiment, processing a singular report without caching would have cost $9, but by leveraging caching, I reduced this cost to $3 — a 3x cost savings. Per Anthropic’s pricing model, creating a cache costs $3.75 / million tokens, however, reads from the cache are only $0.30 / million tokens. In contrast, input tokens cost $3 / million tokens when caching is not used.
In a real-world scenario with more than one document, the savings become even more significant. For example, processing 10,000 research reports of similar length without caching would cost $90,000 in input costs alone. With caching, this cost drops to $30,000, achieving the same precision and quality while saving $60,000.

Discounted Batch Processing

Using Anthropic’s Batches API cuts output costs in half, making it a much cheaper option for certain tasks. Once I had validated the prompts, I ran a single batch job to evaluate all the Q&A answer sets at once. This method proved far more cost-effective than processing each Q&A pair individually.
For example, Claude 3 Opus typically costs $15 per million output tokens. By using batching, this drops to $7.50 per million tokens — a 50% reduction. In my experiment, each Q&A pair generated an average of 100 tokens, resulting in approximately 20,000 output tokens for the document. At the standard rate, this would have cost $0.30. With batch processing, the cost was reduced to $0.15, highlighitng how this approach optimizes costs for non-sequential tasks like evaluation runs.

Time Saved for SMEs

With more accurate, context-rich Q&A pairs, Subject Matter Experts spent less time sifting through PDFs and clarifying details, and more time focusing on strategic insights. This approach also eliminates the need to hire additional staff or allocate internal resources for manually curating datasets, a process that can be time-consuming and expensive. By automating these tasks, companies save significantly on labor costs while streamlining SME workflows, making this a scalable and cost-effective solution.

The post Synthetic Data Generation with LLMs appeared first on Towards Data Science.

Supercharge Your RAG with Multi-Agent Self-RAG

Julian Yip — Thu, 06 Feb 2025 03:07:47 +0000

Introduction

Many of us might have tried to build a RAG application and noticed it falls significantly short of addressing real-life needs. Why is that? It’s because many real-world problems require multiple steps of information retrieval and reasoning. We need our agent to perform those as humans normally do, yet most RAG applications fall short of this.

This article explores how to supercharge your RAG application by making its data retrieval and reasoning process similar to how a human would, under a multi-agent framework. The framework presented here is based on the Self-RAG strategy but has been significantly modified to enhance its capabilities. Prior knowledge of the original strategy is not necessary for reading this article.

Real-life Case

Consider this: I was going to fly from Delhi to Munich (let’s assume I am taking the flight from an EU airline), but I was denied boarding somehow. Now I want to know what the compensation should be.

These two webpages contain relevant information, I go ahead adding them to my vector store, trying to have my agent answer this for me by retrieving the right information.

Flight compensation policy
Distance between Munich and other major cities in the world (note: it does not reflect the distance between airports, but it serves the purpose for our demonstration)

Now, I pass this question to the vector store: “how much can I receive if I am denied boarding, for flights from Delhi to Munich?”.

– – – – – – – – – – – – – – – – – – – – – – – – –
Overview of US Flight Compensation Policies To get compensation for delayed flights, you should contact your airline via their customer service or go to the customer service desk. At the same time, you should bear in mind that you will only receive compensation if the delay is not weather-related and is within the carrier`s control. According to the US Department of Transportation, US airlines are not required to compensate you if a flight is cancelled or delayed. You can be compensated if you are bumped or moved from an overbooked flight. If your provider cancels your flight less than two weeks before departure and you decide to cancel your trip entirely, you can receive a refund of both pre-paid baggage fees and your plane ticket. There will be no refund if you choose to continue your journey. In the case of a delayed flight, the airline will rebook you on a different flight. According to federal law, you will not be provided with money or other compensation. Comparative Analysis of EU vs. US Flight Compensation Policies
– – – – – – – – – – – – – – – – – – – – – – – – –
(AUTHOR-ADDED NOTE: IMPORTANT, PAY ATTENTION TO THIS)
Short-distance flight delays – if it is up to 1,500 km, you are due 250 Euro compensation.
Medium distance flight delays – for all the flights between 1,500 and 3,500 km, the compensation should be 400 Euro.
Long-distance flight delays – if it is over 3,500 km, you are due 600 Euro compensation. To receive this kind of compensation, the following conditions must be met; Your flight starts in a non-EU member state or in an EU member state and finishes in an EU member state and is organised by an EU airline. Your flight reaches the final destination with a delay that exceeds three hours. There is no force majeure.
– – – – – – – – – – – – – – – – – – – – – – – – –
Compensation policies in the EU and US are not the same, which implies that it is worth knowing more about them. While you can always count on Skycop flight cancellation compensation, you should still get acquainted with the information below.
– – – – – – – – – – – – – – – – – – – – – – – – –
Compensation for flight regulations EU: The EU does regulate flight delay compensation, which is known as EU261. US: According to the US Department of Transportation, every airline has its own policies about what should be done for delayed passengers. Compensation for flight delays EU: Just like in the United States, compensation is not provided when the flight is delayed due to uncontrollable reasons. However, there is a clear approach to compensation calculation based on distance. For example, if your flight was up to 1,500 km, you can receive 250 euros. US: There are no federal requirements. That is why every airline sets its own limits for compensation in terms of length. However, it is usually set at three hours. Overbooking EU: In the EU, they call for volunteers if the flight is overbooked. These people are entitled to a choice of: Re-routing to their final destination at the earliest opportunity. Refund of their ticket cost within a week if not travelling. Re-routing at a later date at the person`s convenience.

Unfortunately, they contain only generic flight compensation policies, without telling me how much I can expect when denied boarding from Delhi to Munich specifically. If the RAG agent takes these as the sole context, it can only provide a generic answer about flight compensation policy, without giving the answer we want.

However, while the documents are not immediately useful, there is an important insight contained in the 2nd piece of context: compensation varies according to flight distance. If the RAG agent thinks more like human, it should follow these steps to provide an answer:

Based on the retrieved context, reason that compensation varies with flight distance
Next, retrieve the flight distance between Delhi and Munich
Given the distance (which is around 5900km), classify the flight as a long-distance one
Combined with the previously retrieved context, figure out I am due 600 EUR, assuming other conditions are fulfilled

This example demonstrates how a simple RAG, in which a single retrieval is made, fall short for several reasons:

Complex Queries: Users often have questions that a simple search can’t fully address. For example, “What’s the best smartphone for gaming under $500?” requires consideration of multiple factors like performance, price, and features, which a single retrieval step might miss.
Deep Information: Some information lies across documents. For example, research papers, medical records, or legal documents often include references that need to be made sense of, before one can fully understand the content of a given article. Multiple retrieval steps help dig deeper into the content.

Multiple retrievals supplemented with human-like reasoning allow for a more nuanced, comprehensive, and accurate response, adapting to the complexity and depth of user queries.

Multi-Agent Self-RAG

Here I explain the reasoning process behind this strategy, afterwards I will provide the code to show you how to achieve this!

Note: For readers interested in knowing how my approach differs from the original Self-RAG, I will describe the discrepancies in quotation boxes like this. But general readers who are unfamiliar with the original Self-RAG can skip them.

In the below graphs, each circle represents a step (aka Node), which is performed by a dedicated agent working on the specific problem. We orchestrate them to form a multi-agent RAG application.

1st iteration: Simple RAG

A simple RAG chain

This is just the vanilla RAG approach I described in “Real-life Case”, represented as a graph. After Retrieve documents, the new_documents will be used as input for Generate Answer. Nothing special, but it serves as our starting point.

2nd iteration: Digest documents with “Grade documents”

Reasoning like human do

Remember I said in the “Real-life Case” section, that as a next step, the agent should “reason that compensation varies with flight distance”? The Grade documents step is exactly for this purpose.

Given the new_documents, the agent will try to output two items:

useful_documents: Comparing the question asked, it determines if the documents are useful, and retain a memory for those deemed useful for future reference. As an example, since our question does not concern compensation policies for US, documents describing those are discarded, leaving only those for EU
hypothesis: Based on the documents, the agent forms a hypothesis about how the question can be answered, that is, flight distance needs to be identified

Notice how the above reasoning resembles human thinking! But still, while these outputs are useful, we need to instruct the agent to use them as input for performing the next document retrieval. Without this, the answer provided in Generate answer is still not useful.

useful_documents are appended for each document retrieval loop, instead of being overwritten, to keep a memory of documents that are previously deemed useful. hypothesis is formed from useful_documents and new_documents to provide an “abstract reasoning” to inform how query is to be transformed subsequently.

The hypothesis is especially useful when no useful documents can be identified initially, as the agent can still form hypothesis from documents not immediately deemed as useful / only bearing indirect relationship to the question at hand, for informing what questions to ask next

3rd iteration: Brainstorm new questions to ask

Suggest questions for additional information retrieval

We have the agent reflect upon whether the answer is useful and grounded in context. If not, it should proceed to Transform query to ask further questions.

The output new_queries will be a list of new questions that the agent consider useful for obtaining extra information. Given the useful_documents (compensation policies for EU), and hypothesis (need to identify flight distance between Delhi and Munich), it asks questions like “What is the distance between Delhi and Munich?”

Now we are ready to use the new_queries for further retrieval!

The transform_query node will use useful_documents (which are accumulated per iteration, instead of being overwritten) and hypothesis as input for providing the agent directions to ask new questions.

The new questions will be a list of questions (instead of a single question) separated from the original question, so that the original question is kept in state, otherwise the agent could lose track of the original question after multiple iterations.

Final iteration: Further retrieval with new questions

Issuing new queries to retrieve extra documents

The output new_queries from Transform query will be passed to the Retrieve documents step, forming a retrieval loop.

Since the question “What is the distance between Delhi and Munich?” is asked, we can expect the flight distance is then retrieved as new_documents, and subsequently graded as useful_documents, further used as an input for Generate answer.

The grade_documents node will compare the documents against both the original question and new_questions list, so that documents that are considered useful for new_questions, even if not so for the original question, are kept.

This is because those documents might help answer the original question indirectly, by being relevant to new_questions (like “What is the distance between Delhi and Munich?”)

Final answer!

Equipped with this new context about flight distance, the agent is now ready to provide the right answer: 600 EUR!

Next, let us now dive into the code to see how this multi-agent RAG application is created.

Implementation

The source code can be found here. Our multi-agent RAG application involves iterations and loops, and LangGraph is a great library for building such complex multi-agent application. If you are not familiar with LangGraph, you are strongly suggested to have a look at LangGraph’s Quickstart guide to understand more about it!

To keep this article concise, I will focus on the key code snippets only.

Important note: I am using OpenRouter as the Llm interface, but the code can be easily adapted for other LLM interfaces. Also, while in my code I am using Claude 3.5 Sonnet as model, you can use any LLM as long as it support tools as parameter (check this list here), so you can also run this with other models, like DeepSeek V3 and OpenAI o1!

State definition

In the previous section, I have defined various elements e.g. new_documents, hypothesis that are to be passed to each step (aka Nodes), in LangGraph’s terminology these elements are called State.

We define the State formally with the following snippet.

from typing import List, Annotated
from typing_extensions import TypedDict

def append_to_list(original: list, new: list) -> list:
    original.append(new)
    return original

def combine_list(original: list, new: list) -> list:
    return original + new

class GraphState(TypedDict):
    """
    Represents the state of our graph.

    Attributes:
        question: question
        generation: LLM generation
        new_documents: newly retrieved documents for the current iteration
        useful_documents: documents that are considered useful
        graded_documents: documents that have been graded
        new_queries: newly generated questions
        hypothesis: hypothesis
    """

    question: str
    generation: str
    new_documents: List[str]
    useful_documents: Annotated[List[str], combine_list]
    graded_documents: List[str]
    new_queries: Annotated[List[str], append_to_list]
    hypothesis: str

Graph definition

This is where we combine the different steps to form a “Graph”, which is a representation of our multi-agent application. The definitions of various steps (e.g. grade_documents) are represented by their respective functions.

from langgraph.graph import END, StateGraph, START
from langgraph.checkpoint.memory import MemorySaver
from IPython.display import Image, display

workflow = StateGraph(GraphState)

# Define the nodes
workflow.add_node("retrieve", retrieve)  # retrieve
workflow.add_node("grade_documents", grade_documents)  # grade documents
workflow.add_node("generate", generate)  # generatae
workflow.add_node("transform_query", transform_query)  # transform_query

# Build graph
workflow.add_edge(START, "retrieve")
workflow.add_edge("retrieve", "grade_documents")
workflow.add_conditional_edges(
    "grade_documents",
    decide_to_generate,
    {
        "transform_query": "transform_query",
        "generate": "generate",
    },
)
workflow.add_edge("transform_query", "retrieve")
workflow.add_conditional_edges(
    "generate",
    grade_generation_v_documents_and_question,
    {
        "useful": END,
        "not supported": "transform_query",
        "not useful": "transform_query",
    },
)

# Compile
memory = MemorySaver()
app = workflow.compile(checkpointer=memory)
display(Image(app.get_graph(xray=True).draw_mermaid_png()))

Running the above code, you should see this graphical representation of our RAG application. Notice how it is essentially equivalent to the graph I have shown in the final iteration of “Enhanced Self-RAG Strategy”!

Visualizing the multi-agent RAG graph

After generate, if the answer is considered “not supported”, the agent will proceed to transform_query intead of to generate again, so that the agent will look for additional information rather than trying to regenerate answers based on existing context, which might not suffice for providing a “supported” answer

Now we are ready to put the multi-agent application to test! With the below code snippet, we ask this question how much can I receive if I am denied boarding, for flights from Delhi to Munich?

from pprint import pprint
config = {"configurable": {"thread_id": str(uuid4())}}

# Run
inputs = {
    "question": "how much can I receive if I am denied boarding, for flights from Delhi to Munich?",
    }
for output in app.stream(inputs, config):
    for key, value in output.items():
        # Node
        pprint(f"Node '{key}':")
        # Optional: print full state at each node
        # print(app.get_state(config).values)
    pprint("\n---\n")

# Final generation
pprint(value["generation"])

While output might vary (sometimes the application provides the answer without any iterations, because it “guessed” the distance between Delhi and Munich), it should look something like this, which shows the application went through multiple rounds of data retrieval for RAG.

---RETRIEVE---
"Node 'retrieve':"
'\n---\n'
---CHECK DOCUMENT RELEVANCE TO QUESTION---
---GRADE: DOCUMENT NOT RELEVANT---
---GRADE: DOCUMENT RELEVANT---
---GRADE: DOCUMENT NOT RELEVANT---
---GRADE: DOCUMENT NOT RELEVANT---
---ASSESS GRADED DOCUMENTS---
---DECISION: GENERATE---
"Node 'grade_documents':"
'\n---\n'
---GENERATE---
---CHECK HALLUCINATIONS---
'---DECISION: GENERATION IS NOT GROUNDED IN DOCUMENTS, RE-TRY---'
"Node 'generate':"
'\n---\n'
---TRANSFORM QUERY---
"Node 'transform_query':"
'\n---\n'
---RETRIEVE---
"Node 'retrieve':"
'\n---\n'
---CHECK DOCUMENT RELEVANCE TO QUESTION---
---GRADE: DOCUMENT NOT RELEVANT---
---GRADE: DOCUMENT NOT RELEVANT---
---GRADE: DOCUMENT RELEVANT---
---GRADE: DOCUMENT NOT RELEVANT---
---GRADE: DOCUMENT NOT RELEVANT---
---GRADE: DOCUMENT NOT RELEVANT---
---GRADE: DOCUMENT NOT RELEVANT---
---ASSESS GRADED DOCUMENTS---
---DECISION: GENERATE---
"Node 'grade_documents':"
'\n---\n'
---GENERATE---
---CHECK HALLUCINATIONS---
---DECISION: GENERATION IS GROUNDED IN DOCUMENTS---
---GRADE GENERATION vs QUESTION---
---DECISION: GENERATION ADDRESSES QUESTION---
"Node 'generate':"
'\n---\n'
('Based on the context provided, the flight distance from Munich to Delhi is '
 '5,931 km, which falls into the long-distance category (over 3,500 km). '
 'Therefore, if you are denied boarding on a flight from Delhi to Munich '
 'operated by an EU airline, you would be eligible for 600 Euro compensation, '
 'provided that:\n'
 '1. The flight is operated by an EU airline\n'
 '2. There is no force majeure\n'
 '3. Other applicable conditions are met\n'
 '\n'
 "However, it's important to note that this compensation amount is only valid "
 'if all the required conditions are met as specified in the regulations.')

And the final answer is what we aimed for!

Based on the context provided, the flight distance from Munich to Delhi is
5,931 km, which falls into the long-distance category (over 3,500 km).
Therefore, if you are denied boarding on a flight from Delhi to Munich
operated by an EU airline, you would be eligible for 600 Euro compensation,
provided that:
1. The flight is operated by an EU airline
2. There is no force majeure
3. Other applicable conditions are met

However, it's important to note that this compensation amount is only valid
if all the required conditions are met as specified in the regulations.

Inspecting the State, we see how the hypothesis and new_queries enhance the effectiveness of our multi-agent RAG application by mimicking human thinking process.

Hypothesis

print(app.get_state(config).values.get('hypothesis',""))

--- Output ---
To answer this question accurately, I need to determine:

1. Is this flight operated by an EU airline? (Since Delhi is non-EU and Munich is EU)
2. What is the flight distance between Delhi and Munich? (To determine compensation amount)
3. Are we dealing with a denied boarding situation due to overbooking? (As opposed to delay/cancellation)

From the context, I can find information about compensation amounts based on distance, but I need to verify:
- If the flight meets EU compensation eligibility criteria
- The exact distance between Delhi and Munich to determine which compensation tier applies (250€, 400€, or 600€)
- If denied boarding compensation follows the same amounts as delay compensation

The context doesn't explicitly state compensation amounts specifically for denied boarding, though it mentions overbooking situations in the EU require offering volunteers re-routing or refund options.

Would you like me to proceed with the information available, or would you need additional context about denied boarding compensation specifically?

New Queries

for questions_batch in app.get_state(config).values.get('new_queries',""):
    for q in questions_batch:
        print(q)

--- Output ---
What is the flight distance between Delhi and Munich?
Does EU denied boarding compensation follow the same amounts as flight delay compensation?
Are there specific compensation rules for denied boarding versus flight delays for flights from non-EU to EU destinations?
What are the compensation rules when flying with non-EU airlines from Delhi to Munich?
What are the specific conditions that qualify as denied boarding under EU regulations?

Conclusion

Simple RAG, while easy to build, might fall short in tackling real-life questions. By incorporating human thinking process into a multi-agent RAG framework, we are making RAG applications much more practical.

*Unless otherwise noted, all images are by the author

The post Supercharge Your RAG with Multi-Agent Self-RAG appeared first on Towards Data Science.

RAG Isn’t Immune to LLM Hallucination

Thuwarakesh Murallie — Mon, 20 Jan 2025 13:02:14 +0000

Photo by Johannes Plenio on Unsplash

I recently started to favor Graph RAGs more than vector store-backed ones.

No offense to vector databases; they work fantastically in most cases. The caveat is that you need explicit mentions in the text to retrieve the correct context.

We have workarounds for that, and I’ve covered a few in my previous posts.

Building RAGs Without A Retrieval Model Is a Terrible Mistake

For instance, ColBERT and Multi-representation are helpful retrieval models we should consider when building RAG apps.

GraphRAGs suffer less from retrieval issues (I didn’t say they don’t suffer.) Whenever the retrieval requires some reasoning, GraphRAG performs extraordinarily.

Providing relevant context solves a key problem in Llm-based applications: hallucination. However, it does not eliminate hallucinations altogether.

When you can’t fix something, you measure it. And that’s the focus of this post. In other words, how do we evaluate RAG apps?

But before that, why do LLM’s lie in the first place?

Why do LLMs hallucinate (even RAGs)?

Language models sometimes lie—all right—and sometimes are inaccurate. This is primarily due to two reasons.

The first is that the LLM doesn’t have enough context to answer. This is why Retrieval Augmented generation (RAG) came into existence. RAGs provide context to the LLM that it hasn’t seen in its training.

Some models work well to answer within the provided context, and others don’t. For instance, LLama 3.1 8B works fine if you provide context to generate answers, while DistilBERT doesn’t.

I Fine-Tuned the Tiny Llama 3.2 1B to Replace GPT-4o

The second reason is when the answer requires some reasoning. Not every LLM is good at reasoning; each does it differently. For context, Llama 2 13B doesn’t perform well on reasoning tasks compared to GPT-4o.

Of course, these models are from different generations, and it wouldn’t be appropriate to compare them side by side. But that’s also the point I’m trying to make – if you don’t choose your model wisely, you can expect it to produce hallucinated answers.

How to evaluate hallucination in RAGs

Now that we’re convinced that every LLM-based app can hallucinate and that measuring is the only way to keep it in check, how can we do that?

A few LLM evaluation frameworks have recently evolved. Two of them, RAGAS and Deepeval, are particularly popular. Both tools are open-source and free to use, although a paid version exists.

I’ll be using Deepeval in this post. A quick note: I am not affiliated with Deepeval; I simply like it.

Let’s start by installing the tool. You can get it from the PyPI repository.

pip install -qU deepeval ragas

Because LLM evaluations are vague, we can not test LLMs like we test software. The LLM-generated responses are never predictable, so they require another LLM to evaluate them.

The LLM evaluator needs to be a competent model. I’d choose GPT-4o-mini for this. It’s cost-effective and accurate, but you can experiment.

We can evaluate RAGs in two different ways. The first is a prompt-based technique called G-Eval. The second is RAGAS, which allows us to evaluate RAGs systematically.

Both RAGAS and G-Eval are incredibly helpful frameworks to evaluate RAGs. I’d use RAGAS as the default method when choosing between these frameworks. Since the computation is specific, the results are easy to compare with different evaluations. However, the evaluation criteria can sometimes be more complex than the RAGAS framework. In such a situation, you can dictate the evaluation steps using G-Eval.

G-Eval: The more versatile evaluation framework

As mentioned earlier, G-Eval is a prompt-based evaluation framework, which means we get to tell the LLM how to evaluate our RAG. We can do this by setting the input parameter criteria or evaluation_steps.

Here’s an example using the criteria parameter.

from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCase
from deepeval.test_case import LLMTestCaseParams

test_case = LLMTestCase(
    input="When did XYZ, Inc complete the acquisition of ABC, Inc?",
    actual_output="XYZ, Inc completed the acquisition of ABC, Inc on January 15, 2025, solidifying its market leadership.",
    expected_output="XYZ, Inc completed the acquisition of ABC, Inc on January 10, 2025.",
)

correctness_metric_criteria = GEval(
    name="Correctness with Criteria",
    criteria="Verify if the actual output provides a factually accurate and complete response to the expected output without contradictions or omissions.",
    evaluation_params=[LLMTestCaseParams.INPUT, LLMTestCaseParams.ACTUAL_OUTPUT, LLMTestCaseParams.EXPECTED_OUTPUT],
)

correctness_metric_criteria.measure(test_case)

print("Score:", correctness_metric_criteria.score)
print("Reason:", correctness_metric_criteria.reason)

>> Score: 0.6683615588714628
>> Reason: The Actual Output accurately states the acquisition and maintains a similar context, but the date is incorrect compared to the Expected Output.

The above example verifies that the actual output aligns with the expected output. In this case, the date was different. But it has everything else correct.

The good thing about G-Eval is that we can define how we should evaluate it. We can specify that in the criteria if we’re okay with a date difference of less than 10 days. Here’s how the new evaluation looks.

...

correctness_metric_criteria = GEval(
    ...
    criteria="Verify if the actual output is correct and the date in the expected output is not more than 10 days apart. ",
    ...
)

...

>> Score: 0.8543395625001086
>> Reason: The Actual Output provides an accurate acquisition date of XYZ, Inc completing the acquisition of ABC, Inc and matches the Expected Output except for a 5-day difference in dates, which is within the acceptable range.

As you’ve noticed, this shoots up the score to .85 from .66 because we’ve allowed a 10-day range for the date match.

The above example only checks whether an LLM’s response regarding an expected outcome is correct. However, we also need to check the retrieved data to evaluate RAG.

Here’s a RAG evaluation example. Notice that we’ve used the evaluation_steps parameter instead of criteria.

from deepeval.metrics import GEval
from deepeval.test_case import LLMTestCase, LLMTestCaseParams

# Define the retrieval context with multiple relevant pieces of information
retrieval_context = [
    "XYZ Corporation announced its plans to acquire ABC Enterprises in a deal valued at approximately $4.5 billion.",
    "The merger is expected to consolidate XYZ's market position and expand its reach into new domains.",
    "The acquisition was completed on January 15, 2024.",
    "Post-acquisition, XYZ, Inc. aims to integrate ABC, Inc.'s technologies to enhance its product offerings.",
    "The regulatory bodies approved the acquisition without any objections."
]

# Create the test case with input, actual output, expected output, and retrieval context
test_case = LLMTestCase(
    input="When did XYZ, Inc. complete the acquisition of ABC, Inc?",
    actual_output="XYZ, Inc. completed the acquisition of ABC, Inc. on January 17, 2024, solidifying its market leadership.",
    expected_output="XYZ, Inc. completed the acquisition of ABC, Inc. on January 15, 2024.",
    retrieval_context=retrieval_context
)

# Define the correctness metric with evaluation steps
correctness_metric_steps = GEval(
    name="Correctness with Evaluation Steps",
    evaluation_steps=[
        "verify the retrieval_context has sufficient information to respond to the input. "
        "Verify if the 'actual_output' provides the a completion date not more than 10 days different of the acquisition as stated in the 'retrieval_context'.",
        "Ensure that the 'actual_output' matches the 'expected_output' in terms of factual accuracy.",
        "Check for any contradictions or omissions between the 'actual_output' and the 'expected_output'."
    ],
    evaluation_params=[
        LLMTestCaseParams.INPUT,
        LLMTestCaseParams.ACTUAL_OUTPUT,
        LLMTestCaseParams.EXPECTED_OUTPUT,
        LLMTestCaseParams.RETRIEVAL_CONTEXT
    ],
)

# Measure the test case using the defined metric
correctness_metric_steps.measure(test_case)

# Print the evaluation score and reason
print("Score:", correctness_metric_steps.score)
print("Reason:", correctness_metric_steps.reason)

>> Score: 0.7907626389536403
>> Reason: The retrieval_context provides accurate completion info as January 15, 2024. The actual_output date is close but slightly off, being two days later than in the expected_output. No other significant factual inaccuracies or contradictions are present.

In the above example, we’ve used evaluation_steps instead of criteria. But this is optional. Explaining the evaluation process in a single statement criteria param would work just fine. But it’s always easier to break them down into steps.

G-Eval provides a single score to the evaluation regardless of the steps. You give it a single evaluation step or a dozen of them, and G-Eval will do all the work and produce a single score.

This is easy to understand and often sufficient. But what if we need to test an RAG pipeline component-by-component? This is where RAGAS comes into play.

RAGAS: Standard and more granular evaluation

RAGAS is a combination of four other evaluations in an RAG pipeline. They are answer relevancy, faithfulness, contextual recall, and contextual precision. The RAGAS score is simply the average of all these four.

Here’s an example:

In the below example, I’ve used both RagasMetric And the individual components. In real evaluation, you can either use the individual ones or the RagasMetric only.

from deepeval import evaluate

from deepeval.test_case import LLMTestCase

from deepeval.metrics.ragas import RagasMetric

# Individual metrics of RagasMetric
from deepeval.metrics.ragas import RAGASAnswerRelevancyMetric
from deepeval.metrics.ragas import RAGASFaithfulnessMetric
from deepeval.metrics.ragas import RAGASContextualRecallMetric
from deepeval.metrics.ragas import RAGASContextualPrecisionMetric

# Define the retrieval context with multiple relevant pieces of information
retrieval_context = [
    "XYZ Corporation announced its plans to acquire ABC Enterprises in a deal valued at approximately $4.5 billion.",
    "The merger is expected to consolidate XYZ's market position and expand its reach into new domains.",
    "The acquisition was completed on January 15, 2024.",
    "Post-acquisition, XYZ, Inc. aims to integrate ABC, Inc.'s technologies to enhance its product offerings.",
    "The regulatory bodies approved the acquisition without any objections."
]

# Create the test case with input, actual output, expected output, and retrieval context
test_case = LLMTestCase(
    input="When did XYZ, Inc. complete the acquisition of ABC, Inc?",
    actual_output="XYZ, Inc. completed the acquisition of ABC, Inc. on January 15, 2025, solidifying its market leadership.",
    expected_output="XYZ, Inc. completed the acquisition of ABC, Inc. on January 15, 2024",
    retrieval_context=retrieval_context
)

# Initialize the RagasMetric with a threshold and specify the model
ragas_metric = RagasMetric(threshold=0.5, model="gpt-4o-mini")
ragas_answer_relavancy_metric = RAGASAnswerRelevancyMetric(threshold=0.5, model="gpt-4o-mini")
ragas_faithfulness_metric = RAGASFaithfulnessMetric(threshold=0.5, model="gpt-4o-mini")
ragas_contextrual_recall_metric = RAGASContextualRecallMetric(threshold=0.5, model="gpt-4o-mini")
ragas_contextual_precision_metric = RAGASContextualPrecisionMetric(threshold=0.5, model="gpt-4o-mini")

# Measure the test case using the RagasMetric
ragas_metric.measure(test_case)
ragas_answer_relavancy_metric.measure(test_case)
ragas_faithfulness_metric.measure(test_case)
ragas_contextrual_recall_metric.measure(test_case)
ragas_contextual_precision_metric.measure(test_case)

# Print the evaluation score
print("Score:", ragas_metric.score)

# Alternatively, evaluate test cases in bulk
result = evaluate([test_case], [
    ragas_metric,
    ragas_answer_relavancy_metric,
    ragas_faithfulness_metric,
    ragas_contextrual_recall_metric,
    ragas_contextual_precision_metric,
])

>>Metrics Summary

  -  RAGAS (score: 0.6664463603011416, threshold: 0.5, strict: False, evaluation model: gpt-4o-mini, reason: None, error: None)
  -  Answer Relevancy (ragas) (score: 0.9988984715390413, threshold: 0.5, strict: False, evaluation model: gpt-4o-mini, reason: None, error: None)
  -  Faithfulness (ragas) (score: 0.0, threshold: 0.5, strict: False, evaluation model: gpt-4o-mini, reason: None, error: None)
  -  Contextual Recall (ragas) (score: 1.0, threshold: 0.5, strict: False, evaluation model: gpt-4o-mini, reason: None, error: None)
  -  Contextual Precision (ragas) (score: 0.3333333333, threshold: 0.5, strict: False, evaluation model: gpt-4o-mini, reason: None, error: None)

To understand this, let’s revisit how a RAG app responds to input. The first step is fetching contextual information. The LLM then uses the retrieved context to answer the user’s query.

Contextual Precession is a metric that evaluates whether statements relevant to the input are ranked higher. In our example, the one that contains the acquisition date is ranked #3 in the retrieved context. Hence, it has a low score (0.33) in the evaluation.

Contextual Recall tests if the retrieved context is sufficient to get an answer closer to the expected output. In our example, the acquisition date is available in the retrieved context. Hence, the score for this metric is 1.0.

Faithfulness is the metric that measures hallucinations in responses. It checks the correctness of the actual output with respect to the retrieved context. In our example, the retrieved context clearly states that the acquisition was completed in 2024. But the output says it’s in 2025, which is very wrong. Hence, it gets a 0 for the faithfulness score.

Finally, the answer relevancy metric measures whether the response generated was at least an attempt to answer the right question (input). Even though factually incorrect, the LLM answered the right question in our example, so it received a 0.99 score.

The RAGA score is a weighted average of these four individual metrics.

The downside of RAGAS is that you sometimes get a pass, even if the response wasn’t correct. That’s precisely the case in our example. Answering the acquisition date with a year difference isn’t negligible. But still, it was only one out of four metrics considered. Hence, the overall score of 0.66 is well above the threshold.

Thus, I’d suggest using individual metrics to understand the components rather than the whole system with a single metric (like RAGAS). This helps debug the app better.

Final thoughts

Evaluating apps with an LLM is different from evaluating a software project. Evaluating a RAG app is even more different and challenging.

This is mainly because the response of LLM isn’t always the same. They may sometimes hallucinate.

G-Eval and RAGAS are popular in evaluating the RAG applications. Even though these frameworks cover many aspects of an RAG app, they also find out about the hallucinations of LLMs.

Once you discover your app suffers from hallucinations, you could work on the workflow to fetch better contextual information (perhaps with a graph DB) or change the model.

The post RAG Isn’t Immune to LLM Hallucination appeared first on Towards Data Science.