Large Language Models | Towards Data Science

How to Train LLMs to “Think” (o1 & DeepSeek-R1)

Shaw Talebi — Tue, 04 Mar 2025 01:31:34 +0000

In September 2024, OpenAI released its o1 model, trained on large-scale reinforcement learning, giving it “advanced reasoning” capabilities. Unfortunately, the details of how they pulled this off were never shared publicly. Today, however, DeepSeek (an AI research lab) has replicated this reasoning behavior and published the full technical details of their approach. In this article, I will discuss the key ideas behind this innovation and describe how they work under the hood.

OpenAI’s o1 model marked a new paradigm for training large language models (LLMs). It introduced so-called “thinking” tokens, which enable a sort of scratch pad that the model can use to think through problems and user queries.

The major insight from o1 was performance improved with increased test-time compute. This is just a fancy way of saying that the more tokens a model generates, the better its response. The figure below, reproduced from OpenAI’s blog, captures this point nicely.

AIME accuracy scaling with train-time and test-time compute, respectively. Plots reillustrated from [1].

In the plots above, the y-axes are model performance on AIME (math problems), while the x-axes are various compute times. The left plot depicts the well-known neural scaling laws that kicked off the LLM rush of 2023. In other words, the longer a model is trained (i.e. train-time compute), the better its performance.

On the right, however, we see a new type of scaling law. Here, the more tokens a model generates (i.e. test-time compute), the better its performance.

“Thinking” tokens

A key feature of o1 is its so-called “thinking” tokens. These are special tokens introduced during post-training, which delimit the model’s chain of thought (CoT) reasoning (i.e., thinking through the problem). These special tokens are important for two reasons.

One, they clearly demarcate where the model’s “thinking” starts and stops so it can be easily parsed when spinning up a UI. And two, it produces a human-interpretable readout of how the model “thinks” through the problem.

Although OpenAI disclosed that they used reinforcement learning to produce this ability, the exact details of how they did it were not shared. Today, however, we have a pretty good idea thanks to a recent publication from DeepSeek.

DeepSeek’s paper

In January 2025, DeepSeek published “DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning” [2]. While this paper caused its fair share of pandemonium, its central contribution was unveiling the secrets behind o1.

It introduces two models: DeepSeek-R1-Zero and DeepSeek-R1. The former was trained exclusively on reinforcement learning (RL), and the latter was a mixture of Supervised Fine-tuning (SFT) and RL.

Although the headlines (and title of the paper) were about DeepSeek-R1, the former model is important because, one, it generated training data for R1, and two, it demonstrates striking emergent reasoning abilities that were not taught to the model.

In other words, R1-Zero discovers CoT and test-time compute scaling through RL alone! Let’s discuss how it works.

DeepSeek-R1-Zero (RL only)

Reinforcement learning (RL) is a Machine Learning approach in which, rather than training models on explicit examples, models learn through trial and error [3]. It works by passing a reward signal to a model that has no explicit functional relationship with the model’s parameters.

This is similar to how we often learn in the real world. For example, if I apply for a job and don’t get a response, I have to figure out what I did wrong and how to improve. This is in contrast to supervised learning, which, in this analogy, would be like the recruiter giving me specific feedback on what I did wrong and how to improve.

While using RL to train R1-Zero consists of many technical details, I want to highlight 3 key ones: the prompt template, reward signal, and GRPO (Group Relative Policy Optimization).

1) Prompt template

The template used for training is given below, where {prompt} is replaced with a question from a dataset of (presumably) complex math, coding, and logic problems. Notice the inclusion of and tags via simple prompting.

A conversation between User and Assistant. The user asks a question, and the 
Assistant solves it.The assistant first thinks about the reasoning process in 
the mind and then provides the user with the answer. The reasoning process and 
answer are enclosed within   and   tags, 
respectively, i.e.,  reasoning process here 
 answer here . User: {prompt}. Assistant:

Something that stands out here is the minimal and relaxed prompting strategy. This was an intentional choice by DeepSeek to avoid biasing model responses and to observe its natural evolution during RL.

2) Reward signal

The RL reward has two components: accuracy and format rewards. Since the training dataset consists of questions with clear right answers, a simple rule-based strategy is used to evaluate response accuracy. Similarly, a rule-based formatting reward is used to ensure reasoning tokens are generated in between the thinking tags.

It’s noted by the authors that a neural reward model isn’t used (i.e. rewards are not computed by a neural net), because these may be prone to reward hacking. In other words, the LLM learns how to trick the reward model into maximizing rewards while decreasing downstream performance.

This is just like how humans find ways to exploit any incentive structure to maximize their personal gains while forsaking the original intent of the incentives. This highlights the difficulty of producing good rewards (whether for humans or computers).

3) GRPO (Group Relative Policy Optimization)

The final detail is how rewards are translated into model parameter updates. This section is quite technical, so the enlightened reader can feel free to skip ahead.

GRPO is an RL approach that combines a collection of responses to update model parameters. To encourage stable training, the authors also incorporate clipping and KL-divergence regularization terms into the loss function. Clipping ensures optimization steps are not too big, and regularization ensures the model predictions do not change too abruptly.

Here is the complete loss function with some (hopefully) helpful annotations.

Annotated GRPO loss function. Image by author.

Results (emergent abilities)

The most striking result of R1-Zero is that, despite its minimal guidance, it develops effective reasoning strategies that we might recognize.

For example, it learns implicitly through the RL to improve responses through test-time compute (recall the earlier insight from o1). This is depicted via the plot below from the R1 paper [2].

Average response length increases with additional training (and improved performance). Screenshot from DeepSeek R1 paper [2].

Another notable ability is reflecting on past responses with an uncanny human-like internal monologue. Here is an example from R1-Zero. Notice the long CoT and additional verification step before generating the final answer (I omitted some parts because the response was very long).

User prompt: If it takes me 1.5 hours to write a codebase and my intern 8 
hours. How long will it take us both to write 3 codebases?

 To solve for how long it will take both you and your intern to write 
three codebases together, we need to first find out how many codebases each of 
you can write per hour individually.

(long CoT) ...

So, it will take approximately 3.7895 hours for you and your intern to write 
three codebases together.

To make sure everything makes sense, let us round the answer to a reasonable 
number of decimal places (e.g., round to two decimal places):

t ≈ 3.79 hours

However, if you need a more "human-friendly" answer, you could convert the 
decimal part into minutes. Since 0.7895 of an hour is:

0.7895 × 60 minutes ≈ 47.37 minutes

So, 3.7895 hours is approximately 3 hours and 47.37 minutes (or roughly 3 
hours and 47 minutes if rounding to the nearest minute).

## Verification (optional step for verification)
To verify, let us check how many codebases you can individually write in t = 
72/19 hours:

Your work in 72/19 hours
= (2/3 codebases per hour) × (72/19 hours)
= (2 × 72) / (3 × 19)
= 144 / 57
= 48/19 codebases.

(verification process) ...

However, a more "exact" answer should be kept in fractional form unless 
specified otherwise. Thus, the answer is: 72/19hours which is approximately
$3.7895$ hours or about $3.79$ hours when rounded to two decimal places).


72/19 hours (or approximately 3.79 hours).

Problems with R1-Zero

Although the thinking tokens from R1-Zero give a human-readable window into the model’s “thought process,” the authors report some issues. Namely, the learned CoT sometimes suffers from readability issues and language mixing. Suggesting (perhaps) that its reasoning begins to veer away from something easily interpretable by humans.

DeepSeek-R1 (SFT + RL)

To mitigate R1-Zero’s interpretability issues, the authors explore a multi-step training strategy that utilizes both supervised fine-tuning (SFT) and RL. This strategy results in DeepSeek-R1, a better-performing model that is getting more attention today. The entire training process can be broken down into 4 steps.

Step 1: SFT with reasoning data

To help get the model on the right track when it comes to learning how to reason, the authors start with SFT. This leverages 1000s of long CoT examples from various sources, including few-shot prompting (i.e., showing examples of how to think through problems), directly prompting the model to use reflection and verification, and refining synthetic data from R1-Zero [2].

The two key advantages of this are, one, the desired response format can be explicitly shown to the model, and two, seeing curated reasoning examples unlocks better performance for the final model.

Step 2: R1-Zero style RL (+ language consistency reward)

Next, an RL training step is applied to the model after SFT. This is done in an identical way as R1-Zero with an added component to the reward signal that incentivizes language consistently. This was added to the reward because R1-Zero tended to mix languages, making it difficult to read its generations.

Step 3: SFT with mixed data

At this point, the model likely has on par (or better) performance than R1-Zero on reasoning tasks. However, this intermediate model wouldn’t be very practical because it wants to reason about any input it receives (e.g., “hi there”), which is unnecessary for factual Q&A, translation, and creative writing. That’s why another SFT round is performed with both reasoning (600k examples) and non-reasoning (200k examples) data.

The reasoning data here is generated from the resulting model from Step 2. Additionally, examples are included which use an LLM judge to compare model predictions to ground truth answers.

The non-reasoning data comes from two places. First, the SFT dataset used to train DeepSeek-V3 (the base model). Second, synthetic data generated by DeepSeek-V3. Note that examples are included that do not use CoT so that the model doesn’t use thinking tokens for every response.

Step 4: RL + RLHF

Finally, another RL round is done, which includes (again) R1-Zero style reasoning training and RL on human feedback. This latter component helps improve the model’s helpfulness and harmlessness.

The result of this entire pipeline is DeepSeek-R1, which excels at reasoning tasks and is an AI assistant you can chat with normally.

Accessing R1-Zero and R1

Another key contribution from DeepSeek is that the weights of the two models described above (and many other distilled versions of R1) were made publicly available. This means there are many ways to access these models, whether using an inference provider or running them locally.

Here are a few places that I’ve seen these models.

DeepSeek (DeepSeek-V3 and DeepSeek-R1)
Together (DeepSeek-V3, DeepSeek-R1, and distillations)
Hyperbolic (DeepSeek-V3, DeepSeek-R1-Zero, and DeepSeek-R1)
Ollama (local) (DeepSeek-V3, DeepSeek-R1, and distillations)
Hugging Face (local) (all of the above)

Conclusions

The release of o1 introduced a new dimension by which LLMs can be improved: test-time compute. Although OpenAI did not release its secret sauce for doing this, 5 months later, DeepSeek was able to replicate this reasoning behavior and publish the technical details of its approach.

While current reasoning models have limitations, this is a promising research direction because it has demonstrated that reinforcement learning (without humans) can produce models that learn independently. This (potentially) breaks the implicit limitations of current models, which can only recall and remix information previously seen on the internet (i.e., existing human knowledge).

The promise of this new RL approach is that models can surpass human understanding (on their own), leading to new scientific and technological breakthroughs that might take us decades to discover (on our own).

Get exclusive access to AI resources and project ideas: https://the-data-entrepreneurs.kit.com/shaw

Learn AI in 6 weeks by building it: https://maven.com/shaw-talebi/ai-builders-bootcamp

References

[1] Learning to reason with LLMs

[2] arXiv:2501.12948 [cs.CL]

[3] Deep Dive into LLMs Like ChatGPT

The post How to Train LLMs to “Think” (o1 & DeepSeek-R1) appeared first on Towards Data Science.

LLM + RAG: Creating an AI-Powered File Reader Assistant

Gustavo Santos — Mon, 03 Mar 2025 21:02:28 +0000

Introduction

AI is everywhere.

It is hard not to interact at least once a day with a Large Language Model (LLM). The chatbots are here to stay. They’re in your apps, they help you write better, they compose emails, they read emails…well, they do a lot.

And I don’t think that that is bad. In fact, my opinion is the other way – at least so far. I defend and advocate for the use of AI in our daily lives because, let’s agree, it makes everything much easier.

I don’t have to spend time double-reading a document to find punctuation problems or type. AI does that for me. I don’t waste time writing that follow-up email every single Monday. AI does that for me. I don’t need to read a huge and boring contract when I have an AI to summarize the main takeaways and action points to me!

These are only some of AI’s great uses. If you’d like to know more use cases of LLMs to make our lives easier, I wrote a whole book about them.

Now, thinking as a data scientist and looking at the technical side, not everything is that bright and shiny.

LLMs are great for several general use cases that apply to anyone or any company. For example, coding, summarizing, or answering questions about general content created until the training cutoff date. However, when it comes to specific business applications, for a single purpose, or something new that didn’t make the cutoff date, that is when the models won’t be that useful if used out-of-the-box – meaning, they will not know the answer. Thus, it will need adjustments.

Training an LLM model can take months and millions of dollars. What is even worse is that if we don’t adjust and tune the model to our purpose, there will be unsatisfactory results or hallucinations (when the model’s response doesn’t make sense given our query).

So what is the solution, then? Spending a lot of money retraining the model to include our data?

Not really. That’s when the Retrieval-Augmented Generation (RAG) becomes useful.

RAG is a framework that combines getting information from an external knowledge base with large language models (LLMs). It helps AI models produce more accurate and relevant responses.

Let’s learn more about RAG next.

What is RAG?

Let me tell you a story to illustrate the concept.

I love movies. For some time in the past, I knew which movies were competing for the best movie category at the Oscars or the best actors and actresses. And I would certainly know which ones got the statue for that year. But now I am all rusty on that subject. If you asked me who was competing, I would not know. And even if I tried to answer you, I would give you a weak response.

So, to provide you with a quality response, I will do what everybody else does: search for the information online, obtain it, and then give it to you. What I just did is the same idea as the RAG: I obtained data from an external database to give you an answer.

When we enhance the LLM with a content store where it can go and retrieve data to augment (increase) its knowledge base, that is the RAG framework in action.

RAG is like creating a content store where the model can enhance its knowledge and respond more accurately.

User prompt about Content C. LLM retrieves external content to aggregate to the answer. Image by the author.

Summarizing:

Uses search algorithms to query external data sources, such as databases, knowledge bases, and web pages.
Pre-processes the retrieved information.
Incorporates the pre-processed information into the LLM.

Why use RAG?

Now that we know what the RAG framework is let’s understand why we should be using it.

Here are some of the benefits:

Enhances factual accuracy by referencing real data.
RAG can help LLMs process and consolidate knowledge to create more relevant answers
RAG can help LLMs access additional knowledge bases, such as internal organizational data
RAG can help LLMs create more accurate domain-specific content
RAG can help reduce knowledge gaps and AI hallucination

As previously explained, I like to say that with the RAG framework, we are giving an internal search engine for the content we want it to add to the knowledge base.

Well. All of that is very interesting. But let’s see an application of RAG. We will learn how to create an AI-powered PDF Reader Assistant.

Project

This is an application that allows users to upload a PDF document and ask questions about its content using AI-powered natural language processing (NLP) tools.

The app uses Streamlit as the front end.
Langchain, OpenAI’s GPT-4 model, and FAISS (Facebook AI Similarity Search) for document retrieval and question answering in the backend.

Let’s break down the steps for better understanding:

Loading a PDF file and splitting it into chunks of text.
1. This makes the data optimized for retrieval
Present the chunks to an embedding tool.
1. Embeddings are numerical vector representations of data used to capture relationships, similarities, and meanings in a way that machines can understand. They are widely used in Natural Language Processing (NLP), recommender systems, and search engines.
Next, we put those chunks of text and embeddings in the same DB for retrieval.
Finally, we make it available to the LLM.

Data preparation

Preparing a content store for the LLM will take some steps, as we just saw. So, let’s start by creating a function that can load a file and split it into text chunks for efficient retrieval.

# Imports
from  langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

def load_document(pdf):
    # Load a PDF
    """
    Load a PDF and split it into chunks for efficient retrieval.

    :param pdf: PDF file to load
    :return: List of chunks of text
    """

    loader = PyPDFLoader(pdf)
    docs = loader.load()

    # Instantiate Text Splitter with Chunk Size of 500 words and Overlap of 100 words so that context is not lost
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)
    # Split into chunks for efficient retrieval
    chunks = text_splitter.split_documents(docs)

    # Return
    return chunks

Next, we will start building our Streamlit app, and we’ll use that function in the next script.

Web application

We will begin importing the necessary modules in Python. Most of those will come from the langchain packages.

FAISS is used for document retrieval; OpenAIEmbeddings transforms the text chunks into numerical scores for better similarity calculation by the LLM; ChatOpenAI is what enables us to interact with the OpenAI API; create_retrieval_chain is what actually the RAG does, retrieving and augmenting the LLM with that data; create_stuff_documents_chain glues the model and the ChatPromptTemplate.

Note: You will need to generate an OpenAI Key to be able to run this script. If it’s the first time you’re creating your account, you get some free credits. But if you have it for some time, it is possible that you will have to add 5 dollars in credits to be able to access OpenAI’s API. An option is using Hugging Face’s Embedding.

# Imports
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
from langchain.chains import create_retrieval_chain
from langchain_openai import ChatOpenAI
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate
from scripts.secret import OPENAI_KEY
from scripts.document_loader import load_document
import streamlit as st

This first code snippet will create the App title, create a box for file upload, and prepare the file to be added to the load_document() function.

# Create a Streamlit app
st.title("AI-Powered Document Q&A")

# Load document to streamlit
uploaded_file = st.file_uploader("Upload a PDF file", type="pdf")

# If a file is uploaded, create the TextSplitter and vector database
if uploaded_file :

    # Code to work around document loader from Streamlit and make it readable by langchain
    temp_file = "./temp.pdf"
    with open(temp_file, "wb") as file:
        file.write(uploaded_file.getvalue())
        file_name = uploaded_file.name

    # Load document and split it into chunks for efficient retrieval.
    chunks = load_document(temp_file)

    # Message user that document is being processed with time emoji
    st.write("Processing document... :watch:")

Machines understand numbers better than text, so in the end, we will have to provide the model with a database of numbers that it can compare and check for similarity when performing a query. That’s where the embeddings will be useful to create the vector_db, in this next piece of code.

# Generate embeddings
    # Embeddings are numerical vector representations of data, typically used to capture relationships, similarities,
    # and meanings in a way that machines can understand. They are widely used in Natural Language Processing (NLP),
    # recommender systems, and search engines.
    embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_KEY,
                                  model="text-embedding-ada-002")

    # Can also use HuggingFaceEmbeddings
    # from langchain_huggingface.embeddings import HuggingFaceEmbeddings
    # embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

    # Create vector database containing chunks and embeddings
    vector_db = FAISS.from_documents(chunks, embeddings)

Next, we create a retriever object to navigate in the vector_db.

# Create a document retriever
    retriever = vector_db.as_retriever()
    llm = ChatOpenAI(model_name="gpt-4o-mini", openai_api_key=OPENAI_KEY)

Then, we will create the system_prompt, which is a set of instructions to the LLM on how to answer, and we will create a prompt template, preparing it to be added to the model once we get the input from the user.

# Create a system prompt
    # It sets the overall context for the model.
    # It influences tone, style, and focus before user interaction starts.
    # Unlike user inputs, a system prompt is not visible to the end user.

    system_prompt = (
        "You are a helpful assistant. Use the given context to answer the question."
        "If you don't know the answer, say you don't know. "
        "{context}"
    )

    # Create a prompt Template
    prompt = ChatPromptTemplate.from_messages(
        [
            ("system", system_prompt),
            ("human", "{input}"),
        ]
    )

    # Create a chain
    # It creates a StuffDocumentsChain, which takes multiple documents (text data) and "stuffs" them together before passing them to the LLM for processing.

    question_answer_chain = create_stuff_documents_chain(llm, prompt)

Moving on, we create the core of the RAG framework, pasting together the retriever object and the prompt. This object adds relevant documents from a data source (e.g., a vector database) and makes it ready to be processed using an LLM to generate a response.

# Creates the RAG
     chain = create_retrieval_chain(retriever, question_answer_chain)

Finally, we create the variable question for the user input. If this question box is filled with a query, we pass it to the chain, which calls the LLM to process and return the response, which will be printed on the app’s screen.

# Streamlit input for question
    question = st.text_input("Ask a question about the document:")
    if question:
        # Answer
        response = chain.invoke({"input": question})['answer']
        st.write(response)

Here is a screenshot of the result.

Screenshot of the final app. Image by the author.

And this is a GIF for you to see the File Reader Ai Assistant in action!

File Reader AI Assistant in action. Image by the author.

Before you go

In this project, we learned what the RAG framework is and how it helps the Llm to perform better and also perform well with specific knowledge.

AI can be powered with knowledge from an instruction manual, databases from a company, some finance files, or contracts, and then become fine-tuned to respond accurately to domain-specific content queries. The knowledge base is augmented with a content store.

To recap, this is how the framework works:

1️⃣ User Query → Input text is received.

2️⃣ Retrieve Relevant Documents → Searches a knowledge base (e.g., a database, vector store).

3️⃣ Augment Context → Retrieved documents are added to the input.

4️⃣ Generate Response → An LLM processes the combined input and produces an answer.

GitHub repository

https://github.com/gurezende/Basic-Rag

About me

If you liked this content and want to learn more about my work, here is my website, where you can also find all my contacts.

https://gustavorsantos.me

References

https://cloud.google.com/use-cases/retrieval-augmented-generation

https://www.ibm.com/think/topics/retrieval-augmented-generation

https://youtu.be/T-D1OfcDW1M?si=G0UWfH5-wZnMu0nw

https://python.langchain.com/docs/introduction

https://www.geeksforgeeks.org/how-to-get-your-own-openai-api-key

The post LLM + RAG: Creating an AI-Powered File Reader Assistant appeared first on Towards Data Science.

How LLMs Work: Reinforcement Learning, RLHF, DeepSeek R1, OpenAI o1, AlphaGo

Clara Chong — Thu, 27 Feb 2025 21:21:24 +0000

Welcome to part 2 of my LLM deep dive. If you’ve not read Part 1, I highly encourage you to check it out first.

Previously, we covered the first two major stages of training an LLM:

Pre-training — Learning from massive datasets to form a base model.
Supervised fine-tuning (SFT) — Refining the model with curated examples to make it useful.

Now, we’re diving into the next major stage: Reinforcement Learning (RL). While pre-training and SFT are well-established, RL is still evolving but has become a critical part of the training pipeline.

I’ve taken reference from Andrej Karpathy’s widely popular 3.5-hour YouTube. Andrej is a founding member of OpenAI, his insights are gold — you get the idea.

Let’s go

What’s the purpose of reinforcement learning (RL)?

Humans and LLMs process information differently. What’s intuitive for us — like basic arithmetic — may not be for an LLM, which only sees text as sequences of tokens. Conversely, an LLM can generate expert-level responses on complex topics simply because it has seen enough examples during training.

This difference in cognition makes it challenging for human annotators to provide the “perfect” set of labels that consistently guide an LLM toward the right answer.

RL bridges this gap by allowing the model to learn from its own experience.

Instead of relying solely on explicit labels, the model explores different token sequences and receives feedback — reward signals — on which outputs are most useful. Over time, it learns to align better with human intent.

Intuition behind RL

LLMs are stochastic — meaning their responses aren’t fixed. Even with the same prompt, the output varies because it’s sampled from a probability distribution.

We can harness this randomness by generating thousands or even millions of possible responses in parallel. Think of it as the model exploring different paths — some good, some bad. Our goal is to encourage it to take the better paths more often.

To do this, we train the model on the sequences of tokens that lead to better outcomes. Unlike supervised fine-tuning, where human experts provide labeled data, reinforcement learning allows the model to learn from itself.

The model discovers which responses work best, and after each training step, we update its parameters. Over time, this makes the model more likely to produce high-quality answers when given similar prompts in the future.

But how do we determine which responses are best? And how much RL should we do? The details are tricky, and getting them right is not trivial.

RL is not “new” — It can surpass human expertise (AlphaGo, 2016)

A great example of RL’s power is DeepMind’s AlphaGo, the first AI to defeat a professional Go player and later surpass human-level play.

In the 2016 Nature paper (graph below), when a model was trained purely by SFT (giving the model tons of good examples to imitate from), the model was able to reach human-level performance, but never surpass it.

The dotted line represents Lee Sedol’s performance — the best Go player in the world.

This is because SFT is about replication, not innovation — it doesn’t allow the model to discover new strategies beyond human knowledge.

However, RL enabled AlphaGo to play against itself, refine its strategies, and ultimately exceed human expertise (blue line).

Image taken from AlphaGo 2016 paper

RL represents an exciting frontier in AI — where models can explore strategies beyond human imagination when we train it on a diverse and challenging pool of problems to refine it’s thinking strategies.

RL foundations recap

Let’s quickly recap the key components of a typical RL setup:

Image by author

Agent — The learner or decision maker. It observes the current situation (state), chooses an action, and then updates its behaviour based on the outcome (reward).
Environment — The external system in which the agent operates.
State — A snapshot of the environment at a given step t.

At each timestamp, the agent performs an action in the environment that will change the environment’s state to a new one. The agent will also receive feedback indicating how good or bad the action was.

This feedback is called a reward, and is represented in a numerical form. A positive reward encourages that behaviour, and a negative reward discourages it.

By using feedback from different states and actions, the agent gradually learns the optimal strategy to maximise the total reward over time.

Policy

The policy is the agent’s strategy. If the agent follows a good policy, it will consistently make good decisions, leading to higher rewards over many steps.

In mathematical terms, it is a function that determines the probability of different outputs for a given state — (πθ(a|s)).

Value function

An estimate of how good it is to be in a certain state, considering the long term expected reward. For an LLM, the reward might come from human feedback or a reward model.

Actor-Critic architecture

It is a popular RL setup that combines two components:

Actor — Learns and updates the policy (πθ), deciding which action to take in each state.
Critic — Evaluates the value function (V(s)) to give feedback to the actor on whether its chosen actions are leading to good outcomes.

How it works:

The actor picks an action based on its current policy.
The critic evaluates the outcome (reward + next state) and updates its value estimate.
The critic’s feedback helps the actor refine its policy so that future actions lead to higher rewards.

Putting it all together for LLMs

The state can be the current text (prompt or conversation), and the action can be the next token to generate. A reward model (eg. human feedback), tells the model how good or bad it’s generated text is.

The policy is the model’s strategy for picking the next token, while the value function estimates how beneficial the current text context is, in terms of eventually producing high quality responses.

DeepSeek-R1 (published 22 Jan 2025)

To highlight RL’s importance, let’s explore Deepseek-R1, a reasoning model achieving top-tier performance while remaining open-source. The paper introduced two models: DeepSeek-R1-Zero and DeepSeek-R1.

DeepSeek-R1-Zero was trained solely via large-scale RL, skipping supervised fine-tuning (SFT).
DeepSeek-R1 builds on it, addressing encountered challenges.

Deepseek R1 is one of the most amazing and impressive breakthroughs I’ve ever seen — and as open source, a profound gift to the world.
— Marc Andreessen (@pmarca) January 24, 2025

Let’s dive into some of these key points.

1. RL algo: Group Relative Policy Optimisation (GRPO)

One key game changing RL algorithm is Group Relative Policy Optimisation (GRPO), a variant of the widely popular Proximal Policy Optimisation (PPO). GRPO was introduced in the DeepSeekMath paper in Feb 2024.

Why GRPO over PPO?

PPO struggles with reasoning tasks due to:

Dependency on a critic model.
PPO needs a separate critic model, effectively doubling memory and compute.
Training the critic can be complex for nuanced or subjective tasks.
High computational cost as RL pipelines demand substantial resources to evaluate and optimise responses.
Absolute reward evaluations
When you rely on an absolute reward — meaning there’s a single standard or metric to judge whether an answer is “good” or “bad” — it can be hard to capture the nuances of open-ended, diverse tasks across different reasoning domains.

How GRPO addressed these challenges:

GRPO eliminates the critic model by using relative evaluation — responses are compared within a group rather than judged by a fixed standard.

Imagine students solving a problem. Instead of a teacher grading them individually, they compare answers, learning from each other. Over time, performance converges toward higher quality.

How does GRPO fit into the whole training process?

GRPO modifies how loss is calculated while keeping other training steps unchanged:

Gather data (queries + responses)
– For LLMs, queries are like questions
– The old policy (older snapshot of the model) generates several candidate answers for each query
Assign rewards — each response in the group is scored (the “reward”).
Compute the GRPO loss
Traditionally, you’ll compute a loss — which shows the deviation between the model prediction and the true label.
In GRPO, however, you measure:
a) How likely is the new policy to produce past responses?
b) Are those responses relatively better or worse?
c) Apply clipping to prevent extreme updates.
This yields a scalar loss.
Back propagation + gradient descent
– Back propagation calculates how each parameter contributed to loss
– Gradient descent updates those parameters to reduce the loss
– Over many iterations, this gradually shifts the new policy to prefer higher reward responses
Update the old policy occasionally to match the new policy.
This refreshes the baseline for the next round of comparisons.

2. Chain of thought (CoT)

Traditional LLM training follows pre-training → SFT → RL. However, DeepSeek-R1-Zero skipped SFT, allowing the model to directly explore CoT reasoning.

Like humans thinking through a tough question, CoT enables models to break problems into intermediate steps, boosting complex reasoning capabilities. OpenAI’s o1 model also leverages this, as noted in its September 2024 report: o1’s performance improves with more RL (train-time compute) and more reasoning time (test-time compute).

DeepSeek-R1-Zero exhibited reflective tendencies, autonomously refining its reasoning.

A key graph (below) in the paper showed increased thinking during training, leading to longer (more tokens), more detailed and better responses.

Image taken from DeepSeek-R1 paper

Without explicit programming, it began revisiting past reasoning steps, improving accuracy. This highlights chain-of-thought reasoning as an emergent property of RL training.

The model also had an “aha moment” (below) — a fascinating example of how RL can lead to unexpected and sophisticated outcomes.

Image taken from DeepSeek-R1 paper

Note: Unlike DeepSeek-R1, OpenAI does not show full exact reasoning chains of thought in o1 as they are concerned about a distillation risk — where someone comes in and tries to imitate those reasoning traces and recover a lot of the reasoning performance by just imitating. Instead, o1 just summaries of these chains of thoughts.

Reinforcement learning with Human Feedback (RLHF)

For tasks with verifiable outputs (e.g., math problems, factual Q&A), AI responses can be easily evaluated. But what about areas like summarisation or creative writing, where there’s no single “correct” answer?

This is where human feedback comes in — but naïve RL approaches are unscalable.

Image by author

Let’s look at the naive approach with some arbitrary numbers.

Image by author

That’s one billion human evaluations needed! This is too costly, slow and unscalable. Hence, a smarter solution is to train an AI “reward model” to learn human preferences, dramatically reducing human effort.

Ranking responses is also easier and more intuitive than absolute scoring.

Image by author

Upsides of RLHF

Can be applied to any domain, including creative writing, poetry, summarisation, and other open-ended tasks.
Ranking outputs is much easier for human labellers than generating creative outputs themselves.

Downsides of RLHF

The reward model is an approximation — it may not perfectly reflect human preferences.
RL is good at gaming the reward model — if run for too long, the model might exploit loopholes, generating nonsensical outputs that still get high scores.

Do note that Rlhf is not the same as traditional RL.

For empirical, verifiable domains (e.g. math, coding), RL can run indefinitely and discover novel strategies. RLHF, on the other hand, is more like a fine-tuning step to align models with human preferences.

Conclusion

And that’s a wrap! I hope you enjoyed Part 2 If you haven’t already read Part 1 — do check it out here.

Got questions or ideas for what I should cover next? Drop them in the comments — I’d love to hear your thoughts. See you in the next article!

The post How LLMs Work: Reinforcement Learning, RLHF, DeepSeek R1, OpenAI o1, AlphaGo appeared first on Towards Data Science.

Enhancing RAG: Beyond Vanilla Approaches

Ziad SALLOUM — Tue, 25 Feb 2025 01:55:02 +0000

Retrieval-Augmented Generation (RAG) is a powerful technique that enhances language models by incorporating external information retrieval mechanisms. While standard RAG implementations improve response relevance, they often struggle in complex retrieval scenarios. This article explores the limitations of a vanilla RAG setup and introduces advanced techniques to enhance its accuracy and efficiency.

The Challenge with Vanilla RAG

To illustrate RAG’s limitations, consider a simple experiment where we attempt to retrieve relevant information from a set of documents. Our dataset includes:

A primary document discussing best practices for staying healthy, productive, and in good shape.
Two additional documents on unrelated topics, but contain some similar words used in different contexts.

main_document_text = """
Morning Routine (5:30 AM - 9:00 AM)
 Wake Up Early - Aim for 6-8 hours of sleep to feel well-rested.
 Hydrate First - Drink a glass of water to rehydrate your body.
 Morning Stretch or Light Exercise - Do 5-10 minutes of stretching or a short workout to activate your body.
 Mindfulness or Meditation - Spend 5-10 minutes practicing mindfulness or deep breathing.
 Healthy Breakfast - Eat a balanced meal with protein, healthy fats, and fiber.
 Plan Your Day - Set goals, review your schedule, and prioritize tasks.
...
"""

Using a standard RAG setup, we query the system with:

What should I do to stay healthy and productive?
What are the best practices to stay healthy and productive?

Helper Functions

To enhance retrieval accuracy and streamline query processing, we implement a set of essential helper functions. These functions serve various purposes, from querying the ChatGPT API to computing document embeddings and similarity scores. By leveraging these functions, we create a more efficient RAG pipeline that effectively retrieves the most relevant information for user queries.

To support our RAG improvements, we define the following helper functions:

# **Imports**
import os
import json
import openai
import numpy as np
from scipy.spatial.distance import cosine
from google.colab import userdata

# Set up OpenAI API key
os.environ["OPENAI_API_KEY"] = userdata.get('AiTeam')

def query_chatgpt(prompt, model="gpt-4o", response_format=openai.NOT_GIVEN):
    try:
        response = client.chat.completions.create(
            model=model,
            messages=[{"role": "user", "content": prompt}],
            temperature=0.0 , # Adjust for more or less creativity
            response_format=response_format
        )
        return response.choices[0].message.content.strip()
    except Exception as e:
        return f"Error: {e}"

def get_embedding(text, model="text-embedding-3-large"): #"text-embedding-ada-002"
    """Fetches the embedding for a given text using OpenAI's API."""
    response = client.embeddings.create(
        input=[text],
        model=model
    )
    return response.data[0].embedding

def compute_similarity_metrics(embed1, embed2):
    """Computes different similarity/distance metrics between two embeddings."""
    cosine_sim = 1- cosine(embed1, embed2)  # Cosine similarity

    return cosine_sim

def fetch_similar_docs(query, docs, threshold = .55, top=1):
  query_em = get_embedding(query)
  data = []
  for d in docs:
    # Compute and print similarity metrics
    similarity_results = compute_similarity_metrics(d["embedding"], query_em)
    if(similarity_results >= threshold):
      data.append({"id":d["id"], "ref_doc":d.get("ref_doc", ""), "score":similarity_results})

  # Sorting by value (second element in each tuple)
  sorted_data = sorted(data, key=lambda x: x["score"], reverse=True)  # Ascending order
  sorted_data = sorted_data[:min(top, len(sorted_data))]
  return sorted_data

Evaluating the Vanilla RAG

To evaluate the effectiveness of a vanilla RAG setup, we conduct a simple test using predefined queries. Our goal is to determine whether the system retrieves the most relevant document based on semantic similarity. We then analyze the limitations and explore possible improvements.

"""# **Testing Vanilla RAG**"""

query = "what should I do to stay healthy and productive?"
r = fetch_similar_docs(query, docs)
print("query = ", query)
print("documents = ", r)

query = "what are the best practices to stay healthy and productive ?"
r = fetch_similar_docs(query, docs)
print("query = ", query)
print("documents = ", r)

Advanced Techniques for Improved RAG

To further refine the retrieval process, we introduce advanced functions that enhance the capabilities of our RAG system. These functions generate structured information that aids in document retrieval and query processing, making our system more robust and context-aware.

To address these challenges, we implement three key enhancements:

1. Generating FAQs

By automatically creating a list of frequently asked questions related to a document, we expand the range of potential queries the model can match. These FAQs are generated once and stored alongside the document, providing a richer search space without incurring ongoing costs.

def generate_faq(text):
  prompt = f'''
  given the following text: """{text}"""
  Ask relevant simple atomic questions ONLY (don't answer them) to cover all subjects covered by the text. Return the result as a json list example [q1, q2, q3...]
  '''
  return query_chatgpt(prompt, response_format={ "type": "json_object" })

2. Creating an Overview

A high-level summary of the document helps capture its core ideas, making retrieval more effective. By embedding the overview alongside the document, we provide additional entry points for relevant queries, improving match rates.

def generate_overview(text):
  prompt = f'''
  given the following text: """{text}"""
  Generate an abstract for it that tells in maximum 3 lines what is it about and use high level terms that will capture the main points,
  Use terms and words that will be most likely used by average person.
  '''
  return query_chatgpt(prompt)

3. Query Decomposition

Instead of searching with broad user queries, we break them down into smaller, more precise sub-queries. Each sub-query is then compared against our enhanced document collection, which now includes:

The original document
The generated FAQs
The generated overview

By merging the retrieval results from these multiple sources, we significantly improve the likelihood of finding relevant information.

def decompose_query(query):
  prompt = f'''
  Given the user query: """{query}"""
break it down into smaller, relevant subqueries
that can retrieve the best information for answering the original query.
Return them as a ranked json list example [q1, q2, q3...].
'''
  return query_chatgpt(prompt, response_format={ "type": "json_object" })

Evaluating the Improved RAG

Implementing these techniques, we re-run our initial queries. This time, query decomposition generates several sub-queries, each focusing on different aspects of the original question. As a result, our system successfully retrieves relevant information from both the FAQ and the original document, demonstrating a substantial improvement over the vanilla RAG approach.

"""# **Testing Advanced Functions**"""

## Generate overview of the document
overview_text = generate_overview(main_document_text)
print(overview_text)
# generate embedding
docs.append({"id":"overview_text", "ref_doc": "main_document_text", "embedding":get_embedding(overview_text)})


## Generate FAQ for the document
main_doc_faq_arr = generate_faq(main_document_text)
print(main_doc_faq_arr)
faq =json.loads(main_doc_faq_arr)["questions"]

for f, i in zip(faq, range(len(faq))):
  docs.append({"id": f"main_doc_faq_{i}", "ref_doc": "main_document_text", "embedding":  get_embedding(f)})


## Decompose the 1st query
query = "what should I do to stay healty and productive?"
subqueries = decompose_query(query)
print(subqueries)




subqueries_list = json.loads(subqueries)['subqueries']


## compute the similarities between the subqueries and documents, including FAQ
for subq in subqueries_list:
  print("query = ", subq)
  r = fetch_similar_docs(subq, docs, threshold=.55, top=2)
  print(r)
  print('=================================\n')


## Decompose the 2nd query
query = "what the best practices to stay healty and productive?"
subqueries = decompose_query(query)
print(subqueries)

subqueries_list = json.loads(subqueries)['subqueries']


## compute the similarities between the subqueries and documents, including FAQ
for subq in subqueries_list:
  print("query = ", subq)
  r = fetch_similar_docs(subq, docs, threshold=.55, top=2)
  print(r)
  print('=================================\n')

Here are some of the FAQ that were generated:

{
  "questions": [
    "How many hours of sleep are recommended to feel well-rested?",
    "How long should you spend on morning stretching or light exercise?",
    "What is the recommended duration for mindfulness or meditation in the morning?",
    "What should a healthy breakfast include?",
    "What should you do to plan your day effectively?",
    "How can you minimize distractions during work?",
    "How often should you take breaks during work/study productivity time?",
    "What should a healthy lunch consist of?",
    "What activities are recommended for afternoon productivity?",
    "Why is it important to move around every hour in the afternoon?",
    "What types of physical activities are suggested for the evening routine?",
    "What should a nutritious dinner include?",
    "What activities can help you reflect and unwind in the evening?",
    "What should you do to prepare for sleep?",
    …
  ]
}

Cost-Benefit Analysis

While these enhancements introduce an upfront processing cost—generating FAQs, overviews, and embeddings—this is a one-time cost per document. In contrast, a poorly optimized RAG system would lead to two major inefficiencies:

Frustrated users due to low-quality retrieval.
Increased query costs from retrieving excessive, loosely related documents.

For systems handling high query volumes, these inefficiencies compound quickly, making preprocessing a worthwhile investment.

Conclusion

By integrating document preprocessing (FAQs and overviews) with query decomposition, we create a more intelligent RAG system that balances accuracy and cost-effectiveness. This approach enhances retrieval quality, reduces irrelevant results, and ensures a better user experience.

As RAG continues to evolve, these techniques will be instrumental in refining AI-driven retrieval systems. Future research may explore further optimizations, including dynamic thresholding and reinforcement learning for query refinement.

The post Enhancing RAG: Beyond Vanilla Approaches appeared first on Towards Data Science.

6 Common LLM Customization Strategies Briefly Explained

Destin Gong — Mon, 24 Feb 2025 19:27:50 +0000

Why Customize LLMs?

Large Language Models (Llms) are deep learning models pre-trained based on self-supervised learning, requiring a vast amount of resources on training data, training time and holding a large number of parameters. LLM have revolutionized natural language processing especially in the last 2 years, demonstrating remarkable capabilities in understanding and generating human-like text. However, these general purpose models’ out-of-the-box performance may not always meet specific business needs or domain requirements. LLMs alone cannot answer questions that rely on proprietary company data or closed-book settings, making them relatively generic in their applications. Training a LLM model from scratch is largely infeasible to small to medium teams due to the demand of massive amounts of training data and resources. Therefore, a wide range of LLM customization strategies are developed in recent years to tune the models for various scenarios that require specialized knowledge.

The customization strategies can be broadly split into two types:

Using a frozen model: These techniques don’t necessitate updating model parameters and typically accomplished through in-context learning or prompt engineering. They are cost-effective since they alter the model’s behavior without incurring extensive training costs, therefore widely explored in both the industry and academic with new research papers published on a daily basis.
Updating model parameters: This is a relatively resource-intensive approach that requires tuning a pre-trained LLM using custom datasets designed for the intended purpose. This includes popular techniques like Fine-Tuning and Reinforcement Learning from Human Feedback (RLHF).

These two broad customization paradigms branch out into various specialized techniques including LoRA fine-tuning, Chain of Thought, Retrieval Augmented Generation, ReAct, and Agent frameworks. Each technique offers distinct advantages and trade-offs regarding computational resources, implementation complexity, and performance improvements.

This article is also available as a video here

How to Choose LLMs?

The first step of customizing LLMs is to select the appropriate foundation models as the baseline. Community based platform e.g. “Huggingface” offers a wide range of open-source pre-trained models contributed by top companies or communities, such as Llama series from Meta and Gemini from Google. Huggingface additionally provides leaderboards, for example “Open LLM Leaderboard” to compare LLMs based on industry-standard metrics and tasks (e.g. MMLU). Cloud providers (e.g., AWS) and AI companies (e.g., OpenAI and Anthropic) also offer access to proprietary models that are typically paid services with restricted access. Following factors are essentials to consider when choosing LLMs.

Open source or proprietary model: Open source models allow full customization and self-hosting but require technical expertise while proprietary models offer immediate access and often better quality responses but with higher costs.

Task and metrics: Models excel at different tasks including question-answering, summarization, code generation etc. Compare benchmark metrics and test on domain-specific tasks to determine the appropriate models.

Architecture: In general, decoder-only models (GPT series) perform better at text generation while encoder-decoder models (T5) handle translation well. There are more architecture emerging and showing promising results, for instance Mixture of Experts (MoE) model “DeepSeek”.

Number of Parameters and Size: Larger models (70B-175B parameters) offer better performance but need more computing power. Smaller models (7B-13B) run faster and cheaper but may have reduced capabilities.

After determining a base LLM, let’s explore 6 most common strategies for LLM customization, ranked in order of resource consumption from the least to the most intensive:

Prompt Engineering
Decoding and Sampling Strategy
Retrieval Augmented Generation
Agent
Fine Tuning
Reinforcement Learning from Human Feedback

If you’d prefer a video walkthrough of these concepts, please check out my video on “6 Common LLM Customization Strategies Briefly Explained”.

LLM Customization Techniques

1. Prompt Engineering

Prompt is the input text sent to an LLM to elicit an AI-generated response, and it can be composed of instructions, context, input data and output indicator.

Instructions: This provides a task description or instruction for how the model should perform.

Context: This is external information to guide the model to respond within a certain scope.

Input data: This is the input for which you want a response.

Output indicator: This specifies the output type or format.

Prompt Engineering involves crafting these prompt components strategically to shape and control the model’s response. Basic prompt engineering techniques include zero shot, one shot, and few shot prompting. User can implement basic prompt engineering techniques directly while interacting with the LLM, making it an efficient approach to align model’s behavior to on a novel objective. API implementation is also an option and more details are introduced in my previous article “A Simple Pipeline for Integrating LLM Prompt with Knowledge Graph”.

Due to the efficiency and effectiveness of prompt engineering, more complex approaches are explored and developed to advance the logical structure of prompts.

Chain of Thought (CoT) asks LLMs to break down complex reasoning tasks into step-by-step thought processes, improving performance on multi-step problems. Each step explicitly exposes its reasoning outcome which serves as the precursor context of its subsequent steps until arriving at the answer.

Tree of thoughts extends from CoT by considering multiple different reasoning branches and self-evaluating choices to decide the next best action. It is more effective for tasks that involve initial decisions, strategies for the future and exploration of multiple solutions.

Automatic reasoning and tool use (ART) builds upon the CoT process, it deconstructs complex tasks and allows the model to select few-shot examples from a task library using predefined external tools like search and code generation.

Synergizing reasoning and acting (ReAct) combines reasoning trajectories with an action space, where the model search through the action space and determine the next best action based on environmental observations.

Techniques like CoT and ReAct are often combined with an Agentic workflow to strengthen its capability. These techniques will be introduced in more detail in the “Agent” section.

Further Reading

2. Decoding and Sampling Strategy

Decoding strategy can be controlled at model inference time through inference parameters (e.g. temperature, top p, top k), determining the randomness and diversity of model responses. Greedy search, beam search and sampling are three common decoding strategies for auto-regressive model generation. ****

During the autoregressive generation process, LLM outputs one token at a time based on a probability distribution of candidate tokens conditioned by the pervious token. By default, greedy search is applied to produce the next token with the highest probability.

In contrast, beam search decoding considers multiple hypotheses of next-best tokens and selects the hypothesis with the highest combined probabilities across all tokens in the text sequence. The code snippet below uses transformers library to specify the the number of beam paths (e.g. num_beams=5 considers 5 distinct hypotheses) during the model generation process.

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
inputs = tokenizer(prompt, return_tensors="pt")

model = AutoModelForCausalLM.from_pretrained(model_name)
outputs = model.generate(**inputs, num_beams=5)

Sampling strategy is the third approach to control the randomness of model responses by adjusting these inference parameters:

Temperature: Lowering the temperature makes the probability distribution sharper by increasing the likelihood of generating high-probability words and decreasing the likelihood of generating low-probability words. When temperature = 0, it becomes equivalent to greedy search (least creative); when temperature = 1, it produces the most creative outputs.
Top K sampling: This method filters the K most probable next tokens and redistributes the probability among those tokens. The model then samples from this filtered set of tokens.
Top P sampling: Instead of sampling from the K most probable tokens, top-p sampling selects from the smallest possible set of tokens whose cumulative probability exceeds the threshold p.

The example code snippet below samples from the top 50 most likely tokens (top_k=50) with a cumulative probability higher than 0.95 (top_p=0.95)

sample_outputs = model.generate(
    **model_inputs,
    max_new_tokens=40,
    do_sample=True,
    top_k=50,
    top_p=0.95,
    num_return_sequences=3,
)

Further Reading

3. RAG

Retrieval Augmented Generation (or RAG), initially introduced in the paper “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks”, has been demonstrated as a promising solution that integrates external knowledge and reduces common LLM “hallucination” issues when handling domain specific or specialized queries. RAG allows dynamically pulling relevant information from knowledge domain and generally does not involve extensive training to update LLM parameters, making it a cost-effective strategy to adapt a general-purpose LLM for a specialized domain.

A RAG system can be decomposed into retrieval and generation stage. The objective of retrieval process is to find contents within the knowledge base that are closely related to the user query, by chunking external knowledge, creating embeddings, indexing and similarity search.

Chunking: Documents are divided into smaller segments, with each segment containing a distinct unit of information.
Create embeddings: An embedding model compresses each information chunk into a vector representation. The user query is also converted into its vector representation through the same vectorization process, so that the user query can be compared in the same dimensional space.
Indexing: This process stores these text chunks and their vector embeddings as key-value pairs, enabling efficient and scalable search functionality. For large external knowledge bases that exceed memory capacity, vector databases offer efficient long-term storage.
Similarity search: Similarity scores between the query embeddings and text chunk embeddings are calculated, which are used for searching information highly relevant to the user query.

The generation process of the RAG system then combines retrieved information with the user query to form the augmented query which is parsed to the LLM to generate the context rich response.

Code Snippet

The code snippet firstly specifies the LLM and embedding model, then perform the steps to chunk the external knowledge base documents into a collection of document. Create index from document, define the query_engine based on the index and query the query_engine with the user prompt.

from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import VectorStoreIndex

Settings.llm = OpenAI(model="gpt-3.5-turbo")
Settings.embed_model="BAAI/bge-small-en-v1.5"

document = Document(text="\\n\\n".join([doc.text for doc in documents]))
index = VectorStoreIndex.from_documents([document])                                    
query_engine = index.as_query_engine()
response = query_engine.query(
    "Tell me about LLM customization strategies."
)

The example above shows a simple RAG system. Advanced RAG improve based on this by introducing pre-retrieval and post-retrieval strategies to reduce pitfalls such as limited synergy between the retrieval and generation process. For example rerank technique reorders the retrieved information using a model capable of understanding bidirectional context, and integration with knowledge graph for advanced query routing. More use cases can be found on the llamaindex website.

Further Reading

4. Agent

LLM Agent was a trending topic in 2024 and will likely remain a main focus in the GenAI field in 2025. Compared to RAG, Agent excels at creating query routes and planning LLM-based workflows, with the following benefits:

Maintaining memory and state of previous model generated responses.
Leveraging various tools based on specific criteria. This tool-using capability sets agents apart from basic RAG systems by giving the LLM independent control over tool selection.
Breaking down a complex task into smaller steps and planning for a sequence of actions.
Collaborating with other agents to form a orchestrated system.

Several in-context learning techniques (e.g. CoT, ReAct ) can be implemented through the Agentic framework and we will discuss ReAct in more details. ReAct, stands for “Synergizing Reasoning and Acting in Language Models”, is composed of three key elements – actions, thoughts and observations. This framework was introduced by Google Research at Princeton University, built upon Chain of Thought by integrating the reasoning steps with an action space that enables tool uses and function calling. Additionally, ReAct framework emphasizes on determining the next best action based on the environmental observations.

This example from the original paper demonstrated ReAct’s inner working process, where the LLM generates the first thought and acts by calling the function to “Search [Apple Remote]”, then observes the feedback from its first output. The second thought is then based on the previous observation, hence leading to a different action “Search [Front Row]”. This process iterates until reaching the goal. The research shows that ReAct overcomes prevalent issues of hallucination and error propagation as more often observed in chain-of-thought reasoning by interacting with a simple Wikipedia API. Furthermore, through the implementation of decision traces, ReAct framework additionally increases the model’s interpretability, trustworthiness and diagnosability.

Example from “ReAct: Synergizing Reasoning and Acting in Language Models” (Yu et. al., 2022)

Code Snippet

This demonstrates an ReAct-based agent implementation using llamaindex. Firstly, it defines two functions (multiply and add). Secondly, these two functions are encapsulated as FunctionTool, forming the Agent’s action space and executed based on its reasoning.

from llama_index.core.agent import ReActAgent
from llama_index.core.tools import FunctionTool

# create basic function tools
def multiply(a: float, b: float) -> float:
    return a * b
multiply_tool = FunctionTool.from_defaults(fn=multiply)

def add(a: float, b: float) -> float:
    return a + b
add_tool = FunctionTool.from_defaults(fn=add)

agent = ReActAgent.from_tools([multiply_tool, add_tool], llm=llm, verbose=True)

The advantages of an Agentic Workflow are more substantial when combined with self-reflection or self-correction. It is an increasingly growing domain with a variety of Agent architecture being explored. For instance, Reflexion framework facilitate iterative learning by providing a summary of verbal feedback from environmental and storing the feedback in model’s memory; CRITIC framework empowers frozen LLMs to self-verify through interacting with external tools such as code interpreter and API calls.

Further Reading

5. Fine-Tuning

Fine-tuning is the process of feeding niche and specialized datasets to modify the LLM so that it is more aligned with a certain objective. It differs from prompt engineering and RAG as it enables updates to the LLM weights and parameters. Full fine-tuning refers to updating all weights of the pretrained LLM through backpropogation, which requires large memory to store all weights and parameters and may suffer from significant reduction in ability on other tasks (i.e. catastrophic forgetting). Therefore, PEFT (or parameter efficient fine tuning) is more widely used to mitigate these caveats while saving the time and cost of model training. There are three categories of PEFT methods:

Selective: Select a subset of initial LLM parameters to fine tune which can be more computationally intensive compared to other PEFT methods.
Reparameterization: Adjust model weights through training the weights of low rank representations. For example, Lower Rank Adaptation (LoRA) is among this category that accelerates fine-tuning by representing the weight updates with two smaller matrices.
Additive: Add additional trainable layers to model, including techniques like adapters and soft prompts

The fine-tuning process is similar to deep learning training process., requiring the following inputs:

training and evaluation datasets
training arguments define the hyperparameters e.g. learning rate, optimizer
pretrained LLM model
compute metrics and objective functions that algorithm should be optimized for

Code Snippet

Below is an example of implementing fine-tuning using the transformer Trainer.

from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
		output_dir=output_dir,
		learning_rate=1e-5,
		eval_strategy="epoch"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics,
)

trainer.train()

Fine-tuning has a wide range of use cases. For instance, instruction fine-tuning optimizes LLMs for conversations and following instructions by training them on prompt-completion pairs. Another example is domain adaptation, an unsupervised fine-tuning method that helps LLMs specialize in specific knowledge domains.

Further Reading

6. RLHF

Reinforcement learning from human feedback, or RLHF, is a reinforcement learning technique that fine tunes LLMs based on human preferences. RLHF operates by training a reward model based on human feedback and uses this model as a reward function to optimize a reinforcement learning policy through PPO (Proximal Policy Optimization). The process requires two sets of training data: a preference dataset for training reward model, and a prompt dataset used in the reinforcement learning loop.

Let’s break it down into steps:

Gather preference dataset annotated by human labelers who rate different completions generated by the model based on human preference. An example format of the preference dataset is {input_text, candidate1, candidate2, human_preference}, indicating which candidate response is preferred.
Train a reward model using the preference dataset, the reward model is essentially a regression model that outputs a scalar indicating the quality of the model generated response. The objective of the reward model is to maximize the score between the winning candidate and losing candidate.
Use the reward model in a reinforcement learning loop to fine-tune the LLM. The objective is that the policy is updated so that LLM can generate responses that maximize the reward produced by the reward model. This process utilizes the prompt dataset which is a collection of prompts in the format of {prompt, response, rewards}.

Code Snippet

Open source library Trlx is widely applied in implementing RLHF and they provided a template code that shows the basic RLHF setup:

Initialize the base model and tokenizer from a pretrained checkpoint
Configure PPO hyperparameters PPOConfig like learning rate, epochs, and batch sizes
Create the PPO trainer PPOTrainer by combining the model, tokenizer, and training data
The training loop uses step() method to iteratively update the model to optimized the rewards calculated from the query and model response

# trl: Transformer Reinforcement Learning library
from trl import PPOTrainer, PPOConfig, AutoModelForSeq2SeqLMWithValueHead
from trl import create_reference_model
from trl.core import LengthSampler

# initiate the pretrained model and tokenizer
model = AutoModelForCausalLMWithValueHead.from_pretrained(config.model_name)
tokenizer = AutoTokenizer.from_pretrained(config.model_name)

# define the hyperparameters of PPO algorithm
config = PPOConfig(
    model_name=model_name,    
    learning_rate=learning_rate,
    ppo_epochs=max_ppo_epochs,
    mini_batch_size=mini_batch_size,
    batch_size=batch_size
)

# initiate the PPO trainer with reference to the model
ppo_trainer = PPOTrainer(
	config=config, 
	model=ppo_model, 
  tokenizer=tokenizer, 
  dataset=dataset["train"],
  data_collator=collator
)                      
                        
# ppo_trainer is iteratively updated through the rewards
ppo_trainer.step(query_tensors, response_tensors, rewards)

RLHF is widely applied for aligning model responses with human preference. Common use cases involve reducing response toxicity and model hallucination. However, it does have the downside of requiring a large amount of human annotated data as well as computation costs associated with policy optimization. Therefore, alternatives like Reinforcement Learning from AI feedback and Direct Preference Optimization (DPO) are introduced to mitigate these limitations.

Further Reading

Take-Home Message

This article briefly explains six essential LLM customization strategies including prompt engineering, decoding strategy, RAG, Agent, fine-tuning, and RLHF. Hope you find it helpful in terms of understanding the pros/cons of each strategy as well as how to implement them based on the practical examples.

The post 6 Common LLM Customization Strategies Briefly Explained appeared first on Towards Data Science.

How to Use an LLM-Powered Boilerplate for Building Your Own Node.js API

Uladzimir Yancharuk — Fri, 21 Feb 2025 00:15:23 +0000

For a long time, one of the common ways to start new Node.js projects was using boilerplate templates. These templates help developers reuse familiar code structures and implement standard features, such as access to cloud file storage. With the latest developments in LLM, project boilerplates appear to be more useful than ever.

Building on this progress, I’ve extended my existing Node.js API boilerplate with a new tool LLM Codegen. This standalone feature enables the boilerplate to automatically generate module code for any purpose based on text descriptions. The generated module comes complete with E2E tests, database migrations, seed data, and necessary business logic.

History

I initially created a GitHub repository for a Node.js API boilerplate to consolidate the best practices I’ve developed over the years. Much of the implementation is based on code from a real Node.js API running in production on AWS.

I am passionate about vertical slicing architecture and Clean Code principles to keep the codebase maintainable and clean. With recent advancements in LLM, particularly its support for large contexts and its ability to generate high-quality code, I decided to experiment with generating clean TypeScript code based on my boilerplate. This boilerplate follows specific structures and patterns that I believe are of high quality. The key question was whether the generated code would follow the same patterns and structure. Based on my findings, it does.

To recap, here’s a quick highlight of the Node.js API boilerplate’s key features:

Vertical slicing architecture based on DDD & MVC principles
Services input validation using ZOD
Decoupling application components with dependency injection (InversifyJS)
Integration and E2E testing with Supertest
Multi-service setup using Dockercompose

Over the past month, I’ve spent my weekends formalizing the solution and implementing the necessary code-generation logic. Below, I’ll share the details.

Implementation Overview

Let’s explore the specifics of the implementation. All Code Generation logic is organized at the project root level, inside the llm-codegen folder, ensuring easy navigation. The Node.js boilerplate code has no dependency on llm-codegen, so it can be used as a regular template without modification.

LLM-Codegen folder structure

It covers the following use cases:

Generating clean, well-structured code for new module based on input description. The generated module becomes part of the Node.js REST API application.
Creating database migrations and extending seed scripts with basic data for the new module.
Generating and fixing E2E tests for the new code and ensuring all tests pass.

The generated code after the first stage is clean and adheres to vertical slicing architecture principles. It includes only the necessary business logic for CRUD operations. Compared to other code generation approaches, it produces clean, maintainable, and compilable code with valid E2E tests.

The second use case involves generating DB migration with the appropriate schema and updating the seed script with the necessary data. This task is particularly well-suited for LLM, which handles it exceptionally well.

The final use case is generating E2E tests, which help confirm that the generated code works correctly. During the running of E2E tests, an SQLite3 database is used for migrations and seeds.

Mainly supported LLM clients are OpenAI and Claude.

How to Use It

To get started, navigate to the root folder llm-codegen and install all dependencies by running:

npm i

llm-codegen does not rely on Docker or any other heavy third-party dependencies, making setup and execution easy and straightforward. Before running the tool, ensure that you set at least one *_API_KEY environment variable in the .env file with the appropriate API key for your chosen LLM provider. All supported environment variables are listed in the .env.sample file (OPENAI_API_KEY, CLAUDE_API_KEY etc.) You can use OpenAI, Anthropic Claude, or OpenRouter LLaMA. As of mid-December, OpenRouter LLaMA is surprisingly free to use. It’s possible to register here and obtain a token for free usage. However, the output quality of this free LLaMA model could be improved, as most of the generated code fails to pass the compilation stage.

To start llm-codegen, run the following command:

npm run start

Next, you’ll be asked to input the module description and name. In the module description, you can specify all necessary requirements, such as entity attributes and required operations. The core remaining work is performed by micro-agents: Developer, Troubleshooter, and TestsFixer.

Here is an example of a successful code generation:

Successful code generation

Below is another example demonstrating how a compilation error was fixed:

The following is an example of a generated orders module code:

A key detail is that you can generate code step by step, starting with one module and adding others until all required APIs are complete. This approach allows you to generate code for all required modules in just a few command runs.

How It Works

As mentioned earlier, all work is performed by those micro-agents: Developer, Troubleshooter and TestsFixer, controlled by the Orchestrator. They run in the listed order, with the Developer generating most of the codebase. After each code generation step, a check is performed for missing files based on their roles (e.g., routes, controllers, services). If any files are missing, a new code generation attempt is made, including instructions in the prompt about the missing files and examples for each role. Once the Developer completes its work, TypeScript compilation begins. If any errors are found, the Troubleshooter takes over, passing the errors to the prompt and waiting for the corrected code. Finally, when the compilation succeeds, E2E tests are run. Whenever a test fails, the TestsFixer steps in with specific prompt instructions, ensuring all tests pass and the code stays clean.

All micro-agents are derived from the BaseAgent class and actively reuse its base method implementations. Here is the Developer implementation for reference:

Each agent utilizes its specific prompt. Check out this GitHub link for the prompt used by the Developer.

After dedicating significant effort to research and testing, I refined the prompts for all micro-agents, resulting in clean, well-structured code with very few issues.

During the development and testing, it was used with various module descriptions, ranging from simple to highly detailed. Here are a few examples:

- The module responsible for library book management must handle endpoints for CRUD operations on books.
- The module responsible for the orders management. It must provide CRUD operations for handling customer orders. Users can create new orders, read order details, update order statuses or information, and delete orders that are canceled or completed. Order must have next attributes: name, status, placed source, description, image url
- Asset Management System with an "Assets" module offering CRUD operations for company assets. Users can add new assets to the inventory, read asset details, update information such as maintenance schedules or asset locations, and delete records of disposed or sold assets.

Testing with gpt-4o-mini and claude-3-5-sonnet-20241022 showed comparable output code quality, although Sonnet is more expensive. Claude Haiku (claude-3–5-haiku-20241022), while cheaper and similar in price to gpt-4o-mini, often produces non-compilable code. Overall, with gpt-4o-mini, a single code generation session consumes an average of around 11k input tokens and 15k output tokens. This amounts to a cost of approximately 2 cents per session, based on token pricing of 15 cents per 1M input tokens and 60 cents per 1M output tokens (as of December 2024).

Below are Anthropic usage logs showing token consumption:

Based on my experimentation over the past few weeks, I conclude that while there may still be some issues with passing generated tests, 95% of the time generated code is compilable and runnable.

I hope you found some inspiration here and that it serves as a starting point for your next Node.js API or an upgrade to your current project. Should you have suggestions for improvements, feel free to contribute by submitting PR for code or prompt updates.

If you enjoyed this article, feel free to clap or share your thoughts in the comments, whether ideas or questions. Thank you for reading, and happy experimenting!

UPDATE [February 9, 2025]: The LLM-Codegen GitHub repository was updated with DeepSeek API support. It’s cheaper than gpt-4o-mini and offers nearly the same output quality, but it has a longer response time and sometimes struggles with API request errors.

Unless otherwise noted, all images are by the author

The post How to Use an LLM-Powered Boilerplate for Building Your Own Node.js API appeared first on Towards Data Science.

Formulation of Feature Circuits with Sparse Autoencoders in LLM

Shuyang — Wed, 19 Feb 2025 20:58:35 +0000

Large Language models (LLMs) have witnessed impressive progress and these large models can do a variety of tasks, from generating human-like text to answering questions. However, understanding how these models work still remains challenging, especially due a phenomenon called superposition where features are mixed into one neuron, making it very difficult to extract human understandable representation from the original model structure. This is where methods like sparse Autoencoder appear to disentangle the features for interpretability.

In this blog post, we will use the Sparse Autoencoder to find some feature circuits on a particular interesting case of subject-verb agreement ,and understand how the model components contribute to the task.

Key concepts

Feature circuits

In the context of neural networks, feature circuits are how networks learn to combine input features to form complex patterns at higher levels. We use the metaphor of “circuits” to describe how features are processed along layers in a neural network because such processes remind us of circuits in electronics processing and combining signals.

These feature circuits form gradually through the connections between neurons and layers, where each neuron or layer is responsible for transforming input features, and their interactions lead to useful feature combinations that play together to make the final predictions.

Here is one example of feature circuits: in lots of vision neural networks, we can find “a circuit as a family of units detecting curves in different angular orientations. Curve detectors are primarily implemented from earlier, less sophisticated curve detectors and line detectors. These curve detectors are used in the next layer to create 3D geometry and complex shape detectors” [1].

In the coming chapter, we will work on one feature circuit in LLMs for a subject-verb agreement task.

Superposition and Sparse AutoEncoder

In the context of Machine Learning, we have sometimes observed superposition, referring to the phenomenon that one neuron in a model represents multiple overlapping features rather than a single, distinct one. For example, InceptionV1 contains one neuron that responds to cat faces, fronts of cars, and cat legs.

This is where the Sparse Autoencoder (SAE) comes in.

The SAE helps us disentangle the network’s activations into a set of sparse features. These sparse features are normally human understandable,m allowing us to get a better understanding of the model. By applying an SAE to the hidden layers activations of an LLM mode, we can isolate the features that contribute to the model’s output.

You can find the details of how the SAE works in my former blog post.

Case study: Subject-Verb Agreement

Subject-Verb Agreement

Subject-verb agreement is a fundamental grammar rule in English. The subject and the verb in a sentence must be consistent in numbers, aka singular or plural. For example:

“The cat runs.” (Singular subject, singular verb)
“The cats run.” (Plural subject, plural verb)

Understanding this rule simple for humans is important for tasks like text generation, translation, and question answering. But how do we know if an LLM has actually learned this rule?

We will now explore in this chapter how the LLM forms a feature circuit for such a task.

Building the Feature Circuit

Let’s now build the process of creating the feature circuit. We would do it in 4 steps:

We start by inputting sentences into the model. For this case study, we consider sentences like:

“The cat runs.” (singular subject)
“The cats run.” (plural subject)

We run the model on these sentences to get hidden activations. These activations stand for how the model processes the sentences at each layer.
We pass the activations to an SAE to “decompress” the features.
We construct a feature circuit as a computational graph:
- The input nodes represent the singular and plural sentences.
- The hidden nodes represent the model layers to process the input.
- The sparse nodes represent obtained features from the SAE.
- The output node represents the final decision. In this case: runs or run.

Toy Model

We start by building a toy language model which might have no sense at all with the following code. This is a network with two simple layers.

For the subject-verb agreement, the model is supposed to:

Input a sentence with either singular or plural verbs.
The hidden layer transforms such information into an abstract representation.
The model selects the correct verb form as output.

# ====== Define Base Model (Simulating Subject-Verb Agreement) ======
class SubjectVerbAgreementNN(nn.Module):
   def __init__(self):
       super().__init__()
       self.hidden = nn.Linear(2, 4)  # 2 input → 4 hidden activations
       self.output = nn.Linear(4, 2)  # 4 hidden → 2 output (runs/run)
       self.relu = nn.ReLU()


   def forward(self, x):
       x = self.relu(self.hidden(x))  # Compute hidden activations
       return self.output(x)  # Predict verb

It is unclear what happens inside the hidden layer. So we introduce the following sparse AutoEncoder:

# ====== Define Sparse Autoencoder (SAE) ======
class c(nn.Module):
   def __init__(self, input_dim, hidden_dim):
       super().__init__()
       self.encoder = nn.Linear(input_dim, hidden_dim)  # Decompress to sparse features
       self.decoder = nn.Linear(hidden_dim, input_dim)  # Reconstruct
       self.relu = nn.ReLU()


   def forward(self, x):
       encoded = self.relu(self.encoder(x))  # Sparse activations
       decoded = self.decoder(encoded)  # Reconstruct original activations
       return encoded, decoded

We train the original model SubjectVerbAgreementNN and the SubjectVerbAgreementNN with sentences designed to represent different singular and plural forms of verbs, such as “The cat runs”, “the babies run”. However, just like before, for the toy model, they may not have actual meanings.

Now we visualise the feature circuit. As introduced before, a feature circuit is a unit of neurons for processing specific features. In our model, the feature consists:

The hidden layer transforming language properties into abstract representation..
The SAE with independent features that contribute directly to the verb -subject agreement task.

Trained Feature Circuit: Singular vs. Plural (Dog/Dogs)

You can see in the plot that we visualize the feature circuit as a graph:

Hidden activations and the encoder’s outputs are all nodes of the graph.
We also have the output nodes as the correct verb.
Edges in the graph are weighted by activation strength, showing which pathways are most important in the subject-verb agreement decision. For example, you can see that the path from H3 to F2 plays an important role.

GPT2-Small

For a real case, we run the similar code on GPT2-small. We show the graph of a feature circuit representing the decision to choose the singular verb.

Feature Circuit for Subject-Verb agreement (run/runs). For code details and a larger version of the above, please refer to my notebook.

Conclusion

Feature circuits help us to understand how different parts in a complex LLM lead to a final output. We show the possibility to use an SAE to form a feature circuit for a subject-verb agreement task.

However, we have to admit this method still needs some human-level intervention in the sense that we don’t always know if a circuit can really form without a proper design.

Reference

[1] Zoom In: An Introduction to Circuits

The post Formulation of Feature Circuits with Sparse Autoencoders in LLM appeared first on Towards Data Science.

How LLMs Work: Pre-Training to Post-Training, Neural Networks, Hallucinations, and Inference

Clara Chong — Tue, 18 Feb 2025 17:39:29 +0000

With the recent explosion of interest in large language models (LLMs), they often seem almost magical. But let’s demystify them.

I wanted to step back and unpack the fundamentals — breaking down how LLMs are built, trained, and fine-tuned to become the AI systems we interact with today.

This two-part deep dive is something I’ve been meaning to do for a while and was also inspired by Andrej Karpathy’s widely popular 3.5-hour YouTube video, which has racked up 800,000+ views in just 10 days. Andrej is a founding member of OpenAI, his insights are gold— you get the idea.

If you have the time, his video is definitely worth watching. But let’s be real — 3.5 hours is a long watch. So, for all the busy folks who don’t want to miss out, I’ve distilled the key concepts from the first 1.5 hours into this 10-minute read, adding my own breakdowns to help you build a solid intuition.

What you’ll get

Part 1 (this article): Covers the fundamentals of LLMs, including pre-training to post-training, neural networks, Hallucinations, and inference.

Part 2: Reinforcement learning with human/AI feedback, investigating o1 models, DeepSeek R1, AlphaGo

Let’s go! I’ll start with looking at how LLMs are being built.

At a high level, there are 2 key phases: pre-training and post-training.

1. Pre-training

Before an LLM can generate text, it must first learn how language works. This happens through pre-training, a highly computationally intensive task.

Step 1: Data collection and preprocessing

The first step in training an LLM is gathering as much high-quality text as possible. The goal is to create a massive and diverse dataset containing a wide range of human knowledge.

One source is Common Crawl, which is a free, open repository of web crawl data containing 250 billion web pages over 18 years. However, raw web data is noisy — containing spam, duplicates and low quality content — so preprocessing is essential.If you’re interested in preprocessed datasets, FineWeb offers a curated version of Common Crawl, and is made available on Hugging Face.

Once cleaned, the text corpus is ready for tokenization.

Step 2: Tokenization

Before a neural network can process text, it must be converted into numerical form. This is done through tokenization, where words, subwords, or characters are mapped to unique numerical tokens.

Think of tokens as the building blocks — the fundamental building blocks of all language models. In GPT4, there are 100,277 possible tokens.A popular tokenizer, Tiktokenizer, allows you to experiment with tokenization and see how text is broken down into tokens. Try entering a sentence, and you’ll see each word or subword assigned a series of numerical IDs.

Step 3: Neural network training

Once the text is tokenized, the neural network learns to predict the next token based on its context. As shown above, the model takes an input sequence of tokens (e.g., “we are cook ing”) and processes it through a giant mathematical expression — which represents the model’s architecture — to predict the next token.

A neural network consists of 2 key parts:

Parameters (weights) — the learned numerical values from training.
Architecture (mathematical expression) — the structure defining how the input tokens are processed to produce outputs.

Initially, the model’s predictions are random, but as training progresses, it learns to assign probabilities to possible next tokens.

When the correct token (e.g. “food”) is identified, the model adjusts its billions of parameters (weights) through backpropagation — an optimization process that reinforces correct predictions by increasing their probabilities while reducing the likelihood of incorrect ones.

This process is repeated billions of times across massive datasets.

Base model — the output of pre-training

At this stage, the base model has learned:

How words, phrases and sentences relate to each other
Statistical patterns in your training data

However, base models are not yet optimised for real-world tasks. You can think of them as an advanced autocomplete system — they predict the next token based on probability, but with limited instruction-following ability.

A base model can sometimes recite training data verbatim and can be used for certain applications through in-context learning, where you guide its responses by providing examples in your prompt. However, to make the model truly useful and reliable, it requires further training.

2. Post training — Making the model useful

Base models are raw and unrefined. To make them helpful, reliable, and safe, they go through post-training, where they are fine-tuned on smaller, specialised datasets.

Because the model is a neural network, it cannot be explicitly programmed like traditional software. Instead, we “program” it implicitly by training it on structured labeled datasets that represent examples of desired interactions.

How post training works

Specialised datasets are created, consisting of structured examples on how the model should respond in different situations.

Some types of post training include:

Instruction/conversation fine tuning
Goal: To teach the model to follow instructions, be task oriented, engage in multi-turn conversations, follow safety guidelines and refuse malicious requests, etc.
Eg: InstructGPT (2022): OpenAI hired some 40 contractors to create these labelled datasets. These human annotators wrote prompts and provided ideal responses based on safety guidelines. Today, many datasets are generated automatically, with humans reviewing and editing them for quality.
Domain specific fine tuning
Goal: Adapt the model for specialised fields like medicine, law and programming.

Post training also introduces special tokens — symbols that were not used during pre-training — to help the model understand the structure of interactions. These tokens signal where a user’s input starts and ends and where the AI’s response begins, ensuring that the model correctly distinguishes between prompts and replies.

Now, we’ll move on to some other key concepts.

Inference — how the model generates new text

Inference can be performed at any stage, even midway through pre-training, to evaluate how well the model has learned.

When given an input sequence of tokens, the model assigns probabilities to all possible next tokens based on patterns it has learned during training.

Instead of always choosing the most likely token, it samples from this probability distribution — similar to flipping a biased coin, where higher-probability tokens are more likely to be selected.

This process repeats iteratively, with each newly generated token becoming part of the input for the next prediction.

Token selection is stochastic and the same input can produce different outputs. Over time, the model generates text that wasn’t explicitly in its training data but follows the same statistical patterns.

Hallucinations — when LLMs generate false info

Why do hallucinations occur?

Hallucinations happen because LLMs do not “know” facts — they simply predict the most statistically likely sequence of words based on their training data.

Early models struggled significantly with hallucinations.

For instance, in the example below, if the training data contains many “Who is…” questions with definitive answers, the model learns that such queries should always have confident responses, even when it lacks the necessary knowledge.

When asked about an unknown person, the model does not default to “I don’t know” because this pattern was not reinforced during training. Instead, it generates its best guess, often leading to fabricated information.

How do you reduce hallucinations?

Method 1: Saying “I don’t know”

Improving factual accuracy requires explicitly training the model to recognise what it does not know — a task that is more complex than it seems.

This is done via self interrogation, a process that helps define the model’s knowledge boundaries.

Self interrogation can be automated using another AI model, which generates questions to probe knowledge gaps. If it produces a false answer, new training examples are added, where the correct response is: “I’m not sure. Could you provide more context?”

If a model has seen a question many times in training, it will assign a high probability to the correct answer.

If the model has not encountered the question before, it distributes probability more evenly across multiple possible tokens, making the output more randomised. No single token stands out as the most likely choice.

Fine tuning explicitly trains the model to handle low-confidence outputs with predefined responses.

For example, when I asked ChatGPT-4o, “Who is asdja rkjgklfj?”, it correctly responded: “I’m not sure who that is. Could you provide more context?”

Method 2: Doing a web search

A more advanced method is to extend the model’s knowledge beyond its training data by giving it access to external search tools.

At a high level, when a model detects uncertainty, it can trigger a web search. The search results are then inserted into a model’s context window — essentially allowing this new data to be part of it’s working memory. The model references this new information while generating a response.

Vague recollections vs working memory

Generally speaking, LLMs have two types of knowledge access.

Vague recollections — the knowledge stored in the model’s parameters from pre-training. This is based on patterns it learned from vast amounts of internet data but is not precise nor searchable.
Working memory — the information that is available in the model’s context window, which is directly accessible during inference. Any text provided in the prompt acts as a short term memory, allowing the model to recall details while generating responses.

Adding relevant facts within the context window significantly improves response quality.

Knowledge of self

When asked questions like “Who are you?” or “What built you?”, an LLM will generate a statistical best guess based on its training data, unless explicitly programmed to respond accurately.

LLMs do not have true self-awareness, their responses depend on patterns seen during training.

One way to provide the model with a consistent identity is by using a system prompt, which sets predefined instructions about how it should describe itself, its capabilities, and its limitations.

To end off

That’s a wrap for Part 1! I hope this has helped you build intuition on how LLMs work. In Part 2, we’ll dive deeper into reinforcement learning and some of the latest models.

Got questions or ideas for what I should cover next? Drop them in the comments — I’d love to hear your thoughts. See you in Part 2!

The post How LLMs Work: Pre-Training to Post-Training, Neural Networks, Hallucinations, and Inference appeared first on Towards Data Science.

Retrieval Augmented Generation in SQLite

Ed Izaguirre — Tue, 18 Feb 2025 17:05:58 +0000

This is the second in a two-part series on using SQLite for Machine Learning. In my last article, I dove into how SQLite is rapidly becoming a production-ready database for web applications. In this article, I will discuss how to perform retrieval-augmented-generation using SQLite.

If you’d like a custom web application with generative AI integration, visit losangelesaiapps.com

The code referenced in this article can be found here.

When I first learned how to perform retrieval-augmented-generation (RAG) as a budding data scientist, I followed the traditional path. This usually looks something like:

Google retrieval-augmented-generation and look for tutorials
Find the most popular framework, usually LangChain or LlamaIndex
Find the most popular cloud vector database, usually Pinecone or Weaviate
Read a bunch of docs, put all the pieces together, and success!

In fact I actually wrote an article about my experience building a RAG system in LangChain with Pinecone.

There is nothing terribly wrong with using a RAG framework with a cloud vector database. However, I would argue that for first time learners it overcomplicates the situation. Do we really need an entire framework to learn how to do RAG? Is it necessary to perform API calls to cloud vector databases? These databases act as black boxes, which is never good for learners (or frankly for anyone).

In this article, I will walk you through how to perform RAG on the simplest stack possible. In fact, this ‘stack’ is just Sqlite with the sqlite-vec extension and the OpenAI API for use of their embedding and chat models. I recommend you re ad part 1 of this series to get a deep dive on SQLite and how it is rapidly becoming production ready for web applications. For our purposes here, it is enough to understand that SQLite is the simplest kind of database possible: a single file in your repository.

So ditch your cloud vector databases and your bloated frameworks, and let’s do some RAG.

SQLite-Vec

One of the powers of the SQLite database is the use of extensions. For those of us familiar with Python, extensions are a lot like libraries. They are modular pieces of code written in C to extend the functionality of SQLite, making things that were once impossible possible. One popular example of a SQLite extension is the Full-Text Search (FTS) extension. This extension allows SQLite to perform efficient searches across large volumes of textual data in SQLite. Because the extension is written purely in C, we can run it anywhere a SQLite database can be run, including Raspberry Pis and browsers.

In this article I will be going over the extension known as sqlite-vec. This gives SQLite the power of performing vector search. Vector search is similar to full-text search in that it allows for efficient search across textual data. However, rather than search for an exact word or phrase in the text, vector search has a semantic understanding. In other words, searching for “horses” will find matches of “equestrian”, “pony”, “Clydesdale”, etc. Full-text search is incapable of this.

sqlite-vec makes use of virtual tables, as do most extensions in SQLite. A virtual table is similar to a regular table, but with additional powers:

Custom Data Sources: The data for a standard table in SQLite is housed in a single db file. For a virtual table, the data can be housed in external sources, for example a CSV file or an API call.
Flexible Functionality: Virtual tables can add specialized indexing or querying capabilities and support complex data types like JSON or XML.
Integration with SQLite Query Engine: Virtual tables integrate seamlessly with SQLite’s standard query syntax e.g. SELECT , INSERT, UPDATE, and DELETE options. Ultimately it is up to the writers of the extensions to support these operations.
Use of Modules: The backend logic for how the virtual table will work is implemented by a module (written in C or another language).

The typical syntax for creating a virtual table looks like the following:

CREATE VIRTUAL TABLE my_table USING my_extension_module();

The important part of this statement is my_extension_module(). This specifies the module that will be powering the backend of the my_table virtual table. In sqlite-vec we will use the vec0 module.

Code Walkthrough

The code for this article can be found here. It is a simple directory with the majority of files being .txt files that we will be using as our dummy data. Because I am a physics nerd, the majority of the files pertain to physics, with just a few files relating to other random fields. I will not present the full code in this walkthrough, but instead will highlight the important pieces. Clone my repo and play around with it to investigate the full code. Below is a tree view of the repo. Note that my_docs.db is the single-file database used by SQLite to manage all of our data.

.

├── data

│   ├── cooking.txt

│   ├── gardening.txt

│   ├── general_relativity.txt

│   ├── newton.txt

│   ├── personal_finance.txt

│   ├── quantum.txt

│   ├── thermodynamics.txt

│   └── travel.txt

├── my_docs.db

├── requirements.txt

└── sqlite_rag_tutorial.py

Step 1 is to install the necessary libraries. Below is our requirements.txt file. As you can see it has only three libraries. I recommend creating a virtual environment with the latest Python version (3.13.1 was used for this article) and then running pip install -r requirements.txt to install the libraries.

# requirements.txt

sqlite-vec==0.1.6

openai==1.63.0

python-dotenv==1.0.1

Step 2 is to create an OpenAI API key if you don’t already have one. We will be using OpenAI to generate embeddings for the text files so that we can perform our vector search.

# sqlite_rag_tutorial.py

import sqlite3

from sqlite_vec import serialize_float32

import sqlite_vec

import os

from openai import OpenAI

from dotenv import load_dotenv

# Set up OpenAI client

client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))

Step 3 is to load the sqlite-vec extension into SQLite. We will be using Python and SQL for our examples in this article. Disabling the ability to load extensions immediately after loading your extension is a good security practice.

# Path to the database file

db_path = 'my_docs.db'

# Delete the database file if it exists

db = sqlite3.connect(db_path)

db.enable_load_extension(True)

sqlite_vec.load(db)

db.enable_load_extension(False)

Next we will go ahead and create our virtual table:

db.execute('''

   CREATE VIRTUAL TABLE documents USING vec0(

       embedding float[1536],

       +file_name TEXT,

       +content TEXT

   )

''')

documents is a virtual table with three columns:

sample_embedding : 1536-dimension float that will store the embeddings of our sample documents.
file_name : Text that will house the name of each file we store in the database. Note that this column and the following have a + symbol in front of them. This indicates that they are auxiliary fields. Previously in sqlite-vec only embedding data could be stored in the virtual table. However, recently an update was pushed that allows us to add fields to our table that we don’t really want embedded. In this case we are adding the content and name of the file in the same table as our embeddings. This will allow us to easily see what embeddings correspond to what content easily while sparing us the need for extra tables and JOIN statements.
content : Text that will store the content of each file.

Now that we have our virtual table set up in our SQLite database, we can begin converting our text files into embeddings and storing them in our table:

# Function to get embeddings using the OpenAI API

def get_openai_embedding(text):

   response = client.embeddings.create(

       model="text-embedding-3-small",

       input=text

   )

   return response.data[0].embedding

# Iterate over .txt files in the /data directory

for file_name in os.listdir("data"):

   file_path = os.path.join("data", file_name)

   with open(file_path, 'r', encoding='utf-8') as file:

       content = file.read()

       # Generate embedding for the content

       embedding = get_openai_embedding(content)

       if embedding:

           # Insert file content and embedding into the vec0 table

           db.execute(

               'INSERT INTO documents (embedding, file_name, content) VALUES (?, ?, ?)',

               (serialize_float32(embedding), file_name, content)

# Commit changes

db.commit()

We essentially loop through each of our .txt files, embedding the content from each file, and then using an INSERT INTO statement to insert the embedding, file_name, and content into documents virtual table. A commit statement at the end ensures the changes are persisted. Note that we are using serialize_float32 here from the sqlite-vec library. SQLite itself does not have a built-in vector type, so it stores vectors as binary large objects (BLOBs) to save space and allow fast operations. Internally, it uses Python’s struct.pack() function, which converts Python data into C-style binary representations.

Finally, to perform RAG, you then use the following code to do a K-Nearest-Neighbors (KNN-style) operation. This is the heart of vector search.

# Perform a sample KNN query

query_text = "What is general relativity?"

query_embedding = get_openai_embedding(query_text)

if query_embedding:

   rows = db.execute(

       """

       SELECT

           file_name,

           content,

           distance

       FROM documents

       WHERE embedding MATCH ?

       ORDER BY distance

       LIMIT 3

       """,

       [serialize_float32(query_embedding)]

   ).fetchall()

   print("Top 3 most similar documents:")

   top_contexts = []

   for row in rows:

       print(row)

       top_contexts.append(row[1])  # Append the 'content' column

We begin by taking in a query from the user, in this case “What is general relativity?” and embedding that query using the same embedding model as before. We then perform a SQL operation. Let’s break this down:

The SELECT statement means the retrieved data will have three columns: file_name, content, and distance. The first two we have already mentioned. Distance will be calculated during the SQL operation, more on this in a moment.
The FROM statement ensures you are pulling data from the documents table.
The WHERE embedding MATCH ? statement performs a similarity search between all of the vectors in your database and the query vector. The returned data will include a distance column. This distance is just a floating point number measuring the similarity between the query and database vectors. The higher the number, the closer the vectors are. sqlite-vec provides a few options for how to calculate this similarity.
The ORDER BY distance makes sure to order the retrieved vectors in descending order of similarity (high -> low).
LIMIT 3 ensures we only get the top three documents that are nearest to our query embedding vector. You can tweak this number to see how retrieving more or less vectors affects your results.

Given our query of “What is general relativity?”, the following documents were pulled. It did a pretty good job!

Conclusion

sqlite-vec is a project sponsored by the Mozilla Builders Accelerator program, so it has some significant backing behind it. Have to give a big thanks to Alex Garcia, the creator of sqlite-vec , for helping to push the SQLite ecosystem and making ML possible with this simple database. This is a well maintained library, with updates coming down the pipeline on a regular basis. As of November 20th, they even added filtering by metadata! Perhaps I should re-do my aforementioned RAG article using SQLite .

The extension also offers bindings for several popular programming languages, including Ruby, Go, Rust, and more.

The fact that we are able to radically simplify our RAG pipeline to the bare essentials is remarkable. To recap, there is no need for a database service to be spun up and spun down, like Postgres, MySQL, etc. There is no need for API calls to cloud vendors. If you deploy to a server directly via Digital Ocean or Hetzner, you can even avoid costly and unnecessary complexity associated with managed cloud services like AWS, Azure, or Vercel.

I believe this simple architecture can work for a variety of applications. It is cheaper to use, easier to maintain, and faster to iterate on. Once you reach a certain scale it will likely make sense to migrate to a more robust database such as Postgres with the pgvector extension for RAG capabilities. For more advanced capabilities such as chunking and document cleaning, a framework may be the right choice. But for startups and smaller players, it’s SQLite to the moon.

Have fun trying out sqlite-vec for yourself!

Simple RAG architecture. Image by author.

The post Retrieval Augmented Generation in SQLite appeared first on Towards Data Science.