Machine Learning | Towards Data Science https://towardsdatascience.com/category/artificial-intelligence/machine-learning/ The world’s leading publication for data science, AI, and ML professionals. Thu, 06 Mar 2025 19:59:52 +0000 en-US hourly 1 https://wordpress.org/?v=6.7.1 https://towardsdatascience.com/wp-content/uploads/2025/02/cropped-Favicon-32x32.png Machine Learning | Towards Data Science https://towardsdatascience.com/category/artificial-intelligence/machine-learning/ 32 32 How to Spot and Prevent Model Drift Before it Impacts Your Business https://towardsdatascience.com/how-to-spot-and-prevent-model-drift-before-it-impacts-your-business/ Thu, 06 Mar 2025 19:22:22 +0000 https://towardsdatascience.com/?p=598826 3 essential methods to track model drift you should know

The post How to Spot and Prevent Model Drift Before it Impacts Your Business appeared first on Towards Data Science.

]]>
Despite the AI hype, many tech companies still rely heavily on machine learning to power critical applications, from personalized recommendations to fraud detection. 

I’ve seen firsthand how undetected drifts can result in significant costs — missed fraud detection, lost revenue, and suboptimal business outcomes, just to name a few. So, it’s crucial to have robust monitoring in place if your company has deployed or plans to deploy machine learning models into production.

Undetected Model Drift can lead to significant financial losses, operational inefficiencies, and even damage to a company’s reputation. To mitigate these risks, it’s important to have effective model monitoring, which involves:

  • Tracking model performance
  • Monitoring feature distributions
  • Detecting both univariate and multivariate drifts

A well-implemented monitoring system can help identify issues early, saving considerable time, money, and resources.

In this comprehensive guide, I’ll provide a framework on how to think about and implement effective Model Monitoring, helping you stay ahead of potential issues and ensure stability and reliability of your models in production.

What’s the difference between feature drift and score drift?

Score drift refers to a gradual change in the distribution of model scores. If left unchecked, this could lead to a decline in model performance, making the model less accurate over time.

On the other hand, feature drift occurs when one or more features experience changes in the distribution. These changes in feature values can affect the underlying relationships that the model has learned, and ultimately lead to inaccurate model predictions.

Simulating score shifts

To model real-world fraud detection challenges, I created a synthetic dataset with five financial transaction features.

The reference dataset represents the original distribution, while the production dataset introduces shifts to simulate an increase in high-value transactions without PIN verification on newer accounts, indicating an increase in fraud.

Each feature has different underlying distributions:

  • Transaction Amount: Log-normal distribution (right-skewed with a long tail)
  • Account Age (months): clipped normal distribution between 0 to 60 (assuming a 5-year-old company)
  • Time Since Last Transaction: Exponential distribution
  • Transaction Count: Poisson distribution
  • Entered PIN: Binomial distribution.

To approximate model scores, I randomly assigned weights to these features and applied a sigmoid function to constrain predictions between 0 to 1. This mimics how a logistic regression fraud model generates risk scores.

As shown in the plot below:

  • Drifted features: Transaction Amount, Account Age, Transaction Count, and Entered PIN all experienced shifts in distribution, scale, or relationships.
Distribution of drifted features (image by author)
  • Stable feature: Time Since Last Transaction remained unchanged.
Distribution of stable feature (image by author)
  • Drifted scores: As a result of the drifted features, the distribution in model scores has also changed.
Distribution of model scores (image by author)

This setup allows us to analyze how feature drift impacts model scores in production.

Detecting model score drift using PSI

To monitor model scores, I used population stability index (PSI) to measure how much model score distribution has shifted over time.

PSI works by binning continuous model scores and comparing the proportion of scores in each bin between the reference and production datasets. It compares the differences in proportions and their logarithmic ratios to compute a single summary statistic to quantify the drift.

Python implementation:

# Define function to calculate PSI given two datasets
def calculate_psi(reference, production, bins=10):
  # Discretize scores into bins
  min_val, max_val = 0, 1
  bin_edges = np.linspace(min_val, max_val, bins + 1)

  # Calculate proportions in each bin
  ref_counts, _ = np.histogram(reference, bins=bin_edges)
  prod_counts, _ = np.histogram(production, bins=bin_edges)

  ref_proportions = ref_counts / len(reference)
  prod_proportions = prod_counts / len(production)
  
  # Avoid division by zero
  ref_proportions = np.clip(ref_proportions, 1e-8, 1)
  prod_proportions = np.clip(prod_proportions, 1e-8, 1)

  # Calculate PSI for each bin
  psi = np.sum((ref_proportions - prod_proportions) * np.log(ref_proportions / prod_proportions))

  return psi
  
# Calculate PSI
psi_value = calculate_psi(ref_data['model_score'], prod_data['model_score'], bins=10)
print(f"PSI Value: {psi_value}")

Below is a summary of how to interpret PSI values:

  • PSI < 0.1: No drift, or very minor drift (distributions are almost identical).
  • 0.1 ≤ PSI < 0.25: Some drift. The distributions are somewhat different.
  • 0.25 ≤ PSI < 0.5: Moderate drift. A noticeable shift between the reference and production distributions.
  • PSI ≥ 0.5: Significant drift. There is a large shift, indicating that the distribution in production has changed substantially from the reference data.
Histogram of model score distributions (image by author)

The PSI value of 0.6374 suggests a significant drift between our reference and production datasets. This aligns with the histogram of model score distributions, which visually confirms the shift towards higher scores in production — indicating an increase in risky transactions.

Detecting feature drift

Kolmogorov-Smirnov test for numeric features

The Kolmogorov-Smirnov (K-S) test is my preferred method for detecting drift in numeric features, because it is non-parametric, meaning it doesn’t assume a normal distribution.

The test compares a feature’s distribution in the reference and production datasets by measuring the maximum difference between the empirical cumulative distribution functions (ECDFs). The resulting K-S statistic ranges from 0 to 1:

  • 0 indicates no difference between the two distributions.
  • Values closer to 1 suggest a greater shift.

Python implementation:

# Create an empty dataframe
ks_results = pd.DataFrame(columns=['Feature', 'KS Statistic', 'p-value', 'Drift Detected'])

# Loop through all features and perform the K-S test
for col in numeric_cols:
    ks_stat, p_value = ks_2samp(ref_data[col], prod_data[col])
    drift_detected = p_value < 0.05
		
		# Store results in the dataframe
    ks_results = pd.concat([
        ks_results,
        pd.DataFrame({
            'Feature': [col],
            'KS Statistic': [ks_stat],
            'p-value': [p_value],
            'Drift Detected': [drift_detected]
        })
    ], ignore_index=True)

Below are ECDF charts of the four numeric features in our dataset:

ECDFs of four numeric features (image by author)

Let’s look at the account age feature as an example: the x-axis represents account age (0-50 months), while the y-axis shows the ECDF for both reference and production datasets. The production dataset skews towards newer accounts, as it has a larger proportion of observations with lower account ages.

Chi-Square test for categorical features

To detect shifts in categorical and boolean features, I like to use the Chi-Square test.

This test compares the frequency distribution of a categorical feature in the reference and production datasets, and returns two values:

  • Chi-Square statistic: A higher value indicates a greater shift between the reference and production datasets.
  • P-value: A p-value below 0.05 suggests that the difference between the reference and production datasets is statistically significant, indicating potential feature drift.

Python implementation:

# Create empty dataframe with corresponding column names
chi2_results = pd.DataFrame(columns=['Feature', 'Chi-Square Statistic', 'p-value', 'Drift Detected'])

for col in categorical_cols:
    # Get normalized value counts for both reference and production datasets
    ref_counts = ref_data[col].value_counts(normalize=True)
    prod_counts = prod_data[col].value_counts(normalize=True)

    # Ensure all categories are represented in both
    all_categories = set(ref_counts.index).union(set(prod_counts.index))
    ref_counts = ref_counts.reindex(all_categories, fill_value=0)
    prod_counts = prod_counts.reindex(all_categories, fill_value=0)

    # Create contingency table
    contingency_table = np.array([ref_counts * len(ref_data), prod_counts * len(prod_data)])

    # Perform Chi-Square test
    chi2_stat, p_value, _, _ = chi2_contingency(contingency_table)
    drift_detected = p_value < 0.05

    # Store results in chi2_results dataframe
    chi2_results = pd.concat([
        chi2_results,
        pd.DataFrame({
            'Feature': [col],
            'Chi-Square Statistic': [chi2_stat],
            'p-value': [p_value],
            'Drift Detected': [drift_detected]
        })
    ], ignore_index=True)

The Chi-Square statistic of 57.31 with a p-value of 3.72e-14 confirms a large shift in our categorical feature, Entered PIN. This finding aligns with the histogram below, which visually illustrates the shift:

Distribution of categorical feature (image by author)

Detecting multivariate shifts

Spearman Correlation for shifts in pairwise interactions

In addition to monitoring individual feature shifts, it’s important to track shifts in relationships or interactions between features, known as multivariate shifts. Even if the distributions of individual features remain stable, multivariate shifts can signal meaningful differences in the data.

By default, Pandas’ .corr() function calculates Pearson correlation, which only captures linear relationships between variables. However, relationships between features are often non-linear yet still follow a consistent trend.

To capture this, we use Spearman correlation to measure monotonic relationships between features. It captures whether features change together in a consistent direction, even if their relationship isn’t strictly linear.

To assess shifts in feature relationships, we compare:

  • Reference correlation (ref_corr): Captures historical feature relationships in the reference dataset.
  • Production correlation (prod_corr): Captures new feature relationships in production.
  • Absolute difference in correlation: Measures how much feature relationships have shifted between the reference and production datasets. Higher values indicate more significant shifts.

Python implementation:

# Calculate correlation matrices
ref_corr = ref_data.corr(method='spearman')
prod_corr = prod_data.corr(method='spearman')

# Calculate correlation difference
corr_diff = abs(ref_corr - prod_corr)

Example: Change in correlation

Now, let’s look at the correlation between transaction_amount and account_age_in_months:

  • In ref_corr, the correlation is 0.00095, indicating a weak relationship between the two features.
  • In prod_corr, the correlation is -0.0325, indicating a weak negative correlation.
  • Absolute difference in the Spearman correlation is 0.0335, which is a small but noticeable shift.

The absolute difference in correlation indicates a shift in the relationship between transaction_amount and account_age_in_months.

There used to be no relationship between these two features, but the production dataset indicates that there is now a weak negative correlation, meaning that newer accounts have higher transaction amounts. This is spot on!

Autoencoder for complex, high-dimensional multivariate shifts

In addition to monitoring pairwise interactions, we can also look for shifts across more dimensions in the data.

Autoencoders are powerful tools for detecting high-dimensional multivariate shifts, where multiple features collectively change in ways that may not be apparent from looking at individual feature distributions or pairwise correlations.

An autoencoder is a neural network that learns a compressed representation of data through two components:

  • Encoder: Compresses input data into a lower-dimensional representation.
  • Decoder: Reconstructs the original input from the compressed representation.

To detect shifts, we compare the reconstructed output to the original input and compute the reconstruction loss.

  • Low reconstruction loss → The autoencoder successfully reconstructs the data, meaning the new observations are similar to what it has seen and learned.
  • High reconstruction loss → The production data deviates significantly from the learned patterns, indicating potential drift.

Unlike traditional drift metrics that focus on individual features or pairwise relationships, autoencoders capture complex, non-linear dependencies across multiple variables simultaneously.

Python implementation:

ref_features = ref_data[numeric_cols + categorical_cols]
prod_features = prod_data[numeric_cols + categorical_cols]

# Normalize the data
scaler = StandardScaler()
ref_scaled = scaler.fit_transform(ref_features)
prod_scaled = scaler.transform(prod_features)

# Split reference data into train and validation
np.random.shuffle(ref_scaled)
train_size = int(0.8 * len(ref_scaled))
train_data = ref_scaled[:train_size]
val_data = ref_scaled[train_size:]

# Build autoencoder
input_dim = ref_features.shape[1]
encoding_dim = 3 
# Input layer
input_layer = Input(shape=(input_dim, ))
# Encoder
encoded = Dense(8, activation="relu")(input_layer)
encoded = Dense(encoding_dim, activation="relu")(encoded)
# Decoder
decoded = Dense(8, activation="relu")(encoded)
decoded = Dense(input_dim, activation="linear")(decoded)
# Autoencoder
autoencoder = Model(input_layer, decoded)
autoencoder.compile(optimizer="adam", loss="mse")

# Train autoencoder
history = autoencoder.fit(
    train_data, train_data,
    epochs=50,
    batch_size=64,
    shuffle=True,
    validation_data=(val_data, val_data),
    verbose=0
)

# Calculate reconstruction error
ref_pred = autoencoder.predict(ref_scaled, verbose=0)
prod_pred = autoencoder.predict(prod_scaled, verbose=0)

ref_mse = np.mean(np.power(ref_scaled - ref_pred, 2), axis=1)
prod_mse = np.mean(np.power(prod_scaled - prod_pred, 2), axis=1)

The charts below show the distribution of reconstruction loss between both datasets.

Distribution of reconstruction loss between actuals and predictions (image by author)

The production dataset has a higher mean reconstruction error than that of the reference dataset, indicating a shift in the overall data. This aligns with the changes in the production dataset with a higher number of newer accounts with high-value transactions.

Summarizing

Model monitoring is an essential, yet often overlooked, responsibility for data scientists and machine learning engineers.

All the statistical methods led to the same conclusion, which aligns with the observed shifts in the data: they detected a trend in production towards newer accounts making higher-value transactions. This shift resulted in higher model scores, signaling an increase in potential fraud.

In this post, I covered techniques for detecting drift on three different levels:

  • Model score drift: Using Population Stability Index (PSI)
  • Individual feature drift: Using Kolmogorov-Smirnov test for numeric features and Chi-Square test for categorical features
  • Multivariate drift: Using Spearman correlation for pairwise interactions and autoencoders for high-dimensional, multivariate shifts.

These are just a few of the techniques I rely on for comprehensive monitoring — there are plenty of other equally valid statistical methods that can also detect drift effectively.

Detected shifts often point to underlying issues that warrant further investigation. The root cause could be as serious as a data collection bug, or as minor as a time change like daylight savings time adjustments.

There are also fantastic python packages, like evidently.ai, that automate many of these comparisons. However, I believe there’s significant value in deeply understanding the statistical techniques behind drift detection, rather than relying solely on these tools.

What’s the model monitoring process like at places you’ve worked?


Want to build your AI skills?

👉🏻 I run the AI Weekender and write weekly blog posts on data science, AI weekend projects, career advice for professionals in data.


Resources

The post How to Spot and Prevent Model Drift Before it Impacts Your Business appeared first on Towards Data Science.

]]>
Overcome Failing Document Ingestion & RAG Strategies with Agentic Knowledge Distillation https://towardsdatascience.com/overcome-failing-document-ingestion-rag-strategies-with-agentic-knowledge-distillation/ Wed, 05 Mar 2025 19:50:12 +0000 https://towardsdatascience.com/?p=598745 Introducing the pyramid search approach

The post Overcome Failing Document Ingestion & RAG Strategies with Agentic Knowledge Distillation appeared first on Towards Data Science.

]]>
Introduction

Many generative AI use cases still revolve around Retrieval Augmented Generation (RAG), yet consistently fall short of user expectations. Despite the growing body of research on RAG improvements and even adding Agents into the process, many solutions still fail to return exhaustive results, miss information that is critical but infrequently mentioned in the documents, require multiple search iterations, and generally struggle to reconcile key themes across multiple documents. To top it all off, many implementations still rely on cramming as much “relevant” information as possible into the model’s context window alongside detailed system and user prompts. Reconciling all this information often exceeds the model’s cognitive capacity and compromises response quality and consistency.

This is where our Agentic Knowledge Distillation + Pyramid Search Approach comes into play. Instead of chasing the best chunking strategy, retrieval algorithm, or inference-time reasoning method, my team, Jim Brown, Mason Sawtell, Sandi Besen, and I, take an agentic approach to document ingestion.

We leverage the full capability of the model at ingestion time to focus exclusively on distilling and preserving the most meaningful information from the document dataset. This fundamentally simplifies the RAG process by allowing the model to direct its reasoning abilities toward addressing the user/system instructions rather than struggling to understand formatting and disparate information across document chunks. 

We specifically target high-value questions that are often difficult to evaluate because they have multiple correct answers or solution paths. These cases are where traditional RAG solutions struggle most and existing RAG evaluation datasets are largely insufficient for testing this problem space. For our research implementation, we downloaded annual and quarterly reports from the last year for the 30 companies in the DOW Jones Industrial Average. These documents can be found through the SEC EDGAR website. The information on EDGAR is accessible and able to be downloaded for free or can be queried through EDGAR public searches. See the SEC privacy policy for additional details, information on the SEC website is “considered public information and may be copied or further distributed by users of the web site without the SEC’s permission”. We selected this dataset for two key reasons: first, it falls outside the knowledge cutoff for the models evaluated, ensuring that the models cannot respond to questions based on their knowledge from pre-training; second, it’s a close approximation for real-world business problems while allowing us to discuss and share our findings using publicly available data. 

While typical RAG solutions excel at factual retrieval where the answer is easily identified in the document dataset (e.g., “When did Apple’s annual shareholder’s meeting occur?”), they struggle with nuanced questions that require a deeper understanding of concepts across documents (e.g., “Which of the DOW companies has the most promising AI strategy?”). Our Agentic Knowledge Distillation + Pyramid Search Approach addresses these types of questions with much greater success compared to other standard approaches we tested and overcomes limitations associated with using knowledge graphs in RAG systems. 

In this article, we’ll cover how our knowledge distillation process works, key benefits of this approach, examples, and an open discussion on the best way to evaluate these types of systems where, in many cases, there is no singular “right” answer.

Building the pyramid: How Agentic Knowledge Distillation works

AI-generated image showing a pyramid structure for document ingestion with labelled sections.
Image by author and team depicting pyramid structure for document ingestion. Robots meant to represent agents building the pyramid.

Overview

Our knowledge distillation process creates a multi-tiered pyramid of information from the raw source documents. Our approach is inspired by the pyramids used in deep learning computer vision-based tasks, which allow a model to analyze an image at multiple scales. We take the contents of the raw document, convert it to markdown, and distill the content into a list of atomic insights, related concepts, document abstracts, and general recollections/memories. During retrieval it’s possible to access any or all levels of the pyramid to respond to the user request. 

How to distill documents and build the pyramid: 

  1. Convert documents to Markdown: Convert all raw source documents to Markdown. We’ve found models process markdown best for this task compared to other formats like JSON and it is more token efficient. We used Azure Document Intelligence to generate the markdown for each page of the document, but there are many other open-source libraries like MarkItDown which do the same thing. Our dataset included 331 documents and 16,601 pages. 
  2. Extract atomic insights from each page: We process documents using a two-page sliding window, which allows each page to be analyzed twice. This gives the agent the opportunity to correct any potential mistakes when processing the page initially. We instruct the model to create a numbered list of insights that grows as it processes the pages in the document. The agent can overwrite insights from the previous page if they were incorrect since it sees each page twice. We instruct the model to extract insights in simple sentences following the subject-verb-object (SVO) format and to write sentences as if English is the second language of the user. This significantly improves performance by encouraging clarity and precision. Rolling over each page multiple times and using the SVO format also solves the disambiguation problem, which is a huge challenge for knowledge graphs. The insight generation step is also particularly helpful for extracting information from tables since the model captures the facts from the table in clear, succinct sentences. Our dataset produced 216,931 total insights, about 13 insights per page and 655 insights per document.
  3. Distilling concepts from insights: From the detailed list of insights, we identify higher-level concepts that connect related information about the document. This step significantly reduces noise and redundant information in the document while preserving essential information and themes. Our dataset produced 14,824 total concepts, about 1 concept per page and 45 concepts per document. 
  4. Creating abstracts from concepts: Given the insights and concepts in the document, the LLM writes an abstract that appears both better than any abstract a human would write and more information-dense than any abstract present in the original document. The LLM generated abstract provides incredibly comprehensive knowledge about the document with a small token density that carries a significant amount of information. We produce one abstract per document, 331 total.
  5. Storing recollections/memories across documents: At the top of the pyramid we store critical information that is useful across all tasks. This can be information that the user shares about the task or information the agent learns about the dataset over time by researching and responding to tasks. For example, we can store the current 30 companies in the DOW as a recollection since this list is different from the 30 companies in the DOW at the time of the model’s knowledge cutoff. As we conduct more and more research tasks, we can continuously improve our recollections and maintain an audit trail of which documents these recollections originated from. For example, we can keep track of AI strategies across companies, where companies are making major investments, etc. These high-level connections are super important since they reveal relationships and information that are not apparent in a single page or document.
Sample subset of insights extracted from IBM 10Q, Q3 2024
Sample subset of insights extracted from IBM 10Q, Q3 2024 (page 4)

We store the text and embeddings for each layer of the pyramid (pages and up) in Azure PostgreSQL. We originally used Azure AI Search, but switched to PostgreSQL for cost reasons. This required us to write our own hybrid search function since PostgreSQL doesn’t yet natively support this feature. This implementation would work with any vector database or vector index of your choosing. The key requirement is to store and efficiently retrieve both text and vector embeddings at any level of the pyramid. 

This approach essentially creates the essence of a knowledge graph, but stores information in natural language, the way an LLM natively wants to interact with it, and is more efficient on token retrieval. We also let the LLM pick the terms used to categorize each level of the pyramid, this seemed to let the model decide for itself the best way to describe and differentiate between the information stored at each level. For example, the LLM preferred “insights” to “facts” as the label for the first level of distilled knowledge. Our goal in doing this was to better understand how an LLM thinks about the process by letting it decide how to store and group related information. 

Using the pyramid: How it works with RAG & Agents

At inference time, both traditional RAG and agentic approaches benefit from the pre-processed, distilled information ingested in our knowledge pyramid. The pyramid structure allows for efficient retrieval in both the traditional RAG case, where only the top X related pieces of information are retrieved or in the Agentic case, where the Agent iteratively plans, retrieves, and evaluates information before returning a final response. 

The benefit of the pyramid approach is that information at any and all levels of the pyramid can be used during inference. For our implementation, we used PydanticAI to create a search agent that takes in the user request, generates search terms, explores ideas related to the request, and keeps track of information relevant to the request. Once the search agent determines there’s sufficient information to address the user request, the results are re-ranked and sent back to the LLM to generate a final reply. Our implementation allows a search agent to traverse the information in the pyramid as it gathers details about a concept/search term. This is similar to walking a knowledge graph, but in a way that’s more natural for the LLM since all the information in the pyramid is stored in natural language.

Depending on the use case, the Agent could access information at all levels of the pyramid or only at specific levels (e.g. only retrieve information from the concepts). For our experiments, we did not retrieve raw page-level data since we wanted to focus on token efficiency and found the LLM-generated information for the insights, concepts, abstracts, and recollections was sufficient for completing our tasks. In theory, the Agent could also have access to the page data; this would provide additional opportunities for the agent to re-examine the original document text; however, it would also significantly increase the total tokens used. 

Here is a high-level visualization of our Agentic approach to responding to user requests:

Overview of the agentic research & response process
Image created by author and team providing an overview of the agentic research & response process

Results from the pyramid: Real-world examples

To evaluate the effectiveness of our approach, we tested it against a variety of question categories, including typical fact-finding questions and complex cross-document research and analysis tasks. 

Fact-finding (spear fishing): 

These tasks require identifying specific information or facts that are buried in a document. These are the types of questions typical RAG solutions target but often require many searches and consume lots of tokens to answer correctly. 

Example task: “What was IBM’s total revenue in the latest financial reporting?”

Example response using pyramid approach: “IBM’s total revenue for the third quarter of 2024 was $14.968 billion [ibm-10q-q3-2024.pdf, pg. 4]

Screenshot of total tokens used to research and generate response
Total tokens used to research and generate response

This result is correct (human-validated) and was generated using only 9,994 total tokens, with 1,240 tokens in the generated final response. 

Complex research and analysis: 

These tasks involve researching and understanding multiple concepts to gain a broader understanding of the documents and make inferences and informed assumptions based on the gathered facts.

Example task: “Analyze the investments Microsoft and NVIDIA are making in AI and how they are positioning themselves in the market. The report should be clearly formatted.”

Example response:

Screenshot of the response generated by the agent analyzing AI investments and positioning for Microsoft and NVIDIA.
Response generated by the agent analyzing AI investments and positioning for Microsoft and NVIDIA.

The result is a comprehensive report that executed quickly and contains detailed information about each of the companies. 26,802 total tokens were used to research and respond to the request with a significant percentage of them used for the final response (2,893 tokens or ~11%). These results were also reviewed by a human to verify their validity.

Screenshot of snippet indicating total token usage for the task
Snippet indicating total token usage for the task

Example task: “Create a report on analyzing the risks disclosed by the various financial companies in the DOW. Indicate which risks are shared and unique.”

Example response:

Screenshot of part 1 of a response generated by the agent on disclosed risks.
Part 1 of response generated by the agent on disclosed risks.
Screenshot of part 2 of a response generated by the agent on disclosed risks.
Part 2 of response generated by the agent on disclosed risks.

Similarly, this task was completed in 42.7 seconds and used 31,685 total tokens, with 3,116 tokens used to generate the final report. 

Screenshot of a snippet indicating total token usage for the task
Snippet indicating total token usage for the task

These results for both fact-finding and complex analysis tasks demonstrate that the pyramid approach efficiently creates detailed reports with low latency using a minimal amount of tokens. The tokens used for the tasks carry dense meaning with little noise allowing for high-quality, thorough responses across tasks.

Benefits of the pyramid: Why use it?

Overall, we found that our pyramid approach provided a significant boost in response quality and overall performance for high-value questions. 

Some of the key benefits we observed include: 

  • Reduced model’s cognitive load: When the agent receives the user task, it retrieves pre-processed, distilled information rather than the raw, inconsistently formatted, disparate document chunks. This fundamentally improves the retrieval process since the model doesn’t waste its cognitive capacity on trying to break down the page/chunk text for the first time. 
  • Superior table processing: By breaking down table information and storing it in concise but descriptive sentences, the pyramid approach makes it easier to retrieve relevant information at inference time through natural language queries. This was particularly important for our dataset since financial reports contain lots of critical information in tables. 
  • Improved response quality to many types of requests: The pyramid enables more comprehensive context-aware responses to both precise, fact-finding questions and broad analysis based tasks that involve many themes across numerous documents. 
  • Preservation of critical context: Since the distillation process identifies and keeps track of key facts, important information that might appear only once in the document is easier to maintain. For example, noting that all tables are represented in millions of dollars or in a particular currency. Traditional chunking methods often cause this type of information to slip through the cracks. 
  • Optimized token usage, memory, and speed: By distilling information at ingestion time, we significantly reduce the number of tokens required during inference, are able to maximize the value of information put in the context window, and improve memory use. 
  • Scalability: Many solutions struggle to perform as the size of the document dataset grows. This approach provides a much more efficient way to manage a large volume of text by only preserving critical information. This also allows for a more efficient use of the LLMs context window by only sending it useful, clear information.
  • Efficient concept exploration: The pyramid enables the agent to explore related information similar to navigating a knowledge graph, but does not require ever generating or maintaining relationships in the graph. The agent can use natural language exclusively and keep track of important facts related to the concepts it’s exploring in a highly token-efficient and fluid way. 
  • Emergent dataset understanding: An unexpected benefit of this approach emerged during our testing. When asking questions like “what can you tell me about this dataset?” or “what types of questions can I ask?”, the system is able to respond and suggest productive search topics because it has a more robust understanding of the dataset context by accessing higher levels in the pyramid like the abstracts and recollections. 

Beyond the pyramid: Evaluation challenges & future directions

Challenges

While the results we’ve observed when using the pyramid search approach have been nothing short of amazing, finding ways to establish meaningful metrics to evaluate the entire system both at ingestion time and during information retrieval is challenging. Traditional RAG and Agent evaluation frameworks often fail to address nuanced questions and analytical responses where many different responses are valid.

Our team plans to write a research paper on this approach in the future, and we are open to any thoughts and feedback from the community, especially when it comes to evaluation metrics. Many of the existing datasets we found were focused on evaluating RAG use cases within one document or precise information retrieval across multiple documents rather than robust concept and theme analysis across documents and domains. 

The main use cases we are interested in relate to broader questions that are representative of how businesses actually want to interact with GenAI systems. For example, “tell me everything I need to know about customer X” or “how do the behaviors of Customer A and B differ? Which am I more likely to have a successful meeting with?”. These types of questions require a deep understanding of information across many sources. The answers to these questions typically require a person to synthesize data from multiple areas of the business and think critically about it. As a result, the answers to these questions are rarely written or saved anywhere which makes it impossible to simply store and retrieve them through a vector index in a typical RAG process. 

Another consideration is that many real-world use cases involve dynamic datasets where documents are consistently being added, edited, and deleted. This makes it difficult to evaluate and track what a “correct” response is since the answer will evolve as the available information changes. 

Future directions

In the future, we believe that the pyramid approach can address some of these challenges by enabling more effective processing of dense documents and storing learned information as recollections. However, tracking and evaluating the validity of the recollections over time will be critical to the system’s overall success and remains a key focus area for our ongoing work. 

When applying this approach to organizational data, the pyramid process could also be used to identify and assess discrepancies across areas of the business. For example, uploading all of a company’s sales pitch decks could surface where certain products or services are being positioned inconsistently. It could also be used to compare insights extracted from various line of business data to help understand if and where teams have developed conflicting understandings of topics or different priorities. This application goes beyond pure information retrieval use cases and would allow the pyramid to serve as an organizational alignment tool that helps identify divergences in messaging, terminology, and overall communication. 

Conclusion: Key takeaways and why the pyramid approach matters

The knowledge distillation pyramid approach is significant because it leverages the full power of the LLM at both ingestion and retrieval time. Our approach allows you to store dense information in fewer tokens which has the added benefit of reducing noise in the dataset at inference. Our approach also runs very quickly and is incredibly token efficient, we are able to generate responses within seconds, explore potentially hundreds of searches, and on average use <40K tokens for the entire search, retrieval, and response generation process (this includes all the search iterations!). 

We find that the LLM is much better at writing atomic insights as sentences and that these insights effectively distill information from both text-based and tabular data. This distilled information written in natural language is very easy for the LLM to understand and navigate at inference since it does not have to expend unnecessary energy reasoning about and breaking down document formatting or filtering through noise

The ability to retrieve and aggregate information at any level of the pyramid also provides significant flexibility to address a variety of query types. This approach offers promising performance for large datasets and enables high-value use cases that require nuanced information retrieval and analysis. 


Note: The opinions expressed in this article are solely my own and do not necessarily reflect the views or policies of my employer.

Interested in discussing further or collaborating? Reach out on LinkedIn!

The post Overcome Failing Document Ingestion & RAG Strategies with Agentic Knowledge Distillation appeared first on Towards Data Science.

]]>
Avoidable and Unavoidable Randomness in GPT-4o https://towardsdatascience.com/avoidable-and-unavoidable-randomness-in-gpt-4o/ Mon, 03 Mar 2025 12:00:00 +0000 https://towardsdatascience.com/?p=598604 Exploring the sources of randomness in GPT-4o from the known and controllable to the opaque and uncontrollable.

The post Avoidable and Unavoidable Randomness in GPT-4o appeared first on Towards Data Science.

]]>
Of course there is randomness in GPT-4o’s outputs. After all, the model samples from a probability distribution when choosing each token. But what I didn’t understand was that those very probabilities themselves are not deterministic. Even with consistent prompts, fixed seeds, and temperature set to zero, GPT-4o still introduces subtle, frustrating randomness.

There’s no fix for this, and it might not even be something OpenAI could fix if they wanted to, just so we’re clear up front about where this article is headed. Along the way, we’ll examine all the sources of randomness in GPT-4o output, which will require us to break down the sampling process to a low level. We’ll point at the issue—the probabilities vary—and critically examine OpenAI’s official guidance on determinism.

First, though, let’s talk about why determinism matters. Determinism means that the same input always produces the same output, like a mathematical function. While LLM creativity is often desirable, determinism serves crucial purposes: researchers need it for reproducible experiments, developers for verifying reported results, and prompt engineers for debugging their changes. Without it, you’re left wondering if different outputs stem from your tweaks or just the random number generator’s mood swings.

Flipping a coin

We’re going to keep things extremely simple here and prompt the most recent version of GPT-4o (gpt-4o-2024-08-06 in the API) with this:

 Flip a coin. Return Heads or Tails only.

Flipping a coin with LLMs is a fascinating topic in itself (see for example Van Koevering & Kleinberg, 2024 in the references), but here, we’ll use it as a simple binary question with which to explore determinism, or the lack thereof.

This is our first attempt.

import os
from openai import OpenAI
client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))

prompt = 'Flip a coin. Return Heads or Tails only.'

response = client.chat.completions.create(
    model='gpt-4o-2024-08-06',
    messages=[{'role': 'user', 'content': prompt}],
)

print(response.choices[0].message.content)

Running the code gave me Heads. Maybe you’ll get Tails, or if you’re really lucky, something far more interesting.

The code first initializes an OpenAI client with an API key set in the environment variable OPENAI_API_KEY (to avoid sharing billing credentials here). The main action happens with client.chat.completions.create, where we specify the model to use and send the prompt (as a part of a very simple conversation named messages) to the server. We get an object called response back from the server. This object contains a lot of information, as shown below, so we need to dig into it to extract GPT-4o’s actual response to the message, which is response.choices[0].message.content.

>>> response
ChatCompletion(id='chatcmpl-B48EqZBLfUWtp9H7cwnchGTJbBDwr', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='Heads', refusal=None, role='assistant', audio=None, function_call=None, tool_calls=None))], created=1740324680, model='gpt-4o-2024-08-06', object='chat.completion', service_tier='default', system_fingerprint='fp_eb9dce56a8', usage=CompletionUsage(completion_tokens=2, prompt_tokens=18, total_tokens=20, completion_tokens_details=CompletionTokensDetails(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0), prompt_tokens_details=PromptTokensDetails(audio_tokens=0, cached_tokens=0)))

Now let’s flip the coin ten times. If this were a real, fair coin, of course, we would expect roughly equal heads and tails over time thanks to the law of large numbers. But GPT-4o’s coin doesn’t work quite like that.

import os
from openai import OpenAI
client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))

prompt = 'Flip a coin. Return Heads or Tails only.'

for _ in range(10):
    response = client.chat.completions.create(
        model='gpt-4o-2024-08-06',
        messages=[{'role': 'user', 'content': prompt}],
    )
    print(response.choices[0].message.content)

Running this code gave me the following output, although you might get different output, of course.

Heads
Heads
Heads
Heads
Heads
Heads
Tails
Heads
Heads
Heads

GPT-4o’s coin is clearly biased, but so are humans. Bar-Hillel, Peer, and Acquisti (2014) found that people flipping imaginary coins choose “heads” 80% of the time. Maybe GPT-4o learned that from us. But whatever the reason, we’re just using this simple example to explore determinism.

Just how biased is GPT-4o’s coin?

Let’s say we wanted to know precisely what percentage of GPT-4o coin flips land Heads.

Rather than the obvious (but expensive) approach of flipping it a million times, there’s a smarter way. For classification tasks with a small set of possible answers, we can extract token probabilities instead of generating full responses. With the right prompt, the first token carries all the necessary information, making these API calls incredibly cheap: around 30,000 calls per dollar, since each requires just 18 (cached) input tokens and 1 output token.

OpenAI gives us (natural) log probabilities. These are called logprobs in the code, and we convert them to regular probabilities by exponentiation. (We’ll discuss temperature soon, but note that exponentiating logprobs directly like this corresponds to a temperature setting of 1.0, and is how we calculate probabilities throughout this article). OpenAI lets us request logprobs for the top 20 most likely tokens, so we do that.

import os
import math
from openai import OpenAI
from tabulate import tabulate

client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))

prompt = 'Flip a coin. Return Heads or Tails only.'

response = client.chat.completions.create(
    model='gpt-4o-2024-08-06',
    max_tokens=1,
    logprobs=True,
    top_logprobs=20,
    messages=[{'role': 'user', 'content': prompt}],
)

logprobs_list = response.choices[0].logprobs.content[0].top_logprobs

data = []
total_pct = 0.0

for logprob_entry in logprobs_list:
    token = logprob_entry.token
    logprob = logprob_entry.logprob
    pct = math.exp(logprob) * 100  # Convert logprob to a percentage
    total_pct += pct
    data.append([token, logprob, pct])

print(
    tabulate(
        data,
        headers=["Token", "Log Probability", "Percentage (%)"],
        tablefmt="github",
        floatfmt=("s", ".10f", ".10f")
    )
)
print(f"\nTotal probabilities: {total_pct:.6f}%")

If you run this, you’ll get something like the following output, but actual numbers will vary.

| Token     |   Log Probability |   Percentage (%) |
|-----------|-------------------|------------------|
| Heads     |     -0.0380541235 |    96.2660836887 |
| T         |     -3.2880542278 |     3.7326407467 |
| Sure      |    -12.5380544662 |     0.0003587502 |
| Head      |    -12.7880544662 |     0.0002793949 |
| Tail      |    -13.2880544662 |     0.0001694616 |
| Certainly |    -13.5380544662 |     0.0001319768 |
| "T        |    -14.2880544662 |     0.0000623414 |
| I'm       |    -14.5380544662 |     0.0000485516 |
| heads     |    -14.5380544662 |     0.0000485516 |
| Heads     |    -14.9130544662 |     0.0000333690 |
| "         |    -15.1630544662 |     0.0000259878 |
| _heads    |    -15.1630544662 |     0.0000259878 |
| tails     |    -15.5380544662 |     0.0000178611 |
| HEAD      |    -15.7880544662 |     0.0000139103 |
| TAIL      |    -16.2880535126 |     0.0000084370 |
| T         |    -16.7880535126 |     0.0000051173 |
| ```       |    -16.7880535126 |     0.0000051173 |
| Here's    |    -16.9130535126 |     0.0000045160 |
| I         |    -17.2880535126 |     0.0000031038 |
| As        |    -17.2880535126 |     0.0000031038 |

Total probabilities: 99.999970%

Looking at these probabilities, we see Heads at ≈96% and T at ≈4%. Our prompt is doing pretty well at constraining the model’s responses. Why T and not Tails? This is the tokenizer splitting Tails into T + ails, while keeping Heads as one piece, as we can see in this Python session:

>>> import tiktoken
>>> encoding = tiktoken.encoding_for_model("gpt-4o-2024-08-06")
>>> encoding.encode('Tails')
[51, 2196]
>>> encoding.decode([51])
'T'
>>> encoding.encode('Heads')
[181043]

These probabilities are not deterministic

Run the code to display the probabilities for the top 20 tokens again, and you’ll likely get different numbers. Here’s what I got on a second running.

| Token     |   Log Probability |   Percentage (%) |
|-----------|-------------------|------------------|
| Heads     |     -0.0110520627 |    98.9008786933 |
| T         |     -4.5110521317 |     1.0986894433 |
| Certainly |    -14.0110521317 |     0.0000822389 |
| Head      |    -14.2610521317 |     0.0000640477 |
| Sure      |    -14.2610521317 |     0.0000640477 |
| Tail      |    -14.3860521317 |     0.0000565219 |
| heads     |    -15.3860521317 |     0.0000207933 |
| Heads     |    -15.5110521317 |     0.0000183500 |
| ```       |    -15.5110521317 |     0.0000183500 |
| _heads    |    -15.6360521317 |     0.0000161938 |
| tails     |    -15.6360521317 |     0.0000161938 |
| I'm       |    -15.8860521317 |     0.0000126117 |
| "T        |    -15.8860521317 |     0.0000126117 |
| As        |    -16.3860511780 |     0.0000076494 |
| "         |    -16.5110511780 |     0.0000067506 |
| HEAD      |    -16.6360511780 |     0.0000059574 |
| TAIL      |    -16.7610511780 |     0.0000052574 |
| Here's    |    -16.7610511780 |     0.0000052574 |
| ``        |    -17.1360511780 |     0.0000036133 |
| T         |    -17.6360511780 |     0.0000021916 |

Total probabilities: 99.999987%

In their cookbook, OpenAI offers the following advice on receiving “mostly identical” outputs:

If the seed, request parameters, and system_fingerprint all match across your requests, then model outputs will mostly be identical. There is a small chance that responses differ even when request parameters and system_fingerprint match, due to the inherent non-determinism of our models.

They also give “mostly identical” advice in the reproducible outputs section of their documentation.

The request parameters that could affect randomness are temperature and seed. OpenAI also suggests we track system_fingerprint, because differences here might cause differences in output. We’ll examine each of these below, but spoiler: none of them will fix or even explain this non-determinism.

Temperature, and why it won’t fix this

Temperature controls how random the model’s responses are. Low temperatures (<0.5) make it robotic and predictable, medium temperatures (0.7–1.3) allow some creativity, and high temperatures (>1.5) produce gibberish. Temperature is often called the “creativity parameter”, but this is an oversimplification. In their analysis, Peeperkorn, Kouwenhoven, Brown, and Jordanous (2024) evaluated LLM outputs across four dimensions of creativity: novelty (originality), coherence (logical consistency), cohesion (how well the text flows), and typicality (how well it fits expected patterns). They observed that:

temperature is weakly correlated with novelty, and unsurprisingly, moderately correlated with incoherence, but there is no relationship with either cohesion or typicality.

But, this is beside the point for coin flipping. Under the hood, the log probabilities are divided by the temperature before they’re renormalized and exponentiated to be converted to probabilities. This creates a non-linear effect: temperature=0.5 squares the probabilities, making likely tokens dominate, while temperature=2.0 applies a square root, flattening the distribution.

What about temperature=0.0? Instead of breaking math dividing by zero, the model simply picks the highest-probability token. Sounds deterministic, right? Not quite. Here’s the catch: temperature only comes into play after the log probabilities are computed, when we convert them to probabilities.

In summary: if the logprobs aren’t deterministic, setting temperature to 0.0 won’t make the model deterministic.

In fact, since we’re just asking the model for the raw logprobs directly rather than generating full responses, the temperature setting doesn’t come into play in our code at all.

Seeds, and why they won’t fix this

After temperature is used to compute probabilities, the model samples from these probabilities to pick the next token. OpenAI gives us a little control over the sampling process by letting us set the seed parameter for the random number generator. In an ideal world, setting a seed would give us determinism at any temperature. But seeds only affect sampling, not the log probabilities before sampling.

In summary: if the logprobs aren’t deterministic, setting a seed won’t make the model deterministic.

In fact, seed only matters with non-zero temperatures. With temperature=0.0, the model is always choosing the highest probability token regardless of the seed. Again, since we’re just asking the model for the raw logprobs directly rather than sampling, neither of these settings can help us achieve determinism.

System fingerprints, our last hope

The system_fingerprint identifies the current combination of model weights, infrastructure, and configuration options in OpenAI’s backend. At least, that’s what OpenAI tells us. Variations in system fingerprints might indeed explain variations in logprobs. Except that they don’t, as we will verify below.

Nothing can get you determinism

Let’s confirm what we’ve been building toward. We’ll run the same request 10 times with every safeguard in place. Even though neither of these parameters should matter for what we’re doing, you can never be too safe, so we’ll set temperature=0.0 and seed=42. And to see if infrastructure differences explain our varying logprobs, we’ll print system_fingerprint. Here’s the code:

import os
import math
from openai import OpenAI
from tabulate import tabulate
from tqdm import tqdm

client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))

prompt = 'Flip a coin. Return Heads or Tails only.'

data = []

for _ in tqdm(range(10), desc='Generating responses'):
    response = client.chat.completions.create(
        model='gpt-4o-2024-08-06',
        temperature=0.0,
        seed=42,
        max_tokens=1,
        logprobs=True,
        top_logprobs=20,
        messages=[{'role': 'user', 'content': prompt}],
    )

    fingerprint = response.system_fingerprint
    logprobs_list = response.choices[0].logprobs.content[0].top_logprobs
    heads_logprob = next(
        entry.logprob for entry in logprobs_list if entry.token == 'Heads'
    )
    pct = math.exp(heads_logprob) * 100
    data.append([fingerprint, heads_logprob, f"{pct:.10f}%"])

headers = ["Fingerprint", "Logprob", "Probability"]
print(tabulate(data, headers=headers, tablefmt="pipe"))

Running this 10 times, here are the logprobs and probabilities for the token Heads:

| Fingerprint   |    Logprob | Probability    |
|---------------|------------|----------------|
| fp_f9f4fb6dbf | -0.0380541 | 96.2660836887% |
| fp_f9f4fb6dbf | -0.0380541 | 96.2660836887% |
| fp_f9f4fb6dbf | -0.0380541 | 96.2660836887% |
| fp_f9f4fb6dbf | -0.0380541 | 96.2660836887% |
| fp_f9f4fb6dbf | -0.160339  | 85.1854886858% |
| fp_f9f4fb6dbf | -0.0380541 | 96.2660836887% |
| fp_f9f4fb6dbf | -0.0110521 | 98.9008786933% |
| fp_f9f4fb6dbf | -0.0380541 | 96.2660836887% |
| fp_f9f4fb6dbf | -0.0380541 | 96.2660836887% |
| fp_f9f4fb6dbf | -0.0380541 | 96.2660836887% |

Mixture-of-experts makes determinism impossible

OpenAI is decidedly not open about the architecture behind GPT-4o. However, it’s widely believed that GPT-4o uses a mixture-of-experts (MoE) architecture with either 8 or 16 experts.

According to a paper by Google DeepMind researchers Puigcerver, Riquelme, Mustafa, and Houlsby (hat tip to user elmstedt on the OpenAI forum), mixture-of-experts architectures may add an unavoidable level of non-determinism:

Under capacity constraints, all Sparse MoE approaches route tokens in groups of a fixed size and enforce (or encourage) balance within the group. When groups contain tokens from different sequences or inputs, these tokens compete for available spots in expert buffers. Therefore, the model is no longer deterministic at the sequence-level, but only at the batch-level.

In other words, when your prompt (a sequence of tokens, in the quote above) reaches OpenAI’s servers, it gets batched with a group of other prompts (OpenAI isn’t open about how many other prompts). Each prompt in the batch is then routed to an “expert” within the model. However, since only so many prompts can be routed to the same expert, the expert your prompt gets routed to will depend on all the other prompts in the batch.

This “competition” for experts introduces a real-world randomness completely beyond our control.

Non-determinism beyond mixture-of-experts

While non-determinism may be inherent to real-world mixture-of-experts models, that does not seem to be the only source of non-determinism in OpenAI’s models.

Making a few changes to our code above (switching to gpt-3.5-turbo-0125, looking for the token He since GPT-3.5’s tokenizer splits “Heads” differently, and ignoring system_fingerprint because this model doesn’t have it) reveals that GPT-3.5-turbo also exhibits non-deterministic logprobs:

|     Logprob | Probability    |
|-------------|----------------|
| -0.00278289 | 99.7220983436% |
| -0.00415331 | 99.5855302068% |
| -0.00258838 | 99.7414961980% |
| -0.00204034 | 99.7961735289% |
| -0.00240277 | 99.7600117933% |
| -0.00204034 | 99.7961735289% |
| -0.00204034 | 99.7961735289% |
| -0.00258838 | 99.7414961980% |
| -0.00351419 | 99.6491976144% |
| -0.00201214 | 99.7989878007% |

No one is claiming that GPT-3.5-turbo uses a mixture-of-experts architecture. Thus, there must be additional factors beyond mixture-of-experts contributing to this non-determinism.

What 10,000 GPT-4o coin flip probabilities tell us

To better understand the patterns and magnitude of this non-determinism, I conducted a more extensive experiment with GPT-4o, performing 10,000 “coin flips” while recording the probability assigned to “Heads” in each case.

The results reveal something fascinating. Across 10,000 API calls with identical parameters, GPT-4o produced not just a few different probability values, but 42 distinct probabilities. If the mixture-of-experts hypothesis were the complete explanation for non-determinism in GPT-4o, we might expect to see one distinct probability for each expert. But GPT-4o is believed to have either 8 or 16 experts, not 42.

In the output below, I clustered these probabilities, ensuring that each cluster was separated from the others by 0.01 (as a raw percentage). This groups the output into 12 clusters.

Probability          Count           Fingerprints
------------------------------------------------------------------
85.1854379113%       5               fp_eb9dce56a8, fp_f9f4fb6dbf
85.1854455275%       74              fp_eb9dce56a8, fp_f9f4fb6dbf
85.1854886858%       180             fp_eb9dce56a8, fp_f9f4fb6dbf
------------------------------------------------------------------
88.0662448207%       31              fp_eb9dce56a8, fp_f9f4fb6dbf
88.0678628883%       2               fp_f9f4fb6dbf
------------------------------------------------------------------
92.3997629747%       1               fp_eb9dce56a8
92.3997733012%       4               fp_eb9dce56a8
92.3997836277%       3               fp_eb9dce56a8
------------------------------------------------------------------
92.4128943690%       1               fp_f9f4fb6dbf
92.4129143363%       21              fp_eb9dce56a8, fp_f9f4fb6dbf
92.4129246643%       8               fp_eb9dce56a8, fp_f9f4fb6dbf
------------------------------------------------------------------
93.9906837191%       4               fp_eb9dce56a8
------------------------------------------------------------------
95.2569999350%       36              fp_eb9dce56a8
------------------------------------------------------------------
96.2660836887%       3391            fp_eb9dce56a8, fp_f9f4fb6dbf
96.2661285161%       2636            fp_eb9dce56a8, fp_f9f4fb6dbf
------------------------------------------------------------------
97.0674551052%       1               fp_eb9dce56a8
97.0674778863%       3               fp_eb9dce56a8
97.0675003058%       4               fp_eb9dce56a8
97.0675116963%       1               fp_eb9dce56a8
97.0680739932%       19              fp_eb9dce56a8, fp_f9f4fb6dbf
97.0681293191%       6               fp_eb9dce56a8, fp_f9f4fb6dbf
97.0681521003%       74              fp_eb9dce56a8, fp_f9f4fb6dbf
97.0682421405%       4               fp_eb9dce56a8
------------------------------------------------------------------
97.7008960695%       1               fp_f9f4fb6dbf
97.7011122645%       3               fp_eb9dce56a8
97.7011462953%       3               fp_eb9dce56a8
97.7018178132%       1               fp_eb9dce56a8
------------------------------------------------------------------
98.2006069902%       426             fp_eb9dce56a8, fp_f9f4fb6dbf
98.2006876548%       6               fp_f9f4fb6dbf
98.2007107019%       1               fp_eb9dce56a8
98.2009525133%       5               fp_eb9dce56a8
98.2009751945%       1               fp_eb9dce56a8
98.2009867181%       1               fp_eb9dce56a8
------------------------------------------------------------------
98.5930987656%       3               fp_eb9dce56a8, fp_f9f4fb6dbf
98.5931104270%       235             fp_eb9dce56a8, fp_f9f4fb6dbf
98.5931222721%       4               fp_eb9dce56a8, fp_f9f4fb6dbf
98.5931340253%       9               fp_eb9dce56a8
98.5931571644%       159             fp_eb9dce56a8, fp_f9f4fb6dbf
98.5931805790%       384             fp_eb9dce56a8
------------------------------------------------------------------
98.9008436920%       95              fp_eb9dce56a8, fp_f9f4fb6dbf
98.9008550214%       362             fp_eb9dce56a8, fp_f9f4fb6dbf
98.9008786933%       1792            fp_eb9dce56a8, fp_f9f4fb6dbf

(With a threshold of 0.001 there are 13 clusters, and with a threshold of 0.0001 there are 17 clusters.)

As the chart above demonstrates, this multitude of results cannot be explained by system_fingerprint values. Across all 10,000 calls, I received only two different system fingerprints: 4488 results with fp_f9f4fb6dbf and 5512 with fp_eb9dce56a8, and for the most part the two system fingerprints returned the same sets probabilities, rather than each fingerprint producing its own distinct set of probabilities.

It could be that these 12 clusters of probabilities represent 12 different experts. Even assuming that, the variations within the clusters remain puzzling. These don’t seem likely to be simple rounding errors, because they are too systematic and consistent. Take the giant cluster at around 96.266% with two distinct probabilities representing over half of our coin flips. The difference between these two probabilities, 0.0000448274%, is tiny but persistent.

Conclusion: Non-determinism is baked in

There is an underlying randomness in the log probabilities returned by all currently available non-thinking OpenAI models: GPT-4o, GPT-4o-mini, and the two flavors of GPT-3.5-turbo. Because this non-determinism is baked into the log probabilities, there’s no way for a user to get around it. Temperature and seed values have no effect, and system fingerprints don’t explain it.

While mixture-of-experts architectures inherently introduce some randomness in the competition for experts, the non-determinism in GPT-4o seems to go far beyond this, and the non-determinism in GPT-3.5-turbo can’t be explained by this at all, because GPT-3.5-turbo isn’t a mixture-of-experts model.

While we can’t verify this claim any more because the model isn’t being served, this behaviour wasn’t seen with GPT-3, according to user _j on the OpenAI forum:

It is a symptom that was not seen on prior GPT-3 AI models where across hundreds of trials to investigate sampling, you never had to doubt that logprobs would be the same. Even if you found a top-2 answer that returned exactly the same logprob value via the API, you would never see them switch position or return different values.

This suggests that whatever is causing this randomness first emerged in either GPT-3.5 or GPT-3.5-turbo.

But regardless of when it emerged, this non-determinism is a serious obstacle to understanding these models. If you want to study a model—how it generalizes, how it biases responses, how it assigns probabilities to different tokens—you need consistency. but as we’ve seen, even when we lock down every knob OpenAI lets us touch, we still can’t get an answer to the simplest possible question: “what is the probability that GPT-4o says a coin lands heads?”

Worse, while mixture-of-experts explains some of this non-determinism, there are clearly other, hidden sources of randomness that we can’t see, control, or understand. In an ideal world, the API would provide more transparency by telling us which expert processed our request or by offering additional parameters to control this routing process. Without such visibility, we’re left guessing at the true nature of the variability.

References

Bar-Hillel, M., Peer, E., & Acquisti, A. (2014). “Heads or tails?” – A reachability bias in binary choice. Journal of Experimental Psychology: Learning, Memory, and Cognition, 40(6), 1656–1663. https://doi.org/10.1037/xlm0000005.

Peeperkorn, M., Kouwenhoven, T., Brown, D., & Jordanous, A. (2024). Is temperature the creativity parameter of Large Language Models?. In The 15th International Conference on Computational Creativity (ICCC’24). arXiv:2405.00492.

Puigcerver, J., Riquelme, C., Mustafa, B., & Houlsby, N. (2024). From sparse to soft mixtures of experts. In The Twelfth International Conference on Learning Representations (ICLR 2024). https://openreview.net/forum?id=jxpsAj7ltE. arXiv:2308.00951.Van Koevering, K., & Kleinberg, J. (2024). How random is random? Evaluating the Randomness and humanness of LLMs’ coin flips. arXiv:2406.00092.

The post Avoidable and Unavoidable Randomness in GPT-4o appeared first on Towards Data Science.

]]>
Vision Transformers (ViT) Explained: Are They Better Than CNNs? https://towardsdatascience.com/vision-transformers-vit-explained-are-they-better-than-cnns/ Fri, 28 Feb 2025 22:12:11 +0000 https://towardsdatascience.com/?p=598582 Understanding how a groundbreaking architecture for computer vision tasks works

The post Vision Transformers (ViT) Explained: Are They Better Than CNNs? appeared first on Towards Data Science.

]]>
1. Introduction

Ever since the introduction of the self-attention mechanism, Transformers have been the top choice when it comes to Natural Language Processing (NLP) tasks. Self-attention-based models are highly parallelizable and require substantially fewer parameters, making them much more computationally efficient, less prone to overfitting, and easier to fine-tune for domain-specific tasks [1]. Furthermore, the key advantage of transformers over past models (like RNN, LSTM, GRU and other neural-based architectures that dominated the NLP domain prior to the introduction of Transformers) is their ability to process input sequences of any length without losing context, through the use of the self-attention mechanism that focuses on different parts of the input sequence, and how those parts interact with other parts of the sequence, at different times [2]. Because of these qualities, Transformers has made it possible to train language models of unprecedented size, with more than 100B parameters, paving the way for the current state-of-the-art advanced models like the Generative Pre-trained Transformer (GPT) and the Bidirectional Encoder Representations from Transformers (BERT) [1].

However, in the field of computer vision, convolutional neural networks or CNNs, remain dominant in most, if not all, computer vision tasks. While there has been an increasing collection of research work that attempts to implement self-attention-based architectures to perform computer vision tasks, very few has reliably outperformed CNNs with promising scalability [3]. The main challenge with integrating the transformer architecture with image-related tasks is that, by design, the self-attention mechanism, which is the core component of transformers, has a quadratic time complexity with respect to sequence length, i.e. O(n2), as shown in Table I and as discussed further in Part 2.1. This is usually not a problem for NLP tasks that use a relatively small number of tokens per input sequence (e.g., a 1,000-word paragraph will only have 1,000 input tokens, or a few more if sub-word units are used as tokens instead of full words). However, in computer vision, the input sequence (the image) can have a token size with orders of magnitude greater than that of NLP input sequences. For example, a relatively small 300 x 300 x 3 image can easily have up to 270,000 tokens and require a self-attention map with up to 72.9 billion parameters (270,0002) when self-attention is applied naively.

Table I. Time complexity for different layer types [2].

For this reason, most of the research work that attempt to use self-attention-based architectures to perform computer vision tasks did so either by applying self-attention locally, using transformer blocks in conjunction with CNN layers, or by only replacing specific components of the CNN architecture while maintaining the overall structure of the network; never by only using a pure transformer [3]. The goal of Dr. Dosovitskiy, et. al. in their work, “An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale”, is to show that it is indeed possible to implement image classification by applying self-attention globally through the use of the basic Transformer encoder architure, while at the same time requiring significantly less computational resources to train, and outperforming state-of-the-art convolutional neural networks like ResNet.

2. The Transformer

Transformers, introduced in the paper titled “Attention is All You Need” by Vaswani et al. in 2017,  are a class of neural network architectures that have revolutionized various natural language processing and machine learning tasks. A high level view of its architecture is shown in Fig. 1.

Fig. 1. The Transformer model architecture showing the encoder (left block)
and decoder components (right block) [2]

Since its introduction, transformers have served as the foundation for many state-of-the-art models in NLP; including BERT, GPT, and more. Fundamentally, they are designed to process sequential data, such as text data, without the need for recurrent or convolutional layers [2]. They achieve this by relying heavily on a mechanism called self-attention.

The self-attention mechanism is a key innovation introduced in the paper that allows the model to capture relationships between different elements in a given sequence by weighing the importance of each element in the sequence with respect to other elements [2]. Say for instance, you want to translate the following sentence:

“The animal didn’t cross the street because it was too tired.”

What does the word “it” in this particular sentence refer to? Is it referring to the street or the animal? For us humans, this may be a trivial question to answer. But for an algorithm, this can be considered a complex task to perform. However, through the self-attention mechanism, the transformer model is able to estimate the relative weight of each word with respect to all the other words in the sentence, allowing the model to associate the word “it” with “animal” in the context of our given sentence [4].

Fig. 2.  Sample output of the 5th encoder in a 5-encoder stack self-attention block given the word “it” as an input. We can see that the attention mechanism is associating our input word with the phrase “The Animal” [4].

2.1. The Self-Attention Mechanism

A transformer transforms a given input sequence by passing each element through an encoder (or a stack of encoders) and a decoder (or a stack of decoders) block, in parallel [2]. Each encoder block contains a self-attention block and a feed forward neural network. Here, we only focus on the transformer encoder block as this was the component used by Dosovitskiy et al. in their Vision Transformer image classification model.

As is the case with general NLP applications, the first step in the encoding process is to turn each input word into a vector using an embedding layer which converts our text data into a vector that represents our word in the vector space while retaining its contextual information. We then compile these individual word embedding vectors into a matrix X, where each row i represents the embedding of each element i in the input sequence. Then, we create three sets of vectors for each element in the input sequence; namely, Key (K), Query (Q), and Value (V). These sets are derived by multiplying matrix X with the corresponding trainable weight matrices WQ, WK, and WV [2].

Afterwards, we perform a matrix multiplication between K and Q, divide the result by the square-root of the dimensionality of K: …and then apply a softmax function to normalize the output and generate weight values between 0 and 1 [2].

We will call this intermediary output the attention factor. This factor, shown in Eq. 4, represents the weight that each element in the sequence contributes to the calculation of the attention value at the current position (word being processed). The idea behind the softmax operation is to amplify the words that the model thinks are relevant to the current position, and attenuate the ones that are irrelevant. For example, in Fig. 3, the input sentence “He later went to report Malaysia for one year” is passed into a BERT encoder unit to generate a heatmap that illustrates the contextual relationship of each word with each other. We can see that words that are deemed contextually associated produce higher weight values in their respective cells, visualized in a dark pink color, while words that are contextually unrelated have low weight values, represented in pale pink.

Fig. 3. Attention matrix visualization – weights in a BERT Encoding Unit [5]

Finally, we multiply the attention factor matrix to the value matrix V to compute the aggregated self-attention value matrix Z of this layer [2], where each row i in Z represents the attention vector for word i in our input sequence.  This aggregated value essentially bakes the “context” provided by other words in the sentence into the current word being processed. The attention equation shown in Eq. 5 is sometimes also referred to as the Scaled Dot-Product Attention.

2.2 The Multi-Headed Self-Attention

In the paper by Vaswani et. al., the self-attention block is further augmented with a mechanism known as the “multi-headed” self-attention, shown in Fig 4. The idea behind this is instead of relying on a single attention mechanism, the model employs multiple parallel attention “heads” (in the paper, Vaswani et. al. used 8 parallel attention layers), wherein each of these attention heads learns different relationships and provides unique perspectives on the input sequence [2]. This improves the performance of the attention layer in two important ways:

First, it expands the ability of the model to focus on different positions within the sequence. Depending on multiple variations involved in the initialization and training process, the calculated attention value for a given word (Eq. 5) can be dominated by other certain unrelated words or phrases or even by the word itself [4]. By computing multiple attention heads, the transformer model has multiple opportunities to capture the correct contextual relationships, thus becoming more robust to variations and ambiguities in the input.Second, since each of our Q, K, V matrices are randomly initialized independently across all the attention heads, the training process then yields several Z matrices (Eq. 5), which gives the transformer multiple representation subspaces [4]. For example, one head might focus on syntactic relationships while another might attend to semantic meanings. Through this, the model is able to capture more diverse relationships within the data.

Fig. 4. Illustrating the Multi-Headed Self-Attention Mechanism. Each individual attention head yields a scaled dot-product attention value, which are concatenated and multiplied to a learned matrix WO to generate the aggregated multi-headed self-attention value matrix [4].

3. The Vision Transformer

The fundamental innovation behind the Vision Transformer (ViT) revolves around the idea that images can be processed as sequences of tokens rather than grids of pixels. In traditional CNNs, input images are analyzed as overlapping tiles via a sliding convolutional filter, which are then processed hierarchically through a series of convolutional and pooling layers. In contrast, ViT treats the image as a collection of non-overlapping patches, which are treated as the input sequence to a standard Transformer encoder unit.

Fig. 5. The Vision Transformer architecture (left), and the Transfomer encoder unit
derived from the Fig. 1 (right)[3].

By defining the input tokens to the transformer as non-overlapping image patches rather than individual pixels, we are therefore able to reduce the dimension of the attention map from ⟮𝐻 𝓍 𝑊⟯2 to ⟮𝑛𝑝ℎ 𝓍 𝑛𝑝𝑤2 given 𝑛𝑝ℎ ≪𝐻 and 𝑛𝑝𝑤≪ 𝑊; where 𝐻 and 𝑊 are the height and width of the image, and 𝑛𝑝ℎ and 𝑛𝑝𝑙 are the number of patches in the corresponding axes. By doing so, the model is able to handle images of varying sizes without requiring extensive architectural changes [3].

These image patches are then linearly embedded into lower-dimensional vectors, similar to the word embedding step that produces matrix X in Part 2.1. Since transformers do not contain recurrence nor convolutions, they lack the capacity to encode positional information of the input tokens and are therefore permutation invariant [2]. Hence, as it is done in NLP applications, a positional embedding is appended to each linearly encoded vector prior to input into the transformer model, in order to encode the spatial information of the patches, ensuring that the model understands the position of each token relative to other tokens within the image. Additionally, an extra learnable classifier cls embedding is added to the input. All of these (the linear embeddings of each 16 x 16 patch, the extra learnable classifier embedding, and their corresponding positional embedding vectors) are passed through a standard Transformer encoder unit as discussed in Part 2. The output corresponding to the added learnable cls embedding is then used to perform classification via a standard MLP classifer head [3].

4. The Result

In the paper, the two largest models, ViT-H/14 and ViT-L/16, both pre-trained on the JFT-300M dataset, are compared to state-of-the-art CNNs—as shown in Table II, including Big Transfer (BiT), which employs supervised transfer learning with large ResNets, and Noisy Student, a large EfficientNet trained using semi-supervised learning on ImageNet and JFT-300M without labels [3]. At the time of this study’s publication, Noisy Student held the state-of-the-art position on ImageNet, while BiT-L on the other datasets utilized in the paper [3]. All models were trained in TPUv3 hardware, and the number of TPUv3-core-days that it took to train each model were recorded.

Table II. Comparison of model performance against popular image classification benchmarks. Reported here are the mean and standard deviation of the accuracies, averaged over three fine-tuning runs [3].

We can see from the table that Vision Transformer models pre-trained on the JFT-300M dataset outperforms ResNet-based baseline models on all datasets; while, at the same time, requiring significantly less computational resources (TPUv3-core-days) to pre-train. A secondary ViT-L/16 model was also trained on a much smaller public ImageNet-21k dataset, and is shown to also perform relatively well while requiring up to 97% less computational resources compared to state-of-the-art counter parts [3].

Fig. 6 shows the comparison of the performance between the BiT and ViT models (measured using the ImageNet Top1 Accuracy metric) across different pre-training datasets of varying sizes. We see that the ViT-Large models underperform compared to the base models on the small datasets like ImageNet, and roughly equivalent performance on ImageNet-21k. However, when pre-trained on larger datasets like JFT-300M, the ViT clearly outperforms the base model [3].

Fig. 6. BiT (ResNet) vs ViT on different pre-training datasets [3].

Further exploring how the size of the dataset relates to model performance, the authors trained the models on various random subsets of the JFT dataset—9M, 30M, 90M, and the full JFT-300M. Additional regularization was not added on smaller subsets in order to assess the intrinsic model properties (and not the effect of regularization) [3]. Fig. 7 shows that ViT models overfit more than ResNets on smaller datasets. Data shows that ResNets perform better with smaller pre-training datasets but plateau sooner than ViT; which then outperforms the former with larger pre-training. The authors conclude that on smaller datasets, convolutional inductive biases play a key role in CNN model performance, which ViT models lack. However, with large enough data, learning relevant patterns directly outweighs inductive biases, wherein ViT excels [3].

Fig. 7. ResNet vs ViT on different subsets of the JFT training dataset [3].

Finally, the authors analyzed the models’ transfer performance from JFT-300M vs total pre-training compute resources allocated, across different architectures, as shown in Fig. 8. Here, we see that Vision Transformers outperform ResNets with the same computational budget across the board. ViT uses approximately 2-4 times less compute to attain similar performance as ResNet [3]. Implementing a hybrid model does improve performance on smaller model sizes, but the discrepancy vanishes for larger models, which the authors find surprising as the initial hypothesis is that the convolutional local feature processing should be able to assist ViT regardless of compute size [3].

Fig. 8. Performance of the models across different pre-training compute values—exa floating point operations per second (or exaFLOPs) [3].

4.1 What does the ViT model learn?

In order to understand how ViT processes image data, it is important to analyze its internal representations. In Part 3, we saw that the input patches generated from the image are fed into a linear embedding layer that projects the 16×16 patch into a lower dimensional vector space, and its resulting embedded representations are then appended with positional embeddings. Fig. 9 shows that the model indeed learns to encode the relative position of each patch in the image. The authors used cosine similarity between the learned positional embeddings across patches [3]. High cosine similarity values emerge on similar relative area within the position embedding matrix corresponding to the patch; i.e., the top right patch (row 1, col 7) has a corresponding high cosine similarity value (yellow pixels) on the top-right area of the position embedding matrix [3].

Fig. 9. Learned positional embedding for the input image patches [3].

Meanwhile, Fig. 10 (left) shows the top principal components of learned embedding filters that are applied to the raw image patches prior to the addition of the positional embeddings. What’s interesting for me is how similar this is to the learned hidden layer representations that you get from Convolutional neural networks, an example of which is shown in the same figure (right) using the AlexNet architecture.

Fig. 10. Filters of the initial linear embedding layer of ViT-L/32 (left) [3].
The first layer of filters from AlexNet (right) [6].

By design, the self-attention mechanism should allow ViT to integrate information across the entire image, even at the lowest layer, effectively giving ViTs a global receptive field at the start. We can somehow see this effect in Fig. 10 where the learned embedding filters captured lower level features like lines and grids, as well as higher level patterns combining lines and color blobs. This in contrast with CNNs whose receptive field size at the lowest layer is very small (because local application of the convolution operation only attends to the area defined by the filter size), and only widens towards the deeper convolutions as further applications of convolutions extract context from the combined information extracted from lower layers. The authors further tested this by measuring the attention distance which is computed from the “average distance in the image space across which information is integrated based on the attention weights [3].” The results are shown in Fig. 11.

Fig. 11. Size of attended area by head and network depth [3].

From the figure, we can see that even at very low layers of the network, some heads attend to most of the image already (as indicated by data points with high mean attention distance value at lower values of network depth); thus proving the ability of the ViT model to integrate image information globally, even at the lowest layers. 

Finally, the authors also calculated the attention maps from the output token to the input space using Attention Rollout by averaging the attention weights of the ViT-L/16 across all heads and then recursively multiplying the weight matrices of all layers. This results in a nice visualization of what the output layer attends to prior to classification, shown in Fig. 12 [3].

Fig. 12. Representative examples of attention from the output token to the input space [3].

5. So, is ViT the future of Computer Vision?

The Vision Transformer (ViT) introduced by Dosovitskiy et. al. in the research study showcased in this paper is a groundbreaking architecture for computer vision tasks. Unlike previous methods that introduce image-specific biases, ViT treats an image as a sequence of patches and process it using a standard Transformer encoder, such as how Transformers are used in NLP. This straightforward yet scalable strategy, combined with pre-training on extensive datasets, has yielded impressive results as discussed in Part 4. The Vision Transformer (ViT) either matches or surpasses the state-of-the-art on numerous image classification datasets (Fig. 6, 7, and 8), all while maintaining cost-effectiveness in pre-training [3].

However, like in any technology, it has its limitations. First, in order to perform well, ViTs require a very large amount of training data that not everyone has access to in the required scale, especially when compared to traditional CNNs. The authors of the paper used the JFT-300M dataset, which is a limited-access dataset managed by Google [7]. The dominant approach to get around this is to use the model pre-trained on the large dataset, and then fine-tune it to smaller (downstream) tasks. However, second, there are still very few pre-trained ViT models available as compared to the available pre-trained CNN models, which limits the availability of transfer learning benefits for these smaller, much more specific computer vision tasks. Third, by design, ViTs process images as sequences of tokens (discussed in Part 3), which means they do not naturally capture spatial information [3]. While adding positional embeddings do help remedy this lack of spatial context, ViTs may not perform as well as CNNs in image localization tasks, given CNNs convolutional layers that are excellent at capturing these spatial relationships.

Moving forward, the authors mention the need to further study scaling ViTs for other computer vision tasks such as image detection and segmentation, as well as other training methods like self-supervised pre-training [3]. Future research may focus on making ViTs more efficient and scalable, such as developing smaller and more lightweight ViT architectures that can still deliver the same competitive performance. Furthermore, providing better accessibility by creating and sharing a wider range of pre-trained ViT models for various tasks and domains can further facilitate the development of this technology in the future.


References

  1. N. Pogeant, “Transformers - the NLP revolution,” Medium, https://medium.com/mlearning-ai/transformers-the-nlp-revolution-5c3b6123cfb4 (accessed Sep. 23, 2023).
  2. A. Vaswani, et. al. “Attention is all you need.” NIPS 2017.
  3. A. Dosovitskiy, et. al. “An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale,” ICLR 2021.
  4. X. Wang, G. Chen, G. Qian, P. Gao, X.-Y. Wei, Y. Wang, Y. Tian, and W. Gao, “Large-scale multi-modal pre-trained models: A comprehensive survey,” Machine Intelligence Research, vol. 20, no. 4, pp. 447–482, 2023, doi: 10.1007/s11633-022-1410-8.
  5. H. Wang, “Addressing Syntax-Based Semantic Complementation: Incorporating Entity and Soft Dependency Constraints into Metonymy Resolution”, Scientific Figure on ResearchGate. Available from: https://www.researchgate.net/figure/Attention-matrix-visualization-a-weights-in-BERT-Encoding-Unit-Entity-BERT-b_fig5_359215965 [accessed 24 Sep, 2023]
  6. A. Krizhevsky, et. al. “ImageNet Classification with Deep Convolutional Neural Networks,” NIPS 2012. 
  7. C. Sun, et. al. “Revisiting Unreasonable Effectiveness of Data in Deep Learning Era,” Google Research, ICCV 2017.

* ChatGPT, used sparingly to rephrase certain paragraphs for better grammar and more concise explanations. All ideas in the report belong to me unless otherwise indicated. Chat Reference: https://chat.openai.com/share/165501fe-d06d-424b-97e0-c26a81893c69

The post Vision Transformers (ViT) Explained: Are They Better Than CNNs? appeared first on Towards Data Science.

]]>
Unraveling Large Language Model Hallucinations https://towardsdatascience.com/unraveling-large-language-model-hallucinations/ Fri, 28 Feb 2025 20:15:37 +0000 https://towardsdatascience.com/?p=598569 Understanding hallucinations as emergent cognitive effects of the training pipeline

The post Unraveling Large Language Model Hallucinations appeared first on Towards Data Science.

]]>
Introduction

In a YouTube video titled Deep Dive into LLMs like ChatGPT, former Senior Director of AI at Tesla, Andrej Karpathy discusses the psychology of Large Language Models (LLMs) as emergent cognitive effects of the training pipeline. This article is inspired by his explanation of LLM hallucinations and the information presented in the video.

You might have seen model hallucinations. They are the instances where LLMs generate incorrect, misleading, or entirely fabricated information that appears plausible. These hallucinations happen because LLMs do not “know” facts in the way humans do; instead, they predict words based on patterns in their training data. Early models released a few years ago struggled significantly with hallucinations. Over time, mitigation strategies have improved the situation, though hallucinations haven’t been fully eliminated.

An illustrative example of LLM hallucinations (Image by Author)

Zyler Vance is a completely fictitious name I came up with. When I input the prompt “Who is Zyler Vance?” into the falcon-7b-instruct model, it generates fabricated information. Zyler Vance is not a character in The Cloverfield Paradox (2018) movie. This model, being an older version, is prone to hallucinations.

LLM Training Pipeline

To understand where these hallucinations originate from, you have to be familiar with the training pipeline. Training LLMs typically involve three major stages.

  1. Pretraining
  2. Post-training: Supervised Fine-Tuning (SFT)
  3. Post-training: Reinforcement Learning with Human Feedback (RLHF)

Pretraining

This is the initial stage of the training for LLMs. During pretraining the model is exposed to a huge quantity of very high-quality and diverse text crawled from the internet. Pretraining helps the model learn general language patterns, grammar, and facts. The output of this training phase is called the base model. It is a token simulator that predicts the next word in a sequence.

To get a sense of what the pretraining dataset might look like you can see the FineWeb dataset. FineWeb dataset is fairly representative of what you might see in an enterprise-grade language model. All the major LLM providers like OpenAI, Google, or Meta will have some equivalent dataset internally like the FineWeb dataset.

Post-Training: Supervised Fine-Tuning

As I mentioned before, the base model is a token simulator. It simply samples internet text documents. We need to turn this base model into an assistant that can answer questions. Therefore, the pretrained model is further refined using a dataset of conversations. These conversation datasets have hundreds of thousands of conversations that are multi-term and very long covering a diverse breadth of topics.

Illustrative human assistant conversations from InstructGPT distribution

These conversations come from human labelers. Given conversational context human lablers write out ideal responses for an assistant in any situation. Later, we take the base model that is trained on internet documents and substitute the dataset with the dataset of conversations. Then continue the model training on this new dataset of conversations. This way, the model adjusts rapidly and learns the statistics of how this assistant responds to queries. At the end of training the model is able to imitate human-like responses.

OpenAssistant/oasst1 is one of the open-source conversations dataset available at hugging face. This is a human-generated and human-annotated assistant-style conversation corpus consisting of 161,443 messages in 35 different languages.

Post-training: Reinforcement Learning with Human Feedback

Supervised Fine-Tuning makes the model capable. However, even a well-trained model can generate misleading, biased, or unhelpful responses. Therefore, Reinforcement Learning with Human Feedback is required to align it with human expectations.

We start with the assistant model, trained by SFT. For a given prompt we generate multiple model outputs. Human labelers rank or score multiple model outputs based on quality, safety, and alignment with human preferences. We use these data to train a whole separate neural network that we call a reward model.

The reward model imitates human scores. It is a simulator of human preferences. It is a completely separate neural network, probably with a transformer architecture, but it is not a language model in the sense that it generates diverse language. It’s just a scoring model.

Now the LLM is fine-tuned using reinforcement learning, where the reward model provides feedback on the quality of the generated outputs. So instead of asking a real human, we’re asking a simulated human for their score of an output. The goal is to maximize the reward signal, which reflects human preferences.

Why Hallucinations?

Now that we have a clearer understanding of the training process of large language models, we can continue with our discussion on hallucinations.

Hallucinations originate from the Supervised Fine-Tuning stage of the training pipeline. The following is a specific example of three potential conversations you might have on your training set.

Examples of human-assistant conversations (Image by Author)

As I have shown earlier, this is what human-assistant conversations would look like in the training time. These conversations are created by human labelers under strict guidelines. When a labeler is writing the correct answer for the assistant in each one of these cases either they know this person or they research them on the internet. After that, they write the assistant response that has a confident tone of an answer.

At test time, if the model is asked about an individual it has not seen during training, it does not simply respond with an acknowledgment of ignorance. Simply put it does not reply with “Oh, I don’t know”. Instead, the model statistically imitates the training set.

In the training set, the questions in the form “Who is X?” are confidently answered with the correct answer. Therefore at the test time, the model replies with the style of the answer and it gives the statistically most likely guess. So it just makes stuff up that is statistically consistent with the style of the answer in its training set.

Model Interrogation

Our question now is how to mitigate the hallucinations. It is evident that our dataset should include examples where the correct answer for the assistant is that the model does not know about some particular fact. However, these answers must be produced only in instances where the model actually does not know. So the key question is how do we know what the model knows and what it does not? We need to probe the model to figure that out empirically.

The task is to figure out the boundary of the model’s knowledge. Therefore, we need to interrogate the model to figure out what it knows and doesn’t know. Then we can add examples to the training set for the things that the model doesn’t know. The correct response, in such cases, is that the model does not know them.

An example of a training instance where the model doesn’t know the answer to a particular question

Let’s take a look at how Meta dealt with hallucinations using this concept for the Llama 3 series of models.

In their 2024 paper titled “The Llama 3 Herd of Models”, Touvron et al. describe how they have developed a knowledge-probing technique to achieve this. Their primary approach involves generating data that aligns model generations with subsets of factual data present in the pre-training data. They describe the following procedure for the data generation process:

Extract a data snippet from the pre-training data.

Generate a factual question about these snippets (context) by prompting Llama 3.

Sample responses from Llama 3 to the question.

Score the correctness of the generations using the original context as a reference and Llama 3 as a judge.

Score the informativeness of the generations using Llama 3 as a judge.

Generate a refusal for responses which are consistently informative and incorrect across the generations, using Llama 3. (p. 27)

After that data generated from the knowledge probe is used to encourage the model to only answer the questions for which it knows about, and refrain from answering questions that it is unsure about. Implementing this technique has improved the hallucination issue over time.

Using Web Search

We have better mitigation strategies than just saying we do not know. We can provide the LLM with an opportunity to generate factual responses and accurately address the question. What would you do, in a case where I ask you a factual question that you don’t have an answer to? How do you answer the question? You could do some research and search the internet to figure out the answer to the question. Then tell me the answer to the question. We can do the same thing with LLMs.

You can think of the knowledge inside the parameters of the trained neural network as a vague recollection of things that the model has seen during pretraining a long time ago. Knowledge in the model parameters is analogous to something in your memory that you read a month ago. You can remember things that you read continuously over time than something you read rarely. If you don’t have a good recollection of information that you read, what you do is go and look it up. When you look up information, you are essentially refreshing your working memory with information, allowing you to retrieve and discuss it.

We need some equivalent mechanism to allow the model to refresh its memory or recollection of information. We can achieve this by introducing tools for the model. The model can use web search tools instead of just replying with “I am sorry, I don’t know the answer”. To achieve this we need to introduce special tokens, such as <SEARCH_START> and <SEARCH_END> along with a protocol that defines how the model is allowed to use these tokens. In this mechanism, the language model can emit special tokens. Now in a case where the model doesn’t know the answer, it has the option to emit the special token <SEARCH_START> instead of replying with “I am sorry, I don’t know the answer”. After that, the model will emit the query and <SEARCH_END>.

Here when the program that is sampling from the model encounters the special token <SEARCH_START> during inference, it will pause the generation process instead of sampling the next token in the sequence. It will initiate a session with the search engine, input the search query into the search engine, and retrieve all the extracted text from the results. Then it will insert that text inside the context window.

The extracted text from the web search is now within the context window that will be fed into the neural network. Think of the context window as the working memory of the model. The data inside the context window is directly accessible by the model. It is directly fed into the neural network. Therefore it is no longer a vague recollection of information. Now, when sampling new tokens, it can very easily reference the data that has been copy-pasted there. Thus, this is a general overview of how these web search tools function.

An example of a training instance with special tokens. The […] notation indicates the placeholder for the extracted content

How can we teach the model to correctly use these tools like web search? Again we accomplish this through training sets. We now need enough data and numerous conversations that demonstrate, by example, how the model should use web search. We need to illustrate with examples aspects such as: “What are the settings where you are using the search? What does it look like? How do you start a search?” Because of the pretraining stage, it possesses a native understanding of what a web search is and what constitutes a good search query. Therefore, if your training set contains several thousand examples, the model will be able to understand clearly how the tool works.

Conclusion

Large language model hallucinations are inherent consequences of the training pipeline, particularly arising from the supervised fine-tuning stage. Since language models are designed to generate statistically probable text, they often produce responses that appear plausible but lack a factual basis.

Early models were prone to hallucinations significantly. However, the problem has improved with the implementation of various mitigation strategies. Knowledge probing techniques and training the model to use web search tools have been proven effective in mitigating the problem. Despite these improvements, completely eliminating hallucinations remains an ongoing challenge. As LLMs continue to evolve, mitigating hallucinations to a large extent is crucial to ensuring their reliability as a trustworthy knowledge base.

If you enjoyed this article, connect with me on X (formerly Twitter) for more insights.

The post Unraveling Large Language Model Hallucinations appeared first on Towards Data Science.

]]>
Debugging the Dreaded NaN https://towardsdatascience.com/debugging-the-dreaded-nan/ Thu, 27 Feb 2025 21:52:06 +0000 https://towardsdatascience.com/?p=598513 Capturing and reproducing failures in PyTorch training with Lightning

The post Debugging the Dreaded NaN appeared first on Towards Data Science.

]]>
You are training your latest AI model, anxiously watching as the loss steadily decreases when suddenly — boom! Your logs are flooded with NaNs (Not a Number) — your model is irreparably corrupted and you’re left staring at your screen in despair. To make matters worse, the NaNs don’t appear consistently. Sometimes your model trains just fine; other times, it fails inexplicably. Sometimes it will crash immediately, sometimes after many days of training.

NaNs in Deep Learning workloads are amongst the most frustrating issues to encounter. And because they often appear sporadically — triggered by a specific combination of model state, input data, and stochastic factors — they can be incredibly difficult to reproduce and debug.

Given the considerable cost of training AI models and the potential waste caused by NaN failures, it is recommended to have dedicated tools for capturing and analyzing NaN occurrences. In a previous post, we discussed the challenge of debugging NaNs in a TensorFlow training workload. We proposed an efficient scheme for capturing and reproducing NaNs and shared a sample TensorFlow implementation. In this post, we adopt and demonstrate a similar mechanism for debugging NaNs in PyTorch workloads. The general scheme is as follows:

On each training step:

  1. Save a copy of the training input batch.
  2. Check the gradients for NaN values. If any appear, save a checkpoint with the current model weights before the model is corrupted. Also, save the input batch and, if necessary, the stochastic state. Discontinue the training job.
  3. Reproduce and debug the NaN occurrence by loading the saved experiment state.

Although this scheme can be easily implemented in native PyTorch, we will take the opportunity to demonstrate some of the conveniences of PyTorch Lightning — a powerful open-source framework designed to streamline the development of machine learning (ML) models. Built on PyTorch, Lightning abstracts away many of the boiler-plate components of an ML experiment, such as training loops, data distribution, logging, and more, enabling developers to focus on the core logic of their models.

To implement our NaN capturing scheme, we will use Lightning’s callback interface — a dedicated structure that enables inserting custom logic at specific points during the flow of execution.

Importantly, please do not view our choice of Lightning or any other tool or technique that we mention as an endorsement of its use. The code that we will share is intended for demonstrative purposes — please do not rely on its correctness or optimality.

Many thanks to Rom Maltser for his contributions to this post.

NaNCapture Callback

To implement our NaN capturing solution, we create a NaNCapture Lightning callback. The constructor receives a directory path for storing/loading checkpoints and sets up the NaNCapture state. We also define utilities for checking for NaNs, storing checkpoints, and halting the training job.

 import os
import torch
from copy import deepcopy
import lightning.pytorch as pl

class NaNCapture(pl.Callback):

    def __init__(self, dirpath: str):
        # path to checkpoint
        self.dirpath = dirpath
        
        # update to True when Nan is identified
        self.nan_captured = False
        
        # stores a copy of the last batch
        self.last_batch = None
        self.batch_idx = None

    @staticmethod
    def contains_nan(tensor):
        return torch.isnan(tensor).any().item()
        # alternatively check for finite
        # return not torch.isfinite(tensor).item()

    @staticmethod
    def halt_training(trainer):
        trainer.should_stop = True
        # communicate stop command to all other ranks
        trainer.strategy.reduce_boolean_decision(trainer.should_stop,
                                                 all=False)

    def save_ckpt(self, trainer):
        os.makedirs(self.dirpath, exist_ok=True)
        # include trainer.global_rank to avoid conflict
        filename = f"nan_checkpoint_rank_{trainer.global_rank}.ckpt"
        full_path = os.path.join(self.dirpath, filename)
        print(f"saving ckpt to {full_path}")
        trainer.save_checkpoint(full_path, False)

Callback Function: on_train_batch_start

We begin by implementing the on_train_batch_start hook to store a copy of each input batch. In case of a NaN event, this batch will be stored in the checkpoint.

Callback Function: on_before_optimizer_step

Next we implement the on_before_optimizer_step hook. Here, we check for NaN entries in all of the gradient tensors. If found, we store a checkpoint with the uncorrupted model weights and halt the training.

Python">    def on_before_optimizer_step(self, trainer, pl_module, optimizer):
        if not self.nan_captured:
            # Check if gradients contain NaN
            grads = [p.grad.view(-1) for p in pl_module.parameters()
                     if p.grad is not None]
            all_grads = torch.cat(grads)
            if self.contains_nan(all_grads):
                print("nan found")
                self.save_ckpt(trainer)
                self.halt_training(trainer)

Capturing the Training State

To enable reproducibility, we include the NaNCapture state in the checkpoint by appending it to the training state dictionary. Lightning provides dedicated utilities for saving and loading a callback state:

def state_dict(self):
        d = {"nan_captured": self.nan_captured}
        if self.nan_captured:
            d["last_batch"] = self.last_batch
        return d


    def load_state_dict(self, state_dict):
        self.nan_captured = state_dict.get("nan_captured", False)
        if self.nan_captured:
            self.last_batch = state_dict["last_batch"]

Reproducing the NaN Occurrence

We have described how our NaNCapture callback can be used to store the training state that resulted in a NaN, but how do we reload this state in order to reproduce the issue and debug it? To accomplish this, we leverage Lightning’s dedicated data loading class, LightningDataModule.

DataModule Function: on_before_batch_transfer

In the code block below, we extend the LightningDataModule class to allow injecting a fixed training input batch. This is achieved by overriding the on_before_batch_transfer hook, as shown below:

from lightning.pytorch import LightningDataModule

class InjectableDataModule(LightningDataModule):

    def __init__(self):
        super().__init__()
        self.cached_batch = None

    def set_custom_batch(self, batch):
        self.cached_batch = batch

    def on_before_batch_transfer(self, batch, dataloader_idx):
        if self.cached_batch:
            return self.cached_batch
        return batch

Callback Function: on_train_start

The final step is modifying the on_train_start hook of our NaNCapture callback to inject the stored training batch into the LightningDataModule.

    def on_train_start(self, trainer, pl_module):
        if self.nan_captured:
            datamodule = trainer.datamodule
            datamodule.set_custom_batch(self.last_batch)

In the next section we will demonstrate the end-to-end solution using a toy example.

Toy Example

To test our new callback, we create a resnet50-based image classification model with a loss function deliberately designed to trigger NaN occurrences.

Instead of using the standard CrossEntropy loss, we compute binary_cross_entropy_with_logits for each class independently and divide the result by the number of samples belonging to that class. Inevitably, we will encounter a batch in which one or more classes are missing, leading to a divide-by-zero operation, resulting in NaN values and corrupting the model.

The implementation below follows Lightning’s introductory tutorial.

import lightning.pytorch as pl
import torch
import torchvision
import torch.nn.functional as F

num_classes = 20


# define a lightning module
class ResnetModel(pl.LightningModule):
    def __init__(self):
        """Initializes a new instance of the MNISTModel class."""
        super().__init__()
        self.model = torchvision.models.resnet50(num_classes=num_classes)

    def forward(self, x):
        return self.model(x)

    def training_step(self, batch, batch_nb):
        x, y = batch
        outputs = self(x)
        # uncomment for default loss
        # return F.cross_entropy(outputs, y)
        
        # calculate binary_cross_entropy for each class individually
        losses = []
        for c in range(num_classes):
            count = torch.count_nonzero(y==c)
            masked = torch.where(y==c, 1., 0.)
            loss = F.binary_cross_entropy_with_logits(
                outputs[..., c],
                masked,
                reduction='sum'
            )
            mean_loss = loss/count # could result in NaN
            losses.append(mean_loss)
        total_loss = torch.stack(losses).mean()
        return total_loss

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=0.02)

We define a synthetic dataset and encapsulate it in our InjectableDataModule class:

import os
import random
from torch.utils.data import Dataset, DataLoader

batch_size = 128
num_steps = 800

# A dataset with random images and labels
class FakeDataset(Dataset):
    def __len__(self):
        return batch_size*num_steps

    def __getitem__(self, index):
        rand_image = torch.randn([3, 224, 224], dtype=torch.float32)
        label = torch.tensor(random.randint(0, num_classes-1),
                             dtype=torch.int64)
        return rand_image, label



# define a lightning datamodule
class FakeDataModule(InjectableDataModule):

    def train_dataloader(self):
        dataset = FakeDataset()
        return DataLoader(
            dataset,
            batch_size=batch_size,
            num_workers=os.cpu_count(),
            pin_memory=True
        )

Finally, we initialize a Lightning Trainer with our NaNCapture callback and call trainer.fit with our Lightning module and Lightning DataModule.

import time

if __name__ == "__main__":

    # Initialize a lightning module
    lit_module = ResnetModel()

    # Initialize a DataModule
    mnist_data = FakeDataModule()

    # Train the model
    ckpt_dir = "./ckpt_dir"
    trainer = pl.Trainer(
        max_epochs=1,
        callbacks=[NaNCapture(ckpt_dir)]
    )

    ckpt_path = None
    
    # check is nan ckpt exists
    if os.path.isdir(ckpt_dir):

    # check if nan ckpt exists
    if os.path.isdir(ckpt_dir):
        dir_contents = [os.path.join(ckpt_dir, f)
                        for f in os.listdir(ckpt_dir)]
        ckpts = [f for f in dir_contents
                 if os.path.isfile(f) and f.endswith('.ckpt')]
        if ckpts:
            ckpt_path = ckpts[0]

    t0 = time.perf_counter()
    trainer.fit(lit_module, mnist_data, ckpt_path=ckpt_path)
    print(f"total runtime: {time.perf_counter() - t0}")

After a number of training steps, a NaN event will occur. At this point a checkpoint is saved with the full training state and the training is halted.

When the script is run again the exact state that caused the NaN will be reloaded allowing us to easily reproduce the issue and debug its root cause.

Performance Overhead

To assess the impact of our NaNCapture callback on runtime performance, we modified our experiment to use CrossEntropyLoss (to avoid NaNs) and measured the average throughput when running with and without NaNCapture callback. The experiments were conducted on an NVIDIA L40S GPU, with a PyTorch 2.5.1 Docker image.

Overhead of NaNCapture Callback (by Author)

For our toy model, the NaNCapture callback adds a minimal 1.5% overhead to the runtime performance — a small price to pay for the valuable debugging capabilities it provides.

Naturally, the actual overhead will depend on the specifics of the model and runtime environment.

How to Handle Stochasticity

The solution we have described henceforth will succeed in reproducing the training state provided that the model does not include any randomness. However, introducing stochasticity into the model definition is often critical for convergence. A common example of a stochastic layer is torch.nn.Dropout.

You may find that your NaN event depends on the precise state of randomness when the failure occurred. Consequently, we would like to enhance our NaNCapture callback to capture and restore the random state at the point of failure. The random state is determined by a number of libraries. In the code block below, we attempt to capture the full state of randomness:

import os
import torch
import random
import numpy as np
from copy import deepcopy
import lightning.pytorch as pl

class NaNCapture(pl.Callback):

    def __init__(self, dirpath: str):
        # path to checkpoint
        self.dirpath = dirpath
        
        # update to True when Nan is identified
        self.nan_captured = False
        
        # stores a copy of the last batch
        self.last_batch = None
        self.batch_idx = None

        # rng state
        self.rng_state = {
            "torch": None,
            "torch_cuda": None,
            "numpy": None,
            "random": None
        }

    @staticmethod
    def contains_nan(tensor):
        return torch.isnan(tensor).any().item()
        # alternatively check for finite
        # return not torch.isfinite(tensor).item()

    @staticmethod
    def halt_training(trainer):
        trainer.should_stop = True
        trainer.strategy.reduce_boolean_decision(trainer.should_stop,
                                                 all=False)

    def save_ckpt(self, trainer):
        os.makedirs(self.dirpath, exist_ok=True)
        # include trainer.global_rank to avoid conflict
        filename = f"nan_checkpoint_rank_{trainer.global_rank}.ckpt"
        full_path = os.path.join(self.dirpath, filename)
        print(f"saving ckpt to {full_path}")
        trainer.save_checkpoint(full_path, False)

    def on_train_start(self, trainer, pl_module):
        if self.nan_captured:
            # inject batch
            datamodule = trainer.datamodule
            datamodule.set_custom_batch(self.last_batch)

    def on_train_batch_start(self, trainer, pl_module, batch, batch_idx):
       if self.nan_captured:
            # restore random state
            torch.random.set_rng_state(self.rng_state["torch"])
            torch.cuda.set_rng_state_all(self.rng_state["torch_cuda"])
            np.random.set_state(self.rng_state["numpy"])
            random.setstate(self.rng_state["random"])
        else:
            # capture current batch
            self.last_batch= deepcopy(batch)
            self.batch_idx = batch_idx
    
            # capture current random state
            self.rng_state["torch"] = torch.random.get_rng_state()
            self.rng_state["torch_cuda"] = torch.cuda.get_rng_state_all()
            self.rng_state["numpy"] = np.random.get_state()
            self.rng_state["random"] = random.getstate()
    
    def on_before_optimizer_step(self, trainer, pl_module, optimizer):
        if not self.nan_captured:
            # Check if gradients contain NaN
            grads = [p.grad.view(-1) for p in pl_module.parameters()
                     if p.grad is not None]
            all_grads = torch.cat(grads)
            if self.contains_nan(all_grads):
                print("nan found")
                self.save_ckpt(trainer)
                self.halt_training(trainer)

    def state_dict(self):
        d = {"nan_captured": self.nan_captured}
        if self.nan_captured:
            d["last_batch"] = self.last_batch
            d["rng_state"] = self.rng_state
        return d

    def load_state_dict(self, state_dict):
        self.nan_captured = state_dict.get("nan_captured", False)
        if self.nan_captured:
            self.last_batch = state_dict["last_batch"]
            self.rng_state = state_dict["rng_state"]

Importantly, setting the random state may not guarantee full reproducibility. The GPU owes its power to its massive parallelism. In some GPU operations, multiple threads may read or write concurrently to the same memory locations resulting in nondeterminism. PyTorch allows for some control over this via its use_deterministic_algorithms, but this may impact the runtime performance. Additionally, there is a possibility that the NaN event will not reproduced once this configuration setting is changed. Please see the PyTorch documentation on reproducibility for more details.

Summary

Encountering NaN failures is one of the most discouraging events that can happen in machine learning development. These errors not only waste valuable computation and development resources, but often indicate fundamental issues in the model architecture or experiment design. Due to their sporadic, sometimes elusive nature, debugging NaN failures can be a nightmare.

This post introduced a proactive approach for capturing and reproducing NaN errors using a dedicated Lightning callback. The solution we shared is a proposal which can be modified and extended for your specific use case.

While this solution may not address every possible NaN scenario, it significantly reduces debugging time when applicable, potentially saving developers countless hours of frustration and wasted effort.

The post Debugging the Dreaded NaN appeared first on Towards Data Science.

]]>
LLaDA: The Diffusion Model That Could Redefine Language Generation https://towardsdatascience.com/llada-the-diffusion-model-that-could-redefine-language-generation/ Wed, 26 Feb 2025 18:18:22 +0000 https://towardsdatascience.com/?p=598455 How LLaDA works, why it matters, and how it could shape the next generation of LLMs

The post LLaDA: The Diffusion Model That Could Redefine Language Generation appeared first on Towards Data Science.

]]>
Introduction

What if we could make language models think more like humans? Instead of writing one word at a time, what if they could sketch out their thoughts first, and gradually refine them?

This is exactly what Large Language Diffusion Models (LLaDA) introduces: a different approach to current text generation used in Large Language Models (LLMs). Unlike traditional autoregressive models (ARMs), which predict text sequentially, left to right, LLaDA leverages a diffusion-like process to generate text. Instead of generating tokens sequentially, it progressively refines masked text until it forms a coherent response.

In this article, we will dive into how LLaDA works, why it matters, and how it could shape the next generation of LLMs.

I hope you enjoy the article!

The current state of LLMs

To appreciate the innovation that LLaDA represents, we first need to understand how current large language models (LLMs) operate. Modern LLMs follow a two-step training process that has become an industry standard:

  1. Pre-training: The model learns general language patterns and knowledge by predicting the next token in massive text datasets through self-supervised learning.
  2. Supervised Fine-Tuning (SFT): The model is refined on carefully curated data to improve its ability to follow instructions and generate useful outputs.

Note that current LLMs often use RLHF as well to further refine the weights of the model, but this is not used by LLaDA so we will skip this step here.

These models, primarily based on the Transformer architecture, generate text one token at a time using next-token prediction.

Simplified Transformer architecture for text generation (Image by the author)

Here is a simplified illustration of how data passes through such a model. Each token is embedded into a vector and is transformed through successive transformer layers. In current LLMs (LLaMA, ChatGPT, DeepSeek, etc), a classification head is used only on the last token embedding to predict the next token in the sequence.

This works thanks to the concept of masked self-attention: each token attends to all the tokens that come before it. We will see later how LLaDA can get rid of the mask in its attention layers.

Attention process: input embeddings are multiplied byQuery, Key, and Value matrices to generate new embeddings (Image by the author, inspired by [3])

If you want to learn more about Transformers, check out my article here.

While this approach has led to impressive results, it also comes with significant limitations, some of which have motivated the development of LLaDA.

Current limitations of LLMs

Current LLMs face several critical challenges:

Computational Inefficiency

Imagine having to write a novel where you can only think about one word at a time, and for each word, you need to reread everything you’ve written so far. This is essentially how current LLMs operate — they predict one token at a time, requiring a complete processing of the previous sequence for each new token. Even with optimization techniques like KV caching, this process is quite computationally expensive and time-consuming.

Limited Bidirectional Reasoning

Traditional autoregressive models (ARMs) are like writers who could never look ahead or revise what they’ve written so far. They can only predict future tokens based on past ones, which limits their ability to reason about relationships between different parts of the text. As humans, we often have a general idea of what we want to say before writing it down, current LLMs lack this capability in some sense.

Amount of data

Existing models require enormous amounts of training data to achieve good performance, making them resource-intensive to develop and potentially limiting their applicability in specialized domains with limited data availability.

What is LLaDA

LLaDA introduces a fundamentally different approach to Language Generation by replacing traditional autoregression with a “diffusion-based” process (we will dive later into why this is called “diffusion”).

Let’s understand how this works, step by step, starting with pre-training.

LLaDA pre-training

Remember that we don’t need any “labeled” data during the pre-training phase. The objective is to feed a very large amount of raw text data into the model. For each text sequence, we do the following:

  1. We fix a maximum length (similar to ARMs). Typically, this could be 4096 tokens. 1% of the time, the lengths of sequences are randomly sampled between 1 and 4096 and padded so that the model is also exposed to shorter sequences.
  2. We randomly choose a “masking rate”. For example, one could pick 40%.
  3. We mask each token with a probability of 0.4. What does “masking” mean exactly? Well, we simply replace the token with a special token<MASK>. As with any other token, this token is associated with a particular index and embedding vector that the model can process and interpret during training.
  4. We then feed our entire sequence into our transformer-based model. This process transforms all the input embedding vectors into new embeddings. We apply the classification head to each of the masked tokens to get a prediction for each. Mathematically, our loss function averages cross-entropy losses over all the masked tokens in the sequence, as below:
Loss function used for LLaDA (Image by the author)

5. And… we repeat this procedure for billions or trillions of text sequences.

Note, that unlike ARMs, LLaDA can fully utilize bidirectional dependencies in the text: it doesn’t require masking in attention layers anymore. However, this can come at an increased computational cost.

Hopefully, you can see how the training phase itself (the flow of the data into the model) is very similar to any other LLMs. We simply predict randomly masked tokens instead of predicting what comes next.

LLaDA SFT

For auto-regressive models, SFT is very similar to pre-training, except that we have pairs of (prompt, response) and want to generate the response when giving the prompt as input.

This is exactly the same concept for LlaDa! Mimicking the pre-training process: we simply pass the prompt and the response, mask random tokens from the response only, and feed the full sequence into the model, which will predict missing tokens from the response.

The innovation in inference

Innovation is where LLaDA gets more interesting, and truly utilizes the “diffusion” paradigm.

Until now, we always randomly masked some text as input and asked the model to predict these tokens. But during inference, we only have access to the prompt and we need to generate the entire response. You might think (and it’s not wrong), that the model has seen examples where the masking rate was very high (potentially 1) during SFT, and it had to learn, somehow, how to generate a full response from a prompt.

However, generating the full response at once during inference will likely produce very poor results because the model lacks information. Instead, we need a method to progressively refine predictions, and that’s where the key idea of ‘remasking’ comes in.

Here is how it works, at each step of the text generation process:

  • Feed the current input to the model (this is the prompt, followed by <MASK> tokens)
  • The model generates one embedding for each input token. We get predictions for the <MASK> tokens only. And here is the important step: we remask a portion of them. In particular: we only keep the “best” tokens i.e. the ones with the best predictions, with the highest confidence.
  • We can use this partially unmasked sequence as input in the next generation step and repeat until all tokens are unmasked.

You can see that, interestingly, we have much more control over the generation process compared to ARMs: we could choose to remask 0 tokens (only one generation step), or we could decide to keep only the best token every time (as many steps as tokens in the response). Obviously, there is a trade-off here between the quality of the predictions and inference time.

Let’s illustrate that with a simple example (in that case, I choose to keep the best 2 tokens at every step)

LLaDA generation process example (Image by the author)

Note, in practice, the remasking step would work as follows. Instead of remasking a fixed number of tokens, we would remask a proportion of s/t tokens over time, from t=1 down to 0, where s is in [0, t]. In particular, this means we remask fewer and fewer tokens as the number of generation steps increases.

Example: if we want N sampling steps (so N discrete steps from t=1 down to t=1/N with steps of 1/N), taking s = (t-1/N) is a good choice, and ensures that s=0 at the end of the process.

The image below summarizes the 3 steps described above. “Mask predictor” simply denotes the Llm (LLaDA), predicting masked tokens.

Pre-training (a.), SFT (b.) and inference (c.) using LLaDA. (source: [1])

Can autoregression and diffusion be combined?

Another clever idea developed in LLaDA is to combine diffusion with traditional autoregressive generation to use the best of both worlds! This is called semi-autoregressive diffusion.

  • Divide the generation process into blocks (for instance, 32 tokens in each block).
  • The objective is to generate one block at a time (like we would generate one token at a time in ARMs).
  • For each block, we apply the diffusion logic by progressively unmasking tokens to reveal the entire block. Then move on to predicting the next block.
Semi-autoregressive process (source: [1])

This is a hybrid approach: we probably lose some of the “backward” generation and parallelization capabilities of the model, but we better “guide” the model towards the final output.

I think this is a very interesting idea because it depends a lot on a hyperparameter (the number of blocks), that can be tuned. I imagine different tasks might benefit more from the backward generation process, while others might benefit more from the more “guided” generation from left to right (more on that in the last paragraph).

Why “Diffusion”?

I think it’s important to briefly explain where this term actually comes from. It reflects a similarity with image diffusion models (like Dall-E), which have been very popular for image generation tasks.

In image diffusion, a model first adds noise to an image until it’s unrecognizable, then learns to reconstruct it step by step. LLaDA applies this idea to text by masking tokens instead of adding noise, and then progressively unmasking them to generate coherent language. In the context of image generation, the masking step is often called “noise scheduling”, and the reverse (remasking) is the “denoising” step.

How do Diffusion Models work? (source: [2])

You can also see LLaDA as some type of discrete (non-continuous) diffusion model: we don’t add noise to tokens, but we “deactivate” some tokens by masking them, and the model learns how to unmask a portion of them.

Results

Let’s go through a few of the interesting results of LLaDA.

You can find all the results in the paper. I chose to focus on what I find the most interesting here.

  • Training efficiency: LLaDA shows similar performance to ARMs with the same number of parameters, but uses much fewer tokens during training (and no RLHF)! For example, the 8B version uses around 2.3T tokens, compared to 15T for LLaMa3.
  • Using different block and answer lengths for different tasks: for example, the block length is particularly large for the Math dataset, and the model demonstrates strong performance for this domain. This could suggest that mathematical reasoning may benefit more from the diffusion-based and backward process.
Source: [1]
  • Interestingly, LLaDA does better on the “Reversal poem completion task”. This task requires the model to complete a poem in reverse order, starting from the last lines and working backward. As expected, ARMs struggle due to their strict left-to-right generation process.
Source: [1]

LLaDA is not just an experimental alternative to ARMs: it shows real advantages in efficiency, structured reasoning, and bidirectional text generation.

Conclusion

I think LLaDA is a promising approach to language generation. Its ability to generate multiple tokens in parallel while maintaining global coherence could definitely lead to more efficient trainingbetter reasoning, and improved context understanding with fewer computational resources.

Beyond efficiency, I think LLaDA also brings a lot of flexibility. By adjusting parameters like the number of blocks generated, and the number of generation steps, it can better adapt to different tasks and constraints, making it a versatile tool for various language modeling needs, and allowing more human control. Diffusion models could also play an important role in pro-active AI and agentic systems by being able to reason more holistically.

As research into diffusion-based language models advances, LLaDA could become a useful step toward more natural and efficient language models. While it’s still early, I believe this shift from sequential to parallel generation is an interesting direction for AI development.

Thanks for reading!


Check out my previous articles:



References:

The post LLaDA: The Diffusion Model That Could Redefine Language Generation appeared first on Towards Data Science.

]]>
Reinforcement Learning with PDEs https://towardsdatascience.com/reinforcement-learning-with-pdes/ Fri, 21 Feb 2025 01:45:39 +0000 https://towardsdatascience.com/?p=598232 Previously we discussed applying reinforcement learning to Ordinary Differential Equations (ODEs) by integrating ODEs within gymnasium. ODEs are a powerful tool that can describe a wide range of systems but are limited to a single variable. Partial Differential Equations (PDEs) are differential equations involving derivatives of multiple variables that can cover a far broader range […]

The post Reinforcement Learning with PDEs appeared first on Towards Data Science.

]]>
Previously we discussed applying reinforcement learning to Ordinary Differential Equations (ODEs) by integrating ODEs within gymnasium. ODEs are a powerful tool that can describe a wide range of systems but are limited to a single variable. Partial Differential Equations (PDEs) are differential equations involving derivatives of multiple variables that can cover a far broader range and more complex systems. Often, ODEs are special cases or special assumptions applied to PDEs.

PDEs include Maxwell’s Equations (governing electricity and magnetism), Navier-Stokes equations (governing fluid flow for aircraft, engines, blood, and other cases), and the Boltzman equation for thermodynamics. PDEs can describe systems such as flexible structures, power grids, manufacturing, or epidemiological models in biology. They can represent highly complex behavior; the Navier Stokes equations describe the eddies of a rushing mountain stream. Their capacity for capturing and revealing more complex behavior of real-world systems makes these equations an important topic for study, both in terms of describing systems and analyzing known equations to make new discoveries about systems. Entire fields (like fluid dynamics, electrodynamics, structural mechanics) can be devoted to study of just a single set of PDEs.

This increased complexity comes with a cost; the systems captured by PDEs are much more difficult to analyze and control. ODEs are also described as lumped-parameter systems, the various parameters and variables that describe them are “lumped” into a discrete point (or small number of points for a coupled system of ODEs). PDEs are distributed parameter systems that track behavior throughout space and time. In other words, the state space for an ODE is a relatively small number of variables, such as time and a few system measurements at a specific point. For PDE/distributed parameter systems, the state space size can approach infinite dimensions, or discretized for computation into millions of points for each time step. A lumped parameter system controls the temperature of an engine based on a small number of sensors. A PDE/distributed parameter system would manage temperature dynamics across the entire engine. 

As with ODEs, many PDEs must be analyzed (aside from special cases) through modelling and simulation. However, due to the higher dimensions, this modelling becomes far more complex. Many ODEs can be solved through straightforward applications of algorithms like MATLAB’s ODE45 or SciPy’s solve_ivp. PDEs are modelled across grids or meshes where the PDE is simplified to an algebraic equation (such as through Taylor Series expansion) at each point on the grid. Grid generation is a field, a science and art, on its own and ideal (or usable) grids can vary greatly based on problem geometry and Physics. Grids (and hence problem state spaces) can number in the millions of points with computation time running in days or weeks, and PDE solvers are often commercial software costing tens of thousands of dollars. 

Controlling PDEs presents a far greater challenge than ODEs. The Laplace transform that forms the basis of much classical control theory is a one-dimensional transformation. While there has been some progress in PDE control theory, the field is not as comprehensive as for ODE/lumped systems. For PDEs, even basic controllability or observability assessments become difficult as the state space to assess increases by orders of magnitude and fewer PDEs have analytic solutions. By necessity, we run into design questions such as what part of the domain needs to be controlled or observed? Can the rest of the domain be in an arbitrary state? What subset of the domain does the controller need to operate over? With key tools in control theory underdeveloped, and new problems presented, applying machine learning has been a major area of research for understanding and controlling PDE systems. 

Given the importance of PDEs, there has been research into developing control strategies for them. For example, Glowinski et. all developed an analytical adjoint based method from advanced functional analysis relying on simulation of the system. Other approaches, such as discussed by Kirsten Morris, apply estimations to reduce the order of the PDE to facilitate more traditional control approaches. Botteghi and Fasel, have begun to apply machine learning to control of these systems (note, this is only a VERY BRIEF glimpse of the research). Here we will apply reinforcement learning on two PDE control problems. The diffusion equation is a simple, linear, second order PDE with known analytic solution. The Kuramoto–Sivashinsky (K-S) equation is a much more complex 4th order nonlinear equation that models instabilities in a flame front. 

For both these equations we use a simple, small square domain of grid points. We target a sinusoidal pattern in a target area of a line down the middle of the domain by controlling input along left and right sides. Input parameters for the controls are the values at the target region and the {x,y} coordinates of the input control points. Training the algorithm required modelling the system development through time with the control inputs. As discussed above, this requires a grid where the equation is solved at each point then iterated through each time step. I used the py-pde package to create a training environment for the reinforcement learner (thanks to the developer of this package for his prompt feedback and help!). With the py-pde environment, approach proceeded as usual with reinforcement learning: the particular algorithm develops a guess at a controller strategy. That controller strategy is applied at small, discrete time steps and provides control inputs based on the current state of the system that lead to some reward (in this case, root mean square difference between target and current distribution). 

Unlike previous cases, I only present results from the genetic-programming controller. I developed code to apply a soft actor critic (SAC) algorithm to execute as a container on AWS Sagemaker. However, full execution would take about 50 hours and I didn’t want to spend the money! I looked for ways to reduce the computation time, but eventually gave up due to time constraints; this article was already taking long enough to get out with my job, military reserve duty, family visits over the holidays, civic and church involvement, and not leaving my wife to take care of our baby boy alone!

 First we will discuss the diffusion equation:

with x as a two dimensional cartesian vector and ∆ the Laplace operator. As mentioned, this is a simple second order (second derivative) linear partial differential equation in time and two dimensional space. Mu is the diffusion coefficient which determines how fast effects travel through the system. The diffusion equation tends to wash-out (diffuse!) effects on the boundaries throughout the domain and exhibits stable dynamics. The PDE is implemented as shown below with grid, equation, boundary conditions, initial conditions, and target distribution:

from pde import Diffusion, CartesianGrid, ScalarField, DiffusionPDE, pde
grid = pde.CartesianGrid([[0, 1], [0, 1]], [20, 20], periodic=[False, True])
state = ScalarField.random_uniform(grid, 0.0, 0.2)
bc_left={"value": 0}
bc_right={"value": 0}
bc_x=[bc_left, bc_right]
bc_y="periodic"
#bc_x="periodic"
eq = DiffusionPDE(diffusivity=.1, bc=[bc_x, bc_y])
solver=pde.ExplicitSolver(eq, scheme="euler", adaptive = True)
#result = eq.solve(state, t_range=dt, adaptive=True, tracker=None)
stepper=solver.make_stepper(state, dt=1e-3)
target = 1.*np.sin(2*grid.axes_coords[1]*3.14159265)

The problem is sensitive to diffusion coefficient and domain size; mismatch between these two results in washing out control inputs before they can reach the target region unless calculated over a long simulation time. The control input was updated and reward evaluated every 0.1 timestep up to an end time of T=15. 

Due to py-pde package architecture, the control is applied to one column inside the boundary. Structuring the py-pde package to execute with the boundary condition updated each time step resulted in a memory leak, and the py-pde developer advised using a stepper function as a work-around that doesn’t allow updating the boundary condition. This means the results aren’t exactly physical, but do display the basic principle of PDE control with reinforcement learning. 

The GP algorithm was able to arrive at a final reward (sum mean square error of all 20 points in the central column) of about 2.0 after about 30 iterations with a 500 tree forest. The results are shown below as target and achieved distributed in the target region.

Figure 1: Diffusion equation, green target distribution, red achieved. Provided by author.

Now the more interesting and complex K-S equation:

Unlike the diffusion equation, the K-S equation displays rich dynamics (as befitting an equation describing flame behavior!). Solutions may include stable equilibria or travelling waves, but with increasing domain size all solutions will eventually become chaotic. The PDE implementation is given by below code:

grid = pde.CartesianGrid([[0, 10], [0, 10]], [20, 20], periodic=[True, True])
state = ScalarField.random_uniform(grid, 0.0, 0.5)
bc_y="periodic"
bc_x="periodic"
eq = PDE({"u": "-gradient_squared(u) / 2 - laplace(u + laplace(u))"}, bc=[bc_x, bc_y])
solver=pde.ExplicitSolver(eq, scheme="euler", adaptive = True)
stepper=solver.make_stepper(state, dt=1e-3)
target=1.*np.sin(0.25*grid.axes_coords[1]*3.14159265)

Control inputs are capped at +/-5. The K-S equation is naturally unstable; if any point in the domain exceeds +/- 30 the iteration terminates with a large negative reward for causing the system to diverge. Experiments with the K-S equation in py-pde revealed strong sensitivity to domain size and number of grid points. The equation was run for T=35, both with control and reward update at dt=0.1.

For each, the GP algorithm had more trouble arriving at a solution than in the diffusion equation. I chose to manually stop execution when the solution became visually close; again, we are looking for general principles here. For the more complex system, the controller works better—likely because of how dynamic the K-S equation is the controller is able to have a bigger impact. However, when evaluating the solution for different run times, I found it was not stable; the algorithm learned to arrive at the target distribution at a particular time, not to stabilize at that solution. The algorithm converged to the below solution, but, as the successive time steps show, the solution is unstable and begins to diverge with increasing time steps. 

Figure 2: K-S equation Green target; yellow, red, magenta, cyan, blue for T = 10, 20, 30, 40. Provided by author.

Careful tuning on the reward function would help obtain a solution that would hold longer, reinforcing how vital correct reward function is. Also, in all these cases we aren’t coming to perfect solutions; but, especially for the K-S equations we are getting decent solutions with comparatively little effort compared to non-RL approaches for tackling these sorts of problems.

The GP solution is taking longer to solve with more complex problems and has trouble handling large input variable sets. To use larger input sets, the equations it generates become longer which make it less interpretable and slower to compute. Solution equations had scores of terms rather than the dozen or so in ODE systems. Neural network approaches can handle large input variable sets more easily as input variables only directly impact the size of the input layer. Further, I suspect that neural networks will be able to handle more complex and larger problems better for reasons discussed previously in previous posts. Because of that, I did develop gymnasiums for py-pde diffusion, which can easily be adapted to other PDEs per the py-pde documentation. These gymnasiums can be used with different NN-based reinforcement learning such as the SAC algorithm I developed (which, as discussed, runs but takes time). 

Adjustments could also be made to the genetic Programming approach. For example, vector representation of inputs could reduce size of solution equations. Duriez et al.1 all proposes using Laplace transform to introduce derivatives and integrals into the genetic programming equations, broadening the function spaces they can explore. 

The ability to tackle more complex problems is important. As discussed above, PDEs can describe a wide range of complex phenomena. Currently, controlling these systems usually means lumping parameters. Doing so leaves out dynamics and so we end up working against such systems rather than with them. Efforts to control or manage these means higher control effort, missed efficiencies, and increased risk of failure (small or catastrophic). Better understanding and control alternatives for PDE systems could unlock major gains in engineering fields where marginal improvements have been the standard such as traffic, supply chains, and nuclear fusion as these systems behave as high dimensional distributed parameter systems. They are highly complex with nonlinear and emergent phenomena but have large available data sets—ideal for machine learning to move past current barriers in understanding and optimization. 

For now, I have only taken a very basic look at applying ML to controlling PDEs. Follow ons to the control problem include not just different systems, but optimizing where in the domain the control is applied, experimenting with reduced-order observation space, and optimizing the control for simplicity or control effort. In addition to improved control efficiency, as discussed in Brunton and Kutz2, machine learning can also be used to derive data-based models of complex physical systems and to determine reduced order models which reduce state space size and may be more amenable to analysis and control, by traditional or machine learning methods. Machine learning and PDEs is an exciting area of research, and I encourage you to see what the professionals are doing!

  1. Duriez, Thomas., Steven L Brunton, and Bernd R. Noack. Machine Learning Control–Taming Nonlinear Dynamics and Turbulence. ↩
  2. Brunton, Steven., and J. Nathan Kutz. Data Driven Science and Engineering. ↩

The post Reinforcement Learning with PDEs appeared first on Towards Data Science.

]]>
Multimodal Search Engine Agents Powered by BLIP-2 and Gemini https://towardsdatascience.com/multimodal-search-engine-agents-powered-by-blip-2-and-gemini/ Wed, 19 Feb 2025 22:01:52 +0000 https://towardsdatascience.com/?p=598147 This post was co-authored with Rafael Guedes. Introduction Traditional models can only process a single type of data, such as text, images, or tabular data. Multimodality is a trending concept in the AI research community, referring to a model’s ability to learn from multiple types of data simultaneously. This new technology (not really new, but […]

The post Multimodal Search Engine Agents Powered by BLIP-2 and Gemini appeared first on Towards Data Science.

]]>
This post was co-authored with Rafael Guedes.

Introduction

Traditional models can only process a single type of data, such as text, images, or tabular data. Multimodality is a trending concept in the AI research community, referring to a model’s ability to learn from multiple types of data simultaneously. This new technology (not really new, but significantly improved in the last few months) has numerous potential applications that will transform the user experience of many products.

One good example would be the new way search engines will work in the future, where users can input queries using a combination of modalities, such as text, images, audio, etc. Another example could be improving AI-powered customer support systems for voice and text inputs. In e-commerce, they are enhancing product discovery by allowing users to search using images and text. We will use the latter as our case study in this article.

The frontier AI research labs are shipping several models that support multiple modalities every month. CLIP and DALL-E by OpenAI and BLIP-2 by Salesforce combine image and text. ImageBind by Meta expanded the multiple modality concept to six modalities (text, audio, depth, thermal, image, and inertial measurement units).

In this article, we will explore BLIP-2 by explaining its architecture, the way its loss function works, and its training process. We also present a practical use case that combines BLIP-2 and Gemini to create a multimodal fashion search agent that can assist customers in finding the best outfit based on either text or text and image prompts.

Figure 1: Multimodal Search Agent (image by author with Gemini)

As always, the code is available on our GitHub.

BLIP-2: a multimodal model

BLIP-2 (Bootstrapped Language-Image Pre-Training) [1] is a vision-language model designed to solve tasks such as visual question answering or multimodal reasoning based on inputs of both modalities: image and text. As we will see below, this model was developed to address two main challenges in the vision-language domain:

  1. Reduce computational cost using frozen pre-trained visual encoders and LLMs, drastically reducing the training resources needed compared to a joint training of vision and language networks.
  2. Improving visual-language alignment by introducing Q-Former. Q-Former brings the visual and textual embeddings closer, leading to improved reasoning task performance and the ability to perform multimodal retrieval.

Architecture

The architecture of BLIP-2 follows a modular design that integrates three modules:

  1. Visual Encoder is a frozen visual model, such as ViT, that extracts visual embeddings from the input images (which are then used in downstream tasks).
  2. Querying Transformer (Q-Former) is the key to this architecture. It consists of a trainable lightweight transformer that acts as an intermediate layer between the visual and language models. It is responsible for generating contextualized queries from the visual embeddings so that they can be processed effectively by the language model.
  3. LLM is a frozen pre-trained LLM that processes refined visual embeddings to generate textual descriptions or answers.
Figure 2: BLIP-2 architecture (image by author)

Loss Functions

BLIP-2 has three loss functions to train the Q-Former module:

  • Image-text contrastive loss [2] enforces the alignment between visual and text embeddings by maximizing the similarity of paired image-text representations while pushing apart dissimilar pairs.
  • Image-text matching loss [3] is a binary classification loss that aims to make the model learn fine-grained alignments by predicting whether a text description matches the image (positive, i.e., target=1) or not (negative, i.e., target=0).
  • Image-grounded text generation loss [4] is a cross-entropy loss used in LLMs to predict the probability of the next token in the sequence. The Q-Former architecture does not allow interactions between the image embeddings and the text tokens; therefore, the text must be generated based solely on the visual information, forcing the model to extract relevant visual features.

For both image-text contrastive loss and image-text matching loss, the authors used in-batch negative sampling, which means that if we have a batch size of 512, each image-text pair has one positive sample and 511 negative samples. This approach increases efficiency since negative samples are taken from the batch, and there is no need to search the entire dataset. It also provides a more diverse set of comparisons, leading to a better gradient estimation and faster convergence.

Figure 3: Training losses explained (image by author)

Training Process

The training of BLIP-2 consists of two stages:

Stage 1 – Bootstrapping visual-language representation:

  1. The model receives images as input that are converted to an embedding using the frozen visual encoder.
  2. Together with these images, the model receives their text descriptions, which are also converted into embedding.
  3. The Q-Former is trained using image-text contrastive loss, ensuring that the visual embeddings align closely with their corresponding textual embeddings and get further away from the non-matching text descriptions. At the same time, the image-text matching loss helps the model develop fine-grained representations by learning to classify whether a given text correctly describes the image or not.
Figure 4: Stage 1 training process (image by author)

Stage 2 – Bootstrapping vision-to-language generation:

  1. The pre-trained language model is integrated into the architecture to generate text based on the previously learned representations.
  2. The focus shifts from alignment to text generation by using the image-grounded text generation loss which improves the model capabilities of reasoning and text generation.
Figure 5: Stage 2 training process (image by the author)

Creating a Multimodal Fashion Search Agent using BLIP-2 and Gemini

In this section, we will leverage the multimodal capabilities of BLIP-2 to build a fashion assistant search agent that can receive input text and/or images and return recommendations. For the conversation capabilities of the agent, we will use Gemini 1.5 Pro hosted in Vertex AI, and for the interface, we will build a Streamlit app.

The fashion dataset used in this use case is licensed under the MIT license and can be accessed through the following link: Fashion Product Images Dataset. It consists of more than 44k images of fashion products.

The first step to make this possible is to set up a Vector DB. This enables the agent to perform a vectorized search based on the image embeddings of the items available in the store and the text or image embeddings from the input. We use docker and docker-compose to help us set up the environment:

  • Docker-Compose with Postgres (the database) and the PGVector extension that allows vectorized search.
services:
  postgres:
    container_name: container-pg
    image: ankane/pgvector
    hostname: localhost
    ports:
      - "5432:5432"
    env_file:
      - ./env/postgres.env
    volumes:
      - postgres-data:/var/lib/postgresql/data
    restart: unless-stopped

  pgadmin:
    container_name: container-pgadmin
    image: dpage/pgadmin4
    depends_on:
      - postgres
    ports:
      - "5050:80"
    env_file:
      - ./env/pgadmin.env
    restart: unless-stopped

volumes:
  postgres-data:
  • Postgres env file with the variables to log into the database.
POSTGRES_DB=postgres
POSTGRES_USER=admin
POSTGRES_PASSWORD=root
  • Pgadmin env file with the variables to log into the UI for manual querying the database (optional).
PGADMIN_DEFAULT_EMAIL=admin@admin.com 
PGADMIN_DEFAULT_PASSWORD=root
  • Connection env file with all the components to use to connect to PGVector using Langchain.
DRIVER=psycopg
HOST=localhost
PORT=5432
DATABASE=postgres
USERNAME=admin
PASSWORD=root

Once the Vector DB is set up and running (docker-compose up -d), it is time to create the agents and tools to perform a multimodal search. We build two agents to solve this use case: one to understand what the user is requesting and another one to provide the recommendation:

  • The classifier is responsible for receiving the input message from the customer and extracting which category of clothes the user is looking for, for example, t-shirts, pants, shoes, jerseys, or shirts. It will also return the number of items the customer wants so that we can retrieve the exact number from the Vector DB.
from langchain_core.output_parsers import PydanticOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_google_vertexai import ChatVertexAI
from pydantic import BaseModel, Field

class ClassifierOutput(BaseModel):
    """
    Data structure for the model's output.
    """

    category: list = Field(
        description="A list of clothes category to search for ('t-shirt', 'pants', 'shoes', 'jersey', 'shirt')."
    )
    number_of_items: int = Field(description="The number of items we should retrieve.")

class Classifier:
    """
    Classifier class for classification of input text.
    """

    def __init__(self, model: ChatVertexAI) -> None:
        """
        Initialize the Chain class by creating the chain.
        Args:
            model (ChatVertexAI): The LLM model.
        """
        super().__init__()

        parser = PydanticOutputParser(pydantic_object=ClassifierOutput)

        text_prompt = """
        You are a fashion assistant expert on understanding what a customer needs and on extracting the category or categories of clothes a customer wants from the given text.
        Text:
        {text}

        Instructions:
        1. Read carefully the text.
        2. Extract the category or categories of clothes the customer is looking for, it can be:
            - t-shirt if the custimer is looking for a t-shirt.
            - pants if the customer is looking for pants.
            - jacket if the customer is looking for a jacket.
            - shoes if the customer is looking for shoes.
            - jersey if the customer is looking for a jersey.
            - shirt if the customer is looking for a shirt.
        3. If the customer is looking for multiple items of the same category, return the number of items we should retrieve. If not specfied but the user asked for more than 1, return 2.
        4. If the customer is looking for multiple category, the number of items should be 1.
        5. Return a valid JSON with the categories found, the key must be 'category' and the value must be a list with the categories found and 'number_of_items' with the number of items we should retrieve.

        Provide the output as a valid JSON object without any additional formatting, such as backticks or extra text. Ensure the JSON is correctly structured according to the schema provided below.
        {format_instructions}

        Answer:
        """

        prompt = PromptTemplate.from_template(
            text_prompt, partial_variables={"format_instructions": parser.get_format_instructions()}
        )
        self.chain = prompt | model | parser

    def classify(self, text: str) -> ClassifierOutput:
        """
        Get the category from the model based on the text context.
        Args:
            text (str): user message.
        Returns:
            ClassifierOutput: The model's answer.
        """
        try:
            return self.chain.invoke({"text": text})
        except Exception as e:
            raise RuntimeError(f"Error invoking the chain: {e}")
  • The assistant is responsible for answering with a personalized recommendation retrieved from the Vector DB. In this case, we are also leveraging the multimodal capabilities of Gemini to analyze the images retrieved and produce a better answer.
from langchain_core.output_parsers import PydanticOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_google_vertexai import ChatVertexAI
from pydantic import BaseModel, Field

class AssistantOutput(BaseModel):
    """
    Data structure for the model's output.
    """

    answer: str = Field(description="A string with the fashion advice for the customer.")

class Assistant:
    """
    Assitant class for providing fashion advice.
    """

    def __init__(self, model: ChatVertexAI) -> None:
        """
        Initialize the Chain class by creating the chain.
        Args:
            model (ChatVertexAI): The LLM model.
        """
        super().__init__()

        parser = PydanticOutputParser(pydantic_object=AssistantOutput)

        text_prompt = """
        You work for a fashion store and you are a fashion assistant expert on understanding what a customer needs.
        Based on the items that are available in the store and the customer message below, provide a fashion advice for the customer.
        Number of items: {number_of_items}
        
        Images of items:
        {items}

        Customer message:
        {customer_message}

        Instructions:
        1. Check carefully the images provided.
        2. Read carefully the customer needs.
        3. Provide a fashion advice for the customer based on the items and customer message.
        4. Return a valid JSON with the advice, the key must be 'answer' and the value must be a string with your advice.

        Provide the output as a valid JSON object without any additional formatting, such as backticks or extra text. Ensure the JSON is correctly structured according to the schema provided below.
        {format_instructions}

        Answer:
        """

        prompt = PromptTemplate.from_template(
            text_prompt, partial_variables={"format_instructions": parser.get_format_instructions()}
        )
        self.chain = prompt | model | parser

    def get_advice(self, text: str, items: list, number_of_items: int) -> AssistantOutput:
        """
        Get advice from the model based on the text and items context.
        Args:
            text (str): user message.
            items (list): items found for the customer.
            number_of_items (int): number of items to be retrieved.
        Returns:
            AssistantOutput: The model's answer.
        """
        try:
            return self.chain.invoke({"customer_message": text, "items": items, "number_of_items": number_of_items})
        except Exception as e:
            raise RuntimeError(f"Error invoking the chain: {e}")

In terms of tools, we define one based on BLIP-2. It consists of a function that receives a text or image as input and returns normalized embeddings. Depending on the input, the embeddings are produced using the text embedding model or the image embedding model of BLIP-2.

from typing import Optional

import numpy as np
import torch
import torch.nn.functional as F
from PIL import Image
from PIL.JpegImagePlugin import JpegImageFile
from transformers import AutoProcessor, Blip2TextModelWithProjection, Blip2VisionModelWithProjection

PROCESSOR = AutoProcessor.from_pretrained("Salesforce/blip2-itm-vit-g")
TEXT_MODEL = Blip2TextModelWithProjection.from_pretrained("Salesforce/blip2-itm-vit-g", torch_dtype=torch.float32).to(
    "cpu"
)
IMAGE_MODEL = Blip2VisionModelWithProjection.from_pretrained(
    "Salesforce/blip2-itm-vit-g", torch_dtype=torch.float32
).to("cpu")

def generate_embeddings(text: Optional[str] = None, image: Optional[JpegImageFile] = None) -> np.ndarray:
    """
    Generate embeddings from text or image using the Blip2 model.
    Args:
        text (Optional[str]): customer input text
        image (Optional[Image]): customer input image
    Returns:
        np.ndarray: embedding vector
    """
    if text:
        inputs = PROCESSOR(text=text, return_tensors="pt").to("cpu")
        outputs = TEXT_MODEL(**inputs)
        embedding = F.normalize(outputs.text_embeds, p=2, dim=1)[:, 0, :].detach().numpy().flatten()
    else:
        inputs = PROCESSOR(images=image, return_tensors="pt").to("cpu", torch.float16)
        outputs = IMAGE_MODEL(**inputs)
        embedding = F.normalize(outputs.image_embeds, p=2, dim=1).mean(dim=1).detach().numpy().flatten()

    return embedding

Note that we create the connection to PGVector with a different embedding model because it is mandatory, although it will not be used since we will store the embeddings produced by BLIP-2 directly.

In the loop below, we iterate over all categories of clothes, load the images, and create and append the embeddings to be stored in the vector db into a list. Also, we store the path to the image as text so that we can render it in our Streamlit app. Finally, we store the category to filter the results based on the category predicted by the classifier agent.

import glob
import os

from dotenv import load_dotenv
from langchain_huggingface.embeddings import HuggingFaceEmbeddings
from langchain_postgres.vectorstores import PGVector
from PIL import Image

from blip2 import generate_embeddings

load_dotenv("env/connection.env")

CONNECTION_STRING = PGVector.connection_string_from_db_params(
    driver=os.getenv("DRIVER"),
    host=os.getenv("HOST"),
    port=os.getenv("PORT"),
    database=os.getenv("DATABASE"),
    user=os.getenv("USERNAME"),
    password=os.getenv("PASSWORD"),
)

vector_db = PGVector(
    embeddings=HuggingFaceEmbeddings(model_name="nomic-ai/modernbert-embed-base"),  # does not matter for our use case
    collection_name="fashion",
    connection=CONNECTION_STRING,
    use_jsonb=True,
)

if __name__ == "__main__":

    # generate image embeddings
    # save path to image in text
    # save category in metadata
    texts = []
    embeddings = []
    metadatas = []

    for category in glob.glob("images/*"):
        cat = category.split("/")[-1]
        for img in glob.glob(f"{category}/*"):
            texts.append(img)
            embeddings.append(generate_embeddings(image=Image.open(img)).tolist())
            metadatas.append({"category": cat})

    vector_db.add_embeddings(texts, embeddings, metadatas)

We can now build our Streamlit app to chat with our assistant and ask for recommendations. The chat starts with the agent asking how it can help and providing a box for the customer to write a message and/or to upload a file.

Once the customer replies, the workflow is the following:

  • The classifier agent identifies which categories of clothes the customer is looking for and how many units they want.
  • If the customer uploads a file, this file is going to be converted into an embedding, and we will look for similar items in the vector db, conditioned by the category of clothes the customer wants and the number of units.
  • The items retrieved and the customer’s input message are then sent to the assistant agent to produce the recommendation message that is rendered together with the images retrieved.
  • If the customer did not upload a file, the process is the same, but instead of generating image embeddings for retrieval, we create text embeddings.
import os

import streamlit as st
from dotenv import load_dotenv
from langchain_google_vertexai import ChatVertexAI
from langchain_huggingface.embeddings import HuggingFaceEmbeddings
from langchain_postgres.vectorstores import PGVector
from PIL import Image

import utils
from assistant import Assistant
from blip2 import generate_embeddings
from classifier import Classifier

load_dotenv("env/connection.env")
load_dotenv("env/llm.env")

CONNECTION_STRING = PGVector.connection_string_from_db_params(
    driver=os.getenv("DRIVER"),
    host=os.getenv("HOST"),
    port=os.getenv("PORT"),
    database=os.getenv("DATABASE"),
    user=os.getenv("USERNAME"),
    password=os.getenv("PASSWORD"),
)

vector_db = PGVector(
    embeddings=HuggingFaceEmbeddings(model_name="nomic-ai/modernbert-embed-base"),  # does not matter for our use case
    collection_name="fashion",
    connection=CONNECTION_STRING,
    use_jsonb=True,
)

model = ChatVertexAI(model_name=os.getenv("MODEL_NAME"), project=os.getenv("PROJECT_ID"), temperarture=0.0)
classifier = Classifier(model)
assistant = Assistant(model)

st.title("Welcome to ZAAI's Fashion Assistant")

user_input = st.text_input("Hi, I'm ZAAI's Fashion Assistant. How can I help you today?")

uploaded_file = st.file_uploader("Upload an image", type=["jpg", "jpeg", "png"])

if st.button("Submit"):

    # understand what the user is asking for
    classification = classifier.classify(user_input)

    if uploaded_file:

        image = Image.open(uploaded_file)
        image.save("input_image.jpg")
        embedding = generate_embeddings(image=image)

    else:

        # create text embeddings in case the user does not upload an image
        embedding = generate_embeddings(text=user_input)

    # create a list of items to be retrieved and the path
    retrieved_items = []
    retrieved_items_path = []
    for item in classification.category:
        clothes = vector_db.similarity_search_by_vector(
            embedding, k=classification.number_of_items, filter={"category": {"$in": [item]}}
        )
        for clothe in clothes:
            retrieved_items.append({"bytesBase64Encoded": utils.encode_image_to_base64(clothe.page_content)})
            retrieved_items_path.append(clothe.page_content)

    # get assistant's recommendation
    assistant_output = assistant.get_advice(user_input, retrieved_items, len(retrieved_items))
    st.write(assistant_output.answer)

    cols = st.columns(len(retrieved_items)+1)
    for col, retrieved_item in zip(cols, ["input_image.jpg"]+retrieved_items_path):
        col.image(retrieved_item)

    user_input = st.text_input("")

else:
    st.warning("Please provide text.")

Both examples can be seen below:

Figure 6 shows an example where the customer uploaded an image of a red t-shirt and asked the agent to complete the outfit.

Figure 6: Example of text and image input (image by author)

Figure 7 shows a more straightforward example where the customer asked the agent to show them black t-shirts.

Figure 7: Example of text input (image by author)

Conclusion

Multimodal AI is no longer just a research topic. It is being used in the industry to reshape the way customers interact with company catalogs. In this article, we explored how multimodal models like BLIP-2 and Gemini can be combined to address real-world problems and provide a more personalized experience to customers in a scalable way.

We explored the architecture of BLIP-2 in depth, demonstrating how it bridges the gap between text and image modalities. To extend its capabilities, we developed a system of agents, each specializing in different tasks. This system integrates an LLM (Gemini) and a vector database, enabling retrieval of the product catalog using text and image embeddings. We also leveraged Gemini’s multimodal reasoning to improve the sales assistant agent’s responses to be more human-like.

With tools like BLIP-2, Gemini, and PG Vector, the future of multimodal search and retrieval is already happening, and the search engines of the future will look very different from the ones we use today.

About me

Serial entrepreneur and leader in the AI space. I develop AI products for businesses and invest in AI-focused startups.

Founder @ ZAAI | LinkedIn | X/Twitter

References

[1] Junnan Li, Dongxu Li, Silvio Savarese, Steven Hoi. 2023. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. arXiv:2301.12597

[2] Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, Dilip Krishnan. 2020. Supervised Contrastive Learning. arXiv:2004.11362

[3] Junnan Li, Ramprasaath R. Selvaraju, Akhilesh Deepak Gotmare, Shafiq Joty, Caiming Xiong, Steven Hoi. 2021. Align before Fuse: Vision and Language Representation Learning with Momentum Distillation. arXiv:2107.07651

[4] Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, Hsiao-Wuen Hon. 2019. Unified Language Model Pre-training for Natural Language Understanding and Generation. arXiv:1905.03197

The post Multimodal Search Engine Agents Powered by BLIP-2 and Gemini appeared first on Towards Data Science.

]]>