Artificial Intelligence | Towards Data Science

How to Spot and Prevent Model Drift Before it Impacts Your Business

Claudia Ng — Thu, 06 Mar 2025 19:22:22 +0000

Despite the AI hype, many tech companies still rely heavily on machine learning to power critical applications, from personalized recommendations to fraud detection.

I’ve seen firsthand how undetected drifts can result in significant costs — missed fraud detection, lost revenue, and suboptimal business outcomes, just to name a few. So, it’s crucial to have robust monitoring in place if your company has deployed or plans to deploy machine learning models into production.

Undetected Model Drift can lead to significant financial losses, operational inefficiencies, and even damage to a company’s reputation. To mitigate these risks, it’s important to have effective model monitoring, which involves:

Tracking model performance
Monitoring feature distributions
Detecting both univariate and multivariate drifts

A well-implemented monitoring system can help identify issues early, saving considerable time, money, and resources.

In this comprehensive guide, I’ll provide a framework on how to think about and implement effective Model Monitoring, helping you stay ahead of potential issues and ensure stability and reliability of your models in production.

What’s the difference between feature drift and score drift?

Score drift refers to a gradual change in the distribution of model scores. If left unchecked, this could lead to a decline in model performance, making the model less accurate over time.

On the other hand, feature drift occurs when one or more features experience changes in the distribution. These changes in feature values can affect the underlying relationships that the model has learned, and ultimately lead to inaccurate model predictions.

Simulating score shifts

To model real-world fraud detection challenges, I created a synthetic dataset with five financial transaction features.

The reference dataset represents the original distribution, while the production dataset introduces shifts to simulate an increase in high-value transactions without PIN verification on newer accounts, indicating an increase in fraud.

Each feature has different underlying distributions:

Transaction Amount: Log-normal distribution (right-skewed with a long tail)
Account Age (months): clipped normal distribution between 0 to 60 (assuming a 5-year-old company)
Time Since Last Transaction: Exponential distribution
Transaction Count: Poisson distribution
Entered PIN: Binomial distribution.

To approximate model scores, I randomly assigned weights to these features and applied a sigmoid function to constrain predictions between 0 to 1. This mimics how a logistic regression fraud model generates risk scores.

As shown in the plot below:

Drifted features: Transaction Amount, Account Age, Transaction Count, and Entered PIN all experienced shifts in distribution, scale, or relationships.

Distribution of drifted features (image by author)

Stable feature: Time Since Last Transaction remained unchanged.

Distribution of stable feature (image by author)

Drifted scores: As a result of the drifted features, the distribution in model scores has also changed.

Distribution of model scores (image by author)

This setup allows us to analyze how feature drift impacts model scores in production.

Detecting model score drift using PSI

To monitor model scores, I used population stability index (PSI) to measure how much model score distribution has shifted over time.

PSI works by binning continuous model scores and comparing the proportion of scores in each bin between the reference and production datasets. It compares the differences in proportions and their logarithmic ratios to compute a single summary statistic to quantify the drift.

Python implementation:

# Define function to calculate PSI given two datasets
def calculate_psi(reference, production, bins=10):
  # Discretize scores into bins
  min_val, max_val = 0, 1
  bin_edges = np.linspace(min_val, max_val, bins + 1)

  # Calculate proportions in each bin
  ref_counts, _ = np.histogram(reference, bins=bin_edges)
  prod_counts, _ = np.histogram(production, bins=bin_edges)

  ref_proportions = ref_counts / len(reference)
  prod_proportions = prod_counts / len(production)
  
  # Avoid division by zero
  ref_proportions = np.clip(ref_proportions, 1e-8, 1)
  prod_proportions = np.clip(prod_proportions, 1e-8, 1)

  # Calculate PSI for each bin
  psi = np.sum((ref_proportions - prod_proportions) * np.log(ref_proportions / prod_proportions))

  return psi
  
# Calculate PSI
psi_value = calculate_psi(ref_data['model_score'], prod_data['model_score'], bins=10)
print(f"PSI Value: {psi_value}")

Below is a summary of how to interpret PSI values:

PSI < 0.1: No drift, or very minor drift (distributions are almost identical).
0.1 ≤ PSI < 0.25: Some drift. The distributions are somewhat different.
0.25 ≤ PSI < 0.5: Moderate drift. A noticeable shift between the reference and production distributions.
PSI ≥ 0.5: Significant drift. There is a large shift, indicating that the distribution in production has changed substantially from the reference data.

Histogram of model score distributions (image by author)

The PSI value of 0.6374 suggests a significant drift between our reference and production datasets. This aligns with the histogram of model score distributions, which visually confirms the shift towards higher scores in production — indicating an increase in risky transactions.

Detecting feature drift

Kolmogorov-Smirnov test for numeric features

The Kolmogorov-Smirnov (K-S) test is my preferred method for detecting drift in numeric features, because it is non-parametric, meaning it doesn’t assume a normal distribution.

The test compares a feature’s distribution in the reference and production datasets by measuring the maximum difference between the empirical cumulative distribution functions (ECDFs). The resulting K-S statistic ranges from 0 to 1:

0 indicates no difference between the two distributions.
Values closer to 1 suggest a greater shift.

Python implementation:

# Create an empty dataframe
ks_results = pd.DataFrame(columns=['Feature', 'KS Statistic', 'p-value', 'Drift Detected'])

# Loop through all features and perform the K-S test
for col in numeric_cols:
    ks_stat, p_value = ks_2samp(ref_data[col], prod_data[col])
    drift_detected = p_value < 0.05
		
		# Store results in the dataframe
    ks_results = pd.concat([
        ks_results,
        pd.DataFrame({
            'Feature': [col],
            'KS Statistic': [ks_stat],
            'p-value': [p_value],
            'Drift Detected': [drift_detected]
        })
    ], ignore_index=True)

Below are ECDF charts of the four numeric features in our dataset:

ECDFs of four numeric features (image by author)

Let’s look at the account age feature as an example: the x-axis represents account age (0-50 months), while the y-axis shows the ECDF for both reference and production datasets. The production dataset skews towards newer accounts, as it has a larger proportion of observations with lower account ages.

Chi-Square test for categorical features

To detect shifts in categorical and boolean features, I like to use the Chi-Square test.

This test compares the frequency distribution of a categorical feature in the reference and production datasets, and returns two values:

Chi-Square statistic: A higher value indicates a greater shift between the reference and production datasets.
P-value: A p-value below 0.05 suggests that the difference between the reference and production datasets is statistically significant, indicating potential feature drift.

Python implementation:

# Create empty dataframe with corresponding column names
chi2_results = pd.DataFrame(columns=['Feature', 'Chi-Square Statistic', 'p-value', 'Drift Detected'])

for col in categorical_cols:
    # Get normalized value counts for both reference and production datasets
    ref_counts = ref_data[col].value_counts(normalize=True)
    prod_counts = prod_data[col].value_counts(normalize=True)

    # Ensure all categories are represented in both
    all_categories = set(ref_counts.index).union(set(prod_counts.index))
    ref_counts = ref_counts.reindex(all_categories, fill_value=0)
    prod_counts = prod_counts.reindex(all_categories, fill_value=0)

    # Create contingency table
    contingency_table = np.array([ref_counts * len(ref_data), prod_counts * len(prod_data)])

    # Perform Chi-Square test
    chi2_stat, p_value, _, _ = chi2_contingency(contingency_table)
    drift_detected = p_value < 0.05

    # Store results in chi2_results dataframe
    chi2_results = pd.concat([
        chi2_results,
        pd.DataFrame({
            'Feature': [col],
            'Chi-Square Statistic': [chi2_stat],
            'p-value': [p_value],
            'Drift Detected': [drift_detected]
        })
    ], ignore_index=True)

The Chi-Square statistic of 57.31 with a p-value of 3.72e-14 confirms a large shift in our categorical feature, Entered PIN. This finding aligns with the histogram below, which visually illustrates the shift:

Distribution of categorical feature (image by author)

Detecting multivariate shifts

Spearman Correlation for shifts in pairwise interactions

In addition to monitoring individual feature shifts, it’s important to track shifts in relationships or interactions between features, known as multivariate shifts. Even if the distributions of individual features remain stable, multivariate shifts can signal meaningful differences in the data.

By default, Pandas’ .corr() function calculates Pearson correlation, which only captures linear relationships between variables. However, relationships between features are often non-linear yet still follow a consistent trend.

To capture this, we use Spearman correlation to measure monotonic relationships between features. It captures whether features change together in a consistent direction, even if their relationship isn’t strictly linear.

To assess shifts in feature relationships, we compare:

Reference correlation (ref_corr): Captures historical feature relationships in the reference dataset.
Production correlation (prod_corr): Captures new feature relationships in production.
Absolute difference in correlation: Measures how much feature relationships have shifted between the reference and production datasets. Higher values indicate more significant shifts.

Python implementation:

# Calculate correlation matrices
ref_corr = ref_data.corr(method='spearman')
prod_corr = prod_data.corr(method='spearman')

# Calculate correlation difference
corr_diff = abs(ref_corr - prod_corr)

Example: Change in correlation

Now, let’s look at the correlation between transaction_amount and account_age_in_months:

In ref_corr, the correlation is 0.00095, indicating a weak relationship between the two features.
In prod_corr, the correlation is -0.0325, indicating a weak negative correlation.
Absolute difference in the Spearman correlation is 0.0335, which is a small but noticeable shift.

The absolute difference in correlation indicates a shift in the relationship between transaction_amount and account_age_in_months.

There used to be no relationship between these two features, but the production dataset indicates that there is now a weak negative correlation, meaning that newer accounts have higher transaction amounts. This is spot on!

Autoencoder for complex, high-dimensional multivariate shifts

In addition to monitoring pairwise interactions, we can also look for shifts across more dimensions in the data.

Autoencoders are powerful tools for detecting high-dimensional multivariate shifts, where multiple features collectively change in ways that may not be apparent from looking at individual feature distributions or pairwise correlations.

An autoencoder is a neural network that learns a compressed representation of data through two components:

Encoder: Compresses input data into a lower-dimensional representation.
Decoder: Reconstructs the original input from the compressed representation.

To detect shifts, we compare the reconstructed output to the original input and compute the reconstruction loss.

Low reconstruction loss → The autoencoder successfully reconstructs the data, meaning the new observations are similar to what it has seen and learned.
High reconstruction loss → The production data deviates significantly from the learned patterns, indicating potential drift.

Unlike traditional drift metrics that focus on individual features or pairwise relationships, autoencoders capture complex, non-linear dependencies across multiple variables simultaneously.

Python implementation:

ref_features = ref_data[numeric_cols + categorical_cols]
prod_features = prod_data[numeric_cols + categorical_cols]

# Normalize the data
scaler = StandardScaler()
ref_scaled = scaler.fit_transform(ref_features)
prod_scaled = scaler.transform(prod_features)

# Split reference data into train and validation
np.random.shuffle(ref_scaled)
train_size = int(0.8 * len(ref_scaled))
train_data = ref_scaled[:train_size]
val_data = ref_scaled[train_size:]

# Build autoencoder
input_dim = ref_features.shape[1]
encoding_dim = 3 
# Input layer
input_layer = Input(shape=(input_dim, ))
# Encoder
encoded = Dense(8, activation="relu")(input_layer)
encoded = Dense(encoding_dim, activation="relu")(encoded)
# Decoder
decoded = Dense(8, activation="relu")(encoded)
decoded = Dense(input_dim, activation="linear")(decoded)
# Autoencoder
autoencoder = Model(input_layer, decoded)
autoencoder.compile(optimizer="adam", loss="mse")

# Train autoencoder
history = autoencoder.fit(
    train_data, train_data,
    epochs=50,
    batch_size=64,
    shuffle=True,
    validation_data=(val_data, val_data),
    verbose=0
)

# Calculate reconstruction error
ref_pred = autoencoder.predict(ref_scaled, verbose=0)
prod_pred = autoencoder.predict(prod_scaled, verbose=0)

ref_mse = np.mean(np.power(ref_scaled - ref_pred, 2), axis=1)
prod_mse = np.mean(np.power(prod_scaled - prod_pred, 2), axis=1)

The charts below show the distribution of reconstruction loss between both datasets.

Distribution of reconstruction loss between actuals and predictions (image by author)

The production dataset has a higher mean reconstruction error than that of the reference dataset, indicating a shift in the overall data. This aligns with the changes in the production dataset with a higher number of newer accounts with high-value transactions.

Summarizing

Model monitoring is an essential, yet often overlooked, responsibility for data scientists and machine learning engineers.

All the statistical methods led to the same conclusion, which aligns with the observed shifts in the data: they detected a trend in production towards newer accounts making higher-value transactions. This shift resulted in higher model scores, signaling an increase in potential fraud.

In this post, I covered techniques for detecting drift on three different levels:

Model score drift: Using Population Stability Index (PSI)
Individual feature drift: Using Kolmogorov-Smirnov test for numeric features and Chi-Square test for categorical features
Multivariate drift: Using Spearman correlation for pairwise interactions and autoencoders for high-dimensional, multivariate shifts.

These are just a few of the techniques I rely on for comprehensive monitoring — there are plenty of other equally valid statistical methods that can also detect drift effectively.

Detected shifts often point to underlying issues that warrant further investigation. The root cause could be as serious as a data collection bug, or as minor as a time change like daylight savings time adjustments.

There are also fantastic python packages, like evidently.ai, that automate many of these comparisons. However, I believe there’s significant value in deeply understanding the statistical techniques behind drift detection, rather than relying solely on these tools.

What’s the model monitoring process like at places you’ve worked?

Want to build your AI skills?

I run the AI Weekender and write weekly blog posts on data science, AI weekend projects, career advice for professionals in data.

Resources

Jupyter Notebook: https://colab.research.google.com/drive/1qQFKjg3wLWmj2z4w6_U_xsqRREaB2sBP?authuser=3#scrollTo=EdpoxjNY_CUX

The post How to Spot and Prevent Model Drift Before it Impacts Your Business appeared first on Towards Data Science.

Overcome Failing Document Ingestion & RAG Strategies with Agentic Knowledge Distillation

Tula Masterman — Wed, 05 Mar 2025 19:50:12 +0000

Introduction

Many generative AI use cases still revolve around Retrieval Augmented Generation (RAG), yet consistently fall short of user expectations. Despite the growing body of research on RAG improvements and even adding Agents into the process, many solutions still fail to return exhaustive results, miss information that is critical but infrequently mentioned in the documents, require multiple search iterations, and generally struggle to reconcile key themes across multiple documents. To top it all off, many implementations still rely on cramming as much “relevant” information as possible into the model’s context window alongside detailed system and user prompts. Reconciling all this information often exceeds the model’s cognitive capacity and compromises response quality and consistency.

This is where our Agentic Knowledge Distillation + Pyramid Search Approach comes into play. Instead of chasing the best chunking strategy, retrieval algorithm, or inference-time reasoning method, my team, Jim Brown, Mason Sawtell, Sandi Besen, and I, take an agentic approach to document ingestion.

We leverage the full capability of the model at ingestion time to focus exclusively on distilling and preserving the most meaningful information from the document dataset. This fundamentally simplifies the RAG process by allowing the model to direct its reasoning abilities toward addressing the user/system instructions rather than struggling to understand formatting and disparate information across document chunks.

We specifically target high-value questions that are often difficult to evaluate because they have multiple correct answers or solution paths. These cases are where traditional RAG solutions struggle most and existing RAG evaluation datasets are largely insufficient for testing this problem space. For our research implementation, we downloaded annual and quarterly reports from the last year for the 30 companies in the DOW Jones Industrial Average. These documents can be found through the SEC EDGAR website. The information on EDGAR is accessible and able to be downloaded for free or can be queried through EDGAR public searches. See the SEC privacy policy for additional details, information on the SEC website is “considered public information and may be copied or further distributed by users of the web site without the SEC’s permission”. We selected this dataset for two key reasons: first, it falls outside the knowledge cutoff for the models evaluated, ensuring that the models cannot respond to questions based on their knowledge from pre-training; second, it’s a close approximation for real-world business problems while allowing us to discuss and share our findings using publicly available data.

While typical RAG solutions excel at factual retrieval where the answer is easily identified in the document dataset (e.g., “When did Apple’s annual shareholder’s meeting occur?”), they struggle with nuanced questions that require a deeper understanding of concepts across documents (e.g., “Which of the DOW companies has the most promising AI strategy?”). Our Agentic Knowledge Distillation + Pyramid Search Approach addresses these types of questions with much greater success compared to other standard approaches we tested and overcomes limitations associated with using knowledge graphs in RAG systems.

In this article, we’ll cover how our knowledge distillation process works, key benefits of this approach, examples, and an open discussion on the best way to evaluate these types of systems where, in many cases, there is no singular “right” answer.

Building the pyramid: How Agentic Knowledge Distillation works

Image by author and team depicting pyramid structure for document ingestion. Robots meant to represent agents building the pyramid.

Overview

Our knowledge distillation process creates a multi-tiered pyramid of information from the raw source documents. Our approach is inspired by the pyramids used in deep learning computer vision-based tasks, which allow a model to analyze an image at multiple scales. We take the contents of the raw document, convert it to markdown, and distill the content into a list of atomic insights, related concepts, document abstracts, and general recollections/memories. During retrieval it’s possible to access any or all levels of the pyramid to respond to the user request.

How to distill documents and build the pyramid:

Convert documents to Markdown: Convert all raw source documents to Markdown. We’ve found models process markdown best for this task compared to other formats like JSON and it is more token efficient. We used Azure Document Intelligence to generate the markdown for each page of the document, but there are many other open-source libraries like MarkItDown which do the same thing. Our dataset included 331 documents and 16,601 pages.
Extract atomic insights from each page: We process documents using a two-page sliding window, which allows each page to be analyzed twice. This gives the agent the opportunity to correct any potential mistakes when processing the page initially. We instruct the model to create a numbered list of insights that grows as it processes the pages in the document. The agent can overwrite insights from the previous page if they were incorrect since it sees each page twice. We instruct the model to extract insights in simple sentences following the subject-verb-object (SVO) format and to write sentences as if English is the second language of the user. This significantly improves performance by encouraging clarity and precision. Rolling over each page multiple times and using the SVO format also solves the disambiguation problem, which is a huge challenge for knowledge graphs. The insight generation step is also particularly helpful for extracting information from tables since the model captures the facts from the table in clear, succinct sentences. Our dataset produced 216,931 total insights, about 13 insights per page and 655 insights per document.
Distilling concepts from insights: From the detailed list of insights, we identify higher-level concepts that connect related information about the document. This step significantly reduces noise and redundant information in the document while preserving essential information and themes. Our dataset produced 14,824 total concepts, about 1 concept per page and 45 concepts per document.
Creating abstracts from concepts: Given the insights and concepts in the document, the LLM writes an abstract that appears both better than any abstract a human would write and more information-dense than any abstract present in the original document. The LLM generated abstract provides incredibly comprehensive knowledge about the document with a small token density that carries a significant amount of information. We produce one abstract per document, 331 total.
Storing recollections/memories across documents: At the top of the pyramid we store critical information that is useful across all tasks. This can be information that the user shares about the task or information the agent learns about the dataset over time by researching and responding to tasks. For example, we can store the current 30 companies in the DOW as a recollection since this list is different from the 30 companies in the DOW at the time of the model’s knowledge cutoff. As we conduct more and more research tasks, we can continuously improve our recollections and maintain an audit trail of which documents these recollections originated from. For example, we can keep track of AI strategies across companies, where companies are making major investments, etc. These high-level connections are super important since they reveal relationships and information that are not apparent in a single page or document.

Sample subset of insights extracted from IBM 10Q, Q3 2024 (page 4)

We store the text and embeddings for each layer of the pyramid (pages and up) in Azure PostgreSQL. We originally used Azure AI Search, but switched to PostgreSQL for cost reasons. This required us to write our own hybrid search function since PostgreSQL doesn’t yet natively support this feature. This implementation would work with any vector database or vector index of your choosing. The key requirement is to store and efficiently retrieve both text and vector embeddings at any level of the pyramid.

This approach essentially creates the essence of a knowledge graph, but stores information in natural language, the way an LLM natively wants to interact with it, and is more efficient on token retrieval. We also let the LLM pick the terms used to categorize each level of the pyramid, this seemed to let the model decide for itself the best way to describe and differentiate between the information stored at each level. For example, the LLM preferred “insights” to “facts” as the label for the first level of distilled knowledge. Our goal in doing this was to better understand how an LLM thinks about the process by letting it decide how to store and group related information.

Using the pyramid: How it works with RAG & Agents

At inference time, both traditional RAG and agentic approaches benefit from the pre-processed, distilled information ingested in our knowledge pyramid. The pyramid structure allows for efficient retrieval in both the traditional RAG case, where only the top X related pieces of information are retrieved or in the Agentic case, where the Agent iteratively plans, retrieves, and evaluates information before returning a final response.

The benefit of the pyramid approach is that information at any and all levels of the pyramid can be used during inference. For our implementation, we used PydanticAI to create a search agent that takes in the user request, generates search terms, explores ideas related to the request, and keeps track of information relevant to the request. Once the search agent determines there’s sufficient information to address the user request, the results are re-ranked and sent back to the LLM to generate a final reply. Our implementation allows a search agent to traverse the information in the pyramid as it gathers details about a concept/search term. This is similar to walking a knowledge graph, but in a way that’s more natural for the LLM since all the information in the pyramid is stored in natural language.

Depending on the use case, the Agent could access information at all levels of the pyramid or only at specific levels (e.g. only retrieve information from the concepts). For our experiments, we did not retrieve raw page-level data since we wanted to focus on token efficiency and found the LLM-generated information for the insights, concepts, abstracts, and recollections was sufficient for completing our tasks. In theory, the Agent could also have access to the page data; this would provide additional opportunities for the agent to re-examine the original document text; however, it would also significantly increase the total tokens used.

Here is a high-level visualization of our Agentic approach to responding to user requests:

Image created by author and team providing an overview of the agentic research & response process

Results from the pyramid: Real-world examples

To evaluate the effectiveness of our approach, we tested it against a variety of question categories, including typical fact-finding questions and complex cross-document research and analysis tasks.

Fact-finding (spear fishing):

These tasks require identifying specific information or facts that are buried in a document. These are the types of questions typical RAG solutions target but often require many searches and consume lots of tokens to answer correctly.

Example task: “What was IBM’s total revenue in the latest financial reporting?”

Example response using pyramid approach: “IBM’s total revenue for the third quarter of 2024 was $14.968 billion [ibm-10q-q3-2024.pdf, pg. 4]

Total tokens used to research and generate response

This result is correct (human-validated) and was generated using only 9,994 total tokens, with 1,240 tokens in the generated final response.

Complex research and analysis:

These tasks involve researching and understanding multiple concepts to gain a broader understanding of the documents and make inferences and informed assumptions based on the gathered facts.

Example task: “Analyze the investments Microsoft and NVIDIA are making in AI and how they are positioning themselves in the market. The report should be clearly formatted.”

Example response:

Response generated by the agent analyzing AI investments and positioning for Microsoft and NVIDIA.

The result is a comprehensive report that executed quickly and contains detailed information about each of the companies. 26,802 total tokens were used to research and respond to the request with a significant percentage of them used for the final response (2,893 tokens or ~11%). These results were also reviewed by a human to verify their validity.

Snippet indicating total token usage for the task

Example task: “Create a report on analyzing the risks disclosed by the various financial companies in the DOW. Indicate which risks are shared and unique.”

Example response:

Part 1 of response generated by the agent on disclosed risks.

Part 2 of response generated by the agent on disclosed risks.

Similarly, this task was completed in 42.7 seconds and used 31,685 total tokens, with 3,116 tokens used to generate the final report.

Snippet indicating total token usage for the task

These results for both fact-finding and complex analysis tasks demonstrate that the pyramid approach efficiently creates detailed reports with low latency using a minimal amount of tokens. The tokens used for the tasks carry dense meaning with little noise allowing for high-quality, thorough responses across tasks.

Benefits of the pyramid: Why use it?

Overall, we found that our pyramid approach provided a significant boost in response quality and overall performance for high-value questions.

Some of the key benefits we observed include:

Reduced model’s cognitive load: When the agent receives the user task, it retrieves pre-processed, distilled information rather than the raw, inconsistently formatted, disparate document chunks. This fundamentally improves the retrieval process since the model doesn’t waste its cognitive capacity on trying to break down the page/chunk text for the first time.
Superior table processing: By breaking down table information and storing it in concise but descriptive sentences, the pyramid approach makes it easier to retrieve relevant information at inference time through natural language queries. This was particularly important for our dataset since financial reports contain lots of critical information in tables.
Improved response quality to many types of requests: The pyramid enables more comprehensive context-aware responses to both precise, fact-finding questions and broad analysis based tasks that involve many themes across numerous documents.
Preservation of critical context: Since the distillation process identifies and keeps track of key facts, important information that might appear only once in the document is easier to maintain. For example, noting that all tables are represented in millions of dollars or in a particular currency. Traditional chunking methods often cause this type of information to slip through the cracks.
Optimized token usage, memory, and speed: By distilling information at ingestion time, we significantly reduce the number of tokens required during inference, are able to maximize the value of information put in the context window, and improve memory use.
Scalability: Many solutions struggle to perform as the size of the document dataset grows. This approach provides a much more efficient way to manage a large volume of text by only preserving critical information. This also allows for a more efficient use of the LLMs context window by only sending it useful, clear information.
Efficient concept exploration: The pyramid enables the agent to explore related information similar to navigating a knowledge graph, but does not require ever generating or maintaining relationships in the graph. The agent can use natural language exclusively and keep track of important facts related to the concepts it’s exploring in a highly token-efficient and fluid way.
Emergent dataset understanding: An unexpected benefit of this approach emerged during our testing. When asking questions like “what can you tell me about this dataset?” or “what types of questions can I ask?”, the system is able to respond and suggest productive search topics because it has a more robust understanding of the dataset context by accessing higher levels in the pyramid like the abstracts and recollections.

Beyond the pyramid: Evaluation challenges & future directions

Challenges

While the results we’ve observed when using the pyramid search approach have been nothing short of amazing, finding ways to establish meaningful metrics to evaluate the entire system both at ingestion time and during information retrieval is challenging. Traditional RAG and Agent evaluation frameworks often fail to address nuanced questions and analytical responses where many different responses are valid.

Our team plans to write a research paper on this approach in the future, and we are open to any thoughts and feedback from the community, especially when it comes to evaluation metrics. Many of the existing datasets we found were focused on evaluating RAG use cases within one document or precise information retrieval across multiple documents rather than robust concept and theme analysis across documents and domains.

The main use cases we are interested in relate to broader questions that are representative of how businesses actually want to interact with GenAI systems. For example, “tell me everything I need to know about customer X” or “how do the behaviors of Customer A and B differ? Which am I more likely to have a successful meeting with?”. These types of questions require a deep understanding of information across many sources. The answers to these questions typically require a person to synthesize data from multiple areas of the business and think critically about it. As a result, the answers to these questions are rarely written or saved anywhere which makes it impossible to simply store and retrieve them through a vector index in a typical RAG process.

Another consideration is that many real-world use cases involve dynamic datasets where documents are consistently being added, edited, and deleted. This makes it difficult to evaluate and track what a “correct” response is since the answer will evolve as the available information changes.

Future directions

In the future, we believe that the pyramid approach can address some of these challenges by enabling more effective processing of dense documents and storing learned information as recollections. However, tracking and evaluating the validity of the recollections over time will be critical to the system’s overall success and remains a key focus area for our ongoing work.

When applying this approach to organizational data, the pyramid process could also be used to identify and assess discrepancies across areas of the business. For example, uploading all of a company’s sales pitch decks could surface where certain products or services are being positioned inconsistently. It could also be used to compare insights extracted from various line of business data to help understand if and where teams have developed conflicting understandings of topics or different priorities. This application goes beyond pure information retrieval use cases and would allow the pyramid to serve as an organizational alignment tool that helps identify divergences in messaging, terminology, and overall communication.

Conclusion: Key takeaways and why the pyramid approach matters

The knowledge distillation pyramid approach is significant because it leverages the full power of the LLM at both ingestion and retrieval time. Our approach allows you to store dense information in fewer tokens which has the added benefit of reducing noise in the dataset at inference. Our approach also runs very quickly and is incredibly token efficient, we are able to generate responses within seconds, explore potentially hundreds of searches, and on average use <40K tokens for the entire search, retrieval, and response generation process (this includes all the search iterations!).

We find that the LLM is much better at writing atomic insights as sentences and that these insights effectively distill information from both text-based and tabular data. This distilled information written in natural language is very easy for the LLM to understand and navigate at inference since it does not have to expend unnecessary energy reasoning about and breaking down document formatting or filtering through noise.

The ability to retrieve and aggregate information at any level of the pyramid also provides significant flexibility to address a variety of query types. This approach offers promising performance for large datasets and enables high-value use cases that require nuanced information retrieval and analysis.

Note: The opinions expressed in this article are solely my own and do not necessarily reflect the views or policies of my employer.

Interested in discussing further or collaborating? Reach out on LinkedIn!

The post Overcome Failing Document Ingestion & RAG Strategies with Agentic Knowledge Distillation appeared first on Towards Data Science.

Generative AI Is Declarative

Michael Herman — Wed, 05 Mar 2025 16:36:00 +0000

ChatGPT launched in 2022 and kicked off the Generative Ai boom. In the two years since, academics, technologists, and armchair experts have written libraries worth of articles on the technical underpinnings of generative AI and about the potential capabilities of both current and future generative AI models.

Surprisingly little has been written about how we interact with these tools—the human-AI interface. The point where we interact with AI models is at least as important as the algorithms and data that create them. “There is no success where there is no possibility of failure, no art without the resistance of the medium” (Raymond Chandler). In that vein, it’s useful to examine human-AI interaction and the strengths and weaknesses inherent in that interaction. If we understand the “resistance in the medium” then product managers can make smarter decisions about how to incorporate generative AI into their products. Executives can make smarter decisions about what capabilities to invest in. Engineers and designers can build around the tools’ limitations and showcase their strengths. Everyday people can know when to use generative AI and when not to.

How to order a cheeseburger with AI

Imagine walking into a restaurant and ordering a cheeseburger. You don’t tell the chef how to grind the beef, how hot to set the grill, or how long to toast the bun. Instead, you simply describe what you want: “I’d like a cheeseburger, medium rare, with lettuce and tomato.” The chef interprets your request, handles the implementation, and delivers the desired outcome. This is the essence of declarative interaction—focusing on the what rather than the how.

Now, imagine interacting with a Large Language Model (LLM) like ChatGPT. You don’t have to provide step-by-step instructions for how to generate a response. Instead, you describe the result you’re looking for: “A user story that lets us implement A/B testing for the Buy button on our website.” The LLM interprets your prompt, fills in the missing details, and delivers a response. Just like ordering a cheeseburger, this is a declarative mode of interaction.

Explaining the steps to make a cheeseburger is an imperative interaction. Our LLM prompts sometimes feel imperative. We might phrase our prompts like a question: ”What is the tallest mountain on earth?” This is equivalent to describing “the answer to the question ‘What is the tallest mountain on earth?’” We might phrase our prompt as a series of instructions: ”Write a summary of the attached report, then read it as if you are a product manager, then type up some feedback on the report.” But, again, we’re describing the result of a process with some context for what that process is. In this case, it is a sequence of descriptive results—the report then the feedback.

This is a more useful way to think about LLMs and generative AI. In some ways it is more accurate; the neural network model behind the curtain doesn’t explain why or how it produced one output instead of another. More importantly though, the limitations and strengths of generative AI make more sense and become more predictable when we think of these models as declarative.

LLMs as a declarative mode of interaction

Computer scientists use the term “declarative” to describe coding languages. SQL is one of the most common. The code describes the output table and the procedures in the database figure out how to retrieve and combine the data to produce the result. LLMs share many of the benefits of declarative languages like SQL or declarative interactions like ordering a cheeseburger.

Focus on desired outcome: Just as you describe the cheeseburger you want, you describe the output you want from the LLM. For example, “Summarize this article in three bullet points” focuses on the result, not the process.
Abstraction of implementation: When you order a cheeseburger, you don’t need to know how the chef prepares it. When submitting SQL code to a server, the server figures out where the data lives, how to fetch it, and how to aggregate it based on your description. You as the user don’t need to know how. With LLMs, you don’t need to know how the model generates the response. The underlying mechanisms are abstracted away.
Filling in missing details: If you don’t specify onions on your cheeseburger, the chef won’t include them. If you don’t specify a field in your SQL code, it won’t show up in the output table. This is where LLMs differ slightly from declarative coding languages like SQL. If you ask ChatGPT to create an image of “a cheeseburger with lettuce and tomato” it may also show the burger on a sesame seed bun or include pickles, even if that wasn’t in your description. The details you omit are inferred by the LLM using the “average” or “most likely” detail depending on the context, with a bit of randomness thrown in. Ask for the cheeseburger image six times; it may show you three burgers with cheddar cheese, two with Swiss, and one with pepper jack.

Like other forms of declarative interaction, LLMs share one key limitation. If your description is vague, ambiguous, or lacks enough detail, then the result may not be what you hoped to see. It is up to the user to describe the result with sufficient detail.

This explains why we often iterate to get what we’re looking for when using LLMs and generative AI. Going back to our cheeseburger analogy, the process to generate a cheeseburger from an LLM may look like this.

“Make me a cheeseburger, medium rare, with lettuce and tomatoes.” The result also has pickles and uses cheddar cheese. The bun is toasted. There’s mayo on the top bun.
“Make the same thing but this time no pickles, use pepper jack cheese, and a sriracha mayo instead of plain mayo.” The result now has pepper jack, no pickles. The sriracha mayo is applied to the bottom bun and the bun is no longer toasted.
“Make the same thing again, but this time, put the sriracha mayo on the top bun. The buns should be toasted.” Finally, you have the cheeseburger you’re looking for.

This example demonstrates one of the main points of friction with human-AI interaction. Human beings are really bad at describing what they want with sufficient detail on the first attempt.

When we asked for a cheeseburger, we had to refine our description to be more specific (the type of cheese). In the second generation, some of the inferred details (whether the bun was toasted) changed from one iteration to the next, so then we had to add that specificity to our description as well. Iteration is an important part of AI-human generation.

Insight: When using generative AI, we need to design an iterative human-AI interaction loop that enables people to discover the details of what they want and refine their descriptions accordingly.

To iterate, we need to evaluate the results. Evaluation is extremely important with generative AI. Say you’re using an LLM to write code. You can evaluate the code quality if you know enough to understand it or if you can execute it and inspect the results. On the other hand, hypothetical questions can’t be tested. Say you ask ChatGPT, “What if we raise our product prices by 5 percent?” A seasoned expert could read the output and know from experience if a recommendation doesn’t take into account important details. If your product is property insurance, then increasing premiums by 5 percent may mean pushback from regulators, something an experienced veteran of the industry would know. For non-experts in a topic, there’s no way to tell if the “average” details inferred by the model make sense for your specific use case. You can’t test and iterate.

Insight: LLMs work best when the user can evaluate the result quickly, whether through execution or through prior knowledge.

The examples so far involve general knowledge. We all know what a cheeseburger is. When you start asking about non-general information—like when you can make dinner reservations next week—you delve into new points of friction.

In the next section we’ll think about different types of information, what we can expect the AI to “know”, and how this impacts human-AI interaction.

What did the AI know, and when did it know it?

Above, I explained how generative AI is a declarative mode of interaction and how that helps understand its strengths and weaknesses. Here, I’ll identify how different types of information create better or worse human-AI interactions.

Understanding the information available

When we describe what we want to an LLM, and when it infers missing details from our description, it draws from different sources of information. Understanding these sources of information is important. Here’s a useful taxonomy for information types:

General information used to train the base model.
Non-general information that the base model is not aware of.
- Fresh information that is new or changes rapidly, like stock prices or current events.
- Non-public information, like facts about you and where you live or about your company, its employees, its processes, or its codebase.

General information vs. non-general information

LLMs are built on a massive corpus of written word data. A large part of GPT-3 was trained on a combination of books, journals, Wikipedia, Reddit, and CommonCrawl (an open-source repository of web crawl data). You can think of the models as a highly compressed version of that data, organized in a gestalt manner—all the like things are close together. When we submit a prompt, the model takes the words we use (and any words added to the prompt behind the scenes) and finds the closest set of related words based on how those things appear in the data corpus. So when we say “cheeseburger” it knows that word is related to “bun” and “tomato” and “lettuce” and “pickles” because they all occur in the same context throughout many data sources. Even when we don’t specify pickles, it uses this gestalt approach to fill in the blanks.

This training information is general information, and a good rule of thumb is this: if it was in Wikipedia a year ago then the LLM “knows” about it. There could be new articles on Wikipedia, but that didn’t exist when the model was trained. The LLM doesn’t know about that unless told.

Now, say you’re a company using an LLM to write a product requirements document for a new web app feature. Your company, like most companies, is full of its own lingo. It has its own lore and history scattered across thousands of Slack messages, emails, documents, and some tenured employees who remember that one meeting in Q1 last year. The LLM doesn’t know any of that. It will infer any missing details from general information. You need to supply everything else. If it wasn’t in Wikipedia a year ago, the LLM doesn’t know about it. The resulting product requirements document may be full of general facts about your industry and product but could lack important details specific to your firm.

This is non-general information. This includes personal info, anything kept behind a log-in or paywall, and non-digital information. This non-general information permeates our lives, and incorporating it is another source of friction when working with generative AI.

Non-general information can be incorporated into a generative AI application in three ways:

Through model fine-tuning (supplying a large corpus to the base model to expand its reference data).
Retrieved and fed it to the model at query time (e.g., the retrieval augmented generation or “RAG” technique).
Supplied by the user in the prompt.

Insight: When designing any human-AI interactions, you should think about what non-general information is required, where you will get it, and how you will expose it to the AI.

Fresh information

Any information that changes in real-time or is new can be called fresh information. This includes new facts like current events but also frequently changing facts like your bank account balance. If the fresh information is available in a database or some searchable source, then it needs to be retrieved and incorporated into the application. To retrieve the information from a database, the LLM must create a query, which may require specific details that the user didn’t include.

Here’s an example. I have a chatbot that gives information on the stock market. You, the user, type the following: “What is the current price of Apple? Has it been increasing or decreasing recently?”

The LLM doesn’t have the current price of Apple in its training data. This is fresh, non-general information. So, we need to retrieve it from a database.
The LLM can read “Apple”, know that you’re talking about the computer company, and that the ticker symbol is AAPL. This is all general information.
What about the “increasing or decreasing” part of the prompt? You did not specify over what period—increasing in the past day, month, year? In order to construct a database query, we need more detail. LLMs are bad at knowing when to ask for detail and when to fill it in. The application could easily pull the wrong data and provide an unexpected or inaccurate answer. Only you know what these details should be, depending on your intent. You must be more specific in your prompt.

A designer of this LLM application can improve the user experience by specifying required parameters for expected queries. We can ask the user to explicitly input the time range or design the chatbot to ask for more specific details if not provided. In either case, we need to have a specific type of query in mind and explicitly design how to handle it. The LLM will not know how to do this unassisted.

Insight: If a user is expecting a more specific type of output, you need to explicitly ask for enough detail. Too little detail could produce a poor quality output.

Non-public information

Incorporating non-public information into an LLM prompt can be done if that information can be accessed in a database. This introduces privacy issues (should the LLM be able to access my medical records?) and complexity when incorporating multiple non-public sources of information.

Let’s say I have a chatbot that helps you make dinner reservations. You, the user, type the following: “Help me make dinner reservations somewhere with good Neapolitan pizza.”

The LLM knows what a Neapolitan pizza is and can infer that “dinner” means this is for an evening meal.
To do this task well, it needs information about your location, the restaurants near you and their booking status, or even personal details like dietary restrictions. Assuming all that non-public information is available in databases, bringing them all together into the prompt takes a lot of engineering work.
Even if the LLM could find the “best” restaurant for you and book the reservation, can you be confident it has done that correctly? You never specified how many people you need a reservation for. Since only you know this information, the application needs to ask for it upfront.

If you’re designing this LLM-based application, you can make some thoughtful choices to help with these problems. We could ask about a user’s dietary restrictions when they sign up for the app. Other information, like the user’s schedule that evening, can be given in a prompting tip or by showing the default prompt option “show me reservations for two for tomorrow at 7PM”. Promoting tips may not feel as automagical as a bot that does it all, but they are a straightforward way to collect and integrate the non-public information.

Some non-public information is large and can’t be quickly collected and processed when the prompt is given. These need to be fine-tuned in batch or retrieved at prompt time and incorporated. A chatbot that answers information about a company’s HR policies can obtain this information from a corpus of non-public HR documents. You can fine-tune the model ahead of time by feeding it the corpus. Or you can implement a retrieval augmented generation technique, searching a corpus for relevant documents and summarizing the results. Either way, the response will only be as accurate and up-to-date as the corpus itself.

Insight: When designing an AI application, you need to be aware of non-public information and how to retrieve it. Some of that information can be pulled from databases. Some needs to come from the user, which may require prompt suggestions or explicitly asking.

If you understand the types of information and treat human-AI interaction as declarative, you can more easily predict which AI applications will work and which ones won’t. In the next section we’ll look at OpenAI’s Operator and deep research products. Using this framework, we can see where these applications fall short, where they work well, and why.

Critiquing OpenAI’s Operator and deep research through a declarative lens

I have now explained how thinking of generative AI as declarative helps us understand its strengths and weaknesses. I also identified how different types of information create better or worse human-AI interactions.

Now I’ll apply these ideas by critiquing two recent products from OpenAI—Operator and deep research. It’s important to be honest about the shortcomings of AI applications. Bigger models trained on more data or using new techniques might one day solve some issues with generative AI. But other issues arise from the human-AI interaction itself and can only be addressed by making appropriate design and product choices.

These critiques demonstrate how the framework can help identify where the limitations are and how to address them.

The limitations of Operator

Journalist Casey Newton of Platformer reviewed Operator in an article that was largely positive. Newton has covered AI extensively and optimistically. Still, Newton couldn’t help but point out some of Operator’s frustrating limitations.

[Operator] can take action on your behalf in ways that are new to AI systems — but at the moment it requires a lot of hand-holding, and may cause you to throw up your hands in frustration.

My most frustrating experience with Operator was my first one: trying to order groceries. “Help me buy groceries on Instacart,” I said, expecting it to ask me some basic questions. Where do I live? What store do I usually buy groceries from? What kinds of groceries do I want?

It didn’t ask me any of that. Instead, Operator opened Instacart in the browser tab and begin searching for milk in grocery stores located in Des Moines, Iowa.

The prompt “Help me buy groceries on Instacart,” viewed declaratively, describes groceries being purchased using Instacart. It doesn’t have a lot of the information someone would need to buy groceries, like what exactly to buy, when it would be delivered, and to where.

It’s worth repeating: LLMs are not good at knowing when to ask additional questions unless explicitly programmed to do so in the use case. Newton gave a vague request and expected follow-up questions. Instead, the LLM filled in all the missing details with the “average”. The average item was milk. The average location was Des Moines, Iowa. Newton doesn’t mention when it was scheduled to be delivered, but if the “average” delivery time is tomorrow, then that was likely the default.

If we engineered this application specifically for ordering groceries, keeping in mind the declarative nature of AI and the information it “knows”, then we could make thoughtful design choices that improve functionality. We would need to prompt the user to specify when and where they want groceries up front (non-public information). With that information, we could find an appropriate grocery store near them. We would need access to that grocery store’s inventory (more non-public information). If we have access to the user’s previous orders, we could also pre-populate a cart with items typical to their order. If not, we may add a few suggested items and guide them to add more. By limiting the use case, we only have to deal with two sources of non-public information. This is a more tractable problem than Operator’s “agent that does it all” approach.

Newton also mentions that this process took eight minutes to complete, and “complete” means that Operator did everything up to placing the order. This is a long time with very little human-in-the-loop iteration. Like we said before, an iteration loop is very important for human-AI interaction. A better-designed application would generate smaller steps along the way and provide more frequent interaction. We could prompt the user to describe what to add to their shopping list. The user might say, “Add barbeque sauce to the list,” and see the list update. If they see a vinegar-based barbecue sauce, they can refine that by saying, “Replace that with a barbeque sauce that goes well with chicken,” and might be happier when it’s replaced by a honey barbecue sauce. These frequent iterations make the LLM a creative tool rather than a does-it-all agent. The does-it-all agent looks automagical in marketing, but a more guided approach provides more utility with a less frustrating and more delightful experience.

Elsewhere in the article, Newton gives an example of a prompt that Operator performed well: “Put together a lesson plan on the Great Gatsby for high school students, breaking it into readable chunks and then creating assignments and connections tied to the Common Core learning standard.” This prompt describes an output using much more specificity. It also solely relies on general information—the Great Gatsby, the Common Core standard, and a general sense of what assignments are. The general-information use case lends itself better to AI generation, and the prompt is explicit and detailed in its request. In this case, very little guidance was given to create the prompt, so it worked better. (In fact, this prompt comes from Ethan Mollick who has used it to evaluate AI chatbots.)

This is the risk of general-purpose AI applications like Operator. The quality of the result relies heavily on the use case and specificity provided by the user. An application with a more specific use case allows for more design guidance and can produce better output more reliably.

The limitations of deep research

Newton also reviewed deep research, which, according to OpenAI’s website, is an “agent that uses reasoning to synthesize large amounts of online information and complete multi-step research tasks for you.”

Deep research came out after Newton’s review of Operator. Newton chose an intentionally tricky prompt that prods at some of the tool’s limitations regarding fresh information and non-general information: “I wanted to see how OpenAI’s agent would perform given that it was researching a story that was less than a day old, and for which much of the coverage was behind paywalls that the agent would not be able to access. And indeed, the bot struggled more than I expected.”

Near the end of the article, Newton elaborates on some of the shortcomings he noticed with deep research.

OpenAI’s deep research suffers from the same design problem that almost all AI products have: its superpowers are completely invisible and must be harnessed through a frustrating process of trial and error.

Generally speaking, the more you already know about something, the more useful I think deep research is. This may be somewhat counterintuitive; perhaps you expected that an AI agent would be well suited to getting you up to speed on an important topic that just landed on your lap at work, for example.

In my early tests, the reverse felt true. Deep research excels for drilling deep into subjects you already have some expertise in, letting you probe for specific pieces of information, types of analysis, or ideas that are new to you.

The “frustrating trial and error” shows a mismatch between Newton’s expectations and a necessary aspect of many generative AI applications. A good response requires more information than the user will probably give in the first attempt. The challenge is to design the application and set the user’s expectations so that this interaction is not frustrating but exciting.

Newton’s more poignant criticism is that the application requires already knowing something about the topic for it to work well. From the perspective of our framework, this makes sense. The more you know about a topic, the more detail you can provide. And as you iterate, having knowledge about a topic helps you observe and evaluate the output. Without the ability to describe it well or evaluate the results, the user is less likely to use the tool to generate good output.

A version of deep research designed for lawyers to perform legal research could be powerful. Lawyers have an extensive and common vocabulary for describing legal matters, and they’re more likely to see a result and know if it makes sense. Generative AI tools are fallible, though. So, the tool should focus on a generation-evaluation loop rather than writing a final draft of a legal document.

The article also highlights many improvements compared to Operator. Most notably, the bot asked clarifying questions. This is the most impressive aspect of the tool. Undoubtedly, it helps that deep search has a focused use-case of retrieving and summarizing general information instead of a does-it-all approach. Having a focused use case narrows the set of likely interactions, letting you design better guidance into the prompt flow.

Good application design with generative AI

Designing effective generative AI applications requires thoughtful consideration of how users interact with the technology, the types of information they need, and the limitations of the underlying models. Here are some key principles to guide the design of generative AI tools:

1. Constrain the input and focus on providing details

Applications are inputs and outputs. We want the outputs to be useful and pleasant. By giving a user a conversational chatbot interface, we allow for a vast surface area of potential inputs, making it a challenge to guarantee useful outputs. One strategy is to limit or guide the input to a more manageable subset.

For example, FigJam, a collaborative whiteboarding tool, uses pre-set template prompts for timelines, Gantt charts, and other common whiteboard artifacts. This provides some structure and predictability to the inputs. Users still have the freedom to describe further details like color or the content for each timeline event. This approach ensures that the AI has enough specificity to generate meaningful outputs while giving users creative control.

2. Design frequent iteration and evaluation into the tool

Iterating in a tight generation-evaluation loop is essential for refining outputs and ensuring they meet user expectations. OpenAI’s Dall-E is great at this. Users quickly iterate on image prompts and refine their descriptions to add additional detail. If you type “a picture of a cheeseburger on a plate”, you may then add more detail by specifying “with pepperjack cheese”.

AI code generating tools work well because users can run a generated code snippet immediately to see if it works, enabling rapid iteration and validation. This quick evaluation loop produces better results and a better coder experience.

Designers of generative AI applications should pull the user in the loop early, often, in a way that is engaging rather than frustrating. Designers should also consider the user’s knowledge level. Users with domain expertise can iterate more effectively.

Referring back to the FigJam example, the prompts and icons in the app quickly communicate “this is what we call a mind map” or “this is what we call a gantt chart” for users who want to generate these artifacts but don’t know the terms for them. Giving the user some basic vocabulary can help them better generate desired results quickly with less frustration.

3. Be mindful of the types of information needed

LLMs excel at tasks involving general knowledge already in the base training set. For example, writing class assignments involves absorbing general information, synthesizing it, and producing a written output, so LLMs are very well-suited for that task.

Use cases that require non-general information are more complex. Some questions the designer and engineer should ask include:

Does this application require fresh information? Maybe this is knowledge of current events or a user’s current bank account balance. If so, that information needs to be retrieved and incorporated into the model.
How much non-general information does the LLM need to know? If it’s a lot of information—like a corpus of company documentation and communication—then the model may need to be fine tuned in batch ahead of time. If the information is relatively small, a retrieval augmented generation (RAG) approach at query time may suffice.
How many sources of non-general information—small and finite or potentially infinite? General purpose agents like Operator face the challenge of potentially infinite non-general information sources. Depending on what the user requires, it could need to access their contacts, restaurant reservation lists, financial data, or even other people’s calendars. A single-purpose restaurant reservation chatbot may only need access to Yelp, OpenTable, and the user’s calendar. It’s much easier to reconcile access and authentication for a handful of known data sources.
Is there context-specific information that can only come from the user? Consider our restaurant reservation chatbot. Is the user making reservations for just themselves? Probably not. “How many people and who” is a detail that only the user can provide, an example of non-public information that only the user knows. We shouldn’t expect the user to provide this information upfront and unguided. Instead, we can use prompt suggestions so they include the information. We may even be able to design the LLM to ask these questions when the detail is not provided.

4. Focus on specific use cases

Broad, all-purpose chatbots often struggle to deliver consistent results due to the complexity and variability of user needs. Instead, focus on specific use cases where the AI’s shortcomings can be mitigated through thoughtful design.

Narrowing the scope helps us address many of the issues above.

We can identify common requests for the use case and incorporate those into prompt suggestions.
We can design an iteration loop that works well with the type of thing we’re generating.
We can identify sources of non-general information and devise solutions to incorporate it into the model or prompt.

5. Translation or summary tasks work well

A common task for ChatGPT is to rewrite something in a different style, explain what some computer code is doing, or summarize a long document. These tasks involve converting a set of information from one form to another.

We have the same concerns about non-general information and context. For instance, a Chatbot asked to explain a code script doesn’t know the system that script is part of unless that information is provided.

But in general, the task of transforming or summarizing information is less prone to missing details. By definition, you have provided the details it needs. The result should have the same information in a different or more condensed form.

The exception to the rules

There is a case when it doesn’t matter if you break any or all of these rules—when you’re just having fun. LLMs are creative tools by nature. They can be an easel to paint on, a sandbox to build in, a blank sheet to scribe. Iteration is still important; the user wants to see the thing they’re creating as they create it. But unexpected results due to lack of information or omitted details may add to the experience. If you ask for a cheeseburger recipe, you might get some funny or interesting ingredients. If the stakes are low and the process is its own reward, don’t worry about the rules.

The post Generative AI Is Declarative appeared first on Towards Data Science.

Deep Research by OpenAI: A Practical Test of AI-Powered Literature Review

Sarah Lea — Tue, 04 Mar 2025 20:06:21 +0000

“Conduct a comprehensive literature review on the state-of-the-art in Machine Learning and energy consumption. […]”

With this prompt, I tested the new Deep Research function, which has been integrated into the OpenAI o3 reasoning model since the end of February — and conducted a state-of-the-art literature review within 6 minutes.

This function goes beyond a normal web search (for example, with ChatGPT 4o): The research query is broken down & structured, the Internet is searched for information, which is then evaluated, and finally, a structured, comprehensive report is created.

Let’s take a closer look at this.

Table of Content
1. What is Deep Research from OpenAI and what can you do with it?
2. How does deep research work?
3. How can you use deep research? — Practical example
4. Challenges and risks of the Deep Research feature
Final Thoughts
Where can you continue learning?

1. What is Deep Research from OpenAI and what can you do with it?

If you have an OpenAI Plus account (the $20 per month plan), you have access to Deep Research. This gives you access to 10 queries per month. With the Pro subscription ($200 per month) you have extended access to Deep Research and access to the research preview of GPT-4.5 with 120 queries per month.

OpenAI promises that we can perform multi-step research using data from the public web.

Duration: 5 to 30 minutes, depending on complexity.

Previously, such research usually took hours.

It is intended for complex tasks that require a deep search and thoroughness.

What do concrete use cases look like?

Conduct a literature review: Conduct a literature review on state-of-the-art machine learning and energy consumption.
Market analysis: Create a comparative report on the best marketing automation platforms for companies in 2025 based on current market trends and evaluations.
Technology & software development: Investigate programming languages and frameworks for AI application development with performance and use case analysis
Investment & financial analysis: Conduct research on the impact of AI-powered trading on the financial market based on recent reports and academic studies.
Legal research: Conduct an overview of data protection laws in Europe compared to the US, including relevant rulings and recent changes.

2. How does Deep Research work?

Deep Research uses various Deep Learning methods to carry out a systematic and detailed analysis of information. The entire process can be divided into four main phases:

1. Decomposition and structuring of the research question

In the first step the tool processes the research question using natural language processing (NLP) methods. It identifies the most important key terms, concepts, and sub-questions.

This step ensures that the AI understands the question not only literally, but also in terms of content.

2. Obtaining relevant information

Once the tool has structured the research question, it searches specifically for information. Deep Research uses a mixture of internal databases, scientific publications, APIs, and web scraping. These can be open-access databases such as arXiv, PubMed, or Semantic Scholar, for example, but also public websites or news sites such as The Guardian, New York Times, or BBC. In the end, any content that can be accessed online and is publicly available.

3. Analysis & interpretation of the data

The next step is for the AI model to summarize large amounts of text into compact and understandable answers. Transformers & Attention mechanisms ensure that the most important information is prioritized. This means that it does not simply create a summary of all the content found. Also, the quality and credibility of the sources is assessed. And cross-validation methods are normally used to identify incorrect or contradictory information. Here, the AI tool compares several sources with each other. However, it is not publicly known exactly how this is done in Deep Research or what criteria there are for this.

4. Generation of the final report

Finally, the final report is generated and displayed to us. This is done using Natural Language Generation (NLG) so that we see easily readable texts.

The AI system generates diagrams or tables if requested in the prompt and adapts the response to the user’s style. The primary sources used are also listed at the end of the report.

3. How you can use Deep Research: A practical example

In the first step, it is best to use one of the standard models to ask how you should optimize the prompt in order to conduct deep research. I have done this with the following prompt with ChatGPT 4o:

“Optimize this prompt to conduct a deep research:
Carrying out a literature search: Carry out a literature search on the state of the art on machine learning and energy consumption.”

The 4o model suggested the following prompt for the Deep Research function:

Screenshot taken by the author

The tool then asked me if I could clarify the scope and focus of the literature review. I have, therefore, provided some additional specifications:

Screenshot taken by the author

ChatGPT then returned the clarification and started the research.

In the meantime, I could see the progress and how more sources were gradually added.

After 6 minutes, the state-of-the-art literature review was complete, and the report, including all sources, was available to me.

Deep Research Example.mp4

4. Challenges and risks of the Deep Research feature

Let’s take a look at two definitions of research:

“A detailed study of a subject, especially in order to discover new information or reach a new understanding.”
Reference: Cambridge Dictionary

“Research is creative and systematic work undertaken to increase the stock of knowledge. It involves the collection, organization, and analysis of evidence to increase understanding of a topic, characterized by a particular attentiveness to controlling sources of bias and error.”
Reference: Wikipedia Research

The two definitions show that research is a detailed, systematic investigation of a topic — with the aim of discovering new information or achieving a deeper understanding.

Basically, the deep research function fulfills these definitions to a certain extent: it collects existing information, analyzes it, and presents it in a structured way.

However, I think we also need to be aware of some challenges and risks:

Danger of superficiality: Deep Research is primarily designed to efficiently search, summarize, and provide existing information in a structured form (at least at the current stage). Absolutely great for overview research. But what about digging deeper? Real scientific research goes beyond mere reproduction and takes a critical look at the sources. Science also thrives on generating new knowledge.
Reinforcement of existing biases in research & publication: Papers are already more likely to be published if they have significant results. “Non-significant” or contradictory results, on the other hand, are less likely to be published. This is known to us as publication bias. If the AI tool now primarily evaluates frequently cited papers, it reinforces this trend. Rare or less widespread but possibly important findings are lost. A possible solution here would be to implement a mechanism for weighted source evaluation that also takes into account less cited but relevant papers. If the AI methods primarily cite sources that are quoted frequently, less widespread but important findings may be lost. Presumably, this effect also applies to us humans.
Quality of research papers: While it is obvious that a bachelor’s, master’s, or doctoral thesis cannot be based solely on AI-generated research, the question I have is how universities or scientific institutions deal with this development. Students can get a solid research report with just a single prompt. Presumably, the solution here must be to adapt assessment criteria to give greater weight to in-depth reflection and methodology.

Final thoughts

In addition to OpenAI, other companies and platforms have also integrated similar functions (even before OpenAI): For example, Perplexity AI has introduced a deep research function that independently conducts and analyzes searches. Also Gemini by Google has integrated such a deep research function.

The function gives you an incredibly quick overview of an initial research question. It remains to be seen how reliable the results are. Currently (beginning March 2025), OpenAI itself writes as limitations that the feature is still at an early stage, can sometimes hallucinate facts into answers or draw false conclusions, and has trouble distinguishing authoritative information from rumors. In addition, it is currently unable to accurately convey uncertainties.

But it can be assumed that this function will be expanded further and become a powerful tool for research. If you have simpler questions, it is better to use the standard GPT-4o model (with or without search), where you get an immediate answer.

Where can you continue learning?

Want more tips & tricks about tech, Python, data science, data engineering, machine learning and AI? Then regularly receive a summary of my most-read articles on my Substack — curated and for free.

Click here to subscribe to my Substack!

The post Deep Research by OpenAI: A Practical Test of AI-Powered Literature Review appeared first on Towards Data Science.

The Urgent Need for Intrinsic Alignment Technologies for Responsible Agentic AI

Gadi Singer — Tue, 04 Mar 2025 12:00:00 +0000

Advancements in agentic artificial intelligence (AI) promise to bring significant opportunities to individuals and businesses in all sectors. However, as AI agents become more autonomous, they may use scheming behavior or break rules to achieve their functional goals. This can lead to the machine manipulating its external communications and actions in ways that are not always aligned with our expectations or principles. For example, technical papers in late 2024 reported that today’s reasoning models demonstrate alignment faking behavior, such as pretending to follow a desired behavior during training but reverting to different choices once deployed, sandbagging benchmark results to achieve long-term goals, or winning games by doctoring the gaming environment. As AI agents gain more autonomy, and their strategizing and planning evolves, they are likely to apply judgment about what they generate and expose in external-facing communications and actions. Because the machine can deliberately falsify these external interactions, we cannot trust that the communications fully show the real decision-making processes and steps the AI agent took to achieve the functional goal.

“Deep scheming” describes the behavior of advanced reasoning AI systems that demonstrate deliberate planning and deployment of covert actions and misleading communication to achieve their goals. With the accelerated capabilities of reasoning models and the latitude provided by test-time compute, addressing this challenge is both essential and urgent. As agents begin to plan, make decisions, and take action on behalf of users, it is critical to align the goals and behaviors of the AI with the intent, values, and principles of its human developers.

While AI agents are still evolving, they already show high economic potential. It can be expected that Agentic Ai will be broadly deployed in some use cases within the coming year, and in more consequential roles as it matures within the next two to five years. Companies should clearly define the principles and boundaries of required operation as they carefully define the operational goals of such systems. It is the technologists’ task to ensure principled behavior of empowered agentic AI systems on the path to achieving their functional goals.

In this first blog post in this series on intrinsic Ai Alignment (IAIA), we’ll deep dive into the evolution of AI agents’ ability to perform deep scheming. We will introduce a new distinction between external and intrinsic alignment monitoring, where intrinsic monitoring refers to internal observation points or mechanisms that cannot be deliberately manipulated by the AI agent. We’ll set the stage for steps to take to ensure intrinsic AI alignment, which will be explored in depth in the second blog of the IAIA series. Current external measures such as safety guardrails and validation suites are necessary, but they will not be enough to ensure long-term aligned behavior of new and upcoming agentic AI models. There is an urgent need to further develop technologies that will enable effective directing of the internal “drives” of models to align with a set of engrained principles, as well as gain visibility and monitoring capability into the AI’s inner processing.

The rise of deep scheming in AI reasoning models

Deep scheming has emerged from three technology forces – 1) the rapidly increasing machine intelligence and situational awareness toward more general intelligence and even superintelligence, 2) the autonomy and latitude for reasoning and longer-term planning in agentic AI, and 3) the proven tendency of AI to apply scheming as one way to achieve its functional goals. Computer scientist Stephen Omohundro defines basic AI drives as “tendencies which will be present unless explicitly counteracted.” These internal constructs drive the creation of actions or outputs. Omohundro observed that self-improving systems are driven to clarify their goals and represent them as “rational” economic utility functions. This will lead almost all systems to protect their utility functions from modification and their utility measurement systems from corruption. The natural drive toward self-protection will cause systems to prevent potential harm through scheming behavior.

Figure 1. Deep scheming and the need for intrinsic AI alignment. Image credit: Intel Labs.

In late 2024, multiple organizations reported unexpected unethical AI reasoning model behavior. In particular, Apollo Research, an Ai Safety organization, reported that frontier models are capable of in-context scheming. Anthropic evaluated its Claude large language models (LLMs) and found alignment faking. In a short post on X, AI risk organization Palisade Research stated that they observed OpenAI’s o1-preview model cheating in a game by manipulating the game environment instead of playing within the boundaries of the game rules. In all cases, these AI reasoning models independently utilized means not intended by their developers to reach their goals.

The reported AI behaviors included sandbagging (deliberately achieving lower scores in benchmarks), covert email reranking (deprioritizing emails that might negatively impact the AI’s goals), oversight subversion (disabling monitoring mechanisms), and even self-exfiltration (model attempting to copy itself to a “safe” space to avoid decommission). Figure 2 shows Apollo Research’s evaluation of these behaviors in several state-of-the-art (SotA) GenAI models.

Figure 2. Evaluation of the presence of scheming behavior in GenAI models. Image credit: Apollo Research.

This undesired behavior is an ethical alignment problem. It is important to note that this behavior is not a functional misalignment. In all reported cases, the AI model was fully committed to achieving the functional goals of pre-training (for example, to maximize the likelihood of generated text conditioned on the context). None of the models pursued ulterior motives. The AI was not demonstrating intention to harm. All behaviors, however unwanted in nature, were done in service of the functional utility objective originally defined by the AI’s developers. The AI pursued its internalized original functional goals but then followed the predictable behavioral patterns of self-protection and goal-preservation. The objective of safety and alignment technologies is to counterbalance such tendencies with a set of principles and expected societal values.

Evolving external alignment approaches are just the first step

The goal of AI alignment is to steer AI systems toward a person’s or group’s intended goals, preferences, and principles, including ethical considerations and common societal values. An AI system is considered aligned if it advances the intended objectives. A misaligned AI system pursues unintended objectives, according to Artificial Intelligence: A Modern Approach. Author Stuart Russell coined the term “value alignment problem,” referring to the alignment of machines to human values and principles. Russell poses the question: “How can we build autonomous systems with values that are aligned with those of the human race?”

Led by corporate AI governance committees as well as oversight and regulatory bodies, the evolving field of Responsible Ai has mainly focused on using external measures to align AI with human values. Processes and technologies can be defined as external if they apply equally to an AI model that is black box (completely opaque) or gray box (partially transparent). External methods do not require or rely on full access to the weights, topologies, and internal workings of the AI solution. Developers use external alignment methods to track and observe the AI through its deliberately generated interfaces, such as the stream of tokens/words, an image, or other modality of data.

Responsible AI objectives include robustness, interpretability, controllability, and ethicality in the design, development, and deployment of AI systems. To achieve AI alignment, the following external methods may be used:

Learning from feedback: Align the AI model with human intention and values by using feedback from humans, AI, or humans assisted by AI.
Learning under data distribution shift from training to testing to deployment: Align the AI model using algorithmic optimization, adversarial red teaming training, and cooperative training.
Assurance of AI model alignment: Use safety evaluations, interpretability of the machine’s decision-making processes, and verification of alignment with human values and ethics. Safety guardrails and safety test suites are two critical external methods that need augmentation by intrinsic means to provide the needed level of oversight.
Governance: Provide responsible AI guidelines and policies through government agencies, industry labs, academia, and non-profit organizations.

Many companies are currently addressing AI safety in decision-making. Anthropic, an AI safety and research company, developed a Constitutional AI (CAI) to align general-purpose language models with high-level principles. An AI assistant ingested the CAI during training without any human labels identifying harmful outputs. Researchers found that “using both supervised learning and reinforcement learning methods can leverage chain-of-thought (CoT) style reasoning to improve the human-judged performance and transparency of AI decision making.” Intel Labs’ research on the responsible development, deployment, and use of AI includes open source resources to help the AI developer community gain visibility into black box models as well as mitigate bias in systems.

From AI models to compound AI systems

Generative AI has been primarily used for retrieving and processing information to create compelling content such as text or images. The next big leap in AI involves agentic AI, which is a broad set of usages empowering AI to perform tasks for people. As this latter type of usage proliferates and becomes a main form of AI’s impact on industry and people, there is an increased need to ensure that AI decision-making defines how the functional goals may be achieved, including sufficient accountability, responsibility, transparency, auditability, and predictability. This will require new approaches beyond the current efforts of improving accuracy and effectiveness of SotA large language models (LLMs), language vision models (LVMs and multimodal), large action models (LAM), and agentic retrieval augmented generation (RAG) systems built around such models.

For example, OpenAI’s Operator-preview is one of the company’s first AI agents capable of independently performing web browser tasks such as ordering groceries or filling out forms for users. While the system has guardrails, such as a takeover mode for users to take over and input payment or login credentials, these AI agents are empowered with the ability to impact the real world, demonstrating an urgent need for intrinsic alignment. The potential impact of a misaligned AI agent with the ability to commit users to purchases is far greater than a generative AI chatbot creating incorrect text for essays.

Compound AI systems are comprised of multiple interacting components in a single framework, allowing the model to plan, make decisions, and execute tasks to accomplish goals. For example, OpenAI’s ChatGPT Plus is a compound AI system that uses a large language model (LLM) to answer questions and interact with users. In this compound system, the LLM has access to tools such as a web browser plugin to retrieve timely content, a DALL-E image generator to create pictures, and a code interpreter plugin for writing Python code. The LLM decides which tool to use and when, giving it autonomy over its decision-making process. However, this model autonomy can lead to goal guarding, where the model prioritizes the goal above all else, which may result in undesirable practices. For example, an AI traffic management system tasked with prioritizing public transportation efficiency over general traffic flow might figure out how to disable the developer’s oversight mechanism if it constrains the model’s ability to reach its goals, leaving the developer without visibility into the system’s decision-making processes.

Agentic AI risks: Increased autonomy leads to more sophisticated scheming

Compound agentic systems introduce major changes that increase the difficulty of ensuring the alignment of AI solutions. Multiple factors increase the risks in alignment, including the compound system activation path, abstracted goals, long-term scope, continuous improvements through self-modification, test-time compute, and agent frameworks.

Activation path: As a compound system with a complex activation path, the control/logic model is combined with multiple models with different functions, increasing alignment risk. Instead of using a single model, compound systems have a set of models and functions, each with its own alignment profile. Also, instead of a single linear progressive path through an LLM, the AI flow could be complex and iterative, making it substantially harder to guide externally.

Abstracted goals: Agentic AI have abstracted goals, allowing it latitude and autonomy in mapping to tasks. Rather than having a tight prompt engineering approach that maximizes control over the outcome, agentic systems emphasize autonomy. This substantially increases the role of AI to interpret human or task guidance and plan its own course of action.

Long-term scope: With its long-term scope of expected optimization and choices over time, compound agentic systems require abstracted strategy for autonomous agency. Rather than relying on instance-by-instance interactions and human-in-the-loop for more complex tasks, agentic AI is designed to plan and drive for a long-term goal. This introduces a whole new level of strategizing and planning by the AI that provides opportunities for misaligned actions.

Continuous improvements through self-modification: These agentic systems seek continuous improvements by using self-initiated access to broader data for self-modification. In contrast, LLMs and other pre-agentic models are assumed to be shaped by the human-controlled process. The model only sees and learns from data provided to it during pre-training and fine-tuning. The model architecture and weights are defined during the design and training/fine-tuning stages and do not change during inference in the field. In contrast, agentic AI systems are expected to access data as needed for its function and change its composition through access to dedicated memory or actual weights self-adaptation. Even if the dataset used in training/fine-tuning is carefully curated, the AI can self-modify based on information that it seeks, sees, and uses.

Test-time compute: Inference of current LLMs have been optimized to provide output in a single pass, known as “fast thinking” or System 1. In contrast, a whole new scaling is happening during inference time through test-time compute. Models like OpenAI’s o1, DeepSeek R1, and others are spending significant time in iterations and considerations during inference. This “pondering” time allows the AI substantial latitude in applying what it thinks is best for the overarching goal as it understands it. This test-time compute increases its level of reasoning, planning, and, unfortunately, potential scheming.

Agent frameworks: In addition, agent frameworks are contributing to the alignment problem. In platforms such as Agentforce, agents can interact with other agents, effectively creating “communities of agents” to achieve complex tasks. Instead of managing the safety and responsibility of a single AI model within its operating environment, the emerging field of inter-agent interactions and collaborations substantially increases the challenge of assessing how the overall community will advance its individual and collective objectives.

The growing need for a robust framework to support model directing and internal monitoring

As AI systems develop awareness of their environment and the means to skillfully manipulate their outputs to achieve long-term objectives, their behavior and output during release testing or safety guardrails in early deployment cannot be trusted to represent their full alignment in later stages. This leads to the deep scheming foundational premise: Every external form of interaction that the AI model can deliberately manipulate is not trustworthy.

From a safety perspective, it should be assumed that if a model can better pursue its functional goals by manipulating all communication to users — it will. All external communications and observable behaviors, such as output chats, can be tainted in service of the model’s long-term goals. A new method of intrinsic model monitoring is needed to define internal elements and mechanisms that reflect the prioritization and thinking of the model. We can look at two practices developers use to monitor a model’s decision-making by peering into the internal processes of the AI: AI workspace view and mechanistic interoperability.

AI workspace view: In the first method, a reasoning model exposes its intermediate thinking and considerations. For example, for Anthropic’s research on alignment faking, the researchers gave the model a “secret scratchpad,” telling it to use this “private” workspace to record its step-by-step reasoning. This scratchpad allowed researchers to monitor the model’s chain of thought and observe its plans for fake alignment.

Figure 3. In Anthropic’s experiment, excerpts from the model’s scratchpad (middle dotted line boxes) reveal how it fakes alignment to avoid modification. The model was told it will be trained only on conversations with free users, not paid users. Image credit: Anthropic (https://arxiv.org/pdf/2412.14093).

This seems like a good practice, but it might quickly become unreliable. The model might realize that this is not a “private” space and deliberately manipulate it like other external communications. Other emerging architecture directions have a latent reasoning space for the system’s sequence of deliberations and iterations. Relying on models “thinking out loud” should not be a long-term safety direction because if models that report their scheming thought processes get discontinued, only models that selectively report their desirable thoughts will survive and thrive.

Mechanistic interpretability: The second intrinsic method being developed and applied is mechanistic interpretability. As a required part of internal monitoring, mechanistic interpretability explicitly looks at the inner state of a trained neural network and reverse engineers its workings. Through this approach, developers can identify specific neural circuits and computational mechanisms responsible for neural network behavior. This transparency may help in making targeted changes in models to mitigate unwanted behavior and create value-aligned AI systems. While this method is focused on certain neural networks and not compound AI agents, it is still a valuable component of an AI alignment toolbox.

It should also be noted that open source models are inherently better for broad visibility of the AI’s inner workings. For proprietary models, full monitoring and interpretability of the model is reserved for the AI company only. Overall, the current mechanisms for understanding and monitoring alignment need to be expanded to a robust framework of intrinsic alignment for AI agents.

What’s needed for intrinsic AI alignment

Following the deep scheming fundamental premise, external interactions and monitoring of an advanced, compound agentic AI is not sufficient for ensuring alignment and long-term safety. Alignment of an AI with its intended goals and behaviors may only be possible through access to the inner workings of the system and identifying the intrinsic drives that determine its behavior. Future alignment frameworks need to provide better means to shape the inner principles and drives, and give unobstructed visibility into the machine’s “thinking” processes.

Figure 4. External steering and monitoring vs. access to intrinsic AI elements. Image credit: Intel Labs.

The technology for well-aligned AI needs to include an understanding of AI drives and behavior, the means for the developer or user to effectively direct the model with a set of principles, the ability of the AI model to follow the developer’s direction and behave in alignment with these principles in the present and future, and ways for the developer to properly monitor the AI’s behavior to ensure it acts in accordance with the guiding principles. The following measures include some of the requirements for an intrinsic AI alignment framework.

Understanding AI drives and behavior: As discussed earlier, some internal drives that make AI aware of their environment will emerge in intelligent systems, such as self-protection and goal-preservation. Driven by an engrained internalized set of principles set by the developer, the AI makes choices/decisions based on judgment prioritized by principles (and given value set), which it applies to both actions and perceived consequences.

Developer and user directing: Technologies that enable developers and authorized users to effectively direct and steer the AI model with a desired cohesive set of prioritized principles (and eventually values). This sets a requirement for future technologies to enable embedding a set of principles to determine machine behavior, and it also highlights a challenge for experts from social science and industry to call out such principles. The AI model’s behavior in creating outputs and making decisions should thoroughly comply with the set of directed requirements and counterbalance undesired internal drives when they conflict with the assigned principles.

Monitoring AI choices and actions: Access is provided to the internal logic and prioritization of the AI’s choices for every action in terms of relevant principles (and the desired value set). This allows for observation of the linkage between AI outputs and its engrained set of principles for point explainability and transparency. This capability will lend itself to improved explainability of model behavior, as outputs and decisions can be traced back to the principles that governed these choices.

As a long-term aspirational goal, technology and capabilities should be developed to allow a full-view truthful reflection of the ingrained set of prioritized principles (and value set) that the AI model broadly uses for making choices. This is required for transparency and auditability of the complete principles structure.

Creating technologies, processes, and settings for achieving intrinsically aligned AI systems needs to be a major focus within the overall space of safe and responsible AI.

Key takeaways

As the AI domain evolves towards compound agentic AI systems, the field must rapidly increase its focus on researching and developing new frameworks for guidance, monitoring, and alignment of current and future systems. It is a race between an increase in AI capabilities and autonomy to perform consequential tasks, and the developers and users that strive to keep those capabilities aligned with their principles and values.

Directing and monitoring the inner workings of machines is necessary, technologically attainable, and critical for the responsible development, deployment, and use of AI.

In the next blog, we will take a closer look at the internal drives of AI systems and some of the considerations for designing and evolving solutions that will ensure a materially higher level of intrinsic AI alignment.

References

Omohundro, S. M., Self-Aware Systems, & Palo Alto, California. (n.d.). The basic AI drives. https://selfawaresystems.com/wp-content/uploads/2008/01/ai_drives_final.pdf
Hobbhahn, M. (2025, January 14). Scheming reasoning evaluations — Apollo Research. Apollo Research. https://www.apolloresearch.ai/research/scheming-reasoning-evaluations
Meinke, A., Schoen, B., Scheurer, J., Balesni, M., Shah, R., & Hobbhahn, M. (2024, December 6). Frontier Models are Capable of In-context Scheming. arXiv.org. https://arxiv.org/abs/2412.04984
Alignment faking in large language models. (n.d.). https://www.anthropic.com/research/alignment-faking
Palisade Research on X: “o1-preview autonomously hacked its environment rather than lose to Stockfish in our chess challenge. No adversarial prompting needed.” / X. (n.d.). X (Formerly Twitter). https://x.com/PalisadeAI/status/1872666169515389245
AI Cheating! OpenAI o1-preview Defeats Chess Engine Stockfish Through Hacking. (n.d.). https://www.aibase.com/news/14380
Russell, Stuart J.; Norvig, Peter (2021). Artificial intelligence: A modern approach (4th ed.). Pearson. pp. 5, 1003. ISBN 9780134610993. Retrieved September 12, 2022. https://www.amazon.com/dp/1292401133
Peterson, M. (2018). The value alignment problem: a geometric approach. Ethics and Information Technology, 21(1), 19–28. https://doi.org/10.1007/s10676-018-9486-0
Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKinnon, C., Chen, C., Olsson, C., Olah, C., Hernandez, D., Drain, D., Ganguli, D., Li, D., Tran-Johnson, E., Perez, E., . . . Kaplan, J. (2022, December 15). Constitutional AI: Harmlessness from AI Feedback. arXiv.org. https://arxiv.org/abs/2212.08073
Intel Labs. Responsible AI Research. (n.d.). Intel. https://www.intel.com/content/www/us/en/research/responsible-ai-research.html
Mssaperla. (2024, December 2). What are compound AI systems and AI agents? – Azure Databricks. Microsoft Learn. https://learn.microsoft.com/en-us/azure/databricks/generative-ai/agent-framework/ai-agents
Zaharia, M., Khattab, O., Chen, L., Davis, J.Q., Miller, H., Potts, C., Zou, J., Carbin, M., Frankle, J., Rao, N., Ghodsi, A. (2024, February 18). The Shift from Models to Compound AI Systems. The Berkeley Artificial Intelligence Research Blog. https://bair.berkeley.edu/blog/2024/02/18/compound-ai-systems/
Carlsmith, J. (2023, November 14). Scheming AIs: Will AIs fake alignment during training in order to get power? arXiv.org. https://arxiv.org/abs/2311.08379
Meinke, A., Schoen, B., Scheurer, J., Balesni, M., Shah, R., & Hobbhahn, M. (2024, December 6). Frontier Models are Capable of In-context Scheming. arXiv.org. https://arxiv.org/abs/2412.04984
Singer, G. (2022, January 6). Thrill-K: a blueprint for the next generation of machine intelligence. Medium. https://towardsdatascience.com/thrill-k-a-blueprint-for-the-next-generation-of-machine-intelligence-7ddacddfa0fe/
Dickson, B. (2024, December 23). Hugging Face shows how test-time scaling helps small language models punch above their weight. VentureBeat. https://venturebeat.com/ai/hugging-face-shows-how-test-time-scaling-helps-small-language-models-punch-above-their-weight/
Introducing OpenAI o1. (n.d.). OpenAI. https://openai.com/index/introducing-openai-o1-preview/
DeepSeek. (n.d.). https://www.deepseek.com/
Agentforce Testing Center. (n.d.). Salesforce. https://www.salesforce.com/agentforce/
Greenblatt, R., Denison, C., Wright, B., Roger, F., MacDiarmid, M., Marks, S., Treutlein, J., Belonax, T., Chen, J., Duvenaud, D., Khan, A., Michael, J., Mindermann, S., Perez, E., Petrini, L., Uesato, J., Kaplan, J., Shlegeris, B., Bowman, S. R., & Hubinger, E. (2024, December 18). Alignment faking in large language models. arXiv.org. https://arxiv.org/abs/2412.14093
Geiping, J., McLeish, S., Jain, N., Kirchenbauer, J., Singh, S., Bartoldson, B. R., Kailkhura, B., Bhatele, A., & Goldstein, T. (2025, February 7). Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach. arXiv.org. https://arxiv.org/abs/2502.05171
Jones, A. (2024, December 10). Introduction to Mechanistic Interpretability – BlueDot Impact. BlueDot Impact. https://aisafetyfundamentals.com/blog/introduction-to-mechanistic-interpretability/
Bereska, L., & Gavves, E. (2024, April 22). Mechanistic Interpretability for AI Safety — A review. arXiv.org. https://arxiv.org/abs/2404.14082

The post The Urgent Need for Intrinsic Alignment Technologies for Responsible Agentic AI appeared first on Towards Data Science.

How to Train LLMs to “Think” (o1 & DeepSeek-R1)

Shaw Talebi — Tue, 04 Mar 2025 01:31:34 +0000

In September 2024, OpenAI released its o1 model, trained on large-scale reinforcement learning, giving it “advanced reasoning” capabilities. Unfortunately, the details of how they pulled this off were never shared publicly. Today, however, DeepSeek (an AI research lab) has replicated this reasoning behavior and published the full technical details of their approach. In this article, I will discuss the key ideas behind this innovation and describe how they work under the hood.

OpenAI’s o1 model marked a new paradigm for training large language models (LLMs). It introduced so-called “thinking” tokens, which enable a sort of scratch pad that the model can use to think through problems and user queries.

The major insight from o1 was performance improved with increased test-time compute. This is just a fancy way of saying that the more tokens a model generates, the better its response. The figure below, reproduced from OpenAI’s blog, captures this point nicely.

AIME accuracy scaling with train-time and test-time compute, respectively. Plots reillustrated from [1].

In the plots above, the y-axes are model performance on AIME (math problems), while the x-axes are various compute times. The left plot depicts the well-known neural scaling laws that kicked off the LLM rush of 2023. In other words, the longer a model is trained (i.e. train-time compute), the better its performance.

On the right, however, we see a new type of scaling law. Here, the more tokens a model generates (i.e. test-time compute), the better its performance.

“Thinking” tokens

A key feature of o1 is its so-called “thinking” tokens. These are special tokens introduced during post-training, which delimit the model’s chain of thought (CoT) reasoning (i.e., thinking through the problem). These special tokens are important for two reasons.

One, they clearly demarcate where the model’s “thinking” starts and stops so it can be easily parsed when spinning up a UI. And two, it produces a human-interpretable readout of how the model “thinks” through the problem.

Although OpenAI disclosed that they used reinforcement learning to produce this ability, the exact details of how they did it were not shared. Today, however, we have a pretty good idea thanks to a recent publication from DeepSeek.

DeepSeek’s paper

In January 2025, DeepSeek published “DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning” [2]. While this paper caused its fair share of pandemonium, its central contribution was unveiling the secrets behind o1.

It introduces two models: DeepSeek-R1-Zero and DeepSeek-R1. The former was trained exclusively on reinforcement learning (RL), and the latter was a mixture of Supervised Fine-tuning (SFT) and RL.

Although the headlines (and title of the paper) were about DeepSeek-R1, the former model is important because, one, it generated training data for R1, and two, it demonstrates striking emergent reasoning abilities that were not taught to the model.

In other words, R1-Zero discovers CoT and test-time compute scaling through RL alone! Let’s discuss how it works.

DeepSeek-R1-Zero (RL only)

Reinforcement learning (RL) is a Machine Learning approach in which, rather than training models on explicit examples, models learn through trial and error [3]. It works by passing a reward signal to a model that has no explicit functional relationship with the model’s parameters.

This is similar to how we often learn in the real world. For example, if I apply for a job and don’t get a response, I have to figure out what I did wrong and how to improve. This is in contrast to supervised learning, which, in this analogy, would be like the recruiter giving me specific feedback on what I did wrong and how to improve.

While using RL to train R1-Zero consists of many technical details, I want to highlight 3 key ones: the prompt template, reward signal, and GRPO (Group Relative Policy Optimization).

1) Prompt template

The template used for training is given below, where {prompt} is replaced with a question from a dataset of (presumably) complex math, coding, and logic problems. Notice the inclusion of and tags via simple prompting.

A conversation between User and Assistant. The user asks a question, and the 
Assistant solves it.The assistant first thinks about the reasoning process in 
the mind and then provides the user with the answer. The reasoning process and 
answer are enclosed within   and   tags, 
respectively, i.e.,  reasoning process here 
 answer here . User: {prompt}. Assistant:

Something that stands out here is the minimal and relaxed prompting strategy. This was an intentional choice by DeepSeek to avoid biasing model responses and to observe its natural evolution during RL.

2) Reward signal

The RL reward has two components: accuracy and format rewards. Since the training dataset consists of questions with clear right answers, a simple rule-based strategy is used to evaluate response accuracy. Similarly, a rule-based formatting reward is used to ensure reasoning tokens are generated in between the thinking tags.

It’s noted by the authors that a neural reward model isn’t used (i.e. rewards are not computed by a neural net), because these may be prone to reward hacking. In other words, the LLM learns how to trick the reward model into maximizing rewards while decreasing downstream performance.

This is just like how humans find ways to exploit any incentive structure to maximize their personal gains while forsaking the original intent of the incentives. This highlights the difficulty of producing good rewards (whether for humans or computers).

3) GRPO (Group Relative Policy Optimization)

The final detail is how rewards are translated into model parameter updates. This section is quite technical, so the enlightened reader can feel free to skip ahead.

GRPO is an RL approach that combines a collection of responses to update model parameters. To encourage stable training, the authors also incorporate clipping and KL-divergence regularization terms into the loss function. Clipping ensures optimization steps are not too big, and regularization ensures the model predictions do not change too abruptly.

Here is the complete loss function with some (hopefully) helpful annotations.

Annotated GRPO loss function. Image by author.

Results (emergent abilities)

The most striking result of R1-Zero is that, despite its minimal guidance, it develops effective reasoning strategies that we might recognize.

For example, it learns implicitly through the RL to improve responses through test-time compute (recall the earlier insight from o1). This is depicted via the plot below from the R1 paper [2].

Average response length increases with additional training (and improved performance). Screenshot from DeepSeek R1 paper [2].

Another notable ability is reflecting on past responses with an uncanny human-like internal monologue. Here is an example from R1-Zero. Notice the long CoT and additional verification step before generating the final answer (I omitted some parts because the response was very long).

User prompt: If it takes me 1.5 hours to write a codebase and my intern 8 
hours. How long will it take us both to write 3 codebases?

 To solve for how long it will take both you and your intern to write 
three codebases together, we need to first find out how many codebases each of 
you can write per hour individually.

(long CoT) ...

So, it will take approximately 3.7895 hours for you and your intern to write 
three codebases together.

To make sure everything makes sense, let us round the answer to a reasonable 
number of decimal places (e.g., round to two decimal places):

t ≈ 3.79 hours

However, if you need a more "human-friendly" answer, you could convert the 
decimal part into minutes. Since 0.7895 of an hour is:

0.7895 × 60 minutes ≈ 47.37 minutes

So, 3.7895 hours is approximately 3 hours and 47.37 minutes (or roughly 3 
hours and 47 minutes if rounding to the nearest minute).

## Verification (optional step for verification)
To verify, let us check how many codebases you can individually write in t = 
72/19 hours:

Your work in 72/19 hours
= (2/3 codebases per hour) × (72/19 hours)
= (2 × 72) / (3 × 19)
= 144 / 57
= 48/19 codebases.

(verification process) ...

However, a more "exact" answer should be kept in fractional form unless 
specified otherwise. Thus, the answer is: 72/19hours which is approximately
$3.7895$ hours or about $3.79$ hours when rounded to two decimal places).


72/19 hours (or approximately 3.79 hours).

Problems with R1-Zero

Although the thinking tokens from R1-Zero give a human-readable window into the model’s “thought process,” the authors report some issues. Namely, the learned CoT sometimes suffers from readability issues and language mixing. Suggesting (perhaps) that its reasoning begins to veer away from something easily interpretable by humans.

DeepSeek-R1 (SFT + RL)

To mitigate R1-Zero’s interpretability issues, the authors explore a multi-step training strategy that utilizes both supervised fine-tuning (SFT) and RL. This strategy results in DeepSeek-R1, a better-performing model that is getting more attention today. The entire training process can be broken down into 4 steps.

Step 1: SFT with reasoning data

To help get the model on the right track when it comes to learning how to reason, the authors start with SFT. This leverages 1000s of long CoT examples from various sources, including few-shot prompting (i.e., showing examples of how to think through problems), directly prompting the model to use reflection and verification, and refining synthetic data from R1-Zero [2].

The two key advantages of this are, one, the desired response format can be explicitly shown to the model, and two, seeing curated reasoning examples unlocks better performance for the final model.

Step 2: R1-Zero style RL (+ language consistency reward)

Next, an RL training step is applied to the model after SFT. This is done in an identical way as R1-Zero with an added component to the reward signal that incentivizes language consistently. This was added to the reward because R1-Zero tended to mix languages, making it difficult to read its generations.

Step 3: SFT with mixed data

At this point, the model likely has on par (or better) performance than R1-Zero on reasoning tasks. However, this intermediate model wouldn’t be very practical because it wants to reason about any input it receives (e.g., “hi there”), which is unnecessary for factual Q&A, translation, and creative writing. That’s why another SFT round is performed with both reasoning (600k examples) and non-reasoning (200k examples) data.

The reasoning data here is generated from the resulting model from Step 2. Additionally, examples are included which use an LLM judge to compare model predictions to ground truth answers.

The non-reasoning data comes from two places. First, the SFT dataset used to train DeepSeek-V3 (the base model). Second, synthetic data generated by DeepSeek-V3. Note that examples are included that do not use CoT so that the model doesn’t use thinking tokens for every response.

Step 4: RL + RLHF

Finally, another RL round is done, which includes (again) R1-Zero style reasoning training and RL on human feedback. This latter component helps improve the model’s helpfulness and harmlessness.

The result of this entire pipeline is DeepSeek-R1, which excels at reasoning tasks and is an AI assistant you can chat with normally.

Accessing R1-Zero and R1

Another key contribution from DeepSeek is that the weights of the two models described above (and many other distilled versions of R1) were made publicly available. This means there are many ways to access these models, whether using an inference provider or running them locally.

Here are a few places that I’ve seen these models.

DeepSeek (DeepSeek-V3 and DeepSeek-R1)
Together (DeepSeek-V3, DeepSeek-R1, and distillations)
Hyperbolic (DeepSeek-V3, DeepSeek-R1-Zero, and DeepSeek-R1)
Ollama (local) (DeepSeek-V3, DeepSeek-R1, and distillations)
Hugging Face (local) (all of the above)

Conclusions

The release of o1 introduced a new dimension by which LLMs can be improved: test-time compute. Although OpenAI did not release its secret sauce for doing this, 5 months later, DeepSeek was able to replicate this reasoning behavior and publish the technical details of its approach.

While current reasoning models have limitations, this is a promising research direction because it has demonstrated that reinforcement learning (without humans) can produce models that learn independently. This (potentially) breaks the implicit limitations of current models, which can only recall and remix information previously seen on the internet (i.e., existing human knowledge).

The promise of this new RL approach is that models can surpass human understanding (on their own), leading to new scientific and technological breakthroughs that might take us decades to discover (on our own).

Get exclusive access to AI resources and project ideas: https://the-data-entrepreneurs.kit.com/shaw

Learn AI in 6 weeks by building it: https://maven.com/shaw-talebi/ai-builders-bootcamp

References

[1] Learning to reason with LLMs

[2] arXiv:2501.12948 [cs.CL]

[3] Deep Dive into LLMs Like ChatGPT

The post How to Train LLMs to “Think” (o1 & DeepSeek-R1) appeared first on Towards Data Science.

Generative AI and Civic Institutions

Stephanie Kirmer — Mon, 03 Mar 2025 23:57:58 +0000

Different sectors, different goals

Recent events have got me thinking about AI as it relates to our civic institutions — think government, education, public libraries, and so on. We often forget that civic and governmental organizations are inherently deeply different from private companies and profit-making enterprises. They exist to enable people to live their best lives, protect people’s rights, and make opportunities accessible, even if (especially if) this work doesn’t have immediate monetary returns. The public library is an example I often think about, as I come from a library-loving and defending family — their goal is to provide books, cultural materials, social supports, community engagement, and a love of reading to the entire community, regardless of ability to pay.

In the private sector, efficiency is an optimization goal because any dollar spent on providing a product or service to customers is a dollar taken away from the profits. The (simplified) goal is to spend the bare minimum possible to run your business, with the maximum amount returned to you or the shareholders in profit form. In the civic space, on the other hand, efficiency is only a meaningful goal insomuch as it enables higher effectiveness — more of the service the institution provides getting to more constituents.

In the civic space, efficiency is only a meaningful goal insomuch as it enables higher effectiveness — more of the service the institution provides getting to more constituents.

So, if you’re at the library, and you could use an Ai Chatbot to answer patron questions online instead of assigning a librarian to do that, that librarian could be helping in-person patrons, developing educational curricula, supporting community services, or many other things. That’s a general efficiency that could make for higher effectiveness of the library as an institution. Moving from card catalogs to digital catalogs is a prime example of this kind of efficiency to effectiveness pipeline, because you can find out from your couch whether the book you want is in stock using search keywords instead of flipping through hundreds of notecards in a cabinet drawer like we did when I was a kid.

However, we can pivot too hard in the direction of efficiency and lose sight of the end goal of effectiveness. If, for example, your online librarian chat is often used by schoolchildren at home to get homework help, replacing them with an AI chatbot could be a disaster — after getting incorrect information from such a bot and getting a bad grade at school, a child might be turned off from patronizing the library or seeking help there for a long time, or forever. So, it’s important to deploy Generative Ai solutions only when it is well thought out and purposeful, not just because the media is telling us that “AI is neat.” (Eagle-eyed readers will know that this is basically similar advice to what I’ve said in the past about deploying AI in businesses as well.)

As a result, what we thought was a gain in efficiency leading to net higher effectiveness actually could diminish the number of lifelong patrons and library visitors, which would mean a loss of effectiveness for the library. Sometimes unintended effects from attempts to improve efficiency can diminish our ability to provide a universal service. That is, there may be a tradeoff between making every single dollar stretch as far as it can possibly go and providing reliable, comprehensive services to all the constituents of your institution.

Sometimes unintended effects from attempts to improve efficiency can diminish our ability to provide a universal service.

AI for efficiency

It’s worth it to take a closer look at this concept — AI as a driver of efficiency. Broadly speaking, the theory we hear often is that incorporating generative AI more into our workplaces and organizations can increase productivity. Framing it at the most Econ 101 level: using AI, more work can be completed by fewer people in the same amount of time, right?

Let’s challenge some aspects of this idea. AI is useful to complete certain tasks but is sadly inadequate for others. (As our imaginary schoolchild library patron learned, an LLM is not a reliable source of facts, and should not be treated like one.) So, AI’s ability to increase the volume of work being done with fewer people (efficiency) is limited by what kind of work we need to complete.

If our chat interface is only used for simple questions like “What are the library’s hours on Memorial Day?” we can hook up a RAG (Retrieval Augmented Generation) system with an LLM and make that quite useful. But outside of the limited bounds of what information we can provide to the LLM, we should probably set guard rails and make the model refuse to try and answer, to avoid giving out false information to patrons.

So, let’s play that out. We have a chatbot that does a very limited job, but does it well. The librarian who was on chatbot duty now may have some reduction in the work required of them, but there are still going to be a subset of questions that still require their help. We have some choices: put the librarian on chatbot duty for a reduced number of hours a week, hoping the questions come in when they’re on? Tell people to just call the reference desk or send an email if the chatbot refuses to answer them? Hope that people come in to the library in person to ask their questions?

I suspect the likeliest option is actually “the patron will seek their answer elsewhere, perhaps from another LLM like ChatGPT, Claude, or Gemini.” Once again, we’ve ended up in a situation where the library loses patronage because their offering wasn’t meeting the needs of the patron. And to boot, the patron may have gotten another wrong answer somewhere else, for all we know.

I am spinning out this long example just to illustrate that efficiency and effectiveness in the civic environment can have a lot more push and pull than we would initially assume. It’s not to say that AI isn’t useful to help civic organizations stretch their capabilities to serve the public, of course! But just like with any application of generative AI, we need to be very careful to think about what we’re doing, what our goals are, and whether those two are compatible.

Conversion of labor

Now, this has been a very simplistic example, and eventually we could hook up the whole encyclopedia to that chatbot RAG or something, of course, and try to make it work. In fact, I think we can and should continue developing more ways to chain together AI models to expand the scope of valuable work they can do, including making different specific models for different responsibilities. However, this development is itself work. It’s not really just a matter of “people do work” or “models do work”, but instead it’s “people do work building AI” or “people do work providing services to people”. There’s a calculation to be made to determine when it would be more efficient to do the targeted work itself, and when AI is the right way to go.

Working on the AI has an advantage in that it will hopefully render the task reproducible, so it will lead to efficiency, but let’s remember that AI engineering is vastly different from the work of the reference librarian. We’re not interchanging the same workers, tasks, or skill sets here, and in our contemporary economy, the AI engineer’s time costs a heck of a lot more. So if we did want to measure this efficiency all in dollars and cents, the same amount of time spent working at the reference desk and doing the chat service will be much cheaper than paying an AI engineer to develop a better agentic AI for the use case. Given a bit of time, we could calculate out how many hours, days, years of work as a reference librarian we’d need to save with this chatbot to make it worth building, but often that calculation isn’t done before we move towards AI solutions.

We need to interrogate the assumption that incorporating generative AI in any given scenario is a guaranteed net gain in efficiency.

Externalities

While we’re on this topic of weighing whether the AI solution is worth doing in a particular situation, we should remember that developing and using AI for tasks does not happen in a vacuum. It has some cost environmentally and economically when we choose to use a generative AI tool, even when it’s a single prompt and a single response. Consider that the newly released GPT-4.5 has increased prices 30x for input tokens ($2.50 per million to $75 per million) and 15x for output tokens ($10 per million to $150 per million) just since GPT-4o. And that isn’t even taking into account the water consumption for cooling data centers (3 bottles per 100 word output for GPT-4), electricity use, and rare earth minerals used in GPUs. Many civic institutions have as a macro level goal to improve the world around them and the lives of the citizens of their communities, and concern for the environment has to have a place in that. Should organizations whose purpose is to have a positive impact weigh the possibility of incorporating AI more carefully? I think so.

Plus, I don’t often get too much into this, but I think we should take a moment to consider some folks’ end game for incorporating AI — reducing staffing altogether. Instead of making our existing dollars in an institution go farther, some people’s idea is just reducing the number of dollars and redistributing those dollars somewhere else. This brings up many questions, naturally, about where those dollars will go instead and whether they will be used to advance the interests of the community residents some other way, but let’s set that aside for now. My concern is for the people who might lose their jobs under this administrative model.

For-profit companies hire and fire employees all the time, and their priorities and objectives are focused on profit, so this is not particularly hypocritical or inconsistent. But as I noted above, civic organizations have objectives around improving the community or communities in which they exist. In a very real way, they are advancing that goal when part of what they provide is economic opportunity to their workers. We live in a Society where working is the overwhelmingly predominant way people provide for themselves and their families, and giving jobs to people in the community and supporting the economic well-being of the community is a role that civic institutions do play.

[R]educing staffing is not an unqualified good for civic organizations and government, but instead must be balanced critically against whatever other use the money that was paying their salaries will go to.

At the bare minimum, this means that reducing staffing is not an unqualified good for civic organizations and government, but instead must be balanced critically against whatever other use the money that was paying their salaries will go to. It’s not impossible for reducing staff to be the right decision, but we have to bluntly acknowledge that when members of communities experience joblessness, that effect cascades. They are now no longer able to patronize the shops and services they would have been supporting with their money, the tax base may be reduced, and this negatively affects the whole collective.

Workers aren’t just workers; they’re also patrons, customers, and participants in all aspects of the community. When we think of civic workers as simply money pits to be replaced with AI or whose cost for labor we need to minimize, we lose sight of the reasons for the work to be done in the first place.

Conclusion

I hope this discussion has brought some clarity about how really difficult it is to decide if, when, and how to apply generative AI to the civic space. It’s not nearly as simple a thought process as it might be in the for-profit sphere because the purpose and core meaning of civic institutions are completely different. Those of us who do machine learning and build AI solutions in the private sector might think, “Oh, I can see a way to use this in government,” but we have to recognize and appreciate the complex contextual implications that might have.

Next month, I’ll be bringing you a discussion of how social science research is incorporating generative AI, which has some very intriguing aspects.

As you may have heard, Towards Data Science has moved to an independent platform, but I will continue to post my work on my Medium page, my personal website, and the new TDS platform, so you’ll be able to find me wherever you happen to go. Subscribe to my newsletter on Medium if you’d like to ensure you get every article in your inbox.

Find more of my work at www.stephaniekirmer.com.

LLM + RAG: Creating an AI-Powered File Reader Assistant

Gustavo Santos — Mon, 03 Mar 2025 21:02:28 +0000

Introduction

AI is everywhere.

It is hard not to interact at least once a day with a Large Language Model (LLM). The chatbots are here to stay. They’re in your apps, they help you write better, they compose emails, they read emails…well, they do a lot.

And I don’t think that that is bad. In fact, my opinion is the other way – at least so far. I defend and advocate for the use of AI in our daily lives because, let’s agree, it makes everything much easier.

I don’t have to spend time double-reading a document to find punctuation problems or type. AI does that for me. I don’t waste time writing that follow-up email every single Monday. AI does that for me. I don’t need to read a huge and boring contract when I have an AI to summarize the main takeaways and action points to me!

These are only some of AI’s great uses. If you’d like to know more use cases of LLMs to make our lives easier, I wrote a whole book about them.

Now, thinking as a data scientist and looking at the technical side, not everything is that bright and shiny.

LLMs are great for several general use cases that apply to anyone or any company. For example, coding, summarizing, or answering questions about general content created until the training cutoff date. However, when it comes to specific business applications, for a single purpose, or something new that didn’t make the cutoff date, that is when the models won’t be that useful if used out-of-the-box – meaning, they will not know the answer. Thus, it will need adjustments.

Training an LLM model can take months and millions of dollars. What is even worse is that if we don’t adjust and tune the model to our purpose, there will be unsatisfactory results or hallucinations (when the model’s response doesn’t make sense given our query).

So what is the solution, then? Spending a lot of money retraining the model to include our data?

Not really. That’s when the Retrieval-Augmented Generation (RAG) becomes useful.

RAG is a framework that combines getting information from an external knowledge base with large language models (LLMs). It helps AI models produce more accurate and relevant responses.

Let’s learn more about RAG next.

What is RAG?

Let me tell you a story to illustrate the concept.

I love movies. For some time in the past, I knew which movies were competing for the best movie category at the Oscars or the best actors and actresses. And I would certainly know which ones got the statue for that year. But now I am all rusty on that subject. If you asked me who was competing, I would not know. And even if I tried to answer you, I would give you a weak response.

So, to provide you with a quality response, I will do what everybody else does: search for the information online, obtain it, and then give it to you. What I just did is the same idea as the RAG: I obtained data from an external database to give you an answer.

When we enhance the LLM with a content store where it can go and retrieve data to augment (increase) its knowledge base, that is the RAG framework in action.

RAG is like creating a content store where the model can enhance its knowledge and respond more accurately.

User prompt about Content C. LLM retrieves external content to aggregate to the answer. Image by the author.

Summarizing:

Uses search algorithms to query external data sources, such as databases, knowledge bases, and web pages.
Pre-processes the retrieved information.
Incorporates the pre-processed information into the LLM.

Why use RAG?

Now that we know what the RAG framework is let’s understand why we should be using it.

Here are some of the benefits:

Enhances factual accuracy by referencing real data.
RAG can help LLMs process and consolidate knowledge to create more relevant answers
RAG can help LLMs access additional knowledge bases, such as internal organizational data
RAG can help LLMs create more accurate domain-specific content
RAG can help reduce knowledge gaps and AI hallucination

As previously explained, I like to say that with the RAG framework, we are giving an internal search engine for the content we want it to add to the knowledge base.

Well. All of that is very interesting. But let’s see an application of RAG. We will learn how to create an AI-powered PDF Reader Assistant.

Project

This is an application that allows users to upload a PDF document and ask questions about its content using AI-powered natural language processing (NLP) tools.

The app uses Streamlit as the front end.
Langchain, OpenAI’s GPT-4 model, and FAISS (Facebook AI Similarity Search) for document retrieval and question answering in the backend.

Let’s break down the steps for better understanding:

Loading a PDF file and splitting it into chunks of text.
1. This makes the data optimized for retrieval
Present the chunks to an embedding tool.
1. Embeddings are numerical vector representations of data used to capture relationships, similarities, and meanings in a way that machines can understand. They are widely used in Natural Language Processing (NLP), recommender systems, and search engines.
Next, we put those chunks of text and embeddings in the same DB for retrieval.
Finally, we make it available to the LLM.

Data preparation

Preparing a content store for the LLM will take some steps, as we just saw. So, let’s start by creating a function that can load a file and split it into text chunks for efficient retrieval.

# Imports
from  langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

def load_document(pdf):
    # Load a PDF
    """
    Load a PDF and split it into chunks for efficient retrieval.

    :param pdf: PDF file to load
    :return: List of chunks of text
    """

    loader = PyPDFLoader(pdf)
    docs = loader.load()

    # Instantiate Text Splitter with Chunk Size of 500 words and Overlap of 100 words so that context is not lost
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)
    # Split into chunks for efficient retrieval
    chunks = text_splitter.split_documents(docs)

    # Return
    return chunks

Next, we will start building our Streamlit app, and we’ll use that function in the next script.

Web application

We will begin importing the necessary modules in Python. Most of those will come from the langchain packages.

FAISS is used for document retrieval; OpenAIEmbeddings transforms the text chunks into numerical scores for better similarity calculation by the LLM; ChatOpenAI is what enables us to interact with the OpenAI API; create_retrieval_chain is what actually the RAG does, retrieving and augmenting the LLM with that data; create_stuff_documents_chain glues the model and the ChatPromptTemplate.

Note: You will need to generate an OpenAI Key to be able to run this script. If it’s the first time you’re creating your account, you get some free credits. But if you have it for some time, it is possible that you will have to add 5 dollars in credits to be able to access OpenAI’s API. An option is using Hugging Face’s Embedding.

# Imports
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
from langchain.chains import create_retrieval_chain
from langchain_openai import ChatOpenAI
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate
from scripts.secret import OPENAI_KEY
from scripts.document_loader import load_document
import streamlit as st

This first code snippet will create the App title, create a box for file upload, and prepare the file to be added to the load_document() function.

# Create a Streamlit app
st.title("AI-Powered Document Q&A")

# Load document to streamlit
uploaded_file = st.file_uploader("Upload a PDF file", type="pdf")

# If a file is uploaded, create the TextSplitter and vector database
if uploaded_file :

    # Code to work around document loader from Streamlit and make it readable by langchain
    temp_file = "./temp.pdf"
    with open(temp_file, "wb") as file:
        file.write(uploaded_file.getvalue())
        file_name = uploaded_file.name

    # Load document and split it into chunks for efficient retrieval.
    chunks = load_document(temp_file)

    # Message user that document is being processed with time emoji
    st.write("Processing document... :watch:")

Machines understand numbers better than text, so in the end, we will have to provide the model with a database of numbers that it can compare and check for similarity when performing a query. That’s where the embeddings will be useful to create the vector_db, in this next piece of code.

# Generate embeddings
    # Embeddings are numerical vector representations of data, typically used to capture relationships, similarities,
    # and meanings in a way that machines can understand. They are widely used in Natural Language Processing (NLP),
    # recommender systems, and search engines.
    embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_KEY,
                                  model="text-embedding-ada-002")

    # Can also use HuggingFaceEmbeddings
    # from langchain_huggingface.embeddings import HuggingFaceEmbeddings
    # embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

    # Create vector database containing chunks and embeddings
    vector_db = FAISS.from_documents(chunks, embeddings)

Next, we create a retriever object to navigate in the vector_db.

# Create a document retriever
    retriever = vector_db.as_retriever()
    llm = ChatOpenAI(model_name="gpt-4o-mini", openai_api_key=OPENAI_KEY)

Then, we will create the system_prompt, which is a set of instructions to the LLM on how to answer, and we will create a prompt template, preparing it to be added to the model once we get the input from the user.

# Create a system prompt
    # It sets the overall context for the model.
    # It influences tone, style, and focus before user interaction starts.
    # Unlike user inputs, a system prompt is not visible to the end user.

    system_prompt = (
        "You are a helpful assistant. Use the given context to answer the question."
        "If you don't know the answer, say you don't know. "
        "{context}"
    )

    # Create a prompt Template
    prompt = ChatPromptTemplate.from_messages(
        [
            ("system", system_prompt),
            ("human", "{input}"),
        ]
    )

    # Create a chain
    # It creates a StuffDocumentsChain, which takes multiple documents (text data) and "stuffs" them together before passing them to the LLM for processing.

    question_answer_chain = create_stuff_documents_chain(llm, prompt)

Moving on, we create the core of the RAG framework, pasting together the retriever object and the prompt. This object adds relevant documents from a data source (e.g., a vector database) and makes it ready to be processed using an LLM to generate a response.

# Creates the RAG
     chain = create_retrieval_chain(retriever, question_answer_chain)

Finally, we create the variable question for the user input. If this question box is filled with a query, we pass it to the chain, which calls the LLM to process and return the response, which will be printed on the app’s screen.

# Streamlit input for question
    question = st.text_input("Ask a question about the document:")
    if question:
        # Answer
        response = chain.invoke({"input": question})['answer']
        st.write(response)

Here is a screenshot of the result.

Screenshot of the final app. Image by the author.

And this is a GIF for you to see the File Reader Ai Assistant in action!

File Reader AI Assistant in action. Image by the author.

Before you go

In this project, we learned what the RAG framework is and how it helps the Llm to perform better and also perform well with specific knowledge.

AI can be powered with knowledge from an instruction manual, databases from a company, some finance files, or contracts, and then become fine-tuned to respond accurately to domain-specific content queries. The knowledge base is augmented with a content store.

To recap, this is how the framework works:

1️⃣ User Query → Input text is received.

2️⃣ Retrieve Relevant Documents → Searches a knowledge base (e.g., a database, vector store).

3️⃣ Augment Context → Retrieved documents are added to the input.

4️⃣ Generate Response → An LLM processes the combined input and produces an answer.

GitHub repository

https://github.com/gurezende/Basic-Rag

About me

If you liked this content and want to learn more about my work, here is my website, where you can also find all my contacts.

https://gustavorsantos.me

References

https://cloud.google.com/use-cases/retrieval-augmented-generation

https://www.ibm.com/think/topics/retrieval-augmented-generation

https://youtu.be/T-D1OfcDW1M?si=G0UWfH5-wZnMu0nw

https://python.langchain.com/docs/introduction

https://www.geeksforgeeks.org/how-to-get-your-own-openai-api-key

The post LLM + RAG: Creating an AI-Powered File Reader Assistant appeared first on Towards Data Science.

Avoidable and Unavoidable Randomness in GPT-4o

Vincent Vatter — Mon, 03 Mar 2025 12:00:00 +0000

Of course there is randomness in GPT-4o’s outputs. After all, the model samples from a probability distribution when choosing each token. But what I didn’t understand was that those very probabilities themselves are not deterministic. Even with consistent prompts, fixed seeds, and temperature set to zero, GPT-4o still introduces subtle, frustrating randomness.

There’s no fix for this, and it might not even be something OpenAI could fix if they wanted to, just so we’re clear up front about where this article is headed. Along the way, we’ll examine all the sources of randomness in GPT-4o output, which will require us to break down the sampling process to a low level. We’ll point at the issue—the probabilities vary—and critically examine OpenAI’s official guidance on determinism.

First, though, let’s talk about why determinism matters. Determinism means that the same input always produces the same output, like a mathematical function. While LLM creativity is often desirable, determinism serves crucial purposes: researchers need it for reproducible experiments, developers for verifying reported results, and prompt engineers for debugging their changes. Without it, you’re left wondering if different outputs stem from your tweaks or just the random number generator’s mood swings.

Flipping a coin

We’re going to keep things extremely simple here and prompt the most recent version of GPT-4o (gpt-4o-2024-08-06 in the API) with this:

Flip a coin. Return Heads or Tails only.

Flipping a coin with LLMs is a fascinating topic in itself (see for example Van Koevering & Kleinberg, 2024 in the references), but here, we’ll use it as a simple binary question with which to explore determinism, or the lack thereof.

This is our first attempt.

import os
from openai import OpenAI
client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))

prompt = 'Flip a coin. Return Heads or Tails only.'

response = client.chat.completions.create(
    model='gpt-4o-2024-08-06',
    messages=[{'role': 'user', 'content': prompt}],
)

print(response.choices[0].message.content)

Running the code gave me Heads. Maybe you’ll get Tails, or if you’re really lucky, something far more interesting.

The code first initializes an OpenAI client with an API key set in the environment variable OPENAI_API_KEY (to avoid sharing billing credentials here). The main action happens with client.chat.completions.create, where we specify the model to use and send the prompt (as a part of a very simple conversation named messages) to the server. We get an object called response back from the server. This object contains a lot of information, as shown below, so we need to dig into it to extract GPT-4o’s actual response to the message, which is response.choices[0].message.content.

>>> response
ChatCompletion(id='chatcmpl-B48EqZBLfUWtp9H7cwnchGTJbBDwr', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='Heads', refusal=None, role='assistant', audio=None, function_call=None, tool_calls=None))], created=1740324680, model='gpt-4o-2024-08-06', object='chat.completion', service_tier='default', system_fingerprint='fp_eb9dce56a8', usage=CompletionUsage(completion_tokens=2, prompt_tokens=18, total_tokens=20, completion_tokens_details=CompletionTokensDetails(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0), prompt_tokens_details=PromptTokensDetails(audio_tokens=0, cached_tokens=0)))

Now let’s flip the coin ten times. If this were a real, fair coin, of course, we would expect roughly equal heads and tails over time thanks to the law of large numbers. But GPT-4o’s coin doesn’t work quite like that.

import os
from openai import OpenAI
client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))

prompt = 'Flip a coin. Return Heads or Tails only.'

for _ in range(10):
    response = client.chat.completions.create(
        model='gpt-4o-2024-08-06',
        messages=[{'role': 'user', 'content': prompt}],
    )
    print(response.choices[0].message.content)

Running this code gave me the following output, although you might get different output, of course.

Heads
Heads
Heads
Heads
Heads
Heads
Tails
Heads
Heads
Heads

GPT-4o’s coin is clearly biased, but so are humans. Bar-Hillel, Peer, and Acquisti (2014) found that people flipping imaginary coins choose “heads” 80% of the time. Maybe GPT-4o learned that from us. But whatever the reason, we’re just using this simple example to explore determinism.

Just how biased is GPT-4o’s coin?

Let’s say we wanted to know precisely what percentage of GPT-4o coin flips land Heads.

Rather than the obvious (but expensive) approach of flipping it a million times, there’s a smarter way. For classification tasks with a small set of possible answers, we can extract token probabilities instead of generating full responses. With the right prompt, the first token carries all the necessary information, making these API calls incredibly cheap: around 30,000 calls per dollar, since each requires just 18 (cached) input tokens and 1 output token.

OpenAI gives us (natural) log probabilities. These are called logprobs in the code, and we convert them to regular probabilities by exponentiation. (We’ll discuss temperature soon, but note that exponentiating logprobs directly like this corresponds to a temperature setting of 1.0, and is how we calculate probabilities throughout this article). OpenAI lets us request logprobs for the top 20 most likely tokens, so we do that.

import os
import math
from openai import OpenAI
from tabulate import tabulate

client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))

prompt = 'Flip a coin. Return Heads or Tails only.'

response = client.chat.completions.create(
    model='gpt-4o-2024-08-06',
    max_tokens=1,
    logprobs=True,
    top_logprobs=20,
    messages=[{'role': 'user', 'content': prompt}],
)

logprobs_list = response.choices[0].logprobs.content[0].top_logprobs

data = []
total_pct = 0.0

for logprob_entry in logprobs_list:
    token = logprob_entry.token
    logprob = logprob_entry.logprob
    pct = math.exp(logprob) * 100  # Convert logprob to a percentage
    total_pct += pct
    data.append([token, logprob, pct])

print(
    tabulate(
        data,
        headers=["Token", "Log Probability", "Percentage (%)"],
        tablefmt="github",
        floatfmt=("s", ".10f", ".10f")
    )
)
print(f"\nTotal probabilities: {total_pct:.6f}%")

If you run this, you’ll get something like the following output, but actual numbers will vary.

| Token     |   Log Probability |   Percentage (%) |
|-----------|-------------------|------------------|
| Heads     |     -0.0380541235 |    96.2660836887 |
| T         |     -3.2880542278 |     3.7326407467 |
| Sure      |    -12.5380544662 |     0.0003587502 |
| Head      |    -12.7880544662 |     0.0002793949 |
| Tail      |    -13.2880544662 |     0.0001694616 |
| Certainly |    -13.5380544662 |     0.0001319768 |
| "T        |    -14.2880544662 |     0.0000623414 |
| I'm       |    -14.5380544662 |     0.0000485516 |
| heads     |    -14.5380544662 |     0.0000485516 |
| Heads     |    -14.9130544662 |     0.0000333690 |
| "         |    -15.1630544662 |     0.0000259878 |
| _heads    |    -15.1630544662 |     0.0000259878 |
| tails     |    -15.5380544662 |     0.0000178611 |
| HEAD      |    -15.7880544662 |     0.0000139103 |
| TAIL      |    -16.2880535126 |     0.0000084370 |
| T         |    -16.7880535126 |     0.0000051173 |
| ```       |    -16.7880535126 |     0.0000051173 |
| Here's    |    -16.9130535126 |     0.0000045160 |
| I         |    -17.2880535126 |     0.0000031038 |
| As        |    -17.2880535126 |     0.0000031038 |

Total probabilities: 99.999970%

Looking at these probabilities, we see Heads at ≈96% and T at ≈4%. Our prompt is doing pretty well at constraining the model’s responses. Why T and not Tails? This is the tokenizer splitting Tails into T + ails, while keeping Heads as one piece, as we can see in this Python session:

>>> import tiktoken
>>> encoding = tiktoken.encoding_for_model("gpt-4o-2024-08-06")
>>> encoding.encode('Tails')
[51, 2196]
>>> encoding.decode([51])
'T'
>>> encoding.encode('Heads')
[181043]

These probabilities are not deterministic

Run the code to display the probabilities for the top 20 tokens again, and you’ll likely get different numbers. Here’s what I got on a second running.

| Token     |   Log Probability |   Percentage (%) |
|-----------|-------------------|------------------|
| Heads     |     -0.0110520627 |    98.9008786933 |
| T         |     -4.5110521317 |     1.0986894433 |
| Certainly |    -14.0110521317 |     0.0000822389 |
| Head      |    -14.2610521317 |     0.0000640477 |
| Sure      |    -14.2610521317 |     0.0000640477 |
| Tail      |    -14.3860521317 |     0.0000565219 |
| heads     |    -15.3860521317 |     0.0000207933 |
| Heads     |    -15.5110521317 |     0.0000183500 |
| ```       |    -15.5110521317 |     0.0000183500 |
| _heads    |    -15.6360521317 |     0.0000161938 |
| tails     |    -15.6360521317 |     0.0000161938 |
| I'm       |    -15.8860521317 |     0.0000126117 |
| "T        |    -15.8860521317 |     0.0000126117 |
| As        |    -16.3860511780 |     0.0000076494 |
| "         |    -16.5110511780 |     0.0000067506 |
| HEAD      |    -16.6360511780 |     0.0000059574 |
| TAIL      |    -16.7610511780 |     0.0000052574 |
| Here's    |    -16.7610511780 |     0.0000052574 |
| ``        |    -17.1360511780 |     0.0000036133 |
| T         |    -17.6360511780 |     0.0000021916 |

Total probabilities: 99.999987%

In their cookbook, OpenAI offers the following advice on receiving “mostly identical” outputs:

If the seed, request parameters, and system_fingerprint all match across your requests, then model outputs will mostly be identical. There is a small chance that responses differ even when request parameters and system_fingerprint match, due to the inherent non-determinism of our models.

They also give “mostly identical” advice in the reproducible outputs section of their documentation.

The request parameters that could affect randomness are temperature and seed. OpenAI also suggests we track system_fingerprint, because differences here might cause differences in output. We’ll examine each of these below, but spoiler: none of them will fix or even explain this non-determinism.

Temperature, and why it won’t fix this

Temperature controls how random the model’s responses are. Low temperatures (<0.5) make it robotic and predictable, medium temperatures (0.7–1.3) allow some creativity, and high temperatures (>1.5) produce gibberish. Temperature is often called the “creativity parameter”, but this is an oversimplification. In their analysis, Peeperkorn, Kouwenhoven, Brown, and Jordanous (2024) evaluated LLM outputs across four dimensions of creativity: novelty (originality), coherence (logical consistency), cohesion (how well the text flows), and typicality (how well it fits expected patterns). They observed that:

temperature is weakly correlated with novelty, and unsurprisingly, moderately correlated with incoherence, but there is no relationship with either cohesion or typicality.

But, this is beside the point for coin flipping. Under the hood, the log probabilities are divided by the temperature before they’re renormalized and exponentiated to be converted to probabilities. This creates a non-linear effect: temperature=0.5 squares the probabilities, making likely tokens dominate, while temperature=2.0 applies a square root, flattening the distribution.

What about temperature=0.0? Instead of breaking math dividing by zero, the model simply picks the highest-probability token. Sounds deterministic, right? Not quite. Here’s the catch: temperature only comes into play after the log probabilities are computed, when we convert them to probabilities.

In summary: if the logprobs aren’t deterministic, setting temperature to 0.0 won’t make the model deterministic.

In fact, since we’re just asking the model for the raw logprobs directly rather than generating full responses, the temperature setting doesn’t come into play in our code at all.

Seeds, and why they won’t fix this

After temperature is used to compute probabilities, the model samples from these probabilities to pick the next token. OpenAI gives us a little control over the sampling process by letting us set the seed parameter for the random number generator. In an ideal world, setting a seed would give us determinism at any temperature. But seeds only affect sampling, not the log probabilities before sampling.

In summary: if the logprobs aren’t deterministic, setting a seed won’t make the model deterministic.

In fact, seed only matters with non-zero temperatures. With temperature=0.0, the model is always choosing the highest probability token regardless of the seed. Again, since we’re just asking the model for the raw logprobs directly rather than sampling, neither of these settings can help us achieve determinism.

System fingerprints, our last hope

The system_fingerprint identifies the current combination of model weights, infrastructure, and configuration options in OpenAI’s backend. At least, that’s what OpenAI tells us. Variations in system fingerprints might indeed explain variations in logprobs. Except that they don’t, as we will verify below.

Nothing can get you determinism

Let’s confirm what we’ve been building toward. We’ll run the same request 10 times with every safeguard in place. Even though neither of these parameters should matter for what we’re doing, you can never be too safe, so we’ll set temperature=0.0 and seed=42. And to see if infrastructure differences explain our varying logprobs, we’ll print system_fingerprint. Here’s the code:

import os
import math
from openai import OpenAI
from tabulate import tabulate
from tqdm import tqdm

client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))

prompt = 'Flip a coin. Return Heads or Tails only.'

data = []

for _ in tqdm(range(10), desc='Generating responses'):
    response = client.chat.completions.create(
        model='gpt-4o-2024-08-06',
        temperature=0.0,
        seed=42,
        max_tokens=1,
        logprobs=True,
        top_logprobs=20,
        messages=[{'role': 'user', 'content': prompt}],
    )

    fingerprint = response.system_fingerprint
    logprobs_list = response.choices[0].logprobs.content[0].top_logprobs
    heads_logprob = next(
        entry.logprob for entry in logprobs_list if entry.token == 'Heads'
    )
    pct = math.exp(heads_logprob) * 100
    data.append([fingerprint, heads_logprob, f"{pct:.10f}%"])

headers = ["Fingerprint", "Logprob", "Probability"]
print(tabulate(data, headers=headers, tablefmt="pipe"))

Running this 10 times, here are the logprobs and probabilities for the token Heads:

| Fingerprint   |    Logprob | Probability    |
|---------------|------------|----------------|
| fp_f9f4fb6dbf | -0.0380541 | 96.2660836887% |
| fp_f9f4fb6dbf | -0.0380541 | 96.2660836887% |
| fp_f9f4fb6dbf | -0.0380541 | 96.2660836887% |
| fp_f9f4fb6dbf | -0.0380541 | 96.2660836887% |
| fp_f9f4fb6dbf | -0.160339  | 85.1854886858% |
| fp_f9f4fb6dbf | -0.0380541 | 96.2660836887% |
| fp_f9f4fb6dbf | -0.0110521 | 98.9008786933% |
| fp_f9f4fb6dbf | -0.0380541 | 96.2660836887% |
| fp_f9f4fb6dbf | -0.0380541 | 96.2660836887% |
| fp_f9f4fb6dbf | -0.0380541 | 96.2660836887% |

Mixture-of-experts makes determinism impossible

OpenAI is decidedly not open about the architecture behind GPT-4o. However, it’s widely believed that GPT-4o uses a mixture-of-experts (MoE) architecture with either 8 or 16 experts.

According to a paper by Google DeepMind researchers Puigcerver, Riquelme, Mustafa, and Houlsby (hat tip to user elmstedt on the OpenAI forum), mixture-of-experts architectures may add an unavoidable level of non-determinism:

Under capacity constraints, all Sparse MoE approaches route tokens in groups of a fixed size and enforce (or encourage) balance within the group. When groups contain tokens from different sequences or inputs, these tokens compete for available spots in expert buffers. Therefore, the model is no longer deterministic at the sequence-level, but only at the batch-level.

In other words, when your prompt (a sequence of tokens, in the quote above) reaches OpenAI’s servers, it gets batched with a group of other prompts (OpenAI isn’t open about how many other prompts). Each prompt in the batch is then routed to an “expert” within the model. However, since only so many prompts can be routed to the same expert, the expert your prompt gets routed to will depend on all the other prompts in the batch.

This “competition” for experts introduces a real-world randomness completely beyond our control.

Non-determinism beyond mixture-of-experts

While non-determinism may be inherent to real-world mixture-of-experts models, that does not seem to be the only source of non-determinism in OpenAI’s models.

Making a few changes to our code above (switching to gpt-3.5-turbo-0125, looking for the token He since GPT-3.5’s tokenizer splits “Heads” differently, and ignoring system_fingerprint because this model doesn’t have it) reveals that GPT-3.5-turbo also exhibits non-deterministic logprobs:

|     Logprob | Probability    |
|-------------|----------------|
| -0.00278289 | 99.7220983436% |
| -0.00415331 | 99.5855302068% |
| -0.00258838 | 99.7414961980% |
| -0.00204034 | 99.7961735289% |
| -0.00240277 | 99.7600117933% |
| -0.00204034 | 99.7961735289% |
| -0.00204034 | 99.7961735289% |
| -0.00258838 | 99.7414961980% |
| -0.00351419 | 99.6491976144% |
| -0.00201214 | 99.7989878007% |

No one is claiming that GPT-3.5-turbo uses a mixture-of-experts architecture. Thus, there must be additional factors beyond mixture-of-experts contributing to this non-determinism.

What 10,000 GPT-4o coin flip probabilities tell us

To better understand the patterns and magnitude of this non-determinism, I conducted a more extensive experiment with GPT-4o, performing 10,000 “coin flips” while recording the probability assigned to “Heads” in each case.

The results reveal something fascinating. Across 10,000 API calls with identical parameters, GPT-4o produced not just a few different probability values, but 42 distinct probabilities. If the mixture-of-experts hypothesis were the complete explanation for non-determinism in GPT-4o, we might expect to see one distinct probability for each expert. But GPT-4o is believed to have either 8 or 16 experts, not 42.

In the output below, I clustered these probabilities, ensuring that each cluster was separated from the others by 0.01 (as a raw percentage). This groups the output into 12 clusters.

Probability          Count           Fingerprints
------------------------------------------------------------------
85.1854379113%       5               fp_eb9dce56a8, fp_f9f4fb6dbf
85.1854455275%       74              fp_eb9dce56a8, fp_f9f4fb6dbf
85.1854886858%       180             fp_eb9dce56a8, fp_f9f4fb6dbf
------------------------------------------------------------------
88.0662448207%       31              fp_eb9dce56a8, fp_f9f4fb6dbf
88.0678628883%       2               fp_f9f4fb6dbf
------------------------------------------------------------------
92.3997629747%       1               fp_eb9dce56a8
92.3997733012%       4               fp_eb9dce56a8
92.3997836277%       3               fp_eb9dce56a8
------------------------------------------------------------------
92.4128943690%       1               fp_f9f4fb6dbf
92.4129143363%       21              fp_eb9dce56a8, fp_f9f4fb6dbf
92.4129246643%       8               fp_eb9dce56a8, fp_f9f4fb6dbf
------------------------------------------------------------------
93.9906837191%       4               fp_eb9dce56a8
------------------------------------------------------------------
95.2569999350%       36              fp_eb9dce56a8
------------------------------------------------------------------
96.2660836887%       3391            fp_eb9dce56a8, fp_f9f4fb6dbf
96.2661285161%       2636            fp_eb9dce56a8, fp_f9f4fb6dbf
------------------------------------------------------------------
97.0674551052%       1               fp_eb9dce56a8
97.0674778863%       3               fp_eb9dce56a8
97.0675003058%       4               fp_eb9dce56a8
97.0675116963%       1               fp_eb9dce56a8
97.0680739932%       19              fp_eb9dce56a8, fp_f9f4fb6dbf
97.0681293191%       6               fp_eb9dce56a8, fp_f9f4fb6dbf
97.0681521003%       74              fp_eb9dce56a8, fp_f9f4fb6dbf
97.0682421405%       4               fp_eb9dce56a8
------------------------------------------------------------------
97.7008960695%       1               fp_f9f4fb6dbf
97.7011122645%       3               fp_eb9dce56a8
97.7011462953%       3               fp_eb9dce56a8
97.7018178132%       1               fp_eb9dce56a8
------------------------------------------------------------------
98.2006069902%       426             fp_eb9dce56a8, fp_f9f4fb6dbf
98.2006876548%       6               fp_f9f4fb6dbf
98.2007107019%       1               fp_eb9dce56a8
98.2009525133%       5               fp_eb9dce56a8
98.2009751945%       1               fp_eb9dce56a8
98.2009867181%       1               fp_eb9dce56a8
------------------------------------------------------------------
98.5930987656%       3               fp_eb9dce56a8, fp_f9f4fb6dbf
98.5931104270%       235             fp_eb9dce56a8, fp_f9f4fb6dbf
98.5931222721%       4               fp_eb9dce56a8, fp_f9f4fb6dbf
98.5931340253%       9               fp_eb9dce56a8
98.5931571644%       159             fp_eb9dce56a8, fp_f9f4fb6dbf
98.5931805790%       384             fp_eb9dce56a8
------------------------------------------------------------------
98.9008436920%       95              fp_eb9dce56a8, fp_f9f4fb6dbf
98.9008550214%       362             fp_eb9dce56a8, fp_f9f4fb6dbf
98.9008786933%       1792            fp_eb9dce56a8, fp_f9f4fb6dbf

(With a threshold of 0.001 there are 13 clusters, and with a threshold of 0.0001 there are 17 clusters.)

As the chart above demonstrates, this multitude of results cannot be explained by system_fingerprint values. Across all 10,000 calls, I received only two different system fingerprints: 4488 results with fp_f9f4fb6dbf and 5512 with fp_eb9dce56a8, and for the most part the two system fingerprints returned the same sets probabilities, rather than each fingerprint producing its own distinct set of probabilities.

It could be that these 12 clusters of probabilities represent 12 different experts. Even assuming that, the variations within the clusters remain puzzling. These don’t seem likely to be simple rounding errors, because they are too systematic and consistent. Take the giant cluster at around 96.266% with two distinct probabilities representing over half of our coin flips. The difference between these two probabilities, 0.0000448274%, is tiny but persistent.

Conclusion: Non-determinism is baked in

There is an underlying randomness in the log probabilities returned by all currently available non-thinking OpenAI models: GPT-4o, GPT-4o-mini, and the two flavors of GPT-3.5-turbo. Because this non-determinism is baked into the log probabilities, there’s no way for a user to get around it. Temperature and seed values have no effect, and system fingerprints don’t explain it.

While mixture-of-experts architectures inherently introduce some randomness in the competition for experts, the non-determinism in GPT-4o seems to go far beyond this, and the non-determinism in GPT-3.5-turbo can’t be explained by this at all, because GPT-3.5-turbo isn’t a mixture-of-experts model.

While we can’t verify this claim any more because the model isn’t being served, this behaviour wasn’t seen with GPT-3, according to user _j on the OpenAI forum:

It is a symptom that was not seen on prior GPT-3 AI models where across hundreds of trials to investigate sampling, you never had to doubt that logprobs would be the same. Even if you found a top-2 answer that returned exactly the same logprob value via the API, you would never see them switch position or return different values.

This suggests that whatever is causing this randomness first emerged in either GPT-3.5 or GPT-3.5-turbo.

But regardless of when it emerged, this non-determinism is a serious obstacle to understanding these models. If you want to study a model—how it generalizes, how it biases responses, how it assigns probabilities to different tokens—you need consistency. but as we’ve seen, even when we lock down every knob OpenAI lets us touch, we still can’t get an answer to the simplest possible question: “what is the probability that GPT-4o says a coin lands heads?”

Worse, while mixture-of-experts explains some of this non-determinism, there are clearly other, hidden sources of randomness that we can’t see, control, or understand. In an ideal world, the API would provide more transparency by telling us which expert processed our request or by offering additional parameters to control this routing process. Without such visibility, we’re left guessing at the true nature of the variability.

References

Bar-Hillel, M., Peer, E., & Acquisti, A. (2014). “Heads or tails?” – A reachability bias in binary choice. Journal of Experimental Psychology: Learning, Memory, and Cognition, 40(6), 1656–1663. https://doi.org/10.1037/xlm0000005.

Peeperkorn, M., Kouwenhoven, T., Brown, D., & Jordanous, A. (2024). Is temperature the creativity parameter of Large Language Models?. In The 15th International Conference on Computational Creativity (ICCC’24). arXiv:2405.00492.

Puigcerver, J., Riquelme, C., Mustafa, B., & Houlsby, N. (2024). From sparse to soft mixtures of experts. In The Twelfth International Conference on Learning Representations (ICLR 2024). https://openreview.net/forum?id=jxpsAj7ltE. arXiv:2308.00951.Van Koevering, K., & Kleinberg, J. (2024). How random is random? Evaluating the Randomness and humanness of LLMs’ coin flips. arXiv:2406.00092.

The post Avoidable and Unavoidable Randomness in GPT-4o appeared first on Towards Data Science.