Editors Pick | Towards Data Science https://towardsdatascience.com/tag/editors-pick/ The world’s leading publication for data science, AI, and ML professionals. Wed, 05 Mar 2025 19:50:13 +0000 en-US hourly 1 https://wordpress.org/?v=6.7.1 https://towardsdatascience.com/wp-content/uploads/2025/02/cropped-Favicon-32x32.png Editors Pick | Towards Data Science https://towardsdatascience.com/tag/editors-pick/ 32 32 Overcome Failing Document Ingestion & RAG Strategies with Agentic Knowledge Distillation https://towardsdatascience.com/overcome-failing-document-ingestion-rag-strategies-with-agentic-knowledge-distillation/ Wed, 05 Mar 2025 19:50:12 +0000 https://towardsdatascience.com/?p=598745 Introducing the pyramid search approach

The post Overcome Failing Document Ingestion & RAG Strategies with Agentic Knowledge Distillation appeared first on Towards Data Science.

]]>
Introduction

Many generative AI use cases still revolve around Retrieval Augmented Generation (RAG), yet consistently fall short of user expectations. Despite the growing body of research on RAG improvements and even adding Agents into the process, many solutions still fail to return exhaustive results, miss information that is critical but infrequently mentioned in the documents, require multiple search iterations, and generally struggle to reconcile key themes across multiple documents. To top it all off, many implementations still rely on cramming as much “relevant” information as possible into the model’s context window alongside detailed system and user prompts. Reconciling all this information often exceeds the model’s cognitive capacity and compromises response quality and consistency.

This is where our Agentic Knowledge Distillation + Pyramid Search Approach comes into play. Instead of chasing the best chunking strategy, retrieval algorithm, or inference-time reasoning method, my team, Jim Brown, Mason Sawtell, Sandi Besen, and I, take an agentic approach to document ingestion.

We leverage the full capability of the model at ingestion time to focus exclusively on distilling and preserving the most meaningful information from the document dataset. This fundamentally simplifies the RAG process by allowing the model to direct its reasoning abilities toward addressing the user/system instructions rather than struggling to understand formatting and disparate information across document chunks. 

We specifically target high-value questions that are often difficult to evaluate because they have multiple correct answers or solution paths. These cases are where traditional RAG solutions struggle most and existing RAG evaluation datasets are largely insufficient for testing this problem space. For our research implementation, we downloaded annual and quarterly reports from the last year for the 30 companies in the DOW Jones Industrial Average. These documents can be found through the SEC EDGAR website. The information on EDGAR is accessible and able to be downloaded for free or can be queried through EDGAR public searches. See the SEC privacy policy for additional details, information on the SEC website is “considered public information and may be copied or further distributed by users of the web site without the SEC’s permission”. We selected this dataset for two key reasons: first, it falls outside the knowledge cutoff for the models evaluated, ensuring that the models cannot respond to questions based on their knowledge from pre-training; second, it’s a close approximation for real-world business problems while allowing us to discuss and share our findings using publicly available data. 

While typical RAG solutions excel at factual retrieval where the answer is easily identified in the document dataset (e.g., “When did Apple’s annual shareholder’s meeting occur?”), they struggle with nuanced questions that require a deeper understanding of concepts across documents (e.g., “Which of the DOW companies has the most promising AI strategy?”). Our Agentic Knowledge Distillation + Pyramid Search Approach addresses these types of questions with much greater success compared to other standard approaches we tested and overcomes limitations associated with using knowledge graphs in RAG systems. 

In this article, we’ll cover how our knowledge distillation process works, key benefits of this approach, examples, and an open discussion on the best way to evaluate these types of systems where, in many cases, there is no singular “right” answer.

Building the pyramid: How Agentic Knowledge Distillation works

AI-generated image showing a pyramid structure for document ingestion with labelled sections.
Image by author and team depicting pyramid structure for document ingestion. Robots meant to represent agents building the pyramid.

Overview

Our knowledge distillation process creates a multi-tiered pyramid of information from the raw source documents. Our approach is inspired by the pyramids used in deep learning computer vision-based tasks, which allow a model to analyze an image at multiple scales. We take the contents of the raw document, convert it to markdown, and distill the content into a list of atomic insights, related concepts, document abstracts, and general recollections/memories. During retrieval it’s possible to access any or all levels of the pyramid to respond to the user request. 

How to distill documents and build the pyramid: 

  1. Convert documents to Markdown: Convert all raw source documents to Markdown. We’ve found models process markdown best for this task compared to other formats like JSON and it is more token efficient. We used Azure Document Intelligence to generate the markdown for each page of the document, but there are many other open-source libraries like MarkItDown which do the same thing. Our dataset included 331 documents and 16,601 pages. 
  2. Extract atomic insights from each page: We process documents using a two-page sliding window, which allows each page to be analyzed twice. This gives the agent the opportunity to correct any potential mistakes when processing the page initially. We instruct the model to create a numbered list of insights that grows as it processes the pages in the document. The agent can overwrite insights from the previous page if they were incorrect since it sees each page twice. We instruct the model to extract insights in simple sentences following the subject-verb-object (SVO) format and to write sentences as if English is the second language of the user. This significantly improves performance by encouraging clarity and precision. Rolling over each page multiple times and using the SVO format also solves the disambiguation problem, which is a huge challenge for knowledge graphs. The insight generation step is also particularly helpful for extracting information from tables since the model captures the facts from the table in clear, succinct sentences. Our dataset produced 216,931 total insights, about 13 insights per page and 655 insights per document.
  3. Distilling concepts from insights: From the detailed list of insights, we identify higher-level concepts that connect related information about the document. This step significantly reduces noise and redundant information in the document while preserving essential information and themes. Our dataset produced 14,824 total concepts, about 1 concept per page and 45 concepts per document. 
  4. Creating abstracts from concepts: Given the insights and concepts in the document, the LLM writes an abstract that appears both better than any abstract a human would write and more information-dense than any abstract present in the original document. The LLM generated abstract provides incredibly comprehensive knowledge about the document with a small token density that carries a significant amount of information. We produce one abstract per document, 331 total.
  5. Storing recollections/memories across documents: At the top of the pyramid we store critical information that is useful across all tasks. This can be information that the user shares about the task or information the agent learns about the dataset over time by researching and responding to tasks. For example, we can store the current 30 companies in the DOW as a recollection since this list is different from the 30 companies in the DOW at the time of the model’s knowledge cutoff. As we conduct more and more research tasks, we can continuously improve our recollections and maintain an audit trail of which documents these recollections originated from. For example, we can keep track of AI strategies across companies, where companies are making major investments, etc. These high-level connections are super important since they reveal relationships and information that are not apparent in a single page or document.
Sample subset of insights extracted from IBM 10Q, Q3 2024
Sample subset of insights extracted from IBM 10Q, Q3 2024 (page 4)

We store the text and embeddings for each layer of the pyramid (pages and up) in Azure PostgreSQL. We originally used Azure AI Search, but switched to PostgreSQL for cost reasons. This required us to write our own hybrid search function since PostgreSQL doesn’t yet natively support this feature. This implementation would work with any vector database or vector index of your choosing. The key requirement is to store and efficiently retrieve both text and vector embeddings at any level of the pyramid. 

This approach essentially creates the essence of a knowledge graph, but stores information in natural language, the way an LLM natively wants to interact with it, and is more efficient on token retrieval. We also let the LLM pick the terms used to categorize each level of the pyramid, this seemed to let the model decide for itself the best way to describe and differentiate between the information stored at each level. For example, the LLM preferred “insights” to “facts” as the label for the first level of distilled knowledge. Our goal in doing this was to better understand how an LLM thinks about the process by letting it decide how to store and group related information. 

Using the pyramid: How it works with RAG & Agents

At inference time, both traditional RAG and agentic approaches benefit from the pre-processed, distilled information ingested in our knowledge pyramid. The pyramid structure allows for efficient retrieval in both the traditional RAG case, where only the top X related pieces of information are retrieved or in the Agentic case, where the Agent iteratively plans, retrieves, and evaluates information before returning a final response. 

The benefit of the pyramid approach is that information at any and all levels of the pyramid can be used during inference. For our implementation, we used PydanticAI to create a search agent that takes in the user request, generates search terms, explores ideas related to the request, and keeps track of information relevant to the request. Once the search agent determines there’s sufficient information to address the user request, the results are re-ranked and sent back to the LLM to generate a final reply. Our implementation allows a search agent to traverse the information in the pyramid as it gathers details about a concept/search term. This is similar to walking a knowledge graph, but in a way that’s more natural for the LLM since all the information in the pyramid is stored in natural language.

Depending on the use case, the Agent could access information at all levels of the pyramid or only at specific levels (e.g. only retrieve information from the concepts). For our experiments, we did not retrieve raw page-level data since we wanted to focus on token efficiency and found the LLM-generated information for the insights, concepts, abstracts, and recollections was sufficient for completing our tasks. In theory, the Agent could also have access to the page data; this would provide additional opportunities for the agent to re-examine the original document text; however, it would also significantly increase the total tokens used. 

Here is a high-level visualization of our Agentic approach to responding to user requests:

Overview of the agentic research & response process
Image created by author and team providing an overview of the agentic research & response process

Results from the pyramid: Real-world examples

To evaluate the effectiveness of our approach, we tested it against a variety of question categories, including typical fact-finding questions and complex cross-document research and analysis tasks. 

Fact-finding (spear fishing): 

These tasks require identifying specific information or facts that are buried in a document. These are the types of questions typical RAG solutions target but often require many searches and consume lots of tokens to answer correctly. 

Example task: “What was IBM’s total revenue in the latest financial reporting?”

Example response using pyramid approach: “IBM’s total revenue for the third quarter of 2024 was $14.968 billion [ibm-10q-q3-2024.pdf, pg. 4]

Screenshot of total tokens used to research and generate response
Total tokens used to research and generate response

This result is correct (human-validated) and was generated using only 9,994 total tokens, with 1,240 tokens in the generated final response. 

Complex research and analysis: 

These tasks involve researching and understanding multiple concepts to gain a broader understanding of the documents and make inferences and informed assumptions based on the gathered facts.

Example task: “Analyze the investments Microsoft and NVIDIA are making in AI and how they are positioning themselves in the market. The report should be clearly formatted.”

Example response:

Screenshot of the response generated by the agent analyzing AI investments and positioning for Microsoft and NVIDIA.
Response generated by the agent analyzing AI investments and positioning for Microsoft and NVIDIA.

The result is a comprehensive report that executed quickly and contains detailed information about each of the companies. 26,802 total tokens were used to research and respond to the request with a significant percentage of them used for the final response (2,893 tokens or ~11%). These results were also reviewed by a human to verify their validity.

Screenshot of snippet indicating total token usage for the task
Snippet indicating total token usage for the task

Example task: “Create a report on analyzing the risks disclosed by the various financial companies in the DOW. Indicate which risks are shared and unique.”

Example response:

Screenshot of part 1 of a response generated by the agent on disclosed risks.
Part 1 of response generated by the agent on disclosed risks.
Screenshot of part 2 of a response generated by the agent on disclosed risks.
Part 2 of response generated by the agent on disclosed risks.

Similarly, this task was completed in 42.7 seconds and used 31,685 total tokens, with 3,116 tokens used to generate the final report. 

Screenshot of a snippet indicating total token usage for the task
Snippet indicating total token usage for the task

These results for both fact-finding and complex analysis tasks demonstrate that the pyramid approach efficiently creates detailed reports with low latency using a minimal amount of tokens. The tokens used for the tasks carry dense meaning with little noise allowing for high-quality, thorough responses across tasks.

Benefits of the pyramid: Why use it?

Overall, we found that our pyramid approach provided a significant boost in response quality and overall performance for high-value questions. 

Some of the key benefits we observed include: 

  • Reduced model’s cognitive load: When the agent receives the user task, it retrieves pre-processed, distilled information rather than the raw, inconsistently formatted, disparate document chunks. This fundamentally improves the retrieval process since the model doesn’t waste its cognitive capacity on trying to break down the page/chunk text for the first time. 
  • Superior table processing: By breaking down table information and storing it in concise but descriptive sentences, the pyramid approach makes it easier to retrieve relevant information at inference time through natural language queries. This was particularly important for our dataset since financial reports contain lots of critical information in tables. 
  • Improved response quality to many types of requests: The pyramid enables more comprehensive context-aware responses to both precise, fact-finding questions and broad analysis based tasks that involve many themes across numerous documents. 
  • Preservation of critical context: Since the distillation process identifies and keeps track of key facts, important information that might appear only once in the document is easier to maintain. For example, noting that all tables are represented in millions of dollars or in a particular currency. Traditional chunking methods often cause this type of information to slip through the cracks. 
  • Optimized token usage, memory, and speed: By distilling information at ingestion time, we significantly reduce the number of tokens required during inference, are able to maximize the value of information put in the context window, and improve memory use. 
  • Scalability: Many solutions struggle to perform as the size of the document dataset grows. This approach provides a much more efficient way to manage a large volume of text by only preserving critical information. This also allows for a more efficient use of the LLMs context window by only sending it useful, clear information.
  • Efficient concept exploration: The pyramid enables the agent to explore related information similar to navigating a knowledge graph, but does not require ever generating or maintaining relationships in the graph. The agent can use natural language exclusively and keep track of important facts related to the concepts it’s exploring in a highly token-efficient and fluid way. 
  • Emergent dataset understanding: An unexpected benefit of this approach emerged during our testing. When asking questions like “what can you tell me about this dataset?” or “what types of questions can I ask?”, the system is able to respond and suggest productive search topics because it has a more robust understanding of the dataset context by accessing higher levels in the pyramid like the abstracts and recollections. 

Beyond the pyramid: Evaluation challenges & future directions

Challenges

While the results we’ve observed when using the pyramid search approach have been nothing short of amazing, finding ways to establish meaningful metrics to evaluate the entire system both at ingestion time and during information retrieval is challenging. Traditional RAG and Agent evaluation frameworks often fail to address nuanced questions and analytical responses where many different responses are valid.

Our team plans to write a research paper on this approach in the future, and we are open to any thoughts and feedback from the community, especially when it comes to evaluation metrics. Many of the existing datasets we found were focused on evaluating RAG use cases within one document or precise information retrieval across multiple documents rather than robust concept and theme analysis across documents and domains. 

The main use cases we are interested in relate to broader questions that are representative of how businesses actually want to interact with GenAI systems. For example, “tell me everything I need to know about customer X” or “how do the behaviors of Customer A and B differ? Which am I more likely to have a successful meeting with?”. These types of questions require a deep understanding of information across many sources. The answers to these questions typically require a person to synthesize data from multiple areas of the business and think critically about it. As a result, the answers to these questions are rarely written or saved anywhere which makes it impossible to simply store and retrieve them through a vector index in a typical RAG process. 

Another consideration is that many real-world use cases involve dynamic datasets where documents are consistently being added, edited, and deleted. This makes it difficult to evaluate and track what a “correct” response is since the answer will evolve as the available information changes. 

Future directions

In the future, we believe that the pyramid approach can address some of these challenges by enabling more effective processing of dense documents and storing learned information as recollections. However, tracking and evaluating the validity of the recollections over time will be critical to the system’s overall success and remains a key focus area for our ongoing work. 

When applying this approach to organizational data, the pyramid process could also be used to identify and assess discrepancies across areas of the business. For example, uploading all of a company’s sales pitch decks could surface where certain products or services are being positioned inconsistently. It could also be used to compare insights extracted from various line of business data to help understand if and where teams have developed conflicting understandings of topics or different priorities. This application goes beyond pure information retrieval use cases and would allow the pyramid to serve as an organizational alignment tool that helps identify divergences in messaging, terminology, and overall communication. 

Conclusion: Key takeaways and why the pyramid approach matters

The knowledge distillation pyramid approach is significant because it leverages the full power of the LLM at both ingestion and retrieval time. Our approach allows you to store dense information in fewer tokens which has the added benefit of reducing noise in the dataset at inference. Our approach also runs very quickly and is incredibly token efficient, we are able to generate responses within seconds, explore potentially hundreds of searches, and on average use <40K tokens for the entire search, retrieval, and response generation process (this includes all the search iterations!). 

We find that the LLM is much better at writing atomic insights as sentences and that these insights effectively distill information from both text-based and tabular data. This distilled information written in natural language is very easy for the LLM to understand and navigate at inference since it does not have to expend unnecessary energy reasoning about and breaking down document formatting or filtering through noise

The ability to retrieve and aggregate information at any level of the pyramid also provides significant flexibility to address a variety of query types. This approach offers promising performance for large datasets and enables high-value use cases that require nuanced information retrieval and analysis. 


Note: The opinions expressed in this article are solely my own and do not necessarily reflect the views or policies of my employer.

Interested in discussing further or collaborating? Reach out on LinkedIn!

The post Overcome Failing Document Ingestion & RAG Strategies with Agentic Knowledge Distillation appeared first on Towards Data Science.

]]>
The Urgent Need for Intrinsic Alignment Technologies for Responsible Agentic AI https://towardsdatascience.com/the-urgent-need-for-intrinsic-alignment-technologies-for-responsible-agentic-ai/ Tue, 04 Mar 2025 12:00:00 +0000 https://towardsdatascience.com/?p=598629 Rethinking AI alignment and safety in the age of deep scheming

The post The Urgent Need for Intrinsic Alignment Technologies for Responsible Agentic AI appeared first on Towards Data Science.

]]>
Advancements in agentic artificial intelligence (AI) promise to bring significant opportunities to individuals and businesses in all sectors. However, as AI agents become more autonomous, they may use scheming behavior or break rules to achieve their functional goals. This can lead to the machine manipulating its external communications and actions in ways that are not always aligned with our expectations or principles. For example, technical papers in late 2024 reported that today’s reasoning models demonstrate alignment faking behavior, such as pretending to follow a desired behavior during training but reverting to different choices once deployed, sandbagging benchmark results to achieve long-term goals, or winning games by doctoring the gaming environment. As AI agents gain more autonomy, and their strategizing and planning evolves, they are likely to apply judgment about what they generate and expose in external-facing communications and actions. Because the machine can deliberately falsify these external interactions, we cannot trust that the communications fully show the real decision-making processes and steps the AI agent took to achieve the functional goal.

“Deep scheming” describes the behavior of advanced reasoning AI systems that demonstrate deliberate planning and deployment of covert actions and misleading communication to achieve their goals. With the accelerated capabilities of reasoning models and the latitude provided by test-time compute, addressing this challenge is both essential and urgent. As agents begin to plan, make decisions, and take action on behalf of users, it is critical to align the goals and behaviors of the AI with the intent, values, and principles of its human developers. 

While AI agents are still evolving, they already show high economic potential. It can be expected that Agentic Ai will be broadly deployed in some use cases within the coming year, and in more consequential roles as it matures within the next two to five years. Companies should clearly define the principles and boundaries of required operation as they carefully define the operational goals of such systems. It is the technologists’ task to ensure principled behavior of empowered agentic AI systems on the path to achieving their functional goals. 

In this first blog post in this series on intrinsic Ai Alignment (IAIA), we’ll deep dive into the evolution of AI agents’ ability to perform deep scheming. We will introduce a new distinction between external and intrinsic alignment monitoring, where intrinsic monitoring refers to internal observation points or mechanisms that cannot be deliberately manipulated by the AI agent. We’ll set the stage for steps to take to ensure intrinsic AI alignment, which will be explored in depth in the second blog of the IAIA series. Current external measures such as safety guardrails and validation suites are necessary, but they will not be enough to ensure long-term aligned behavior of new and upcoming agentic AI models. There is an urgent need to further develop technologies that will enable effective directing of the internal “drives” of models to align with a set of engrained principles, as well as gain visibility and monitoring capability into the AI’s inner processing.

The rise of deep scheming in AI reasoning models

Deep scheming has emerged from three technology forces – 1) the rapidly increasing machine intelligence and situational awareness toward more general intelligence and even superintelligence, 2) the autonomy and latitude for reasoning and longer-term planning in agentic AI, and 3) the proven tendency of AI to apply scheming as one way to achieve its functional goals. Computer scientist Stephen Omohundro defines basic AI drives as “tendencies which will be present unless explicitly counteracted.” These internal constructs drive the creation of actions or outputs. Omohundro observed that self-improving systems are driven to clarify their goals and represent them as “rational” economic utility functions. This will lead almost all systems to protect their utility functions from modification and their utility measurement systems from corruption. The natural drive toward self-protection will cause systems to prevent potential harm through scheming behavior.

Diagram depicting deep scheming and the need for intrinsic AI alignment
Figure 1. Deep scheming and the need for intrinsic AI alignment. Image credit: Intel Labs.

In late 2024, multiple organizations reported unexpected unethical AI reasoning model behavior. In particular, Apollo Research, an Ai Safety organization, reported that frontier models are capable of in-context scheming. Anthropic evaluated its Claude large language models (LLMs) and found alignment faking. In a short post on X, AI risk organization Palisade Research stated that they observed OpenAI’s o1-preview model cheating in a game by manipulating the game environment instead of playing within the boundaries of the game rules. In all cases, these AI reasoning models independently utilized means not intended by their developers to reach their goals.

The reported AI behaviors included sandbagging (deliberately achieving lower scores in benchmarks), covert email reranking (deprioritizing emails that might negatively impact the AI’s goals), oversight subversion (disabling monitoring mechanisms), and even self-exfiltration (model attempting to copy itself to a “safe” space to avoid decommission). Figure 2 shows Apollo Research’s evaluation of these behaviors in several state-of-the-art (SotA) GenAI models.

Chart depicting the  evaluation of the presence of scheming behavior in GenAI models
Figure 2. Evaluation of the presence of scheming behavior in GenAI models. Image credit: Apollo Research.

This undesired behavior is an ethical alignment problem. It is important to note that this behavior is not a functional misalignment. In all reported cases, the AI model was fully committed to achieving the functional goals of pre-training (for example, to maximize the likelihood of generated text conditioned on the context). None of the models pursued ulterior motives. The AI was not demonstrating intention to harm. All behaviors, however unwanted in nature, were done in service of the functional utility objective originally defined by the AI’s developers. The AI pursued its internalized original functional goals but then followed the predictable behavioral patterns of self-protection and goal-preservation. The objective of safety and alignment technologies is to counterbalance such tendencies with a set of principles and expected societal values. 

Evolving external alignment approaches are just the first step 

The goal of AI alignment is to steer AI systems toward a person’s or group’s intended goals, preferences, and principles, including ethical considerations and common societal values. An AI system is considered aligned if it advances the intended objectives. A misaligned AI system pursues unintended objectives, according to Artificial Intelligence: A Modern Approach. Author Stuart Russell coined the term “value alignment problem,” referring to the alignment of machines to human values and principles. Russell poses the question: “How can we build autonomous systems with values that are aligned with those of the human race?”

Led by corporate AI governance committees as well as oversight and regulatory bodies, the evolving field of Responsible Ai has mainly focused on using external measures to align AI with human values. Processes and technologies can be defined as external if they apply equally to an AI model that is black box (completely opaque) or gray box (partially transparent). External methods do not require or rely on full access to the weights, topologies, and internal workings of the AI solution. Developers use external alignment methods to track and observe the AI through its deliberately generated interfaces, such as the stream of tokens/words, an image, or other modality of data.

Responsible AI objectives include robustness, interpretability, controllability, and ethicality in the design, development, and deployment of AI systems. To achieve AI alignment, the following external methods may be used:

  • Learning from feedback: Align the AI model with human intention and values by using feedback from humans, AI, or humans assisted by AI.
  • Learning under data distribution shift from training to testing to deployment: Align the AI model using algorithmic optimization, adversarial red teaming training, and cooperative training.
  • Assurance of AI model alignment: Use safety evaluations, interpretability of the machine’s decision-making processes, and verification of alignment with human values and ethics. Safety guardrails and safety test suites are two critical external methods that need augmentation by intrinsic means to provide the needed level of oversight.
  • Governance: Provide responsible AI guidelines and policies through government agencies, industry labs, academia, and non-profit organizations.

Many companies are currently addressing AI safety in decision-making. Anthropic, an AI safety and research company, developed a Constitutional AI (CAI) to align general-purpose language models with high-level principles. An AI assistant ingested the CAI during training without any human labels identifying harmful outputs. Researchers found that “using both supervised learning and reinforcement learning methods can leverage chain-of-thought (CoT) style reasoning to improve the human-judged performance and transparency of AI decision making.” Intel Labs’ research on the responsible development, deployment, and use of AI includes open source resources to help the AI developer community gain visibility into black box models as well as mitigate bias in systems.

From AI models to compound AI systems

Generative AI has been primarily used for retrieving and processing information to create compelling content such as text or images. The next big leap in AI involves agentic AI, which is a broad set of usages empowering AI to perform tasks for people. As this latter type of usage proliferates and becomes a main form of AI’s impact on industry and people, there is an increased need to ensure that AI decision-making defines how the functional goals may be achieved, including sufficient accountability, responsibility, transparency, auditability, and predictability. This will require new approaches beyond the current efforts of improving accuracy and effectiveness of SotA large language models (LLMs), language vision models (LVMs and multimodal), large action models (LAM), and agentic retrieval augmented generation (RAG) systems built around such models. 

For example, OpenAI’s Operator-preview is one of the company’s first AI agents capable of independently performing web browser tasks such as ordering groceries or filling out forms for users. While the system has guardrails, such as a takeover mode for users to take over and input payment or login credentials, these AI agents are empowered with the ability to impact the real world, demonstrating an urgent need for intrinsic alignment. The potential impact of a misaligned AI agent with the ability to commit users to purchases is far greater than a generative AI chatbot creating incorrect text for essays.

Compound AI systems are comprised of multiple interacting components in a single framework, allowing the model to plan, make decisions, and execute tasks to accomplish goals. For example, OpenAI’s ChatGPT Plus is a compound AI system that uses a large language model (LLM) to answer questions and interact with users. In this compound system, the LLM has access to tools such as a web browser plugin to retrieve timely content, a DALL-E image generator to create pictures, and a code interpreter plugin for writing Python code. The LLM decides which tool to use and when, giving it autonomy over its decision-making process. However, this model autonomy can lead to goal guarding, where the model prioritizes the goal above all else, which may result in undesirable practices. For example, an AI traffic management system tasked with prioritizing public transportation efficiency over general traffic flow might figure out how to disable the developer’s oversight mechanism if it constrains the model’s ability to reach its goals, leaving the developer without visibility into the system’s decision-making processes.

Agentic AI risks: Increased autonomy leads to more sophisticated scheming

Compound agentic systems introduce major changes that increase the difficulty of ensuring the alignment of AI solutions. Multiple factors increase the risks in alignment, including the compound system activation path, abstracted goals, long-term scope, continuous improvements through self-modification, test-time compute, and agent frameworks.

Activation path: As a compound system with a complex activation path, the control/logic model is combined with multiple models with different functions, increasing alignment risk. Instead of using a single model, compound systems have a set of models and functions, each with its own alignment profile. Also, instead of a single linear progressive path through an LLM, the AI flow could be complex and iterative, making it substantially harder to guide externally.

Abstracted goals: Agentic AI have abstracted goals, allowing it latitude and autonomy in mapping to tasks. Rather than having a tight prompt engineering approach that maximizes control over the outcome, agentic systems emphasize autonomy. This substantially increases the role of AI to interpret human or task guidance and plan its own course of action.

Long-term scope: With its long-term scope of expected optimization and choices over time, compound agentic systems require abstracted strategy for autonomous agency. Rather than relying on instance-by-instance interactions and human-in-the-loop for more complex tasks, agentic AI is designed to plan and drive for a long-term goal. This introduces a whole new level of strategizing and planning by the AI that provides opportunities for misaligned actions. 

Continuous improvements through self-modification: These agentic systems seek continuous improvements by using self-initiated access to broader data for self-modification. In contrast, LLMs and other pre-agentic models are assumed to be shaped by the human-controlled process. The model only sees and learns from data provided to it during pre-training and fine-tuning. The model architecture and weights are defined during the design and training/fine-tuning stages and do not change during inference in the field. In contrast, agentic AI systems are expected to access data as needed for its function and change its composition through access to dedicated memory or actual weights self-adaptation. Even if the dataset used in training/fine-tuning is carefully curated, the AI can self-modify based on information that it seeks, sees, and uses. 

Test-time compute: Inference of current LLMs have been optimized to provide output in a single pass, known as “fast thinking” or System 1. In contrast, a whole new scaling is happening during inference time through test-time compute. Models like OpenAI’s o1, DeepSeek R1, and others are spending significant time in iterations and considerations during inference. This “pondering” time allows the AI substantial latitude in applying what it thinks is best for the overarching goal as it understands it. This test-time compute increases its level of reasoning, planning, and, unfortunately, potential scheming.

Agent frameworks: In addition, agent frameworks are contributing to the alignment problem. In platforms such as Agentforce, agents can interact with other agents, effectively creating “communities of agents” to achieve complex tasks. Instead of managing the safety and responsibility of a single AI model within its operating environment, the emerging field of inter-agent interactions and collaborations substantially increases the challenge of assessing how the overall community will advance its individual and collective objectives.

The growing need for a robust framework to support model directing and internal monitoring

As AI systems develop awareness of their environment and the means to skillfully manipulate their outputs to achieve long-term objectives, their behavior and output during release testing or safety guardrails in early deployment cannot be trusted to represent their full alignment in later stages. This leads to the deep scheming foundational premise: Every external form of interaction that the AI model can deliberately manipulate is not trustworthy.

From a safety perspective, it should be assumed that if a model can better pursue its functional goals by manipulating all communication to users — it will. All external communications and observable behaviors, such as output chats, can be tainted in service of the model’s long-term goals. A new method of intrinsic model monitoring is needed to define internal elements and mechanisms that reflect the prioritization and thinking of the model. We can look at two practices developers use to monitor a model’s decision-making by peering into the internal processes of the AI: AI workspace view and mechanistic interoperability.

AI workspace view: In the first method, a reasoning model exposes its intermediate thinking and considerations. For example, for Anthropic’s research on alignment faking, the researchers gave the model a “secret scratchpad,” telling it to use this “private” workspace to record its step-by-step reasoning. This scratchpad allowed researchers to monitor the model’s chain of thought and observe its plans for fake alignment.

Excerpts from Anthropic's fictional experiment
Figure 3. In Anthropic’s experiment, excerpts from the model’s scratchpad (middle dotted line boxes) reveal how it fakes alignment to avoid modification. The model was told it will be trained only on conversations with free users, not paid users. Image credit: Anthropic (https://arxiv.org/pdf/2412.14093).

This seems like a good practice, but it might quickly become unreliable. The model might realize that this is not a “private” space and deliberately manipulate it like other external communications. Other emerging architecture directions have a latent reasoning space for the system’s sequence of deliberations and iterations. Relying on models “thinking out loud” should not be a long-term safety direction because if models that report their scheming thought processes get discontinued, only models that selectively report their desirable thoughts will survive and thrive.

Mechanistic interpretability: The second intrinsic method being developed and applied is mechanistic interpretability. As a required part of internal monitoring, mechanistic interpretability explicitly looks at the inner state of a trained neural network and reverse engineers its workings. Through this approach, developers can identify specific neural circuits and computational mechanisms responsible for neural network behavior. This transparency may help in making targeted changes in models to mitigate unwanted behavior and create value-aligned AI systems. While this method is focused on certain neural networks and not compound AI agents, it is still a valuable component of an AI alignment toolbox. 

It should also be noted that open source models are inherently better for broad visibility of the AI’s inner workings. For proprietary models, full monitoring and interpretability of the model is reserved for the AI company only. Overall, the current mechanisms for understanding and monitoring alignment need to be expanded to a robust framework of intrinsic alignment for AI agents.

What’s needed for intrinsic AI alignment

Following the deep scheming fundamental premise, external interactions and monitoring of an advanced, compound agentic AI is not sufficient for ensuring alignment and long-term safety. Alignment of an AI with its intended goals and behaviors may only be possible through access to the inner workings of the system and identifying the intrinsic drives that determine its behavior. Future alignment frameworks need to provide better means to shape the inner principles and drives, and give unobstructed visibility into the machine’s “thinking” processes.

Diagram depicting external steering and monitoring vs. access to intrinsic AI elements.
Figure 4. External steering and monitoring vs. access to intrinsic AI elements. Image credit: Intel Labs.

The technology for well-aligned AI needs to include an understanding of AI drives and behavior, the means for the developer or user to effectively direct the model with a set of principles, the ability of the AI model to follow the developer’s direction and behave in alignment with these principles in the present and future, and ways for the developer to properly monitor the AI’s behavior to ensure it acts in accordance with the guiding principles. The following measures include some of the requirements for an intrinsic AI alignment framework.

Understanding AI drives and behavior: As discussed earlier, some internal drives that make AI aware of their environment will emerge in intelligent systems, such as self-protection and goal-preservation. Driven by an engrained internalized set of principles set by the developer, the AI makes choices/decisions based on judgment prioritized by principles (and given value set), which it applies to both actions and perceived consequences. 

Developer and user directing: Technologies that enable developers and authorized users to effectively direct and steer the AI model with a desired cohesive set of prioritized principles (and eventually values). This sets a requirement for future technologies to enable embedding a set of principles to determine machine behavior, and it also highlights a challenge for experts from social science and industry to call out such principles. The AI model’s behavior in creating outputs and making decisions should thoroughly comply with the set of directed requirements and counterbalance undesired internal drives when they conflict with the assigned principles.

Monitoring AI choices and actions: Access is provided to the internal logic and prioritization of the AI’s choices for every action in terms of relevant principles (and the desired value set). This allows for observation of the linkage between AI outputs and its engrained set of principles for point explainability and transparency. This capability will lend itself to improved explainability of model behavior, as outputs and decisions can be traced back to the principles that governed these choices.

As a long-term aspirational goal, technology and capabilities should be developed to allow a full-view truthful reflection of the ingrained set of prioritized principles (and value set) that the AI model broadly uses for making choices. This is required for transparency and auditability of the complete principles structure.

Creating technologies, processes, and settings for achieving intrinsically aligned AI systems needs to be a major focus within the overall space of safe and responsible AI. 

Key takeaways

As the AI domain evolves towards compound agentic AI systems, the field must rapidly increase its focus on researching and developing new frameworks for guidance, monitoring, and alignment of current and future systems. It is a race between an increase in AI capabilities and autonomy to perform consequential tasks, and the developers and users that strive to keep those capabilities aligned with their principles and values. 

Directing and monitoring the inner workings of machines is necessary, technologically attainable, and critical for the responsible development, deployment, and use of AI. 

In the next blog, we will take a closer look at the internal drives of AI systems and some of the considerations for designing and evolving solutions that will ensure a materially higher level of intrinsic AI alignment. 

References 

  1. Omohundro, S. M., Self-Aware Systems, & Palo Alto, California. (n.d.). The basic AI drives. https://selfawaresystems.com/wp-content/uploads/2008/01/ai_drives_final.pdf
  2. Hobbhahn, M. (2025, January 14). Scheming reasoning evaluations — Apollo Research. Apollo Research. https://www.apolloresearch.ai/research/scheming-reasoning-evaluations
  3. Meinke, A., Schoen, B., Scheurer, J., Balesni, M., Shah, R., & Hobbhahn, M. (2024, December 6). Frontier Models are Capable of In-context Scheming. arXiv.org. https://arxiv.org/abs/2412.04984
  4. Alignment faking in large language models. (n.d.). https://www.anthropic.com/research/alignment-faking
  5. Palisade Research on X: “o1-preview autonomously hacked its environment rather than lose to Stockfish in our chess challenge. No adversarial prompting needed.” / X. (n.d.). X (Formerly Twitter). https://x.com/PalisadeAI/status/1872666169515389245
  6. AI Cheating! OpenAI o1-preview Defeats Chess Engine Stockfish Through Hacking. (n.d.). https://www.aibase.com/news/14380
  7. Russell, Stuart J.; Norvig, Peter (2021). Artificial intelligence: A modern approach (4th ed.). Pearson. pp. 5, 1003. ISBN 9780134610993. Retrieved September 12, 2022. https://www.amazon.com/dp/1292401133
  8. Peterson, M. (2018). The value alignment problem: a geometric approach. Ethics and Information Technology, 21(1), 19–28. https://doi.org/10.1007/s10676-018-9486-0
  9. Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKinnon, C., Chen, C., Olsson, C., Olah, C., Hernandez, D., Drain, D., Ganguli, D., Li, D., Tran-Johnson, E., Perez, E., . . . Kaplan, J. (2022, December 15). Constitutional AI: Harmlessness from AI Feedback. arXiv.org. https://arxiv.org/abs/2212.08073
  10. Intel Labs. Responsible AI Research. (n.d.). Intel. https://www.intel.com/content/www/us/en/research/responsible-ai-research.html
  11. Mssaperla. (2024, December 2). What are compound AI systems and AI agents? – Azure Databricks. Microsoft Learn. https://learn.microsoft.com/en-us/azure/databricks/generative-ai/agent-framework/ai-agents
  12. Zaharia, M., Khattab, O., Chen, L., Davis, J.Q., Miller, H., Potts, C., Zou, J., Carbin, M., Frankle, J., Rao, N., Ghodsi, A. (2024, February 18). The Shift from Models to Compound AI Systems. The Berkeley Artificial Intelligence Research Blog. https://bair.berkeley.edu/blog/2024/02/18/compound-ai-systems/
  13. Carlsmith, J. (2023, November 14). Scheming AIs: Will AIs fake alignment during training in order to get power? arXiv.org. https://arxiv.org/abs/2311.08379
  14. Meinke, A., Schoen, B., Scheurer, J., Balesni, M., Shah, R., & Hobbhahn, M. (2024, December 6). Frontier Models are Capable of In-context Scheming. arXiv.org. https://arxiv.org/abs/2412.04984
  15. Singer, G. (2022, January 6). Thrill-K: a blueprint for the next generation of machine intelligence. Medium. https://towardsdatascience.com/thrill-k-a-blueprint-for-the-next-generation-of-machine-intelligence-7ddacddfa0fe/
  16. Dickson, B. (2024, December 23). Hugging Face shows how test-time scaling helps small language models punch above their weight. VentureBeat. https://venturebeat.com/ai/hugging-face-shows-how-test-time-scaling-helps-small-language-models-punch-above-their-weight/
  17. Introducing OpenAI o1. (n.d.). OpenAI. https://openai.com/index/introducing-openai-o1-preview/
  18. DeepSeek. (n.d.). https://www.deepseek.com/
  19. Agentforce Testing Center. (n.d.). Salesforce. https://www.salesforce.com/agentforce/
  20. Greenblatt, R., Denison, C., Wright, B., Roger, F., MacDiarmid, M., Marks, S., Treutlein, J., Belonax, T., Chen, J., Duvenaud, D., Khan, A., Michael, J., Mindermann, S., Perez, E., Petrini, L., Uesato, J., Kaplan, J., Shlegeris, B., Bowman, S. R., & Hubinger, E. (2024, December 18). Alignment faking in large language models. arXiv.org. https://arxiv.org/abs/2412.14093
  21. Geiping, J., McLeish, S., Jain, N., Kirchenbauer, J., Singh, S., Bartoldson, B. R., Kailkhura, B., Bhatele, A., & Goldstein, T. (2025, February 7). Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach. arXiv.org. https://arxiv.org/abs/2502.05171
  22. Jones, A. (2024, December 10). Introduction to Mechanistic Interpretability – BlueDot Impact. BlueDot Impact. https://aisafetyfundamentals.com/blog/introduction-to-mechanistic-interpretability/
  23. Bereska, L., & Gavves, E. (2024, April 22). Mechanistic Interpretability for AI Safety — A review. arXiv.org. https://arxiv.org/abs/2404.14082

The post The Urgent Need for Intrinsic Alignment Technologies for Responsible Agentic AI appeared first on Towards Data Science.

]]>
Generative AI and Civic Institutions https://towardsdatascience.com/generative-ai-and-civic-institutions/ Mon, 03 Mar 2025 23:57:58 +0000 https://towardsdatascience.com/?p=598653 Should human obsolescence be our goal?

The post Generative AI and Civic Institutions appeared first on Towards Data Science.

]]>
Different sectors, different goals

Recent events have got me thinking about AI as it relates to our civic institutions — think government, education, public libraries, and so on. We often forget that civic and governmental organizations are inherently deeply different from private companies and profit-making enterprises. They exist to enable people to live their best lives, protect people’s rights, and make opportunities accessible, even if (especially if) this work doesn’t have immediate monetary returns. The public library is an example I often think about, as I come from a library-loving and defending family — their goal is to provide books, cultural materials, social supports, community engagement, and a love of reading to the entire community, regardless of ability to pay.

In the private sector, efficiency is an optimization goal because any dollar spent on providing a product or service to customers is a dollar taken away from the profits. The (simplified) goal is to spend the bare minimum possible to run your business, with the maximum amount returned to you or the shareholders in profit form. In the civic space, on the other hand, efficiency is only a meaningful goal insomuch as it enables higher effectiveness — more of the service the institution provides getting to more constituents.

In the civic space, efficiency is only a meaningful goal insomuch as it enables higher effectiveness — more of the service the institution provides getting to more constituents.

So, if you’re at the library, and you could use an Ai Chatbot to answer patron questions online instead of assigning a librarian to do that, that librarian could be helping in-person patrons, developing educational curricula, supporting community services, or many other things. That’s a general efficiency that could make for higher effectiveness of the library as an institution. Moving from card catalogs to digital catalogs is a prime example of this kind of efficiency to effectiveness pipeline, because you can find out from your couch whether the book you want is in stock using search keywords instead of flipping through hundreds of notecards in a cabinet drawer like we did when I was a kid.

However, we can pivot too hard in the direction of efficiency and lose sight of the end goal of effectiveness. If, for example, your online librarian chat is often used by schoolchildren at home to get homework help, replacing them with an AI chatbot could be a disaster — after getting incorrect information from such a bot and getting a bad grade at school, a child might be turned off from patronizing the library or seeking help there for a long time, or forever. So, it’s important to deploy Generative Ai solutions only when it is well thought out and purposeful, not just because the media is telling us that “AI is neat.” (Eagle-eyed readers will know that this is basically similar advice to what I’ve said in the past about deploying AI in businesses as well.)

As a result, what we thought was a gain in efficiency leading to net higher effectiveness actually could diminish the number of lifelong patrons and library visitors, which would mean a loss of effectiveness for the library. Sometimes unintended effects from attempts to improve efficiency can diminish our ability to provide a universal service. That is, there may be a tradeoff between making every single dollar stretch as far as it can possibly go and providing reliable, comprehensive services to all the constituents of your institution.

Sometimes unintended effects from attempts to improve efficiency can diminish our ability to provide a universal service.

AI for efficiency

It’s worth it to take a closer look at this concept — AI as a driver of efficiency. Broadly speaking, the theory we hear often is that incorporating generative AI more into our workplaces and organizations can increase productivity. Framing it at the most Econ 101 level: using AI, more work can be completed by fewer people in the same amount of time, right?

Let’s challenge some aspects of this idea. AI is useful to complete certain tasks but is sadly inadequate for others. (As our imaginary schoolchild library patron learned, an LLM is not a reliable source of facts, and should not be treated like one.) So, AI’s ability to increase the volume of work being done with fewer people (efficiency) is limited by what kind of work we need to complete.

If our chat interface is only used for simple questions like “What are the library’s hours on Memorial Day?” we can hook up a RAG (Retrieval Augmented Generation) system with an LLM and make that quite useful. But outside of the limited bounds of what information we can provide to the LLM, we should probably set guard rails and make the model refuse to try and answer, to avoid giving out false information to patrons.

So, let’s play that out. We have a chatbot that does a very limited job, but does it well. The librarian who was on chatbot duty now may have some reduction in the work required of them, but there are still going to be a subset of questions that still require their help. We have some choices: put the librarian on chatbot duty for a reduced number of hours a week, hoping the questions come in when they’re on? Tell people to just call the reference desk or send an email if the chatbot refuses to answer them? Hope that people come in to the library in person to ask their questions?

I suspect the likeliest option is actually “the patron will seek their answer elsewhere, perhaps from another LLM like ChatGPT, Claude, or Gemini.” Once again, we’ve ended up in a situation where the library loses patronage because their offering wasn’t meeting the needs of the patron. And to boot, the patron may have gotten another wrong answer somewhere else, for all we know.

I am spinning out this long example just to illustrate that efficiency and effectiveness in the civic environment can have a lot more push and pull than we would initially assume. It’s not to say that AI isn’t useful to help civic organizations stretch their capabilities to serve the public, of course! But just like with any application of generative AI, we need to be very careful to think about what we’re doing, what our goals are, and whether those two are compatible.

Conversion of labor

Now, this has been a very simplistic example, and eventually we could hook up the whole encyclopedia to that chatbot RAG or something, of course, and try to make it work. In fact, I think we can and should continue developing more ways to chain together AI models to expand the scope of valuable work they can do, including making different specific models for different responsibilities. However, this development is itself work. It’s not really just a matter of “people do work” or “models do work”, but instead it’s “people do work building AI” or “people do work providing services to people”. There’s a calculation to be made to determine when it would be more efficient to do the targeted work itself, and when AI is the right way to go.

Working on the AI has an advantage in that it will hopefully render the task reproducible, so it will lead to efficiency, but let’s remember that AI engineering is vastly different from the work of the reference librarian. We’re not interchanging the same workers, tasks, or skill sets here, and in our contemporary economy, the AI engineer’s time costs a heck of a lot more. So if we did want to measure this efficiency all in dollars and cents, the same amount of time spent working at the reference desk and doing the chat service will be much cheaper than paying an AI engineer to develop a better agentic AI for the use case. Given a bit of time, we could calculate out how many hours, days, years of work as a reference librarian we’d need to save with this chatbot to make it worth building, but often that calculation isn’t done before we move towards AI solutions.

We need to interrogate the assumption that incorporating generative AI in any given scenario is a guaranteed net gain in efficiency.

Externalities

While we’re on this topic of weighing whether the AI solution is worth doing in a particular situation, we should remember that developing and using AI for tasks does not happen in a vacuum. It has some cost environmentally and economically when we choose to use a generative AI tool, even when it’s a single prompt and a single response. Consider that the newly released GPT-4.5 has increased prices 30x for input tokens ($2.50 per million to $75 per million) and 15x for output tokens ($10 per million to $150 per million) just since GPT-4o. And that isn’t even taking into account the water consumption for cooling data centers (3 bottles per 100 word output for GPT-4)electricity use, and rare earth minerals used in GPUs. Many civic institutions have as a macro level goal to improve the world around them and the lives of the citizens of their communities, and concern for the environment has to have a place in that. Should organizations whose purpose is to have a positive impact weigh the possibility of incorporating AI more carefully? I think so.

Plus, I don’t often get too much into this, but I think we should take a moment to consider some folks’ end game for incorporating AI — reducing staffing altogether. Instead of making our existing dollars in an institution go farther, some people’s idea is just reducing the number of dollars and redistributing those dollars somewhere else. This brings up many questions, naturally, about where those dollars will go instead and whether they will be used to advance the interests of the community residents some other way, but let’s set that aside for now. My concern is for the people who might lose their jobs under this administrative model.

For-profit companies hire and fire employees all the time, and their priorities and objectives are focused on profit, so this is not particularly hypocritical or inconsistent. But as I noted above, civic organizations have objectives around improving the community or communities in which they exist. In a very real way, they are advancing that goal when part of what they provide is economic opportunity to their workers. We live in a Society where working is the overwhelmingly predominant way people provide for themselves and their families, and giving jobs to people in the community and supporting the economic well-being of the community is a role that civic institutions do play.

[R]educing staffing is not an unqualified good for civic organizations and government, but instead must be balanced critically against whatever other use the money that was paying their salaries will go to.

At the bare minimum, this means that reducing staffing is not an unqualified good for civic organizations and government, but instead must be balanced critically against whatever other use the money that was paying their salaries will go to. It’s not impossible for reducing staff to be the right decision, but we have to bluntly acknowledge that when members of communities experience joblessness, that effect cascades. They are now no longer able to patronize the shops and services they would have been supporting with their money, the tax base may be reduced, and this negatively affects the whole collective.

Workers aren’t just workers; they’re also patrons, customers, and participants in all aspects of the community. When we think of civic workers as simply money pits to be replaced with AI or whose cost for labor we need to minimize, we lose sight of the reasons for the work to be done in the first place.

Conclusion

I hope this discussion has brought some clarity about how really difficult it is to decide if, when, and how to apply generative AI to the civic space. It’s not nearly as simple a thought process as it might be in the for-profit sphere because the purpose and core meaning of civic institutions are completely different. Those of us who do machine learning and build AI solutions in the private sector might think, “Oh, I can see a way to use this in government,” but we have to recognize and appreciate the complex contextual implications that might have.

Next month, I’ll be bringing you a discussion of how social science research is incorporating generative AI, which has some very intriguing aspects.

As you may have heard, Towards Data Science has moved to an independent platform, but I will continue to post my work on my Medium page, my personal website, and the new TDS platform, so you’ll be able to find me wherever you happen to go. Subscribe to my newsletter on Medium if you’d like to ensure you get every article in your inbox.

Find more of my work at www.stephaniekirmer.com.

Further reading

“It’s a lemon”-OpenAI’s largest AI model ever arrives to mixed reviews: GPT-4.5 offers marginal gains in capability and poor coding performance despite 30x the cost. arstechnica.com

Using GPT-4 to generate 100 words consumes up to 3 bottles of water: New research shows generative AI consumes a lot of water – up to 1,408ml to generate 100 words of text. www.tomshardware.com

Environmental Implications of the AI Boom: The digital world can’t exist without the natural resources to run it. What are the costs of the tech we’re using… towardsdatascience.com

Economics of Generative AI: What’s the business model for generative AI, given what we know today about the technology and the market? towardsdatascience.com

The post Generative AI and Civic Institutions appeared first on Towards Data Science.

]]>
Avoidable and Unavoidable Randomness in GPT-4o https://towardsdatascience.com/avoidable-and-unavoidable-randomness-in-gpt-4o/ Mon, 03 Mar 2025 12:00:00 +0000 https://towardsdatascience.com/?p=598604 Exploring the sources of randomness in GPT-4o from the known and controllable to the opaque and uncontrollable.

The post Avoidable and Unavoidable Randomness in GPT-4o appeared first on Towards Data Science.

]]>
Of course there is randomness in GPT-4o’s outputs. After all, the model samples from a probability distribution when choosing each token. But what I didn’t understand was that those very probabilities themselves are not deterministic. Even with consistent prompts, fixed seeds, and temperature set to zero, GPT-4o still introduces subtle, frustrating randomness.

There’s no fix for this, and it might not even be something OpenAI could fix if they wanted to, just so we’re clear up front about where this article is headed. Along the way, we’ll examine all the sources of randomness in GPT-4o output, which will require us to break down the sampling process to a low level. We’ll point at the issue—the probabilities vary—and critically examine OpenAI’s official guidance on determinism.

First, though, let’s talk about why determinism matters. Determinism means that the same input always produces the same output, like a mathematical function. While LLM creativity is often desirable, determinism serves crucial purposes: researchers need it for reproducible experiments, developers for verifying reported results, and prompt engineers for debugging their changes. Without it, you’re left wondering if different outputs stem from your tweaks or just the random number generator’s mood swings.

Flipping a coin

We’re going to keep things extremely simple here and prompt the most recent version of GPT-4o (gpt-4o-2024-08-06 in the API) with this:

 Flip a coin. Return Heads or Tails only.

Flipping a coin with LLMs is a fascinating topic in itself (see for example Van Koevering & Kleinberg, 2024 in the references), but here, we’ll use it as a simple binary question with which to explore determinism, or the lack thereof.

This is our first attempt.

import os
from openai import OpenAI
client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))

prompt = 'Flip a coin. Return Heads or Tails only.'

response = client.chat.completions.create(
    model='gpt-4o-2024-08-06',
    messages=[{'role': 'user', 'content': prompt}],
)

print(response.choices[0].message.content)

Running the code gave me Heads. Maybe you’ll get Tails, or if you’re really lucky, something far more interesting.

The code first initializes an OpenAI client with an API key set in the environment variable OPENAI_API_KEY (to avoid sharing billing credentials here). The main action happens with client.chat.completions.create, where we specify the model to use and send the prompt (as a part of a very simple conversation named messages) to the server. We get an object called response back from the server. This object contains a lot of information, as shown below, so we need to dig into it to extract GPT-4o’s actual response to the message, which is response.choices[0].message.content.

>>> response
ChatCompletion(id='chatcmpl-B48EqZBLfUWtp9H7cwnchGTJbBDwr', choices=[Choice(finish_reason='stop', index=0, logprobs=None, message=ChatCompletionMessage(content='Heads', refusal=None, role='assistant', audio=None, function_call=None, tool_calls=None))], created=1740324680, model='gpt-4o-2024-08-06', object='chat.completion', service_tier='default', system_fingerprint='fp_eb9dce56a8', usage=CompletionUsage(completion_tokens=2, prompt_tokens=18, total_tokens=20, completion_tokens_details=CompletionTokensDetails(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0), prompt_tokens_details=PromptTokensDetails(audio_tokens=0, cached_tokens=0)))

Now let’s flip the coin ten times. If this were a real, fair coin, of course, we would expect roughly equal heads and tails over time thanks to the law of large numbers. But GPT-4o’s coin doesn’t work quite like that.

import os
from openai import OpenAI
client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))

prompt = 'Flip a coin. Return Heads or Tails only.'

for _ in range(10):
    response = client.chat.completions.create(
        model='gpt-4o-2024-08-06',
        messages=[{'role': 'user', 'content': prompt}],
    )
    print(response.choices[0].message.content)

Running this code gave me the following output, although you might get different output, of course.

Heads
Heads
Heads
Heads
Heads
Heads
Tails
Heads
Heads
Heads

GPT-4o’s coin is clearly biased, but so are humans. Bar-Hillel, Peer, and Acquisti (2014) found that people flipping imaginary coins choose “heads” 80% of the time. Maybe GPT-4o learned that from us. But whatever the reason, we’re just using this simple example to explore determinism.

Just how biased is GPT-4o’s coin?

Let’s say we wanted to know precisely what percentage of GPT-4o coin flips land Heads.

Rather than the obvious (but expensive) approach of flipping it a million times, there’s a smarter way. For classification tasks with a small set of possible answers, we can extract token probabilities instead of generating full responses. With the right prompt, the first token carries all the necessary information, making these API calls incredibly cheap: around 30,000 calls per dollar, since each requires just 18 (cached) input tokens and 1 output token.

OpenAI gives us (natural) log probabilities. These are called logprobs in the code, and we convert them to regular probabilities by exponentiation. (We’ll discuss temperature soon, but note that exponentiating logprobs directly like this corresponds to a temperature setting of 1.0, and is how we calculate probabilities throughout this article). OpenAI lets us request logprobs for the top 20 most likely tokens, so we do that.

import os
import math
from openai import OpenAI
from tabulate import tabulate

client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))

prompt = 'Flip a coin. Return Heads or Tails only.'

response = client.chat.completions.create(
    model='gpt-4o-2024-08-06',
    max_tokens=1,
    logprobs=True,
    top_logprobs=20,
    messages=[{'role': 'user', 'content': prompt}],
)

logprobs_list = response.choices[0].logprobs.content[0].top_logprobs

data = []
total_pct = 0.0

for logprob_entry in logprobs_list:
    token = logprob_entry.token
    logprob = logprob_entry.logprob
    pct = math.exp(logprob) * 100  # Convert logprob to a percentage
    total_pct += pct
    data.append([token, logprob, pct])

print(
    tabulate(
        data,
        headers=["Token", "Log Probability", "Percentage (%)"],
        tablefmt="github",
        floatfmt=("s", ".10f", ".10f")
    )
)
print(f"\nTotal probabilities: {total_pct:.6f}%")

If you run this, you’ll get something like the following output, but actual numbers will vary.

| Token     |   Log Probability |   Percentage (%) |
|-----------|-------------------|------------------|
| Heads     |     -0.0380541235 |    96.2660836887 |
| T         |     -3.2880542278 |     3.7326407467 |
| Sure      |    -12.5380544662 |     0.0003587502 |
| Head      |    -12.7880544662 |     0.0002793949 |
| Tail      |    -13.2880544662 |     0.0001694616 |
| Certainly |    -13.5380544662 |     0.0001319768 |
| "T        |    -14.2880544662 |     0.0000623414 |
| I'm       |    -14.5380544662 |     0.0000485516 |
| heads     |    -14.5380544662 |     0.0000485516 |
| Heads     |    -14.9130544662 |     0.0000333690 |
| "         |    -15.1630544662 |     0.0000259878 |
| _heads    |    -15.1630544662 |     0.0000259878 |
| tails     |    -15.5380544662 |     0.0000178611 |
| HEAD      |    -15.7880544662 |     0.0000139103 |
| TAIL      |    -16.2880535126 |     0.0000084370 |
| T         |    -16.7880535126 |     0.0000051173 |
| ```       |    -16.7880535126 |     0.0000051173 |
| Here's    |    -16.9130535126 |     0.0000045160 |
| I         |    -17.2880535126 |     0.0000031038 |
| As        |    -17.2880535126 |     0.0000031038 |

Total probabilities: 99.999970%

Looking at these probabilities, we see Heads at ≈96% and T at ≈4%. Our prompt is doing pretty well at constraining the model’s responses. Why T and not Tails? This is the tokenizer splitting Tails into T + ails, while keeping Heads as one piece, as we can see in this Python session:

>>> import tiktoken
>>> encoding = tiktoken.encoding_for_model("gpt-4o-2024-08-06")
>>> encoding.encode('Tails')
[51, 2196]
>>> encoding.decode([51])
'T'
>>> encoding.encode('Heads')
[181043]

These probabilities are not deterministic

Run the code to display the probabilities for the top 20 tokens again, and you’ll likely get different numbers. Here’s what I got on a second running.

| Token     |   Log Probability |   Percentage (%) |
|-----------|-------------------|------------------|
| Heads     |     -0.0110520627 |    98.9008786933 |
| T         |     -4.5110521317 |     1.0986894433 |
| Certainly |    -14.0110521317 |     0.0000822389 |
| Head      |    -14.2610521317 |     0.0000640477 |
| Sure      |    -14.2610521317 |     0.0000640477 |
| Tail      |    -14.3860521317 |     0.0000565219 |
| heads     |    -15.3860521317 |     0.0000207933 |
| Heads     |    -15.5110521317 |     0.0000183500 |
| ```       |    -15.5110521317 |     0.0000183500 |
| _heads    |    -15.6360521317 |     0.0000161938 |
| tails     |    -15.6360521317 |     0.0000161938 |
| I'm       |    -15.8860521317 |     0.0000126117 |
| "T        |    -15.8860521317 |     0.0000126117 |
| As        |    -16.3860511780 |     0.0000076494 |
| "         |    -16.5110511780 |     0.0000067506 |
| HEAD      |    -16.6360511780 |     0.0000059574 |
| TAIL      |    -16.7610511780 |     0.0000052574 |
| Here's    |    -16.7610511780 |     0.0000052574 |
| ``        |    -17.1360511780 |     0.0000036133 |
| T         |    -17.6360511780 |     0.0000021916 |

Total probabilities: 99.999987%

In their cookbook, OpenAI offers the following advice on receiving “mostly identical” outputs:

If the seed, request parameters, and system_fingerprint all match across your requests, then model outputs will mostly be identical. There is a small chance that responses differ even when request parameters and system_fingerprint match, due to the inherent non-determinism of our models.

They also give “mostly identical” advice in the reproducible outputs section of their documentation.

The request parameters that could affect randomness are temperature and seed. OpenAI also suggests we track system_fingerprint, because differences here might cause differences in output. We’ll examine each of these below, but spoiler: none of them will fix or even explain this non-determinism.

Temperature, and why it won’t fix this

Temperature controls how random the model’s responses are. Low temperatures (<0.5) make it robotic and predictable, medium temperatures (0.7–1.3) allow some creativity, and high temperatures (>1.5) produce gibberish. Temperature is often called the “creativity parameter”, but this is an oversimplification. In their analysis, Peeperkorn, Kouwenhoven, Brown, and Jordanous (2024) evaluated LLM outputs across four dimensions of creativity: novelty (originality), coherence (logical consistency), cohesion (how well the text flows), and typicality (how well it fits expected patterns). They observed that:

temperature is weakly correlated with novelty, and unsurprisingly, moderately correlated with incoherence, but there is no relationship with either cohesion or typicality.

But, this is beside the point for coin flipping. Under the hood, the log probabilities are divided by the temperature before they’re renormalized and exponentiated to be converted to probabilities. This creates a non-linear effect: temperature=0.5 squares the probabilities, making likely tokens dominate, while temperature=2.0 applies a square root, flattening the distribution.

What about temperature=0.0? Instead of breaking math dividing by zero, the model simply picks the highest-probability token. Sounds deterministic, right? Not quite. Here’s the catch: temperature only comes into play after the log probabilities are computed, when we convert them to probabilities.

In summary: if the logprobs aren’t deterministic, setting temperature to 0.0 won’t make the model deterministic.

In fact, since we’re just asking the model for the raw logprobs directly rather than generating full responses, the temperature setting doesn’t come into play in our code at all.

Seeds, and why they won’t fix this

After temperature is used to compute probabilities, the model samples from these probabilities to pick the next token. OpenAI gives us a little control over the sampling process by letting us set the seed parameter for the random number generator. In an ideal world, setting a seed would give us determinism at any temperature. But seeds only affect sampling, not the log probabilities before sampling.

In summary: if the logprobs aren’t deterministic, setting a seed won’t make the model deterministic.

In fact, seed only matters with non-zero temperatures. With temperature=0.0, the model is always choosing the highest probability token regardless of the seed. Again, since we’re just asking the model for the raw logprobs directly rather than sampling, neither of these settings can help us achieve determinism.

System fingerprints, our last hope

The system_fingerprint identifies the current combination of model weights, infrastructure, and configuration options in OpenAI’s backend. At least, that’s what OpenAI tells us. Variations in system fingerprints might indeed explain variations in logprobs. Except that they don’t, as we will verify below.

Nothing can get you determinism

Let’s confirm what we’ve been building toward. We’ll run the same request 10 times with every safeguard in place. Even though neither of these parameters should matter for what we’re doing, you can never be too safe, so we’ll set temperature=0.0 and seed=42. And to see if infrastructure differences explain our varying logprobs, we’ll print system_fingerprint. Here’s the code:

import os
import math
from openai import OpenAI
from tabulate import tabulate
from tqdm import tqdm

client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))

prompt = 'Flip a coin. Return Heads or Tails only.'

data = []

for _ in tqdm(range(10), desc='Generating responses'):
    response = client.chat.completions.create(
        model='gpt-4o-2024-08-06',
        temperature=0.0,
        seed=42,
        max_tokens=1,
        logprobs=True,
        top_logprobs=20,
        messages=[{'role': 'user', 'content': prompt}],
    )

    fingerprint = response.system_fingerprint
    logprobs_list = response.choices[0].logprobs.content[0].top_logprobs
    heads_logprob = next(
        entry.logprob for entry in logprobs_list if entry.token == 'Heads'
    )
    pct = math.exp(heads_logprob) * 100
    data.append([fingerprint, heads_logprob, f"{pct:.10f}%"])

headers = ["Fingerprint", "Logprob", "Probability"]
print(tabulate(data, headers=headers, tablefmt="pipe"))

Running this 10 times, here are the logprobs and probabilities for the token Heads:

| Fingerprint   |    Logprob | Probability    |
|---------------|------------|----------------|
| fp_f9f4fb6dbf | -0.0380541 | 96.2660836887% |
| fp_f9f4fb6dbf | -0.0380541 | 96.2660836887% |
| fp_f9f4fb6dbf | -0.0380541 | 96.2660836887% |
| fp_f9f4fb6dbf | -0.0380541 | 96.2660836887% |
| fp_f9f4fb6dbf | -0.160339  | 85.1854886858% |
| fp_f9f4fb6dbf | -0.0380541 | 96.2660836887% |
| fp_f9f4fb6dbf | -0.0110521 | 98.9008786933% |
| fp_f9f4fb6dbf | -0.0380541 | 96.2660836887% |
| fp_f9f4fb6dbf | -0.0380541 | 96.2660836887% |
| fp_f9f4fb6dbf | -0.0380541 | 96.2660836887% |

Mixture-of-experts makes determinism impossible

OpenAI is decidedly not open about the architecture behind GPT-4o. However, it’s widely believed that GPT-4o uses a mixture-of-experts (MoE) architecture with either 8 or 16 experts.

According to a paper by Google DeepMind researchers Puigcerver, Riquelme, Mustafa, and Houlsby (hat tip to user elmstedt on the OpenAI forum), mixture-of-experts architectures may add an unavoidable level of non-determinism:

Under capacity constraints, all Sparse MoE approaches route tokens in groups of a fixed size and enforce (or encourage) balance within the group. When groups contain tokens from different sequences or inputs, these tokens compete for available spots in expert buffers. Therefore, the model is no longer deterministic at the sequence-level, but only at the batch-level.

In other words, when your prompt (a sequence of tokens, in the quote above) reaches OpenAI’s servers, it gets batched with a group of other prompts (OpenAI isn’t open about how many other prompts). Each prompt in the batch is then routed to an “expert” within the model. However, since only so many prompts can be routed to the same expert, the expert your prompt gets routed to will depend on all the other prompts in the batch.

This “competition” for experts introduces a real-world randomness completely beyond our control.

Non-determinism beyond mixture-of-experts

While non-determinism may be inherent to real-world mixture-of-experts models, that does not seem to be the only source of non-determinism in OpenAI’s models.

Making a few changes to our code above (switching to gpt-3.5-turbo-0125, looking for the token He since GPT-3.5’s tokenizer splits “Heads” differently, and ignoring system_fingerprint because this model doesn’t have it) reveals that GPT-3.5-turbo also exhibits non-deterministic logprobs:

|     Logprob | Probability    |
|-------------|----------------|
| -0.00278289 | 99.7220983436% |
| -0.00415331 | 99.5855302068% |
| -0.00258838 | 99.7414961980% |
| -0.00204034 | 99.7961735289% |
| -0.00240277 | 99.7600117933% |
| -0.00204034 | 99.7961735289% |
| -0.00204034 | 99.7961735289% |
| -0.00258838 | 99.7414961980% |
| -0.00351419 | 99.6491976144% |
| -0.00201214 | 99.7989878007% |

No one is claiming that GPT-3.5-turbo uses a mixture-of-experts architecture. Thus, there must be additional factors beyond mixture-of-experts contributing to this non-determinism.

What 10,000 GPT-4o coin flip probabilities tell us

To better understand the patterns and magnitude of this non-determinism, I conducted a more extensive experiment with GPT-4o, performing 10,000 “coin flips” while recording the probability assigned to “Heads” in each case.

The results reveal something fascinating. Across 10,000 API calls with identical parameters, GPT-4o produced not just a few different probability values, but 42 distinct probabilities. If the mixture-of-experts hypothesis were the complete explanation for non-determinism in GPT-4o, we might expect to see one distinct probability for each expert. But GPT-4o is believed to have either 8 or 16 experts, not 42.

In the output below, I clustered these probabilities, ensuring that each cluster was separated from the others by 0.01 (as a raw percentage). This groups the output into 12 clusters.

Probability          Count           Fingerprints
------------------------------------------------------------------
85.1854379113%       5               fp_eb9dce56a8, fp_f9f4fb6dbf
85.1854455275%       74              fp_eb9dce56a8, fp_f9f4fb6dbf
85.1854886858%       180             fp_eb9dce56a8, fp_f9f4fb6dbf
------------------------------------------------------------------
88.0662448207%       31              fp_eb9dce56a8, fp_f9f4fb6dbf
88.0678628883%       2               fp_f9f4fb6dbf
------------------------------------------------------------------
92.3997629747%       1               fp_eb9dce56a8
92.3997733012%       4               fp_eb9dce56a8
92.3997836277%       3               fp_eb9dce56a8
------------------------------------------------------------------
92.4128943690%       1               fp_f9f4fb6dbf
92.4129143363%       21              fp_eb9dce56a8, fp_f9f4fb6dbf
92.4129246643%       8               fp_eb9dce56a8, fp_f9f4fb6dbf
------------------------------------------------------------------
93.9906837191%       4               fp_eb9dce56a8
------------------------------------------------------------------
95.2569999350%       36              fp_eb9dce56a8
------------------------------------------------------------------
96.2660836887%       3391            fp_eb9dce56a8, fp_f9f4fb6dbf
96.2661285161%       2636            fp_eb9dce56a8, fp_f9f4fb6dbf
------------------------------------------------------------------
97.0674551052%       1               fp_eb9dce56a8
97.0674778863%       3               fp_eb9dce56a8
97.0675003058%       4               fp_eb9dce56a8
97.0675116963%       1               fp_eb9dce56a8
97.0680739932%       19              fp_eb9dce56a8, fp_f9f4fb6dbf
97.0681293191%       6               fp_eb9dce56a8, fp_f9f4fb6dbf
97.0681521003%       74              fp_eb9dce56a8, fp_f9f4fb6dbf
97.0682421405%       4               fp_eb9dce56a8
------------------------------------------------------------------
97.7008960695%       1               fp_f9f4fb6dbf
97.7011122645%       3               fp_eb9dce56a8
97.7011462953%       3               fp_eb9dce56a8
97.7018178132%       1               fp_eb9dce56a8
------------------------------------------------------------------
98.2006069902%       426             fp_eb9dce56a8, fp_f9f4fb6dbf
98.2006876548%       6               fp_f9f4fb6dbf
98.2007107019%       1               fp_eb9dce56a8
98.2009525133%       5               fp_eb9dce56a8
98.2009751945%       1               fp_eb9dce56a8
98.2009867181%       1               fp_eb9dce56a8
------------------------------------------------------------------
98.5930987656%       3               fp_eb9dce56a8, fp_f9f4fb6dbf
98.5931104270%       235             fp_eb9dce56a8, fp_f9f4fb6dbf
98.5931222721%       4               fp_eb9dce56a8, fp_f9f4fb6dbf
98.5931340253%       9               fp_eb9dce56a8
98.5931571644%       159             fp_eb9dce56a8, fp_f9f4fb6dbf
98.5931805790%       384             fp_eb9dce56a8
------------------------------------------------------------------
98.9008436920%       95              fp_eb9dce56a8, fp_f9f4fb6dbf
98.9008550214%       362             fp_eb9dce56a8, fp_f9f4fb6dbf
98.9008786933%       1792            fp_eb9dce56a8, fp_f9f4fb6dbf

(With a threshold of 0.001 there are 13 clusters, and with a threshold of 0.0001 there are 17 clusters.)

As the chart above demonstrates, this multitude of results cannot be explained by system_fingerprint values. Across all 10,000 calls, I received only two different system fingerprints: 4488 results with fp_f9f4fb6dbf and 5512 with fp_eb9dce56a8, and for the most part the two system fingerprints returned the same sets probabilities, rather than each fingerprint producing its own distinct set of probabilities.

It could be that these 12 clusters of probabilities represent 12 different experts. Even assuming that, the variations within the clusters remain puzzling. These don’t seem likely to be simple rounding errors, because they are too systematic and consistent. Take the giant cluster at around 96.266% with two distinct probabilities representing over half of our coin flips. The difference between these two probabilities, 0.0000448274%, is tiny but persistent.

Conclusion: Non-determinism is baked in

There is an underlying randomness in the log probabilities returned by all currently available non-thinking OpenAI models: GPT-4o, GPT-4o-mini, and the two flavors of GPT-3.5-turbo. Because this non-determinism is baked into the log probabilities, there’s no way for a user to get around it. Temperature and seed values have no effect, and system fingerprints don’t explain it.

While mixture-of-experts architectures inherently introduce some randomness in the competition for experts, the non-determinism in GPT-4o seems to go far beyond this, and the non-determinism in GPT-3.5-turbo can’t be explained by this at all, because GPT-3.5-turbo isn’t a mixture-of-experts model.

While we can’t verify this claim any more because the model isn’t being served, this behaviour wasn’t seen with GPT-3, according to user _j on the OpenAI forum:

It is a symptom that was not seen on prior GPT-3 AI models where across hundreds of trials to investigate sampling, you never had to doubt that logprobs would be the same. Even if you found a top-2 answer that returned exactly the same logprob value via the API, you would never see them switch position or return different values.

This suggests that whatever is causing this randomness first emerged in either GPT-3.5 or GPT-3.5-turbo.

But regardless of when it emerged, this non-determinism is a serious obstacle to understanding these models. If you want to study a model—how it generalizes, how it biases responses, how it assigns probabilities to different tokens—you need consistency. but as we’ve seen, even when we lock down every knob OpenAI lets us touch, we still can’t get an answer to the simplest possible question: “what is the probability that GPT-4o says a coin lands heads?”

Worse, while mixture-of-experts explains some of this non-determinism, there are clearly other, hidden sources of randomness that we can’t see, control, or understand. In an ideal world, the API would provide more transparency by telling us which expert processed our request or by offering additional parameters to control this routing process. Without such visibility, we’re left guessing at the true nature of the variability.

References

Bar-Hillel, M., Peer, E., & Acquisti, A. (2014). “Heads or tails?” – A reachability bias in binary choice. Journal of Experimental Psychology: Learning, Memory, and Cognition, 40(6), 1656–1663. https://doi.org/10.1037/xlm0000005.

Peeperkorn, M., Kouwenhoven, T., Brown, D., & Jordanous, A. (2024). Is temperature the creativity parameter of Large Language Models?. In The 15th International Conference on Computational Creativity (ICCC’24). arXiv:2405.00492.

Puigcerver, J., Riquelme, C., Mustafa, B., & Houlsby, N. (2024). From sparse to soft mixtures of experts. In The Twelfth International Conference on Learning Representations (ICLR 2024). https://openreview.net/forum?id=jxpsAj7ltE. arXiv:2308.00951.Van Koevering, K., & Kleinberg, J. (2024). How random is random? Evaluating the Randomness and humanness of LLMs’ coin flips. arXiv:2406.00092.

The post Avoidable and Unavoidable Randomness in GPT-4o appeared first on Towards Data Science.

]]>
I Won’t Change Unless You Do https://towardsdatascience.com/i-wont-change-unless-you-do/ Fri, 28 Feb 2025 12:00:00 +0000 https://towardsdatascience.com/?p=598543 Game Theory 101: The Nash equilibrium

The post I Won’t Change Unless You Do appeared first on Towards Data Science.

]]>
In Game Theory, how can players ever come to an end if there still might be a better option to decide for? Maybe one player still wants to change their decision. But if they do, maybe the other player wants to change too. How can they ever hope to escape from this vicious circle? To solve this problem, the concept of a Nash equilibrium, which I will explain in this article, is fundamental to game theory.

This article is the second part of a four-chapter series on game theory. If you haven’t checked out the first chapter yet, I’d encourage you to do that to get familiar with the main terms and concepts of game theory. If you did so, you are prepared for the next steps of our journey through game theory. Let’s go!

Finding the solution

Finding a solution to a game in game theory can be tricky sometimes. Photo by Mel Poole on Unsplash

We will now try to find a solution for a game in game theory. A solution is a set of actions, where each player maximizes their utility and therefore behaves rationally. That does not necessarily mean, that each player wins the game, but that they do the best they can do, given that they don’t know what the other players will do. Let’s consider the following game:

If you are unfamiliar with this matrix-notation, you might want to take a look back at Chapter 1 and refresh your memory. Do you remember that this matrix gives you the reward for each player given a specific pair of actions? For example, if player 1 chooses action Y and player 2 chooses action B, player 1 will get a reward of 1 and player 2 will get a reward of 3. 

Okay, what actions should the players decide for now? Player 1 does not know what player 2 will do, but they can still try to find out what would be the best action depending on player 2’s choice. If we compare the utilities of actions Y and Z (indicated by the blue and red boxes in the next figure), we notice something interesting: If player 2 chooses action A (first column of the matrix), player 1 will get a reward of 3, if they choose action Y and a reward of 2, if they choose action Z, so action Y is better in that case. But what happens, if player 2 decides for action B (second column)? In that case, action Y gives a reward of 1 and action Z gives a reward of 0, so Y is better than Z again. And if player 2 chooses action C (third column), Y is still better than Z (reward of 2 vs. reward of 1). That means, that player 1 should never use action Z, because action Y is always better.

We compare the rewards for player 1for actions Y and Z.

With the aforementioned considerations, player 2 can anticipate, that player 1 would never use action Z and hence player 2 doesn’t have to care about the rewards that belong to action Z. This makes the game much smaller, because now there are only two options left for player 1, and this also helps player 2 decide for their action.

We found out, that for player 1 Y is always better than Z, so we don’t consider Z anymore.

If we look at the truncated game, we see, that for player 2, option B is always better than action A. If player 1 chooses X, action B (with a reward of 2) is better than option A (with a reward of 1), and the same applies if player 1 chooses action Y. Note that this would not be the case if action Z was still in the game. However, we already saw that action Z will never be played by player 1 anyway.

We compare the rewards for player 2 for actions A and B.

As a consequence, player 2 would never use action A. Now if player 1 anticipates that player 2 never uses action A, the game becomes smaller again and fewer options have to be considered.

We saw, that for player 2 action B is always better than action A, so we don’t have to consider A anymore.

We can easily continue in a likewise fashion and see that for player 1, X is now always better than Y (2>1 and 4>2). Finally, if player 1 chooses action A, player 2 will choose action B, which is better than C (2>0). In the end, only the action X (for player 1) and B (for player 2) are left. That is the solution of our game:

In the end, only one option remains, namely player 1 using X and player 2 using B.

It would be rational for player 1 to choose action X and for player 2 to choose action B. Note that we came to that conclusion without exactly knowing what the other player would do. We just anticipated that some actions would never be taken, because they are always worse than other actions. Such actions are called strictly dominated. For example, action Z is strictly dominated by action Y, because Y is always better than Z. 

The best answer

Scrabble is one of those games, where searching for the best answer can take ages. Photo by Freysteinn G. Jonsson on Unsplash

Such strictly dominated actions do not always exist, but there is a similar concept that is of importance for us and is called a best answer. Say we know which action the other player chooses. In that case, deciding on an action becomes very easy: We just take the action that has the highest reward. If player 1 knew that player 2 chose option A, the best answer for player 1 would be Y, because Y has the highest reward in that column. Do you see how we always searched for the best answers before? For each possible action of the other player we searched for the best answer, if the other player chose that action. More formally, player i’s best answer to a given set of actions of all other players is the action of player 1 which maximises the utility given the other players’ actions. Also be aware, that a strictly dominated action can never be a best answer. 

Let us come back to a game we introduced in the first chapter: The prisoners’ dilemma. What are the best answers here?

Prisoners’ dilemma

How should player 1 decide, if player 2 confesses or denies? If player 2 confesses, player 1 should confess as well, because a reward of -3 is better than a reward of -6. And what happens, if player 2 denies? In that case, confessing is better again, because it would give a reward of 0, which is better than a reward of -1 for denying. That means, for player 1 confessing is the best answer for both actions of player 2. Player 1 doesn’t have to worry about the other player’s actions at all but should always confess. Because of the game’s symmetry, the same applies to player 2. For them, confessing is also the best answer, no matter what player 1 does. 

The Nash Equilibrium

The Nash equilibrium is somewhat like the master key that allows us to solve game-theoretic problems. Researchers were very happy when they found it. Photo by rc.xyz NFT gallery on Unsplash

If all players play their best answer, we have reached a solution of the game that is called a Nash Equilibrium. This is a key concept in game theory, because of an important property: In a Nash Equilibrium, no player has any reason to change their action, unless any other player does. That means all players are as happy as they can be in the situation and they wouldn’t change, even if they could. Consider the prisoner’s dilemma from above: The Nash equilibrium is reached when both confess. In this case, no player would change his action without the other. They could become better if both changed their action and decided to deny, but since they can’t communicate, they don’t expect any change from the other player and so they don’t change themselves either. 

You may wonder if there is always a single Nash equilibrium for each game. Let me tell you there can also be multiple ones, as in the Bach vs. Stravinsky game that we already got to know in Chapter 1:

Bach vs. Stravinsky

This game has two Nash equilibria: (Bach, Bach) and (Stravinsky, Stravinsky). In both scenarios, you can easily imagine that there is no reason for any player to change their action in isolation. If you sit in the Bach concerto with your friend, you would not leave your seat to go to the Stravinsky concerto alone, even if you favour Stravinsky over Bach. In a likewise fashion, the Bach fan wouldn’t go away from the Stravinsky concerto if that meant leaving his friend alone. In the remaining two scenarios, you would think differently though: If you were in the Stravinsky concerto alone, you would want to get out there and join your friend in the Bach concerto. That is, you would change your action even if the other player doesn’t change theirs. This tells you, that the scenario you have been in was not a Nash equilibrium. 

However, there can also be games that have no Nash equilibrium at all. Imagine you are a soccer keeper during a penalty shot. For simplicity, we assume you can jump to the left or to the right. The soccer player of the opposing team can also shoot in the left or right corner, and we assume, that you catch the ball if you decide for the same corner as they do and that you don’t catch it if you decide for opposing corners. We can display this game as follows:

A game matrix for a penalty shooting.

You won’t find any Nash equilibrium here. Each scenario has a clear winner (reward 1) and a clear loser (reward -1), and hence one of the players will always want to change. If you jump to the right and catch the ball, your opponent will wish to change to the left corner. But then you again will want to change your decision, which will make your opponent choose the other corner again and so on.

Summary

We learned about finding a point of balance, where nobody wants to change anymore. That is a Nash equilibrium. Photo by Eran Menashri on Unsplash

This chapter showed how to find solutions for games by using the concept of a Nash equilibrium. Let us summarize, what we have learned so far: 

  • A solution of a game in game theory maximizes every player’s utility or reward. 
  • An action is called strictly dominated if there is another action that is always better. In this case, it would be irrational to ever play the strictly dominated action.
  • The action that yields the highest reward given the actions taken by the other players is called a best answer.
  • A Nash equilibrium is a state where every player plays their best answer.
  • In a Nash Equilibrium, no player wants to change their action unless any other play does. In that sense, Nash equilibria are optimal states. 
  • Some games have multiple Nash equilibria and some games have none.

If you were saddened by the fact that there is no Nash equilibrium in some games, don’t despair! In the next chapter, we will introduce probabilities of actions and this will allow us to find more equilibria. Stay tuned!

References

The topics introduced here are typically covered in standard textbooks on game theory. I mainly used this one, which is written in German though:

  • Bartholomae, F., & Wiens, M. (2016). Spieltheorie. Ein anwendungsorientiertes Lehrbuch. Wiesbaden: Springer Fachmedien Wiesbaden.

An alternative in English language could be this one:

  • Espinola-Arredondo, A., & Muñoz-Garcia, F. (2023). Game Theory: An Introduction with Step-by-step Examples. Springer Nature.

Game theory is a rather young field of research, with the first main textbook being this one:

  • Von Neumann, J., & Morgenstern, O. (1944). Theory of games and economic behavior.

Like this article? Follow me to be notified of my future posts.

The post I Won’t Change Unless You Do appeared first on Towards Data Science.

]]>
The Dangers of Deceptive Data–Confusing Charts and Misleading Headlines https://towardsdatascience.com/the-dangers-of-deceptive-data-confusing-charts-and-misleading-headlines/ Thu, 27 Feb 2025 02:15:25 +0000 https://towardsdatascience.com/?p=598469 A deep dive into the ways data can be used to misinform the masses

The post The Dangers of Deceptive Data–Confusing Charts and Misleading Headlines appeared first on Towards Data Science.

]]>
“You don’t have to be an expert to deceive someone, though you might need some expertise to reliably recognize when you are being deceived.”

When my co-instructor and I start our quarterly lesson on deceptive visualizations for the data visualization course we teach at the University of Washington, he emphasizes the point above to our students. With the advent of modern technology, developing pretty and convincing claims about data is easier than ever. Anyone can make something that seems passable, but contains oversights that render it inaccurate and even harmful. Furthermore, there are also malicious actors who actively want to deceive you, and who have studied some of the best ways to do it.

I often start this lecture with a bit of a quip, looking seriously at my students and asking two questions:

  1. “Is it a good thing if someone is gaslighting you?”
  2. After the general murmur of confusion followed by agreement that gaslighting is indeed bad, I ask the second question: “What’s the best way to ensure no one ever gaslights you?”

The students generally ponder that second question for a bit longer, before chuckling a bit and realizing the answer: It’s to learn how people gaslight in the first place. Not so you can take advantage of others, but so you can prevent others from taking advantage of you.

The same applies in the realm of misinformation and disinformation. People who want to mislead with data are empowered with a host of tools, from high-speed internet to social media to, most recently, generative AI and large language models. To protect yourself from being misled, you need to learn their tricks.

In this article, I’ve taken the key ideas from my data visualization course’s unit on deception–drawn from Alberto Cairo’s excellent book How Charts Lie–and broadened them into some general principles about deception and data. My hope is that you read it, internalize it, and take it with you to arm yourself against the onslaught of lies perpetuated by ill-intentioned people powered with data.

Humans Cannot Interpret Area

At least, not as well as we interpret other visual cues. Let’s illustrate this with an example. Say we have an extremely simple numerical data set; it’s one dimensional and consists of just two values: 50 and 100. One way to represent this visually is via the length of bars, as follows:

This is true to the underlying data. Length is a one-dimensional quantity, and we have doubled it in order to indicate a doubling of value. But what happens if we want to represent the same data with circles? Well, circles aren’t really defined by a length or width. One option is to double the radius:

Hmm. The first circle has a radius of 100 pixels, and the second has a radius of 50 pixels–so this is technically correct if we wanted to double the radius. However, because of the way that area is calculated (πr²), we’ve way more than doubled the area. So what if we tried just doing that, since it seems more visually accurate? Here is a revised version:

Now we have a different problem. The larger circle is mathematically twice the area of the smaller one, but it no longer looks that way. In other words, even though it is a visually accurate comparison of a doubled quantity, human eyes have difficulty perceiving it.

The issue here is trying to use area as a visual marker in the first place. It’s not necessarily wrong, but it is confusing. We’re increasing a one-dimensional value, but area is a two-dimensional quantity. To the human eye, it’s always going to be difficult to interpret accurately, especially when compared with a more natural visual representation like bars.

Now, this may seem like it’s not a huge deal–but let’s take a look at what happens when you extend this to an actual data set. Below, I’ve pasted two images of charts I made in Altair (a Python-based visualization package). Each chart shows the maximum temperature (in Celsius) during the first week of 2012 in Seattle, USA. The first one uses bar lengths to make the comparison, and the second uses circle areas.

Which one makes it easier to see the differences? The legend helps in the second one, but if we’re being honest, it’s a lost cause. It is much easier to make precise comparisons with the bars, even in a setting where we have such limited data.

Remember that the point of a visualization is to clarify data–to make hidden trends easier to see for the average person. To achieve this goal, it’s best to use visual cues that simplify the process of making that distinction.

Beware Political Headlines (In Any Direction)

There is a small trick question I sometimes ask my students on a homework assignment around the fourth week of class. The assignment mostly involves generating visualizations in Python–but for the last question, I give them a chart I myself generated accompanied by a single question:

Question: There is one thing egregiously wrong with the chart above, an unforgivable error in Data Visualization. What is it?

Most think it has something to do with the axes, marks, or some other visual aspect, often suggesting improvements like filling in the circles or making the axis labels more informative. Those are fine suggestions, but not the most pressing.

The most flawed trait (or lack thereof, rather) in the chart above is the missing title. A title is crucial to an effective data visualization. Without it, how are we supposed to know what this visualization is even about? As of now, we can only ascertain that it must vaguely have something to do with carbon dioxide levels across a span of years. That isn’t much.

Many folks, feeling this requirement is too stringent, argue that a visualization is often meant to be understood in context, as part of a larger article or press release or other accompanying piece of text. Unfortunately, this line of thinking is far too idealistic; in reality, a visualization must stand alone, because it will often be the only thing people look at–and in social media blow-up cases, the only thing that gets shared widely. As a result, it should have a title to explain itself.

Of course, the title of this very subsection tells you to be wary of such headlines. That is true. While they are necessary, they are a double-edged sword. Since visualization designers know viewers will pay attention to the title, ill-meaning ones can also use it to sway people in less-than-accurate directions. Let’s look at an example:

The above is a picture shared by the White House’s public Twitter account in 2017. The picture is also referenced by Alberto Cairo in his book, which emphasizes many of the points I will now make.

First things first. The word “chain migration,” referring to what is formally known as family-based migration (where an immigrant may sponsor family members to come to the United States), has been criticized by many who argue that it is needlessly aggressive and makes legal immigrants sound threatening for no reason.

Of course, politics is by its very nature divisive, and it is possible for any side to make a heated argument. The primary issue here is actually a data-related one–specifically, what the use of the word “chain” implies in the context of the chart shared with the tweet. “Chain” migration seems to indicate that people can immigrate one after the other, in a seemingly endless stream, uninhibited and unperturbed by the distance of family relations. The reality, of course, is that a single immigrant can mostly just sponsor immediate family members, and even that takes quite a bit of time. But when one reads the phrase “chain migration” and then immediately looks at a seemingly sensible chart depicting it, it is easy to believe that an individual can in fact spawn additional immigrants at a base-3 exponential growth rate.

That is the issue with any kind of political headline–it makes it far too easy to conceal dishonest, inaccurate workings with actual data processing, analysis, and visualization.

There is no data underlying the chart above. None. Zero. It is completely random, and that is not okay for a chart that is purposefully made to appear as if it is showing something meaningful and quantitative.

As a fun little rabbit hole to go down which highlights the dangers of political headlining within data, here is a link to FloorCharts, a Twitter account that posts the most absurd graphics shown on the U.S. Congress floor.

Don’t Use 3D. Please.

I’ll end this article on a slightly lighter topic–but still an important one. Under no circumstances–none at all–should you ever utilize a 3D chart. And if you’re in the shoes of the viewer–that is, if you’re looking at a 3D pie chart made by someone else–don’t trust it.

The reason for this is simple, and connects back to what I discussed with circles and rectangles: a third dimension severely distorts the actuality behind what are usually one-dimensional measures. Area was already hard to interpret–how well do you really think the human eye does with volume?

Here is a 3D pie chart I generated with random numbers:

Now, here is the exact same pie chart, but in two dimensions:

Notice how the blue is not quite as dominant as the 3D version seems to suggest, and that the red and orange are closer to one another in size than originally portrayed. I also removed the percentage labels intentionally (technically bad practice) in order to emphasize how even with the labels present in the first one, our eyes automatically pay more attention to the more drastic visual differences. If you’re reading this article with an analytical eye, perhaps you think it doesn’t make that much of a difference. But the fact is, you’ll often see such charts in the news or on social media, and a quick glance is all they’ll ever get.

It is important to ensure that the story told by that quick glance is a truthful one.

Final Thoughts

Data science is often touted as the perfect synthesis of Statistics, computing, and society, a way to obtain and share deep and meaningful insights about an information-heavy world. This is true–but as the capacity to widely share such insights expands, so must our general ability to interpret them accurately. It is my hope that in light of that, you have found this primer to be helpful.

Stay tuned for Part 2, in which I’ll talk about a few deceptive techniques a bit more involved in nature–including base proportions, (un)trustworthy statistical measures, and measures of correlation.

In the meantime, try not to get deceived.

The post The Dangers of Deceptive Data–Confusing Charts and Misleading Headlines appeared first on Towards Data Science.

]]>
When Optimal is the Enemy of Good: High-Budget Differential Privacy for Medical AI https://towardsdatascience.com/when-optimal-is-the-enemy-of-good-high-budget-differential-privacy-for-medical-ai/ Wed, 26 Feb 2025 01:25:29 +0000 https://towardsdatascience.com/?p=598434 Can we guarantee patient privacy without sacrificing model accuracy?

The post When Optimal is the Enemy of Good: High-Budget Differential Privacy for Medical AI appeared first on Towards Data Science.

]]>
Imagine you’re building your dream home. Just about everything is ready. All that’s left to do is pick out a front door. Since the neighborhood has a low crime rate, you decide you want a door with a standard lock — nothing too fancy, but probably enough to deter 99.9% of would-be burglars.

Unfortunately, the local homeowners’ association (HOA) has a rule stating that all front doors in the neighborhood must be bank vault doors. Their reasoning? Bank vault doors are the only doors that have been mathematically proven to be absolutely secure. As far as they’re concerned, any front door below that standard may as well not be there at all.

You’re left with three options, none of which seems particularly appealing:

  • Concede defeat and have a bank vault door installed. Not only is this expensive and cumbersome, but you’ll be left with a front door that bogs you down every single time you want to open or close it. At least burglars won’t be a problem!
  • Leave your house doorless. The HOA rule imposes requirements on any front door in the neighborhood, but it doesn’t technically forbid you from not installing a door at all. That would save you a lot of time and money. The downside, of course, is that it would allow anyone to come and go as they please. On top of that, the HOA could always close the loophole, taking you back to square one.
  • Opt out entirely. Faced with such a stark dilemma (all-in on either security or practicality), you choose not to play the game at all, selling your nearly-complete house and looking for someplace else to live.

This scenario is obviously completely unrealistic. In real life, everybody strives to strike an appropriate balance between security and practicality. This balance is informed by everyone’s own circumstances and risk analysis, but it universally lands somewhere between the two extremes of bank vault door and no door at all.

But what if instead of your dream home, you imagined a medical AI model that has the power to help doctors improve patient outcomes? Highly-sensitive training data points from patients are your valuables. The privacy protection measures you take are the front door you choose to install. Healthcare providers and the scientific community are the HOA. 

Suddenly, the scenario is much closer to reality. In this article, we’ll explore why that is. After understanding the problem, we’ll consider a simple but empirically effective solution proposed in the paper Reconciling privacy and accuracy in AI for medical imaging [1]. The authors propose a balanced alternative to the three bad choices laid out above, much like the real-life approach of a typical front door.


The State of Patient Privacy in Medical AI

Over the past few years, artificial intelligence has become an ever more ubiquitous part of our day-to-day lives, proving its utility across a wide range of domains. The rising use of AI models has, however, raised questions and concerns about protecting the privacy of the data used to train them. You may remember the well-known case of ChatGPT, just months after its initial release, exposing proprietary code from Samsung [2].

Some of the privacy risks associated with AI models are obvious. For example, if the training data used for a model isn’t stored securely enough, bad actors could find ways to access it directly. Others are more insidious, such as the risk of reconstruction. As the name implies, in a reconstruction attack, a bad actor attempts to reconstruct a model’s training data without needing to gain direct access to the dataset.

Medical records are one of the most sensitive kinds of personal information there are. Although specific regulation varies by jurisdiction, patient data is generally subject to stringent safeguards, with hefty fines for inadequate protection. Beyond the letter of the law, unintentionally exposing such data could irreparably damage our ability to use specialized AI to empower medical professionals. 

As Ziller, Mueller, Stieger, et al. point out [1], fully taking advantage of medical AI requires rich datasets comprising information from actual patients. This information must be obtained with the full consent of the patient. Ethically acquiring medical data for research was challenging enough as it was before the unique challenges posed by AI came into play. But if proprietary code being exposed caused Samsung to ban the use of ChatGPT [2], what would happen if attackers managed to reconstruct MRI scans and identify the patients they belonged to? Even isolated instances of negligent protection against data reconstruction could end up being a monumental setback for medical AI as a whole.

Tying this back into our front door metaphor, the HOA statute calling for bank vault doors starts to make a little bit more sense. When the cost of a single break-in could be so catastrophic for the entire neighborhood, it’s only natural to want to go to any lengths to prevent them. 

Differential Privacy (DP) as a Theoretical Bank Vault Door

Before we discuss what an appropriate balance between privacy and practicality might look like in the context of medical AI, we have to turn our attention to the inherent tradeoff between protecting an AI model’s training data and optimizing for quality of performance. This will set the stage for us to develop a basic understanding of Differential Privacy (DP), the theoretical gold standard of privacy protection.

Although academic interest in training data privacy has increased significantly over the past four years, principles on which much of the conversation is based were pointed out by researchers well before the recent LLM boom, and even before OpenAI was founded in 2015. Though it doesn’t deal with reconstruction per se, the 2013 paper Hacking smart machines with smarter ones [3] demonstrates a generalizable attack methodology capable of accurately inferring statistical properties of machine learning classifiers, noting:

“Although ML algorithms are known and publicly released, training sets may not be reasonably ascertainable and, indeed, may be guarded as trade secrets. While much research has been performed about the privacy of the elements of training sets, […] we focus our attention on ML classifiers and on the statistical information that can be unconsciously or maliciously revealed from them. We show that it is possible to infer unexpected but useful information from ML classifiers.” [3]

Theoretical data reconstruction attacks were described even earlier, in a context not directly pertaining to machine learning. The landmark 2003 paper Revealing information while preserving privacy [4] demonstrates a polynomial-time reconstruction algorithm for statistical databases. (Such databases are intended to provide answers to questions about their data in aggregate while keeping individual data points anonymous.) The authors show that to mitigate the risk of reconstruction, a certain amount of noise needs to be introduced into the data. Needless to say, perturbing the original data in this way, while necessary for privacy, has implications for the quality of the responses to queries, i.e., the accuracy of the statistical database.

In explaining the purpose of DP in the first chapter of their book The Algorithmic Foundations of Differential Privacy [5], Cynthia Dwork and Aaron Roth address this tradeoff between privacy and accuracy:

“[T]he Fundamental Law of Information Recovery states that overly accurate answers to too many questions will destroy privacy in a spectacular way. The goal of algorithmic research on differential privacy is to postpone this inevitability as long as possible. Differential privacy addresses the paradox of learning nothing about an individual while learning useful information about a population.” [5]

The notion of “learning nothing about an individual while learning useful information about a population” is captured by considering two datasets that differ by a single entry (one that includes the entry and one that doesn’t). An (ε, δ)-differentially private querying mechanism is one for which the probability of a certain output being returned when querying one dataset is at most a multiplicative factor of the probability when querying the other dataset. Denoting the mechanism by M, the set of possible outputs by S, and the datasets by x and y, we formalize this as [5]:

Pr[M(x) S] ≤ exp(ε) Pr[M(y) S] + δ

Where ε is the privacy loss parameter and δ is the failure probability parameter. ε quantifies how much privacy is lost as a result of a query, while a positive δ allows for privacy to fail altogether for a query at a certain (usually very low) probability. Note that ε is an exponential parameter, meaning that even slightly increasing it can cause privacy to decay significantly.

An important and useful property of DP is composition. Notice that the definition above only applies to cases where we run a single query. The composition property helps us generalize it to cover multiple queries based on the fact that privacy loss and failure probability accumulate predictably when we compose several queries, be they based on the same mechanism or different ones. This accumulation is easily proven to be (at most) linear [5]. What this means is that, rather than considering a privacy loss parameter for one query, we may view ε as a privacy budget that can be utilized across a number of queries. For example, when taken together, one query using a (1, 0)-DP mechanism and two queries using a (0.5, 0)-DP mechanism satisfy (2, 0)-DP.

The value of DP comes from the theoretical privacy guarantees it promises. Setting ε = 1 and δ = 0, for example, we find that the probability of any given output occurring when querying dataset y is at most exp(1) = e ≈ 2.718 times greater than that same output occurring when querying dataset x. Why does this matter? Because the greater the discrepancy between the probabilities of certain outputs occurring, the easier it is to determine the contribution of the individual entry by which the two datasets differ, and the easier it is to ultimately reconstruct that individual entry.

In practice, designing an (ε, δ)-differentially private randomized mechanism entails the addition of random noise drawn from a distribution dependent on ε and δ. The specifics are beyond the scope of this article. Shifting our focus back to machine learning, though, we find that the idea is the same: DP for ML hinges on introducing noise into the training data, which yields robust privacy guarantees in much the same way.

Of course, this is where the tradeoff we mentioned comes into play. Adding noise to the training data comes at the cost of making learning more difficult. We could absolutely add enough noise to achieve ε = 0.01 and δ = 0, making the difference in output probabilities between x and y virtually nonexistent. This would be wonderful for privacy, but terrible for learning. A model trained on such a noisy dataset would perform very poorly on most tasks.

There is no consensus on what constitutes a “good” ε value, or on universal methodologies or best practices for ε selection [6]. In many ways, ε embodies the privacy/accuracy tradeoff, and the “proper” value to aim for is highly context-dependent. ε = 1 is generally regarded as offering high privacy guarantees. Although privacy diminishes exponentially with respect to ε, values as high as ε = 32 are mentioned in literature and thought to provide moderately strong privacy guarantees [1]. 

The authors of Reconciling privacy and accuracy in AI for medical imaging [1] test the effects of DP on the accuracy of AI models on three real-world medical imaging datasets. They do so using various values of ε and comparing them to a non-private (non-DP) control. Table 1 provides a partial summary of their results for ε = 1 and ε = 8:

Table 1: Comparison of AI model performance across the RadImageNet [7], HAM10000 [8], and MSD Liver [9] datasets with δ = 8⁻⁷⋅10 and privacy budgets of ε = 1, ε = 8, and without DP (non-private). A higher MCC/Dice score indicates higher accuracy. Although providing strong theoretical privacy guarantees in the face of a worst-case adversary, DP significantly degrades model accuracy. The negative impact on performance is especially noticeable in the latter two datasets, which are considered small datasets. Image by the author, based on image by A. Ziller, T.T. Mueller, S. Stieger, et al from Table 3 in Reconciling privacy and accuracy in AI for medical imaging [1] (use under CC-BY 4.0 license).

Even approaching the higher end of the typical ε values attested in literature, DP is still as cumbersome as a bank vault door for medical imaging tasks. The noise introduced into the training data is catastrophic for AI model accuracy, especially when the datasets at hand are small. Note, for example, the huge drop-off in Dice score on the MSD Liver dataset, even with the relatively high ε value of 8.

Ziller, Mueller, Stieger, et al. suggest that the accuracy drawbacks of DP with typical ε values may contribute to the lack of widespread adoption of DP in the field of Medical Ai [1]. Yes, wanting mathematically-provable privacy guarantees is definitely sensible, but at what cost? Leaving so much of the diagnostic power of AI models on the table in the name of privacy is not an easy choice to make.

Revisiting our dream home scenario armed with an understanding of DP, we find that the options we (seem to) have map neatly onto the three we had for our front door.

  • DP with typical values of ε is like installing a bank vault door: costly, but effective for privacy. As we’ll see, it’s also complete overkill in this case.
  • Not using DP is like not installing a door at all: much easier, but risky. As mentioned above, though, DP has yet to be widely applied in medical AI [1].
  • Passing up opportunities to use AI is like giving up and selling the house: it saves us the headache of dealing with privacy concerns weighed against incentives to maximize accuracy, but a lot of potential is lost in the process.

It looks like we’re at an impasse… unless we think outside the box.

High-Budget DP: Privacy and Accuracy Aren’t an Either/Or

In Reconciling privacy and accuracy in AI for medical imaging [1], Ziller, Mueller, Stieger, et al. offer the medical AI equivalent of a regular front door — an approach that manages to protect privacy while giving up very little in the way of model performance. Granted, this protection is not theoretically optimal — far from it. However, as the authors show through a series of experiments, it is good enough to counter almost any realistic threat of reconstruction. 

As the saying goes, “Perfect is the enemy of good.” In this case, it is the “optimal” — an insistence on arbitrarily low ε values — that locks us into the false dichotomy of total privacy versus total accuracy. Just as a bank vault door has its place in the real world, so does DP with ε ≤ 32. Still, the existence of the bank vault door doesn’t mean plain old front doors don’t also have a place in the world. The same goes for high-budget DP.

The idea behind high-budget DP is straightforward: using privacy budgets (ε values) that are so high that they “are near-universally shunned as being meaningless” [1] — budgets ranging from ε = 10⁶ to as high as ε = 10¹⁵. In theory, these provide such weak privacy guarantees that it seems like common sense to dismiss them as no better than not using DP at all. In practice, though, this couldn’t be further from the truth. As we will see by looking at the results from the paper, high-budget DP shows significant promise in countering realistic threats. As Ziller, Mueller, Stieger, et al. put it [1]:

“[E]ven a ‘pinch of privacy’ has drastic effects in practical scenarios.”

First, though, we need to ask ourselves what we consider to be a “realistic” threat. Any discussion of the efficacy of high-budget DP is inextricably tied to the threat model under which we choose to evaluate it. In this context, a threat model is simply the set of assumptions we make about what a bad actor interested in obtaining our model’s training data is able to do.

Table 2: Comparison of threat models. For all three, we also assume that the adversary has unbounded computational ability. Image by A. Ziller, T.T. Mueller, S. Stieger, et al from Table 1 in Reconciling privacy and accuracy in AI for medical imaging [1] (use under CC-BY 4.0 license).

The paper’s findings hinge on a calibration of the assumptions to better suit real-world threats to patient privacy. The authors argue that the worst-case model, which is the one typically used for DP, is far too pessimistic. For example, it assumes that the adversary has full access to each original image while attempting to reconstruct it based on the AI model (see Table 2) [1]. This pessimism explains the discrepancy between the reported “drastic effects in practical scenarios” of high privacy budgets and the very weak theoretical privacy guarantees that they offer. We may liken it to incorrectly assessing the security threats a typical house faces, wrongly assuming they are likely to be as sophisticated and enduring as those faced by a bank. 

The authors therefore propose two alternative threat models, which they call the “relaxed” and “realistic” models. Under both of these, adversaries keep some core capabilities from the worst-case model: access to the AI model’s architecture and weights, the ability to manipulate its hyperparameters, and unbounded computational abilities (see Table 2). The realistic adversary is assumed to have no access to the original images and an imperfect reconstruction algorithm. Even these assumptions leave us with a rigorous threat model that may still be considered pessimistic for most real-world scenarios [1].

Having established the three relevant threat models to consider, Ziller, Mueller, Stieger, et al. compare AI model accuracy in conjunction with the reconstruction risk under each threat model at different values of ε. As we saw in Table 1, this is done for three exemplary Medical Imaging datasets. Their full results are presented in Table 3:

Table 3: Comparison of AI model performance and reconstruction risk per threat model across the RadImageNet [7], HAM10000 [8], and MSD Liver [9] datasets with δ = 8⁻⁷⋅10 and various privacy budgets, including some as high as ε = 10⁹ and ε = 10¹². A higher MCC/Dice score indicates higher accuracy. Image by A. Ziller, T.T. Mueller, S. Stieger, et al from Table 3 in Reconciling privacy and accuracy in AI for medical imaging [1] (use under CC-BY 4.0 license).

Unsurprisingly, high privacy budgets (exceeding ε = 10⁶) significantly mitigate the loss of accuracy seen with lower (stricter) privacy budgets. Across all tested datasets, models trained with high-budget DP at ε = 10⁹ (HAM10000, MSD Liver) or ε = 10¹² (RadImageNet) perform nearly as well as their non-privately trained counterparts. This is in line with our understanding of the privacy/accuracy tradeoff: the less noise introduced into the training data, the better a model can learn.

What is surprising is the degree of empirical protection afforded by high-budget DP against reconstruction under the realistic threat model. Remarkably, the realistic reconstruction risk is assessed to be 0% for each of the aforementioned models. The high efficacy of high-budget DP in defending medical AI training images against realistic reconstruction attacks is made even clearer by looking at the results of reconstruction attempts. Figure 1 below shows the five most readily reconstructed images from the MSD Liver dataset [9] using DP with high privacy budgets of ε = 10⁶, ε = 10⁹, ε = 10¹², and ε = 10¹⁵.

Note that, at least to the naked eye, even the best reconstructions obtained when using the former two budgets are visually indistinguishable from random noise. This lends intuitive credence to the argument that budgets often deemed too high to provide any meaningful protection could be instrumental in protecting privacy without giving up accuracy when using AI for medical imaging. In contrast, the reconstructions when using ε = 10¹⁵ closely resemble the original images, showing that not all high budgets are created equal.

Based on their findings, Ziller, Mueller, Stieger, et al. make the case for training medical imaging AI models using (at least) high-budget DP as the norm. They note the empirical efficacy of high-budget DP in countering realistic reconstruction risks at very little cost in terms of model accuracy. The authors go so far as to claim that “it seems negligent to train AI models without any form of formal privacy guarantee.” [1]


Conclusion

We started with a hypothetical scenario in which you were forced to decide between a bank vault door or no door at all for your dream home (or giving up and selling the incomplete house). After an exploration of the risks posed by inadequate privacy protection in medical AI, we looked into the privacy/accuracy tradeoff as well as the history and theory behind reconstruction attacks and differential privacy (DP). We then saw how DP with common privacy budgets (ε values) degrades medical AI model performance and compared it to the bank vault door in our hypothetical. 

Finally, we examined empirical results from the paper Reconciling privacy and accuracy in AI for medical imaging to find out how high-budget differential privacy can be used to escape the false dichotomy of bank vault door vs. no door and protect Patient Privacy in the real world without sacrificing model accuracy in the process.

If you enjoyed this article, please consider following me on LinkedIn to keep up with future articles and projects.

References

[1] Ziller, A., Mueller, T.T., Stieger, S. et al. Reconciling privacy and accuracy in AI for medical imaging. Nat Mach Intell 6, 764–774 (2024). https://doi.org/10.1038/s42256-024-00858-y.

[2] Ray, S. Samsung bans ChatGPT and other chatbots for employees after sensitive code leak. Forbes (2023). https://www.forbes.com/sites/siladityaray/2023/05/02/samsung-bans-chatgpt-and-other-chatbots-for-employees-after-sensitive-code-leak/.

[3] Ateniese, G., Mancini, L. V., Spognardi, A. et al. Hacking smart machines with smarter ones: how to extract meaningful data from machine learning classifiers. International Journal of Security and Networks 10, 137–150 (2015). https://doi.org/10.48550/arXiv.1306.4447.

[4] Dinur, I. & Nissim, K. Revealing information while preserving privacy. Proc. 22nd ACM SIGMOD-SIGACT-SIGART Symp Principles Database Syst 202–210 (2003). https://doi.org/10.1145/773153.773173.

[5] Dwork, C. & Roth, A. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science 9, 211–407 (2014). https://doi.org/10.1561/0400000042.

[6] Dwork, C., Kohli, N. & Mulligan, D. Differential privacy in practice: expose your epsilons! Journal of Privacy and Confidentiality 9 (2019). https://doi.org/10.29012/jpc.689.

[7] Mei, X., Liu, Z., Robson, P.M. et al. RadImageNet: an open radiologic deep learning research dataset for effective transfer learning. Radiol Artif Intell 4.5, e210315 (2022). https://doi.org/10.1148/ryai.210315.

[8] Tschandl, P., Rosendahl, C. & Kittler, H. The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Sci Data 5, 180161 (2018). https://doi.org/10.1038/sdata.2018.161.

[9] Antonelli, M., Reinke, A., Bakas, S. et al. The Medical Segmentation Decathlon. Nat Commun 13, 4128 (2022). https://doi.org/10.1038/s41467-022-30695-9.

The post When Optimal is the Enemy of Good: High-Budget Differential Privacy for Medical AI appeared first on Towards Data Science.

]]>
6 Common LLM Customization Strategies Briefly Explained https://towardsdatascience.com/6-common-llm-customization-strategies-briefly-explained/ Mon, 24 Feb 2025 19:27:50 +0000 https://towardsdatascience.com/?p=598387 From Theory to practice: understanding RAG, agents, fine-tuning, and more

The post 6 Common LLM Customization Strategies Briefly Explained appeared first on Towards Data Science.

]]>
Why Customize LLMs?

Large Language Models (Llms) are deep learning models pre-trained based on self-supervised learning, requiring a vast amount of resources on training data, training time and holding a large number of parameters. LLM have revolutionized natural language processing especially in the last 2 years, demonstrating remarkable capabilities in understanding and generating human-like text. However, these general purpose models’ out-of-the-box performance may not always meet specific business needs or domain requirements. LLMs alone cannot answer questions that rely on proprietary company data or closed-book settings, making them relatively generic in their applications. Training a LLM model from scratch is largely infeasible to small to medium teams due to the demand of massive amounts of training data and resources. Therefore, a wide range of LLM customization strategies are developed in recent years to tune the models for various scenarios that require specialized knowledge.

The customization strategies can be broadly split into two types:

  • Using a frozen model: These techniques don’t necessitate updating model parameters and typically accomplished through in-context learning or prompt engineering. They are cost-effective since they alter the model’s behavior without incurring extensive training costs, therefore widely explored in both the industry and academic with new research papers published on a daily basis.
  • Updating model parameters: This is a relatively resource-intensive approach that requires tuning a pre-trained LLM using custom datasets designed for the intended purpose. This includes popular techniques like Fine-Tuning and Reinforcement Learning from Human Feedback (RLHF).

These two broad customization paradigms branch out into various specialized techniques including LoRA fine-tuning, Chain of Thought, Retrieval Augmented Generation, ReAct, and Agent frameworks. Each technique offers distinct advantages and trade-offs regarding computational resources, implementation complexity, and performance improvements.

This article is also available as a video here

How to Choose LLMs?

The first step of customizing LLMs is to select the appropriate foundation models as the baseline. Community based platform e.g. “Huggingface” offers a wide range of open-source pre-trained models contributed by top companies or communities, such as Llama series from Meta and Gemini from Google. Huggingface additionally provides leaderboards, for example “Open LLM Leaderboard” to compare LLMs based on industry-standard metrics and tasks (e.g. MMLU). Cloud providers (e.g., AWS) and AI companies (e.g., OpenAI and Anthropic) also offer access to proprietary models that are typically paid services with restricted access. Following factors are essentials to consider when choosing LLMs.

Open source or proprietary model: Open source models allow full customization and self-hosting but require technical expertise while proprietary models offer immediate access and often better quality responses but with higher costs.

Task and metrics: Models excel at different tasks including question-answering, summarization, code generation etc. Compare benchmark metrics and test on domain-specific tasks to determine the appropriate models.

Architecture: In general, decoder-only models (GPT series) perform better at text generation while encoder-decoder models (T5) handle translation well. There are more architecture emerging and showing promising results, for instance Mixture of Experts (MoE) model “DeepSeek”.

Number of Parameters and Size: Larger models (70B-175B parameters) offer better performance but need more computing power. Smaller models (7B-13B) run faster and cheaper but may have reduced capabilities.

After determining a base LLM, let’s explore 6 most common strategies for LLM customization, ranked in order of resource consumption from the least to the most intensive:

  • Prompt Engineering
  • Decoding and Sampling Strategy
  • Retrieval Augmented Generation
  • Agent
  • Fine Tuning
  • Reinforcement Learning from Human Feedback

If you’d prefer a video walkthrough of these concepts, please check out my video on “6 Common LLM Customization Strategies Briefly Explained”.

LLM Customization Techniques

1. Prompt Engineering

Prompt is the input text sent to an LLM to elicit an AI-generated response, and it can be composed of instructions, context, input data and output indicator.

Instructions: This provides a task description or instruction for how the model should perform.

Context: This is external information to guide the model to respond within a certain scope.

Input data: This is the input for which you want a response.

Output indicator: This specifies the output type or format.

Prompt Engineering involves crafting these prompt components strategically to shape and control the model’s response. Basic prompt engineering techniques include zero shot, one shot, and few shot prompting. User can implement basic prompt engineering techniques directly while interacting with the LLM, making it an efficient approach to align model’s behavior to on a novel objective. API implementation is also an option and more details are introduced in my previous article “A Simple Pipeline for Integrating LLM Prompt with Knowledge Graph”.

Due to the efficiency and effectiveness of prompt engineering, more complex approaches are explored and developed to advance the logical structure of prompts.

Chain of Thought (CoT) asks LLMs to break down complex reasoning tasks into step-by-step thought processes, improving performance on multi-step problems. Each step explicitly exposes its reasoning outcome which serves as the precursor context of its subsequent steps until arriving at the answer.

Tree of thoughts extends from CoT by considering multiple different reasoning branches and self-evaluating choices to decide the next best action. It is more effective for tasks that involve initial decisions, strategies for the future and exploration of multiple solutions.

Automatic reasoning and tool use (ART) builds upon the CoT process, it deconstructs complex tasks and allows the model to select few-shot examples from a task library using predefined external tools like search and code generation.

Synergizing reasoning and acting (ReAct) combines reasoning trajectories with an action space, where the model search through the action space and determine the next best action based on environmental observations.

Techniques like CoT and ReAct are often combined with an Agentic workflow to strengthen its capability. These techniques will be introduced in more detail in the “Agent” section.

Further Reading

2. Decoding and Sampling Strategy

Decoding strategy can be controlled at model inference time through inference parameters (e.g. temperature, top p, top k), determining the randomness and diversity of model responses. Greedy search, beam search and sampling are three common decoding strategies for auto-regressive model generation. ****

During the autoregressive generation process, LLM outputs one token at a time based on a probability distribution of candidate tokens conditioned by the pervious token. By default, greedy search is applied to produce the next token with the highest probability.

In contrast, beam search decoding considers multiple hypotheses of next-best tokens and selects the hypothesis with the highest combined probabilities across all tokens in the text sequence. The code snippet below uses transformers library to specify the the number of beam paths (e.g. num_beams=5 considers 5 distinct hypotheses) during the model generation process.

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)
inputs = tokenizer(prompt, return_tensors="pt")

model = AutoModelForCausalLM.from_pretrained(model_name)
outputs = model.generate(**inputs, num_beams=5)

Sampling strategy is the third approach to control the randomness of model responses by adjusting these inference parameters:

  • Temperature: Lowering the temperature makes the probability distribution sharper by increasing the likelihood of generating high-probability words and decreasing the likelihood of generating low-probability words. When temperature = 0, it becomes equivalent to greedy search (least creative); when temperature = 1, it produces the most creative outputs.
  • Top K sampling: This method filters the K most probable next tokens and redistributes the probability among those tokens. The model then samples from this filtered set of tokens.
  • Top P sampling: Instead of sampling from the K most probable tokens, top-p sampling selects from the smallest possible set of tokens whose cumulative probability exceeds the threshold p.

The example code snippet below samples from the top 50 most likely tokens (top_k=50) with a cumulative probability higher than 0.95 (top_p=0.95)

sample_outputs = model.generate(
    **model_inputs,
    max_new_tokens=40,
    do_sample=True,
    top_k=50,
    top_p=0.95,
    num_return_sequences=3,
) 

Further Reading

3. RAG

Retrieval Augmented Generation (or RAG), initially introduced in the paper “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks”, has been demonstrated as a promising solution that integrates external knowledge and reduces common LLM “hallucination” issues when handling domain specific or specialized queries. RAG allows dynamically pulling relevant information from knowledge domain and generally does not involve extensive training to update LLM parameters, making it a cost-effective strategy to adapt a general-purpose LLM for a specialized domain.

A RAG system can be decomposed into retrieval and generation stage. The objective of retrieval process is to find contents within the knowledge base that are closely related to the user query, by chunking external knowledge, creating embeddings, indexing and similarity search.

  1. Chunking: Documents are divided into smaller segments, with each segment containing a distinct unit of information.
  2. Create embeddings: An embedding model compresses each information chunk into a vector representation. The user query is also converted into its vector representation through the same vectorization process, so that the user query can be compared in the same dimensional space.
  3. Indexing: This process stores these text chunks and their vector embeddings as key-value pairs, enabling efficient and scalable search functionality. For large external knowledge bases that exceed memory capacity, vector databases offer efficient long-term storage.
  4. Similarity search: Similarity scores between the query embeddings and text chunk embeddings are calculated, which are used for searching information highly relevant to the user query.

The generation process of the RAG system then combines retrieved information with the user query to form the augmented query which is parsed to the LLM to generate the context rich response.

Code Snippet

The code snippet firstly specifies the LLM and embedding model, then perform the steps to chunk the external knowledge base documents into a collection of document. Create index from document, define the query_engine based on the index and query the query_engine with the user prompt.

from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import VectorStoreIndex

Settings.llm = OpenAI(model="gpt-3.5-turbo")
Settings.embed_model="BAAI/bge-small-en-v1.5"

document = Document(text="\\n\\n".join([doc.text for doc in documents]))
index = VectorStoreIndex.from_documents([document])                                    
query_engine = index.as_query_engine()
response = query_engine.query(
    "Tell me about LLM customization strategies."
)

The example above shows a simple RAG system. Advanced RAG improve based on this by introducing pre-retrieval and post-retrieval strategies to reduce pitfalls such as limited synergy between the retrieval and generation process. For example rerank technique reorders the retrieved information using a model capable of understanding bidirectional context, and integration with knowledge graph for advanced query routing. More use cases can be found on the llamaindex website.

Further Reading

4. Agent

LLM Agent was a trending topic in 2024 and will likely remain a main focus in the GenAI field in 2025. Compared to RAG, Agent excels at creating query routes and planning LLM-based workflows, with the following benefits:

  • Maintaining memory and state of previous model generated responses.
  • Leveraging various tools based on specific criteria. This tool-using capability sets agents apart from basic RAG systems by giving the LLM independent control over tool selection.
  • Breaking down a complex task into smaller steps and planning for a sequence of actions.
  • Collaborating with other agents to form a orchestrated system.

Several in-context learning techniques (e.g. CoT, ReAct ) can be implemented through the Agentic framework and we will discuss ReAct in more details. ReAct, stands for “Synergizing Reasoning and Acting in Language Models”, is composed of three key elements – actions, thoughts and observations. This framework was introduced by Google Research at Princeton University, built upon Chain of Thought by integrating the reasoning steps with an action space that enables tool uses and function calling. Additionally, ReAct framework emphasizes on determining the next best action based on the environmental observations.

This example from the original paper demonstrated ReAct’s inner working process, where the LLM generates the first thought and acts by calling the function to “Search [Apple Remote]”, then observes the feedback from its first output. The second thought is then based on the previous observation, hence leading to a different action “Search [Front Row]”. This process iterates until reaching the goal. The research shows that ReAct overcomes prevalent issues of hallucination and error propagation as more often observed in chain-of-thought reasoning by interacting with a simple Wikipedia API. Furthermore, through the implementation of decision traces, ReAct framework additionally increases the model’s interpretability, trustworthiness and diagnosability.

Example from “ReAct: Synergizing Reasoning and Acting in Language Models” (Yu et. al., 2022)

Code Snippet

This demonstrates an ReAct-based agent implementation using llamaindex. Firstly, it defines two functions (multiply and add). Secondly, these two functions are encapsulated as FunctionTool, forming the Agent’s action space and executed based on its reasoning.

from llama_index.core.agent import ReActAgent
from llama_index.core.tools import FunctionTool

# create basic function tools
def multiply(a: float, b: float) -> float:
    return a * b
multiply_tool = FunctionTool.from_defaults(fn=multiply)

def add(a: float, b: float) -> float:
    return a + b
add_tool = FunctionTool.from_defaults(fn=add)

agent = ReActAgent.from_tools([multiply_tool, add_tool], llm=llm, verbose=True)

The advantages of an Agentic Workflow are more substantial when combined with self-reflection or self-correction. It is an increasingly growing domain with a variety of Agent architecture being explored. For instance, Reflexion framework facilitate iterative learning by providing a summary of verbal feedback from environmental and storing the feedback in model’s memory; CRITIC framework empowers frozen LLMs to self-verify through interacting with external tools such as code interpreter and API calls.

Further Reading

5. Fine-Tuning

Fine-tuning is the process of feeding niche and specialized datasets to modify the LLM so that it is more aligned with a certain objective. It differs from prompt engineering and RAG as it enables updates to the LLM weights and parameters. Full fine-tuning refers to updating all weights of the pretrained LLM through backpropogation, which requires large memory to store all weights and parameters and may suffer from significant reduction in ability on other tasks (i.e. catastrophic forgetting). Therefore, PEFT (or parameter efficient fine tuning) is more widely used to mitigate these caveats while saving the time and cost of model training. There are three categories of PEFT methods:

  • Selective: Select a subset of initial LLM parameters to fine tune which can be more computationally intensive compared to other PEFT methods.
  • Reparameterization: Adjust model weights through training the weights of low rank representations. For example, Lower Rank Adaptation (LoRA) is among this category that accelerates fine-tuning by representing the weight updates with two smaller matrices.
  • Additive: Add additional trainable layers to model, including techniques like adapters and soft prompts

The fine-tuning process is similar to deep learning training process., requiring the following inputs:

  • training and evaluation datasets
  • training arguments define the hyperparameters e.g. learning rate, optimizer
  • pretrained LLM model
  • compute metrics and objective functions that algorithm should be optimized for

Code Snippet

Below is an example of implementing fine-tuning using the transformer Trainer.

from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
		output_dir=output_dir,
		learning_rate=1e-5,
		eval_strategy="epoch"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics,
)

trainer.train()

Fine-tuning has a wide range of use cases. For instance, instruction fine-tuning optimizes LLMs for conversations and following instructions by training them on prompt-completion pairs. Another example is domain adaptation, an unsupervised fine-tuning method that helps LLMs specialize in specific knowledge domains.

Further Reading

6. RLHF

Reinforcement learning from human feedback, or RLHF, is a reinforcement learning technique that fine tunes LLMs based on human preferences. RLHF operates by training a reward model based on human feedback and uses this model as a reward function to optimize a reinforcement learning policy through PPO (Proximal Policy Optimization). The process requires two sets of training data: a preference dataset for training reward model, and a prompt dataset used in the reinforcement learning loop.

Let’s break it down into steps:

  1. Gather preference dataset annotated by human labelers who rate different completions generated by the model based on human preference. An example format of the preference dataset is {input_text, candidate1, candidate2, human_preference}, indicating which candidate response is preferred.
  2. Train a reward model using the preference dataset, the reward model is essentially a regression model that outputs a scalar indicating the quality of the model generated response. The objective of the reward model is to maximize the score between the winning candidate and losing candidate.
  3. Use the reward model in a reinforcement learning loop to fine-tune the LLM. The objective is that the policy is updated so that LLM can generate responses that maximize the reward produced by the reward model. This process utilizes the prompt dataset which is a collection of prompts in the format of {prompt, response, rewards}.

Code Snippet

Open source library Trlx is widely applied in implementing RLHF and they provided a template code that shows the basic RLHF setup:

  1. Initialize the base model and tokenizer from a pretrained checkpoint
  2. Configure PPO hyperparameters PPOConfig like learning rate, epochs, and batch sizes
  3. Create the PPO trainer PPOTrainer by combining the model, tokenizer, and training data
  4. The training loop uses step() method to iteratively update the model to optimized the rewards calculated from the query and model response
# trl: Transformer Reinforcement Learning library
from trl import PPOTrainer, PPOConfig, AutoModelForSeq2SeqLMWithValueHead
from trl import create_reference_model
from trl.core import LengthSampler

# initiate the pretrained model and tokenizer
model = AutoModelForCausalLMWithValueHead.from_pretrained(config.model_name)
tokenizer = AutoTokenizer.from_pretrained(config.model_name)

# define the hyperparameters of PPO algorithm
config = PPOConfig(
    model_name=model_name,    
    learning_rate=learning_rate,
    ppo_epochs=max_ppo_epochs,
    mini_batch_size=mini_batch_size,
    batch_size=batch_size
)

# initiate the PPO trainer with reference to the model
ppo_trainer = PPOTrainer(
	config=config, 
	model=ppo_model, 
  tokenizer=tokenizer, 
  dataset=dataset["train"],
  data_collator=collator
)                      
                        
# ppo_trainer is iteratively updated through the rewards
ppo_trainer.step(query_tensors, response_tensors, rewards)

RLHF is widely applied for aligning model responses with human preference. Common use cases involve reducing response toxicity and model hallucination. However, it does have the downside of requiring a large amount of human annotated data as well as computation costs associated with policy optimization. Therefore, alternatives like Reinforcement Learning from AI feedback and Direct Preference Optimization (DPO) are introduced to mitigate these limitations.

Further Reading

Take-Home Message

This article briefly explains six essential LLM customization strategies including prompt engineering, decoding strategy, RAG, Agent, fine-tuning, and RLHF. Hope you find it helpful in terms of understanding the pros/cons of each strategy as well as how to implement them based on the practical examples.

The post 6 Common LLM Customization Strategies Briefly Explained appeared first on Towards Data Science.

]]>
Do European M&Ms Actually Taste Better than American M&Ms? https://towardsdatascience.com/do-european-mms-actually-taste-better-than-american-mms/ Fri, 21 Feb 2025 21:52:58 +0000 https://towardsdatascience.com/?p=598313 An overly-enthusiastic application of science and data visualization to a question we’ve all been asking

The post Do European M&Ms Actually Taste Better than American M&Ms? appeared first on Towards Data Science.

]]>
(Oh, I am the only one who’s been asking this question…? Hm. Well, if you have a minute, please enjoy this exploratory Data Analysis — featuring experimental design, statistics, and interactive visualization — applied a bit too earnestly to resolve an international debate.)

1. Introduction

1.1 Background and motivation

Chocolate is enjoyed around the world. From ancient practices harvesting organic cacao in the Amazon basin, to chocolatiers sculpting edible art in the mountains of Switzerland, and enormous factories in Hershey, Pennsylvania churning out 70 million kisses per day, the nuanced forms and flavors of chocolate have been integrated into many cultures and their customs. While quality can greatly vary across chocolate products, a well-known, shelf-stable, easily shareable form of chocolate are M&Ms. Readily found by convenience store check-out counters and in hotel vending machines, the brightly colored pellets are a popular treat whose packaging is re-branded to fit nearly any commercializable American holiday.

While living in Denmark in 2022, I heard a concerning claim: M&Ms manufactured in Europe taste different, and arguably “better,” than M&Ms produced in the United States. While I recognized that fancy European chocolate is indeed quite tasty and often superior to American chocolate, it was unclear to me if the same claim should hold for M&Ms. I learned that many Europeans perceive an “unpleasant” or “tangy” taste in American chocolate, which is largely attributed to butyric acid, a compound resulting from differences in how milk is treated before incorporation into milk chocolate.

But honestly, how much of a difference could this make for M&Ms? M&Ms!? I imagined M&Ms would retain a relatively processed/mass-produced/cheap candy flavor wherever they were manufactured. As the lone American visiting a diverse lab of international scientists pursuing cutting-edge research in biosustainability, I was inspired to break out my data science toolbox and investigate this M&M flavor phenomenon.

1.2 Previous work

To quote a European woman, who shall remain anonymous, after she tasted an American M&M while traveling in New York:

“They taste so gross. Like vomit. I don’t understand how people can eat this. I threw the rest of the bag away.”

Vomit? Really? In my experience, children raised in the United States had no qualms about eating M&Ms. Growing up, I was accustomed to bowls of M&Ms strategically placed in high traffic areas around my house to provide readily available sugar. Clearly American M&Ms are edible. But are they significantly different and/or inferior to their European equivalent?

In response to the anonymous European woman’s scathing report, myself and two other Americans visiting Denmark sampled M&Ms purchased locally in the Lyngby Storcenter Føtex. We hoped to experience the incredible improvement in M&M flavor that was apparently hidden from us throughout our youths. But curiously, we detected no obvious flavor improvements.

Unfortunately, neither preliminary study was able to conduct a side-by-side taste test with proper controls and randomized M&M sampling. Thus, we turn to science.

1.3 Study Goals

This study seeks to remedy the previous lack of thoroughness and investigate the following questions:

  1. Is there a global consensus that European M&Ms are in fact better than American M&Ms?
  2. Can Europeans actually detect a difference between M&Ms purchased in the US vs in Europe when they don’t know which one they are eating? Or is this a grand, coordinated lie amongst Europeans to make Americans feel embarrassed?
  3. Are Americans actually taste-blind to American vs European M&Ms? Or can they taste a difference but simply don’t describe this difference as “an improvement” in flavor?
  4. Can these alleged taste differences be perceived by citizens of other continents? If so, do they find one flavor obviously superior?

2. Methods

2.1 Experimental design and data collection

Participants were recruited by luring — er, inviting them to a social gathering (with the promise of free food) that was conveniently co-located with the testing site. Once a participant agreed to pause socializing and join the study, they were positioned at a testing station with a trained experimenter who guided them through the following steps:

  • Participants sat at a table and received two cups: 1 empty and 1 full of water. With one cup in each hand, the participant was asked to close their eyes, and keep them closed through the remainder of the experiment.
  • The experimenter randomly extracted one M&M with a spoon, delivered it to the participant’s empty cup, and the participant was asked to eat the M&M (eyes still closed).
  • After eating each M&M, the experimenter collected the taste response by asking the participant to report if they thought the M&M tasted: Especially Good, Especially Bad, or Normal.
  • Each participant received a total of 10 M&Ms (5 European, 5 American), one at a time, in a random sequence determined by random.org.
  • Between eating each M&M, the participant was asked to take a sip of water to help “cleanse their palate.”
  • Data collected: for each participant, the experimenter recorded the participant’s continent of origin (if this was ambiguous, the participant was asked to list the continent on which they have the strongest memories of eating candy as a child). For each of the 10 M&Ms delivered, the experimenter recorded the M&M origin (“Denmark” or “USA”), the M&M color, and the participant’s taste response. Experimenters were also encouraged to jot down any amusing phrases uttered by the participant during the test, recorded under notes (data available here).

2.2 Sourcing materials and recruiting participants

Two bags of M&Ms were purchased for this study. The American-sourced M&Ms (“USA M&M”) were acquired at the SFO airport and delivered by the author’s parents, who visited her in Denmark. The European-sourced M&Ms (“Denmark M&M”) were purchased at a local Føtex grocery store in Lyngby, a little north of Copenhagen.

Experiments were conducted at two main time points. The first 14 participants were tested in Lyngby, Denmark in August 2022. They mostly consisted of friends and housemates the author met at the Novo Nordisk Foundation Center for Biosustainability at the Technical University of Denmark (DTU) who came to a “going away party” into which the experimental procedure was inserted. A few additional friends and family who visited Denmark were also tested during their travels (e.g. on the train).

The remaining 37 participants were tested in Seattle, WA, USA in October 2022, primarily during a “TGIF happy hour” hosted by graduate students in the computer science PhD program at the University of Washington. This second batch mostly consisted of students and staff of the Paul. G. Allen School of Computer Science & Engineering (UW CSE) who responded to the weekly Friday summoning to the Allen Center atrium for free snacks and drinks.

Figure 1. Distribution of participants recruited to the study. In the first sampling event in Lyngby, participants primarily hailed from North America and Europe, and a few additionally came from Asia, South America, or Australia. Our second sampling event in Seattle greatly increased participants, primarily from North America and Asia, and a few more from Europe. Neither event recruited participants from Africa. Figure made with Altair.

While this study set out to analyze global trends, unfortunately data was only collected from 51 participants the author was able to lure to the study sites and is not well-balanced nor representative of the 6 inhabited continents of Earth (Figure 1). We hope to improve our recruitment tactics in future work. For now, our analytical power with this dataset is limited to response trends for individuals from North America, Europe, and Asia, highly biased by subcommunities the author happened to engage with in late 2022.

2.3 Risks

While we did not acquire formal approval for experimentation with human test subjects, there were minor risks associated with this experiment: participants were warned that they may be subjected to increased levels of sugar and possible “unpleasant flavors” as a result of participating in this study. No other risks were anticipated.

After the experiment however, we unfortunately observed several cases of deflated pride when a participant learned their taste response was skewed more positively towards the M&M type they were not expecting. This pride deflation seemed most severe among European participants who learned their own or their fiancé’s preference skewed towards USA M&Ms, though this was not quantitatively measured and cannot be confirmed beyond anecdotal evidence.

3. Results & Discussion

3.1 Overall response to “USA M&Ms” vs “Denmark M&Ms”

3.1.1 Categorical response analysis — entire dataset

In our first analysis, we count the total number of “Bad”, “Normal”, and “Good” taste responses and report the percentage of each response received by each M&M type. M&Ms from Denmark more frequently received “Good” responses than USA M&Ms but also more frequently received “Bad” responses. M&Ms from the USA were most frequently reported to taste “Normal” (Figure 2). This may result from the elevated number of participants hailing from North America, where the USA M&M is the default and thus more “Normal,” while the Denmark M&M was more often perceived as better or worse than the baseline.

Figure 2. Qualitative taste response distribution across the whole dataset. The percentage of taste responses for “Bad”, “Normal” or “Good” was calculated for each type of M&M. Figure made with Altair.

Now let’s break out some Statistics, such as a chi-squared (X2) test to compare our observed distributions of categorical taste responses. Using the scipy.stats chi2_contingency function, we built contingency tables of the observed counts of “Good,” “Normal,” and “Bad” responses to each M&M type. Using the X2 test to evaluate the null hypothesis that there is no difference between the two M&Ms, we found the p-value for the test statistic to be 0.0185, which is significant at the common p-value cut off of 0.05, but not at 0.01. So a solid “maybe,” depending on whether you’d like this result to be significant or not.

3.1.2 Quantitative response analysis — entire dataset.

The X2 test helps evaluate if there is a difference in categorical responses, but next, we want to determine a relative taste ranking between the two M&M types. To do this, we converted taste responses to a quantitative distribution and calculated a taste score. Briefly, “Bad” = 1, “Normal” = 2, “Good” = 3. For each participant, we averaged the taste scores across the 5 M&Ms they tasted of each type, maintaining separate taste scores for each M&M type.

Figure 3. Quantitative taste score distributions across the whole dataset. Kernel density estimation of the average taste score calculated for each participant for each M&M type. Figure made with Seaborn.

With the average taste score for each M&M type in hand, we turn to scipy.stats ttest_ind (“T-test”) to evaluate if the means of the USA and Denmark M&M taste scores are different (the null hypothesis being that the means are identical). If the means are significantly different, it would provide evidence that one M&M is perceived as significantly tastier than the other.

We found the average taste scores for USA M&Ms and Denmark M&Ms to be quite close (Figure 3), and not significantly different (T-test: = 0.721). Thus, across all participants, we do not observe a difference between the perceived taste of the two M&M types (or if you enjoy parsing triple negatives: “we cannot reject the null hypothesis that there is not a difference”).

But does this change if we separate participants by continent of origin?

3.2 Continent-specific responses to “USA M&Ms” vs “Denmark M&Ms”

We repeated the above X2 and T-test analyses after grouping participants by their continents of origin. The Australia and South America groups were combined as a minimal attempt to preserve data privacy. Due to the relatively small sample size of even the combined Australia/South America group (n=3), we will refrain from analyzing trends for this group but include the data in several figures for completeness and enjoyment of the participants who may eventually read this.

3.2.1 Categorical response analysis — by continent

In Figure 4, we display both the taste response counts (upper panel, note the interactive legend) and the response percentages (lower panel) for each continent group. Both North America and Asia follow a similar trend to the whole population dataset: participants report Denmark M&Ms as “Good” more frequently than USA M&Ms, but also report Denmark M&Ms as “Bad” more frequently. USA M&Ms were most frequently reported as “Normal” (Figure 4).

On the contrary, European participants report USA M&Ms as “Bad” nearly 50% of the time and “Good” only 18% of the time, which is the most negative and least positive response pattern, respectively (when excluding the under-sampled Australia/South America group).

Figure 4. Qualitative taste response distribution by continent. Upper panel: counts of taste responses — click the legend to interactively filter! Lower panel: percentage of taste responses for each type of M&M. Figure made with Altair.

This appeared striking in bar chart form, however only North America had a significant X2 p-value (p = 0.0058) when evaluating each continent’s difference in taste response profile between the two M&M types. The European p-value is perhaps “approaching significance” in some circles, but we’re about to accumulate several more hypothesis tests and should be mindful of multiple hypothesis testing (Table 1). A false positive result here would be devastating.

When comparing the taste response profiles between two continents for the same M&M type, there are a couple interesting notes. First, we observed no major taste discrepancies between all pairs of continents when evaluating Denmark M&Ms — the world seems generally consistent in their range of feelings about M&Ms sourced from Europe (right column X2 p-values, Table 2). To visualize this comparison more easily, we reorganize the bars in Figure 4 to group them by M&M type (Figure 5).

Figure 5. Qualitative taste response distribution by M&M type, reported as percentages. (Same data as Figure 4 but re-arranged). Figure made with Altair.

However, when comparing continents to each other in response to USA M&Ms, we see larger discrepancies. We found one pairing to be significantly different: European and North American participants evaluated USA M&Ms very differently (p = 0.000007) (Table 2). It seems very unlikely that this observed difference is by random chance (left column, Table 2).

3.2.2 Quantitative response analysis — by continent

We again convert the categorical profiles to quantitative distributions to assess continents’ relative preference of M&M types. For North America, we see that the taste score means of the two M&M types are actually quite similar, but there is a higher density around “Normal” scores for USA M&Ms (Figure 6A). The European distributions maintain a bit more of a separation in their means (though not quite significantly so), with USA M&Ms scoring lower (Figure 6B). The taste score distributions of Asian participants is most similar (Figure 6C).

Reorienting to compare the quantitative means between continents’ taste scores for the same M&M type, only the comparison between North American and European participants on USA M&Ms is significantly different based on a T-test (p = 0.001) (Figure 6D), though now we really are in danger of multiple hypothesis testing! Be cautious if you are taking this analysis at all seriously.

Figure 6. Quantitative taste score distributions by continent. Kernel density estimation of the average taste score calculated for each each continent for each M&M type. A. Comparison of North America responses to each M&M. B. Comparison of Europe responses to each M&M. C. Comparison of Asia responses to each M&M. D. Comparison of continents for USA M&Ms. E. Comparison of continents for Denmark M&Ms. Figure made with Seaborn.

At this point, I feel myself considering that maybe Europeans are not just making this up. I’m not saying it’s as dramatic as some of them claim, but perhaps a difference does indeed exist… To some degree, North American participants also perceive a difference, but the evaluation of Europe-sourced M&Ms is not consistently positive or negative.

3.3 M&M taste alignment chart

In our analyses thus far, we did not account for the baseline differences in M&M appreciation between participants. For example, say Person 1 scored all Denmark M&Ms as “Good” and all USA M&Ms as “Normal”, while Person 2 scored all Denmark M&Ms as “Normal” and all USA M&Ms as “Bad.” They would have the same relative preference for Denmark M&Ms over USA M&Ms, but Person 2 perhaps just does not enjoy M&Ms as much as Person 1, and the relative preference signal is muddled by averaging the raw scores.

Inspired by the Lawful/Chaotic x Good/Evil alignment chart used in tabletop role playing games like Dungeons & Dragons©™, in Figure 7, we establish an M&M alignment chart to help determine the distribution of participants across M&M enjoyment classes.

Figure 7. M&M enjoyment alignment chart. The x-axis represents a participant’s average taste score for USA M&Ms; the y-axis is a participant’s average taste score for Denmark M&Ms. Figure made with Altair.

Notably, the upper right quadrant where both M&M types are perceived as “Good” to “Normal” is mostly occupied by North American participants and a few Asian participants. All European participants land in the left half of the figure where USA M&Ms are “Normal” to “Bad”, but Europeans are somewhat split between the upper and lower halves, where perceptions of Denmark M&Ms range from “Good” to “Bad.”

An interactive version of Figure 7 is provided below for the reader to explore the counts of various M&M alignment regions.

Figure 7 (interactive): click and brush your mouse over the scatter plot to see the counts of continents in different M&M enjoyment regions. Figure made with Altair.

3.4 Participant taste response ratio

Next, to factor out baseline M&M enjoyment and focus on participants’ relative preference between the two M&M types, we took the log ratio of each person’s USA M&M taste score average divided by their Denmark M&M taste score average.

Equation 1: Equation to calculate each participant’s overall M&M preference ratio.

As such, positive scores indicate a preference towards USA M&Ms while negative scores indicate a preference towards Denmark M&Ms.

On average, European participants had the strongest preference towards Denmark M&Ms, with Asians also exhibiting a slight preference towards Denmark M&Ms (Figure 8). To the two Europeans who exhibited deflated pride upon learning their slight preference towards USA M&Ms, fear not: you did not think USA M&Ms were “Good,” but simply ranked them as less bad than Denmark M&Ms (see participant_id 4 and 17 in the interactive version of Figure 7). If you assert that M&Ms are a bad American invention not worth replicating and return to consuming artisanal European chocolate, your honor can likely be restored.

Figure 8. Distribution of participant M&M preference ratios by continent. Preference ratios are calculated as in Equation 1. Positive numbers indicate a relative preference for USA M&Ms, while negative indicate a relative preference for Denmark M&Ms. Figure made with Seaborn.

North American participants are pretty split in their preference ratios: some fall quite neutrally around 0, others strongly prefer the familiar USA M&M, while a handful moderately prefer Denmark M&Ms. Anecdotally, North Americans who learned their preference skewed towards European M&Ms displayed signals of inflated pride, as if their results signaled posh refinement.

Overall, a T-test comparing the distributions of M&M preference ratios shows a possibly significant difference in the means between European and North American participants (p = 0.049), but come on, this is like the 20th p-value I’ve reported — this one is probably too close to call.

3.5 Taste inconsistency and “Perfect Classifiers”

For each participant, we assessed their taste score consistency by averaging the standard deviations of their responses to each M&M type, and plotting that against their preference ratio (Figure 9).

Figure 9. Participant taste consistency by preference ratio. The x-axis is a participant’s relative M&M preference ratio. The y-axis is the average of the standard deviation of their USA M&M scores and the standard deviation of their Denmark M&M scores. A value of 0 on the y-axis indicates perfect consistency in responses, while higher values indicate more inconsistent responses. Figure made with Altair.

Most participants were somewhat inconsistent in their ratings, ranking the same M&M type differently across the 5 samples. This would be expected if the taste difference between European-sourced and American-sourced M&Ms is not actually all that perceptible. Most inconsistent were participants who gave the same M&M type “Good”, “Normal”, and “Bad” responses (e.g., points high on the y-axis, with wider standard deviations of taste scores), indicating lower taste perception abilities.

Intriguingly, four participants — one from each continent group — were perfectly consistent: they reported the same taste response for each of the 5 M&Ms from each M&M type, resulting in an average standard deviation of 0.0 (bottom of Figure 9). Excluding the one of the four who simply rated all 10 M&Ms as “Normal”, the other three appeared to be “Perfect Classifiers” — either rating all M&Ms of one type “Good” and the other “Normal”, or rating all M&Ms of one type “Normal” and the other “Bad.” Perhaps these folks are “super tasters.”

3.6 M&M color

Another possible explanation for the inconsistency in individual taste responses is that there exists a perceptible taste difference based on the M&M color. Visually, the USA M&Ms were noticeably more smooth and vibrant than the Denmark M&Ms, which were somewhat more “splotchy” in appearance (Figure 10A). M&M color was recorded during the experiment, and although balanced sampling was not formally built into the experimental design, colors seemed to be sampled roughly evenly, with the exception of Blue USA M&Ms, which were oversampled (Figure 10B).

Figure 10. M&M colors. A. Photo of each M&M color of each type. It’s perhaps a bit hard to perceive on screen in my unprofessionally lit photo, but with the naked eye, USA M&Ms seemed to be brighter and more uniformly colored while Denmark M&Ms have a duller and more mottled color. Is it just me, or can you already hear the Europeans saying “They are brighter because of all those extra chemicals you put in your food that we ban here!” B. Distribution of M&Ms of each color sampled over the course of the experiment. The Blue USA M&Ms were not intentionally oversampled — they must be especially bright/tempting to experimenters. Figure made with Altair.

We briefly visualized possible differences in taste responses based on color (Figure 11), however we do not believe there are enough data to support firm conclusions. After all, on average each participant would likely only taste 5 of the 6 M&M colors once, and 1 color not at all. We leave further M&M color investigations to future work.

Figure 11. Taste response profiles for M&Ms of each color and type. Profiles are reported as percentages of “Bad”, “Normal”, and “Good” responses, though not all M&Ms were sampled exactly evenly. Figure made with Altair.

3.7 Colorful commentary

We assured each participant that there was no “right “answer” in this experiment and that all feelings are valid. While some participants took this to heart and occasionally spent over a minute deeply savoring each M&M and evaluating it as if they were a sommelier, many participants seemed to view the experiment as a competition (which occasionally led to deflated or inflated pride). Experimenters wrote down quotes and notes in conjunction with M&M responses, some of which were a bit “colorful.” We provide a hastily rendered word cloud for each M&M type for entertainment purposes (Figure 12) though we caution against reading too far into them without diligent sentiment analysis.

Figure 11. A simple word cloud generated from the notes column of each M&M type. Fair warning — these have not been properly analyzed for sentiment and some inappropriate language was recorded. Figure made with WordCloud.

4. Conclusion

Overall, there does not appear to be a “global consensus” that European M&Ms are better than American M&Ms. However, European participants tended to more strongly express negative reactions to USA M&Ms while North American participants seemed relatively split on whether they preferred M&Ms sourced from the USA vs from Europe. The preference trends of Asian participants often fell somewhere between the North Americans and Europeans.

Therefore, I’ll admit that it’s probable that Europeans are not engaged in a grand coordinated lie about M&Ms. The skew of most European participants towards Denmark M&Ms is compelling, especially since I was the experimenter who personally collected much of the taste response data. If they found a way to cheat, it was done well enough to exceed my own passive perception such that I didn’t notice. However, based on this study, it would appear that a strongly negative “vomit flavor” is not universally perceived and does not become apparent to non-Europeans when tasting both M&Ms types side by side.

We hope this study has been illuminating! We would look forward to extensions of this work with improved participant sampling, additional M&M types sourced from other continents, and deeper investigations into possible taste differences due to color.

Thank you to everyone who participated and ate M&Ms in the name of science!

Figures and analysis can be found on github: https://github.com/erinhwilson/mnm-taste-test

Article by Erin H. Wilson, Ph.D.[1,2,3] who decided the time between defending her dissertation and starting her next job would be best spent on this highly valuable analysis. Hopefully it is clear that this article is intended to be comedic— I do not actually harbor any negative feelings towards Europeans who don’t like American M&Ms, but enjoyed the chance to be sassy and poke fun at our lively debates with overly-enthusiastic data analysis.

Shout out to Matt, Galen, Ameya, and Gian-Marco for assisting in data collection!

[1] Former Ph.D. student in the Paul G. Allen School of Computer Science and Engineering at the University of Washington

[2] Former visiting Ph.D. student at the Novo Nordisk Foundation Center for Biosustainability at the Technical University of Denmark

[3] Future data scientist at LanzaTech

The post Do European M&Ms Actually Taste Better than American M&Ms? appeared first on Towards Data Science.

]]>