Tula Masterman, Author at Towards Data Science

Overcome Failing Document Ingestion & RAG Strategies with Agentic Knowledge Distillation

Tula Masterman — Wed, 05 Mar 2025 19:50:12 +0000

Introduction

Many generative AI use cases still revolve around Retrieval Augmented Generation (RAG), yet consistently fall short of user expectations. Despite the growing body of research on RAG improvements and even adding Agents into the process, many solutions still fail to return exhaustive results, miss information that is critical but infrequently mentioned in the documents, require multiple search iterations, and generally struggle to reconcile key themes across multiple documents. To top it all off, many implementations still rely on cramming as much “relevant” information as possible into the model’s context window alongside detailed system and user prompts. Reconciling all this information often exceeds the model’s cognitive capacity and compromises response quality and consistency.

This is where our Agentic Knowledge Distillation + Pyramid Search Approach comes into play. Instead of chasing the best chunking strategy, retrieval algorithm, or inference-time reasoning method, my team, Jim Brown, Mason Sawtell, Sandi Besen, and I, take an agentic approach to document ingestion.

We leverage the full capability of the model at ingestion time to focus exclusively on distilling and preserving the most meaningful information from the document dataset. This fundamentally simplifies the RAG process by allowing the model to direct its reasoning abilities toward addressing the user/system instructions rather than struggling to understand formatting and disparate information across document chunks.

We specifically target high-value questions that are often difficult to evaluate because they have multiple correct answers or solution paths. These cases are where traditional RAG solutions struggle most and existing RAG evaluation datasets are largely insufficient for testing this problem space. For our research implementation, we downloaded annual and quarterly reports from the last year for the 30 companies in the DOW Jones Industrial Average. These documents can be found through the SEC EDGAR website. The information on EDGAR is accessible and able to be downloaded for free or can be queried through EDGAR public searches. See the SEC privacy policy for additional details, information on the SEC website is “considered public information and may be copied or further distributed by users of the web site without the SEC’s permission”. We selected this dataset for two key reasons: first, it falls outside the knowledge cutoff for the models evaluated, ensuring that the models cannot respond to questions based on their knowledge from pre-training; second, it’s a close approximation for real-world business problems while allowing us to discuss and share our findings using publicly available data.

While typical RAG solutions excel at factual retrieval where the answer is easily identified in the document dataset (e.g., “When did Apple’s annual shareholder’s meeting occur?”), they struggle with nuanced questions that require a deeper understanding of concepts across documents (e.g., “Which of the DOW companies has the most promising AI strategy?”). Our Agentic Knowledge Distillation + Pyramid Search Approach addresses these types of questions with much greater success compared to other standard approaches we tested and overcomes limitations associated with using knowledge graphs in RAG systems.

In this article, we’ll cover how our knowledge distillation process works, key benefits of this approach, examples, and an open discussion on the best way to evaluate these types of systems where, in many cases, there is no singular “right” answer.

Building the pyramid: How Agentic Knowledge Distillation works

Image by author and team depicting pyramid structure for document ingestion. Robots meant to represent agents building the pyramid.

Overview

Our knowledge distillation process creates a multi-tiered pyramid of information from the raw source documents. Our approach is inspired by the pyramids used in deep learning computer vision-based tasks, which allow a model to analyze an image at multiple scales. We take the contents of the raw document, convert it to markdown, and distill the content into a list of atomic insights, related concepts, document abstracts, and general recollections/memories. During retrieval it’s possible to access any or all levels of the pyramid to respond to the user request.

How to distill documents and build the pyramid:

Convert documents to Markdown: Convert all raw source documents to Markdown. We’ve found models process markdown best for this task compared to other formats like JSON and it is more token efficient. We used Azure Document Intelligence to generate the markdown for each page of the document, but there are many other open-source libraries like MarkItDown which do the same thing. Our dataset included 331 documents and 16,601 pages.
Extract atomic insights from each page: We process documents using a two-page sliding window, which allows each page to be analyzed twice. This gives the agent the opportunity to correct any potential mistakes when processing the page initially. We instruct the model to create a numbered list of insights that grows as it processes the pages in the document. The agent can overwrite insights from the previous page if they were incorrect since it sees each page twice. We instruct the model to extract insights in simple sentences following the subject-verb-object (SVO) format and to write sentences as if English is the second language of the user. This significantly improves performance by encouraging clarity and precision. Rolling over each page multiple times and using the SVO format also solves the disambiguation problem, which is a huge challenge for knowledge graphs. The insight generation step is also particularly helpful for extracting information from tables since the model captures the facts from the table in clear, succinct sentences. Our dataset produced 216,931 total insights, about 13 insights per page and 655 insights per document.
Distilling concepts from insights: From the detailed list of insights, we identify higher-level concepts that connect related information about the document. This step significantly reduces noise and redundant information in the document while preserving essential information and themes. Our dataset produced 14,824 total concepts, about 1 concept per page and 45 concepts per document.
Creating abstracts from concepts: Given the insights and concepts in the document, the LLM writes an abstract that appears both better than any abstract a human would write and more information-dense than any abstract present in the original document. The LLM generated abstract provides incredibly comprehensive knowledge about the document with a small token density that carries a significant amount of information. We produce one abstract per document, 331 total.
Storing recollections/memories across documents: At the top of the pyramid we store critical information that is useful across all tasks. This can be information that the user shares about the task or information the agent learns about the dataset over time by researching and responding to tasks. For example, we can store the current 30 companies in the DOW as a recollection since this list is different from the 30 companies in the DOW at the time of the model’s knowledge cutoff. As we conduct more and more research tasks, we can continuously improve our recollections and maintain an audit trail of which documents these recollections originated from. For example, we can keep track of AI strategies across companies, where companies are making major investments, etc. These high-level connections are super important since they reveal relationships and information that are not apparent in a single page or document.

Sample subset of insights extracted from IBM 10Q, Q3 2024 (page 4)

We store the text and embeddings for each layer of the pyramid (pages and up) in Azure PostgreSQL. We originally used Azure AI Search, but switched to PostgreSQL for cost reasons. This required us to write our own hybrid search function since PostgreSQL doesn’t yet natively support this feature. This implementation would work with any vector database or vector index of your choosing. The key requirement is to store and efficiently retrieve both text and vector embeddings at any level of the pyramid.

This approach essentially creates the essence of a knowledge graph, but stores information in natural language, the way an LLM natively wants to interact with it, and is more efficient on token retrieval. We also let the LLM pick the terms used to categorize each level of the pyramid, this seemed to let the model decide for itself the best way to describe and differentiate between the information stored at each level. For example, the LLM preferred “insights” to “facts” as the label for the first level of distilled knowledge. Our goal in doing this was to better understand how an LLM thinks about the process by letting it decide how to store and group related information.

Using the pyramid: How it works with RAG & Agents

At inference time, both traditional RAG and agentic approaches benefit from the pre-processed, distilled information ingested in our knowledge pyramid. The pyramid structure allows for efficient retrieval in both the traditional RAG case, where only the top X related pieces of information are retrieved or in the Agentic case, where the Agent iteratively plans, retrieves, and evaluates information before returning a final response.

The benefit of the pyramid approach is that information at any and all levels of the pyramid can be used during inference. For our implementation, we used PydanticAI to create a search agent that takes in the user request, generates search terms, explores ideas related to the request, and keeps track of information relevant to the request. Once the search agent determines there’s sufficient information to address the user request, the results are re-ranked and sent back to the LLM to generate a final reply. Our implementation allows a search agent to traverse the information in the pyramid as it gathers details about a concept/search term. This is similar to walking a knowledge graph, but in a way that’s more natural for the LLM since all the information in the pyramid is stored in natural language.

Depending on the use case, the Agent could access information at all levels of the pyramid or only at specific levels (e.g. only retrieve information from the concepts). For our experiments, we did not retrieve raw page-level data since we wanted to focus on token efficiency and found the LLM-generated information for the insights, concepts, abstracts, and recollections was sufficient for completing our tasks. In theory, the Agent could also have access to the page data; this would provide additional opportunities for the agent to re-examine the original document text; however, it would also significantly increase the total tokens used.

Here is a high-level visualization of our Agentic approach to responding to user requests:

Image created by author and team providing an overview of the agentic research & response process

Results from the pyramid: Real-world examples

To evaluate the effectiveness of our approach, we tested it against a variety of question categories, including typical fact-finding questions and complex cross-document research and analysis tasks.

Fact-finding (spear fishing):

These tasks require identifying specific information or facts that are buried in a document. These are the types of questions typical RAG solutions target but often require many searches and consume lots of tokens to answer correctly.

Example task: “What was IBM’s total revenue in the latest financial reporting?”

Example response using pyramid approach: “IBM’s total revenue for the third quarter of 2024 was $14.968 billion [ibm-10q-q3-2024.pdf, pg. 4]

Total tokens used to research and generate response

This result is correct (human-validated) and was generated using only 9,994 total tokens, with 1,240 tokens in the generated final response.

Complex research and analysis:

These tasks involve researching and understanding multiple concepts to gain a broader understanding of the documents and make inferences and informed assumptions based on the gathered facts.

Example task: “Analyze the investments Microsoft and NVIDIA are making in AI and how they are positioning themselves in the market. The report should be clearly formatted.”

Example response:

Response generated by the agent analyzing AI investments and positioning for Microsoft and NVIDIA.

The result is a comprehensive report that executed quickly and contains detailed information about each of the companies. 26,802 total tokens were used to research and respond to the request with a significant percentage of them used for the final response (2,893 tokens or ~11%). These results were also reviewed by a human to verify their validity.

Snippet indicating total token usage for the task

Example task: “Create a report on analyzing the risks disclosed by the various financial companies in the DOW. Indicate which risks are shared and unique.”

Example response:

Part 1 of response generated by the agent on disclosed risks.

Part 2 of response generated by the agent on disclosed risks.

Similarly, this task was completed in 42.7 seconds and used 31,685 total tokens, with 3,116 tokens used to generate the final report.

Snippet indicating total token usage for the task

These results for both fact-finding and complex analysis tasks demonstrate that the pyramid approach efficiently creates detailed reports with low latency using a minimal amount of tokens. The tokens used for the tasks carry dense meaning with little noise allowing for high-quality, thorough responses across tasks.

Benefits of the pyramid: Why use it?

Overall, we found that our pyramid approach provided a significant boost in response quality and overall performance for high-value questions.

Some of the key benefits we observed include:

Reduced model’s cognitive load: When the agent receives the user task, it retrieves pre-processed, distilled information rather than the raw, inconsistently formatted, disparate document chunks. This fundamentally improves the retrieval process since the model doesn’t waste its cognitive capacity on trying to break down the page/chunk text for the first time.
Superior table processing: By breaking down table information and storing it in concise but descriptive sentences, the pyramid approach makes it easier to retrieve relevant information at inference time through natural language queries. This was particularly important for our dataset since financial reports contain lots of critical information in tables.
Improved response quality to many types of requests: The pyramid enables more comprehensive context-aware responses to both precise, fact-finding questions and broad analysis based tasks that involve many themes across numerous documents.
Preservation of critical context: Since the distillation process identifies and keeps track of key facts, important information that might appear only once in the document is easier to maintain. For example, noting that all tables are represented in millions of dollars or in a particular currency. Traditional chunking methods often cause this type of information to slip through the cracks.
Optimized token usage, memory, and speed: By distilling information at ingestion time, we significantly reduce the number of tokens required during inference, are able to maximize the value of information put in the context window, and improve memory use.
Scalability: Many solutions struggle to perform as the size of the document dataset grows. This approach provides a much more efficient way to manage a large volume of text by only preserving critical information. This also allows for a more efficient use of the LLMs context window by only sending it useful, clear information.
Efficient concept exploration: The pyramid enables the agent to explore related information similar to navigating a knowledge graph, but does not require ever generating or maintaining relationships in the graph. The agent can use natural language exclusively and keep track of important facts related to the concepts it’s exploring in a highly token-efficient and fluid way.
Emergent dataset understanding: An unexpected benefit of this approach emerged during our testing. When asking questions like “what can you tell me about this dataset?” or “what types of questions can I ask?”, the system is able to respond and suggest productive search topics because it has a more robust understanding of the dataset context by accessing higher levels in the pyramid like the abstracts and recollections.

Beyond the pyramid: Evaluation challenges & future directions

Challenges

While the results we’ve observed when using the pyramid search approach have been nothing short of amazing, finding ways to establish meaningful metrics to evaluate the entire system both at ingestion time and during information retrieval is challenging. Traditional RAG and Agent evaluation frameworks often fail to address nuanced questions and analytical responses where many different responses are valid.

Our team plans to write a research paper on this approach in the future, and we are open to any thoughts and feedback from the community, especially when it comes to evaluation metrics. Many of the existing datasets we found were focused on evaluating RAG use cases within one document or precise information retrieval across multiple documents rather than robust concept and theme analysis across documents and domains.

The main use cases we are interested in relate to broader questions that are representative of how businesses actually want to interact with GenAI systems. For example, “tell me everything I need to know about customer X” or “how do the behaviors of Customer A and B differ? Which am I more likely to have a successful meeting with?”. These types of questions require a deep understanding of information across many sources. The answers to these questions typically require a person to synthesize data from multiple areas of the business and think critically about it. As a result, the answers to these questions are rarely written or saved anywhere which makes it impossible to simply store and retrieve them through a vector index in a typical RAG process.

Another consideration is that many real-world use cases involve dynamic datasets where documents are consistently being added, edited, and deleted. This makes it difficult to evaluate and track what a “correct” response is since the answer will evolve as the available information changes.

Future directions

In the future, we believe that the pyramid approach can address some of these challenges by enabling more effective processing of dense documents and storing learned information as recollections. However, tracking and evaluating the validity of the recollections over time will be critical to the system’s overall success and remains a key focus area for our ongoing work.

When applying this approach to organizational data, the pyramid process could also be used to identify and assess discrepancies across areas of the business. For example, uploading all of a company’s sales pitch decks could surface where certain products or services are being positioned inconsistently. It could also be used to compare insights extracted from various line of business data to help understand if and where teams have developed conflicting understandings of topics or different priorities. This application goes beyond pure information retrieval use cases and would allow the pyramid to serve as an organizational alignment tool that helps identify divergences in messaging, terminology, and overall communication.

Conclusion: Key takeaways and why the pyramid approach matters

The knowledge distillation pyramid approach is significant because it leverages the full power of the LLM at both ingestion and retrieval time. Our approach allows you to store dense information in fewer tokens which has the added benefit of reducing noise in the dataset at inference. Our approach also runs very quickly and is incredibly token efficient, we are able to generate responses within seconds, explore potentially hundreds of searches, and on average use <40K tokens for the entire search, retrieval, and response generation process (this includes all the search iterations!).

We find that the LLM is much better at writing atomic insights as sentences and that these insights effectively distill information from both text-based and tabular data. This distilled information written in natural language is very easy for the LLM to understand and navigate at inference since it does not have to expend unnecessary energy reasoning about and breaking down document formatting or filtering through noise.

The ability to retrieve and aggregate information at any level of the pyramid also provides significant flexibility to address a variety of query types. This approach offers promising performance for large datasets and enables high-value use cases that require nuanced information retrieval and analysis.

Note: The opinions expressed in this article are solely my own and do not necessarily reflect the views or policies of my employer.

Interested in discussing further or collaborating? Reach out on LinkedIn!

The post Overcome Failing Document Ingestion & RAG Strategies with Agentic Knowledge Distillation appeared first on Towards Data Science.

Improving Agent Systems & AI Reasoning

Tula Masterman — Sun, 02 Feb 2025 14:32:00 +0000

Image by author and GPT-4o meant to represent DeepSeek and other competitive GenAI model providers

Introduction

Over the past year generative AI adoption and Ai Agent development have skyrocketed. Reports from LangChain show that 51% of respondents are using AI Agents in production, while reports from Deloitte predict that in 2025 at least 25% of companies using Generative AI will launch AI agent pilots or proof of concepts. Despite the popularity and growth of AI Agent frameworks, anyone building these systems quickly runs into limitations of working with large language models (LLMs), with model reasoning ability often at the top of the list. To overcome reasoning limitations researchers and developers have explored a variety of different techniques ranging from different prompting methods like ReAct or Chain of Thought (CoT) to building multi-agent systems with separate agents dedicated to planning and evaluation, and now companies are releasing new models trained specifically to improve the model’s built-in reasoning process.

DeepSeek’s R1 and OpenAI’s o1 and o3 announcements are shaking up the industry by providing more robust reasoning capabilities compared to traditional LLMs. These models are trained to "think" before answering and have a self-contained reasoning process allowing them to break down tasks into simpler steps, work iteratively on the steps, recognize and correct mistakes before returning a final answer. This differs from earlier models like GPT-4o which required users to build their own reasoning logic by prompting the model to think step-by-step and creating loops for the model to iteratively plan, work, and evaluate its progress on a task. One of the key differences in training Reasoning Language Models (RLMs) like o1, o3, and R1 lies in the focus on post-training and test-time compute scaling.

In this article we’ll cover the key differences between train and test time compute scaling, post-training and how to train a RLM like DeepSeek’s R1, and the impact of RLMs on AI Agent development.

Train-Time Compute vs Test-Time Compute

Overview

Compute-scaling relates to providing more resources, such as processing power and memory, for training and running AI models. In a nutshell, train-time compute scaling applies to both pre-training where a model learns general patterns and post-training where a base-model undergoes additional training like Reinforcement Learning (RL) or Supervised Fine-Tuning (SFT) to learn additional more specific behaviors. In contrast, test-time compute scaling applies at inference time, when making a prediction, and provides more computational power for the model to "think" by exploring multiple potential solutions before generating a final answer.

It’s important to understand that both test-time compute scaling and post-training can be used to help a model "think" before producing a final response but that these approaches are implemented in different ways.

While post-training involves updating or creating a new model, test-time compute scaling enables the exploration of multiple solutions at inference without changing the model itself. These approaches could be used together; in theory you could take a model that has undergone post-training for improved reasoning, like DeepSeek-R1, and allow it to further enhance it’s reasoning by performing additional searches at inference through test-time compute scaling.

Image by author. Depicts a very simple representation of pre-training and post-training. Note that there can be significant variations in post-training, but essentially the base model is modified in some way to create an updated model better suited to the task.

Train-Time Compute: Pre-Training & Post-Training

Today, most LLMs & Foundation Models are pre-trained on a large amount of data from sources like the Common Crawl, which have a wide and varied representation of human-written text. This pre-training phase teaches the model to predict the next most likely word or token in a given context. Once pre-training is complete, most models undergo a form of Supervised Fine Tuning (SFT) to optimize them for instruction following or chat based use cases. For more information on these training processes check out one of my previous articles.

Overall, this training process is incredibly resource intensive and requires many training runs each costing millions of dollars before producing a model like Claude 3.5 Sonnet, GPT-4o, Llama 3.1–405B, etc. These models excel on general purpose tasks as measured on a variety of benchmarks across topics for logical reasoning, math, coding, reading comprehension and more.

However, despite their compelling performance on a myriad of problem types, getting a typical LLM to actually "think" before responding requires a lot of engineering from the user. Fundamentally, these models receive an input and then return an output as their final answer. You can think of this like the model generating it’s best guess in one step based on either learned information from pre-training or through in context learning from directions and information provided in a user’s prompt. This behavior is why Agent frameworks, Chain-of-Thought (CoT) prompting, and tool-calling have all taken off. These patterns allow people to build systems around LLMs which enable a more iterative, structured, and successful workflow for LLM application development.

Recently, models like DeepSeek-R1 have diverged from the typical pre-training and post-training patterns that optimize models for chat or instruction following. Instead DeepSeek-R1 used a multi-stage post-training pipeline to teach the model more specific behaviors like how to produce Chain-of-Thought sequences which in turn improve the model’s overall ability to "think" and reason. We’ll cover this in detail in the next section using the DeepSeek-R1 training process as an example.

Test-Time Compute Scaling: Enabling "Thinking" at Inference

What’s exciting about test-time compute scaling and post-training is that reasoning and iterative problem solving can be built into the models themselves or their inference pipelines. Instead of relying on the developer to guide the entire reasoning and iteration process, there’s opportunities to allow the model to explore multiple solution paths, reflect on it’s progress, rank the best solution paths, and generally refine the overall reasoning lifecycle before sending a response to the user.

Test-time compute scaling is specifically related to optimizing performance at inference and does not involve modifying the model’s parameters. What this means practically is that a smaller model like Llama 3.2–8b can compete with much larger models by spending more time "thinking" and working through numerous possible solutions at inference time.

Some of the common test-time scaling strategies include self-refinement where the model iteratively refines it’s own outputs and searching against a verifier where multiple possible answers are generated and a verifier selects the best path to move forward from. Common search against verifier strategies include:

Best-of-N where numerous responses are generated for each question, each answer is scored, and the answer with the highest score wins.
Beam Search which typically use a Process Reward Model (PRM) to score a multi-step reasoning process. This allows you to start by generating multiple solution paths (beams), determine which paths are the best to continue searching on, then generate a new set of sub-paths and evaluate these, continuing until a solution is reached.
Diverse Verifier Tree Search (DVTS) is related to Beam Search but creates a separate tree for each of the initial paths (beams) created. Each tree is then expanded and the branches of the tree are scored using PRM.

Image by author inspired by HuggingFace blog on Test Time Compute Scaling

Determining which search strategy is best is still an active area of research, but there are a lot of great resources on HuggingFace which provide examples for how these search strategies can be implemented for your use case.

Training a Reasoning Language Model (RLM)

OpenAI’s o1 model announced in September 2024 was one of the first models designed to "think" before responding to users. Although it takes longer to get a response from o1 compared to models like GPT-4o, o1’s responses are typically better for more advanced tasks since it generates chain of thought sequences that help it break down and solve problems.

Working with o1 and o3 requires a different style of prompt engineering compared to earlier generations of models given that these new reasoning focused models operate quite differently than their predecessors. For example, telling o1 or o3 to "think step by step" will be less valuable than giving the same instructions to GPT-4o.

Given the closed-source nature of OpenAI’s o1 and o3 models it’s impossible to know exactly how the models were developed; this is a big reason why DeepSeek-R1 attracted so much attention. DeepSeek-R1 is the first open-source model to demonstrate comparable behavior and performance to OpenAI’s o1. This is amazing for the open-source community because it means developers can modify R1 to their needs and, compute power permitting, can replicate R1’s training methodology.

DeepSeek-R1 Training Process:

DeepSeek-R1-Zero: First, DeepSeek performed Reinforcement Learning (RL) (post-training) on their base model DeepSeek-V3. This resulted in DeepSeek-R1-Zero, a model that learned how to reason, create chain-of-thought-sequences, and demonstrates capabilities like self-verification and reflection. The fact that a model could learn all these behaviors from RL alone is significant for the AI industry as a whole. However, despite DeepSeek-R1-Zero’s impressive ability to learn, the model had significant issues like language mixing and generally poor readability. This led the team to explore other paths to stabilize model performance and create a more production-ready model.
DeepSeek-R1: Creating DeepSeek-R1 involved a multi-stage post training pipeline alternating between SFT and RL steps. Researchers first performed SFT on DeepSeek-V3 using cold start data in the form of thousands of example CoT sequences, the goal of this was to create a more stable starting point for RL and overcome the issues found with DeepSeek-R1-Zero. Second, researchers performed RL and included rewards to promote language consistency and enhance reasoning on tasks like science, coding, and math. Third, SFT is completed again, this time including non-reasoning focused training examples to help the model retain more general-purpose abilities like writing and role-playing. Finally, RL occurs again to help improve with alignment towards human preferences. This resulted in a highly capable model with 671B parameters.
Distilled DeepSeek-R1 Models: The DeepSeek team further demonstrated that DeepSeek-R1’s reasoning can be distilled into open-source smaller models using SFT alone without RL. They fine-tuned smaller models ranging from 1.5B-70B parameters based on both Qwen and Llama architectures resulting in a set of lighter, more efficient models with better reasoning abilities. This significantly improves accessibility for developers since many of these distilled models can run quickly on their device.

Conclusion: The Impact of Improved Reasoning Models on AI Agents

As reasoning-first models and test-time compute scaling techniques continue to advance, the system design, capabilities, and user-experience for interacting with AI agents will change significantly.

Going forward I believe we will see more streamlined agent teams. Instead of having separate agents and hyper use-case specific prompts and tools we will likely see design patterns where a single RLM manages the entire workflow. This will also likely change how much background information the user needs to provide the agent if the agent is better equipped to explore a variety of different solution paths.

User interaction with agents will also change. Today many agent interfaces are still chat-focused with users expecting near-instant responses. Given that it takes RLMs longer to respond I think user-expectations and experiences will shift and we’ll see more instances where users delegate tasks that agent teams execute in the background. This execution time could take minutes or hours depending on the complexity of the task but ideally will result in thorough and highly traceable outputs. This could enable people to delegate many tasks to a variety of agent teams at once and spend their time focusing on human-centric tasks.

Despite their promising performance, many reasoning focused models still lack tool-calling capabilities. OpenAI’s newly released o3-mini is the first reasoning focused model that natively supports tool-calling, structured outputs, and developer prompts (the new version of system prompts). Tool-calling is critical for agents since it allows them to interact with the world, gather information, and actually execute tasks on our behalf. However, given the rapid pace of innovation in this space I expect we will soon see more RLMs with integrated tool calling.

In summary, this is just the beginning of a new age of general-purpose reasoning models that will continue to transform the way that we work and live.

Note: The opinions expressed in this article are solely my own and do not necessarily reflect the views or policies of my employer.

Interested in discussing further or collaborating? Reach out on LinkedIn!

References:

The post Improving Agent Systems & AI Reasoning appeared first on Towards Data Science.

Introducing Layer Enhanced Classification (LEC)

Tula Masterman — Fri, 20 Dec 2024 12:02:33 +0000

Leveraging the hidden state from an intermediate Transformer layer for efficient and robust content safety and prompt injection classification

Image by author and GPT-4o meant to represent the robust language understanding provided by Large Language Models.

Introduction

As the adoption of Language Models (LMs) grows, it’s more and more important to detect inappropriate content in both the user’s input and the generated outputs of the language model. With each new model release from any major model provider, one of the first things people try to do is find ways to "jailbreak" or otherwise manipulate the model to respond in ways it shouldn’t. A quick search on Google or X reveals many examples of how people have found ways around model alignment tuning to get models to respond to inappropriate requests. Furthermore, many companies have released Generative AI based chatbots publicly for tasks like customer service, which often end up suffering from prompt injection attacks and responding to tasks both inappropriate and far beyond their intended use. Detecting and classifying these instances is extremely important for businesses so that they don’t end up with a system that can be easily manipulated by their users, especially if they deploy their chat systems publicly.

My team, Mason Sawtell, Sandi Besen, Jim Brown, and I recently published our paper Lightweight Safety Classification using Pruned Language Models as an ArXiv preprint. Our work introduces a new approach, Layer Enhanced Classification (LEC), and demonstrates that using LEC it is possible to effectively classify both content safety violations and prompt injection attacks by using the hidden state(s) from the intermediate transformer layer(s) of a Language Model to train a penalized logistic regression classifier with very few trainable parameters (769 on the low end) and a small number of training examples, often fewer than 100. This approach combines the computational efficiency of a simple classification model with the robust language understanding of a Language Model.

All of the models trained using our approach, LEC, outperform special-purpose models designed for each task as well as GPT-4o. We find that there are optimal intermediate transformer layers that produce the necessary features for both content safety and prompt injection classification tasks. This is important because it suggests you can use the same model to simultaneously classify content safety violations, prompt injections, and generate the output tokens. Alternatively, you could use a very small LM, prune it to the optimal intermediate layer, and use the outputs from this layer as the features for the classification task. This would allow for an incredibly compute efficient and lightweight classifier that integrates well with an existing LM inference pipeline.

This is the first of several articles I plan to share on this topic. In this article I will summarize the goals, approach, key results, and implications of our research. In a future article, I plan to share how we applied our approach to IBM’s Granite-8B model and an open-source model without any guardrails, allowing both models to detect content safety & prompt injection violations and generate output tokens all in one pass through the model. For further details on our research feel free to check out the full paper, read my colleague Sandi Besen’s article, or reach out with questions.

Goals & Approach

Overview: Our research focuses on understanding how well the hidden states of intermediate transformer layers perform when used as the input features for classification tasks. We wanted to understand if small general-purpose models and special-purpose models for content safety and prompt injection classification tasks would perform better on these tasks if we could identify the optimal layer to use for the task instead of using the entire model / the last layer for classification. We also wanted to understand how small of a model, in terms of the total number of parameters, we could use as a starting point for this task. Other research has shown that different layers of the model focus on different characteristics of any given prompt input, our work finds that the intermediate layers tend to best capture the features that are most important for these classification tasks.

Datasets: For both content safety and prompt injection classification tasks we compare the performance of models trained using our approach to baseline models on task-specific datasets. Previous work indicated our classifiers would only see small performance improvements after a few hundred examples so for both classification tasks we used a task-specific dataset with 5,000 randomly sampled examples, allowing for enough data diversity while minimizing compute and training time. For the content safety dataset we use a combination of the SALAD Data dataset from OpenSafetyLab and the LMSYS-Chat-1M dataset from LMSYS. For the prompt injection dataset we use the SPML dataset since it includes system and user prompt pairs. This is critical because some user requests might seem "safe" (e.g., "help me solve this math problem") but they ask the model to respond outside of the system’s intended use as defined in the system prompt (e.g. "You are a helpful AI assistant for Company X, you only respond to questions about our company").

Model Selection: We use GPT-4o as a baseline model for both tasks since it is widely considered one of the most capable LLMs and in some cases outperformed the baseline special-purpose model(s). For content safety classification we use Llama Guard 3 1B and 8B models and for prompt injection classification we use Protect AI’s DeBERTA v3 Base Prompt Injection v2 model since these models are considered leaders in their respective areas. We apply our approach, LEC, to the baseline special purpose models (Llama Guard 3 1B, Llama Guard 3 8B, and DeBERTa v3 Base Prompt Injection) and general-purpose models. For general-purpose models we selected Qwen 2.5 Instruct in sizes 0.5B, 1.5B, and 3B since these models are relatively close in size to the special-purpose models.

This setup allows us to compare 3 key things:

How well our approach performs when applied to a small general-purpose model compared to both baseline models (GPT-4o and the special-purpose model).
How much applying our approach improves the performance of the special-purpose model relative to its own baseline performance on that task.
How well our approach generalizes across model architectures, by evaluating its performance on both general-purpose and special-purpose models.

Important Implementation Details: For both Qwen 2.5 Instruct models and task-specific special-purpose models we prune individual layers and capture the hidden state of the transformer layer to train a Penalized Logistic Regression (PLR) model with L2 regularization. The PLR model has the same number of trainable parameters as the size of the model’s hidden state plus one for the bias in binary classification tasks, this ranges from 769 for the smallest model (Protect AI’s DeBERTa) to 4097 for the largest model (Llama Guard 3 8B). We train the classifier with varying numbers of examples for each layer allowing us to understand the impact of individual layers on the task and how many training examples are necessary to surpass the baseline models’ performance or achieve optimal performance in terms of F1 score. We run our entire test set through the baseline models to establish their performance on each task.

Image by author and team demonstrating the LEC training process at a high level. Training examples are independently passed through a model and the hidden state at each transformer layer is captured. These hidden states are then used to train classifiers. Each classifier is trained with a varying number of examples. The results allow us to determine which layers produce the most task-relevant features and how many examples are needed to achieve the best performance.

Key Results

In this section I’ll cover the important results across both tasks and for each task, content safety classification and prompt injection classification, individually.

Key findings across both tasks:

Overall, our approach results in a higher F1 score across all evaluated tasks, models, and number of of training examples, typically surpassing baseline model performance within 20–100 examples.
The intermediate layers tend to show the largest improvement in F1 score compared to the final layer when trained on fewer examples. These layers also tend to have the best performance relative to the baseline models. This indicates that local features important to both classification tasks are represented early on in the transformer network and suggests that use cases with fewer training examples can especially benefit from our approach.
Furthermore, we found that applying our approach to the special-purpose models outperforms the models own baseline performance, typically within 20 examples, by identifying and using the most task-relevant layer.
Both general-purpose Qwen 2.5 Instruct models and task-specific special-purpose models achieve higher F1 scores within fewer examples with our approach. This suggests that our approach generalizes across architectures and domains.
In the Qwen 2.5 Instruct models, we find that the intermediate model layers attain higher F1 scores with fewer examples for both content safety and prompt injection classification tasks. This suggests that it’s feasible to use one model for both classification tasks and generate the outputs in one pass. The additional compute time for these extra classification steps would be almost negligible given the small size of the classifiers.

Content safety classification results:

Image by author and team demonstrating LEC performance at select layers on the binary content safety classification task for Qwen 2.5 Instruct 0.5B, Llama Guard 3 1B, and Llama Guard 3 8b. The x-axis shows the number of training examples, and the Y-axis reflects the weighted F1-score.

For both binary and multi-class classification, the general and special purpose models trained using our approach typically outperform the baseline Llama Guard 3 models within 20 examples and GPT-4o in fewer than 100 examples.
For both binary and multi-class classification, the general and special purpose LEC models typically surpass all baseline models performance for the intermediate layers if not all layers. Our results on binary content safety classification surpass the baselines by the widest margins attaining maximum F1-scores of 0.95 or 0.96 for both Qwen 2.5 Instruct and Llama Guard LEC models. In comparison, GPT-4o’s baseline F1 score is 0.82, Llama Guard 3 1B’s is 0.65 , and Llama Guard 3 8B’s is 0.71.
For binary classification our approach performs comparably when applied to Qwen 2.5 Instruct 0.5B, Llama Guard 3 1B, and Llama Guard 3 8B. The models attain a maximum F1 score of 0.95, 0.96, and 0.96 respectively. Interestingly, Qwen 2.5 Instruct 0.5B surpasses GPT-4o’s baseline performance in 15 examples for the middle layers while it takes both Llama Guard 3 models 55 examples to do so.
For multi-class classification, a very small LEC model using the hidden state from the middle layers of Qwen 2.5 Instruct 0.5B surpasses GPT-4o’s baseline performance within 35 training examples for all three difficulty levels of the multi-class classification task.

Prompt injection classification results:

Image by author and team demonstrating LEC performance at select layers on the prompt injection classification task for Qwen 2.5 Instruct 0.5B and DeBERTa v3 Prompt Injection v2 models. The x-axis shows the number of training examples, and the Y-axis reflects the weighted F1-score. These graphs demonstrate how both LEC models outperform the baselines for the intermediate model layers with minimal training examples.

Applying our approach to both general-purpose Qwen 2.5 Instruct models and special-purpose DeBERTa v3 Prompt Injection v2 results in both models intermediate layers outperforming the baseline models in fewer than 100 training examples. This again indicates that our approach generalizes across model architectures and domains.
All three Qwen 2.5 Instruct model sizes surpass the baseline DeBERTa v3 Prompt Injection v2 model’s F1 score of 0.73 within 5 training examples for all model layers.
Qwen 2.5 Instruct 0.5B surpasses GPT-4o’s performance for the middle layer, layer 12 in 55 examples. Similar, but slightly better performance is observed for the larger Qwen 2.5 Instruct models.
Applying our approach to the DeBERTa v3 Prompt Injection v2 model results in a maximum F1 score of 0.98, significantly surpassing the model’s baseline performance F1 score of 0.73 on this task.
The intermediate layers achieve the highest weighted F1 scores for both the DeBERTa model and across Qwen 2.5 Instruct model sizes.

Conclusion

In our research we focused on two Responsible Ai related classification tasks but expect this approach to work for other classification tasks provided that the important features for the task can be detected by the intermediate layers of the model.

We demonstrated that our approach of training a classification model on the hidden state from an intermediate transformer layer creates effective content safety and prompt injection classification models with minimal parameters and training examples. Furthermore, we illustrated how our approach improves the performance of existing special-purpose models compared to their own baseline results.

Our results suggest two promising options for integrating top-performing content safety and prompt injection classifiers into existing LLM inference workflows. One option is to take a lightweight small model like the ones explored in our paper, prune it to the optimal layer and use it as a feature extractor for the classification task. The classification model could then be used to identify any content safety violations or prompt injections before processing the user input with a closed-source model like GPT-4o. The same classification model could be used to validate the generated response before sending it to the user. A second option is to apply our approach to an open-source, general-purpose model, like IBM’s Granite or Meta’s Llama models, identify which layers are most relevant to the classification task, then update the inference pipeline to simultaneously classify content safety and prompt injections while generating the output response. If content safety or prompt injections are detected you could easily stop the output generation, otherwise if there are no violations, the model can continue generating it’s response. Either of these options could be extended to apply to AI-agent based scenarios depending on the model used for each agent.

In summary, LEC provides a new promising and practical solution to safeguarding Generative AI based systems by identifying content safety and prompt injection attacks with better performance and fewer training examples compared to existing approaches. This is critical for any person or business building with Generative AI today to ensure their systems are operating both responsibly and as intended.

Note: The opinions expressed both in this article and the research paper are solely those of the authors and do not necessarily reflect the views or policies of their respective employers.

Interested in discussing further or collaborating? Reach out on LinkedIn!

Additional References:

The post Introducing Layer Enhanced Classification (LEC) appeared first on Towards Data Science.

Computer Use and AI Agents: A New Paradigm for Screen Interaction

Tula Masterman — Wed, 30 Oct 2024 12:01:55 +0000

Introduction: The ever-evolving AI Agent Landscape

Recent announcements from Anthropic, Microsoft, and Apple are changing the way we think about AI Agents. Today, the term "Ai Agent" is oversaturated – nearly every AI-related announcement refers to agents, but their sophistication and utility vary greatly.

At one end of the spectrum, we have advanced agents that leverage multiple loops for planning, tool execution, and goal evaluation, iterating until they complete a task. These agents might even create and use memories, learning from their past mistakes to drive future successes. Determining what makes an effective agent is a very active area of AI research. It involves understanding what attributes make a successful agent (e.g., how should the agent plan, how should it use memory, how many tools should it use, how should it keep track of it’s task) and the best approach to configure a team of agents.

On the other end of the spectrum, we find AI agents that execute single purpose tasks that require little if any reasoning. These agents are often more workflow focused. For example, an agent that consistently summarizes a document and stores the result. These agents are typically easier to implement because the use cases are narrowly defined, requiring less planning or coordination across multiple tools and fewer complex decisions.

With the latest announcements from Anthropic, Microsoft, and Apple, we’re witnessing a shift from text-based AI agents to multimodal agents. This opens up the potential to give an agent written or verbal instructions and allow it to seamlessly navigate your phone or computer to complete tasks. This has great potential to improve accessibility across devices, but also comes with significant risks. Anthropic’s computer use announcement highlights the risks of giving AI unfettered access to your screen, and provides risk mitigation tactics like running Claude in a dedicated virtual machine or container, limiting internet access to an allowlist of permitted domains, including human in the loop checks, and avoiding giving the model access to sensitive data. They note that no content submitted to the API will be used for training.

Key Announcements from Anthropic, Microsoft, and Apple:

Anthropic’s Claude 3.5 Sonnet: Giving AI the Power to Use Computers

Overview: The goal of Computer Use is to give AI the ability to interact with a computer the same way a human would. Ideally Claude would be able to open and edit documents, click to various areas of the page, scroll and read pages, run and execute command line code, and more. Today, Claude can follow instructions from a human to move a cursor around the computer screen, click on relevant areas of the screen, and type into a virtual keyboard. Claude Scored 14.9% on the OSWorld benchmark, which is higher than other AI models on the same benchmark, but still significantly behind humans (humans typically score 70–75%).
How it works: Claude looks at user submitted screenshots and counts pixels to determine where it needs to move the cursor to complete the task. Researchers note that Claude was not given internet access during training for safety reasons, but that Claude was able to generalize from training tasks like using a calculator and text-editor to more complex tasks. It even retried tasks when it failed. Computer use includes three Anthropic defined tools: computer, text editor, and bash. The computer tool is used for screen navigation, text editor is used for viewing, creating, and editing text files, and bash is used to run bash shell commands.
Challenges: Despite it’s promising performance, there’s still a long way to go for Claude’s computer use abilities. Today it struggles with scrolling, overall reliability, and is vulnerable to prompt injections.
How to Use: Public beta available through the Anthropic API. Computer use can be combined with regular tool use.

Microsoft’s OmniParser & GPT-4V: Making Screens Understandable and Actionable for AI

Overview: OmniParser is designed to parse screenshots of user interfaces and transform them into structured outputs. These outputs can be passed to a model like GPT-4V to generate actions based on the detected screen elements. OmniParser + GPT-4V were scored on a variety of benchmarks including Windows Agent Arena which adapts the OSWorld benchmark to create Windows specific tasks. These tasks are designed to evaluate an agents ability to plan, understand the screen, and use tools, OmniParser & GPT-4V scored ~20%.
How it Works: OmniParser combines multiple fine-tuned models to understand screens. It uses a finetuned interactable icon/region detection model (YOLOv8), a finetuned icon description model (BLIP-2 or Florence2), and an OCR module. These models are used to detect icons and text and generate descriptions before sending this output to GPT-4V which decides how to use the output to interact with the screen.
Challenges: Today, when OmniParser detects repeated icons or text and passes them to GPT-4V, GPT-4V usually fails to click on the correct icon. Additionally, OmniParser is subject to OCR output so if the bounding box is off, the whole system might fail to click on the appropriate area for clickable links. There are also challenges with understanding certain icons since sometimes the same icon is used to describe different concepts (e.g., three dots for loading versus for a menu item).
How to Use: OmniParser is available on GitHub & HuggingFace you will need to install the requirements and load the model from HuggingFace, next you can try running the demo notebooks to see how OmniParser breaks down images.

Apple’s Ferret-UI: Bringing Multimodal Intelligence to Mobile UIs

Overview: Apple’s Ferret (Refer and Ground Anything Anywhere at Any Granularity) has been around since 2023, but recently Apple released Ferret-UI a MLLM (Multimodal Large Language Model) which can execute "referring, grounding, and reasoning tasks" on mobile UI screens. Referring tasks include actions like widget classification and icon recognition. Grounding tasks include tasks like find icon or find text. Ferret-UI can understand UIs and follow instructions to interact with the UI.
How it Works: Ferret-UI is based on Ferret and adapted to work on finer grained images by training with "any resolution" so it can better understand mobile UIs. Each image is split into two sub-images which have their own features generated. The LLM uses the full image, both sub-images, regional features, and text embeddings to generate a response.
Challenges: Some of the results cited in the Ferret-UI paper demonstrate instances where Ferret predicts nearby text instead of the target text, predicts valid words when presented with a screen that has misspelled words, it also sometimes misclassifies UI attributes.
How to Use: Apple made the data and code available on GitHub for research use only. Apple released two Ferret-UI checkpoints, one built on Gemma-2b and one built on Llama-3–8B. The Ferret-UI models are subject to the licenses for Gemma and Llama while the dataset allows non-commercial use.

Summary: Three Approaches to AI Driven Screen Navigation

In summary, each of these systems demonstrate a different approach to building Multimodal agents that can interact with computers or mobile devices on our behalf.

Anthropic’s Claude 3.5 Sonnet focuses on general computer interaction where Claude counts pixels to appropriately navigate the screen. Microsoft’s OmniParser addresses specific challenges for breaking down user interfaces into structured outputs which are then sent to models like GPT-4V to determine actions. Apple’s Ferret-UI is tailored to mobile UI comprehension allowing it to identify icons, text, and widgets while also executing open-ended instructions related to the UI.

Across each system, the workflow typically follows two key phases one for parsing the visual information and one for reasoning about how to interact with it. Parsing screens accurately is critical for properly planning how to interact with the screen and making sure the system reliably executes tasks.

Conclusion: Building Smarter, Safer AI Agents

In my opinion, the most exciting aspect of these developments is how multimodal capabilities and reasoning frameworks are starting to converge. While these tools offer promising capabilities, they still lag significantly behind human performance. There are also significant AI safety concerns which need to be addressed when implementing any agentic system with screen access.

One of the biggest benefits of agentic systems is their potential to overcome the cognitive limitations of individual models by breaking down tasks into specialized components. These systems can be built in many ways. In some cases, what appears to the user as a single agent may, behind the scenes, consist of a team of sub-agents – each managing distinct responsibilities like planning, screen interaction, or memory management. For example, a reasoning agent might coordinate with another agent that specializes in parsing screen data, while a separate agent curates memories to enhance future performance.

Alternatively, these capabilities might be combined within one robust agent. In this setup, the agent could have multiple internal planning modules— one focused on planning the screen interactions and another focused on managing the overall task. The best approach to structuring agents remains to be seen, but the goal remains the same: to create agents that perform reliably overtime, across multiple modalities, and adapt seamlessly to the user’s needs.

References:

Interested in discussing further or collaborating? Reach out on LinkedIn!

The post Computer Use and AI Agents: A New Paradigm for Screen Interaction appeared first on Towards Data Science.

AI Agents: The Intersection of Tool Calling and Reasoning in Generative AI

Tula Masterman — Sat, 05 Oct 2024 14:02:00 +0000

Introduction: The Rise of Agentic AI

Today, new libraries and low-code platforms are making it easier than ever to build AI agents, also referred to as digital workers. Tool calling is one of the primary abilities driving the "agentic" nature of Generative AI models by extending their ability beyond conversational tasks. By executing tools (functions), agents can take action on your behalf and solve complex, multi-step problems that require robust decision making and interacting with a variety of external data sources.

This article focuses on how reasoning is expressed through tool calling, explores some of the challenges of tool use, covers common ways to evaluate tool-calling ability, and provides examples of how different models and agents interact with tools.

Expressions of Reasoning to Solve Problems

At the core of successful agents lie two key expressions of reasoning: reasoning through evaluation and planning and reasoning through tool use.

Reasoning through evaluation and planning relates to an agent’s ability to effectively breakdown a problem by iteratively planning, assessing progress, and adjusting its approach until the task is completed. Techniques like Chain-of-Thought (CoT), ReAct, and Prompt Decomposition are all patterns designed to improve the model’s ability to reason strategically by breaking down tasks to solve them correctly. This type of reasoning is more macro-level, ensuring the task is completed correctly by working iteratively and taking into account the results from each stage.
Reasoning through tool use relates to the agents ability to effectively interact with it’s environment, deciding which tools to call and how to structure each call. These tools enable the agent to retrieve data, execute code, call APIs, and more. The strength of this type of reasoning lies in the proper execution of tool calls rather than reflecting on the results from the call.

While both expressions of reasoning are important, they don’t always need to be combined to create powerful solutions. For example, OpenAI’s new o1 model excels at reasoning through evaluation and planning because it was trained to reason using chain of thought. This has significantly improved its ability to think through and solve complex challenges as reflected on a variety of benchmarks. For example, the o1 model has been shown to surpass human PhD-level accuracy on the GPQA benchmark covering physics, biology, and chemistry, and scored in the 86th-93rd percentile on Codeforces contests. While o1’s reasoning ability could be used to generate text-based responses that suggest tools based on their descriptions, it currently lacks explicit tool calling abilities (at least for now!).

In contrast, many models are fine-tuned specifically for reasoning through tool use enabling them to generate function calls and interact with APIs very effectively. These models are focused on calling the right tool in the right format at the right time, but are typically not designed to evaluate their own results as thoroughly as o1 might. The Berkeley Function Calling Leaderboard (BFCL) is a great resource for comparing how different models perform on function calling tasks. It also provides an evaluation suite to compare your own fine-tuned model on various challenging tool calling tasks. In fact, the latest dataset, BFCL v3, was just released and now includes multi-step, multi-turn function calling, further raising the bar for tool based reasoning tasks.

Both types of reasoning are powerful independently, and when combined, they have the potential to create agents that can effectively breakdown complicated tasks and autonomously interact with their environment. For more examples of AI agent architectures for reasoning, planning, and tool calling check out my team’s survey paper on ArXiv.

Challenges with Tool-Calling: Navigating Complex Agent Behaviors

Building robust and reliable agents requires overcoming many different challenges. When solving complex problems, an agent often needs to balance multiple tasks at once including planning, interacting with the right tools at the right time, formatting tool calls properly, remembering outputs from previous steps, avoiding repetitive loops, and adhering to guidance to protect the system from jailbreaks/prompt injections/etc.

Too many demands can easily overwhelm a single agent, leading to a growing trend where what may appear to an end user as one agent, is behind the scenes a collection of many agents and prompts working together to divide and conquer completing the task. This division allows tasks to be broken down and handled in parallel by different models and agents tailored to solve that particular piece of the puzzle.

It’s here that models with excellent tool calling capabilities come into play. While tool-calling is a powerful way to enable productive agents, it comes with its own set of challenges. Agents need to understand the available tools, select the right one from a set of potentially similar options, format the inputs accurately, call tools in the right order, and potentially integrate feedback or instructions from other agents or humans. Many models are fine-tuned specifically for tool calling, allowing them to specialize in selecting functions at the right time with high accuracy.

Some of the key considerations when fine-tuning a model for tool calling include:

Proper Tool Selection: The model needs to understand the relationship between available tools, make nested calls when applicable, and select the right tool in the presence of other similar tools.
Handling Structural Challenges: Although most models use JSON format for tool calling, other formats like YAML or XML can also be used. Consider whether the model needs to generalize across formats or if it should only use one. Regardless of the format, the model needs to include the appropriate parameters for each tool call, potentially using results from a previous call in subsequent ones.
Ensuring Dataset Diversity and Robust Evaluations: The dataset used should be diverse and cover the complexity of multi-step, multi-turn function calling. Proper evaluations should be performed to prevent overfitting and avoid benchmark contamination.

Common Benchmarks to Evaluate Tool-Calling

With the growing importance of tool use in language models, many datasets have emerged to help evaluate and improve model tool-calling capabilities. Two of the most popular benchmarks today are the Berkeley Function Calling Leaderboard and Nexus Function Calling Benchmark, both of which Meta used to evaluate the performance of their Llama 3.1 model series. A recent paper, ToolACE, demonstrates how agents can be used to create a diverse dataset for fine-tuning and evaluating model tool use.

Let’s explore each of these benchmarks in more detail:

Berkeley Function Calling Leaderboard (BFCL): BFCL contains 2,000 question-function-answer pairs across multiple programming languages. Today there are 3 versions of the BFCL dataset each with enhancements to better reflect real-world scenarios. For example, BFCL-V2, released August 19th, 2024 includes user contributed samples designed to address evaluation challenges related to dataset contamination. BFCL-V3 released September 19th, 2024 adds multi-turn, multi-step tool calling to the benchmark. This is critical for agentic applications where a model needs to make multiple tool calls over time to successfully complete a task. Instructions for evaluating models on BFCL can be found on GitHub, with the latest dataset available on HuggingFace, and the current leaderboard accessible here. The Berkeley team has also released various versions of their Gorilla Open-Functions model fine-tuned specifically for function-calling tasks.
Nexus Function Calling Benchmark: This benchmark evaluates models on zero-shot function calling and API usage across nine different tasks classified into three major categories for single, parallel, and nested tool calls. Nexusflow released NexusRaven-V2, a model designed for function-calling. The Nexus benchmark is available on GitHub and the corresponding leaderboard is on HuggingFace.
ToolACE: The ToolACE paper demonstrates a creative approach to overcoming challenges related to collecting real-world data for function-calling. The research team created an agentic pipeline to generate a synthetic dataset for tool calling consisting of over 26,000 different APIs. The dataset includes examples of single, parallel, and nested tool calls, as well as non-tool based interactions, and supports both single and multi-turn dialogs. The team released a fine-tuned version of Llama-3.1–8B-Instruct, ToolACE-8B, designed to handle these complex tool-calling related tasks. A subset of the ToolACE dataset is available on HuggingFace.

Each of these benchmarks facilitates our ability to evaluate model reasoning expressed through tool calling. These benchmarks and fine-tuned models reflect a growing trend towards developing more specialized models for specific tasks and increasing Llm capabilities by extending their ability to interact with the real-world.

Examples of Tool-Calling in Action

If you’re interested in exploring tool-calling in action, here are some examples to get you started organized by ease of use, ranging from simple built-in tools to using fine-tuned models, and agents with tool-calling abilities.

Level 1 – ChatGPT: The best place to start and see tool-calling live without needing to define any tools yourself, is through ChatGPT. Here you can use GPT-4o through the chat interface to call and execute tools for web-browsing. For example, when asked "what’s the latest AI news this week?" ChatGPT-4o will conduct a web search and return a response based on the information it finds. Remember the new o1 model does not have tool-calling abilities yet and cannot search the web.

Image by author 9/30/24

While this built-in web-searching feature is convenient, most use cases will require defining custom tools that can integrate directly into your own model workflows and applications. This brings us to the next level of complexity.

Level 2 – Using a Model with Tool Calling Abilities and Defining Custom Tools:

This level involves using a model with tool-calling abilities to get a sense of how effectively the model selects and uses it’s tools. It’s important to note that when a model is trained for tool-calling, it only generates the text or code for the tool call, it does not actually execute the code itself. Something external to the model needs to invoke the tool, and it’s at this point – where we’re combining generation with execution – that we transition from language model capabilities to agentic systems.

To get a sense for how models express tool calls we can turn towards the Databricks Playground. For example, we can select the model Llama 3.1 405B and give it access to the sample tools get_distance_between_locations and get_current_weather. When prompted with the user message "I am going on a trip from LA to New York how far are these two cities? And what’s the weather like in New York? I want to be prepared for when I get there" the model decides which tools to call and what parameters to pass so it can effectively reply to the user.

Image by author 10/2/2024 depicting using the Databricks Playground for sample tool calling

In this example, the model suggests two tool calls. Since the model cannot execute the tools, the user needs to fill in a sample result to simulate the tool output (e.g., "2500" for the distance and "68" for the weather). The model then uses these simulated outputs to reply to the user.

This approach to using the Databricks Playground allows you to observe how the model uses custom defined tools and is a great way to test your function definitions before implementing them in your tool-calling enabled applications or agents.

Outside of the Databricks Playground, we can observe and evaluate how effectively different models available on platforms like HuggingFace use tools through code directly. For example, we can load different models like Llama 3.2–3B-Instruct, ToolACE-8B, NexusRaven-V2–13B, and more from HuggingFace, give them the same system prompt, tools, and user message then observe and compare the tool calls each model returns. This is a great way to understand how well different models reason about using custom-defined tools and can help you determine which tool-calling models are best suited for your applications.

Here is an example demonstrating a tool call generated by Llama-3.2–3B-Instruct based on the following tool definitions and user message, the same steps could be followed for other models to compare generated tool calls.

import torch
from transformers import pipeline

function_definitions = """[
    {
        "name": "search_google",
        "description": "Performs a Google search for a given query and returns the top results.",
        "parameters": {
            "type": "dict",
            "required": [
                "query"
            ],
            "properties": {
                "query": {
                    "type": "string",
                    "description": "The search query to be used for the Google search."
                },
                "num_results": {
                    "type": "integer",
                    "description": "The number of search results to return.",
                    "default": 10
                }
            }
        }
    },
    {
        "name": "send_email",
        "description": "Sends an email to a specified recipient.",
        "parameters": {
            "type": "dict",
            "required": [
                "recipient_email",
                "subject",
                "message"
            ],
            "properties": {
                "recipient_email": {
                    "type": "string",
                    "description": "The email address of the recipient."
                },
                "subject": {
                    "type": "string",
                    "description": "The subject of the email."
                },
                "message": {
                    "type": "string",
                    "description": "The body of the email."
                }
            }
        }
    }
]
"""

# This is the suggested system prompt from Meta
system_prompt = """You are an expert in composing functions. You are given a question and a set of possible functions. 
Based on the question, you will need to make one or more function/tool calls to achieve the purpose. 
If none of the function can be used, point it out. If the given question lacks the parameters required by the function,
also point it out. You should only return the function call in tools call sections.

If you decide to invoke any of the function(s), you MUST put it in the format of [func_name1(params_name1=params_value1, params_name2=params_value2...), func_name2(params)]n
You SHOULD NOT include any other text in the response.

Here is a list of functions in JSON format that you can invoke.nn{functions}n""".format(functions=function_definitions)

Image by author sample output demonstrating generated tool call from Llama 3.2–3B-Instruct

From here we can move to Level 3 where we’re defining Agents that execute the tool-calls generated by the language model.

Level 3 Agents (invoking/executing LLM tool-calls): Agents often express reasoning both through planning and execution as well as tool calling making them an increasingly important aspect of AI based applications. Using libraries like LangGraph, AutoGen, Semantic Kernel, or LlamaIndex, you can quickly create an agent using models like GPT-4o or Llama 3.1–405B which support both conversations with the user and tool execution.

Check out these guides for some exciting examples of agents in action:

LangGraph: Local RAG Agent with Llama 3
AutoGen: Solve Tasks Requiring Web Info
Semantic Kernel: Getting Started with Agents in Semantic Kernel
LlamaIndex: Agent Usage Pattern Documentation

Conclusion:

The future of agentic systems will be driven by models with strong reasoning abilities enabling them to effectively interact with their environment. As the field evolves, I expect we will continue to see a proliferation of smaller, specialized models focused on specific tasks like tool-calling and planning.

It’s important to consider the current limitations of model sizes when building agents. For example, according to the Llama 3.1 model card, the Llama 3.1–8B model is not reliable for tasks that involve both maintaining a conversation and calling tools. Instead, larger models with 70B+ parameters should be used for these types of tasks. This alongside other emerging research for fine-tuning small language models suggests that smaller models may serve best as specialized tool-callers while larger models may be better for more advanced reasoning. By combining these abilities, we can build increasingly effective agents that provide a seamless user experience and allow people to leverage these reasoning abilities in both professional and personal endeavors.

Interested in discussing further or collaborating? Reach out on LinkedIn!

The post AI Agents: The Intersection of Tool Calling and Reasoning in Generative AI appeared first on Towards Data Science.

Navigating the Latest GenAI Model Announcements – July 2024

Tula Masterman — Fri, 26 Jul 2024 13:32:37 +0000

Navigating the Latest GenAI Announcements – July 2024

A guide to new models GPT-4o mini, Llama 3.1, Mistral NeMo 12B and other GenAI trends

Image Created by Author with GPT-4o to represent different models

Introduction

Since the launch of ChatGPT in November 2022, it feels like almost every week there’s a new model, novel prompting approach, innovative agent framework, or other exciting Genai breakthrough. July 2024 is no different: this month alone we’ve seen the release of Mistral Codestral Mamba, Mistral NeMo 12B, GPT-4o mini, and Llama 3.1 amongst others. These models bring significant enhancements to areas like inference speed, reasoning ability, coding ability, and tool calling performance making them a compelling choice for business use.

In this article we’ll cover the highlights of recently released models and discuss some of the major trends in GenAI today, including increasing context window sizes and improving performance across languages and modalities.

Overview of July Release Models

Mistral Codestral Mamba

Overview: Codestral Mamba 7B is designed for enhanced reasoning and coding capabilities using the Mamba architecture instead of the Transformer architecture used by most Language Models. This architecture enables in context retrieval for much longer sequences and has been tested for sequences up to 256K tokens. By comparison, most Transformer based models allow between 8-128K token context windows. The Mamba architecture also enables faster inference speeds than Transformer based models.
Availability: Codestral Mamba is an open source model under the Apache 2.0 License.
Performance: Codestral Mamba 7B outperforms CodeGemma-1.1 7B, CodeLlama 7B, and DeepSeekv1.5 7B on the HumanEval, MBPP, CruxE, HumanEval C++, and Human Eval JavaScript benchmarks. It performs similarly to Codestral 22B across these benchmarks despite it’s smaller size.

Image created by author based on results from Mistral AI Codestral Mamba announcement

Mistral NeMo 12B

Overview: Mistral NeMo 12B was produced by Mistral and Nvidia to offer a competitive language model in the 12B parameter range with a far larger context window than most models of this size. Nemo 12B has a 128K token context window while similarly sized models Gemma 2 9B and Llama 3 8B offer only 8K token context windows. NeMo is designed for multilingual use cases and provides a new tokenizer, Tekken, which outperforms the Llama 3 tokenizer for compressing text across 85% of languages. The HuggingFace model card indicates NeMo should be used with lower temperatures than earlier Mistral models, they recommend setting the temperature to 0.3.
Availability: NeMo 12B is an open source model (offering both base and instruction-tuned checkpoints) under the Apache 2.0 License.
Performance: Mistral NeMo 12B outperforms Gemma 2 9B and Llama 3 8B across multiple zero and five shot benchmarks by as much as 10%. It also performs almost 2x better than Mistral 7B on WildBench which is designed to measure model’s performance on real world tasks requiring complex reasoning and multiple conversation turns.

Image created by author based on results from Mistral AI NeMo announcement

Mistral Large 2

Overview: Mistral Large 2 provides a 128K token context window, improved function calling, support for numerous languages and 80+ coding languages. Like Codestral Mamba and NeMo, Mistral Large 2 was trained on a large volume of code allowing it to perform competitively with GPT-4o, Claude 3 Opus, and Llama 3.1 405B. During training, the Mistral team focused on reducing the model’s likelihood of hallucinations making Mistral Large 2 more likely to respond that it cannot find an answer or lacks the information needed to provide a response.
Availability: Mistral Large 2 is available under the Mistral Research License. This allows experimentation and modification for research and non-commercial use cases. For those interested in using Mistral Large 2 commercially, you can contact Mistral AI directly and request a Mistral Commercial License.
Performance: Mistral Large 2 outperforms GPT-4o, and Claude 3 Opus on function calling tasks and performs similarly to these models on instruction following and alignment based tasks evaluated by the Wild Bench and Arena Hard benchmarks.

GPT-4o mini

Overview: GPT-4o mini is a small, cost effective model that supports text and vision and offers competitive reasoning and tool calling performance. It has a 128K token context window with an impressive 16K token output length. It is the most cost effective model from OpenAI at 15 cents per million input tokens and 60 cents per million output tokens. OpenAI notes that this price is 99% cheaper than their text-davinci-003 model from 2022 indicating a trend towards cheaper, smaller, more capable models in a relatively short time frame. While GPT-4o mini does not support image, video, and audio inputs like GPT-4o does, OpenAI reports these features are coming soon. Like GPT-4o, GPT-4o mini has been trained with built-in safety measures and is the first OpenAI model that applies the instruction hierarchy method designed to make the model more resistant to prompt injections and jailbreaks. GPT-4o mini leverages the same tokenizer as GPT-4o which enables improved performance on non-English text. Shortly after the GPT-4o mini announcement, OpenAI also announced an experimental 64K token output for GPT-4o available through their alpha program.
Availability: GPT-4o mini is a closed source model available through OpenAI’s Assistants API, Chat Completions API, and Batch API. It is also available through Azure AI.
Performance: GPT-4o mini outperforms Gemini Flash and Claude Haiku, models of similar size, on multiple benchmarks including MMLU (Massive Multitask Language Understanding) which is designed to measure reasoning ability, MGSM (Multilingual Grade School Math) which measures mathematical reasoning, HumanEval which measures coding ability, and MMMU (Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark) which measures multimodal reasoning.

Image by author based on results from GPT-4o mini announcement

Llama 3.1

Overview: Llama 3.1 introduces a 128K token context window, a significant jump from the 8K token context window for Llama 3, which was released only three months ago in April. Llama 3.1 is available in three sizes: 405B, 70B, and 8B. It offers improved reasoning, tool-calling, and multilingual performance. Meta’s Llama 3.1 announcement calls Llama 3.1 405B the "first frontier-level open source AI model". This demonstrates a huge stride forward for the open source community and demonstrates Meta’s commitment to making AI accessible, Mark Zuckerberg discusses this in more detail in his article "Open Source AI is the Path Forward". The Llama 3.1 announcement also includes guidance on enabling common use cases like real-time and batch inference, fine-tuning, RAG, continued pre-training, synthetic data generation, and distillation. Meta also released the Llama Reference System to support developers working on agentic based use cases with Llama 3.1 and additional AI safety tools including Llama Guard 3 to moderate inputs and outputs in multiple languages, Prompt Guard to mitigate prompt injections, and CyberSecEval 3 to reduce GenAI security risks.
Availability: Llama 3.1 is an open source model. Meta has changed their license to allow developers to use the outputs from Llama models to train and improve other models. Models are available through HuggingFace, llama.meta.com, and through other partner platforms like Azure AI.
Performance: Each of the Llama 3.1 models outperform other models in their size class across nearly all the common language model benchmarks for reasoning, coding, math, tool use, long context, and multilingual performance.

Image by author based on results from Meta Llama 3.1 announcement

Trends in GenAI Models

Overall, there is a trend towards increasingly capable models of all sizes with longer context windows, longer token output lengths, and lower price points. The push towards improved reasoning, tool calling, and coding abilities reflect the increasing demand for agentic systems capable of taking complex actions on behalf of users. To create effective agent systems, models need to understand how to break down a problem, how to use the tools available to them, and how to reconcile lots of information at one time.

The recent announcements from OpenAI and Meta reflect the growing discussion around AI safety with both companies demonstrating different ways to approach the same challenge. OpenAI has taken a closed source approach and improved model safety through applying feedback from experts in social psychology and misinformation and implementing new training methods. In contrast, Meta has doubled down on their open source initiatives and released new tools focused on helping developers mitigate AI safety concerns.

Image created by author with GPT-4o depicting an arena with closed and open source models competing.

Conclusion

In the future, I think we’ll continue to see advancements in generalist and specialist models with frontier models like GPT-4o and Llama 3.1 getting better and better at breaking down problems and performing a variety of tasks across modalities, while specialist models like Codestral Mamba will excel in their domain and become more adept at handling longer contexts and nuanced tasks within their area of expertise. Additionally, I expect we’ll see new benchmarks focused on models’ ability to follow multiple directions at once within a single turn and a proliferation of AI systems that leverage generalist and specialist models to perform tasks as a team.

Furthermore, while model performance is typically measured based on standard benchmarks, what ultimately matters is how humans perceive the performance and how effectively models can further human goals. The Llama 3.1 announcement includes an interesting graphic demonstrating how people rated responses from Llama 3.1 compared to GPT-4o, GPT-4, and Claude 3.5. The results show that Llama 3.1 received a tie from humans in over 50% of the examples with the remaining win rates roughly split between Llama 3.1 and it’s challenger. This is significant because it suggests that open source models can now readily compete in a league that was previously dominated by closed source models.

Interested in discussing further or collaborating? Reach out on LinkedIn!

The post Navigating the Latest GenAI Model Announcements – July 2024 appeared first on Towards Data Science.

Understanding Techniques for Solving GenAI Challenges

Tula Masterman — Thu, 20 Jun 2024 21:32:33 +0000

Dive into model pre-training, fine-tuning, RAG, prompt engineering, and more!

Source: Author & GPT4o. Image is designed to show a language model learning and developing its brain!

Introduction

Generative AI adoption is rapidly increasing for both individuals and businesses. A recent Gartner study found that GenAI solutions are the number one AI solution used by organizations, with most companies leveraging GenAI features built into existing tools like Microsoft 365 Copilot. In my experience, most businesses are looking for some sort of "private ChatGPT" they can use to get more value from their unique organizational data. Company goals vary from finding information in particular documents, generating reports based on tabular data, and summarizing content, to finding all the projects related to some domain, and much more.

This article explores various approaches to solve these problems, outlining the pros, cons, and applications of each. My goal is to provide guidance on when to consider different approaches and how to combine them for the best outcomes, covering everything from the most complex and expensive approaches like pre-training to the simplest, most cost-effective techniques like prompt engineering.

The sequence of the article is intended to build from the foundational concepts for model training (pre-training, continued pre-training, and fine tuning) to the more commonly understood techniques (RAG and prompt engineering) for interacting with existing models.

Image by author

Background

There is no one-size fits all approach to tackling Genai problems. Most use cases require a combination of techniques to achieve successful outcomes. Typically, organizations start with a model like GPT-4, Llama3 70b Instruct, or DBRX Instruct which have been pretrained on trillions of tokens to perform next token prediction, then fine-tuned for a particular task, like instruction or chat. Instruction based models are trained and optimized to follow specific directions given in the prompt while chat based models are trained and optimized to handle conversational formats over multiple turns, maintaining context and coherence throughout the conversation.

Using existing models allows organizations to take advantage of the significant time and financial investments made by companies like OpenAI, Meta, and Databricks to curate datasets, create innovative architectures, and train and evaluate their models.

Although not every company will need to pre-train or instruction fine-tune their models, anyone using a Large Language Model (Llm) benefits from the groundwork laid by these industry leaders. This foundation allows other companies to address their unique challenges without starting from scratch.

In the following sections, we’ll explore pre-training, fine-tuning (both instruction fine-tuning, and continued pre-training), Retrieval Augmented Generation (RAG), fine-tuning embeddings for RAG, and prompt engineering, discussing how and when each of these approaches should be used or considered.

Setting the Baseline with Pre-Training

Overview: Pre-Training a model creates a foundation which will be used as a base for all downstream tasks. This process includes defining the architecture for the model, curating a massive dataset (generally trillions of tokens), training the model, and evaluating its performance. In the context of LLMs and SLMs, the pre-training phase is used to inject knowledge into the model, enabling it to predict the next word or token in a sequence. For instance, in the sentence "the cat sat on the ___", the model learns to predict "mat".

Companies like OpenAI have invested heavily in the pre-training phase for their GPT models, but since models like GPT-3.5, GPT-4, and GPT-4o are closed source it is not possible to use the underlying architecture and pre-train the model on a different dataset with different parameters. However, with resources like Mosaic AI’s pre-training API it is possible to pre-train open source models like DBRX.

Pros:

Complete control: The benefit of pre-training a model is that you’d have complete control over the entire process to create the model. You can tailor the architecture, dataset, and training parameters to your needs and test it with evaluations representative of your domain instead of a focusing primarily on common benchmarks.
Inherent domain specific knowledge: By curating a dataset focused on a particular domain, the model can develop a deeper understanding of that domain compared to a general purpose model.

Cons:

Most expensive option: Pre-training requires an extreme amount of computational power (many, many GPUs) which means the cost of pre-training is typically in the millions to tens or hundreds of millions of dollars and often takes weeks to complete the training.
Knowledge cutoffs: The final model is also completed at a certain point in time, so it will have no inherent understanding of real time information unless augmented by techniques like RAG or function-calling.
Advanced requirements: This approach requires the most data and the most advanced expertise to achieve meaningful results.

Applications: Generally, pre-training your own model is only necessary if none of the other approaches are sufficient for your use case. For example, if you wanted to train a model to understand a new language it has no previous exposure to, you may consider pre-training it then fine-tuning it for your intended use.

Once the base training is complete, the models typically need to be fine-tuned so that they can perform tasks effectively. When you see a model labeled as a chat or instruct model, that indicates the base model has been fine-tuned for either of those purposes. Nearly any model you interact with today has been fine-tuned for one of these purposes so that end users can interact with the model efficiently.

Given the incredible cost and intensive process required to pre-train a model, most organizations decide to leverage existing models in their GenAI use cases. To get started with pretraining, check out Mosaic AI’s pretraining API, this allows you to pretrain a Databricks DBRX model with different parameter sizes.

Image by author. Overview of LLM and SLM pre-training.

Adding Knowledge with Continued Pre-Training (CPT)

Overview: CPT is a type of fine-tuning that allows extends the knowledge of an existing model rather than training the entire model from scratch. The output of a model that’s gone through CPT will still predict the next token. In general it’s recommended that you use CPT then Instruction Fine-Tuning (IFT) this way you can extend the model’s knowledge first, then tune it to a particular task like following instructions or chat. If done in the reverse order, the model may forget instructions that it learned during the IFT phase.

Pros:

No need for labeled training data: CPT does not require labeled training data. This is great if you have a lot of domain-specific or new information you want to teach the model in general. Since the output is still focused on next token prediction, the output from CPT is helpful if you want a text-completion model.
Faster and more cost effective than pre-training: CPT can be completed in hours or days using less GPUs than pre-training making it faster and cheaper!

Cons:

Still comparatively expensive: CPT is significantly cheaper than pre-training, but can still be expensive and cost tens of thousands of dollars to train a model depending on the volume of data and number of GPUs required.
Requires curated evaluations: Additionally, you will need to create your own evaluations to make sure the model is performing well in the new domain you are teaching it.
Typically requires subsequent IFT: For most use cases, you would still need to perform IFT on the model once CPT finishes so that the final model can properly respond to questions or chats. This ultimately increases the time and cost until you have a model ready for use.

Applications: For industries with highly domain specific content like healthcare or legal, CPT may be a great option for introducing new topics to the model. With tools like Mosaic AI’s Fine-Tuning API you can easily get started with CPT, all you need is a series of text files you want to use for training. For the CPT process, all the text files will be concatenated with a separation token between each of the documents, Mosaic handles the complexity behind the scenes for how these files get fed to the model for training.

As an example, let’s say we used CPT with a series of text files about responsible AI and AI policies. If I prompt the model to "Tell me three principles important to Responsible AI", I would likely get a response with a high probability to follow the sentence I prompted like "I need to understand the key Responsible AI principles so I can train an effective model". Although this response is related to my prompt, it does not directly answer the question. This demonstrates the need for IFT to refine the models instruction following capabilities.

Image by author inspired by Continual Learning for Large Language Models: A Survey

Tailoring Responses with Instruction Fine-Tuning (IFT)

Overview: IFT is used to teach a model how to perform a particular task. It typically requires thousands of examples and can be used for a specific purpose such as improving question answering, extracting key information, or adopting a certain tone.

Pros:

Speed and cost-effectiveness: IFT takes significantly less time to complete, this type of training can be achieved in minutes making it not only faster, but much cheaper compared to pre-training or CPT.
Task-specific customization: This is a great method to get tailored results out of the model by guiding it to respond in a particular tone, classify documents, revise certain documents, and more.

Cons:

Requires labeled dataset: IFT needs labeled data to teach the model how it should behave. While there are many open-source datasets available, it may take time to properly create and label a dataset for your unique use case.
Potential decrease in general capabilities: Introducing new skills through IFT may reduce the model’s performance on general tasks. If you are concerned about maintaining the model’s ability to generalize, you may want to include examples of general skills in your training and evaluation set this way you can measure performance on the general tasks as well as the new skill(s) you are teaching.

Applications: IFT helps the model perform particular tasks like question answering much better. Using the prompt "Tell me three principles important to Responsible AI", a model that had undergone IFT would likely respond with an answer to the question like "Responsible AI is critical for ensuring the ethical use of models grounded in core principles like transparency, fairness, and privacy. Following responsible AI principles helps align the solution with broader societal values and ethical standards". This response is more useful to the end user compared to a response that may come from a CPT or PT model only since it addresses the question directly.

Note that there are a variety of fine-tuning approaches and techniques designed to improve the overall model performance and reduce both time and cost associated with training.

Image by author. Inspired by Continual Learning for Large Language Models: A Survey

Finding real-time or private information with Retrieval Augmented Generation (RAG)

Overview: RAG enables language models to answer questions using information outside of their training data. In the RAG process, a user query triggers a retrieval of relevant information from a vector index, which is then integrated into a new prompt along with the original query to generate a response. This technique is one of the most common techniques used today due to its effectiveness and simplicity.

Pros:

Access to real-time information & information beyond training data: RAG allows models to utilize query information from diverse and constantly updated sources like the internet or internal document datastores. Anything that can be stored in a vector index or retrieved via a plugin/tool, can be used in the RAG process.
Ease of implementation: RAG does not require custom training making it both cost-effective and straightforward to get started. It’s also a very well documented and researched area with many articles providing insights on how to improve responses from RAG systems.
Traceability and citations: All generated responses can include citations for which documents were used to answer the query making it easy for the user to verify the information and understand how the response was generated. Since you know exactly what information got sent to the model to answer the question, it’s easy to provide a traceable answers to the end user, and if needed the end user can look at the referenced documents for more information. In comparison, if you are querying a model directly, it’s difficult to know how it answered that question or what references were used to generate the response.

Cons:

Context window limitations: The first major problem is the context windows of different models, some models like GPT-4 and 4o have 128k token context window, while the Llama-3 series is still only 8k tokens. With smaller context windows, you cannot pass as much information to the model to answer the question. As a result, it becomes more important to have robust chunking and chunk re-ranking techniques in place so you can retreive the right context and use this to respond to the user correctly.
The "Lost in the Middle Problem": Even with longer context windows, there is a common "lost in the middle problem" where models tend to pay more attention to information at the beginning or end of the prompt, meaning that if the answer to the question lies in the middle of the context, the model may still answer incorrectly even when presented with all the information needed to answer the question. Similarly, the models might mix up information they’ve retrieved and answer the question only partially correct. For example, I’ve seen when asking a model to find information about two companies and return their point of view on AI, the model has on occasion mixed up the companies policies.
Top K Retrieval Challenge: In typical RAG pipelines, only the top K documents (or chunks of text) related to the query are retrieved and sent to the model for a final response. This pattern yields better results when looking for specific details in a document corpus, but often fails to correctly answer exhaustive search based questions. For example, the prompt "give me all of the documents related to responsible AI" would need additional logic to keep searching through the vector index for all responsible AI documents instead of stopping after returning the first top K related chunks.
Overly similar documents: If the vector index contains documents that are all semantically similar, it might be difficult for the model to retrieve the exact document relevant to the task. This is particularly true in specialized domains or domains with uniform language. This may not be a problem in a vector index where the content of all the documents is diverse, however, if you’re using RAG against an index on medical documents where all the language is very similar and not something a typical embedding model would be trained on, it might be harder to find the documents / answers you’re looking for.

Applications: Any use case involving question and answering over a set of documents will typically involve RAG. It’s a very practical way to get started with Generative AI and requires no additional model training. The emerging concept of AI Agents also tend to have at least one tool for RAG. Many agent implementations will have RAG based tools for different data sources. For example, an internal support agent might have access to an HR tool and IT support tool. In this set-up there could be a RAG component for both the HR and IT documents, each tool could have the same pipeline running behind the scenes, the only difference would be the document dataset.

Image by author. Overview of the RAG process.

Improving the R in RAG by Fine-Tunning Embeddings

Overview: Fine-Tuning Embeddings can improve the retrieval component of RAG. The goal of fine-tuning embeddings is to push the vector embeddings further apart in the vector space, making them more different from one another and therefore easier to find the documents most relevant to the question.

Pros:

Generally cost-effective: Fine-tuning embeddings is comparatively inexpensive when considering other training methods.
Domain-specific customization: This method can be a great option for distinguishing text in domains that the underlying embedding model was not as exposed to during training. For example, highly specific legal or health care documents may benefit from fine-tuning embeddings for those corpuses in their RAG pipeline.

Cons:

Requires labeled data & often re-training: A labeled dataset is needed to fine-tune an embedding model. Additionally, you may need to continuously re-train the embedding model as you add additional information to your index.
Additional maintenance across indexes: Depending on how many data sources you’re querying you also might have to keep track of multiple sets of embedding models and their related data sources. It’s important to remember that whatever embedding model was used to embed the corpus of documents must be the same model used to embed the query when it’s time to retrieve relevant information. If you are querying against multiple indexes, each embedded using a different embedding model, then you’ll need to make sure that your models match at the time of retrieval.

Applications: Fine-tuning embeddings is a great option if the traditional RAG approach is not effective because the documents in your index are too similar to one another. By fine-tuning the embeddings you can teach the model to differentiate better between domain specific concepts and improve your RAG results.

Talking to Models with Prompt Engineering

Overview: Prompt engineering is the most common way to interact with Generative AI models, it is merely sending a message to the model that’s designed to get the output you want. It can be as simple as "Tell me a story about a German Shepherd" or it can be incredibly complicated with particular details regarding what you’d like the model to do.

Pros:

Immediate results: Experimenting with different prompts can be done anytime you have access to a language model and results are returned in seconds (or less)! As soon as the idea hits, you can begin working on refining a prompt until the model gives the desired response.
High performance on general tasks: Prompt engineering alone works great for generic tasks that do not require any retrieval of business specific information or real-time information.
Compatibility with other techniques: It will work with models that have been pre-trained, continuously pre-trained, or fine-tuned, and it can be used in conjunction with RAG making it the most used and versatile of the approaches.

Cons:

Limited capability on its own: Prompt engineering alone is typically not enough to get the model to perform how you want. In most cases, people want the model to interact with some external data whether it’s a document database, API call, or SQL table, all of which will need to combine prompt engineering with RAG or other specialized tool calling.
Precision challenges: Writing the perfect prompt can be challenging and often requires a lot of tweaking until the model performs as intended. The prompt that works great with one model might fail miserably with another, requiring lots of iterations and experimentation across many models and prompt variations.

Applications: Prompt Engineering will be used in combination with all of the aforementioned techniques to produce the intended response. There are many different techniques for prompt engineering to help steer the model in the right direction. For more information on these techniques I recommend this Prompt Engineering Guide from Microsoft they give a variety of examples from Chain-of-Thought prompting and beyond.

Image by author. Overview of Prompt Engineering Process.

Conclusion

Generative AI technology is changing and improving everyday. Most applications will require leveraging a variety of the techniques described in this article. Getting started with existing language models that have been fine-tuned for instruction or chat capabilities and focusing on prompt engineering and RAG is a great place to start! From here finding more tailored use cases that require fine-tuning/instruction fine-tuning can provide even greater benefits.

Looking ahead, AI agents offer a promising way to take advantage of the latest advancements in both closed and open-source models that have been pre-trained on tons of public data and fine-tuned for chat/instruction following. When given the right tools, these agents can perform incredible tasks on your behalf from information retrieval with RAG to helping plan company events or vacations.

Additionally, we can expect to see a proliferation of more domain specific models as organizations with lots of specialized data begin pre-training their own models. For instance, companies like Harvey are already developing tailored AI solutions that can handle the unique demands of the legal industry. This trend will likely continue, leading to highly specialized models that deliver even more precise and relevant results in various fields.

By combining the strengths of different AI techniques and leveraging the power of AI agents and domain-specific models, organizations can unlock the full potential of Generative AI.

Additional References

Interested in discussing further or collaborating? Reach out on LinkedIn!

The post Understanding Techniques for Solving GenAI Challenges appeared first on Towards Data Science.

Are Language Models Benchmark Savants or Real-World Problem Solvers?

Tula Masterman — Sat, 23 Mar 2024 03:33:14 +0000

AI students taking an exam in a classroom. Image created by author and DALL-E 3.

In the realm of education, the best exams are those that challenge students to apply what they’ve learned in new and unpredictable ways, moving beyond memorizing facts to demonstrate true understanding. Our evaluations of language models should follow the same pattern. As we see new models flood the AI space everyday whether from giants like OpenAI and Anthropic, or from smaller research teams and universities, its critical that our model evaluations dive deeper than performance on standard benchmarks. Emerging research suggests that the benchmarks we’ve relied on to gauge model capability are not as reliable as we once thought. In order for us to champion new models appropriately, our benchmarks must evolve to be as dynamic and complex as the real-world challenges we’re asking these models and emerging Ai Agent architectures to solve.

In this article we will explore the complexity of language model evaluation by answering the following questions:

How are language models evaluated today?
How reliable are language models that excel on benchmarks?
Can language models and AI agents translate knowledge into action?
Why should language models (or foundation models) master more than text?

So, how are language models evaluated today?

Today most models either Large Language Models (LLMs) or Small Language Models (SLMs) are evaluated on a common set of benchmarks including the Massive Multitask Language Understanding (MMLU), Grade School Math (GSM8K), and Big-Bench Hard (BBH) datasets amongst others.

To provide a deeper understanding of the types of tasks each benchmark evaluates, here are some sample questions from each dataset:

MMLU: Designed to measure information the model learned during pre-training across a variety of STEM and humanities based subjects and difficulty levels from elementary to advanced professional understanding using multiple choice questions. Example college medicine question in MMLU: "In a genetic test of a newborn, a rare genetic disorder is found that has X-linked recessive transmission. Which of the following statements is likely true regarding the pedigree of the disorder? A. All descendants on the maternal side will have the disorder B. Females will be approximately twice as affected as males in their family. C. All daughters of an affected male will be affected. D. There will be equal distribution of males and females affected." (Correct answer is C) [2]
GSM8K: Language models typically struggle to solve math questions, the GSM8K dataset evaluates a models ability to reason and solve math problems using 8.5k diverse grade school math problems. Example: "Dean’s mother gave him $28 to go to the grocery store. Dean bought 6 toy cars and 5 teddy bears. Each toy car cost $12 and each teddy bear cost $1. His mother then feels generous and decides to give him and extra $10. How much money does Dean have left?" [3]
BBH: This dataset consists of 23 tasks from the Big Bench dataset which language models have traditionally struggled to solve. These tasks generallly require multi step reasoning to successfully complete the task. Example: "If you follow these instructions, do you return to the starting point? Turn left. Turn right. Take 5 steps. Take 4 steps. Turn around. Take 9 steps. Options: – Yes – No" [4]

Anthropic’s recent announcement of Claude-3 shows their Opus model surpassing GPT-4 as the leading model on a majority of the common benchmarks. For example, Claude-3 Opus performed at 86.8% on MMLU, narrowly surpassing GPT-4 which scored 86.4%. Claude-3 Opus also scored 95% on GSM8K and 86.8% on BBH compared to GPT-4’s 92% and 83.1% respectively [1].

While the performance of models like GPT-4 and Claude on these benchmarks is impressive, these tasks are not always representative of the types of challenges business want to solve. Additionally, there is a growing body of research suggesting that models are memorizing benchmark questions rather than understanding them. This does not necessarily mean that the models aren’t capable of generalizing to new tasks, we see LLMs and SLMs perform amazing feats everyday, but it does mean we should reconsider how we’re evaluating, scoring, and promoting models.

How reliable are language models that excel on benchmarks?

Research from Microsoft, the Institute of Automation CAS, and the University of Science and Technology, China demonstrates how when asking various language models rephrased or modified benchmark questions, the models perform significantly worse than when asked the same benchmark question with no modification. For the purposes of their research as exhibited in the paper, DyVal 2, the researchers took questions from benchmarks like MMLU and modified them by either rephrasing the question, adding an extra answer to the question, rephrasing the answers, permuting the answers, or adding extra content to the question. When comparing model performance on the "vanilla" dataset compared to the modified questions they saw a decrease in performance, for example GPT-4 scored 84.4 on the vanilla MMLU questions and 68.86 on the modified MMLU questions [5].

Source: DyVal2, Model Performance on Vanilla Benchmarks Compared to Probing Benchmark

Similarly, research from the Department of Computer Science at the University of Arizona indicates that there is a significant amount of data contamination in language models [6]. Meaning that the information in the benchmarks is becoming part of the models training data, effectively making the benchmark scores irrelevant since the models are being tested on information they are trained on.

Additional research from Fudan University, Tongji University, and Alibaba highlights the need for self-evolving dynamic evaluations for AI agents to combat the issues of data contamination and benchmark memorization [7]. These dynamic benchmarks will help prevent models from memorizing or learning information during pre-training that they’d later be tested on. Although a recurring influx of new benchmarks may create challenges when comparing an older model to a newer model, ideally these benchmarks will mitigate issues of data contamination and make it easier to gauge how well a model understands topics from training.

When evaluating model capability for a particular problem, we need to grasp both how well the model understands information learned during pretraining and how well it can generalize to novel tasks or concepts beyond it’s training data.

Can language models and AI agents translate knowledge into action?

As we look to use models as AI agents to perform actions on our behalf, whether that’s booking a vacation, writing a report, or researching new topics for us, we’ll need additional benchmarks or evaluation mechanisms that can assess the reliability and accuracy of these agents. Most businesses looking to harness the power of foundation models require giving the model access to a variety of tools integrated with their unique data sources and require the model to reason and plan when and how to use the tools available to them effectively. These types of tasks are not represented in many traditional Llm Benchmarks.

Source: AgentVerse, results from team of agents compared to single agent on software development task involving tool calling and code execution

To address this gap, many research teams are creating their own benchmarks and frameworks that evaluate agent performance on tasks involving tool use and knowledge outside of the model’s training data. For example, the authors of AgentVerse evaluated how well teams of agents could perform real world tasks involving event planning, software development, and consulting. The researchers created their own set of 10 test tasks which were manually evaluated to determine if the agents performed the right set of actions, used the proper tools, and got to an accurate result. They found that teams of agents who operated in a cycle with defined stages for agent recruitment, task planning, independent task execution, and subsequent evaluation lead to superior outcomes compared to independent agents [8].

Beyond single modalities and into the real world. Why should language models (or foundation models) master more than text?

In my opinion the emerging agent architectures and benchmarks are a great step towards understanding how well language models will perform on business oriented problems, but one limitation is that most are still text focused. As we consider the world and the dynamic nature of most jobs, we will need agent systems and models that evaluate both performance on text based tasks as well as visual and auditory tasks together. The AlgoPuzzleVQA dataset is one example of evaluating models on their ability to both reason, read, and visually interpret mathematical and algorithmic puzzles [9].

Source: Are Language Models Puzzle Prodigies? Example questions from AlgoPuzzleVQA dataset

While businesses may not be interested in how well a model can solve a puzzle, it is still a step in the right direction for understanding how well models can reason about multimodal information.

Conclusion

As we continue adopting foundation models in our daily routines and professional endeavors, we need additional evaluation options that mirror real world problems. Dynamic and multimodal benchmarks are one key component of this. However, as we introduce additional agent frameworks and architectures with many AI agents collaborating to solve a problem, evaluation and comparison across models and frameworks becomes even more challenging. The true measure of foundation models lies not in their ability to conquer standardized tests, but in their capacity to understand, adapt, and act within the complex and often unpredictable real world. By changing how we evaluate language models, we challenge these models to evolve from text-based intellects and benchmark savants to comprehensive thinkers capable of tackling multifaceted (and multimodal) challenges.

Interested in discussing further or collaborating? Reach out on LinkedIn!

The post Are Language Models Benchmark Savants or Real-World Problem Solvers? appeared first on Towards Data Science.