Deep Dives | Towards Data Science

Generative AI Is Declarative

Michael Herman — Wed, 05 Mar 2025 16:36:00 +0000

ChatGPT launched in 2022 and kicked off the Generative Ai boom. In the two years since, academics, technologists, and armchair experts have written libraries worth of articles on the technical underpinnings of generative AI and about the potential capabilities of both current and future generative AI models.

Surprisingly little has been written about how we interact with these tools—the human-AI interface. The point where we interact with AI models is at least as important as the algorithms and data that create them. “There is no success where there is no possibility of failure, no art without the resistance of the medium” (Raymond Chandler). In that vein, it’s useful to examine human-AI interaction and the strengths and weaknesses inherent in that interaction. If we understand the “resistance in the medium” then product managers can make smarter decisions about how to incorporate generative AI into their products. Executives can make smarter decisions about what capabilities to invest in. Engineers and designers can build around the tools’ limitations and showcase their strengths. Everyday people can know when to use generative AI and when not to.

How to order a cheeseburger with AI

Imagine walking into a restaurant and ordering a cheeseburger. You don’t tell the chef how to grind the beef, how hot to set the grill, or how long to toast the bun. Instead, you simply describe what you want: “I’d like a cheeseburger, medium rare, with lettuce and tomato.” The chef interprets your request, handles the implementation, and delivers the desired outcome. This is the essence of declarative interaction—focusing on the what rather than the how.

Now, imagine interacting with a Large Language Model (LLM) like ChatGPT. You don’t have to provide step-by-step instructions for how to generate a response. Instead, you describe the result you’re looking for: “A user story that lets us implement A/B testing for the Buy button on our website.” The LLM interprets your prompt, fills in the missing details, and delivers a response. Just like ordering a cheeseburger, this is a declarative mode of interaction.

Explaining the steps to make a cheeseburger is an imperative interaction. Our LLM prompts sometimes feel imperative. We might phrase our prompts like a question: ”What is the tallest mountain on earth?” This is equivalent to describing “the answer to the question ‘What is the tallest mountain on earth?’” We might phrase our prompt as a series of instructions: ”Write a summary of the attached report, then read it as if you are a product manager, then type up some feedback on the report.” But, again, we’re describing the result of a process with some context for what that process is. In this case, it is a sequence of descriptive results—the report then the feedback.

This is a more useful way to think about LLMs and generative AI. In some ways it is more accurate; the neural network model behind the curtain doesn’t explain why or how it produced one output instead of another. More importantly though, the limitations and strengths of generative AI make more sense and become more predictable when we think of these models as declarative.

LLMs as a declarative mode of interaction

Computer scientists use the term “declarative” to describe coding languages. SQL is one of the most common. The code describes the output table and the procedures in the database figure out how to retrieve and combine the data to produce the result. LLMs share many of the benefits of declarative languages like SQL or declarative interactions like ordering a cheeseburger.

Focus on desired outcome: Just as you describe the cheeseburger you want, you describe the output you want from the LLM. For example, “Summarize this article in three bullet points” focuses on the result, not the process.
Abstraction of implementation: When you order a cheeseburger, you don’t need to know how the chef prepares it. When submitting SQL code to a server, the server figures out where the data lives, how to fetch it, and how to aggregate it based on your description. You as the user don’t need to know how. With LLMs, you don’t need to know how the model generates the response. The underlying mechanisms are abstracted away.
Filling in missing details: If you don’t specify onions on your cheeseburger, the chef won’t include them. If you don’t specify a field in your SQL code, it won’t show up in the output table. This is where LLMs differ slightly from declarative coding languages like SQL. If you ask ChatGPT to create an image of “a cheeseburger with lettuce and tomato” it may also show the burger on a sesame seed bun or include pickles, even if that wasn’t in your description. The details you omit are inferred by the LLM using the “average” or “most likely” detail depending on the context, with a bit of randomness thrown in. Ask for the cheeseburger image six times; it may show you three burgers with cheddar cheese, two with Swiss, and one with pepper jack.

Like other forms of declarative interaction, LLMs share one key limitation. If your description is vague, ambiguous, or lacks enough detail, then the result may not be what you hoped to see. It is up to the user to describe the result with sufficient detail.

This explains why we often iterate to get what we’re looking for when using LLMs and generative AI. Going back to our cheeseburger analogy, the process to generate a cheeseburger from an LLM may look like this.

“Make me a cheeseburger, medium rare, with lettuce and tomatoes.” The result also has pickles and uses cheddar cheese. The bun is toasted. There’s mayo on the top bun.
“Make the same thing but this time no pickles, use pepper jack cheese, and a sriracha mayo instead of plain mayo.” The result now has pepper jack, no pickles. The sriracha mayo is applied to the bottom bun and the bun is no longer toasted.
“Make the same thing again, but this time, put the sriracha mayo on the top bun. The buns should be toasted.” Finally, you have the cheeseburger you’re looking for.

This example demonstrates one of the main points of friction with human-AI interaction. Human beings are really bad at describing what they want with sufficient detail on the first attempt.

When we asked for a cheeseburger, we had to refine our description to be more specific (the type of cheese). In the second generation, some of the inferred details (whether the bun was toasted) changed from one iteration to the next, so then we had to add that specificity to our description as well. Iteration is an important part of AI-human generation.

Insight: When using generative AI, we need to design an iterative human-AI interaction loop that enables people to discover the details of what they want and refine their descriptions accordingly.

To iterate, we need to evaluate the results. Evaluation is extremely important with generative AI. Say you’re using an LLM to write code. You can evaluate the code quality if you know enough to understand it or if you can execute it and inspect the results. On the other hand, hypothetical questions can’t be tested. Say you ask ChatGPT, “What if we raise our product prices by 5 percent?” A seasoned expert could read the output and know from experience if a recommendation doesn’t take into account important details. If your product is property insurance, then increasing premiums by 5 percent may mean pushback from regulators, something an experienced veteran of the industry would know. For non-experts in a topic, there’s no way to tell if the “average” details inferred by the model make sense for your specific use case. You can’t test and iterate.

Insight: LLMs work best when the user can evaluate the result quickly, whether through execution or through prior knowledge.

The examples so far involve general knowledge. We all know what a cheeseburger is. When you start asking about non-general information—like when you can make dinner reservations next week—you delve into new points of friction.

In the next section we’ll think about different types of information, what we can expect the AI to “know”, and how this impacts human-AI interaction.

What did the AI know, and when did it know it?

Above, I explained how generative AI is a declarative mode of interaction and how that helps understand its strengths and weaknesses. Here, I’ll identify how different types of information create better or worse human-AI interactions.

Understanding the information available

When we describe what we want to an LLM, and when it infers missing details from our description, it draws from different sources of information. Understanding these sources of information is important. Here’s a useful taxonomy for information types:

General information used to train the base model.
Non-general information that the base model is not aware of.
- Fresh information that is new or changes rapidly, like stock prices or current events.
- Non-public information, like facts about you and where you live or about your company, its employees, its processes, or its codebase.

General information vs. non-general information

LLMs are built on a massive corpus of written word data. A large part of GPT-3 was trained on a combination of books, journals, Wikipedia, Reddit, and CommonCrawl (an open-source repository of web crawl data). You can think of the models as a highly compressed version of that data, organized in a gestalt manner—all the like things are close together. When we submit a prompt, the model takes the words we use (and any words added to the prompt behind the scenes) and finds the closest set of related words based on how those things appear in the data corpus. So when we say “cheeseburger” it knows that word is related to “bun” and “tomato” and “lettuce” and “pickles” because they all occur in the same context throughout many data sources. Even when we don’t specify pickles, it uses this gestalt approach to fill in the blanks.

This training information is general information, and a good rule of thumb is this: if it was in Wikipedia a year ago then the LLM “knows” about it. There could be new articles on Wikipedia, but that didn’t exist when the model was trained. The LLM doesn’t know about that unless told.

Now, say you’re a company using an LLM to write a product requirements document for a new web app feature. Your company, like most companies, is full of its own lingo. It has its own lore and history scattered across thousands of Slack messages, emails, documents, and some tenured employees who remember that one meeting in Q1 last year. The LLM doesn’t know any of that. It will infer any missing details from general information. You need to supply everything else. If it wasn’t in Wikipedia a year ago, the LLM doesn’t know about it. The resulting product requirements document may be full of general facts about your industry and product but could lack important details specific to your firm.

This is non-general information. This includes personal info, anything kept behind a log-in or paywall, and non-digital information. This non-general information permeates our lives, and incorporating it is another source of friction when working with generative AI.

Non-general information can be incorporated into a generative AI application in three ways:

Through model fine-tuning (supplying a large corpus to the base model to expand its reference data).
Retrieved and fed it to the model at query time (e.g., the retrieval augmented generation or “RAG” technique).
Supplied by the user in the prompt.

Insight: When designing any human-AI interactions, you should think about what non-general information is required, where you will get it, and how you will expose it to the AI.

Fresh information

Any information that changes in real-time or is new can be called fresh information. This includes new facts like current events but also frequently changing facts like your bank account balance. If the fresh information is available in a database or some searchable source, then it needs to be retrieved and incorporated into the application. To retrieve the information from a database, the LLM must create a query, which may require specific details that the user didn’t include.

Here’s an example. I have a chatbot that gives information on the stock market. You, the user, type the following: “What is the current price of Apple? Has it been increasing or decreasing recently?”

The LLM doesn’t have the current price of Apple in its training data. This is fresh, non-general information. So, we need to retrieve it from a database.
The LLM can read “Apple”, know that you’re talking about the computer company, and that the ticker symbol is AAPL. This is all general information.
What about the “increasing or decreasing” part of the prompt? You did not specify over what period—increasing in the past day, month, year? In order to construct a database query, we need more detail. LLMs are bad at knowing when to ask for detail and when to fill it in. The application could easily pull the wrong data and provide an unexpected or inaccurate answer. Only you know what these details should be, depending on your intent. You must be more specific in your prompt.

A designer of this LLM application can improve the user experience by specifying required parameters for expected queries. We can ask the user to explicitly input the time range or design the chatbot to ask for more specific details if not provided. In either case, we need to have a specific type of query in mind and explicitly design how to handle it. The LLM will not know how to do this unassisted.

Insight: If a user is expecting a more specific type of output, you need to explicitly ask for enough detail. Too little detail could produce a poor quality output.

Non-public information

Incorporating non-public information into an LLM prompt can be done if that information can be accessed in a database. This introduces privacy issues (should the LLM be able to access my medical records?) and complexity when incorporating multiple non-public sources of information.

Let’s say I have a chatbot that helps you make dinner reservations. You, the user, type the following: “Help me make dinner reservations somewhere with good Neapolitan pizza.”

The LLM knows what a Neapolitan pizza is and can infer that “dinner” means this is for an evening meal.
To do this task well, it needs information about your location, the restaurants near you and their booking status, or even personal details like dietary restrictions. Assuming all that non-public information is available in databases, bringing them all together into the prompt takes a lot of engineering work.
Even if the LLM could find the “best” restaurant for you and book the reservation, can you be confident it has done that correctly? You never specified how many people you need a reservation for. Since only you know this information, the application needs to ask for it upfront.

If you’re designing this LLM-based application, you can make some thoughtful choices to help with these problems. We could ask about a user’s dietary restrictions when they sign up for the app. Other information, like the user’s schedule that evening, can be given in a prompting tip or by showing the default prompt option “show me reservations for two for tomorrow at 7PM”. Promoting tips may not feel as automagical as a bot that does it all, but they are a straightforward way to collect and integrate the non-public information.

Some non-public information is large and can’t be quickly collected and processed when the prompt is given. These need to be fine-tuned in batch or retrieved at prompt time and incorporated. A chatbot that answers information about a company’s HR policies can obtain this information from a corpus of non-public HR documents. You can fine-tune the model ahead of time by feeding it the corpus. Or you can implement a retrieval augmented generation technique, searching a corpus for relevant documents and summarizing the results. Either way, the response will only be as accurate and up-to-date as the corpus itself.

Insight: When designing an AI application, you need to be aware of non-public information and how to retrieve it. Some of that information can be pulled from databases. Some needs to come from the user, which may require prompt suggestions or explicitly asking.

If you understand the types of information and treat human-AI interaction as declarative, you can more easily predict which AI applications will work and which ones won’t. In the next section we’ll look at OpenAI’s Operator and deep research products. Using this framework, we can see where these applications fall short, where they work well, and why.

Critiquing OpenAI’s Operator and deep research through a declarative lens

I have now explained how thinking of generative AI as declarative helps us understand its strengths and weaknesses. I also identified how different types of information create better or worse human-AI interactions.

Now I’ll apply these ideas by critiquing two recent products from OpenAI—Operator and deep research. It’s important to be honest about the shortcomings of AI applications. Bigger models trained on more data or using new techniques might one day solve some issues with generative AI. But other issues arise from the human-AI interaction itself and can only be addressed by making appropriate design and product choices.

These critiques demonstrate how the framework can help identify where the limitations are and how to address them.

The limitations of Operator

Journalist Casey Newton of Platformer reviewed Operator in an article that was largely positive. Newton has covered AI extensively and optimistically. Still, Newton couldn’t help but point out some of Operator’s frustrating limitations.

[Operator] can take action on your behalf in ways that are new to AI systems — but at the moment it requires a lot of hand-holding, and may cause you to throw up your hands in frustration.

My most frustrating experience with Operator was my first one: trying to order groceries. “Help me buy groceries on Instacart,” I said, expecting it to ask me some basic questions. Where do I live? What store do I usually buy groceries from? What kinds of groceries do I want?

It didn’t ask me any of that. Instead, Operator opened Instacart in the browser tab and begin searching for milk in grocery stores located in Des Moines, Iowa.

The prompt “Help me buy groceries on Instacart,” viewed declaratively, describes groceries being purchased using Instacart. It doesn’t have a lot of the information someone would need to buy groceries, like what exactly to buy, when it would be delivered, and to where.

It’s worth repeating: LLMs are not good at knowing when to ask additional questions unless explicitly programmed to do so in the use case. Newton gave a vague request and expected follow-up questions. Instead, the LLM filled in all the missing details with the “average”. The average item was milk. The average location was Des Moines, Iowa. Newton doesn’t mention when it was scheduled to be delivered, but if the “average” delivery time is tomorrow, then that was likely the default.

If we engineered this application specifically for ordering groceries, keeping in mind the declarative nature of AI and the information it “knows”, then we could make thoughtful design choices that improve functionality. We would need to prompt the user to specify when and where they want groceries up front (non-public information). With that information, we could find an appropriate grocery store near them. We would need access to that grocery store’s inventory (more non-public information). If we have access to the user’s previous orders, we could also pre-populate a cart with items typical to their order. If not, we may add a few suggested items and guide them to add more. By limiting the use case, we only have to deal with two sources of non-public information. This is a more tractable problem than Operator’s “agent that does it all” approach.

Newton also mentions that this process took eight minutes to complete, and “complete” means that Operator did everything up to placing the order. This is a long time with very little human-in-the-loop iteration. Like we said before, an iteration loop is very important for human-AI interaction. A better-designed application would generate smaller steps along the way and provide more frequent interaction. We could prompt the user to describe what to add to their shopping list. The user might say, “Add barbeque sauce to the list,” and see the list update. If they see a vinegar-based barbecue sauce, they can refine that by saying, “Replace that with a barbeque sauce that goes well with chicken,” and might be happier when it’s replaced by a honey barbecue sauce. These frequent iterations make the LLM a creative tool rather than a does-it-all agent. The does-it-all agent looks automagical in marketing, but a more guided approach provides more utility with a less frustrating and more delightful experience.

Elsewhere in the article, Newton gives an example of a prompt that Operator performed well: “Put together a lesson plan on the Great Gatsby for high school students, breaking it into readable chunks and then creating assignments and connections tied to the Common Core learning standard.” This prompt describes an output using much more specificity. It also solely relies on general information—the Great Gatsby, the Common Core standard, and a general sense of what assignments are. The general-information use case lends itself better to AI generation, and the prompt is explicit and detailed in its request. In this case, very little guidance was given to create the prompt, so it worked better. (In fact, this prompt comes from Ethan Mollick who has used it to evaluate AI chatbots.)

This is the risk of general-purpose AI applications like Operator. The quality of the result relies heavily on the use case and specificity provided by the user. An application with a more specific use case allows for more design guidance and can produce better output more reliably.

The limitations of deep research

Newton also reviewed deep research, which, according to OpenAI’s website, is an “agent that uses reasoning to synthesize large amounts of online information and complete multi-step research tasks for you.”

Deep research came out after Newton’s review of Operator. Newton chose an intentionally tricky prompt that prods at some of the tool’s limitations regarding fresh information and non-general information: “I wanted to see how OpenAI’s agent would perform given that it was researching a story that was less than a day old, and for which much of the coverage was behind paywalls that the agent would not be able to access. And indeed, the bot struggled more than I expected.”

Near the end of the article, Newton elaborates on some of the shortcomings he noticed with deep research.

OpenAI’s deep research suffers from the same design problem that almost all AI products have: its superpowers are completely invisible and must be harnessed through a frustrating process of trial and error.

Generally speaking, the more you already know about something, the more useful I think deep research is. This may be somewhat counterintuitive; perhaps you expected that an AI agent would be well suited to getting you up to speed on an important topic that just landed on your lap at work, for example.

In my early tests, the reverse felt true. Deep research excels for drilling deep into subjects you already have some expertise in, letting you probe for specific pieces of information, types of analysis, or ideas that are new to you.

The “frustrating trial and error” shows a mismatch between Newton’s expectations and a necessary aspect of many generative AI applications. A good response requires more information than the user will probably give in the first attempt. The challenge is to design the application and set the user’s expectations so that this interaction is not frustrating but exciting.

Newton’s more poignant criticism is that the application requires already knowing something about the topic for it to work well. From the perspective of our framework, this makes sense. The more you know about a topic, the more detail you can provide. And as you iterate, having knowledge about a topic helps you observe and evaluate the output. Without the ability to describe it well or evaluate the results, the user is less likely to use the tool to generate good output.

A version of deep research designed for lawyers to perform legal research could be powerful. Lawyers have an extensive and common vocabulary for describing legal matters, and they’re more likely to see a result and know if it makes sense. Generative AI tools are fallible, though. So, the tool should focus on a generation-evaluation loop rather than writing a final draft of a legal document.

The article also highlights many improvements compared to Operator. Most notably, the bot asked clarifying questions. This is the most impressive aspect of the tool. Undoubtedly, it helps that deep search has a focused use-case of retrieving and summarizing general information instead of a does-it-all approach. Having a focused use case narrows the set of likely interactions, letting you design better guidance into the prompt flow.

Good application design with generative AI

Designing effective generative AI applications requires thoughtful consideration of how users interact with the technology, the types of information they need, and the limitations of the underlying models. Here are some key principles to guide the design of generative AI tools:

1. Constrain the input and focus on providing details

Applications are inputs and outputs. We want the outputs to be useful and pleasant. By giving a user a conversational chatbot interface, we allow for a vast surface area of potential inputs, making it a challenge to guarantee useful outputs. One strategy is to limit or guide the input to a more manageable subset.

For example, FigJam, a collaborative whiteboarding tool, uses pre-set template prompts for timelines, Gantt charts, and other common whiteboard artifacts. This provides some structure and predictability to the inputs. Users still have the freedom to describe further details like color or the content for each timeline event. This approach ensures that the AI has enough specificity to generate meaningful outputs while giving users creative control.

2. Design frequent iteration and evaluation into the tool

Iterating in a tight generation-evaluation loop is essential for refining outputs and ensuring they meet user expectations. OpenAI’s Dall-E is great at this. Users quickly iterate on image prompts and refine their descriptions to add additional detail. If you type “a picture of a cheeseburger on a plate”, you may then add more detail by specifying “with pepperjack cheese”.

AI code generating tools work well because users can run a generated code snippet immediately to see if it works, enabling rapid iteration and validation. This quick evaluation loop produces better results and a better coder experience.

Designers of generative AI applications should pull the user in the loop early, often, in a way that is engaging rather than frustrating. Designers should also consider the user’s knowledge level. Users with domain expertise can iterate more effectively.

Referring back to the FigJam example, the prompts and icons in the app quickly communicate “this is what we call a mind map” or “this is what we call a gantt chart” for users who want to generate these artifacts but don’t know the terms for them. Giving the user some basic vocabulary can help them better generate desired results quickly with less frustration.

3. Be mindful of the types of information needed

LLMs excel at tasks involving general knowledge already in the base training set. For example, writing class assignments involves absorbing general information, synthesizing it, and producing a written output, so LLMs are very well-suited for that task.

Use cases that require non-general information are more complex. Some questions the designer and engineer should ask include:

Does this application require fresh information? Maybe this is knowledge of current events or a user’s current bank account balance. If so, that information needs to be retrieved and incorporated into the model.
How much non-general information does the LLM need to know? If it’s a lot of information—like a corpus of company documentation and communication—then the model may need to be fine tuned in batch ahead of time. If the information is relatively small, a retrieval augmented generation (RAG) approach at query time may suffice.
How many sources of non-general information—small and finite or potentially infinite? General purpose agents like Operator face the challenge of potentially infinite non-general information sources. Depending on what the user requires, it could need to access their contacts, restaurant reservation lists, financial data, or even other people’s calendars. A single-purpose restaurant reservation chatbot may only need access to Yelp, OpenTable, and the user’s calendar. It’s much easier to reconcile access and authentication for a handful of known data sources.
Is there context-specific information that can only come from the user? Consider our restaurant reservation chatbot. Is the user making reservations for just themselves? Probably not. “How many people and who” is a detail that only the user can provide, an example of non-public information that only the user knows. We shouldn’t expect the user to provide this information upfront and unguided. Instead, we can use prompt suggestions so they include the information. We may even be able to design the LLM to ask these questions when the detail is not provided.

4. Focus on specific use cases

Broad, all-purpose chatbots often struggle to deliver consistent results due to the complexity and variability of user needs. Instead, focus on specific use cases where the AI’s shortcomings can be mitigated through thoughtful design.

Narrowing the scope helps us address many of the issues above.

We can identify common requests for the use case and incorporate those into prompt suggestions.
We can design an iteration loop that works well with the type of thing we’re generating.
We can identify sources of non-general information and devise solutions to incorporate it into the model or prompt.

5. Translation or summary tasks work well

A common task for ChatGPT is to rewrite something in a different style, explain what some computer code is doing, or summarize a long document. These tasks involve converting a set of information from one form to another.

We have the same concerns about non-general information and context. For instance, a Chatbot asked to explain a code script doesn’t know the system that script is part of unless that information is provided.

But in general, the task of transforming or summarizing information is less prone to missing details. By definition, you have provided the details it needs. The result should have the same information in a different or more condensed form.

The exception to the rules

There is a case when it doesn’t matter if you break any or all of these rules—when you’re just having fun. LLMs are creative tools by nature. They can be an easel to paint on, a sandbox to build in, a blank sheet to scribe. Iteration is still important; the user wants to see the thing they’re creating as they create it. But unexpected results due to lack of information or omitted details may add to the experience. If you ask for a cheeseburger recipe, you might get some funny or interesting ingredients. If the stakes are low and the process is its own reward, don’t worry about the rules.

The post Generative AI Is Declarative appeared first on Towards Data Science.

Vision Transformers (ViT) Explained: Are They Better Than CNNs?

Rein Bugnot — Fri, 28 Feb 2025 22:12:11 +0000

1. Introduction

Ever since the introduction of the self-attention mechanism, Transformers have been the top choice when it comes to Natural Language Processing (NLP) tasks. Self-attention-based models are highly parallelizable and require substantially fewer parameters, making them much more computationally efficient, less prone to overfitting, and easier to fine-tune for domain-specific tasks [1]. Furthermore, the key advantage of transformers over past models (like RNN, LSTM, GRU and other neural-based architectures that dominated the NLP domain prior to the introduction of Transformers) is their ability to process input sequences of any length without losing context, through the use of the self-attention mechanism that focuses on different parts of the input sequence, and how those parts interact with other parts of the sequence, at different times [2]. Because of these qualities, Transformers has made it possible to train language models of unprecedented size, with more than 100B parameters, paving the way for the current state-of-the-art advanced models like the Generative Pre-trained Transformer (GPT) and the Bidirectional Encoder Representations from Transformers (BERT) [1].

However, in the field of computer vision, convolutional neural networks or CNNs, remain dominant in most, if not all, computer vision tasks. While there has been an increasing collection of research work that attempts to implement self-attention-based architectures to perform computer vision tasks, very few has reliably outperformed CNNs with promising scalability [3]. The main challenge with integrating the transformer architecture with image-related tasks is that, by design, the self-attention mechanism, which is the core component of transformers, has a quadratic time complexity with respect to sequence length, i.e. O(n2), as shown in Table I and as discussed further in Part 2.1. This is usually not a problem for NLP tasks that use a relatively small number of tokens per input sequence (e.g., a 1,000-word paragraph will only have 1,000 input tokens, or a few more if sub-word units are used as tokens instead of full words). However, in computer vision, the input sequence (the image) can have a token size with orders of magnitude greater than that of NLP input sequences. For example, a relatively small 300 x 300 x 3 image can easily have up to 270,000 tokens and require a self-attention map with up to 72.9 billion parameters (270,0002) when self-attention is applied naively.

Table I. Time complexity for different layer types [2].

For this reason, most of the research work that attempt to use self-attention-based architectures to perform computer vision tasks did so either by applying self-attention locally, using transformer blocks in conjunction with CNN layers, or by only replacing specific components of the CNN architecture while maintaining the overall structure of the network; never by only using a pure transformer [3]. The goal of Dr. Dosovitskiy, et. al. in their work, “An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale”, is to show that it is indeed possible to implement image classification by applying self-attention globally through the use of the basic Transformer encoder architure, while at the same time requiring significantly less computational resources to train, and outperforming state-of-the-art convolutional neural networks like ResNet.

2. The Transformer

Transformers, introduced in the paper titled “Attention is All You Need” by Vaswani et al. in 2017, are a class of neural network architectures that have revolutionized various natural language processing and machine learning tasks. A high level view of its architecture is shown in Fig. 1.

Fig. 1. The Transformer model architecture showing the encoder (left block)
and decoder components (right block) [2]

Since its introduction, transformers have served as the foundation for many state-of-the-art models in NLP; including BERT, GPT, and more. Fundamentally, they are designed to process sequential data, such as text data, without the need for recurrent or convolutional layers [2]. They achieve this by relying heavily on a mechanism called self-attention.

The self-attention mechanism is a key innovation introduced in the paper that allows the model to capture relationships between different elements in a given sequence by weighing the importance of each element in the sequence with respect to other elements [2]. Say for instance, you want to translate the following sentence:

“The animal didn’t cross the street because it was too tired.”

What does the word “it” in this particular sentence refer to? Is it referring to the street or the animal? For us humans, this may be a trivial question to answer. But for an algorithm, this can be considered a complex task to perform. However, through the self-attention mechanism, the transformer model is able to estimate the relative weight of each word with respect to all the other words in the sentence, allowing the model to associate the word “it” with “animal” in the context of our given sentence [4].

Fig. 2. Sample output of the 5^th encoder in a 5-encoder stack self-attention block given the word “it” as an input. We can see that the attention mechanism is associating our input word with the phrase “The Animal” [4].

2.1. The Self-Attention Mechanism

A transformer transforms a given input sequence by passing each element through an encoder (or a stack of encoders) and a decoder (or a stack of decoders) block, in parallel [2]. Each encoder block contains a self-attention block and a feed forward neural network. Here, we only focus on the transformer encoder block as this was the component used by Dosovitskiy et al. in their Vision Transformer image classification model.

As is the case with general NLP applications, the first step in the encoding process is to turn each input word into a vector using an embedding layer which converts our text data into a vector that represents our word in the vector space while retaining its contextual information. We then compile these individual word embedding vectors into a matrix X, where each row i represents the embedding of each element i in the input sequence. Then, we create three sets of vectors for each element in the input sequence; namely, Key (K), Query (Q), and Value (V). These sets are derived by multiplying matrix X with the corresponding trainable weight matrices WQ, WK, and WV [2].

Afterwards, we perform a matrix multiplication between K and Q, divide the result by the square-root of the dimensionality of K: …and then apply a softmax function to normalize the output and generate weight values between 0 and 1 [2].

We will call this intermediary output the attention factor. This factor, shown in Eq. 4, represents the weight that each element in the sequence contributes to the calculation of the attention value at the current position (word being processed). The idea behind the softmax operation is to amplify the words that the model thinks are relevant to the current position, and attenuate the ones that are irrelevant. For example, in Fig. 3, the input sentence “He later went to report Malaysia for one year” is passed into a BERT encoder unit to generate a heatmap that illustrates the contextual relationship of each word with each other. We can see that words that are deemed contextually associated produce higher weight values in their respective cells, visualized in a dark pink color, while words that are contextually unrelated have low weight values, represented in pale pink.

Fig. 3. Attention matrix visualization – weights in a BERT Encoding Unit [5]

Finally, we multiply the attention factor matrix to the value matrix V to compute the aggregated self-attention value matrix Z of this layer [2], where each row i in Z represents the attention vector for word i in our input sequence. This aggregated value essentially bakes the “context” provided by other words in the sentence into the current word being processed. The attention equation shown in Eq. 5 is sometimes also referred to as the Scaled Dot-Product Attention.

2.2 The Multi-Headed Self-Attention

In the paper by Vaswani et. al., the self-attention block is further augmented with a mechanism known as the “multi-headed” self-attention, shown in Fig 4. The idea behind this is instead of relying on a single attention mechanism, the model employs multiple parallel attention “heads” (in the paper, Vaswani et. al. used 8 parallel attention layers), wherein each of these attention heads learns different relationships and provides unique perspectives on the input sequence [2]. This improves the performance of the attention layer in two important ways:

First, it expands the ability of the model to focus on different positions within the sequence. Depending on multiple variations involved in the initialization and training process, the calculated attention value for a given word (Eq. 5) can be dominated by other certain unrelated words or phrases or even by the word itself [4]. By computing multiple attention heads, the transformer model has multiple opportunities to capture the correct contextual relationships, thus becoming more robust to variations and ambiguities in the input.Second, since each of our Q, K, V matrices are randomly initialized independently across all the attention heads, the training process then yields several Z matrices (Eq. 5), which gives the transformer multiple representation subspaces [4]. For example, one head might focus on syntactic relationships while another might attend to semantic meanings. Through this, the model is able to capture more diverse relationships within the data.

Fig. 4. Illustrating the Multi-Headed Self-Attention Mechanism. Each individual attention head yields a scaled dot-product attention value, which are concatenated and multiplied to a learned matrix W^O to generate the aggregated multi-headed self-attention value matrix [4].

3. The Vision Transformer

The fundamental innovation behind the Vision Transformer (ViT) revolves around the idea that images can be processed as sequences of tokens rather than grids of pixels. In traditional CNNs, input images are analyzed as overlapping tiles via a sliding convolutional filter, which are then processed hierarchically through a series of convolutional and pooling layers. In contrast, ViT treats the image as a collection of non-overlapping patches, which are treated as the input sequence to a standard Transformer encoder unit.

Fig. 5. The Vision Transformer architecture (left), and the Transfomer encoder unit
derived from the Fig. 1 (right)[3].

By defining the input tokens to the transformer as non-overlapping image patches rather than individual pixels, we are therefore able to reduce the dimension of the attention map from ⟮𝐻 𝓍 𝑊⟯² to ⟮𝑛_𝑝ℎ 𝓍 𝑛_𝑝𝑤 ⟯² given 𝑛_𝑝ℎ ≪𝐻 and 𝑛_𝑝𝑤≪ 𝑊; where 𝐻 and 𝑊 are the height and width of the image, and 𝑛_𝑝ℎ and 𝑛_𝑝𝑙 are the number of patches in the corresponding axes. By doing so, the model is able to handle images of varying sizes without requiring extensive architectural changes [3].

These image patches are then linearly embedded into lower-dimensional vectors, similar to the word embedding step that produces matrix X in Part 2.1. Since transformers do not contain recurrence nor convolutions, they lack the capacity to encode positional information of the input tokens and are therefore permutation invariant [2]. Hence, as it is done in NLP applications, a positional embedding is appended to each linearly encoded vector prior to input into the transformer model, in order to encode the spatial information of the patches, ensuring that the model understands the position of each token relative to other tokens within the image. Additionally, an extra learnable classifier cls embedding is added to the input. All of these (the linear embeddings of each 16 x 16 patch, the extra learnable classifier embedding, and their corresponding positional embedding vectors) are passed through a standard Transformer encoder unit as discussed in Part 2. The output corresponding to the added learnable cls embedding is then used to perform classification via a standard MLP classifer head [3].

4. The Result

In the paper, the two largest models, ViT-H/14 and ViT-L/16, both pre-trained on the JFT-300M dataset, are compared to state-of-the-art CNNs—as shown in Table II, including Big Transfer (BiT), which employs supervised transfer learning with large ResNets, and Noisy Student, a large EfficientNet trained using semi-supervised learning on ImageNet and JFT-300M without labels [3]. At the time of this study’s publication, Noisy Student held the state-of-the-art position on ImageNet, while BiT-L on the other datasets utilized in the paper [3]. All models were trained in TPUv3 hardware, and the number of TPUv3-core-days that it took to train each model were recorded.

Table II. Comparison of model performance against popular image classification benchmarks. Reported here are the mean and standard deviation of the accuracies, averaged over three fine-tuning runs [3].

We can see from the table that Vision Transformer models pre-trained on the JFT-300M dataset outperforms ResNet-based baseline models on all datasets; while, at the same time, requiring significantly less computational resources (TPUv3-core-days) to pre-train. A secondary ViT-L/16 model was also trained on a much smaller public ImageNet-21k dataset, and is shown to also perform relatively well while requiring up to 97% less computational resources compared to state-of-the-art counter parts [3].

Fig. 6 shows the comparison of the performance between the BiT and ViT models (measured using the ImageNet Top1 Accuracy metric) across different pre-training datasets of varying sizes. We see that the ViT-Large models underperform compared to the base models on the small datasets like ImageNet, and roughly equivalent performance on ImageNet-21k. However, when pre-trained on larger datasets like JFT-300M, the ViT clearly outperforms the base model [3].

Fig. 6. BiT (ResNet) vs ViT on different pre-training datasets [3].

Further exploring how the size of the dataset relates to model performance, the authors trained the models on various random subsets of the JFT dataset—9M, 30M, 90M, and the full JFT-300M. Additional regularization was not added on smaller subsets in order to assess the intrinsic model properties (and not the effect of regularization) [3]. Fig. 7 shows that ViT models overfit more than ResNets on smaller datasets. Data shows that ResNets perform better with smaller pre-training datasets but plateau sooner than ViT; which then outperforms the former with larger pre-training. The authors conclude that on smaller datasets, convolutional inductive biases play a key role in CNN model performance, which ViT models lack. However, with large enough data, learning relevant patterns directly outweighs inductive biases, wherein ViT excels [3].

Fig. 7. ResNet vs ViT on different subsets of the JFT training dataset [3].

Finally, the authors analyzed the models’ transfer performance from JFT-300M vs total pre-training compute resources allocated, across different architectures, as shown in Fig. 8. Here, we see that Vision Transformers outperform ResNets with the same computational budget across the board. ViT uses approximately 2-4 times less compute to attain similar performance as ResNet [3]. Implementing a hybrid model does improve performance on smaller model sizes, but the discrepancy vanishes for larger models, which the authors find surprising as the initial hypothesis is that the convolutional local feature processing should be able to assist ViT regardless of compute size [3].

Fig. 8. Performance of the models across different pre-training compute values—exa floating point operations per second (or exaFLOPs) [3].

4.1 What does the ViT model learn?

In order to understand how ViT processes image data, it is important to analyze its internal representations. In Part 3, we saw that the input patches generated from the image are fed into a linear embedding layer that projects the 16×16 patch into a lower dimensional vector space, and its resulting embedded representations are then appended with positional embeddings. Fig. 9 shows that the model indeed learns to encode the relative position of each patch in the image. The authors used cosine similarity between the learned positional embeddings across patches [3]. High cosine similarity values emerge on similar relative area within the position embedding matrix corresponding to the patch; i.e., the top right patch (row 1, col 7) has a corresponding high cosine similarity value (yellow pixels) on the top-right area of the position embedding matrix [3].

Fig. 9. Learned positional embedding for the input image patches [3].

Meanwhile, Fig. 10 (left) shows the top principal components of learned embedding filters that are applied to the raw image patches prior to the addition of the positional embeddings. What’s interesting for me is how similar this is to the learned hidden layer representations that you get from Convolutional neural networks, an example of which is shown in the same figure (right) using the AlexNet architecture.

Fig. 10. Filters of the initial linear embedding layer of ViT-L/32 (left) [3].
The first layer of filters from AlexNet (right) [6].

By design, the self-attention mechanism should allow ViT to integrate information across the entire image, even at the lowest layer, effectively giving ViTs a global receptive field at the start. We can somehow see this effect in Fig. 10 where the learned embedding filters captured lower level features like lines and grids, as well as higher level patterns combining lines and color blobs. This in contrast with CNNs whose receptive field size at the lowest layer is very small (because local application of the convolution operation only attends to the area defined by the filter size), and only widens towards the deeper convolutions as further applications of convolutions extract context from the combined information extracted from lower layers. The authors further tested this by measuring the attention distance which is computed from the “average distance in the image space across which information is integrated based on the attention weights [3].” The results are shown in Fig. 11.

Fig. 11. Size of attended area by head and network depth [3].

From the figure, we can see that even at very low layers of the network, some heads attend to most of the image already (as indicated by data points with high mean attention distance value at lower values of network depth); thus proving the ability of the ViT model to integrate image information globally, even at the lowest layers.

Finally, the authors also calculated the attention maps from the output token to the input space using Attention Rollout by averaging the attention weights of the ViT-L/16 across all heads and then recursively multiplying the weight matrices of all layers. This results in a nice visualization of what the output layer attends to prior to classification, shown in Fig. 12 [3].

Fig. 12. Representative examples of attention from the output token to the input space [3].

5. So, is ViT the future of Computer Vision?

The Vision Transformer (ViT) introduced by Dosovitskiy et. al. in the research study showcased in this paper is a groundbreaking architecture for computer vision tasks. Unlike previous methods that introduce image-specific biases, ViT treats an image as a sequence of patches and process it using a standard Transformer encoder, such as how Transformers are used in NLP. This straightforward yet scalable strategy, combined with pre-training on extensive datasets, has yielded impressive results as discussed in Part 4. The Vision Transformer (ViT) either matches or surpasses the state-of-the-art on numerous image classification datasets (Fig. 6, 7, and 8), all while maintaining cost-effectiveness in pre-training [3].

However, like in any technology, it has its limitations. First, in order to perform well, ViTs require a very large amount of training data that not everyone has access to in the required scale, especially when compared to traditional CNNs. The authors of the paper used the JFT-300M dataset, which is a limited-access dataset managed by Google [7]. The dominant approach to get around this is to use the model pre-trained on the large dataset, and then fine-tune it to smaller (downstream) tasks. However, second, there are still very few pre-trained ViT models available as compared to the available pre-trained CNN models, which limits the availability of transfer learning benefits for these smaller, much more specific computer vision tasks. Third, by design, ViTs process images as sequences of tokens (discussed in Part 3), which means they do not naturally capture spatial information [3]. While adding positional embeddings do help remedy this lack of spatial context, ViTs may not perform as well as CNNs in image localization tasks, given CNNs convolutional layers that are excellent at capturing these spatial relationships.

Moving forward, the authors mention the need to further study scaling ViTs for other computer vision tasks such as image detection and segmentation, as well as other training methods like self-supervised pre-training [3]. Future research may focus on making ViTs more efficient and scalable, such as developing smaller and more lightweight ViT architectures that can still deliver the same competitive performance. Furthermore, providing better accessibility by creating and sharing a wider range of pre-trained ViT models for various tasks and domains can further facilitate the development of this technology in the future.

References

N. Pogeant, “Transformers - the NLP revolution,” Medium, https://medium.com/mlearning-ai/transformers-the-nlp-revolution-5c3b6123cfb4 (accessed Sep. 23, 2023).
A. Vaswani, et. al. “Attention is all you need.” NIPS 2017.
A. Dosovitskiy, et. al. “An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale,” ICLR 2021.
X. Wang, G. Chen, G. Qian, P. Gao, X.-Y. Wei, Y. Wang, Y. Tian, and W. Gao, “Large-scale multi-modal pre-trained models: A comprehensive survey,” Machine Intelligence Research, vol. 20, no. 4, pp. 447–482, 2023, doi: 10.1007/s11633-022-1410-8.
H. Wang, “Addressing Syntax-Based Semantic Complementation: Incorporating Entity and Soft Dependency Constraints into Metonymy Resolution”, Scientific Figure on ResearchGate. Available from: https://www.researchgate.net/figure/Attention-matrix-visualization-a-weights-in-BERT-Encoding-Unit-Entity-BERT-b_fig5_359215965 [accessed 24 Sep, 2023]
A. Krizhevsky, et. al. “ImageNet Classification with Deep Convolutional Neural Networks,” NIPS 2012.
C. Sun, et. al. “Revisiting Unreasonable Effectiveness of Data in Deep Learning Era,” Google Research, ICCV 2017.

* ChatGPT, used sparingly to rephrase certain paragraphs for better grammar and more concise explanations. All ideas in the report belong to me unless otherwise indicated. Chat Reference: https://chat.openai.com/share/165501fe-d06d-424b-97e0-c26a81893c69

The post Vision Transformers (ViT) Explained: Are They Better Than CNNs? appeared first on Towards Data Science.

Unraveling Large Language Model Hallucinations

Prashal Ruchiranga — Fri, 28 Feb 2025 20:15:37 +0000

Introduction

In a YouTube video titled Deep Dive into LLMs like ChatGPT, former Senior Director of AI at Tesla, Andrej Karpathy discusses the psychology of Large Language Models (LLMs) as emergent cognitive effects of the training pipeline. This article is inspired by his explanation of LLM hallucinations and the information presented in the video.

You might have seen model hallucinations. They are the instances where LLMs generate incorrect, misleading, or entirely fabricated information that appears plausible. These hallucinations happen because LLMs do not “know” facts in the way humans do; instead, they predict words based on patterns in their training data. Early models released a few years ago struggled significantly with hallucinations. Over time, mitigation strategies have improved the situation, though hallucinations haven’t been fully eliminated.

An illustrative example of LLM hallucinations (Image by Author)

Zyler Vance is a completely fictitious name I came up with. When I input the prompt “Who is Zyler Vance?” into the falcon-7b-instruct model, it generates fabricated information. Zyler Vance is not a character in The Cloverfield Paradox (2018) movie. This model, being an older version, is prone to hallucinations.

LLM Training Pipeline

To understand where these hallucinations originate from, you have to be familiar with the training pipeline. Training LLMs typically involve three major stages.

Pretraining
Post-training: Supervised Fine-Tuning (SFT)
Post-training: Reinforcement Learning with Human Feedback (RLHF)

Pretraining

This is the initial stage of the training for LLMs. During pretraining the model is exposed to a huge quantity of very high-quality and diverse text crawled from the internet. Pretraining helps the model learn general language patterns, grammar, and facts. The output of this training phase is called the base model. It is a token simulator that predicts the next word in a sequence.

To get a sense of what the pretraining dataset might look like you can see the FineWeb dataset. FineWeb dataset is fairly representative of what you might see in an enterprise-grade language model. All the major LLM providers like OpenAI, Google, or Meta will have some equivalent dataset internally like the FineWeb dataset.

Post-Training: Supervised Fine-Tuning

As I mentioned before, the base model is a token simulator. It simply samples internet text documents. We need to turn this base model into an assistant that can answer questions. Therefore, the pretrained model is further refined using a dataset of conversations. These conversation datasets have hundreds of thousands of conversations that are multi-term and very long covering a diverse breadth of topics.

Illustrative human assistant conversations from InstructGPT distribution

These conversations come from human labelers. Given conversational context human lablers write out ideal responses for an assistant in any situation. Later, we take the base model that is trained on internet documents and substitute the dataset with the dataset of conversations. Then continue the model training on this new dataset of conversations. This way, the model adjusts rapidly and learns the statistics of how this assistant responds to queries. At the end of training the model is able to imitate human-like responses.

OpenAssistant/oasst1 is one of the open-source conversations dataset available at hugging face. This is a human-generated and human-annotated assistant-style conversation corpus consisting of 161,443 messages in 35 different languages.

Post-training: Reinforcement Learning with Human Feedback

Supervised Fine-Tuning makes the model capable. However, even a well-trained model can generate misleading, biased, or unhelpful responses. Therefore, Reinforcement Learning with Human Feedback is required to align it with human expectations.

We start with the assistant model, trained by SFT. For a given prompt we generate multiple model outputs. Human labelers rank or score multiple model outputs based on quality, safety, and alignment with human preferences. We use these data to train a whole separate neural network that we call a reward model.

The reward model imitates human scores. It is a simulator of human preferences. It is a completely separate neural network, probably with a transformer architecture, but it is not a language model in the sense that it generates diverse language. It’s just a scoring model.

Now the LLM is fine-tuned using reinforcement learning, where the reward model provides feedback on the quality of the generated outputs. So instead of asking a real human, we’re asking a simulated human for their score of an output. The goal is to maximize the reward signal, which reflects human preferences.

Why Hallucinations?

Now that we have a clearer understanding of the training process of large language models, we can continue with our discussion on hallucinations.

Hallucinations originate from the Supervised Fine-Tuning stage of the training pipeline. The following is a specific example of three potential conversations you might have on your training set.

Examples of human-assistant conversations (Image by Author)

As I have shown earlier, this is what human-assistant conversations would look like in the training time. These conversations are created by human labelers under strict guidelines. When a labeler is writing the correct answer for the assistant in each one of these cases either they know this person or they research them on the internet. After that, they write the assistant response that has a confident tone of an answer.

At test time, if the model is asked about an individual it has not seen during training, it does not simply respond with an acknowledgment of ignorance. Simply put it does not reply with “Oh, I don’t know”. Instead, the model statistically imitates the training set.

In the training set, the questions in the form “Who is X?” are confidently answered with the correct answer. Therefore at the test time, the model replies with the style of the answer and it gives the statistically most likely guess. So it just makes stuff up that is statistically consistent with the style of the answer in its training set.

Model Interrogation

Our question now is how to mitigate the hallucinations. It is evident that our dataset should include examples where the correct answer for the assistant is that the model does not know about some particular fact. However, these answers must be produced only in instances where the model actually does not know. So the key question is how do we know what the model knows and what it does not? We need to probe the model to figure that out empirically.

The task is to figure out the boundary of the model’s knowledge. Therefore, we need to interrogate the model to figure out what it knows and doesn’t know. Then we can add examples to the training set for the things that the model doesn’t know. The correct response, in such cases, is that the model does not know them.

An example of a training instance where the model doesn’t know the answer to a particular question

Let’s take a look at how Meta dealt with hallucinations using this concept for the Llama 3 series of models.

In their 2024 paper titled “The Llama 3 Herd of Models”, Touvron et al. describe how they have developed a knowledge-probing technique to achieve this. Their primary approach involves generating data that aligns model generations with subsets of factual data present in the pre-training data. They describe the following procedure for the data generation process:

Extract a data snippet from the pre-training data.

Generate a factual question about these snippets (context) by prompting Llama 3.

Sample responses from Llama 3 to the question.

Score the correctness of the generations using the original context as a reference and Llama 3 as a judge.

Score the informativeness of the generations using Llama 3 as a judge.

Generate a refusal for responses which are consistently informative and incorrect across the generations, using Llama 3. (p. 27)

After that data generated from the knowledge probe is used to encourage the model to only answer the questions for which it knows about, and refrain from answering questions that it is unsure about. Implementing this technique has improved the hallucination issue over time.

Using Web Search

We have better mitigation strategies than just saying we do not know. We can provide the LLM with an opportunity to generate factual responses and accurately address the question. What would you do, in a case where I ask you a factual question that you don’t have an answer to? How do you answer the question? You could do some research and search the internet to figure out the answer to the question. Then tell me the answer to the question. We can do the same thing with LLMs.

You can think of the knowledge inside the parameters of the trained neural network as a vague recollection of things that the model has seen during pretraining a long time ago. Knowledge in the model parameters is analogous to something in your memory that you read a month ago. You can remember things that you read continuously over time than something you read rarely. If you don’t have a good recollection of information that you read, what you do is go and look it up. When you look up information, you are essentially refreshing your working memory with information, allowing you to retrieve and discuss it.

We need some equivalent mechanism to allow the model to refresh its memory or recollection of information. We can achieve this by introducing tools for the model. The model can use web search tools instead of just replying with “I am sorry, I don’t know the answer”. To achieve this we need to introduce special tokens, such as and along with a protocol that defines how the model is allowed to use these tokens. In this mechanism, the language model can emit special tokens. Now in a case where the model doesn’t know the answer, it has the option to emit the special token instead of replying with “I am sorry, I don’t know the answer”. After that, the model will emit the query and .

Here when the program that is sampling from the model encounters the special token during inference, it will pause the generation process instead of sampling the next token in the sequence. It will initiate a session with the search engine, input the search query into the search engine, and retrieve all the extracted text from the results. Then it will insert that text inside the context window.

The extracted text from the web search is now within the context window that will be fed into the neural network. Think of the context window as the working memory of the model. The data inside the context window is directly accessible by the model. It is directly fed into the neural network. Therefore it is no longer a vague recollection of information. Now, when sampling new tokens, it can very easily reference the data that has been copy-pasted there. Thus, this is a general overview of how these web search tools function.

An example of a training instance with special tokens. The […] notation indicates the placeholder for the extracted content

How can we teach the model to correctly use these tools like web search? Again we accomplish this through training sets. We now need enough data and numerous conversations that demonstrate, by example, how the model should use web search. We need to illustrate with examples aspects such as: “What are the settings where you are using the search? What does it look like? How do you start a search?” Because of the pretraining stage, it possesses a native understanding of what a web search is and what constitutes a good search query. Therefore, if your training set contains several thousand examples, the model will be able to understand clearly how the tool works.

Conclusion

Large language model hallucinations are inherent consequences of the training pipeline, particularly arising from the supervised fine-tuning stage. Since language models are designed to generate statistically probable text, they often produce responses that appear plausible but lack a factual basis.

Early models were prone to hallucinations significantly. However, the problem has improved with the implementation of various mitigation strategies. Knowledge probing techniques and training the model to use web search tools have been proven effective in mitigating the problem. Despite these improvements, completely eliminating hallucinations remains an ongoing challenge. As LLMs continue to evolve, mitigating hallucinations to a large extent is crucial to ensuring their reliability as a trustworthy knowledge base.

If you enjoyed this article, connect with me on X (formerly Twitter) for more insights.

The post Unraveling Large Language Model Hallucinations appeared first on Towards Data Science.

LLaDA: The Diffusion Model That Could Redefine Language Generation

Maxime Wolf — Wed, 26 Feb 2025 18:18:22 +0000

Introduction

What if we could make language models think more like humans? Instead of writing one word at a time, what if they could sketch out their thoughts first, and gradually refine them?

This is exactly what Large Language Diffusion Models (LLaDA) introduces: a different approach to current text generation used in Large Language Models (LLMs). Unlike traditional autoregressive models (ARMs), which predict text sequentially, left to right, LLaDA leverages a diffusion-like process to generate text. Instead of generating tokens sequentially, it progressively refines masked text until it forms a coherent response.

In this article, we will dive into how LLaDA works, why it matters, and how it could shape the next generation of LLMs.

I hope you enjoy the article!

The current state of LLMs

To appreciate the innovation that LLaDA represents, we first need to understand how current large language models (LLMs) operate. Modern LLMs follow a two-step training process that has become an industry standard:

Pre-training: The model learns general language patterns and knowledge by predicting the next token in massive text datasets through self-supervised learning.
Supervised Fine-Tuning (SFT): The model is refined on carefully curated data to improve its ability to follow instructions and generate useful outputs.

Note that current LLMs often use RLHF as well to further refine the weights of the model, but this is not used by LLaDA so we will skip this step here.

These models, primarily based on the Transformer architecture, generate text one token at a time using next-token prediction.

Simplified Transformer architecture for text generation (Image by the author)

Here is a simplified illustration of how data passes through such a model. Each token is embedded into a vector and is transformed through successive transformer layers. In current LLMs (LLaMA, ChatGPT, DeepSeek, etc), a classification head is used only on the last token embedding to predict the next token in the sequence.

This works thanks to the concept of masked self-attention: each token attends to all the tokens that come before it. We will see later how LLaDA can get rid of the mask in its attention layers.

Attention process: input embeddings are multiplied byQuery, Key, and Value matrices to generate new embeddings (Image by the author, inspired by [3])

If you want to learn more about Transformers, check out my article here.

While this approach has led to impressive results, it also comes with significant limitations, some of which have motivated the development of LLaDA.

Current limitations of LLMs

Current LLMs face several critical challenges:

Computational Inefficiency

Imagine having to write a novel where you can only think about one word at a time, and for each word, you need to reread everything you’ve written so far. This is essentially how current LLMs operate — they predict one token at a time, requiring a complete processing of the previous sequence for each new token. Even with optimization techniques like KV caching, this process is quite computationally expensive and time-consuming.

Limited Bidirectional Reasoning

Traditional autoregressive models (ARMs) are like writers who could never look ahead or revise what they’ve written so far. They can only predict future tokens based on past ones, which limits their ability to reason about relationships between different parts of the text. As humans, we often have a general idea of what we want to say before writing it down, current LLMs lack this capability in some sense.

Amount of data

Existing models require enormous amounts of training data to achieve good performance, making them resource-intensive to develop and potentially limiting their applicability in specialized domains with limited data availability.

What is LLaDA

LLaDA introduces a fundamentally different approach to Language Generation by replacing traditional autoregression with a “diffusion-based” process (we will dive later into why this is called “diffusion”).

Let’s understand how this works, step by step, starting with pre-training.

LLaDA pre-training

Remember that we don’t need any “labeled” data during the pre-training phase. The objective is to feed a very large amount of raw text data into the model. For each text sequence, we do the following:

We fix a maximum length (similar to ARMs). Typically, this could be 4096 tokens. 1% of the time, the lengths of sequences are randomly sampled between 1 and 4096 and padded so that the model is also exposed to shorter sequences.
We randomly choose a “masking rate”. For example, one could pick 40%.
We mask each token with a probability of 0.4. What does “masking” mean exactly? Well, we simply replace the token with a special token: . As with any other token, this token is associated with a particular index and embedding vector that the model can process and interpret during training.
We then feed our entire sequence into our transformer-based model. This process transforms all the input embedding vectors into new embeddings. We apply the classification head to each of the masked tokens to get a prediction for each. Mathematically, our loss function averages cross-entropy losses over all the masked tokens in the sequence, as below:

Loss function used for LLaDA (Image by the author)

5. And… we repeat this procedure for billions or trillions of text sequences.

Note, that unlike ARMs, LLaDA can fully utilize bidirectional dependencies in the text: it doesn’t require masking in attention layers anymore. However, this can come at an increased computational cost.

Hopefully, you can see how the training phase itself (the flow of the data into the model) is very similar to any other LLMs. We simply predict randomly masked tokens instead of predicting what comes next.

LLaDA SFT

For auto-regressive models, SFT is very similar to pre-training, except that we have pairs of (prompt, response) and want to generate the response when giving the prompt as input.

This is exactly the same concept for LlaDa! Mimicking the pre-training process: we simply pass the prompt and the response, mask random tokens from the response only, and feed the full sequence into the model, which will predict missing tokens from the response.

The innovation in inference

Innovation is where LLaDA gets more interesting, and truly utilizes the “diffusion” paradigm.

Until now, we always randomly masked some text as input and asked the model to predict these tokens. But during inference, we only have access to the prompt and we need to generate the entire response. You might think (and it’s not wrong), that the model has seen examples where the masking rate was very high (potentially 1) during SFT, and it had to learn, somehow, how to generate a full response from a prompt.

However, generating the full response at once during inference will likely produce very poor results because the model lacks information. Instead, we need a method to progressively refine predictions, and that’s where the key idea of ‘remasking’ comes in.

Here is how it works, at each step of the text generation process:

Feed the current input to the model (this is the prompt, followed by tokens)
The model generates one embedding for each input token. We get predictions for the tokens only. And here is the important step: we remask a portion of them. In particular: we only keep the “best” tokens i.e. the ones with the best predictions, with the highest confidence.
We can use this partially unmasked sequence as input in the next generation step and repeat until all tokens are unmasked.

You can see that, interestingly, we have much more control over the generation process compared to ARMs: we could choose to remask 0 tokens (only one generation step), or we could decide to keep only the best token every time (as many steps as tokens in the response). Obviously, there is a trade-off here between the quality of the predictions and inference time.

Let’s illustrate that with a simple example (in that case, I choose to keep the best 2 tokens at every step)

LLaDA generation process example (Image by the author)

Note, in practice, the remasking step would work as follows. Instead of remasking a fixed number of tokens, we would remask a proportion of s/t tokens over time, from t=1 down to 0, where s is in [0, t]. In particular, this means we remask fewer and fewer tokens as the number of generation steps increases.

Example: if we want N sampling steps (so N discrete steps from t=1 down to t=1/N with steps of 1/N), taking s = (t-1/N) is a good choice, and ensures that s=0 at the end of the process.

The image below summarizes the 3 steps described above. “Mask predictor” simply denotes the Llm (LLaDA), predicting masked tokens.

Pre-training (a.), SFT (b.) and inference (c.) using LLaDA. (source: [1])

Can autoregression and diffusion be combined?

Another clever idea developed in LLaDA is to combine diffusion with traditional autoregressive generation to use the best of both worlds! This is called semi-autoregressive diffusion.

Divide the generation process into blocks (for instance, 32 tokens in each block).
The objective is to generate one block at a time (like we would generate one token at a time in ARMs).
For each block, we apply the diffusion logic by progressively unmasking tokens to reveal the entire block. Then move on to predicting the next block.

Semi-autoregressive process (source: [1])

This is a hybrid approach: we probably lose some of the “backward” generation and parallelization capabilities of the model, but we better “guide” the model towards the final output.

I think this is a very interesting idea because it depends a lot on a hyperparameter (the number of blocks), that can be tuned. I imagine different tasks might benefit more from the backward generation process, while others might benefit more from the more “guided” generation from left to right (more on that in the last paragraph).

Why “Diffusion”?

I think it’s important to briefly explain where this term actually comes from. It reflects a similarity with image diffusion models (like Dall-E), which have been very popular for image generation tasks.

In image diffusion, a model first adds noise to an image until it’s unrecognizable, then learns to reconstruct it step by step. LLaDA applies this idea to text by masking tokens instead of adding noise, and then progressively unmasking them to generate coherent language. In the context of image generation, the masking step is often called “noise scheduling”, and the reverse (remasking) is the “denoising” step.

How do Diffusion Models work? (source: [2])

You can also see LLaDA as some type of discrete (non-continuous) diffusion model: we don’t add noise to tokens, but we “deactivate” some tokens by masking them, and the model learns how to unmask a portion of them.

Results

Let’s go through a few of the interesting results of LLaDA.

You can find all the results in the paper. I chose to focus on what I find the most interesting here.

Training efficiency: LLaDA shows similar performance to ARMs with the same number of parameters, but uses much fewer tokens during training (and no RLHF)! For example, the 8B version uses around 2.3T tokens, compared to 15T for LLaMa3.
Using different block and answer lengths for different tasks: for example, the block length is particularly large for the Math dataset, and the model demonstrates strong performance for this domain. This could suggest that mathematical reasoning may benefit more from the diffusion-based and backward process.

Source: [1]

Interestingly, LLaDA does better on the “Reversal poem completion task”. This task requires the model to complete a poem in reverse order, starting from the last lines and working backward. As expected, ARMs struggle due to their strict left-to-right generation process.

Source: [1]

LLaDA is not just an experimental alternative to ARMs: it shows real advantages in efficiency, structured reasoning, and bidirectional text generation.

Conclusion

I think LLaDA is a promising approach to language generation. Its ability to generate multiple tokens in parallel while maintaining global coherence could definitely lead to more efficient training, better reasoning, and improved context understanding with fewer computational resources.

Beyond efficiency, I think LLaDA also brings a lot of flexibility. By adjusting parameters like the number of blocks generated, and the number of generation steps, it can better adapt to different tasks and constraints, making it a versatile tool for various language modeling needs, and allowing more human control. Diffusion models could also play an important role in pro-active AI and agentic systems by being able to reason more holistically.

As research into diffusion-based language models advances, LLaDA could become a useful step toward more natural and efficient language models. While it’s still early, I believe this shift from sequential to parallel generation is an interesting direction for AI development.

Thanks for reading!

Check out my previous articles:

Training Large Language Models: From TRPO to GRPO

The Math Behind “The Curse of Dimensionality”

Feel free to connect on LinkedIn
Follow me on GitHub for more content
Visit my website: maximewolf.com

References:

[1] Liu, C., Wu, J., Xu, Y., Zhang, Y., Zhu, X., & Song, D. (2024). Large Language Diffusion Models. arXiv preprint arXiv:2502.09992. https://arxiv.org/pdf/2502.09992
[2] Yang, Ling, et al. “Diffusion models: A comprehensive survey of methods and applications.” ACM Computing Surveys 56.4 (2023): 1–39.
[3] Alammar, J. (2018, June 27). The Illustrated Transformer. Jay Alammar’s Blog. https://jalammar.github.io/illustrated-transformer/

The post LLaDA: The Diffusion Model That Could Redefine Language Generation appeared first on Towards Data Science.

Talking about Games

Dorian Drost — Fri, 21 Feb 2025 19:14:25 +0000

Game theory is a field of research that is quite prominent in Economics but rather unpopular in other scientific disciplines. However, the concepts used in game theory can be of interest to a wider audience, including data scientists, statisticians, computer scientists or psychologists, to name just a few. This article is the opener to a four-chapter tutorial series on the fundamentals of game theory, so stay tuned for the upcoming articles.

In this article, I will explain the kinds of problems Game Theory deals with and introduce the main terms and concepts used to describe a game. We will see some examples of games that are typically analysed within game theory and lay the foundation for deeper insights into the capabilities of game theory in the later chapters. But before we go into the details, I want to introduce you to some applications of game theory, that show the multitude of areas game-theoretic concepts can be applied to.

Applications of game theory

Even french fries can be an application of game theory. Photo by engin akyurt on Unsplash

Does it make sense to vote for a small party in an election if this party may not have a chance to win anyway? Is it worth starting a price war with your competitor who offers the same goods as you? Do you gain anything if you reduce your catch rate of overfished areas if your competitors simply carry on as before? Should you take out insurance if you believe that the government will pay for the reconstruction after the next hurricane anyway? And how should you behave in the next auction where you are about to bid on your favourite Picasso painting?

All these questions (and many more) live within the area of applications that can be modelled with game theory. Whenever a situation includes strategic decisions in interaction with others, game-theoretic concepts can be applied to describe this situation formally and search for decisions that are not made intuitively but that are backed by a notion of rationality. Key to all the situations above is that your decisions depend on other people’s behaviour. If everybody agrees to conserve the overfished areas, you want to play along to preserve nature, but if you think that everybody else will continue fishing, why should you be the only one to stop? Likewise, your voting behaviour in an election might heavily depend on your assumptions about other people’s votes. If nobody votes for that candidate, your vote will be wasted, but if everybody thinks so, the candidate doesn’t have a chance at all. Maybe there are many people who say “I would vote for him if others vote for him too”.

Similar situations can happen in very different situations. Have you ever thought about having food delivered and everybody said “You don’t have to order anything because of me, but if you order anyway, I’d take some french fries”? All these examples can be applications of game theory, so let’s start understanding what game theory is all about.

Understanding the game

Before playing, you need to understand the components of the game. Photo by Laine Cooper on Unsplash

When you hear the word game, you might think of video games such as Minecraft, board games such as Monopoly, or card games such as poker. There are some common principles to all these games: We always have some players who are allowed to do certain things determined by the game’s rules. For example, in poker, you can raise, check or give up. In Monopoly, you can buy a property you land on or don’t buy it. What we also have is some notion of how to win the game. In poker, you have to get the best hand to win and in Monopoly, you have to be the last person standing after everybody went bankrupt. That also means that some actions are better than others in some scenarios. If you have two aces on the hand, staying in the game is better than giving up.

When we look at games from the perspective of game theory, we use the same concepts, just more formally.

A game in game theory consists of n players, where each player has a strategy set and a utility function.

A game consists of a set of players I = {1, .., n}, where each player has a set of strategies S and a utility function ui(s1, s2, … sn). The set of strategies is determined by the rules of the games. For example, it could be S = {check, raise, give-up} and the player would have to decide which of these actions they want to use. The utility function u (also called reward) describes how valuable a certain action of a player would be, given the actions of the other players. Every player wants to maximize their utility, but now comes the tricky part: The utility of an action of yours depends on the other players’ actions. But for them, the same applies: Their actions’ utilities depend on the actions of the other players (including yours).

Let’s consider a well-known game to illustrate this point. In rock-paper-scissors, we have n=2 players and each player can choose between three actions, hence the strategy set is S={rock, paper, scissors} for each player. But the utility of an action depends on what the other player does. If our opponent chooses rock, the utility of paper is high (1), because paper beats rock. But if your opponent chooses scissors, the utility of paper is low (-1), because you would lose. Finally, if your opponent chooses paper as well, you reach a draw and the utility is 0.

Utility values for player one choosing paper for three choices of the opponents strategy.

Instead of writing down the utility function for each case individually, it is common to display games in a matrix like this:

The first player decides for the row of the matrix by selecting his action and the second player decides for the column. For example, if player 1 chooses paper and player 2 chooses scissors, we end up in the cell in the third column and second row. The value in this cell is the utility for both players, where the first value corresponds to player 1 and the second value corresponds to player 2. (-1,1) means that player 1 has a utility of -1 and player 2 has a utility of 1. Scissors beat paper.

Some more details

Now we have understood the main components of a game in game theory. Let me add a few more hints on what game theory is about and what assumptions it uses to describe its scenarios.

We often assume that the players select their actions at the same time (like in rock-paper-scissors). We call such games static games. There are also dynamic games in which players take turns deciding on their actions (like in chess). We will consider these cases in a later chapter of this tutorial.
In game theory, it is typically assumed that the players can not communicate with each other so they can’t come to an agreement before deciding on their actions. In rock-paper-scissors, you wouldn’t want to do that anyway, but there are other games where communication would make it easier to choose an action. However, we will always assume that communication is not possible.
Game theory is considered a normative theory, not a descriptive one. That means we will analyse games concerning the question “What would be the rational solution?” This may not always be what people do in a likewise situation in reality. Such descriptions of real human behaviour are part of the research field of behavioural economics, which is located on the border between Psychology and economics.

The prisoner’s dilemma

The prisoner’s dilemma is all about not ending up here. Photo by De an Sun on Unsplash

Let us become more familiar with the main concepts of game theory by looking at some typical games that are often analyzed. Often, such games are derived from are story or scenario that may happen in the real world and require people to decide between some actions. One such story could be as follows:

Say we have two criminals who are suspected of having committed a crime. The police have some circumstantial evidence, but no actual proof for their guilt. Hence they question the two criminals, who now have to decide if they want to confess or deny the crime. If you are in the situation of one of the criminals, you might think that denying is always better than confessing, but now comes the tricky part: The police propose a deal to you. If you confess while your partner denies, you are considered a crown witness and will not be punished. In this case, you are free to go but your partner will go to jail for six years. Sounds like a good deal, but be aware, that the outcome also depends on your partner’s action. If you both confess, there is no crown witness anymore and you both go to jail for three years. If you both deny, the police can only use circumstantial evidence against you, which will lead to one year in prison for both you and your partner. But be aware, that your partner is offered the same deal. If you deny and he confesses, he is the crown witness and you go to jail for six years. How do you decide?

The prisoner’s dilemma.

The game derived from this story is called the prisoner’s dilemma and is a typical example of a game in game theory. We can visualize it as a matrix just like we did with rock-paper-scissors before and in this matrix, we easily see the dilemma the players are in. If both deny, they receive a rather low punishment. But if you assume that your partner denies, you might be tempted to confess, which would prevent you from going to jail. But your partner might think the same, and if you both confess, you both go to jail for longer. Such a game can easily make you go round in circles. We will talk about solutions to this problem in the next chapter of this tutorial. First, let’s consider some more examples.

Bach vs. Stravinsky

Who do you prefer, Bach or Stravinsky? Photo by Sigmund on Unsplash

You and your friend want to go to a concert together. You are a fan of Bach’s music but your friend favors the Russian 20th. century composer Igor Stravinsky. However, you both want to avoid being alone in any concert. Although you prefer Bach over Stravinsky, you would rather go to the Stravinsky concert with your friend than go to the Bach concert alone. We can create a matrix for this game:

Bach vs. Stravinsky

You decide for the row by going to the Bach or Stravinsky concert and your friend decides for the column by going to one of the concerts as well. For you, it would be best if you both chose Bach. Your reward would be 2 and your friend would get a reward of 1, which is still better for him than being in the Stravinsky concert all by himself. However, he would be even happier, if you were in the Stravinsky concert together.

Do you remember, that we said players are not allowed to communicate before making their decision? This example illustrates why. If you could just call each other and decide where to go, this would not be a game to investigate with game theory anymore. But you can’t call each other so you just have to go to any of the concerts and hope you will meet your friend there. What do you do?

Arm or disarm?

Make love, not war. Photo by Artem Beliaikin on Unsplash

A third example brings us to the realm of international politics. The world would be a much happier place with fewer firearms, wouldn’t it? However, if nations think about disarmament, they also have to consider the choices other nations make. If the USA disarms, the Soviet Union might want to rearm, to be able to attack the USA — that was the thinking during the Cold War, at least. Such a scenario could be described with the following matrix:

The matrix for the disarm vs. upgrade game.

As you see, when both nations disarm, they get the highest reward (3 each), because there are fewer firearms in the world and the risk of war is minimized. However, if you disarm, while the opponent upgrades, your opponent is in the better position and gets a reward of 2, while you only get 0. Then again, it might have been better to upgrade yourself, which gives a reward of 1 for both players. That is better than being the only one who disarms, but not as good as it would get if both nations disarmed.

The solution?

All these examples have one thing in common: There is no single option that is always the best. Instead, the utility of an action for one player always depends on the other player’s action, which, in turn, depends on the first player’s action and so on. Game theory is now interested in finding the optimal solution and deciding what would be the rational action; that is, the action that maximizes the expected reward. Different ideas on how exactly such a solution looks like will be part of the next chapter in this series.

Summary

Learning about game theory is as much fun as playing a game, don’t you think? Photo by Christopher Paul High on Unsplash

Before continuing with finding solutions in the next chapter, let us recap what we have learned so far.

A game consists of players, that decide for actions, which have a utility or reward.
The utility/reward of an action depends on the other players’ actions.
In static games, players decide for their actions simultaneously. In dynamic games, they take turns.
The prisoner’s dilemma is a very popular example of a game in game theory.
Games become increasingly interesting if there is no single action that is better than any other.

Now that you are familiar with how games are described in game theory, you can check out the next chapter to learn how to find solutions for games in game theory.

References

The topics introduced here are typically covered in standard textbooks on game theory. I mainly used this one, which is written in German though:

Bartholomae, F., & Wiens, M. (2016). Spieltheorie. Ein anwendungsorientiertes Lehrbuch. Wiesbaden: Springer Fachmedien Wiesbaden.

An alternative in English language could be this one:

Espinola-Arredondo, A., & Muñoz-Garcia, F. (2023). Game Theory: An Introduction with Step-by-step Examples. Springer Nature.

Game theory is a rather young field of research, with the first main textbook being this one:

Von Neumann, J., & Morgenstern, O. (1944). Theory of games and economic behavior.

Like this article? Follow me to be notified of my future posts.

The post Talking about Games appeared first on Towards Data Science.

How To Generate GIFs from 3D Models with Python

Florent Poux, Ph.D. — Fri, 21 Feb 2025 02:23:47 +0000

As a data scientist, you know that effectively communicating your insights is as important as the insights themselves.

But how do you communicate over 3D data?

I can bet most of us have been there: you spend days, weeks, maybe even months meticulously collecting and processing 3D data. Then comes the moment to share your findings, whether it’s with clients, colleagues, or the broader scientific community. You throw together a few static screenshots, but they just don’t capture the essence of your work. The subtle details, the spatial relationships, the sheer scale of the data—it all gets lost in translation.

Comparing 3D Data Communication Methods. © F. Poux

Or maybe you’ve tried using specialized 3D visualization software. But when your client uses it, they struggle with clunky interfaces, steep learning curves, and restrictive licensing.

What should be a smooth, intuitive process becomes a frustrating exercise in technical acrobatics. It’s an all-too-common scenario: the brilliance of your 3D data is trapped behind a wall of technical barriers.

This highlights a common issue: the need to create shareable content that can be opened by anyone, i.e., that does not demand specific 3D data science skills.

Think about it: what is the most used way to share visual information? Images.

But how can we convey the 3D information from a simple 2D image?

Well, let us use “first principle thinking”: let us create shareable content stacking multiple 2D views, such as GIFs or MP4s, from raw point clouds.

The bread of magic to generate GIF and MP4. © F. Poux

This process is critical for presentations, reports, and general communication. But generating GIFs and MP4s from 3D data can be complex and time-consuming. I’ve often found myself wrestling with the challenge of quickly generating rotating GIF or MP4 files from a 3D point cloud, a task that seemed simple enough but often spiraled into a time-consuming ordeal.

Current workflows might lack efficiency and ease of use, and a streamlined process can save time and improve data presentation.

Let me share a solution that involves leveraging Python and specific libraries to automate the creation of GIFs and MP4s from point clouds (or any 3D dataset such as a mesh or a CAD model).

Think about it. You’ve spent hours meticulously collecting and processing this 3D data. Now, you need to present it in a compelling way for a presentation or a report. But how can we be sure it can be integrated into a SaaS solution where it is triggered on upload? You try to create a dynamic visualization to showcase a critical feature or insight, and yet you’re stuck manually capturing frames and stitching them together. How can we automate this process to seamlessly integrate it into your existing systems?

An example of a GIF generated with the methodology. © F. Poux

If you are new to my (3D) writing world, welcome! We are going on an exciting adventure that will allow you to master an essential 3D Python skill. Before diving, I like to establish a clear scenario, the mission brief.

Once the scene is laid out, we embark on the Python journey. Everything is given. You will see Tips (Notes and Growing) to help you get the most out of this article. Thanks to the 3D Geodata Academy for supporting the endeavor.

The Mission

You are working for a new engineering firm, “Geospatial Dynamics,” which wants to showcase its cutting-edge LiDAR scanning services. Instead of sending clients static point cloud images, you propose to use a new tool, which is a Python script, to generate dynamic rotating GIFs of project sites.

After doing so market research, you found that this can immediately elevate their proposals, resulting in a 20% higher project approval rate. That’s the power of visual storytelling.

The three stages of the mission towards an increase project approval. © F. Poux

On top, you can even imagine a more compelling scenario, where “GeoSpatial Dynamics” is able to process point clouds massively and then generate MP4 videos that are sent to potential clients. This way, you lower the churn and make the brand more memorable.

With that in mind, we can start designing a robust framework to answer our mission’s goal.

The Framework

I remember a project where I had to show a detailed architectural scan to a group of investors. The usual still images just could not capture the fine details. I desperately needed a way to create a rotating GIF to convey the full scope of the design. That is why I’m excited to introduce this Cloud2Gif Python solution. With this, you’ll be able to easily generate shareable visualizations for presentations, reports, and communication.

The framework I propose is straightforward yet effective. It takes raw 3D data, processes it using Python and the PyVista library, generates a series of frames, and stitches them together to create a GIF or MP4 video. The high-level workflow includes:

The various stages of the framework in this article. © F. Poux

1. Loading the 3D data (mesh with texture).

2. Loading a 3D Point Cloud

3. Setting up the visualization environment.

4. Generating a GIF

4.1. Defining a camera orbit path around the data.

4.2. Rendering frames from different viewpoints along the path.

4.3. Encoding the frames into a GIF or

5. Generating an orbital MP4

6. Creating a Function

7. Testing with multiple datasets

This streamlined process allows for easy customization and integration into existing workflows. The key advantage here is the simplicity of the approach. By leveraging the basic principles of 3D data rendering, a very efficient and self-contained script can be put together and deployed on any system as long as Python is installed.

This makes it compatible with various edge computing solutions and allows for easy integration with sensor-heavy systems. The goal is to generate a GIF and an MP4 from a 3D data set. The process is simple, requiring a 3D data set, a bit of magic (the code), and the output as GIF and MP4 files.

The growth of the solution as we move along the major stages. © F. Poux

Now, what are the tools and libraries that we will need for this endeavor?

1. Setup Guide: The Libraries, Tools and Data

For this project, we primarily use the following two Python libraries:

NumPy: The cornerstone of numerical computing in Python. Without it, I would have to deal with every vertex (point) in a very inefficient way. NumPy Official Website
pyvista: A high-level interface to the Visualization Toolkit (VTK). PyVista enables me to easily visualize and interact with 3D data. It handles rendering, camera control, and exporting frames. PyVista Official Website

PyVista and Numpy libraries for 3D Data. © F. Poux

These libraries provide all the necessary tools to handle data processing, visualization, and output generation. This set of libraries was carefully chosen so that a minimal amount of external dependencies is present, which improves sustainability and makes it easily deployable on any system.

Let me share the details of the environment as well as the data preparation setup.

Quick Environment Setup Guide

Let me provide very brief details on how to set up your environment.

Step 1: Install Miniconda

Four simple steps to get a working Miniconda version:

Visit: https://docs.conda.io/projects/miniconda/en/latest/
Download the “installer file” for your Operating System (Let it be Windows, MacOS or a Linux distribution)
Run the installer
Open terminal/command prompt and verify with: conda — version

How to install Anaconda for 3D Coding. © F. Poux

Step 2: Create a new environment

You can run the following code in your terminal

conda create -n pyvista_env python=3.10
conda activate pyvista_env

Step 3: Install required packages

For this, you can leverage pip as follows:

pip install numpy
pip install pyvista

Step 4: Test the installation

If you want to test your installation, type python in your terminal and run the following lines:

import numpy as np
import pyvista as pv
print(f”PyVista version: {pv.__version__}”)

This should return the pyvista version. Do not forget to exit Python from your terminal afterward (Ctrl+C).

Note: Here are some common issues and workarounds:

If PyVista doesn’t show a 3D window: pip install vtk
If environment activation fails: Restart the terminal
If data loading fails: Check file format compatibility (PLY, LAS, LAZ supported)

Beautiful, at this stage, your environment is ready. Now, let me share some quick ways to get your hands on 3D datasets.

Data Preparation for 3D Visualization

At the end of the article, I share with you the datasets as well as the code. However, in order to ensure you are fully independent, here are three reliable sources I regularly use to get my hands on point cloud data:

The LiDAR Data Download Process. © F. Poux

The USGS 3DEP LiDAR Point Cloud Downloads

Visit: https://apps.nationalmap.gov/lidar-explorer/
Navigate to your area of interest
Download LAZ/LAS files

OpenTopography

Visit: https://portal.opentopography.org/datasets
Create a free account
Choose “Point Cloud” datasets
Download selected region

ETH Zurich’s PCD Repository

Visit: https://www.eth3d.net/datasets
Download the “high-res multi-view” datasets
Extract the PLY files

For quick testing, you can also use PyVista’s built-in example data:

# Load sample data
from pyvista import examples
terrain = examples.download_crater_topo()
terrain.plot()

Note: Remember to always check the data license and attribution requirements when using public datasets.

Finally, to ensure a complete setup, below is a typical expected folder structure:

project_folder/
├── environment.yml
├── data/
│ └── pointcloud.ply
└── scripts/
└── gifmaker.py

Beautiful, we can now jump right onto the first stage: loading and visualizing textured mesh data.

2. Loading and Visualizing Textured Mesh Data

One first critical step is properly loading and rendering 3D data. In my research laboratory, I have found that PyVista provides an excellent foundation for handling complex 3D visualization tasks.

Here’s how you can approach this fundamental step:

import numpy as np
import pyvista as pv

mesh = pv.examples.load_globe()
texture = pv.examples.load_globe_texture()

pl = pv.Plotter()
pl.add_mesh(mesh, texture=texture, smooth_shading=True)
pl.show()

This code snippet loads a textured globe mesh, but the principles apply to any textured 3D model.

The earth rendered as a sphere with PyVista. © F. Poux

Let me discuss and speak a bit about the smooth_shading parameter. It’s a tiny element that renders the surfaces more continuous (as opposed to faceted), which, in the case of spherical objects, improves the visual impact.

Now, this is just a starter for 3D mesh data. This means that we deal with surfaces that join points together. But what if we want to work solely with point-based representations?

In that scenario, we have to consider shifting our data processing approach to propose solutions to the unique visual challenges attached to point cloud datasets.

3. Point Cloud Data Integration

Point cloud visualization demands extra attention to detail. In particular, adjusting the point density and the way we represent points on the screen has a noticeable impact.

Let us use a PLY file for testing (see the end of the article for resources).

The example PLY point cloud data with PyVista. © F. Poux

You can load a point cloud pv.read and create scalar fields for better visualization (such as using a scalar field based on the height or extent around the center of the point cloud).

In my work with LiDAR datasets, I’ve developed a simple, systematic approach to point cloud loading and initial visualization:

cloud = pv.read('street_sample.ply')
scalars = np.linalg.norm(cloud.points - cloud.center, axis=1)

pl = pv.Plotter()
pl.add_mesh(cloud)
pl.show()

The scalar computation here is particularly important. By calculating the distance from each point to the cloud’s center, we create a basis for color-coding that helps convey depth and structure in our visualizations. This becomes especially valuable when dealing with large-scale point clouds where spatial relationships might not be immediately apparent.

Moving from basic visualization to creating engaging animations requires careful consideration of the visualization environment. Let’s explore how to optimize these settings for the best possible results.

4. Optimizing the Visualization Environment

The visual impact of our animations heavily depends on the visualization environment settings.

Through extensive testing, I’ve identified key parameters that consistently produce professional-quality results:

pl = pv.Plotter(off_screen=False)
pl.add_mesh(
   cloud,
   style='points',
   render_points_as_spheres=True,
   emissive=False,
   color='#fff7c2',
   scalars=scalars,
   opacity=1,
   point_size=8.0,
   show_scalar_bar=False
   )

pl.add_text('test', color='b')
pl.background_color = 'k'
pl.enable_eye_dome_lighting()
pl.show()

As you can see, the plotter is initialized off_screen=False to render directly to the screen. The point cloud is then added to the plotter with specified styling. The style=’points’ parameter ensures that the point cloud is rendered as individual points. The scalars=’scalars’ argument uses the previously computed scalar field for coloring, while point_size sets the size of the points, and opacity adjusts the transparency. A base color is also set.

Note: In my experience, rendering points as spheres significantly improves the depth perception in the final generated animation. You can also combine this by using the eye_dome_lighting feature. This algorithm adds another layer of depth cues through some sort of normal-based shading, which makes the structure of point clouds more apparent.

You can play around with the various parameters until you obtain a rendering that is satisfying for your applications. Then, I propose that we move to creating the animated GIFs.

A GIF of the point cloud. © F. Poux

5. Creating Animated GIFs

At this stage, our aim is to generate a series of renderings by varying the viewpoint from which we generate these.

This means that we need to design a camera path that is sound, from which we can generate frame rendering.

This means that to generate our GIF, we must first create an orbiting path for the camera around the point cloud. Then, we can sample the path at regular intervals and capture frames from different viewpoints.

These frames can then be used to create the GIF. Here are the steps:

The 4 stages in the animated gifs generation. © F. Poux

I change to off-screen rendering
I take the cloud length parameters to set the camera
I create a path
I create a loop that takes a point of this pass

Which translates into the following:

pl = pv.Plotter(off_screen=True, image_scale=2)
pl.add_mesh(
   cloud,
   style='points',
   render_points_as_spheres=True,
   emissive=False,
   color='#fff7c2',
   scalars=scalars,
   opacity=1,
   point_size=5.0,
   show_scalar_bar=False
   )

pl.background_color = 'k'
pl.enable_eye_dome_lighting()
pl.show(auto_close=False)

viewup = [0, 0, 1]

path = pl.generate_orbital_path(n_points=40, shift=cloud.length, viewup=viewup, factor=3.0)
pl.open_gif("orbit_cloud_2.gif")
pl.orbit_on_path(path, write_frames=True, viewup=viewup)
pl.close()

As you can see, an orbital path is created around the point cloud using pl.generate_orbital_path(). The path’s radius is determined by cloud_length, the center is set to the center of the point cloud, and the normal vector is set to [0, 0, 1], indicating that the circle lies in the XY plane.

From there, we can enter a loop to generate individual frames for the GIF (the camera’s focal point is set to the center of the point cloud).

The image_scale parameter deserves special attention—it determines the resolution of our output.

I’ve found that a value of 2 provides a good balance between the perceived quality and the file size. Also, the viewup vector is crucial for maintaining proper orientation throughout the animation. You can experiment with its value if you want a rotation following a non-horizontal plane.

This results in a GIF that you can use to communicate very easily.

Another synthetic point cloud generated GIF. © F. Poux

But we can push one extra stage: creating an MP4 video. This can be useful if you want to obtain higher-quality animations with smaller file sizes as compared to GIFs (which are not as compressed).

6. High-Quality MP4 Video Generation

The generation of an MP4 video follows the exact same principles as we used to generate our GIF.

Therefore, let me get straight to the point. To generate an MP4 file from any point cloud, we can reason in four stages:

Gather your configurations over the parameters that best suit you.
Create an orbital path the same way you did with GIFs
Instead of using the open_gif function, let us use it open_movie to write a “movie” type file.
We orbit on the path and write the frames, similarly to our GIF method.

Note: Don’t forget to use your proper configuration in the definition of the path.

This is what the end result looks like with code:

pl = pv.Plotter(off_screen=True, image_scale=1)
pl.add_mesh(
   cloud,
   style='points_gaussian',
   render_points_as_spheres=True,
   emissive=True,
   color='#fff7c2',
   scalars=scalars,
   opacity=0.15,
   point_size=5.0,
   show_scalar_bar=False
   )

pl.background_color = 'k'
pl.show(auto_close=False)

viewup = [0.2, 0.2, 1]

path = pl.generate_orbital_path(n_points=40, shift=cloud.length, viewup=viewup, factor=3.0)
pl.open_movie("orbit_cloud.mp4")
pl.orbit_on_path(path, write_frames=True)
pl.close()

Notice the use of points_gaussian style and adjusted opacity—these settings provide interesting visual quality in video format, particularly for dense point clouds.

And now, what about streamlining the process?

7. Streamlining the Process with a Custom Function

To make this process more efficient and reproducible, I’ve developed a function that encapsulates all these steps:

def cloudgify(input_path):
   cloud = pv.read(input_path)
   scalars = np.linalg.norm(cloud.points - cloud.center, axis=1)
   pl = pv.Plotter(off_screen=True, image_scale=1)
   pl.add_mesh(
       cloud,
       style='Points',
       render_points_as_spheres=True,
       emissive=False,
       color='#fff7c2',
       scalars=scalars,
       opacity=0.65,
       point_size=5.0,
       show_scalar_bar=False
       )

   pl.background_color = 'k'
   pl.enable_eye_dome_lighting()
   pl.show(auto_close=False)

   viewup = [0, 0, 1]

   path = pl.generate_orbital_path(n_points=40, shift=cloud.length, viewup=viewup, factor=3.0)
  
   pl.open_gif(input_path.split('.')[0]+'.gif')
   pl.orbit_on_path(path, write_frames=True, viewup=viewup)
   pl.close()
  
   path = pl.generate_orbital_path(n_points=100, shift=cloud.length, viewup=viewup, factor=3.0)
   pl.open_movie(input_path.split('.')[0]+'.mp4')
   pl.orbit_on_path(path, write_frames=True)
   pl.close()
  
   return

Note: This function standardizes our visualization process while maintaining flexibility through its parameters. It incorporates several optimizations I’ve developed through extensive testing. Note the different n_points values for GIF (40) and MP4 (100)—this balances file size and smoothness appropriately for each format. The automatic filename generation split(‘.’)[0] ensures consistent output naming.

And what better than to test our new creation on multiple datasets?

8. Batch Processing Multiple Datasets

Finally, we can apply our function to multiple datasets:

dataset_paths= ["lixel_indoor.ply", "NAAVIS_EXTERIOR.ply", "pcd_synthetic.ply", "the_adas_lidar.ply"]

for pcd in dataset_paths:
   cloudgify(pcd)

This approach can be remarkably efficient when processing large datasets made of several files. Indeed, if your parametrization is sound, you can maintain consistent 3D visualization across all outputs.

Growing: I am a big fan of 0% supervision to create 100% automatic systems. This means that if you want to push the experiments even more, I suggest investigating ways to automatically infer the parameters based on the data, i.e., data-driven heuristics. Here is an example of a paper I wrote a couple of years down the line that focuses on such an approach for unsupervised segmentation (Automation in Construction, 2022)

A Little Discussion

Alright, you know my tendency to push innovation. While relatively simple, this Cloud2Gif solution has direct applications that can help you propose better experiences. Three of them come to mind, which I leverage on a weekly basis:

Interactive Data Profiling and Exploration: By generating GIFs of complex simulation results, I can profile my results at scale very quickly. Indeed, the qualitative analysis is thus a matter of slicing a sheet filled with metadata and GIFs to check if the results are on par with my metrics. This is very handy
Educational Materials: I often use this script to generate engaging visuals for my online courses and tutorials, enhancing the learning experience for the professionals and students that go through it. This is especially true now that most material is found online, where we can leverage the capacity of browsers to play animations.
Real-time Monitoring Systems: I worked on integrating this script into a real-time monitoring system to generate visual alerts based on sensor data. This is especially relevant for sensor-heavy systems, where it can be difficult to extract meaning from the point cloud representation manually. Especially when conceiving 3D Capture Systems, leveraging SLAM or other techniques, it can be helpful to get a feedback loop in real-time to ensure a cohesive registration.

However, when we consider the broader research landscape and the pressing needs of the 3D data community, the real value proposition of this approach becomes evident. Scientific research is increasingly interdisciplinary, and communication is key. We need tools that enable researchers from diverse backgrounds to understand and share complex 3D data easily.

The Cloud2Gif script is self-contained and requires minimal external dependencies. This makes it ideally suited for deployment on resource-constrained edge devices. And this may be the top application that I worked on, leveraging such a straightforward approach.

As a little digression, I saw the positive impact of the script in two scenarios. First, I designed an environmental monitoring system for diseases in farmland crops. This was a 3D project, and I could include the generation of visual alerts (with an MP4 file) based on the real-time LiDAR sensor data. A great project!

In another context, I wanted to provide visual feedback to on-site technicians using a SLAM-equipped system for mapping purposes. I integrated the process to generate a GIF every 30 seconds that showed the current state of data registration. It was a great way to ensure consistent data capture. This actually allowed us to reconstruct complex environments with better consistency in managing our data drift.

Conclusion

Today, I walked through a simple yet powerful Python script to transform 3D data into dynamic GIFs and MP4 videos. This script, combined with libraries like NumPy and PyVista, allows us to create engaging visuals for various applications, from presentations to research and educational materials.

The key here is accessibility: the script is easily deployable and customizable, providing an immediate way of transforming complex data into an accessible format. This Cloud2Gif script is an excellent piece for your application if you need to share, assess, or get quick visual feedback within data acquisition situations.

What is next?

Well, if you feel up for a challenge, you can create a simple web application that allows users to upload point clouds, trigger the video generation process, and download the resulting GIF or MP4 file.

This, in a similar manner as shown here:

In addition to Flask, you can also create a simple web application that can be deployed on Amazon Web Services so that it is scalable and easily accessible to anyone, with minimal maintenance.

These are skills that you develop through the Segmentor OS Program at the 3D Geodata Academy.

About the author

Florent Poux, Ph.D. is a Scientific and Course Director focused on educating engineers on leveraging AI and 3D Data Science. He leads research teams and teaches 3D Computer Vision at various universities. His current aim is to ensure humans are correctly equipped with the knowledge and skills to tackle 3D challenges for impactful innovations.

Resources

The post How To Generate GIFs from 3D Models with Python appeared first on Towards Data Science.

Zero Human Code: What I Learned from Forcing AI to Build (and Fix) Its Own Code for 27 Straight Days

Daniel Bentes — Wed, 19 Feb 2025 19:03:44 +0000

27 days, 1,700+ commits, 99,9% AI generated code

The narrative around AI development tools has become increasingly detached from reality. YouTube is filled with claims of building complex applications in hours using AI assistants. The truth?

I spent 27 days building ObjectiveScope under a strict constraint: the AI tools would handle ALL coding, debugging, and implementation, while I acted purely as the orchestrator. This wasn’t just about building a product — it was a rigorous experiment in the true capabilities of Agentic Ai development.

A dimwitted AI intern and a frustrated product manager walked into a bar… (Image by author)

The experiment design

Two parallel objectives drove this project:

Transform a weekend prototype into a full-service product
Test the real limits of AI-driven development by maintaining a strict “no direct code changes” policy

This self-imposed constraint was crucial: unlike typical AI-assisted development where developers freely modify code, I would only provide instructions and direction. The AI tools had to handle everything else — from writing initial features to debugging their own generated issues. This meant that even simple fixes that would take seconds to implement manually often required careful prompting and patience to guide the AI to the solution.

The rules

No direct code modifications (except for critical model name corrections — about 0.1% of commits)
All bugs must be fixed by the AI tools themselves
All feature implementations must be done entirely through AI
My role was limited to providing instructions, context, and guidance

This approach would either validate or challenge the growing hype around agentic Ai Development tools.

The development reality

Let’s cut through the marketing hype. Building with pure AI assistance is possible but comes with significant constraints that aren’t discussed enough in tech circles and marketing lingo.

The self-imposed restriction of not directly modifying code turned what might be minor issues in traditional development into complex exercises in AI instruction and guidance.

Core challenges

Deteriorating context management

As application complexity grew, AI tools increasingly lost track of the broader system context
Features would be recreated unnecessarily or broken by seemingly unrelated changes
The AI struggled to maintain consistent architectural patterns across the codebase
Each new feature required increasingly detailed prompting to prevent system degradation
Having to guide the AI to understand and maintain its own code added significant complexity

Technical limitations

Regular battles with outdated knowledge (e.g., consistent attempts to use deprecated third party library versions)
Persistent issues with model names (AI constantly changing “gpt-4o” or “o3-mini” to “gpt-4” as it identified this as the “bug” in the code during debugging sessions). The 0.1% of my direct interventions were solely to correct model references to avoid wasting time and money
Integration challenges with modern framework features became exercises in patient instruction rather than quick fixes
Code and debugging quality varied between prompts. Sometimes I just reverted and gave it the same prompt again with much better results.

Self-debugging constraints

What would be a 5-minute fix for a human often turned into hours of carefully guiding the AI
The AI frequently introduced new issues (and even new features) while trying to fix existing ones
Success required extremely precise prompting and constant vigilance
Each bug fix needed to be validated across the entire system to ensure no new issues were introduced
More often than not the AI lied about what it actually implemented!

Always verify the generated code! (Image by author)

Tool-specific insights

Lovable

Excelled at initial feature generation but struggled with maintenance
Performance degraded significantly as project complexity increased
Had to be abandoned in the final three days due to increasing response times and bugs in the tool itself
Strong with UI generation but weak at maintaining system consistency

Cursor Composer

More reliable for incremental changes and bug fixes
Better at maintaining context within individual files
Struggled with cross-component dependencies
Required more specific prompting but produced more consistent results
Much better at debugging and having control

Difficulty with abstract concepts

My experience with these agentic coding tools is that while they may excel at concrete tasks and well-defined instructions, they often struggle with abstract concepts, such as design principles, user experience, and code maintainability. This limitation hinders their ability to generate code that is not only functional but also elegant, efficient, and aligned with best practices. This can result in code that is difficult to read, maintain, or scale, potentially creating more work in the long run.

Unexpected learnings

The experiment yielded several unexpected but valuable insights about AI-driven development:

The evolution of prompting strategies

One of the most valuable outcomes was developing a collection of effective debugging prompts. Through trial and error, I discovered patterns in how to guide AI tools through complex debugging scenarios. These prompts now serve as a reusable toolkit for other AI development projects, demonstrating how even strict constraints can lead to transferable knowledge.

Architectural lock-in

Perhaps the most significant finding was how early architectural decisions become nearly immutable in pure AI development. Unlike traditional development, where refactoring is a standard practice, changing the application’s architecture late in the development process proved almost impossible. Two critical issues emerged:

Growing file complexity

Files that grew larger over time became increasingly risky to modify, as a prompt to refactor the file often introduced hours of iterations to make the things work again.
The AI tools struggled to maintain context across larger amount of files
Attempts at refactoring often resulted in broken functionality and even new features I didn’t ask for
The cost of fixing AI-introduced bugs during refactoring often outweigh potential benefits

Architectural rigidity

Initial architectural decisions had outsized impact on the entire development process, specially when combining different AI tools to work on the same codebase
The AI’s inability to comprehend full system implications made large-scale changes dangerous
What would be routine refactoring in traditional development became high-risk and time consuming operations

This differs fundamentally from typical AI-assisted development, where developers can freely refactor and restructure code. The constraint of pure AI development revealed how current tools, while powerful for initial development, struggle with the evolutionary nature of software architecture.

Key learnings for AI-only development

Early decisions matter more

Initial architectural choices become nearly permanent in pure AI development
Changes that would be routine refactoring in traditional development become high-risk operations
Success requires more upfront architectural planning than typical development

Context is everything

AI tools excel at isolated tasks but struggle with system-wide implications
Success requires maintaining a clear architectural vision that the current AI tools don’t seem to provide
Documentation and context management become critical as complexity grows

Time investment reality

Claims of building complex apps in hours are misleading. The process requires significant time investment in:

Precise prompt engineering
Reviewing and guiding AI-generated changes
Managing system-wide consistency
Debugging AI-introduced issues

Tool selection matters

Different tools excel at different stages of development
Success requires understanding each tool’s strengths and limitations
Be prepared to switch or even combine tools as project needs evolve

Scale changes everything

AI tools excel at initial development but struggle with growing complexity
System-wide changes become exponentially more difficult over time
Traditional refactoring patterns don’t translate well to AI-only development

The human element

The role shifts from writing code to orchestrating AI systems
Strategic thinking and architectural oversight become more critical
Success depends on maintaining the bigger picture that AI tools often miss
Stress management and deep breathing is encouraged as frustration builds up

The Art of AI Instruction

Perhaps the most practical insight from this experiment can be summed up in one tip: Approach prompt engineering like you’re talking to a really dimwitted intern. This isn’t just amusing — it’s a fundamental truth about working with current AI systems:

Be Painfully Specific: The more you leave ambiguous, the more room there is for the AI to make incorrect assumptions and “screw up”
Assume No Context: Just like an intern on their first day, the AI needs everything spelled out explicitly
Never Rely on Assumptions: If you don’t specify it, the AI will make its own (often wrong) decisions
Check Everything: Trust but verify — every single output needs review

This mindset shift was crucial for success. While AI tools can generate impressive code, they lack the common sense and contextual understanding that even a junior developers possess. Understanding this limitation transforms frustration into an effective strategy.

When frustration takes over. An example of how NOT to prompt (Image by author)

The Result: A Full-Featured Goal Achievement Platform

While the development process revealed crucial insights about AI tooling, the end result speaks for itself: ObjectiveScope emerged as a sophisticated platform that transforms how solopreneurs and small teams manage their strategic planning and execution.

ObjectiveScope transforms how founders and teams manage strategy and execution. At its core, AI-powered analysis eliminates the struggle of turning complex strategy documents into actionable plans — what typically takes hours becomes a 5-minute automated process. The platform doesn’t just track OKRs; it actively helps you create and manage them, ensuring your objectives and key results actually align with your strategic vision while automatically keeping everything up to date.

Screenshot of the strategy analysis section in ObjectiveScope (Image by author)

For the daily chaos every founder faces, the intelligent priority management system turns overwhelming task lists into clear, strategically-aligned action plans. No more Sunday night planning sessions or constant doubt about working on the right things. The platform validates that your daily work truly moves the needle on your strategic goals.

Team collaboration features solve the common challenge of keeping everyone aligned without endless meetings. Real-time updates and role-based workspaces mean everyone knows their priorities and how they connect to the bigger picture.

Real-World Impact

ObjectiveScope addresses critical challenges I’ve repeatedly encountered while advising startups, managing my own ventures or just talking to other founders.

I’m spending 80% less time on planning, eliminating the constant context switching that kills productivity, and maintaining strategic clarity even during the busiest operational periods. It’s about transforming strategic management from a burdensome overhead into an effortless daily rhythm that keeps you and your team focused on what matters most.

I’ll be expanding ObjectiveScope to address other key challenges faced by founders and teams. Some ideas in the pipeline are:

An agentic chat assistant will provide real-time strategic guidance, eliminating the uncertainty of decision-making in isolation.
Smart personalization will learn from your patterns and preferences, ensuring recommendations actually fit your working style and business context.
Deep integrations with Notion, Slack, and calendar tools will end the constant context-switching between apps that fragments strategic focus.
Predictive analytics will analyze your performance patterns to flag potential issues before they impact goals and suggest resource adjustments when needed.
And finally, flexible planning approaches — both on-demand and scheduled — will ensure you can maintain strategic clarity whether you’re following a stable plan or responding to rapid market changes.

Each enhancement aims to transform a common pain point into an automated, intelligent solution.

Looking Forward: Evolution Beyond the Experiment

The initial AI-driven development phase was just the beginning. Moving forward, I’ll be taking a more hands-on approach to building new capabilities, informed by the insights gained from this experiment. I certainly can’t take the risk of letting AI completely loose in the code when we are in production.

This evolution reflects a key learning from the first phase of the experiment: while AI can build complex applications on its own, the path to product excellence requires combining AI capabilities with human insight and direct development expertise. At least for now.

The Emergence of “Long Thinking” in Coding

The shift toward “long thinking” through reasoning models in AI development marks a critical evolution in how we might build software in the future. This emerging approach emphasizes deliberate reasoning and planning — essentially trading rapid responses for better-engineered solutions. For complex software development, this isn’t just an incremental improvement; it’s a fundamental requirement for producing production-grade code.

This capability shift is redefining the developer’s role as well, but not in the way many predicted. Rather than replacing developers, AI is elevating their position from code implementers to system architects and strategic problem solvers. The real value emerges when developers focus on the tasks AI can’t handle well yet: battle tested system design, architectural decisions, and creative problem-solving. It’s not about automation replacing human work — it’s about automation enhancing human capability.

Next Steps: Can AI run the entire business operation?

I’m validating whether ObjectiveScope — a tool built by AI — can be operated entirely by AI. The next phase moves beyond AI development to test the boundaries of AI operations.

Using ObjectiveScope’s own strategic planning capabilities, combined with various AI agents and tools, I’ll attempt to run all business operations — marketing, strategy, customer support, and prioritization — without human intervention.

It’s a meta-experiment where AI uses AI-built tools to run an AI-developed service…

Stay tuned for more!

The post Zero Human Code: What I Learned from Forcing AI to Build (and Fix) Its Own Code for 27 Straight Days appeared first on Towards Data Science.

The Gamma Hurdle Distribution

Jeff Allard — Sat, 08 Feb 2025 02:24:14 +0000

Which Outcome Matters?

Here is a common scenario : An A/B test was conducted, where a random sample of units (e.g. customers) were selected for a campaign and they received Treatment A. Another sample was selected to receive Treatment B. “A” could be a communication or offer and “B” could be no communication or no offer. “A” could be 10% off and “B” could be 20% off. Two groups, two different treatments, where A and B are two discrete treatments, but without loss of generality to greater than 2 treatments and continuous treatments.

So, the campaign runs and results are made available. With our backend system, we can track which of these units took the action of interest (e.g. made a purchase) and which did not. Further, for those that did, we log the intensity of that action. A common scenario is that we can track purchase amounts for those that purchased. This is often called an average order amount or revenue per buyer metric. Or a hundred different names that all mean the same thing — for those that purchased, how much did they spend, on average?

For some use-cases, the marketer is interested in the former metric — the purchase rate. For example, did we drive more (potentially first time) buyers in our acquisition campaign with Treatment A or B? Sometimes, we are interested in driving the revenue per buyer higher so we put emphasis on the latter.

More often though, we are interested in driving revenue in a cost effective manner and what we really care about is the revenue that the campaign produced overall. Did treatment A or B drive more revenue? We don’t always have balanced sample sizes (perhaps due to cost or risk avoidance) and so we divide the measured revenue by the number of candidates that were treated in each group (call these counts N_A and N_B). We want to compare this measure between the two groups, so the standard contrast is simply:

This is just the mean revenue for Treatment A minus mean revenue for Treatment B, where that mean is taken over the entire set of targeted units, irrespective if they responded or not. Its interpretation is likewise straightforward — what is the average revenue per promoted unit increase going from Treatment A versus Treatment B?

Of course, this last measure accounts for both of the prior: the response rate multiplied by the mean revenue per responder.

Uncertainty?

How much a buyer spends is highly variable and a couple large purchases in one treatment group or the other can skew the mean significantly. Likewise, sample variation can be significant. So, we want to understand how confident we are in this comparison of means and quantify the “significance” of the observed difference.

So, you throw the data in a t-test and stare at the p-value. But wait! Unfortunately for the marketer, the vast majority of the time, the purchase rate is relatively low (sometimes VERY low) and hence there are a lot of zero revenue values — often the vast majority. The t-test assumptions may be badly violated. Very large sample sizes may come to the rescue, but there is a more principled way to analyze this data that is useful in multiple ways, that will be explained.

Example Dataset

Lets start with the sample dataset to makes things practical. One of my favorite direct marketing datasets is from the KDD Cup 98.

url = 'https://kdd.ics.uci.edu/databases/kddcup98/epsilon_mirror/cup98lrn.zip'
filename = 'cup98LRN.txt'

r = requests.get(url)
z = zipfile.ZipFile(io.BytesIO(r.content))
z.extractall()


pdf_data = pd.read_csv(filename, sep=',')
pdf_data = pdf_data.query('TARGET_D >=0')
pdf_data['TREATMENT'] =  np.where(pdf_data.RFA_2F >1,'A','B')
pdf_data['TREATED'] =  np.where(pdf_data.RFA_2F >1,1,0)
pdf_data['GT_0'] = np.where(pdf_data.TARGET_D >0,1,0)
pdf_data = pdf_data[['TREATMENT', 'TREATED', 'GT_0', 'TARGET_D']]

In the code snippet above we are downloading a zip file (the learning dataset specifically), extracting it and reading it into a Pandas data frame. The nature of this dataset is campaign history from a non-profit organization that was seeking donations through direct mailings. There is no treatment variants within this dataset, so we are pretending instead and segmenting the dataset based on the frequency of past donations. We call this indicator TREATMENT (as the categorical and create TREATED as the binary indicator for ‘A’ ). Consider this the results of a randomized control trial where a portion of the sample population was treated with an offer and the remainder were not. We track each individual and accumulate the amount of their donation.

So, if we examine this dataset, we see that there are about 95,000 promoted individuals, generally distributed equally across the two treatments:

Treatment A has a larger response rate but overall the response rate in the dataset is only around 5%. So, we have 95% zeros.

For those that donated, Treatment A appears to be associated with a lower average donation amount.

Combining together everyone that was targeted, Treatment A appears to be associated with a higher average donation amount — the higher response rate outweighs the lower donation amount for responders— but not by much.

Finally, the histogram of the donation amount is shown here, pooled over both treatments, which illustrates the mass at zero and a right skew.

A numerical summary of the two treatment groups quantifies the phenomenon observed above — while Treatment A appears to have driven significantly higher response, those that were treated with A donated less on average when they responded. The net of these two measures, the one we are ultimately after — the overall mean donation per targeted unit – appears to still be higher for Treatment A. How confident we are in that finding is the subject of this analysis.

Gamma Hurdle

One way to model this data and answer our research question in terms of the difference between the two treatments in generating the average donation per targeted unit is with the Gamma Hurdle distribution. Similar to the more well known Zero Inflated Poisson (ZIP) or NB (ZINB) distribution, this is a mixture distribution where one part pertains to the mass at zero and the other, in the cases where the random variable is positive, the gamma density function.

Here π represents the probability that the random variable y is > 0. In other words its the probability of the gamma process. Likewise, (1- π) is the probability that the random variable is zero. In terms of our problem, this pertains to the probability that a donation is made and if so, it’s value.

Lets start with the component parts of using this distribution in a regression – logistic and gamma regression.

Logistic Regression

The logit function is the link function here, relating the log odds to the linear combination of our predictor variables, which with a single variable such as our binary treatment indicator, looks like:

Where π represents the probability that the outcome is a “positive” (denoted as 1) event such as a purchase and (1-π) represents the probability that the outcome is a “negative” (denoted as 0) event. Further, π which is the qty of interest above, is defined by the inverse logit function:

Fitting this model is very simple, we need to find the values of the two betas that maximize the likelihood of the data (the outcome y)— which assuming N iid observations is:

We could use any of multiple libraries to quickly fit this model but will demonstrate PYMC as the means to build a simple Bayesian logistic regression.

Without any of the normal steps of the Bayesian workflow, we fit this simple model using MCMC.

import pymc as pm
import arviz as az
from scipy.special import expit


with pm.Model() as logistic_model:

    # noninformative priors
    intercept = pm.Normal('intercept', 0, sigma=10)
    beta_treat = pm.Normal('beta_treat', 0, sigma=10)

    # linear combination of the treated variable 
    # through the inverse logit to squish the linear predictor between 0 and 1
    p =  pm.invlogit(intercept + beta_treat * pdf_data.TREATED)

    # Individual level binary variable (respond or not)
    pm.Bernoulli(name='logit', p=p, observed=pdf_data.GT_0)

    idata = pm.sample(nuts_sampler="numpyro")

az.summary(idata, var_names=['intercept', 'beta_treat'])

If we construct a contrast of the two treatment mean response rates, we find that as expected, the mean response rate lift for Treatment A is 0.026 larger than Treatment B with a 94% credible interval of (0.024 , 0.029).

# create a new column in the posterior which contrasts Treatment A - B
idata.posterior['TREATMENT A - TREATMENT B'] = expit(idata.posterior.intercept + idata.posterior.beta_treat) -  expit(idata.posterior.intercept)

az.plot_posterior(
    idata,
    var_names=['TREATMENT A - TREATMENT B']
)

Gamma Regression

The next component is the gamma distribution with one of it’s parametrizations of it’s probability density function, as shown above:

This distribution is defined for strictly positive random variables and if used in business for values such as costs, customer demand spending and insurance claim amounts.

Since the mean and variance of gamma are defined in terms of α and β according to the formulas:

for gamma regression, we can parameterize by α and β or by μ and σ. If we make μ defined as a linear combination of predictor variables, then we can define gamma in terms of α and β using μ:

The gamma regression model assumes (in this case, the inverse link is another common option) the log link which is intended to “linearize” the relationship between predictor and outcome:

Following almost exactly the same methodology as for the response rate, we limit the dataset to only responders and fit the gamma regression using PYMC.

with pm.Model() as gamma_model:

    # noninformative priors
    intercept = pm.Normal('intercept', 0, sigma=10)
    beta_treat = pm.Normal('beta_treat', 0, sigma=10)

    shape = pm.HalfNormal('shape', 5)

    # linear combination of the treated variable 
    # through the exp to ensure the linear predictor is positive
    mu =  pm.Deterministic('mu',pm.math.exp(intercept + beta_treat * pdf_responders.TREATED))

    # Individual level binary variable (respond or not)
    pm.Gamma(name='gamma', alpha = shape, beta = shape/mu,  observed=pdf_responders.TARGET_D)

    idata = pm.sample(nuts_sampler="numpyro")

az.summary(idata, var_names=['intercept', 'beta_treat'])

# create a new column in the posterior which contrasts Treatment A - B
idata.posterior['TREATMENT A - TREATMENT B'] = np.exp(idata.posterior.intercept + idata.posterior.beta_treat) -  np.exp(idata.posterior.intercept)

az.plot_posterior(
    idata,
    var_names=['TREATMENT A - TREATMENT B']
)

Again, as expected, we see the mean lift for Treatment A to have an expected value equal to the sample value of -7.8. The 94% credible interval is (-8.3, -7.3).

The components, response rate and average amount per responder shown above are about as simple as we can get. But, its a straight forward extension to add additional predictors in order to 1) estimate the Conditional Average Treatment Effects (CATE) when we expect the treatment effect to differ by segment or 2) reduce the variance of the average treatment effect estimate by conditioning on pre-treatment variables.

Hurdle Model (Gamma) Regression

At this point, it should be pretty straightforward to see where we are progressing. For the hurdle model, we have a conditional likelihood, depending on if the specific observation is 0 or greater than zero, as shown above for the gamma hurdle distribution. We can fit the two component models (logistic and gamma regression) simultaneously. We get for free, their product, which in our example is an estimate of the donation amount per targeted unit.

It would not be difficult to fit this model with using a likelihood function with a switch statement depending on the value of the outcome variable, but PYMC has this distribution already encoded for us.

import pymc as pm
import arviz as az

with pm.Model() as hurdle_model:

    ## noninformative priors ##
    # logistic
    intercept_lr = pm.Normal('intercept_lr', 0, sigma=5)
    beta_treat_lr = pm.Normal('beta_treat_lr', 0, sigma=1)

    # gamma
    intercept_gr = pm.Normal('intercept_gr', 0, sigma=5)
    beta_treat_gr = pm.Normal('beta_treat_gr', 0, sigma=1)

    # alpha
    shape = pm.HalfNormal('shape', 1)

    ## mean functions of predictors ##
    p =  pm.Deterministic('p', pm.invlogit(intercept_lr + beta_treat_lr * pdf_data.TREATED))
    mu =  pm.Deterministic('mu',pm.math.exp(intercept_gr + beta_treat_gr * pdf_data.TREATED))
    
    ## likliehood ##
    # psi is pi
    pm.HurdleGamma(name='hurdlegamma', psi=p, alpha = shape, beta = shape/mu, observed=pdf_data.TARGET_D)

    idata = pm.sample(cores = 10)

If we examine the trace summary, we see that the results are exactly the same for the two component models.

As noted, the mean of the gamma hurdle distribution is π * μ so we can create a contrast:

# create a new column in the posterior which contrasts Treatment A - B
idata.posterior['TREATMENT A - TREATMENT B'] = ((expit(idata.posterior.intercept_lr + idata.posterior.beta_treat_lr))* np.exp(idata.posterior.intercept_gr + idata.posterior.beta_treat_gr)) - \
                                                    ((expit(idata.posterior.intercept_lr))* np.exp(idata.posterior.intercept_gr))

az.plot_posterior(
    idata,
    var_names=['TREATMENT A - TREATMENT B']

The mean expected value of this model is 0.043 with a 94% credible interval of (-0.0069, 0.092). We could interrogate the posterior to see what proportion of times the donation per buyer is predicted to be higher for Treatment A and any other decision functions that made sense for our case — including adding a fuller P&L to the estimate (i.e. including margins and cost).

Notes: Some implementations parameterize the gamma hurdle model differently where the probability of zero is π and hence the mean of the gamma hurdle involves (1-π) instead. Also note that at the time of this writing there appears to be an issue with the nuts samplers in PYMC and we had to fall back on the default python implementation for running the above code.

Summary

With this approach, we get the same inference for both models separately and the extra benefit of the third metric. Fitting these models with PYMC allows us all the benefits of Bayesian analysis — including injection of prior domain knowledge and a full posterior to answer questions and quantify uncertainty!

Credits:

All images are the authors, unless otherwise noted.
The dataset used is from the KDD 98 Cup sponsored by Epsilon. https://kdd.ics.uci.edu/databases/kddcup98/kddcup98.html (CC BY 4.0)

The post The Gamma Hurdle Distribution appeared first on Towards Data Science.

A Visual Guide to How Diffusion Models Work

Yue Wu — Thu, 06 Feb 2025 20:17:21 +0000

This article is aimed at those who want to understand exactly how Diffusion Models work, with no prior knowledge expected. I’ve tried to use illustrations wherever possible to provide visual intuitions on each part of these models. I’ve kept mathematical notation and equations to a minimum, and where they are necessary I’ve tried to define and explain them as they occur.

Intro

I’ve framed this article around three main questions:

What exactly is it that diffusion models learn?
How and why do diffusion models work?
Once you’ve trained a model, how do you get useful stuff out of it?

The examples will be based on the glyffuser, a minimal text-to-image diffusion model that I previously implemented and wrote about. The architecture of this model is a standard text-to-image denoising diffusion model without any bells or whistles. It was trained to generate pictures of new “Chinese” glyphs from English definitions. Have a look at the picture below — even if you’re not familiar with Chinese writing, I hope you’ll agree that the generated glyphs look pretty similar to the real ones!

Random examples of glyffuser training data (left) and generated data (right).

What exactly is it that diffusion models learn?

Generative Ai models are often said to take a big pile of data and “learn” it. For text-to-image diffusion models, the data takes the form of pairs of images and descriptive text. But what exactly is it that we want the model to learn? First, let’s forget about the text for a moment and concentrate on what we are trying to generate: the images.

Probability distributions

Broadly, we can say that we want a generative AI model to learn the underlying probability distribution of the data. What does this mean? Consider the one-dimensional normal (Gaussian) distribution below, commonly written 𝒩(μ,σ²) and parameterized with mean μ = 0 and variance σ² = 1. The black curve below shows the probability density function. We can sample from it: drawing values such that over a large number of samples, the set of values reflects the underlying distribution. These days, we can simply write something like x = random.gauss(0, 1) in Python to sample from the standard normal distribution, although the computational sampling process itself is non-trivial!

Values sampled from an underlying distribution (here, the standard normal 𝒩(0,1)) can then be used to estimate the parameters of that distribution.

We could think of a set of numbers sampled from the above normal distribution as a simple dataset, like that shown as the orange histogram above. In this particular case, we can calculate the parameters of the underlying distribution using maximum likelihood estimation, i.e. by working out the mean and variance. The normal distribution estimated from the samples is shown by the dotted line above. To take some liberties with terminology, you might consider this as a simple example of “learning” an underlying probability distribution. We can also say that here we explicitly learnt the distribution, in contrast with the implicit methods that diffusion models use.

Conceptually, this is all that generative AI is doing — learning a distribution, then sampling from that distribution!

Data representations

What, then, does the underlying probability distribution of a more complex dataset look like, such as that of the image dataset we want to use to train our diffusion model?

First, we need to know what the representation of the data is. Generally, a machine learning (ML) model requires data inputs with a consistent representation, i.e. format. For the example above, it was simply numbers (scalars). For images, this representation is commonly a fixed-length vector.

The image dataset used for the glyffuser model is ~21,000 pictures of Chinese glyphs. The images are all the same size, 128 × 128 = 16384 pixels, and greyscale (single-channel color). Thus an obvious choice for the representation is a vector x of length 16384, where each element corresponds to the color of one pixel: x = (x₁,x₂,…,x₁₆₃₈₄). We can call the domain of all possible images for our dataset “pixel space”.

An example glyph with pixel values labelled (downsampled to 32 × 32 pixels for readability).

Dataset visualization

We make the assumption that our individual data samples, x, are actually sampled from an underlying probability distribution, q(x), in pixel space, much as the samples from our first example were sampled from an underlying normal distribution in 1-dimensional space. Note: the notation x ∼ q(x) is commonly used to mean: “the random variable x sampled from the probability distribution q(x).”

This distribution is clearly much more complex than a Gaussian and cannot be easily parameterized — we need to learn it with a ML model, which we’ll discuss later. First, let’s try to visualize the distribution to gain a better intution.

As humans find it difficult to see in more than 3 dimensions, we need to reduce the dimensionality of our data. A small digression on why this works: the manifold hypothesis posits that natural datasets lie on lower dimensional manifolds embedded in a higher dimensional space — think of a line embedded in a 2-D plane, or a plane embedded in 3-D space. We can use a dimensionality reduction technique such as UMAP to project our dataset from 16384 to 2 dimensions. The 2-D projection retains a lot of structure, consistent with the idea that our data lie on a lower dimensional manifold embedded in pixel space. In our UMAP, we see two large clusters corresponding to characters in which the components are arranged either horizontally (e.g. 明) or vertically (e.g. 草). An interactive version of the plot below with popups on each datapoint is linked here.

Click here for an interactive version of this plot.

Let’s now use this low-dimensional UMAP dataset as a visual shorthand for our high-dimensional dataset. Remember, we assume that these individual points have been sampled from a continuous underlying probability distribution q(x). To get a sense of what this distribution might look like, we can apply a KDE (kernel density estimation) over the UMAP dataset. (Note: this is just an approximation for visualization purposes.)

This gives a sense of what q(x) should look like: clusters of glyphs correspond to high-probability regions of the distribution. The true q(x) lies in 16384 dimensions — this is the distribution we want to learn with our diffusion model.

We showed that for a simple distribution such as the 1-D Gaussian, we could calculate the parameters (mean and variance) from our data. However, for complex distributions such as images, we need to call on ML methods. Moreover, what we will find is that for diffusion models in practice, rather than parameterizing the distribution directly, they learn it implicitly through the process of learning how to transform noise into data over many steps.

Takeaway

The aim of generative AI such as diffusion models is to learn the complex probability distributions underlying their training data and then sample from these distributions.

How and why do diffusion models work?

Diffusion models have recently come into the spotlight as a particularly effective method for learning these probability distributions. They generate convincing images by starting from pure noise and gradually refining it. To whet your interest, have a look at the animation below that shows the denoising process generating 16 samples.

In this section we’ll only talk about the mechanics of how these models work but if you’re interested in how they arose from the broader context of generative models, have a look at the further reading section below.

What is “noise”?

Let’s first precisely define noise, since the term is thrown around a lot in the context of diffusion. In particular, we are talking about Gaussian noise: consider the samples we talked about in the section about probability distributions. You could think of each sample as an image of a single pixel of noise. An image that is “pure Gaussian noise”, then, is one in which each pixel value is sampled from an independent standard Gaussian distribution, 𝒩(0,1). For a pure noise image in the domain of our glyph dataset, this would be noise drawn from 16384 separate Gaussian distributions. You can see this in the previous animation. One thing to keep in mind is that we can choose the means of these noise distributions, i.e. center them, on specific values — the pixel values of an image, for instance.

For convenience, you’ll often find the noise distributions for image datasets written as a single multivariate distribution 𝒩(0,I) where I is the identity matrix, a covariance matrix with all diagonal entries equal to 1 and zeroes elsewhere. This is simply a compact notation for a set of multiple independent Gaussians — i.e. there are no correlations between the noise on different pixels. In the basic implementations of diffusion models, only uncorrelated (a.k.a. “isotropic”) noise is used. This article contains an excellent interactive introduction on multivariate Gaussians.

Diffusion process overview

Below is an adaptation of the somewhat-famous diagram from Ho et al.’s seminal paper “Denoising Diffusion Probabilistic Models” which gives an overview of the whole diffusion process:

Diagram of the diffusion process adapted from Ho et al. 2020. The glyph 锂, meaning “lithium”, is used as a representative sample from the dataset.

I found that there was a lot to unpack in this diagram and simply understanding what each component meant was very helpful, so let’s go through it and define everything step by step.

We previously used x ∼ q(x) to refer to our data. Here, we’ve added a subscript, xₜ, to denote timestep t indicating how many steps of “noising” have taken place. We refer to the samples noised a given timestep as x ∼ q(xₜ). x₀ is clean data and xₜ (t = T) ∼ 𝒩(0,1) is pure noise.

We define a forward diffusion process whereby we corrupt samples with noise. This process is described by the distribution q(xₜ|xₜ₋₁). If we could access the hypothetical reverse process q(xₜ₋₁|xₜ), we could generate samples from noise. As we cannot access it directly because we would need to know x₀, we use ML to learn the parameters, θ, of a model of this process, 𝑝θ(𝑥ₜ₋₁∣𝑥ₜ). (That should be p subscript θ but medium cannot render it.)

In the following sections we go into detail on how the forward and reverse diffusion processes work.

Forward diffusion, or “noising”

Used as a verb, “noising” an image refers to applying a transformation that moves it towards pure noise by scaling down its pixel values toward 0 while adding proportional Gaussian noise. Mathematically, this transformation is a multivariate Gaussian distribution centered on the pixel values of the preceding image.

In the forward diffusion process, this noising distribution is written as q(xₜ|xₜ₋₁) where the vertical bar symbol “|” is read as “given” or “conditional on”, to indicate the pixel means are passed forward from q(xₜ₋₁) At t = T where T is a large number (commonly 1000) we aim to end up with images of pure noise (which, somewhat confusingly, is also a Gaussian distribution, as discussed previously).

The marginal distributions q(xₜ) represent the distributions that have accumulated the effects of all the previous noising steps (marginalization refers to integration over all possible conditions, which recovers the unconditioned distribution).

Since the conditional distributions are Gaussian, what about their variances? They are determined by a variance schedule that maps timesteps to variance values. Initially, an empirically determined schedule of linearly increasing values from 0.0001 to 0.02 over 1000 steps was presented in Ho et al. Later research by Nichol & Dhariwal suggested an improved cosine schedule. They state that a schedule is most effective when the rate of information destruction through noising is relatively even per step throughout the whole noising process.

Forward diffusion intuition

As we encounter Gaussian distributions both as pure noise q(xₜ, t = T) and as the noising distribution q(xₜ|xₜ₋₁), I’ll try to draw the distinction by giving a visual intuition of the distribution for a single noising step, q(x₁∣x₀), for some arbitrary, structured 2-dimensional data:

Each noising step q(xₜ|xₜ₋₁) is a Gaussian distribution conditioned on the previous step.

The distribution q(x₁∣x₀) is Gaussian, centered around each point in x₀, shown in blue. Several example points x₀⁽ⁱ⁾ are picked to illustrate this, with q(x₁∣x₀ = x₀⁽ⁱ⁾) shown in orange.

In practice, the main usage of these distributions is to generate specific instances of noised samples for training (discussed further below). We can calculate the parameters of the noising distributions at any timestep t directly from the variance schedule, as the chain of Gaussians is itself also Gaussian. This is very convenient, as we don’t need to perform noising sequentially—for any given starting data x₀⁽ⁱ⁾, we can calculate the noised sample xₜ⁽ⁱ⁾ by sampling from q(xₜ∣x₀ = x₀⁽ⁱ⁾) directly.

Forward diffusion visualization

Let’s now return to our glyph dataset (once again using the UMAP visualization as a visual shorthand). The top row of the figure below shows our dataset sampled from distributions noised to various timesteps: xₜ ∼ q(xₜ). As we increase the number of noising steps, you can see that the dataset begins to resemble pure Gaussian noise. The bottom row visualizes the underlying probability distribution q(xₜ).

The dataset xₜ (above) sampled from its probability distribution q(xₜ) (below) at different noising timesteps.

Reverse diffusion overview

It follows that if we knew the reverse distributions q(xₜ₋₁∣xₜ), we could repeatedly subtract a small amount of noise, starting from a pure noise sample xₜ at t = T to arrive at a data sample x₀ ∼ q(x₀). In practice, however, we cannot access these distributions without knowing x₀ beforehand. Intuitively, it’s easy to make a known image much noisier, but given a very noisy image, it’s much harder to guess what the original image was.

So what are we to do? Since we have a large amount of data, we can train an ML model to accurately guess the original image that any given noisy image came from. Specifically, we learn the parameters θ of an ML model that approximates the reverse noising distributions, pθ(xₜ₋₁ ∣ xₜ) for t = 0, …, T. In practice, this is embodied in a single noise prediction model trained over many different samples and timesteps. This allows it to denoise any given input, as shown in the figure below.

The ML model predicts added noise at any given timestep t.

Next, let’s go over how this noise prediction model is implemented and trained in practice.

How the model is implemented

First, we define the ML model — generally a deep neural network of some sort — that will act as our noise prediction model. This is what does the heavy lifting! In practice, any ML model that inputs and outputs data of the correct size can be used; the U-net, an architecture particularly suited to learning images, is what we use here and frequently chosen in practice. More recent models also use vision transformers.

We use the U-net architecture (Ronneberger et al. 2015) for our ML noise prediction model. We train the model by minimizing the difference between predicted and actual noise.

Then we run the training loop depicted in the figure above:

We take a random image from our dataset and noise it to a random timestep tt. (In practice, we speed things up by doing many examples in parallel!)
We feed the noised image into the ML model and train it to predict the (known to us) noise in the image. We also perform timestep conditioning by feeding the model a timestep embedding, a high-dimensional unique representation of the timestep, so that the model can distinguish between timesteps. This can be a vector the same size as our image directly added to the input (see here for a discussion of how this is implemented).
The model “learns” by minimizing the value of a loss function, some measure of the difference between the predicted and actual noise. The mean square error (the mean of the squares of the pixel-wise difference between the predicted and actual noise) is used in our case.
Repeat until the model is well trained.

Note: A neural network is essentially a function with a huge number of parameters (on the order of 10⁶ for the glyffuser). Neural network ML models are trained by iteratively updating their parameters using backpropagation to minimize a given loss function over many training data examples. This is an excellent introduction. These parameters effectively store the network’s “knowledge”.

A noise prediction model trained in this way eventually sees many different combinations of timesteps and data examples. The glyffuser, for example, was trained over 100 epochs (runs through the whole data set), so it saw around 2 million data samples. Through this process, the model implicity learns the reverse diffusion distributions over the entire dataset at all different timesteps. This allows the model to sample the underlying distribution q(x₀) by stepwise denoising starting from pure noise. Put another way, given an image noised to any given level, the model can predict how to reduce the noise based on its guess of what the original image. By doing this repeatedly, updating its guess of the original image each time, the model can transform any noise to a sample that lies in a high-probability region of the underlying data distribution.

Reverse diffusion in practice

We can now revisit this video of the glyffuser denoising process. Recall a large number of steps from sample to noise e.g. T = 1000 is used during training to make the noise-to-sample trajectory very easy for the model to learn, as changes between steps will be small. Does that mean we need to run 1000 denoising steps every time we want to generate a sample?

Luckily, this is not the case. Essentially, we can run the single-step noise prediction but then rescale it to any given step, although it might not be very good if the gap is too large! This allows us to approximate the full sampling trajectory with fewer steps. The video above uses 120 steps, for instance (most implementations will allow the user to set the number of sampling steps).

Recall that predicting the noise at a given step is equivalent to predicting the original image x₀, and that we can access the equation for any noised image deterministically using only the variance schedule and x₀. Thus, we can calculate xₜ₋ₖ based on any denoising step. The closer the steps are, the better the approximation will be.

Too few steps, however, and the results become worse as the steps become too large for the model to effectively approximate the denoising trajectory. If we only use 5 sampling steps, for example, the sampled characters don’t look very convincing at all:

There is then a whole literature on more advanced sampling methods beyond what we’ve discussed so far, allowing effective sampling with much fewer steps. These often reframe the sampling as a differential equation to be solved deterministically, giving an eerie quality to the sampling videos — I’ve included one at the end if you’re interested. In production-level models, these are usually preferred over the simple method discussed here, but the basic principle of deducing the noise-to-sample trajectory is the same. A full discussion is beyond the scope of this article but see e.g. this paper and its corresponding implementation in the Hugging Face diffusers library for more information.

Alternative intuition from score function

To me, it was still not 100% clear why training the model on noise prediction generalises so well. I found that an alternative interpretation of diffusion models known as “score-based modeling” filled some of the gaps in intuition (for more information, refer to Yang Song’s definitive article on the topic.)

The dataset xₜ sampled from its probability distribution q(xₜ) at different noising timesteps; below, we add the score function ∇ₓ log q(xₜ).

I try to give a visual intuition in the bottom row of the figure above: essentially, learning the noise in our diffusion model is equivalent (to a constant factor) to learning the score function, which is the gradient of the log of the probability distribution: ∇ₓ log q(x). As a gradient, the score function represents a vector field with vectors pointing towards the regions of highest probability density. Subtracting the noise at each step is then equivalent to moving following the directions in this vector field towards regions of high probability density.

As long as there is some signal, the score function effectively guides sampling, but in regions of low probability it tends towards zero as there is little to no gradient to follow. Using many steps to cover different noise levels allows us to avoid this, as we smear out the gradient field at high noise levels, allowing sampling to converge even if we start from low probability density regions of the distribution. The figure shows that as the noise level is increased, more of the domain is covered by the score function vector field.

Summary

The aim of diffusion models is learn the underlying probability distribution of a dataset and then be able to sample from it. This requires forward and reverse diffusion (noising) processes.
The forward noising process takes samples from our dataset and gradually adds Gaussian noise (pushes them off the data manifold). This forward process is computationally efficient because any level of noise can be added in closed form a single step.
The reverse noising process is challenging because we need to predict how to remove the noise at each step without knowing the original data point in advance. We train a ML model to do this by giving it many examples of data noised at different timesteps.
Using very small steps in the forward noising process makes it easier for the model to learn to reverse these steps, as the changes are small.
By applying the reverse noising process iteratively, the model refines noisy samples step by step, eventually producing a realistic data point (one that lies on the data manifold).

Takeaway

Diffusion models are a powerful framework for learning complex data distributions. The distributions are learnt implicitly by modelling a sequential denoising process. This process can then be used to generate samples similar to those in the training distribution.

Once you’ve trained a model, how do you get useful stuff out of it?

Earlier uses of generative AI such as “This Person Does Not Exist” (ca. 2019) made waves simply because it was the first time most people had seen AI-generated photorealistic human faces. A generative adversarial network or “GAN” was used in that case, but the principle remains the same: the model implicitly learnt a underlying data distribution — in that case, human faces — then sampled from it. So far, our glyffuser model does a similar thing: it samples randomly from the distribution of Chinese glyphs.

The question then arises: can we do something more useful than just sample randomly? You’ve likely already encountered text-to-image models such as Dall-E. They are able to incorporate extra meaning from text prompts into the diffusion process — this in known as conditioning. Likewise, diffusion models for scientific scientific applications like protein (e.g. Chroma, RFdiffusion, AlphaFold3) or inorganic crystal structure generation (e.g. MatterGen) become much more useful if can be conditioned to generate samples with desirable properties such as a specific symmetry, bulk modulus, or band gap.

Conditional distributions

We can consider conditioning as a way to guide the diffusion sampling process towards particular regions of our probability distribution. We mentioned conditional distributions in the context of forward diffusion. Below we show how conditioning can be thought of as reshaping a base distribution.

A simple example of a joint probability distribution p(x, y), shown as a contour map, along with its two marginal 1-D probability distributions, p(x) and p(y). The highest points of p(x, y) are at (x₁, y₁) and (x₂, y₂). The conditional distributions p(x∣y = y₁) and p(x∣y = y₂) are shown overlaid on the main plot.

Consider the figure above. Think of p(x) as a distribution we want to sample from (i.e., the images) and p(y) as conditioning information (i.e., the text dataset). These are the marginal distributions of a joint distribution p(x, y). Integrating p(x, y) over y recovers p(x), and vice versa.

Sampling from p(x), we are equally likely to get x₁ or x₂. However, we can condition on p(y = y₁) to obtain p(x∣y = y₁). You can think of this as taking a slice through p(x, y) at a given value of y. In this conditioned distribution, we are much more likely to sample at x₁ than x₂.

In practice, in order to condition on a text dataset, we need to convert the text into a numerical form. We can do this using large language model (LLM) embeddings that can be injected into the noise prediction model during training.

Embedding text with an LLM

In the glyffuser, our conditioning information is in the form of English text definitions. We have two requirements: 1) ML models prefer fixed-length vectors as input. 2) The numerical representation of our text must understand context — if we have the words “lithium” and “element” nearby, the meaning of “element” should be understood as “chemical element” rather than “heating element”. Both of these requirements can be met by using a pre-trained LLM.

The diagram below shows how an LLM converts text into fixed-length vectors. The text is first tokenized (LLMs break text into tokens, small chunks of characters, as their basic unit of interaction). Each token is converted into a base embedding, which is a fixed-length vector of the size of the LLM input. These vectors are then passed through the pre-trained LLM (here we use the encoder portion of Google’s T5 model), where they are imbued with additional contextual meaning. We end up with a array of n vectors of the same length d, i.e. a (n, d) sized tensor.

We can convert text to a numerical embedding imbued with contextual meaning using a pre-trained LLM.

Note: in some models, notably Dall-E, additional image-text alignment is performed using contrastive pretraining. Imagen seems to show that we can get away without doing this.

Training the diffusion model with text conditioning

The exact method that this embedding vector is injected into the model can vary. In Google’s Imagen model, for example, the embedding tensor is pooled (combined into a single vector in the embedding dimension) and added into the data as it passes through the noise prediction model; it is also included in a different way using cross-attention (a method of learning contextual information between sequences of tokens, most famously used in the transformer models that form the basis of LLMs like ChatGPT).

Conditioning information can be added via multiple different methods but the training loss remains the same.

In the glyffuser, we only use cross-attention to introduce this conditioning information. While a significant architectural change is required to introduce this additional information into the model, the loss function for our noise prediction model remains exactly the same.

Testing the conditioned diffusion model

Let’s do a simple test of the fully trained conditioned diffusion model. In the figure below, we try to denoise in a single step with the text prompt “Gold”. As touched upon in our interactive UMAP, Chinese characters often contain components known as radicals which can convey sound (phonetic radicals) or meaning (semantic radicals). A common semantic radical is derived from the character meaning “gold”, “金”, and is used in characters that are in some broad sense associated with gold or metals.

Even with a single sampling step, conditioning guides denoising towards the relevant regions of the probability distribution.

The figure shows that even though a single step is insufficient to approximate the denoising trajectory very well, we have moved into a region of our probability distribution with the “金” radical. This indicates that the text prompt is effectively guiding our sampling towards a region of the glyph probability distribution related to the meaning of the prompt. The animation below shows a 120 step denoising sequence for the same prompt, “Gold”. You can see that every generated glyph has either the 釒 or 钅 radical (the same radical in traditional and simplified Chinese, respectively).

Takeaway

Conditioning enables us to sample meaningful outputs from diffusion models.

Further remarks

I found that with the help of tutorials and existing libraries, it was possible to implement a working diffusion model despite not having a full understanding of what was going on under the hood. I think this is a good way to start learning and highly recommend Hugging Face’s tutorial on training a simple diffusion model using their diffusers Python library (which now includes my small bugfix!).

I’ve omitted some topics that are crucial to how production-grade diffusion models function, but are unnecessary for core understanding. One is the question of how to generate high resolution images. In our example, we did everything in pixel space, but this becomes very computationally expensive for large images. The general approach is to perform diffusion in a smaller space, then upscale it in a separate step. Methods include latent diffusion (used in Stable Diffusion) and cascaded super-resolution models (used in Imagen). Another topic is classifier-free guidance, a very elegant method for boosting the conditioning effect to give much better prompt adherence. I show the implementation in my previous post on the glyffuser and highly recommend this article if you want to learn more.

Fun extras

Diffusion sampling using the DPMSolverSDEScheduler developed by Katherine Crowson and implemented in Hugging Face diffusers—note the smooth transition from noise to data.

The post A Visual Guide to How Diffusion Models Work appeared first on Towards Data Science.