Maxime Wolf, Author at Towards Data Science

LLaDA: The Diffusion Model That Could Redefine Language Generation

Maxime Wolf — Wed, 26 Feb 2025 18:18:22 +0000

Introduction

What if we could make language models think more like humans? Instead of writing one word at a time, what if they could sketch out their thoughts first, and gradually refine them?

This is exactly what Large Language Diffusion Models (LLaDA) introduces: a different approach to current text generation used in Large Language Models (LLMs). Unlike traditional autoregressive models (ARMs), which predict text sequentially, left to right, LLaDA leverages a diffusion-like process to generate text. Instead of generating tokens sequentially, it progressively refines masked text until it forms a coherent response.

In this article, we will dive into how LLaDA works, why it matters, and how it could shape the next generation of LLMs.

I hope you enjoy the article!

The current state of LLMs

To appreciate the innovation that LLaDA represents, we first need to understand how current large language models (LLMs) operate. Modern LLMs follow a two-step training process that has become an industry standard:

Pre-training: The model learns general language patterns and knowledge by predicting the next token in massive text datasets through self-supervised learning.
Supervised Fine-Tuning (SFT): The model is refined on carefully curated data to improve its ability to follow instructions and generate useful outputs.

Note that current LLMs often use RLHF as well to further refine the weights of the model, but this is not used by LLaDA so we will skip this step here.

These models, primarily based on the Transformer architecture, generate text one token at a time using next-token prediction.

Simplified Transformer architecture for text generation (Image by the author)

Here is a simplified illustration of how data passes through such a model. Each token is embedded into a vector and is transformed through successive transformer layers. In current LLMs (LLaMA, ChatGPT, DeepSeek, etc), a classification head is used only on the last token embedding to predict the next token in the sequence.

This works thanks to the concept of masked self-attention: each token attends to all the tokens that come before it. We will see later how LLaDA can get rid of the mask in its attention layers.

Attention process: input embeddings are multiplied byQuery, Key, and Value matrices to generate new embeddings (Image by the author, inspired by [3])

If you want to learn more about Transformers, check out my article here.

While this approach has led to impressive results, it also comes with significant limitations, some of which have motivated the development of LLaDA.

Current limitations of LLMs

Current LLMs face several critical challenges:

Computational Inefficiency

Imagine having to write a novel where you can only think about one word at a time, and for each word, you need to reread everything you’ve written so far. This is essentially how current LLMs operate — they predict one token at a time, requiring a complete processing of the previous sequence for each new token. Even with optimization techniques like KV caching, this process is quite computationally expensive and time-consuming.

Limited Bidirectional Reasoning

Traditional autoregressive models (ARMs) are like writers who could never look ahead or revise what they’ve written so far. They can only predict future tokens based on past ones, which limits their ability to reason about relationships between different parts of the text. As humans, we often have a general idea of what we want to say before writing it down, current LLMs lack this capability in some sense.

Amount of data

Existing models require enormous amounts of training data to achieve good performance, making them resource-intensive to develop and potentially limiting their applicability in specialized domains with limited data availability.

What is LLaDA

LLaDA introduces a fundamentally different approach to Language Generation by replacing traditional autoregression with a “diffusion-based” process (we will dive later into why this is called “diffusion”).

Let’s understand how this works, step by step, starting with pre-training.

LLaDA pre-training

Remember that we don’t need any “labeled” data during the pre-training phase. The objective is to feed a very large amount of raw text data into the model. For each text sequence, we do the following:

We fix a maximum length (similar to ARMs). Typically, this could be 4096 tokens. 1% of the time, the lengths of sequences are randomly sampled between 1 and 4096 and padded so that the model is also exposed to shorter sequences.
We randomly choose a “masking rate”. For example, one could pick 40%.
We mask each token with a probability of 0.4. What does “masking” mean exactly? Well, we simply replace the token with a special token: . As with any other token, this token is associated with a particular index and embedding vector that the model can process and interpret during training.
We then feed our entire sequence into our transformer-based model. This process transforms all the input embedding vectors into new embeddings. We apply the classification head to each of the masked tokens to get a prediction for each. Mathematically, our loss function averages cross-entropy losses over all the masked tokens in the sequence, as below:

Loss function used for LLaDA (Image by the author)

5. And… we repeat this procedure for billions or trillions of text sequences.

Note, that unlike ARMs, LLaDA can fully utilize bidirectional dependencies in the text: it doesn’t require masking in attention layers anymore. However, this can come at an increased computational cost.

Hopefully, you can see how the training phase itself (the flow of the data into the model) is very similar to any other LLMs. We simply predict randomly masked tokens instead of predicting what comes next.

LLaDA SFT

For auto-regressive models, SFT is very similar to pre-training, except that we have pairs of (prompt, response) and want to generate the response when giving the prompt as input.

This is exactly the same concept for LlaDa! Mimicking the pre-training process: we simply pass the prompt and the response, mask random tokens from the response only, and feed the full sequence into the model, which will predict missing tokens from the response.

The innovation in inference

Innovation is where LLaDA gets more interesting, and truly utilizes the “diffusion” paradigm.

Until now, we always randomly masked some text as input and asked the model to predict these tokens. But during inference, we only have access to the prompt and we need to generate the entire response. You might think (and it’s not wrong), that the model has seen examples where the masking rate was very high (potentially 1) during SFT, and it had to learn, somehow, how to generate a full response from a prompt.

However, generating the full response at once during inference will likely produce very poor results because the model lacks information. Instead, we need a method to progressively refine predictions, and that’s where the key idea of ‘remasking’ comes in.

Here is how it works, at each step of the text generation process:

Feed the current input to the model (this is the prompt, followed by tokens)
The model generates one embedding for each input token. We get predictions for the tokens only. And here is the important step: we remask a portion of them. In particular: we only keep the “best” tokens i.e. the ones with the best predictions, with the highest confidence.
We can use this partially unmasked sequence as input in the next generation step and repeat until all tokens are unmasked.

You can see that, interestingly, we have much more control over the generation process compared to ARMs: we could choose to remask 0 tokens (only one generation step), or we could decide to keep only the best token every time (as many steps as tokens in the response). Obviously, there is a trade-off here between the quality of the predictions and inference time.

Let’s illustrate that with a simple example (in that case, I choose to keep the best 2 tokens at every step)

LLaDA generation process example (Image by the author)

Note, in practice, the remasking step would work as follows. Instead of remasking a fixed number of tokens, we would remask a proportion of s/t tokens over time, from t=1 down to 0, where s is in [0, t]. In particular, this means we remask fewer and fewer tokens as the number of generation steps increases.

Example: if we want N sampling steps (so N discrete steps from t=1 down to t=1/N with steps of 1/N), taking s = (t-1/N) is a good choice, and ensures that s=0 at the end of the process.

The image below summarizes the 3 steps described above. “Mask predictor” simply denotes the Llm (LLaDA), predicting masked tokens.

Pre-training (a.), SFT (b.) and inference (c.) using LLaDA. (source: [1])

Can autoregression and diffusion be combined?

Another clever idea developed in LLaDA is to combine diffusion with traditional autoregressive generation to use the best of both worlds! This is called semi-autoregressive diffusion.

Divide the generation process into blocks (for instance, 32 tokens in each block).
The objective is to generate one block at a time (like we would generate one token at a time in ARMs).
For each block, we apply the diffusion logic by progressively unmasking tokens to reveal the entire block. Then move on to predicting the next block.

Semi-autoregressive process (source: [1])

This is a hybrid approach: we probably lose some of the “backward” generation and parallelization capabilities of the model, but we better “guide” the model towards the final output.

I think this is a very interesting idea because it depends a lot on a hyperparameter (the number of blocks), that can be tuned. I imagine different tasks might benefit more from the backward generation process, while others might benefit more from the more “guided” generation from left to right (more on that in the last paragraph).

Why “Diffusion”?

I think it’s important to briefly explain where this term actually comes from. It reflects a similarity with image diffusion models (like Dall-E), which have been very popular for image generation tasks.

In image diffusion, a model first adds noise to an image until it’s unrecognizable, then learns to reconstruct it step by step. LLaDA applies this idea to text by masking tokens instead of adding noise, and then progressively unmasking them to generate coherent language. In the context of image generation, the masking step is often called “noise scheduling”, and the reverse (remasking) is the “denoising” step.

How do Diffusion Models work? (source: [2])

You can also see LLaDA as some type of discrete (non-continuous) diffusion model: we don’t add noise to tokens, but we “deactivate” some tokens by masking them, and the model learns how to unmask a portion of them.

Results

Let’s go through a few of the interesting results of LLaDA.

You can find all the results in the paper. I chose to focus on what I find the most interesting here.

Training efficiency: LLaDA shows similar performance to ARMs with the same number of parameters, but uses much fewer tokens during training (and no RLHF)! For example, the 8B version uses around 2.3T tokens, compared to 15T for LLaMa3.
Using different block and answer lengths for different tasks: for example, the block length is particularly large for the Math dataset, and the model demonstrates strong performance for this domain. This could suggest that mathematical reasoning may benefit more from the diffusion-based and backward process.

Source: [1]

Interestingly, LLaDA does better on the “Reversal poem completion task”. This task requires the model to complete a poem in reverse order, starting from the last lines and working backward. As expected, ARMs struggle due to their strict left-to-right generation process.

Source: [1]

LLaDA is not just an experimental alternative to ARMs: it shows real advantages in efficiency, structured reasoning, and bidirectional text generation.

Conclusion

I think LLaDA is a promising approach to language generation. Its ability to generate multiple tokens in parallel while maintaining global coherence could definitely lead to more efficient training, better reasoning, and improved context understanding with fewer computational resources.

Beyond efficiency, I think LLaDA also brings a lot of flexibility. By adjusting parameters like the number of blocks generated, and the number of generation steps, it can better adapt to different tasks and constraints, making it a versatile tool for various language modeling needs, and allowing more human control. Diffusion models could also play an important role in pro-active AI and agentic systems by being able to reason more holistically.

As research into diffusion-based language models advances, LLaDA could become a useful step toward more natural and efficient language models. While it’s still early, I believe this shift from sequential to parallel generation is an interesting direction for AI development.

Thanks for reading!

Check out my previous articles:

Training Large Language Models: From TRPO to GRPO

The Math Behind “The Curse of Dimensionality”

Feel free to connect on LinkedIn
Follow me on GitHub for more content
Visit my website: maximewolf.com

References:

[1] Liu, C., Wu, J., Xu, Y., Zhang, Y., Zhu, X., & Song, D. (2024). Large Language Diffusion Models. arXiv preprint arXiv:2502.09992. https://arxiv.org/pdf/2502.09992
[2] Yang, Ling, et al. “Diffusion models: A comprehensive survey of methods and applications.” ACM Computing Surveys 56.4 (2023): 1–39.
[3] Alammar, J. (2018, June 27). The Illustrated Transformer. Jay Alammar’s Blog. https://jalammar.github.io/illustrated-transformer/

The post LLaDA: The Diffusion Model That Could Redefine Language Generation appeared first on Towards Data Science.

Training Large Language Models: From TRPO to GRPO

Maxime Wolf — Thu, 06 Feb 2025 03:24:03 +0000

Deepseek has recently made quite a buzz in the AI community, thanks to its impressive performance at relatively low costs. I think this is a perfect opportunity to dive deeper into how Large Language Models (LLMs) are trained. In this article, we will focus on the Reinforcement Learning (RL) side of things: we will cover TRPO, PPO, and, more recently, GRPO (don’t worry, I will explain all these terms soon!)

I have aimed to keep this article relatively easy to read and accessible, by minimizing the math, so you won’t need a deep Reinforcement Learning background to follow along. However, I will assume that you have some familiarity with Machine Learning, Deep Learning, and a basic understanding of how LLMs work.

I hope you enjoy the article!

The 3 steps of LLM training

The 3 steps of LLM training [1]

Before diving into RL specifics, let’s briefly recap the three main stages of training a Large Language Model:

Pre-training: the model is trained on a massive dataset to predict the next token in a sequence based on preceding tokens.
Supervised Fine-Tuning (SFT): the model is then fine-tuned on more targeted data and aligned with specific instructions.
Reinforcement Learning (often called RLHF for Reinforcement Learning with Human Feedback): this is the focus of this article. The main goal is to further refine responses’ alignments with human preferences, by allowing the model to learn directly from feedback.

Reinforcement Learning Basics

A robot trying to exit a maze! [2]

Before diving deeper, let’s briefly revisit the core ideas behind Reinforcement Learning.

RL is quite straightforward to understand at a high level: an agent interacts with an environment. The agent resides in a specific state within the environment and can take actions to transition to other states. Each action yields a reward from the environment: this is how the environment provides feedback that guides the agent’s future actions.

Consider the following example: a robot (the agent) navigates (and tries to exit) a maze (the environment).

The state is the current situation of the environment (the robot’s position in the maze).
The robot can take different actions: for example, it can move forward, turn left, or turn right.
Successfully navigating towards the exit yields a positive reward, while hitting a wall or getting stuck in the maze results in negative rewards.

Easy! Now, let’s now make an analogy to how RL is used in the context of LLMs.

RL in the context of LLMs

Simplified RLHF Process [3]

When used during LLM training, RL is defined by the following components:

The LLM itself is the agent
Environment: everything external to the LLM, including user prompts, feedback systems, and other contextual information. This is basically the framework the LLM is interacting with during training.
Actions: these are responses to a query from the model. More specifically: these are the tokens that the LLM decides to generate in response to a query.
State: the current query being answered along with tokens the LLM has generated so far (i.e., the partial responses).
Rewards: this is a bit more tricky here: unlike the maze example above, there is usually no binary reward. In the context of LLMs, rewards usually come from a separate reward model, which outputs a score for each (query, response) pair. This model is trained from human-annotated data (hence “RLHF”) where annotators rank different responses. The goal is for higher-quality responses to receive higher rewards.

Note: in some cases, rewards can actually get simpler. For example, in DeepSeekMath, rule-based approaches can be used because math responses tend to be more deterministic (correct or wrong answer)

Policy is the final concept we need for now. In RL terms, a policy is simply the strategy for deciding which action to take. In the case of an LLM, the policy outputs a probability distribution over possible tokens at each step: in short, this is what the model uses to sample the next token to generate. Concretely, the policy is determined by the model’s parameters (weights). During RL training, we adjust these parameters so the LLM becomes more likely to produce “better” tokens— that is, tokens that produce higher reward scores.

We often write the policy as:

where a is the action (a token to generate), s the state (the query and tokens generated so far), and θ (model’s parameters).

This idea of finding the best policy is the whole point of RL! Since we don’t have labeled data (like we do in supervised learning) we use rewards to adjust our policy to take better actions. (In LLM terms: we adjust the parameters of our LLM to generate better tokens.)

TRPO (Trust Region Policy Optimization)

An analogy with supervised learning

Let’s take a quick step back to how supervised learning typically works. you have labeled data and use a loss function (like cross-entropy) to measure how close your model’s predictions are to the true labels.

We can then use algorithms like backpropagation and gradient descent to minimize our loss function and update the weights θ of our model.

Recall that our policy also outputs probabilities! In that sense, it is analogous to the model’s predictions in supervised learning… We are tempted to write something like:

where s is the current state and a is a possible action.

A(s, a) is called the advantage function and measures how good is the chosen action in the current state, compared to a baseline. This is very much like the notion of labels in supervised learning but derived from rewards instead of explicit labeling. To simplify, we can write the advantage as:

In practice, the baseline is calculated using a value function. This is a common term in RL that I will explain later. What you need to know for now is that it measures the expected reward we would receive if we continue following the current policy from the state s.

What is TRPO?

TRPO (Trust Region Policy Optimization) builds on this idea of using the advantage function but adds a critical ingredient for stability: it constrains how far the new policy can deviate from the old policy at each update step (similar to what we do with batch gradient descent for example).

It introduces a KL divergence term (see it as a measure of similarity) between the current and the old policy:

It also divides the policy by the old policy. This ratio, multiplied by the advantage function, gives us a sense of how beneficial each update is relative to the old policy.

Putting it all together, TRPO tries to maximize a surrogate objective (which involves the advantage and the policy ratio) subject to a KL divergence constraint.

PPO (Proximal Policy Optimization)

While TRPO was a significant advancement, it’s no longer used widely in practice, especially for training LLMs, due to its computationally intensive gradient calculations.

Instead, PPO is now the preferred approach in most LLMs architecture, including ChatGPT, Gemini, and more.

It is actually quite similar to TRPO, but instead of enforcing a hard constraint on the KL divergence, PPO introduces a “clipped surrogate objective” that implicitly restricts policy updates, and greatly simplifies the optimization process.

Here is a breakdown of the PPO objective function we maximize to tweak our model’s parameters.

GRPO (Group Relative Policy Optimization)

How is the value function usually obtained?

Let’s first talk more about the advantage and the value functions I introduced earlier.

In typical setups (like PPO), a value model is trained alongside the policy. Its goal is to predict the value of each action we take (each token generated by the model), using the rewards we obtain (remember that the value should represent the expected cumulative reward).

Here is how it works in practice. Take the query “What is 2+2?” as an example. Our model outputs “2+2 is 4” and receives a reward of 0.8 for that response. We then go backward and attribute discounted rewards to each prefix:

“2+2 is 4” gets a value of 0.8
“2+2 is” (1 token backward) gets a value of 0.8γ
“2+2” (2 tokens backward) gets a value of 0.8γ²
etc.

where γ is the discount factor (0.9 for example). We then use these prefixes and associated values to train the value model.

Important note: the value model and the reward model are two different things. The reward model is trained before the RL process and uses pairs of (query, response) and human ranking. The value model is trained concurrently to the policy, and aims at predicting the future expected reward at each step of the generation process.

What’s new in GRPO

Even if in practice, the reward model is often derived from the policy (training only the “head”), we still end up maintaining many models and handling multiple training procedures (policy, reward, value model). GRPO streamlines this by introducing a more efficient method.

Remember what I said earlier?

In PPO, we decided to use our value function as the baseline. GRPO chooses something else: Here is what GRPO does: concretely, for each query, GRPO generates a group of responses (group of size G) and uses their rewards to calculate each response’s advantage as a z-score:

where rᵢ is the reward of the i-th response and μ and σ are the mean and standard deviation of rewards in that group.

This naturally eliminates the need for a separate value model. This idea makes a lot of sense when you think about it! It aligns with the value function we introduced before and also measures, in a sense, an “expected” reward we can obtain. Also, this new method is well adapted to our problem because LLMs can easily generate multiple non-deterministic outputs by using a low temperature (controls the randomness of tokens generation).

This is the main idea behind GRPO: getting rid of the value model.

Finally, GRPO adds a KL divergence term (to be exact, GRPO uses a simple approximation of the KL divergence to improve the algorithm further) directly into its objective, comparing the current policy to a reference policy (often the post-SFT model).

See the final formulation below:

And… that’s mostly it for GRPO! I hope this gives you a clear overview of the process: it still relies on the same foundational ideas as TRPO and PPO but introduces additional improvements to make training more efficient, faster, and cheaper — key factors behind DeepSeek’s success.

Conclusion

Reinforcement Learning has become a cornerstone for training today’s Large Language Models, particularly through PPO, and more recently GRPO. Each method rests on the same RL fundamentals — states, actions, rewards, and policies — but adds its own twist to balance stability, efficiency, and human alignment:

• TRPO introduced strict policy constraints via KL divergence

• PPO eased those constraints with a clipped objective

• GRPO took an extra step by removing the value model requirement and using group-based reward normalization. Of course, DeepSeek also benefits from other innovations, like high-quality data and other training strategies, but that is for another time!

I hope this article gave you a clearer picture of how these methods connect and evolve. I believe that Reinforcement Learning will become the main focus in training LLMs to improve their performance, surpassing pre-training and SFT in driving future innovations.

If you’re interested in diving deeper, feel free to check out the references below or explore my previous posts.

Thanks for reading, and feel free to leave a clap and a comment!

Want to learn more about Transformers or dive into the math behind the Curse of Dimensionality? Check out my previous articles:

Transformers: How Do They Transform Your Data?

Deep Learning

Diving into the Transformers architecture and what makes them unbeatable at language tasks

Maxime Wolf

Mar 28

12 min read

Feel free to connect on LinkedIn
Follow me on GitHub for more content
Visit my website: maximewolf.com

References:

[1] “Foundations of Large Language Models”, 2025. https://arxiv.org/pdf/2501.09223
[2] “Reinforcement Learning.” Enaris. Available at: https://enaris.org/material/en/Reinforcement%20Learning/index.html
[3] Y. Gokhale. “Introduction to LLMs and the Generative AI Part 5: RLHF,” Medium, 2023. Available at: https://medium.com/@yash9439/introduction-to-llms-and-the-generative-ai-part-5-rlhf-64e83fbcd795
[4] L. Weng. “An Overview of Reinforcement Learning,” 2018. Available at: https://lilianweng.github.io/posts/2018-02-19-rl-overview/
[5] “DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning”, 2025. https://arxiv.org/pdf/2501.12948
[6] “DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models”, 2025. https://arxiv.org/pdf/2402.03300
[7] “Trust Region Policy Optimization”, 2017. https://arxiv.org/pdf/1502.05477

The post Training Large Language Models: From TRPO to GRPO appeared first on Towards Data Science.

The Math Behind “The Curse of Dimensionality”

Maxime Wolf — Sat, 20 Apr 2024 15:10:24 +0000

Image from Dall-E

In the realm of machine learning, handling high-dimensional vectors is not just common; it’s essential. This is illustrated by the architecture of popular models like Transformers. For instance, BERT uses 768-dimensional vectors to encode the tokens of the input sequences it processes and to better capture complex patterns in the data. Given that our brain struggles to visualize anything beyond 3 dimensions, the use of 768-dimensional vectors is quite mind-blowing!

While some Machine and Deep Learning models excel in these high-dimensional scenarios, they also present many challenges. In this article, we will explore the concept of the "Curse Of Dimensionality", explain some interesting phenomena associated with it, delve into the mathematics behind these phenomena, and discuss their general implications for your Machine Learning models.

Note that detailed mathematical proofs related to this article are available on my website as a supplementary extension to this article.

What is the curse of dimensionality?

People often assume that geometric concepts familiar in three dimensions behave similarly in higher-dimensional spaces. This is not the case. As dimension increases, many interesting and counterintuitive phenomena arise. The "Curse of Dimensionality" is a term invented by Richard Bellman (famous mathematician) that refers to all these surprising effects.

What is so special about high-dimension is how the "volume" of the space (we’ll explore that in more detail soon) is growing exponentially. Take a graduated line (in one dimension) from 1 to 10. There are 10 integers on this line. Extend that in 2 dimensions: it is now a square with 10 × 10 = 100 points with integer coordinates. Now consider "only" 80 dimensions: you would already have 10⁸⁰ points which is the number of atoms in the universe.

In other words, as the dimension increases, the volume of the space grows exponentially, resulting in data becoming increasingly sparse.

High-dimensional spaces are "empty"

Consider this other example. We want to calculate the farthest distance between two points in a unit hypercube (where each side has a length of 1):

In 1 dimension (the hypercube is a line segment from 0 to 1), the maximum distance is simply 1.
In 2 dimensions (the hypercube forms a square), the maximum distance is the distance between the opposite corners [0,0] and [1,1], which is √2, calculated using the Pythagorean theorem.
Extending this concept to n dimensions, the distance between the points at [0,0,…,0] and [1,1,…,1] is √n. This formula arises because each additional dimension adds a square of 1 to the sum under the square root (again by the Pythagorean theorem).

Interestingly, as the number of dimensions n increases, the largest distance within the hypercube grows at an O(√n) rate. This phenomenon illustrates a diminishing returns effect, where increases in dimensional space lead to proportionally smaller gains in spatial distance. More details on this effect and its implications will be explored in the following sections of this article.

The notion of distance in high dimensions

Let’s dive deeper into the notion of distances we started exploring in the previous section.

We had our first glimpse of how high-dimensional spaces render the notion of distance almost meaningless. But what does this really mean, and can we mathematically visualize this phenomenon?

Let’s consider an experiment, using the same n-dimensional unit hypercube we defined before. First, we generate a dataset by randomly sampling many points in this cube: we effectively simulate a multivariate uniform distribution. Then, we sample another point (a "query" point) from that distribution and observe the distance from its nearest and farthest neighbor in our dataset.

Here is the corresponding Python code.

def generate_data(dimension, num_points):
    ''' Generate random data points within [0, 1] for each coordinate in the given dimension '''
    data = np.random.rand(num_points, dimension)
    return data

def neighbors(data, query_point):
    ''' Returns the nearest and farthest point in data from query_point '''
    nearest_distance = float('inf')
    farthest_distance = 0
    for point in data:
        distance = np.linalg.norm(point - query_point)
        if distance < nearest_distance:
            nearest_distance = distance
        if distance > farthest_distance:
            farthest_distance = distance
    return nearest_distance, farthest_distance

We can also plot these distances:

Distances to nearest and farthest points as n increases (Image by the author)

Using a log scale, we observe that the relative difference between the distance to the nearest and farthest neighbor tends to decrease as the dimension increases.

This is a very unintuitive behavior: as explained in the previous section, points are very sparse from each other because of the exponentially increasing volume of the space, but at the same time, the relative distances between points become smaller.

The notion of nearest neighbors vanishes

This means that the very concept of distance becomes less relevant and less discriminative as the dimension of the space increases. As you can imagine, it poses problems for Machine Learning algorithms that solely rely on distances such as kNN.

The maths: the n-ball

We will now talk about some other interesting phenomena. For this, we’ll need the n-ball. A n-ball is the generalization of a ball in n dimensions. The n-ball of radius R is the collection of points at a distance at most R from the center of the space 0.

Let’s consider a radius of 1. The 1-ball is the segment [-1, 1]. The 2-ball is the disk delimited by the unit circle, whose equation is x² + y² ≤ 1. The 3-ball (what we normally call a "ball") has the equation x² + y² + z² ≤ 1. As you understand, we can extend that definition to any dimension:

The question now is: what is the volume of this ball? This is not an easy question and requires quite a lot of maths, which I won’t detail here. However, you can find all the details on my website, in my post about the volume of the n-ball.

After a lot of fun (integral calculus), you can prove that the volume of the n-ball can be expressed as follows, where Γ denotes the Gamma function.

For example, with R = 1 and n = 2, the volume is πR², because Γ(2) = 1. This is indeed the "volume" of the 2-ball (also called the "area" of a circle in this case).

However, beyond being an interesting mathematical challenge, the volume of the n-ball also has some very surprising properties.

As the dimension n increases, the volume of the n-ball converges to 0.

This is true for every radius, but let’s visualize this phenomenon with a few values of R.

Volume of the n-ball for different radii as the dimension increases (Image by the author)

As you can see, it only converges to 0, but it starts by increasing and then decreases to 0. For R = 1, the ball with the largest volume is the 5-ball, and the value of n that reaches the maximum shifts to the right as R increases.

Here are the first values of the volume of the unit n-ball, up to n = 10.

Volume of the unit n-ball for different values of n (Image by the author)

The volume of a high-dimensional unit ball is concentrated near its surface.

For small dimensions, the volume of a ball looks quite "homogeneous": this is not the case in high dimensions.

A spherical shell

Let’s consider an n-ball with radius R and another with radius R-dR where dR is very small. The portion of the n-ball between these 2 balls is called a "shell" and corresponds to the portion of the ball near its surface (see the visualization above in 3D). We can compute the ratio of the "inner" volume of the ball and the volume of the thin shell only.

Ratio (inner volume / total volume) as n increases (Image by the author)

As we can see, it converges very quickly to 0: almost all the volume is near the surface in high dimensional spaces. For instance, for R = 1, dR = 0.05, and n = 50, about 92.3% of the volume is concentrated in the thin shell. This shows that in higher dimensions, the volume is in "corners". This is again related to the distortion of the concept of distance we have seen earlier.

Note that the volume of the unit hypercube (here, denoting a cube centered at zero with a side length of 2) is 2ⁿ. The unit sphere is basically "empty" in very high dimensions, while the unit hypercube, in contrast, gets exponentially more points. Again, this shows how the idea of a "nearest neighbor" of a point loses its effectiveness because there is almost no point within a distance R of a query point q when n is large.

Curse of dimensionality, overfitting, and Occam’s Razor

The curse of dimensionality is closely related to the overfitting principle. Because of the exponential growth of the volume of the space with the dimension, we need very large datasets to adequately capture and model high-dimensional patterns. Even worse: we need a number of samples that grows exponentially with the dimension to overcome this limitation. This scenario, characterized by many features yet relatively few data points, is particularly prone to overfitting.

Occam’s Razor suggests that simpler models are generally better than complex ones because they are less likely to overfit. This principle is particularly relevant in high-dimensional contexts (where the curse of dimensionality plays a role) because it encourages the reduction of model complexity.

Applying Occam’s Razor principle in high-dimensional scenarios can mean reducing the dimensionality of the problem itself (via methods like PCA, feature selection, etc.), thereby mitigating some effects of the curse of dimensionality. Simplifying the model’s structure or the feature space helps in managing the sparse data distribution and in making distance metrics more meaningful again. For instance, dimensionality reduction is a very common preliminary step before applying the kNN algorithm. More recent methods, such as ANNs (Approximate Nearest Neighbors) also emerge as a way to deal with high-dimensional scenarios.

Blessing of dimensionality?

Image by Dall-E

While we’ve outlined the challenges of high-dimensional settings in machine learning, there are also some advantages!

High dimensions can enhance linear separability, making techniques like kernel methods more effective.
Additionally, deep learning architectures are particularly adept at navigating and extracting complex patterns from high-dimensional spaces.

As always with Machine Learning, this is a trade-off: leveraging these advantages involves balancing the increased computational demands with potential gains in model performance.

Conclusion

Hopefully, this gives you an idea of how "weird" geometry can be in high-dimension and the many challenges it poses for machine learning model development. We saw how, in high-dimensional spaces, data is very sparse but also tends to be concentrated in the corners, and distances lose their usefulness. For a deeper dive into the n-ball and mathematical proofs, I encourage you to visit the extended of this article on my website.

While the "curse of dimensionality" outlines significant limitations in high-dimensional spaces, it’s exciting to see how modern deep learning models are increasingly adept at navigating these complexities. Consider the embedding models or the latest LLMs, for example, which utilize very high-dimensional vectors to more effectively discern and model textual patterns.

Want to learn more about Transformers and how they transform your data under the hood? Check out my previous article:

Transformers: How Do They Transform Your Data?

Feel free to connect on LinkedIn
Follow me on GitHub for more content
Visit my website: maximewolf.com

References:

[1] "Volume of an n-ball." Wikipedia, https://en.wikipedia.org/wiki/Volume_of_an_n-ball
[2] "Curse of Dimensionality." Wikipedia, https://en.wikipedia.org/wiki/Curse_of_dimensionality#Blessing_of_dimensionality
[3] Peterson, Ivars. "An Adventure in the Nth Dimension." American Scientist, https://www.americanscientist.org/article/an-adventure-in-the-nth-dimension
[4] Zhang, Yuxi. "Curse of Dimensionality." Geek Culture on Medium, July 2021, https://medium.com/geekculture/curse-of-dimensionality-e97ba916cb8f
[5] "Approximate Nearest Neighbors (ANN)." Activeloop, https://www.activeloop.ai/resources/glossary/approximate-nearest-neighbors-ann/

The post The Math Behind “The Curse of Dimensionality” appeared first on Towards Data Science.

Transformers: How Do They Transform Your Data?

Maxime Wolf — Thu, 28 Mar 2024 22:29:20 +0000

Image by the author

In the rapidly evolving landscape of artificial intelligence and machine learning, one innovation stands out for its profound impact on how we process, understand, and generate data: Transformers. Transformers have revolutionized the field of natural language processing (NLP) and beyond, powering some of today’s most advanced AI applications. But what exactly are Transformers, and how do they manage to transform data in such groundbreaking ways? This article demystifies the inner workings of Transformer models, focusing on the encoder architecture. We will start by going through the implementation of a Transformer encoder in Python, breaking down its main components. Then, we will visualize how Transformers process and adapt input data during training.

While this blog doesn’t cover every architectural detail, it provides an implementation and an overall understanding of the transformative power of Transformers. For an in-depth explanation of Transformers, I suggest you look at the excellent Stanford CS224-n course.

I also recommend following the GitHub repository associated with this article for additional details.

What is a Transformer encoder architecture?

The Transformer model from Attention Is All You Need

This picture shows the original Transformer architecture, combining an encoder and a decoder for sequence-to-sequence language tasks.

In this article, we will focus on the encoder architecture (the red block on the picture). This is what the popular BERT model is using under the hood: the primary focus is on understanding and representing the data, rather than generating sequences. It can be used for a variety of applications: text classification, named-entity recognition (NER), extractive question answering, etc.

So, how is the data actually transformed by this architecture? We will explain each component in detail, but here is an overview of the process.

The input text is tokenized: the Python string is transformed into a list of tokens (numbers)
Each token is passed through an Embedding layer that outputs a vector representation for each token
The embeddings are then further encoded with a Positional Encoding layer, adding information about the position of each token in the sequence
These new embeddings are transformed by a series of Encoder Layers, using a self-attention mechanism
A task-specific head can be added. For example, we will later use a classification head to classify movie reviews as positive or negative

That is important to understand that the Transformer architecture transforms the embedding vectors by mapping them from one representation in a high-dimensional space to another within the same space, applying a series of complex transformations.

Implementing an encoder architecture in Python

The Positional Encoder layer

Unlike RNN models, the attention mechanism makes no use of the order of the input sequence. The PositionalEncoder class adds positional encodings to the input embeddings, using two mathematical functions: cosine and sine.

Positional encoding matrix definition from Attention Is All You Need

Note that positional encodings don’t contain trainable parameters: there are the results of deterministic computations, which makes this method very tractable. Also, sine and cosine functions take values between -1 and 1 and have useful periodicity properties to help the model learn patterns about the relative positions of words.

class PositionalEncoder(nn.Module):
    def __init__(self, d_model, max_length):
        super(PositionalEncoder, self).__init__()
        self.d_model = d_model
        self.max_length = max_length

        # Initialize the positional encoding matrix
        pe = torch.zeros(max_length, d_model)

        position = torch.arange(0, max_length, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2, dtype=torch.float) * -(math.log(10000.0) / d_model))

        # Calculate and assign position encodings to the matrix
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        self.pe = pe.unsqueeze(0)

    def forward(self, x):
        x = x + self.pe[:, :x.size(1)] # update embeddings
        return x

Multi-Head Self-Attention

The self-attention mechanism is the key component of the encoder architecture. Let’s ignore the "multi-head" for now. Attention is a way to determine for each token (i.e. each embedding) the relevance of all other embeddings to that token, to obtain a more refined and contextually relevant encoding.

How does"it" pay attention to other words of the sequence? (The Illustrated Transformer)

There are 3 steps in the self-attention mechanism.

Use matrices Q, K, and V to respectively transform the inputs "query", "key" and "value". Note that for self-attention, query, key, and values are all equal to our input embedding
Compute the attention score using cosine similarity (a dot product) between the query and the key. Scores are scaled by the square root of the embedding dimension to stabilize the gradients during training
Use a softmax layer to make these scores probabilities
The output is the weighted average of the values, using the attention scores as the weights

Mathematically, this corresponds to the following formula.

The Attention Mechanism from Attention Is All You Need

What does "multi-head" mean? Basically, we can apply the described self-attention mechanism process several times, in parallel, and concatenate and project the outputs. This allows each head to focus on different semantic aspects of the sentence.

We start by defining the number of heads, the dimension of the embeddings (d_model), and the dimension of each head (head_dim). We also initialize the Q, K, and V matrices (linear layers), and the final projection layer.

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super(MultiHeadAttention, self).__init__()
        self.num_heads = num_heads
        self.d_model = d_model
        self.head_dim = d_model // num_heads

        self.query_linear = nn.Linear(d_model, d_model)
        self.key_linear = nn.Linear(d_model, d_model)
        self.value_linear = nn.Linear(d_model, d_model)      
        self.output_linear = nn.Linear(d_model, d_model)

When using multi-head attention, we apply each attention head with a reduced dimension (head_dim instead of d_model) as in the original paper, making the total computational cost similar to a one-head attention layer with full dimensionality. Note this is a logical split only. What makes multi-attention so powerful is it can still be represented via a single matrix operation, making computations very efficient on GPUs.

def split_heads(self, x, batch_size):
        # Split the sequence embeddings in x across the attention heads
        x = x.view(batch_size, -1, self.num_heads, self.head_dim)
        return x.permute(0, 2, 1, 3).contiguous().view(batch_size * self.num_heads, -1, self.head_dim)

We compute the attention scores and use a mask to avoid using attention on padded tokens. We apply a softmax activation to make these scores probabilities.

def compute_attention(self, query, key, mask=None):
      # Compute dot-product attention scores
      # dimensions of query and key are (batch_size * num_heads, seq_length, head_dim)
      scores = query @ key.transpose(-2, -1) / math.sqrt(self.head_dim)
      # Now, dimensions of scores is (batch_size * num_heads, seq_length, seq_length)
      if mask is not None:
          scores = scores.view(-1, scores.shape[0] // self.num_heads, mask.shape[1], mask.shape[2]) # for compatibility
          scores = scores.masked_fill(mask == 0, float('-1e20')) # mask to avoid attention on padding tokens
          scores = scores.view(-1, mask.shape[1], mask.shape[2]) # reshape back to original shape
      # Normalize attention scores into attention weights
      attention_weights = F.softmax(scores, dim=-1)

      return attention_weights

The forward attribute performs the multi-head logical split and computes the attention weights. Then, we get the output by multiplying these weights by the values. Finally, we reshape the output and project it with a linear layer.

def forward(self, query, key, value, mask=None):
      batch_size = query.size(0)

      query = self.split_heads(self.query_linear(query), batch_size)
      key = self.split_heads(self.key_linear(key), batch_size)
      value = self.split_heads(self.value_linear(value), batch_size)

      attention_weights = self.compute_attention(query, key, mask)

      # Multiply attention weights by values, concatenate and linearly project outputs
      output = torch.matmul(attention_weights, value)
      output = output.view(batch_size, self.num_heads, -1, self.head_dim).permute(0, 2, 1, 3).contiguous().view(batch_size, -1, self.d_model)
      return self.output_linear(output)

The Encoder Layer

This is the main component of the architecture, which leverages multi-head self-attention. We first implement a simple class to perform a feed-forward operation through 2 dense layers.

class FeedForwardSubLayer(nn.Module):
    def __init__(self, d_model, d_ff):
        super(FeedForwardSubLayer, self).__init__()
        self.fc1 = nn.Linear(d_model, d_ff)
        self.fc2 = nn.Linear(d_ff, d_model)
        self.relu = nn.ReLU()

    def forward(self, x):
        return self.fc2(self.relu(self.fc1(x)))

We can now code the logic for the encoder layer. We start by applying self-attention to the input, which gives a vector of the same dimension. We then use our mini feed-forward network with Layer Norm layers. Note that we also use skip connections before applying normalization.

class EncoderLayer(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout):
        super(EncoderLayer, self).__init__()
        self.self_attn = MultiHeadAttention(d_model, num_heads)
        self.feed_forward = FeedForwardSubLayer(d_model, d_ff)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, mask):
        attn_output = self.self_attn(x, x, x, mask)
        x = self.norm1(x + self.dropout(attn_output)) # skip connection and normalization
        ff_output = self.feed_forward(x)
        return self.norm2(x + self.dropout(ff_output)) # skip connection and normalization

Putting Everything Together

It’s time to create our final model. We pass our data through an embedding layer. This transforms our raw tokens (integers) into a numerical vector. We then apply our positional encoder and several (num_layers) encoder layers.

class TransformerEncoder(nn.Module):
    def __init__(self, vocab_size, d_model, num_layers, num_heads, d_ff, dropout, max_sequence_length):
        super(TransformerEncoder, self).__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.positional_encoding = PositionalEncoder(d_model, max_sequence_length)
        self.layers = nn.ModuleList([EncoderLayer(d_model, num_heads, d_ff, dropout) for _ in range(num_layers)])

    def forward(self, x, mask):
        x = self.embedding(x)
        x = self.positional_encoding(x)
        for layer in self.layers:
            x = layer(x, mask)
        return x

We also create a ClassifierHead class which is used to transform the final embedding into class probabilities for our classification task.

class ClassifierHead(nn.Module):
    def __init__(self, d_model, num_classes):
        super(ClassifierHead, self).__init__()
        self.fc = nn.Linear(d_model, num_classes)

    def forward(self, x):
        logits = self.fc(x[:, 0, :]) # first token corresponds to the classification token
        return F.softmax(logits, dim=-1)

Note that the dense and softmax layers are only applied on the first embedding (corresponding to the first token of our input sequence). This is because when tokenizing the text, the first token is the [CLS] token which stands for "classification." The [CLS] token is designed to aggregate the entire sequence’s information into a single embedding vector, serving as a summary representation that can be used for classification tasks.

Note: the concept of including a [CLS] token originates from BERT, which was initially trained on tasks like next-sentence prediction. The [CLS] token was inserted to predict the likelihood that sentence B follows sentence A, with a [SEP] token separating the 2 sentences. For our model, the [SEP] token simply marks the end of the input sentence, as shown below.

[CLS] Token in BERT Architecture (All About AI)

When you think about it, it’s really mind-blowing that this single [CLS] embedding is able to capture so much information about the entire sequence, thanks to the self-attention mechanism’s ability to weigh and synthesize the importance of every piece of the text in relation to each other.

Training and visualization

Hopefully, the previous section gives you a better understanding of how our Transformer model transforms the input data. We will now write our training pipeline for our binary classification task using the IMDB dataset (movie reviews). Then, we will visualize the embedding of the [CLS] token during the training process to see how our model transformed it.

We first define our hyperparameters, as well as a BERT tokenizer. In the GitHub repository, you can see that I also coded a function to select a subset of the dataset with only 1200 train and 200 test examples.

num_classes = 2 # binary classification
d_model = 256 # dimension of the embedding vectors
num_heads = 4 # number of heads for self-attention
num_layers = 4 # number of encoder layers
d_ff = 512. # dimension of the dense layers in the encoder layers
sequence_length = 256 # maximum sequence length 
dropout = 0.4 # dropout to avoid overfitting
num_epochs = 20
batch_size = 32

loss_function = torch.nn.CrossEntropyLoss()

dataset = load_dataset("imdb")
dataset = balance_and_create_dataset(dataset, 1200, 200) # check GitHub repo

tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased', model_max_length=sequence_length)

You can try to use the BERT tokenizer on one of the sentences:

print(tokenized_datasets['train']['input_ids'][0])

Every sequence should start with the token 101, corresponding to [CLS], followed by some non-zero integers and padded with zeros if the sequence length is smaller than 256. Note that these zeros are ignored during the self-attention computation using our "mask".

tokenized_datasets = dataset.map(encode_examples, batched=True)
tokenized_datasets.set_format(type='torch', columns=['input_ids', 'attention_mask', 'label'])

train_dataloader = DataLoader(tokenized_datasets['train'], batch_size=batch_size, shuffle=True)
test_dataloader = DataLoader(tokenized_datasets['test'], batch_size=batch_size, shuffle=True)

vocab_size = tokenizer.vocab_size

encoder = TransformerEncoder(vocab_size, d_model, num_layers, num_heads, d_ff, dropout, max_sequence_length=sequence_length)
classifier = ClassifierHead(d_model, num_classes)

optimizer = torch.optim.Adam(list(encoder.parameters()) + list(classifier.parameters()), lr=1e-4)

We can now write our train function:

def train(dataloader, encoder, classifier, optimizer, loss_function, num_epochs):
    for epoch in range(num_epochs):        
        # Collect and store embeddings before each epoch starts for visualization purposes (check repo)
        all_embeddings, all_labels = collect_embeddings(encoder, dataloader)
        reduced_embeddings = visualize_embeddings(all_embeddings, all_labels, epoch, show=False)
        dic_embeddings[epoch] = [reduced_embeddings, all_labels]

        encoder.train()
        classifier.train()
        correct_predictions = 0
        total_predictions = 0
        for batch in tqdm(dataloader, desc="Training"):
            input_ids = batch['input_ids']
            attention_mask = batch['attention_mask'] # indicate where padded tokens are
            # These 2 lines make the attention_mask a matrix instead of a vector
            attention_mask = attention_mask.unsqueeze(-1)
            attention_mask = attention_mask & attention_mask.transpose(1, 2) 
            labels = batch['label']
            optimizer.zero_grad()
            output = encoder(input_ids, attention_mask)
            classification = classifier(output)
            loss = loss_function(classification, labels)
            loss.backward()
            optimizer.step()
            preds = torch.argmax(classification, dim=1)
            correct_predictions += torch.sum(preds == labels).item()
            total_predictions += labels.size(0)

        epoch_accuracy = correct_predictions / total_predictions
        print(f'Epoch {epoch} Training Accuracy: {epoch_accuracy:.4f}')

You can find the collect_embeddings and visualize_embeddings functions in the GitHub repo. They store the [CLS] token embedding for each sentence of the training set, apply a dimensionality reduction technique called t-SNE to make them 2D vectors (instead of 256-dimensional vectors), and save an animated plot.

Let’s visualize the results.

Projected [CLS] embeddings for each training point (blue corresponds to positive sentences, red corresponds to negative sentences)

Observing the plot of projected [CLS] embeddings for each training point, we can see the clear distinction between positive (blue) and negative (red) sentences after a few epochs. This visual shows the remarkable capability of the Transformer architecture to adapt embeddings over time and highlights the power of the self-attention mechanism. The data is transformed in such a way that embeddings for each class are well separated, thereby significantly simplifying the task for the classifier head.

Conclusion

As we conclude our exploration of the Transformer architecture, it’s evident that these models are adept at tailoring data to a given task. With the use of positional encoding and multi-head self-attention, Transformers go beyond mere data processing: they interpret and understand information with a level of sophistication previously unseen. The ability to dynamically weigh the relevance of different parts of the input data allows for a more nuanced understanding and representation of the input text. This enhances performance across a wide array of downstream tasks, including text classification, question answering, named entity recognition, and more.

Now that you have a better understanding of the encoder architecture, you are ready to delve into decoder and encoder-decoder models, which are very similar to what we have just explored. Decoders play a pivotal role in generative tasks and are at the core of the popular GPT models.

Feel free to connect on LinkedIn
Follow me on GitHub for more content
Visit my website: maximewolf.com

References

[1] Vaswani, Ashish, et al. "Attention Is All You Need." 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.

[2] "The Illustrated Transformer." Jay Alammar’s Blog, June 2018, http://jalammar.github.io/illustrated-transformer/

[3] Official PyTorch Implementation of the Transformer Architecture. GitHub repository, PyTorch, https://github.com/pytorch/pytorch/blob/master/torch/nn/modules/transformer.py

[4] Manning, Christopher, et al. "CS224n: Natural Language Processing with Deep Learning." Stanford University, Stanford CS224N NLP course, http://web.stanford.edu/class/cs224n/

The post Transformers: How Do They Transform Your Data? appeared first on Towards Data Science.