Srijanie Dey, PhD, Author at Towards Data Science

Deep Dive into LSTMs & xLSTMs by Hand ✍️

Srijanie Dey, PhD — Tue, 09 Jul 2024 17:37:10 +0000

Deep Dive into LSTMs and xLSTMs by Hand

Explore the wisdom of LSTM leading into xLSTMs – a probable competition to the present-day LLMs

Image by author (The ancient wizard as created by my 4-year old)

"In the enchanted realm of Serentia, where ancient forests whispered secrets of spells long forgotten, there dwelled the Enigmastrider – a venerable wizard, guardian of timeless wisdom.

One pivotal day as Serentia faced dire peril, the Enigmastrider wove a mystical ritual using the Essence Stones, imbued with the essence of past, present, and future. Drawing upon ancient magic he conjured the LSTM, a conduit of knowledge capable of preserving Serentia’s history and foreseeing its destiny. Like a river of boundless wisdom, the LSTM flowed transcending the present and revealing what lay beyond the horizon.

From his secluded abode the Enigmastrider observed as Serentia was reborn, ascending to new heights. He knew that his arcane wisdom and tireless efforts had once again safeguarded a legacy in this magical realm."

And with that story we begin our expedition to the depths of one of the most appealing Recurrent Neural Networks – the Long Short-Term Memory Networks, very popularly known as the LSTMs. Why do we revisit this classic? Because they may once again become useful as longer context-lengths in language modeling grow in importance.

Can LSTMs once again get an edge over LLMs?

A short while ago, researchers in Austria came up with a promising initiative to revive the lost glory of LSTMs – by giving way to the more evolved Extended Long-short Term Memory, also called xLSTM. It would not be wrong to say that before Transformers, LSTMs had worn the throne for innumerous deep-learning successes. Now the question stands, with their abilities maximized and drawbacks minimized, can they compete with the present-day LLMs?

To learn the answer, let’s move back in time a bit and revise what LSTMs were and what made them so special:

Long Short Term Memory Networks were first introduced in the year 1997 by Hochreiter and Schmidhuber – to address the long-term dependency problem faced by RNNs. With around 106518 citations on the paper, it is no wonder that LSTMs are a classic.

The key idea in an LSTM is the ability to learn when to remember and when to forget relevant information over arbitrary time intervals. Just like us humans. Rather than starting every idea from scratch – we rely on much older information and are able to very aptly connect the dots. Of course, when talking about LSTMs, the question arises – don’t RNNs do the same thing?

The short answer is yes, they do. However, there is a big difference. The RNN architecture does not support delving too much in the past – only up to the immediate past. And that is not very helpful.

As an example, let’s consider these line John Keats wrote in ‘To Autumn’:

"Season of mists and mellow fruitfulness,

Close bosom-friend of the maturing sun;"

As humans, we understand that words "mists" and "mellow fruitfulness" are conceptually related to the season of autumn, evoking ideas of a specific time of year. Similarly, LSTMs can capture this notion and use it to understand the context further when the words "maturing sun" comes in. Despite the separation between these words in the sequence, LSTM networks can learn to associate and keep the previous connections intact. And this is the big contrast when compared with the original Recurrent Neural Network framework.

And the way LSTMs do it is with the help of a gating mechanism. If we consider the architecture of an RNN vs an LSTM, the difference is very evident. The RNN has a very simple architecture – the past state and present input pass through an activation function to output the next state. An LSTM block, on the other hand, adds three more gates on top of an RNN block: the input gate, the forget gate and output gate which together handle the past state along with the present input. This idea of gating is what makes all the difference.

To understand things further, let’s dive into the details with these incredible works on LSTMs and xLSTMs by the amazing Prof. Tom Yeh.

First, let’s understand the mathematical cogs and wheels behind LSTMs before exploring their newer version.

(All the images below, unless otherwise noted, are by Prof. Tom Yeh from the above-mentioned LinkedIn posts, which I have edited with his permission. )

So, here we go:

How does an LSTM work?

[1] Initialize

The first step begins with randomly assigning values to the previous hidden state h0 and memory cells C0. Keeping it in sync with the diagrams, we set

h0 → [1,1]

C0 → [0.3, -0.5]

[2] Linear Transform

In the next step, we perform a linear transform by multiplying the four weight matrices (Wf, Wc, Wi and Wo) with the concatenated current input X1 and the previous hidden state that we assigned in the previous step.

The resultant values are called feature values obtained as the combination of the current input and the hidden state.

[3] Non-linear Transform

This step is crucial in the LSTM process. It is a non-linear transform with two parts – a sigmoid σ and tanh.

The sigmoid is used to obtain gate values between 0 and 1. This layer essentially determines what information to retain and what to forget. The values always range between 0 and 1 – a ‘0’ implies completely eliminating the information whereas a ‘1’ implies keeping it in place.

Forget gate (f1): [-4, -6] → [0, 0]
Input gate (i1): [6, 4] → [1, 1]
Output gate (o1): [4, -5] → [1, 0]

In the next part, tanh is applied to obtain new candidate memory values that could be added on top of the previous information.

Candidate memory (C’1): [1, -6] → [0.8, -1]

[4] Update Memory

Once the above values are obtained, it is time to update the current state using these values.

The previous step made the decision on what needs to be done, in this step we implement that decision.

We do so in two parts:

Forget : Multiply the current memory values (C0) element-wise with the obtained forget-gate values. What it does is it updates in the current state the values that were decided could be forgotten. → C0 .* f1
Input : Multiply the updated memory values (C’1) element-wise with the input gate values to obtain ‘input-scaled’ the memory values. → C’1 .* i1

Finally, we add these two terms above to get the updated memory C1, i.e. C0 . f1 + C’1 . i1 = C1

[5] Candidate Output

Finally, we make the decision on how the output is going to look like:

To begin, we first apply tanh as before to the new memory C1 to obtain a candidate output o’1. This pushes the values between -1 and 1.

[6] Update Hidden State

To get the final output, we multiply the candidate output o’1 obtained in the previous step with the sigmoid of the output gate o1 obtained in Step 3. The result obtained is the first output of the network and is the updated hidden state h1, i.e. *o’1 o1 = h1.**

– – Process t = 2 – –

We continue with the subsequent iterations below:

[7] Initialize

First, we copy the updates from the previous steps i.e. updated hidden state h1 and memory C1.

[8] Linear Transform

We repeat Step [2] which is element-wise weight and bias matrix multiplication.

[9] Update Memory (C2)

We repeat steps [3] and [4] which are the non-linear transforms using sigmoid and tanh layers, followed by the decision on forgetting the relevant parts and introducing new information – this gives us the updated memory C2.

[10] Update Hidden State (h2)

Finally, we repeat steps [5] and [6] which adds up to give us the second hidden state h2.

Next, we have the final iteration.

– – Process t = 3 – –

[11] Initialize

Once again we copy the hidden state and memory from the previous iteration i.e. h2 and C2.

[12] Linear Transform

We perform the same linear-transform as we do in Step 2.

[13] Update Memory (C3)

Next, we perform the non-linear transforms and perform the memory updates based on the values obtained during the transform.

[14] Update Hidden State (h3)

Once done, we use those values to obtain the final hidden state h3.

Summary:

To summarize the working above, the key thing to remember is that LSTM depends on three main gates : input, forget and output. And these gates as can be inferred from the names, control what part of the information and how much of it is relevant and which parts can be discarded.

Very briefly, the steps to do so are as follows:

Initialize the hidden state and memory values from the previous state.
Perform linear-transform to help the network start looking at the hidden state and memory values.
Apply non-linear transform (sigmoid and tanh) to determine what values to retain /discard and to obtain new candidate memory values.
Based on the decision (values obtained) in Step 3, we perform memory updates.
Next, we determine what the output is going to look like based on the memory update obtained in the previous step. We obtain a candidate output here.
We combine the candidate output with the gated output value obtained in Step 3 to finally reach the intermediate hidden state.

This loop continues for as many iterations as needed.

Extended Long-Short Term Memory (xLSTM)

The need for xLSTMs

When LSTMs emerged, they definitely set the platform for doing something that was not done previously. Recurrent Neural Networks could have memory but it was very limited and hence the birth of LSTM – to support long-term dependencies. However, it was not enough. Because analyzing inputs as sequences obstructed the use of parallel computation and moreover, led to drops in performance due to long dependencies.

Thus, as a solution to it all were born the transformers. But the question still remained – can we once again use LSTMs by addressing their limitations to achieve what Transformers do? To answer that question, came the xLSTM architecture.

How is xLSTM different from LSTM?

xLSTMs can be seen as a very evolved version of LSTMs. The underlying structure of LSTMs are preserved in xLSTM, however new elements have been introduced which help handle the drawbacks of the original form.

Exponential Gating & Scalar Memory Mixing – sLSTM

The most crucial difference is the introduction of exponential gating. In LSTMs, when we perform Step [3], we induce a sigmoid gating to all gates, while for xLSTMs it has been replaced by exponential gating.

For eg: For the input gate i1-

is now,

Images by author

With a bigger range that exponential gating provides, xLSTMs are able to handle updates better as compared to the sigmoid function which compresses inputs to the range of (0, 1). There is a catch though – exponential values may grow up to be very large. To mitigate that problem, xLSTMs incorporate normalization and the logarithm function seen in the equations below plays an important role here.

Image from Reference [1]

Now, logarithm does reverse the effect of the exponential but their combined application, as the xLSTM paper claims, leads the way for balanced states.

This exponential gating along with memory mixing among the different gates (as in the original LSTM) forms the sLSTM block.

Matrix Memory Cell – mLSTM

The other new aspect of the xLSTM architecture is the increase from a scalar memory to matrix memory which allows it to process more information in parallel. It also draws semblance to the transformer architecture by introducing the key, query and value vectors and using them in the normalizer state as the weighted sum of key vectors, where each key vector is weighted by the input and forget gates.

Once the sLSTM and mLSTM blocks are ready, they are stacked one over the other using residual connections to yield xLSTM blocks and finally the xLSTM architecture.

Thus, the introduction of exponential gating (with appropriate normalization) along with newer memory structures establish a strong pedestal for the xLSTMs to achieve results similar to the transformers.

To summarize:

An LSTM is a special Recurrent Neural Network (RNN) that allows connecting previous information to the current state just as us humans do with persistence of our thoughts. LSTMs became incredibly popular because of their ability to look far into the past rather than depending only on the immediate past. What made it possible was the introduction of special gating elements into the RNN architecture-

Forget Gate: Determines what information from the previous cell state should be kept or forgotten. By selectively forgetting irrelevant past information, the LSTM maintains long-term dependencies.
Input Gate : Determines what new information should be stored in the cell state. By controlling how the cell state is updated, it incorporates new information important for predicting the current output.
Output Gate : Determines what information should be the output as the hidden state. By selectively exposing parts of the cell state as the output, the LSTM can provide relevant information to subsequent layers while suppressing the non-pertinent details and thus propagating only the important information over longer sequences.

An xLSTM is an evolved version of the LSTM that addresses the drawbacks faced by the LSTM. It is true that LSTMs are capable of handling long-term dependencies, however the information is processed sequentially and thus doesn’t incorporate the power of parallelism that today’s transformers capitalize on. To address that, xLSTMs bring in:

sLSTM : Exponential gating that helps to include larger ranges as compared to sigmoid activation.
mLSTM : New memory structures with matrix memory to enhance memory capacity and enhance more efficient information retrieval.

Will LSTMs make their comeback?

LSTMs overall are part of the Recurrent Neural Network family that process information in a sequential manner recursively. The advent of Transformers completely obliterated the application of recurrence however, their struggle to handle extremely long sequences still remains a burning problem. Research suggests that quadratic time is pertinent for long-ranges or long contexts.

Thus, it does seem worthwhile to explore options that could at least enlighten a solution path and a good starting point would be going back to LSTMs – in short, LSTMs have a good chance of making a comeback. The present xLSTM results definitely look promising. And then, to round it all up – the use of recurrence by Mamba stands as a good testimony that this could be a lucrative path to explore.

So, let’s follow along in this journey and see it unfold while keeping in mind the power of recurrence!

P.S. If you would like to work through this exercise on your own, here is a link to a blank template for your use.

Blank Template for hand-exercise

Now go have fun and create some Long Short-Term effect!

References:

xLSTM: Extended Long Short-Term Memory, Maximilian et al. May 2024 https://arxiv.org/abs/2405.04517
Long Short-Term Memory, Sepp Hochreiter and Jürgen Schmidhuber, 1997, Neural Comput. 9, 8 (November 15, 1997), 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735

The post Deep Dive into LSTMs & xLSTMs by Hand ✍️ appeared first on Towards Data Science.

Deep Dive into Anthropic’s Sparse Autoencoders by Hand ✍️

Srijanie Dey, PhD — Fri, 31 May 2024 06:47:27 +0000

Image by author (Zephyra, the protector of Lumaria by my 4-year old)

"In the mystical lands of Lumaria, where ancient magic filled the air, lived Zephyra, the Ethereal Griffin. With the body of a lion and the wings of an eagle, Zephyra was the revered protector of the Codex of Truths, an ancient script holding the universe’s secrets.

Nestled in a sacred cave, the Codex was safeguarded by Zephyra’s viridescent eyes, which could see through deception to unveil pure truths. One day, a dark sorcerer descended on the lands of Lumaria and sought to shroud the world in ignorance by concealing the Codex. The villagers called upon Zephyra, who soared through the skies, as a beacon of hope. With a majestic sweep of the wings, Zephyra created a protective barrier of light around the grove, repelling the sorcerer and exposing the truths.

After a long duel, it was concluded that the dark sorcerer was no match to Zephyra’s light. Through her courage and vigilance, the true light kept shining over Lumaria. And as time went by, Lumaria was guided to prosperity under Zephyra’s protection and its path stayed illuminated by the truths Zephyra safeguarded. And this is how Zephyra’s legend lived on!"

Anthropic’s journey ‘towards extracting interpretable features’

Following the story of Zephyra, Anthropic AI delved into the expedition of extracting meaningful features in a model. The idea behind this investigation lies in understanding how different components in a neural network interact with one another and what role each component plays.

According to the paper "Towards Monosemanticity: Decomposing Language Models With Dictionary Learning" a Sparse Autoencoder is able to successfully extract meaningful features from a model. In other words, Sparse Autoencoders help break down the problem of ‘polysemanticity’ – neural activations that correspond to several meanings/interpretations at once by focusing on sparsely activating features that hold a single interpretation – in other words, are more one-directional.

To understand how all of it is done, we have these beautiful handiworks on Autoencoders and Sparse Autoencoders by Prof. Tom Yeh that explain the behind-the-scenes workings of these phenomenal mechanisms.

(All the images below, unless otherwise noted, are by Prof. Tom Yeh from the above-mentioned LinkedIn posts, which I have edited with his permission. )

To begin, let us first let us first explore what an Autoencoder is and how it works.

What is an Autoencoder?

Imagine a writer has his desk strewn with different papers – some are his notes for the story he is writing, some are copies of final drafts, some are again illustrations for his action-packed story. Now amidst this chaos, it is hard to find the important parts – more so when the writer is in a hurry and the publisher is on the phone demanding a book in two days. Thankfully, the writer has a very efficient assistant – this assistant makes sure the cluttered desk is cleaned regularly, grouping similar items, organizing and putting things into their right place. And as and when needed, the assistant would retrieve the correct items for the writer, helping him meet the deadlines set by his publisher.

Well, the name of this assistant is Autoencoder. It mainly has two functions – encoding and decoding. Encoding refers to condensing input data and extracting the essential features (organization). Decoding is the process of reconstructing original data from encoded representation while aiming to minimize information loss (retrieval).

Now let’s look at how this assistant works.

How does an Autoencoder Work?

Given : Four training examples X1, X2, X3, X4.

[1] Auto

The first step is to copy the training examples to targets Y’. The Autoencoder’s work is to reconstruct these training examples. Since the targets are the training examples themselves, the word ‘Auto’ is used which is Greek for ‘self’.

[2] Encoder : Layer 1 +ReLU

As we have seen in all our previous models, a simple weight and bias matrix coupled with ReLU is powerful and is able to do wonders. Thus, by using the first Encoding layer we reduce the size of the original feature set from 4×4 to 3×4.

A quick recap:

Linear transformation : The input embedding vector is multiplied by the weight matrix W and then added with the bias vector b,

z = Wx+b, where W is the weight matrix, x is our word embedding and b is the bias vector.

ReLU activation function : Next, we apply the ReLU to this intermediate z.

ReLU returns the element-wise maximum of the input and zero. Mathematically, h = max{0,z}.

[3] Encoder : Layer 2 + ReLU

The output of the previous layer is processed by the second Encoder layer which reduces the input size further to 2×3. This is where the extraction of relevant features occurs. This layer is also called the ‘bottleneck’ since the outputs in this layer have much lower features than the input features.

[4] Decoder : Layer 1 + ReLU

Once the encoding process is complete, the next step is to decode the relevant features to build ‘back’ the final output. To do so, we multiply the features from the last step with corresponding weights and biases and apply the ReLU layer. The result is a 3×4 matrix.

[5] Decoder : Layer 2 + ReLU

A second Decoder layer (weight, biases + ReLU) applies on the previous output to give the final result which is the reconstructed 4×4 matrix. We do so to get back to original dimension in order to compare the results with our original target.

[6] Loss Gradients & BackPropagation

Once the output from the decoder layer is obtained, we calculate the gradients of the Mean Square Error (MSE) between the outputs (Y) and the targets (Y’). To do so, we find *2(Y-Y’)** , which gives us the final gradients that activate the backpropagation process and updates the weights and biases accordingly.

Now that we understand how the Autoencoder works, it’s time to explore how its sparse variation is able to achieve interpretability for Large Language Models (LLMs).

Sparse Autoencoder – How does it work?

To start with, suppose we are given:

The output of a transformer after the feed-forward layer has processed it, i.e. let us assume we have the model activations for five tokens (X). They are good but they do not shed light on how the model arrives at its decision or makes the predictions.

The prime question here is:

Is it possible to map each activation (3D) to a higher-dimension space (6D) that will help with the understanding?

[1] Encoder : Linear Layer

The first step in the Encoder layer is to multiply the input X with encoder weights and add biases (as done in the first step of an Autoencoder).

[2] Encoder : ReLU

The next sub-step is to apply the ReLU activation function to add non-linearity and suppress negative activations. This suppression leads to many features being set to 0 which enables the concept of sparsity – outputting sparse and interpretable features f.

Interpretability happens when we have only one or two positive features. If we examine f6, we can see X2 and X3 are positive, and may say that both have ‘Mountain’ in common.

[3] Decoder : Reconstruction

Once we are done with the encoder, we proceed to the decoder step. We multiply f with decoder weights and add biases. This outputs X’, which is the reconstruction of X from interpretable features.

As done in an Autoencoder, we want X’ to be as close to X as possible. To ensure that, further training is essential.

[4] Decoder : Weights

As an intermediary step, we compute the L2 norm for each of the weights in this step. We keep them aside to be used later.

L2-norm

Also known as Euclidean norm, L2-norm calculates the magnitude of a vector using the formula: ||x||₂ = √(Σᵢ xᵢ²).

In other words, it sums the squares of each component and then takes the square root over the result. This norm provides a straightforward way to quantify the length or distance of a vector in Euclidean space.

Training

As mentioned earlier, a Sparse Autoencoder instils extensive training to get the reconstructed X’ closer to X. To illustrate that, we proceed to the next steps below:

[5] Sparsity : L1 Loss

The goal here is to obtain as many values close to zero / zero as possible. We do so by invoking L1 sparsity to penalize the absolute values of the weights – the core idea being that we want to make the sum as small as possible.

L1-loss

The L1-loss is calculated as the sum of the absolute values of the weights: L1 = λΣ|w|, where λ is a regularization parameter.

This encourages many weights to become zero, simplifying the model and thus enhancing interpretability.

In other words, L1 helps build the focus on the most relevant features while also preventing overfitting, improving model generalization, and reducing computational complexity.

[6] Sparsity : Gradient

The next step is to calculate L1‘s gradients which -1 for positive values. Thus, for all values of f >0 , the result will be set to -1.

How does L1 penalty push weights towards zero?

The gradient of the L1 penalty pushes weights towards zero through a process that applies a constant force, regardless of the weight’s current value. Here’s how it works (all images in this sub-section are by author):

The L1 penalty is expressed as:

The gradient of this penalty with respect to a weight w is:

where sign(w) is:

During gradient descent, the update rule for weights is:

where 𝞰 is the learning rate.

The constant subtraction (or addition) of λ from the weight value (depending on its sign) decreases the absolute value of the weight. If the weight is small enough, this process can drive it to exactly zero.

[7] Sparsity : Zero

For all other values that are already zero, we keep them unchanged since they have already been zeroed out.

[8] Sparsity : Weight

We multiple each row of the gradient matrix obtained in Step 6 by the corresponding decoder weights obtained in Step 4. This step is crucial as it prevents the model from learning large weights which would add incorrect information while reconstructing the results.

[9] Reconstruction : MSE Loss

We use the Mean Square Error or the L2 loss function to calculate the difference between X’ and X. The goal as seen previously is to minimize the error to the lowest value.

[10] Reconstruction : Gradient

The gradient of L2 loss is *2(X’-X)**.

And hence as seen for the original Autoencoders, we run backpropagation to update the weights and the biases. The catch here is finding a good balance between sparsity and reconstruction.

And with this, we come to the end of this very clever and intuitive way of learning how a model understands an idea and the direction it takes to generate a response.

To summarize:

An Autoencoder overall consists of two parts : Encoder and Decoder. The Encoder uses weights and biases coupled with the ReLU activation function to compress the initial input features into a lower dimension, trying to capture only the relevant parts. The Decoder on the other hand takes the output of the Encoder and works to reconstruct the input features back to their original state. Since the targets in an Autoencoder are the initial features themselves, hence the use of the word ‘auto’. The aim, as is for standard neural networks, is to achieve the lowest error (difference) between the target and the input features – and it is achieved by propagating the gradient of the error through the network while updating the weights and biases.
A Sparse Autoencoder consists of all the components as a standard Autoencoder along with a few more additions. The key here is the different approach in the training step. Since the aim here is to retrieve the interpretable features, we want to zero out those values which hold relatively less meaning. Once the encoder uses ReLU to suppress the negative values, we go a step further and use L1-Loss on the result to encourage sparsity by penalizing the absolute values of the weights. This is achieved by adding a penalty term to the loss function, which is the sum of the absolute values of the weights: λΣ|w|. The weights that remain non-zero are those that are crucial for the model’s performance.

Extracting Interpretable features using Sparsity

As humans, our brains activate only a small subset of neurons in response to specific stimuli. Likewise, Sparse Autoencoders learn a sparse representation of the input by leveraging sparsity constraints like L1 regularization. By doing so, a Sparse Autoencoder is able to extract interpretable features from complex data thus enhancing the simplicity and interpretability of the learned features. This selective activation mirroring biological neural processes helps focus on the most relevant aspects of the input data making the models more robust and efficient.

With Anthropic’s endeavor to understand interpretability in AI models, their initiative highlights the need for transparent and understandable AI systems, especially as they become more integrated into critical decision-making processes. By focusing on creating models that are both powerful and interpretable, Anthropic contributes to the development of AI that can be trusted and effectively utilized in real-world applications.

In conclusion, Sparse Autoencoders are vital for extracting interpretable features, enhancing model robustness, and ensuring efficiency. The ongoing work on understanding these powerful models and how they make inferences underscore the growing importance of interpretability in AI, paving the way for more transparent AI systems. It remains to see how these concepts evolve and driving us towards a future that entails a safe integration of AI in our lives!

P.S. If you would like to work through this exercise on your own, here is a link to a blank template for your use.

Blank Template for hand-exercise

Now go have fun and help Zephyr keep the Codex of Truth safe!

Once again special thanks to Prof. Tom Yeh for supporting this work!

References:

[1] Towards Monosemanticity: Decomposing Language Models With Dictionary Learning, Bricken et al. Oct 2023 https://transformer-circuits.pub/2023/monosemantic-features/index.html

[2] Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet, Templeton et al. May 2024 https://transformer-circuits.pub/2024/scaling-monosemanticity/

The post Deep Dive into Anthropic’s Sparse Autoencoders by Hand ✍️ appeared first on Towards Data Science.

Deep Dive into LlaMA 3 by Hand ✍️

Srijanie Dey, PhD — Fri, 03 May 2024 20:22:47 +0000

Image by author (The shining LlaMA 3 rendition by my 4-year old.)

"In the rugged mountain of the Andes, lived three very beautiful creatures – Rio, Rocky and Sierra. With their lustrous coat and sparkling eyes, they stood out as a beacon of strength and resilience.

As the story goes, it was said that from a very young age their thirst for knowledge was never-ending. They would seek out the wise elders of their herd, listening intently to their stories and absorbing their wisdom like a sponge. With that grew their superpower which was working together with others and learning that teamwork was the key to acing the trials in the challenging terrain of the Andes.

If they encountered travelers who had lost their way or needed help, Rio took in their perspective and led them with comfort, Rocky provided swift solutions while Sierra made sure they had the strength to carry on. And with this they earned admiration and encouraged everyone to follow their example.

As the sun set over the Andes, Rio, Rocky, and Sierra stood together, their spirits intertwined like the mountains themselves. And so, their story lived on as a testament to the power of knowledge, wisdom and collaboration and the will to make a difference.

They were the super-Llamas and the trio was lovingly called LlaMA3!"

LlaMA 3 by Meta

And this story is not very far from the story of Meta’s open-source Large Language Model (LLM) – LlaMA 3 (Large Language Model Meta AI). On April 18, 2024, Meta released their LlaMa 3 family of Large Language Models in 8B and 70B parameter sizes, claiming a major leap over LlaMA 2 and vying for the best state-of-the-art LLM models at that scale.

According to Meta, there were four key focus points while building LlaMA 3 – the model architecture, the pre-training data, scaling up pre-training, and instruction fine-tuning. This leads us to ponder what we can do to reap the most out of this very competent model – on an enterprise scale as well as at the grass-root level.

To help explore the answers to some of these questions, I collaborated with Edurado Ordax, Generative AI Lead at AWS and Prof. Tom Yeh, CS Professor at University of Colorado, Boulder.

So, let’s start the trek:

How can we leverage the power of LlaMA 3?

API vs Fine-Tuning

As per the recent practices, there are two main ways by which these LLMs are being accessed and worked with – API and Fine-Tuning. Even with those two very diverse approaches there are other factors in the process, as can be seen in the following images, that become crucial.

(All images in this section are courtesy to Eduardo Ordax.)

There are mainly 6 stages of how a user can interact with Llama 3.

Stage 1 : Cater to a broad-case usage by using the model as is.

Stage 2 : Use the model as per a user-defined application.

Stage 3 : Use prompt-engineering to train the model to produce the desired outputs.

Stage 4 : Use prompt-engineering on the user side along with delving a bit into data retrieval and fine-tuning which is still mostly managed by the LLM provider.

Stage 5 : Take most of the matters in your own hand (the user), starting from prompt-engineering to data retrieval and fine-tuning (RAG models, PEFT models and so on).

Stage 6 : Create the entire foundational model starting from scratch – pre-training to post-training.

To gain the most out of these models, it is suggested that the best approach would be entering Stage 5 because then the flexibility lies a lot with the user. Being able to customize the model as per the domain-need is crucial in order to maximize its gains. And for that, not getting involved into the systems does not yield optimal returns.

To be able to do so, here is a high-level picture of the tools that could prove to be useful:

The picture dictates that in order to get the highest benefit from the models, a set structure and a road map is essential. There are three components to it:

People: Not just end-users, but the whole range of data engineers, data scientists, MLOps Engineers, ML Engineers along with Prompt Engineers are important.
Process: Not just plugging in the LLM into an API but focusing on the entire lifecycle of model evaluation, model deployment and fine-tuning to cater to specific needs.
Tools: Not just the API access and API tools but the entire range of environments, different ML pipelines, separate accounts for access and running checks.

Of course, this is true for an enterprise-level deployment such that the actual benefits of the model can be reaped. And to be able to do so, the tools and practices under MLOps become very important. Combined with FMOps, these models can prove to be very valuable and enrich the GenAI ecosystem.

FMOps ⊆ MLOps ⊆ DevOps

MLOps also known as Machine Learning Operations is a part of Machine Learning Engineering that focuses on the development as well as the deployment, and maintenance of ML models ensuring that they run reliably and efficiently.

MLOps fall under DevOps (Development and Operations) but specifically for ML models.

FMOps (Foundational Model Operations) on the other hand work for Generative AI scenarios by selecting, evaluating and fine-tuning the LLMs.

With all if it being said, one thing however remains constant. And that is the fact that LlaMA 3 is after all an LLM and its implementation on the enterprise-level is possible and beneficial only after the foundational elements are set and validated with rigor. To be able to do so, let us explore the technical details behind LlaMA 3.

What is the secret sauce toward LlaMa 3’s claim to fame?

At the fundamental level, yes, it is the transformer. If we go a little higher up in the process, the answer would be the transformer architecture but highly optimized to achieve superior performance on the common industry benchmarks while also enabling newer capabilities.

Good news is that since LlaMa 3 is open (open-source at Meta’s discretion), we have access to the Model Card that gives us the details to how this powerful architecture is configured.

So, let’s dive in and unpack the goodness:

How does the transformer architecture coupled with self-attention play its role in LlaMA 3?

To start with, here is a quick review on how the transformer works:

The transformer architecture can be perceived as a combination of the attention layer and the feed-forward layer.
The attention layer combines across features horizontally to produce a new feature.
The feed-forward layer (FFN) combines the parts or the characteristics of a feature to produce new parts/characteristics. It does it vertically across dimensions.

(All the images in this section, unless otherwise noted, are by Prof. Tom Yeh, which I have edited with his permission.)

Below is a basic form of how the architecture looks like and how it functions.

The transformer architecture containing the attention and the feed-forward blocks.

Here are the links to the deep-dive articles for Transformers and Self-Attention where the entire process is discussed in detail.

The essentials of LlaMA 3

It’s time to get into the nitty-gritty and discover how the transformer numbers play out in the real-life LlaMa 3 model. For our discussion, we will only consider the 8B variant. Here we go:

– What are the LlaMA 3 – 8B model parameters?

The primary numbers/values that we need to explore here are for the parameters that play a key role in the transformer architecture. And they are as below:

Layers : Layers here refer to the basic blocks of the Transformers – the attention layer and the FFN as can be seen in the image above. The layers are stacked one above the other where the input flows into one layer and its output is passed on to the next layer, gradually transforming the input data.
Attention heads : Attention heads are part of the self-attention mechanism. Each head scans the input sequence independently and performs the attention steps (Remember: the QK-module, SoftMax function.)
Vocabulary words : The vocabulary refers to the number of words the model recognizes or knows. Essentially, think of it as humans’ way of building our word repertoire so that we develop knowledge and versatility in a language. Most times bigger the vocabulary, better the model performance.
Feature dimensions : These dimensions specify the size of the vectors representing each token in the input data. This number remains consistent throughout the model from the input embedding to the output of each layer.
Hidden dimensions : These dimensions are the internal size of the layers within the model, more commonly the size of hidden layers of the feed-forward layers. As is norm, the size of these layers can be larger than the feature dimension helping the model extract and process more complex representations from the data.
Context-window size : The ‘window-size’ here refers to the number of tokens from the input sequence that the model considers at once when calculating attention.

With the terms defined, let us refer to the actual numbers for these parameters in the LlaMA 3 model. (The original source code where these numbers are stated can be found here.)

The original source code where these numbers are stated can be found here.

Keeping these values in mind, the next steps illustrate how each of them play their part in the model. They are listed in their order of appearance in the source-code.

[1] The context-window

While instantiating the LlaMa class, the variable _max_seqlen defines the context-window. There are other parameters in the class but this one serves our purpose in relation to the transformer model. The _max_seqlen here is 8K which implies the attention head is able to scan 8K tokens at one go.

[2] Vocabulary-size and Attention Layers

Next up is the Transformer class which defines the vocabulary size and the number of layers. Once again the vocabulary size here refers to the set of words (and tokens) that the model can recognize and process. Attention layers here refer to the transformer block (the combination of the attention and feed-forward layers) used in the model.

Based on these numbers, LlaMA 3 has a vocabulary size of 128K which is quite large. Additionally, it has 32 copies of the transformer block.

[3] Feature-dimension and Attention-Heads

The feature dimension and the attention-heads make their way into the Self-Attention module. Feature dimension refers to the vector-size of the tokens in the embedding space and the attention-heads consist of the QK-module that powers the self-attention mechanism in the transformers.

[4] Hidden Dimensions

The hidden dimension features in the Feed-Forward class specifying the number of hidden layers in the model. For LlaMa 3, the hidden layer is 1.3 times the size of the feature dimension. A larger number of hidden layers allows the network to create and manipulate richer representations internally before projecting them back to the smaller output dimension.

[5] Combining the above parameters to form the Transformer

The first matrix is the input feature matrix which goes through the Attention layer to create the Attention Weighted features. In this image the input feature matrix only has a size of 5 x 3 matrix, but in the real-world Llama 3 model it grows up to be 8K x 4096 which is enormous.
The next one is the hidden layer in the Feed-Forward Network that grows up to 5325 and then comes back down to 4096 in the final layer.

[6] Multiple-layers of the Transformer block

LlaMA 3 combines 32 of these above transformer blocks with the output of one passing down into the next block until the last one is reached.

[7] Let’s put it all together

Once we have set all the above pieces in motion, it is time to put it all together and see how they produce the LlaMA effect.

So, what is happening here?

Step 1 : First we have our input matrix, which is the size of 8K (context-window) x 128K (vocabulary-size). This matrix undergoes the process of embedding which takes this high-dimensional matrix into a lower dimension.

Step 2 : This lower dimension in this case turns out to be 4096 which is the specified dimension of the features in the LlaMA model as we had seen before. (A reduction from 128K to 4096 is immense and noteworthy.)

Step 3: This feature goes through the Transformer block where it is processed first by the Attention layer and then the FFN layer. The attention layer processes it across features horizontally whereas the FFN layer does it vertically across dimensions.

Step 4: Step 3 is repeated for 32 layers of the Transformer block. In the end the resultant matrix has the same dimension as the one used for the feature dimension.

Step 5: Finally this matrix is transformed back to the original size of the vocabulary matrix which is 128K so that the model can choose and map those words as available in the vocabulary.

And that’s how LlaMA 3 is essentially scoring high on those benchmarks and creating the LlaMA 3 effect.

The LlaMA 3 Effect

LlaMA 3 was released in two model versions – 8B and 70B parameters to serve a wide range of use-cases. In addition to achieving state-of-the-art performances on standard benchmarks, a new and rigorous human-evaluation set was also developed. And Meta promises to release better and stronger versions of the model with it becoming multilingual and multimodal. The news is newer and larger models are coming soon with over 400B parameters (early reports here show that it is already crushing benchmarks by an almost 20% score increase over LlaMA 3).

However, it is imperative to say that in spite of all the upcoming changes and all the updates, one thing is going to remain the same – the foundation of it all – the transformer architecture and the transformer block that enables this incredible technical advancement.

It could be a coincidence that LlaMA models were named so, but based on legend from the Andes mountains, the real llamas have always been revered for their strength and wisdom. Not very different from the Gen AI – ‘LlaMA’ models.

So, let’s follow along in this exciting journey of the GenAI Andes while keeping in mind the foundation that powers these large language models!

P.S. If you would like to work through this exercise on your own, here is a link to a blank template for your use.

Blank Template for hand-exercise

Now go have fun and create some LlaMA 3 effect!

Image by author

The post Deep Dive into LlaMA 3 by Hand ✍️ appeared first on Towards Data Science.

Deep Dive into Self-Attention by Hand✍︎

Srijanie Dey, PhD — Mon, 22 Apr 2024 14:33:25 +0000

Attention! Attention!

Because ‘Attention is All You Need’.

No, I am not saying that, the Transformer is.

Image by author (Robtimus Prime seeking attention. As per my son, bright rainbow colors work better for attention and hence the color scheme.)

As of today, the world has been swept over by the power of Transformers. Not the likes of ‘Robtimus Prime’ but the ones that constitute neural networks. And that power is because of the concept of ‘attention‘. So, what does attention in the context of transformers really mean? Let’s try to find out some answers here:

First and foremost:

What are transformers?

Transformers are neural networks that specialize in learning context from the data. Quite similar to us trying to find the meaning of ‘attention and context’ in terms of transformers.

How do transformers learn context from the data?

By using the attention mechanism.

What is the attention mechanism?

The attention mechanism helps the model scan all parts of a sequence at each step and determine which elements need to be focused on. The attention mechanism was proposed as an alternative to the ‘strict/hard’ solution of fixed-length vectors in the encoder-decoder architecture and provide a ‘soft’ solution focusing only on the relevant parts.

What is self-attention?

The attention mechanism worked to improve the performance of Recurrence Neural Networks (RNNs), with the effect seeping into Convolutional Neural Networks (CNNs). However, with the introduction of the transformer architecture in the year 2017, the need for RNNs and CNNs was quietly obliterated. And the central reason for it was the self-attention mechanism.

The self-attention mechanism was special in the sense that it was built to inculcate the context of the input sequence in order to enhance the attention mechanism. This idea became transformational as it was able to capture the complex nuances of a language.

As an example:

When I ask my 4-year old what transformers are, his answer only contains the words robots and cars. Because that is the only context he has. But for me, transformers also mean neural networks as this second context is available to the slightly more experienced mind of mine. And that is how different contexts provide different solutions and so tend to be very important.

The word ‘self’ refers to the fact that the attention mechanism examines the same input sequence that it is processing.

There are many variations of how self-attention is performed. But the scaled dot-product mechanism has been one of the most popular ones. This was the one introduced in the original transformer architecture paper in 2017 – "Attention is All You Need".

Where and how does self-attention feature in transformers?

I like to see the transformer architecture as a combination of two shells – the outer shell and the inner shell.

The outer shell is a combination of the attention-weighting mechanism and the feed forward layer about which I talk in detail in this article.
The inner shell consists of the self-attention mechanism and is part of the attention-weighting feature.

So, without further delay, let us dive into the details behind the self-attention mechanism and unravel the workings behind it. The Query-Key module and the SoftMax function play a crucial role in this technique.

This discussion is based on Prof. Tom Yeh’s wonderful AI by Hand Series on Self-Attention. (All the images below, unless otherwise noted, are by Prof. Tom Yeh from the above-mentioned LinkedIn post, which I have edited with his permission.)

So here we go:

Self-Attention

To build some context here, here is a pointer to how we process the ‘Attention-Weighting’ in the transformer **** outer shell.

Attention weight matrix (A)

The attention weight matrix A is obtained by feeding the input features into the Query-Key (QK) module. This matrix tries to find the most relevant parts in the input sequence. Self-Attention comes into play while creating the Attention weight matrix A using the QK-module.

How does the QK-module work?

Let us look at The different components of Self-Attention: Query (Q), Key (K) and Value (V).

I love using the spotlight analogy here as it helps me visualize the model throwing light on each element of the sequence and trying to find the most relevant parts. Taking this analogy a bit further, let us use it to understand the different components of Self-Attention.

Imagine a big stage getting ready for the world’s largest Macbeth production. The audience outside is teeming with excitement.

The lead actor walks onto the stage, the spotlight shines on him and he asks in his booming voice "Should I seize the crown?". The audience whispers in hushed tones and wonders which path this question will lead to. Thus, Macbeth himself represents the role of Query (Q) as he asks pivotal questions and drives the story forward.
Based on Macbeth’s query, the spotlight shifts to other crucial characters that hold information to the answer. The influence of other crucial characters in the story, like Lady Macbeth, triggers Macbeth’s own ambitions and actions. These other characters can be seen as the Key (K) as they unravel different facets of the story based on the particulars they know.
Finally, the extended characters – family, friends, supporter, naysayers provide enough motivation and information to Macbeth by their actions and perspectives. These can be seen as Value (V). The Value (V) pushes Macbeth towards his decisions and shapes the fate of the story.

And with that is created one of the world’s finest performances, that remains etched in the minds of the awestruck audience for the years to come.

Now that we have witnessed the role of Q, K, V in the fantastical world of performing arts, let’s return to planet matrices and learn the mathematical nitty-gritty behind the QK-module. This is the roadmap that we will follow:

Roadmap for the Self-Attention mechanism

And so the process begins.

We are given:

A set of 4-feature vectors (Dimension 6)

Our goal :

Transform the given features into Attention Weighted Features.

[1] Create Query, Key, Value Matrices

To do so, we multiply the features with linear transformation matrices W_Q, W_K, and W_V, to obtain query vectors (q1,q2,q3,q4), key vectors (k1,k2,k3,k4), and value vectors (v1,v2,v3,v4) respectively as shown below:

To get Q, multiply W_Q with X:

To get K, multiply W_K with X:

Similarly, to get V, multiply W_V with X.

To be noted:

As can be seen from the calculation above, we use the same set of features for both queries and keys. And that is how the idea of "self" comes into play here, i.e. the model uses the same set of features to create its query vector as well as the key vector.
The query vector represents the current word (or token) for which we want to compute attention scores relative to other words in the sequence.
The key vector represents the other words (or tokens) in the input sequence and we compute the attention score for each of them with respect to the current word.

[2] Matrix Multiplication

The next step is to multiply the transpose of K with Q i.e. K^T . Q.

The idea here is to calculate the dot product between every pair of query and key vectors. Calculating the dot product gives us an estimate of the matching score between every "key-query" pair, by using the idea of Cosine Similarity between the two vectors. This is the ‘dot-product’ part of the scaled dot-product attention.

Cosine-Similarity

Cosine similarity is the cosine of the angle between the vectors; that is, it is the dot product of the vectors divided by the product of their lengths. It roughly measures if two vectors are pointing in the same direction thus implying the two vectors are similar.

Remember cos(0°) = 1, cos(90°) = 0 , cos(180°) =-1

If the dot product between the two vectors is approximately 1, it implies we are looking at an almost zero angle between the two vectors meaning they are very close to each other.

If the dot product between the two vectors is approximately 0, it implies we are looking at vectors that are orthogonal to each other and not very similar.

If the dot product between the two vectors is approximately -1, it implies we are looking at an almost an 180° angle between the two vectors meaning they are opposites.

[3] Scale

The next step is to scale/normalize each element by the square root of the dimension ‘_dk‘. In our case the number is 3. Scaling down helps to keep the impact of the dimension on the matching score in check.

How does it do so? As per the original Transformer paper and going back to Probability 101, if two independent and identically distributed (i.i.d) variables q and k with mean 0 and variance 1 with dimension d are multiplied, the result is a new random variable with mean remaining 0 but variance changing to _dk.

Now imagine how the matching score would look if our dimension is increased to 32, 64, 128 or even 4960 for that matter. The larger dimension would make the variance higher and push the values into regions ‘unknown’.

To keep the calculation simple here, since sqrt [3] is approximately 1.73205, we replace it with [ floor(□/2) ].

Floor Function

The floor function takes a real number as an argument and returns the largest integer less than or equal to that real number.

Eg : floor(1.5) = 1, floor(2.9) = 2, floor (2.01) = 2, floor(0.5) = 0.

The opposite of the floor function is the ceiling function.

This the ‘scaled’ part of the scaled dot-product attention.

[4] Softmax

There are three parts to this step:

Raise e to the power of the number in each cell (To make things easy, we use 3 to the power of the number in each cell.)
Sum these new values across each column.
For each column, divide each element by its respective sum (Normalize). The purpose of normalizing each column is to have numbers sum up to 1. In other words, each column then becomes a probability distribution of attention, which gives us our Attention Weight Matrix (A).

This Attention Weight Matrix is what we had obtained after passing our feature matrix through the QK-module in Step 2 in the Transformers section.

(Remark: The first column in the Attention Weight Matrix has a typo as the current elements don’t add up to 1. Please double-check. We are allowed these errors because we are human.)

The Softmax step is important as it assigns probabilities to the score obtained in the previous steps and thus helps the model decide how much importance (higher/lower attention weights) needs to be given to each word given the current query. As is to be expected, higher attention weights signify greater relevance allowing the model to capture dependencies more accurately.

Once again, the scaling in the previous step becomes important here. Without the scaling, the values of the resultant matrix gets pushed out into regions that are not processed well by the Softmax function and may result in vanishing gradients.

[5] Matrix Multiplication

Finally we multiply the value vectors (Vs) with the Attention Weight Matrix (A). These value vectors are important as they contain the information associated with each word in the sequence.

And the result of the final multiplication in this step are the attention weighted features Zs which are the ultimate solution of the self-attention mechanism. These attention-weighted features essentially contain a weighted representation of the features assigning higher weights for features with higher relevance as per the context.

Now with this information available, we continue to the next step in the transformer architecture where the feed-forward layer processes this information further.

And this brings us to the end of the brilliant self-attention technique!

Reviewing all the key points based on the ideas we talked about above:

Attention mechanism was the result of an effort to better the performance of RNNs, addressing the issue of fixed-length vector representations in the encoder-decoder architecture. The flexibility of soft-length vectors with a focus on the relevant parts of a sequence was the core strength behind attention.
Self-attention was introduced as a way to inculcate the idea of context into the model. The self-attention mechanism evaluates the same input sequence that it processes, hence the use of the word ‘self’.
There are many variants to the self-attention mechanism and efforts are ongoing to make it more efficient. However, scaled dot-product attention is one of the most popular ones and a crucial reason why the transformer architecture was deemed to be so powerful.
Scaled dot-product self-attention mechanism comprises the Query-Key module (QK-module) along with the Softmax function. The QK module is responsible for extracting the relevance of each element of the input sequence by calculating the attention scores and the Softmax function complements it by assigning probability to the attention scores.
Once the attention-scores are calculated, they are multiplied with the value vector to obtain the attention-weighted features which are then passed on to the feed-forward layer.

Multi-Head Attention

To cater to a varied and overall representation of the sequence, multiple copies of the self-attention mechanism are implemented in parallel which are then concatenated to produce the final attention-weighted values. This is called the Multi-Head Attention.

Transformer in a Nutshell

This is how the inner-shell of the transformer architecture works. And bringing it together with the outer shell, here is a summary of the Transformer mechanism:

The two big ideas in the Transformer architecture here are attention-weighting and the feed-forward layer (FFN). Both of them combined together allow the Transformer to analyze the input sequence from two directions. Attention looks at the sequence based on positions and the FFN does it based on the dimensions of the feature matrix.
The part that powers the attention mechanism is the scaled dot-product Attention which consists of the QK-module and outputs the attention weighted features.

‘Attention Is really All You Need’

Transformers have been here for only a few years and the field of AI has already seen tremendous progress based on it. And the effort is still ongoing. When the authors of the paper used that title for their paper, they were not kidding.

It is interesting to see once again how a fundamental idea – the ‘dot product’ coupled with certain embellishments can turn out to be so powerful!

Image by author

P.S. If you would like to work through this exercise on your own, here are the blank templates for you to use.

Blank Template for hand-exercise

Now go have some fun with the exercise while paying attention to your Robtimus Prime!

Related Work:

Here are the other articles in the Deep Dive by Hand Series:

Deep Dive into Vector Databases by Hand that explores what exactly happens behind-the-scenes in Vector Databases.
Deep Dive into Sora’s Diffusion Transformer (DiT) by Hand that explores the secret behind Sora’s state-of-the-art videos.
Deep Dive into Transformers by Hand that explores the power behind the power of transformers.

References:

Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. "Attention is all you need." Advances in neural information processing systems 30 (2017).
Bahdanau, Dzmitry, Kyunghyun Cho and Yoshua Bengio. "Neural Machine Translation by Jointly Learning to Align and Translate." CoRR abs/1409.0473 (2014).

The post Deep Dive into Self-Attention by Hand✍︎ appeared first on Towards Data Science.

Deep Dive into Transformers by Hand ✍︎

Srijanie Dey, PhD — Fri, 12 Apr 2024 05:18:36 +0000

There has been a new development in our neighborhood.

A ‘Robo-Truck,’ as my son likes to call it, has made its new home on our street.

It is a Tesla Cyber Truck and I have tried to explain that name to my son many times but he insists on calling it Robo-Truck. Now every time I look at Robo-Truck and hear that name, it reminds me of the movie Transformers where robots could transform to and from cars.

And isn’t it strange that Transformers as we know them today could very well be on their way to powering these Robo-Trucks? It’s almost a full circle moment. But where am I going with all these?

Well, I am heading to the destination – Transformers. Not the robot car ones but the neural network ones. And you are invited!

Image by author (Our Transformer – ‘Robtimus Prime’. Colors as mandated by my artist son.)

What are Transformers?

Transformers are essentially neural networks. Neural networks that specialize in learning context from the data.

But what makes them special is the presence of mechanisms that eliminate the need for labeled datasets and convolution or recurrence in the network.

What are these special mechanisms?

There are many. But the two mechanisms that are truly the force behind the transformers are attention weighting and feed-forward networks (FFN).

What is attention-weighting?

Attention-weighting is a technique by which the model learns which part of the incoming sequence needs to be focused on. Think of it as the ‘Eye of Sauron’ scanning everything at all times and throwing light on the parts that are relevant.

Fun-fact: Apparently, the researchers had almost named the Transformer model ‘Attention-Net’, given Attention is such a crucial part of it.

What is FFN?

In the context of transformers, FFN is essentially a regular multilayer perceptron acting on a batch of independent data vectors. Combined with attention, it produces the correct ‘position-dimension’ combination.

How do Attention and FFN work?

So, without further ado, let’s dive into how attention-weighting and FFN make transformers so powerful.

This discussion is based on Prof. Tom Yeh’s wonderful AI by Hand Series on Transformers . (All the images below, unless otherwise noted, are by Prof. Tom Yeh from the above-mentioned LinkedIn posts, which I have edited with his permission.)

So here we go:

The key ideas here : attention weighting and feed-forward network (FFN).

Keeping those in mind, suppose we are given:

5 input features from a previous block (A 3×5 matrix here, where X1, X2, X3, X4 and X5 are the features and each of the three rows denote their characteristics respectively.)

[1] Obtain attention weight matrix A

The first step in the process is to obtain the attention weight matrix A. This is the part where the self-attention mechanism comes to play. What it is trying to do is find the most relevant parts in this input sequence.

We do it by feeding the input features into the query-key (QK) module. For simplicity, the details of the QK module are not included here.

[2] Attention Weighting

Once we have the attention weight matrix A (5×5), we multiply the input features (3×5) with it to obtain the attention-weighted features Z.

The important part here is that the features here are combined based on their positions P1, P2 and P3 i.e. horizontally.

To break it down further, consider this calculation performed row-wise:

P1 X A1 = Z1 → Position [1,1] = 11

P1 X A2 = Z2 → Position [1,2] = 6

P1 X A3 = Z3 → Position [1,3] = 7

P1 X A4 = Z4 → Position [1,4] = 7

P1 X A5 = Z5 → Positon [1,5] = 5

P2 X A4 = Z4 → Position [2,4] = 3

P3 X A5 = Z5 →Position [3,5] = 1

As an example:

It seems a little tedious in the beginning but follow the multiplication row-wise and the result should be pretty straight-forward.

Cool thing is the way our attention-weight matrix A is arranged, the new features Z turn out to be the combinations of X as below **** :

Z1 = X1 + X2

Z2 = X2 + X3

Z3 = X3 + X4

Z4 = X4 + X5

Z5 = X5 + X1

(Hint : Look at the positions of 0s and 1s in matrix A).

[3] FFN : First Layer

The next step is to feed the attention-weighted features into the feed-forward neural network.

However, the difference here lies in combining the values across dimensions as opposed to positions in the previous step. It is done as below:

What this does is that it looks at the data from the other direction.

– In the attention step, we combined our input on the basis of the original features to obtain new features.

– In this FFN step, we consider their characteristics i.e. combine features vertically to obtain our new matrix.

Eg: P1(1,1) * Z1(1,1)

P2(1,2) * Z1 (2,1)

P3 (1,3) * Z1(3,1) + b(1) = 11, where b is bias.

Once again element-wise row operations to the rescue. Notice that here the number of dimensions of the new matrix is increased to 4 here.

[4] ReLU

Our favorite step : ReLU, where the negative values obtained in the previous matrix are returned as zero and the positive value remain unchanged.

[5] FFN : Second Layer

Finally we pass it through the second layer where the dimensionality of the resultant matrix is reduced from 4 back to 3.

The output here is ready to be fed to the next block (see its similarity to the original matrix) and the entire process is repeated from the beginning.

The two key things to remember here are:

The attention layer combines across positions (horizontally).
The feed-forward layer combines across dimensions (vertically).

And this is the secret sauce behind the power of the transformers – the ability to analyze data from different directions.

To summarize the ideas above, here are the key points:

The transformer architecture can be perceived as a combination of the attention layer and the feed-forward layer.
The attention layer combines the features to produce a new feature. E.g. think of combining two robots Robo-Truck and Optimus Prime to get a new robot Robtimus Prime.
The feed-forward (FFN) layer combines the parts or the characteristics of the a feature to produce new parts/characteristics. E.g. wheels of Robo-Truck and Ion-laser of Optimus Prime could produce a wheeled-laser.

The ever powerful Transformers

Neural networks have existed for quite some time now. Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN) had been reigning supreme but things took quite an eventful turn once Transformers were introduced in the year 2017. And since then, the field of AI has grown at an exponential rate – with new models, new benchmarks, new learnings coming in every single day. And only time will tell if this phenomenal idea will one day lead the way for something even bigger – a real ‘Transformer’.

But for now it would not be wrong to say that an idea can really transform how we live!

P.S. If you would like to work through this exercise on your own, here is the blank template for your use.

Blank Template for hand-exercise

Now go have some fun and create your own Robtimus Prime!

The post Deep Dive into Transformers by Hand ✍︎ appeared first on Towards Data Science.

Deep Dive into Sora’s Diffusion Transformer (DiT) by Hand ✍︎

Srijanie Dey, PhD — Tue, 02 Apr 2024 13:58:00 +0000

"In the ancient land of DiTharos, there once lived a legend, called Sora. A legend that embodied the very essence of unlimited potential, encompassing the vastness and the magnificence of the skies.

When it soared high with its iridescent wings spanning vast expanses and light bouncing off its striking body, one could hear the words ‘Sora IS Sky’ echoing into the heavens. What made it a legend was not just its epic enormity but its power to harness the elements of light scattered in the swirling clouds. With its mighty strength, the magic that Sora created with a single twirl, was a sight to behold!

They say, Sora lives on, honing its skills and getting stronger with each passing day, ready to fly when the hour is golden. When you see a splash of crimson red in the sky today, you would know it’s a speck of the legend flying into the realms of light!"

This is a story I told my son about a mythical dragon that lived in a far away land. We called it ‘The Legend of Sora’. He really enjoyed it because Sora is big and strong, and illuminated the sky. Now of course, he doesn’t understand the idea of transformers and diffusion yet, he’s only four, but he does understand the idea of a magnanimous dragon that uses the power of light and rules over DiTharos.

Image by author (The powerful Sora by my son – the color choices and the bold strokes are all his work.)

Sora by Open AI

And that story very closely resembles how our world’s Sora, Open AI’s text-to-video model emerged in the realm of AI and has taken the world by storm. In principle, Sora is a diffusion transformer (DiT) developed by William Peebles and Saining Xie in 2023.

In other words, it uses the idea of diffusion for predicting the videos and the strength of transformers for next-level scaling. To understand this further, let’s try to find the answer to these two questions:

What does Sora do when given a prompt to work on?
How does it combine the diffusion-transformer ideas?

Talking about the videos made by Sora, here is my favorite one of an adorable Dalmatian in the streets of Italy. How natural is its movement!

The prompt used for the video : "The camera directly faces colorful buildings in Burano Italy. An adorable dalmation looks through a window on a building on the ground floor. Many people are walking and cycling along the canal streets in front of the buildings."

How did Sora do this?

Without any further ado, let’s dive into the details and look at how Sora creates these super-realistic videos based on text-prompts.

How does Sora work?

Thanks once again to Prof. Tom Yeh’s wonderful AI by Hand Series, we have this great piece on Sora for our discussion. (All the images below, unless otherwise noted, are by Prof. Tom Yeh from the above-mentioned LinkedIn post, which I have edited with his permission.)

So, here we go:

Our goal – Generate a video based on a text-prompt.

We are given:

Training video
Text-prompt
Diffusion step t = 3

For our example, can you guess what our text-prompt is going to be? You are right. It is "Sora is sky". A diffusion step of t = 3 means we are adding noise or diffusing the model in three steps but for illustration we will stick to one in this example.

What is diffusion?

Diffusion mainly refers to the phenomenon of scattering of particles – think how we enjoy the soft sun rays making a peek from behind the clouds. This soft glow can be attributed to the scattering of sunlight as it passes through the cloud layer causing the rays to spread out in different directions.

The random motion of the particles drives this diffusion. And that is exactly what happens for diffusion models used in image generation. Random noise is added to the image causing the elements in the image to deviate from the original and thus making way for creating more refined images.

As we talk about diffusion in regards to image-models, the key idea to remember is ‘noise’.

The process begins here:

[1] Convert video into patches

When working with text-generation, the models break down the large corpus into small pieces called tokens and use these tokens for all the calculations. Similarly, Sora breaks down the video into smaller elements called visual patches to make the work simpler.

Since we are talking about a video, we are talking about images in multiple frames. In our example, we have four frames. Each of the four frames or matrices contain the pixels that create the image.

The first step here is to convert this training video into 4 spacetime patches as below:

[2] Reduce the dimension of these visual patches : Encoder

Next, dimension reduction. The idea of dimension reduction has existed for over a century now (Trivia : Principal Component Analysis, also known as PCA was introduced by Karl Pearson in 1901), but its significance hasn’t faded over time.

And Sora uses it too!

When we talk about Neural Networks, one of the fundamental ideas for dimension reduction is the encoder. Encoder, by its design, transforms high-dimensional data into lower-dimension by focusing on capturing the most relevant features of the data. Win-win on both sides: it increases the efficiency and speed of the computations while the algorithm gets useful data to work on.

Sora uses the same idea for converting the high-dimensional pixels into a lower-dimensional latent space. To do so, we multiply the patches with weights and biases, followed by ReLU.

Note:

Linear transformation : The input embedding vector is multiplied by the weight matrix W and

then added with the bias vector b,

z = Wx+b, where W is the weight matrix, x is our word embedding and b is the bias vector.

ReLU activation function : Next, we apply the ReLU to this intermediate z.

ReLU returns the element-wise maximum of the input and zero. Mathematically, h = max{0,z}.

The weight matrix here is a 2×4 matrix [ [1, 0, -1, 0], [0, 1, 0, 1] ] with the bias being [0,1].
The patches matrix here is 4×4.

The product of the transpose of the weight matrix W and bias b with the patches followed by ReLU gives us a latent space which is only a 2×4 matrix. Thus, by using the visual encoder the dimension of the ‘model’ is reduced from 4 (2x2x1) to 2 (2×1).

In the original DiT paper, this reduction is from 196,608 (256x256x3) to 4096 (32x32x4), which is huge. Imagine working with 196,608 pixels against working with 4096 – a 48 times reduction!

Right after this dimension reduction, we have one of the most significant steps in the entire process – diffusion.

[3] Diffuse the model with noise

To introduce diffusion, we add sampled noise to the obtained latent features in the previous step to find the Noised Latent. The goal here is to ask the model to detect what the noise is.

This is in essence the idea of diffusion for image generation.

By adding noise to the image, the model is asked to guess what the noise is and what it looks like. In return, the model can generate a completely new image based on what it guessed and learnt from the noisy image.

It can also be seen relative to deleting a word from the language model and asking it to guess what the deleted word was.

Now that the training video has been reduced and diffused with noise, the next steps are to make use of the text-prompt to get a video as advocated by the prompt. We do this by conditioning with the adaptive norm layer.

[4]-[6] Conditioning by Adaptive Norm Layer

What ‘conditioning’ essentially means is we try to influence the behavior of the model using the additional information we have available. For eg: since our prompt is ‘Sora is sky’, we would like for the model to focus on elements such as sky or clouds rather attaching importance on other concepts like a hat or a plant. Thus, an adaptive norm layer massages, in better terms – dynamically scales and shifts the data in the network based on the input it receives.

What is scale and shift?

Scale occurs when we multiply, for e.g. we may start with a variable A. When we multiply it with 2 suppose, we get 2*A which amplifies or scales the value of A up by 2. If we multiply it by ½, the value is scaled down by 0.5.

Shift is denoted by addition, for e.g. we may be walking on the number line. We start with 1 and we are asked to shift to 5. What do we do? We can either add 4 and get 1+4=5 or we could add a hundred 0.4s to get to 5, 1+(100*0.04 )= 5. It all depends on if we want to take bigger steps (4) or smaller steps (0.04) to reach our goal.

[4] Encode Conditions

To make use of the conditions, in our case the information we have for building the model, first we translate it into a form the model understands, i.e., vectors.

The first step in the process is to translate the prompt into a text embedding vector.
The next step is to translate step t = 3 into a binary vector.
The third step is to concatenate these vectors together.

[5] Estimate Scale/Shift

Remember that here we use an ‘adaptive’ layer norm which implies that it adapts its values based on what the current conditions of the model are. Thus, to capture the correct essence of the data, we need to include the importance of each element in the data. And it is done by estimating the scale and shift.

For estimating these values for our model, we multiply the concatenated vector of prompt and diffusion step with the weight and add the bias to it. These weights and biases are learnable parameters which the model learns and updates.

(Remark: The third element in the resultant vector, according to me, should be 1. It could be a small error in the original post but as humans we are allowed a bit of it, aren’t we? To maintain uniformity, I continue here with the values from the original post.)

The goal here is to estimate the scale [2,-1] and the shift [-1,5] (since our model size is 2, we have two scale and two shift parameters). We keep them under ‘X’ and ‘+’ respectively.

[6] Apply Scale/Shift

To apply the scale and shift obtained in the previous step, we multiply the noised latent in Step 3 by [2, -1] and shift it by adding [-1,5].

The result is the ‘conditioned’ noise latent.

[7]-[9] Transformer

The last three steps consist of adding the transformer element to the above diffusion and conditioning steps. This step help us find the noise as predicted by the model.

[7] Self-Attention

This is the critical idea behind Transformers that make them so phenomenal!

What is self-attention?

It is a mechanism by which each word in a sentence analyzes every other word and measures how important they are to each other, making sense of the context and relationships in the text.

To enable self-attention, the conditioned noise latent is fed into the Query-Key function to obtain a self-attention matrix. The QK-values are omitted here for simplicity.

[8] Attention Pooling

Next, we multiply the conditioned noised latent with the self-attention matrix to obtain the attention weighted features.

[9] Point-wise Feed Forward Network

Once again returning back to the basics, we multiply the attention-weighted features with weights and biases to obtain the predicted noise.

Training

The last bit now is to train the model using Mean Square Error between the predicted noise and the sampled noise (ground truth).

[10] Calculate the MSE loss gradients and update learnable parameters

Using the MSE loss gradients, we use backpropagation to update all the parameters that are learnable (for e.g. the weights and biases in the adaptive norm layer).

The encoder and decoder parameters are frozen and not learnable.

(Remark: The second element in the second row should be -1, a tiny error which makes things better).

[11]-[13] Generate New Samples

[11] Denoise

Now that we are ready to generate new videos (yay!), we first need to remove the noise we had introduced. To do so, we subtract the predicted noise from the noise-latent to obtain noise-free latent.

Mind you, this is not the same as our original latent. Reason being we went through multiple conditioning and attention steps in between that included the context of our problem into the model. Thus, allowing the model a better feel for what its target should be while generating the video.

[12] Convert the latent space back to the pixels : Decoder

Just like we did for encoders, we multiply the latent space patches with weight and biases while followed by ReLU. We can observe here that after the work of the decoder, the model is back to the original dimension of 4 which was lowered to 2 when we had used the encoder.

[13] Time for the video!

The last step is to arrange the result from the above matrix into a sequence of frames which finally gives us our new video. Hooray!

And with that we come to the end of this supremely powerful technique. Congratulations, you have created a Sora video!

To summarize all that was said and done above, here are the 5 key points:

Converting the videos into visual patches and then reducing their dimension is essential. A visual encoder is our friend here.
As the name suggests, diffusion is the name of the game in this method. Adding noise to the video and then working with it at each of the subsequent steps (in different ways) is what this technique relies on.
Next up is the transformer architecture that enhances the abilities of the diffusion process along with amplifying the scale of the model.
Once the model is trained and ready to converge to a solution, the two D’s – denoiser and decoder come in handy. One by removing the noise and the other by projecting the low-dimensional space to its original dimension.
Finally, the resultant pixels from the decoder are rearranged to generate the desired video.

(Once you are done with the article, I suggest you to read the story at the beginning once more. Can you spot the similarities between Sora of DiTharos and Sora of our world?)

The Diffusion-Transformer (DiT) Combo

The kind of videos Sora has been able to produce, it is worth saying that the Diffusion-Transformer duo is lethal. Along with it, the idea of visual patches opens up an avenue for tinkering with a range of image resolutions, aspect ratios and durations, which allows for utmost experimentation.

Overall, it would not be wrong to say that this idea is seminal and without a doubt is here to stay. According to this New York Times article , Sora was named after the Japanese word for sky and to evoke the idea of limitless potential. And having witnessed its initial promise, it is true that Sora has definitely set a new frontier in AI. Now it remains to see how well it stands the test of safety and time.

As the legend of DiTharos goes – "Sora lives on, honing its skills and getting stronger with each passing day, ready to fly when the hour is golden!"

P.S. If you would like to work through this exercise on your own, here is a blank template for you to use.

Blank Template for hand-exercise

Now go have some fun with Sora in the land of ‘DiTharos’!

The post Deep Dive into Sora’s Diffusion Transformer (DiT) by Hand ✍︎ appeared first on Towards Data Science.

Deep Dive into Vector Databases by Hand ✍︎

Srijanie Dey, PhD — Wed, 20 Mar 2024 05:40:39 +0000

The other day I asked my favorite Large Language Model (LLM) to help me explain vectors to my almost 4-year old. In seconds, it spit out a story filled with mythical creatures and magic and vectors. And Voila! I had a sketch for a new children’s book, and it was impressive because the unicorn was called ‘LuminaVec’.

Image by the author (‘LuminaVec’ as interpreted by my almost 4-year old)

So, how did the model help weave this creative magic? Well, the answer is by using vectors (in real life) and most probably vector databases. How so? Let me explain.

Vectors and Embedding

First, the model doesn’t understand the exact words I typed in. What helps it understand the words are their numerical representations which are in the form of vectors. These vectors help the model find similarity among the different words while focusing on meaningful information about each. It does this by using embeddings which are low-dimensional vectors that try to capture the semantics and context of the information.

In other words, vectors in an embedding are lists of numbers that specify the position of an object with respect to a reference space. These objects can be features that define a variable in a dataset. With the help of these numerical vector values, we can determine how close or how far one feature is from the other – are they similar (close) or not similar (far)?

Now these vectors are quite powerful but when we are talking about LLMs, we need to be extra cautious about them because of the word ‘large’. As it happens to be with these ‘large’ models, these vectors may quickly become long and more complex, spanning over hundreds or even thousands of dimensions. If not dealt with carefully, the processing speed and mounting expense could become cumbersome very fast!

Vector Databases

To address this issue, we have our mighty warrior : Vector databases.

Vector databases are special databases that contain these vector embeddings. Similar objects have vectors that are closer to each other in the vector database, while dissimilar objects have vectors that are farther apart. So, rather than parsing the data every time a query comes in and generating these vector embeddings, which induces huge resources, it is much faster to run the data through the model once, store it in the vector database and retrieve it as needed. This makes vector databases one of the most powerful solutions addressing the problem of scale and speed for these LLMs.

So, going back to the story about the rainbow unicorn, sparkling magic and powerful vectors – when I had asked the model that question, it may have followed a process as this –

The embedding model first transformed the question to a vector embedding.
This vector embedding was then compared to the embeddings in the vector database(s) related to fun stories for 5-year olds and vectors.
Based on this search and comparison, the vectors that were the most similar were returned. The result should have consisted of a list of vectors ranked in their order of similarity to the query vector.

How does it really work?

To distill things even further, how about we go on a ride to resolve these steps on the micro-level? Time to go back to the basics! Thanks to Prof. Tom Yeh, we have this beautiful handiwork that explains the behind-the-scenes workings of the vectors and vector databases. (All the images below, unless otherwise noted, are by Prof. Tom Yeh from the above-mentioned LinkedIn post, which I have edited with his permission. )

So, here we go:

For our example, we have a dataset of three sentences with 3 words (or tokens) for each.

How are you
Who are you
Who am I

And our query is the sentence ‘am I you’.

In real life, a database may contain billions of sentences (think Wikipedia, news archives, journal papers, or any collection of documents) with tens of thousands of max number of tokens. Now that the stage is set, let the process begin :

[1] Embedding : The first step is generating vector embeddings for all the text that we want to be using. To do so, we search for our corresponding words in a table of 22 vectors, where 22 is the vocabulary size for our example.

In real life, the vocabulary size can be tens of thousands. The word embedding dimensions are in the thousands (e.g., 1024, 4096).

By searching for the words how are you in the vocabulary, the word embedding for it looks as this:

[2] Encoding : The next step is encoding the word embedding to obtain a sequence of feature vectors, one per word. For our example, the encoder is a simple perceptron consisting of a Linear layer with a ReLU activation function.

A quick recap:

Linear transformation : The input embedding vector is multiplied by the weight matrix W and then added with the bias vector b,

z = Wx+b, where W is the weight matrix, x is our word embedding and b is the bias vector.

ReLU activation function : Next, we apply the ReLU to this intermediate z.

ReLU returns the element-wise maximum of the input and zero. Mathematically, h = max{0,z}.

Thus, for this example the text embedding looks like this:

To show how it works, let’s calculate the values for the last column as an example.

Linear transformation :

[1.0 + 1.1 + 0.0 +0.0] + 0 = 1

[0.0 + 1.1 + 0.0 + 1.0] + 0 = 1

[1.0 + (0).1+ 1.0 + 0.0] + (-1) = -1

[1.0 + (-1).1+ 0.0 + 0.0] + 0 = -1

ReLU

max {0,1} =1

max {0,1} = 1

max {0,-1} = 0

And thus we get the last column of our feature vector. We can repeat the same steps for the other columns.

[3] Mean Pooling : In this step, we club the feature vectors by averaging over the columns to obtain a single vector. This is often called text embedding or sentence embedding.

Other techniques for pooling such as CLS, SEP can be used but Mean Pooling is the one used most widely.

[4] Indexing : The next step involves reducing the dimensions of the text embedding vector, which is done with the help of a projection matrix. This projection matrix could be random. The idea here is to obtain a short representation which would allow faster comparison and retrieval.

This result is kept away in the vector storage.

[5] Repeat : The above steps [1]-[4] are repeated for the other sentences in the dataset "who are you" and "who am I".

Now that we have indexed our dataset in the vector database, we move on to the actual query and see how these indices play out to give us the solution.

Query : "am I you"

[6] To get started, we repeat the same steps as above – embedding, encoding and indexing to obtain a 2d-vector representation of our query.

[7] Dot Product (Finding Similarity)

Once the previous steps are done, we perform dot products. This is important as these dot products power the idea of comparison between the query vector and our database vectors. To perform this step, we transpose our query vector and multiply it with the database vectors.

[8] Nearest Neighbor

The final step is performing a linear scan to find the largest dot product, which for our example is 60/9. This is the vector representation for "who am I". In real life, a linear scan could be incredibly slow as it may involve billions of values, the alternative is to use an Approximate Nearest Neighbor (ANN) algorithm like the Hierarchical Navigable Small Worlds (HNSW).

And that brings us to the end of this elegant method.

Thus, by using the vector embeddings of the datasets in the vector database, and performing the steps above, we were able to find the sentence closest to our query. Embedding, encoding, mean pooling, indexing and then dot products form the core of this process.

The ‘large’ picture

However, to bring in the ‘large’ perspective one more time –

A dataset may contain millions or billions of sentences.
The number of tokens for each of them can be tens of thousands.
The word embedding dimensions can be in the thousands.

As we put all of these data and steps together, we are talking about performing operations on dimensions that are mammoth-like in size. And so, to power through this magnificent scale, vector databases come to the rescue. Since we started this article talking about LLMs, it would be a good place to say that because of the scale-handling capability of vector databases, they have come to play a significant role in Retrieval Augmented Generation (RAG). The scalability and speed offered by vector databases enable efficient retrieval for the RAG models, thus paving the way for an efficient generative model.

All in all it is quite right to say that vector databases are powerful. No wonder they have been here for a while – starting their journey of helping recommendation systems to now powering the LLMs, their rule continues. And with the pace vector embeddings are growing for different AI modalities, it seems like vector databases are going to continue their rule for a good amount of time in the future!

Image by the author

P.S. If you would like to work through this exercise on your own, here is a link to a blank template for your use.

Blank Template for hand-exercise

Now go have fun and create some ‘luminous vectoresque’ magic!

The post Deep Dive into Vector Databases by Hand ✍︎ appeared first on Towards Data Science.