Deep Dive into Anthropic's Sparse Autoencoders by Hand ✍️

Image by author (Zephyra, the protector of Lumaria by my 4-year old)

"In the mystical lands of Lumaria, where ancient magic filled the air, lived Zephyra, the Ethereal Griffin. With the body of a lion and the wings of an eagle, Zephyra was the revered protector of the Codex of Truths, an ancient script holding the universe’s secrets.

Nestled in a sacred cave, the Codex was safeguarded by Zephyra’s viridescent eyes, which could see through deception to unveil pure truths. One day, a dark sorcerer descended on the lands of Lumaria and sought to shroud the world in ignorance by concealing the Codex. The villagers called upon Zephyra, who soared through the skies, as a beacon of hope. With a majestic sweep of the wings, Zephyra created a protective barrier of light around the grove, repelling the sorcerer and exposing the truths.

After a long duel, it was concluded that the dark sorcerer was no match to Zephyra’s light. Through her courage and vigilance, the true light kept shining over Lumaria. And as time went by, Lumaria was guided to prosperity under Zephyra’s protection and its path stayed illuminated by the truths Zephyra safeguarded. And this is how Zephyra’s legend lived on!"

Anthropic’s journey ‘towards extracting interpretable features’

Following the story of Zephyra, Anthropic AI delved into the expedition of extracting meaningful features in a model. The idea behind this investigation lies in understanding how different components in a neural network interact with one another and what role each component plays.

According to the paper "Towards Monosemanticity: Decomposing Language Models With Dictionary Learning" a Sparse Autoencoder is able to successfully extract meaningful features from a model. In other words, Sparse Autoencoders help break down the problem of ‘polysemanticity’ – neural activations that correspond to several meanings/interpretations at once by focusing on sparsely activating features that hold a single interpretation – in other words, are more one-directional.

To understand how all of it is done, we have these beautiful handiworks on Autoencoders and Sparse Autoencoders by Prof. Tom Yeh that explain the behind-the-scenes workings of these phenomenal mechanisms.

(All the images below, unless otherwise noted, are by Prof. Tom Yeh from the above-mentioned LinkedIn posts, which I have edited with his permission. )

To begin, let us first let us first explore what an Autoencoder is and how it works.

What is an Autoencoder?

Imagine a writer has his desk strewn with different papers – some are his notes for the story he is writing, some are copies of final drafts, some are again illustrations for his action-packed story. Now amidst this chaos, it is hard to find the important parts – more so when the writer is in a hurry and the publisher is on the phone demanding a book in two days. Thankfully, the writer has a very efficient assistant – this assistant makes sure the cluttered desk is cleaned regularly, grouping similar items, organizing and putting things into their right place. And as and when needed, the assistant would retrieve the correct items for the writer, helping him meet the deadlines set by his publisher.

Well, the name of this assistant is Autoencoder. It mainly has two functions – encoding and decoding. Encoding refers to condensing input data and extracting the essential features (organization). Decoding is the process of reconstructing original data from encoded representation while aiming to minimize information loss (retrieval).

Now let’s look at how this assistant works.

How does an Autoencoder Work?

Given : Four training examples X1, X2, X3, X4.

[1] Auto

The first step is to copy the training examples to targets Y’. The Autoencoder’s work is to reconstruct these training examples. Since the targets are the training examples themselves, the word ‘Auto’ is used which is Greek for ‘self’.

[2] Encoder : Layer 1 +ReLU

As we have seen in all our previous models, a simple weight and bias matrix coupled with ReLU is powerful and is able to do wonders. Thus, by using the first Encoding layer we reduce the size of the original feature set from 4×4 to 3×4.

A quick recap:

Linear transformation : The input embedding vector is multiplied by the weight matrix W and then added with the bias vector b,

z = Wx+b, where W is the weight matrix, x is our word embedding and b is the bias vector.

ReLU activation function : Next, we apply the ReLU to this intermediate z.

ReLU returns the element-wise maximum of the input and zero. Mathematically, h = max{0,z}.

[3] Encoder : Layer 2 + ReLU

The output of the previous layer is processed by the second Encoder layer which reduces the input size further to 2×3. This is where the extraction of relevant features occurs. This layer is also called the ‘bottleneck’ since the outputs in this layer have much lower features than the input features.

[4] Decoder : Layer 1 + ReLU

Once the encoding process is complete, the next step is to decode the relevant features to build ‘back’ the final output. To do so, we multiply the features from the last step with corresponding weights and biases and apply the ReLU layer. The result is a 3×4 matrix.

[5] Decoder : Layer 2 + ReLU

A second Decoder layer (weight, biases + ReLU) applies on the previous output to give the final result which is the reconstructed 4×4 matrix. We do so to get back to original dimension in order to compare the results with our original target.

[6] Loss Gradients & BackPropagation

Once the output from the decoder layer is obtained, we calculate the gradients of the Mean Square Error (MSE) between the outputs (Y) and the targets (Y’). To do so, we find *2(Y-Y’)** , which gives us the final gradients that activate the backpropagation process and updates the weights and biases accordingly.

Now that we understand how the Autoencoder works, it’s time to explore how its sparse variation is able to achieve interpretability for Large Language Models (LLMs).

Sparse Autoencoder – How does it work?

To start with, suppose we are given:

The output of a transformer after the feed-forward layer has processed it, i.e. let us assume we have the model activations for five tokens (X). They are good but they do not shed light on how the model arrives at its decision or makes the predictions.

The prime question here is:

Is it possible to map each activation (3D) to a higher-dimension space (6D) that will help with the understanding?

[1] Encoder : Linear Layer

The first step in the Encoder layer is to multiply the input X with encoder weights and add biases (as done in the first step of an Autoencoder).

[2] Encoder : ReLU

The next sub-step is to apply the ReLU activation function to add non-linearity and suppress negative activations. This suppression leads to many features being set to 0 which enables the concept of sparsity – outputting sparse and interpretable features f.

Interpretability happens when we have only one or two positive features. If we examine f6, we can see X2 and X3 are positive, and may say that both have ‘Mountain’ in common.

[3] Decoder : Reconstruction

Once we are done with the encoder, we proceed to the decoder step. We multiply f with decoder weights and add biases. This outputs X’, which is the reconstruction of X from interpretable features.

As done in an Autoencoder, we want X’ to be as close to X as possible. To ensure that, further training is essential.

[4] Decoder : Weights

As an intermediary step, we compute the L2 norm for each of the weights in this step. We keep them aside to be used later.

L2-norm

Also known as Euclidean norm, L2-norm calculates the magnitude of a vector using the formula: ||x||₂ = √(Σᵢ xᵢ²).

In other words, it sums the squares of each component and then takes the square root over the result. This norm provides a straightforward way to quantify the length or distance of a vector in Euclidean space.

Training

As mentioned earlier, a Sparse Autoencoder instils extensive training to get the reconstructed X’ closer to X. To illustrate that, we proceed to the next steps below:

[5] Sparsity : L1 Loss

The goal here is to obtain as many values close to zero / zero as possible. We do so by invoking L1 sparsity to penalize the absolute values of the weights – the core idea being that we want to make the sum as small as possible.

L1-loss

The L1-loss is calculated as the sum of the absolute values of the weights: L1 = λΣ|w|, where λ is a regularization parameter.

This encourages many weights to become zero, simplifying the model and thus enhancing interpretability.

In other words, L1 helps build the focus on the most relevant features while also preventing overfitting, improving model generalization, and reducing computational complexity.

[6] Sparsity : Gradient

The next step is to calculate L1‘s gradients which -1 for positive values. Thus, for all values of f >0 , the result will be set to -1.

How does L1 penalty push weights towards zero?

The gradient of the L1 penalty pushes weights towards zero through a process that applies a constant force, regardless of the weight’s current value. Here’s how it works (all images in this sub-section are by author):

The L1 penalty is expressed as:

The gradient of this penalty with respect to a weight w is:

where sign(w) is:

During gradient descent, the update rule for weights is:

where 𝞰 is the learning rate.

The constant subtraction (or addition) of λ from the weight value (depending on its sign) decreases the absolute value of the weight. If the weight is small enough, this process can drive it to exactly zero.

[7] Sparsity : Zero

For all other values that are already zero, we keep them unchanged since they have already been zeroed out.

[8] Sparsity : Weight

We multiple each row of the gradient matrix obtained in Step 6 by the corresponding decoder weights obtained in Step 4. This step is crucial as it prevents the model from learning large weights which would add incorrect information while reconstructing the results.

[9] Reconstruction : MSE Loss

We use the Mean Square Error or the L2 loss function to calculate the difference between X’ and X. The goal as seen previously is to minimize the error to the lowest value.

[10] Reconstruction : Gradient

The gradient of L2 loss is *2(X’-X)**.

And hence as seen for the original Autoencoders, we run backpropagation to update the weights and the biases. The catch here is finding a good balance between sparsity and reconstruction.

And with this, we come to the end of this very clever and intuitive way of learning how a model understands an idea and the direction it takes to generate a response.

To summarize:

An Autoencoder overall consists of two parts : Encoder and Decoder. The Encoder uses weights and biases coupled with the ReLU activation function to compress the initial input features into a lower dimension, trying to capture only the relevant parts. The Decoder on the other hand takes the output of the Encoder and works to reconstruct the input features back to their original state. Since the targets in an Autoencoder are the initial features themselves, hence the use of the word ‘auto’. The aim, as is for standard neural networks, is to achieve the lowest error (difference) between the target and the input features – and it is achieved by propagating the gradient of the error through the network while updating the weights and biases.
A Sparse Autoencoder consists of all the components as a standard Autoencoder along with a few more additions. The key here is the different approach in the training step. Since the aim here is to retrieve the interpretable features, we want to zero out those values which hold relatively less meaning. Once the encoder uses ReLU to suppress the negative values, we go a step further and use L1-Loss on the result to encourage sparsity by penalizing the absolute values of the weights. This is achieved by adding a penalty term to the loss function, which is the sum of the absolute values of the weights: λΣ|w|. The weights that remain non-zero are those that are crucial for the model’s performance.

Extracting Interpretable features using Sparsity

As humans, our brains activate only a small subset of neurons in response to specific stimuli. Likewise, Sparse Autoencoders learn a sparse representation of the input by leveraging sparsity constraints like L1 regularization. By doing so, a Sparse Autoencoder is able to extract interpretable features from complex data thus enhancing the simplicity and interpretability of the learned features. This selective activation mirroring biological neural processes helps focus on the most relevant aspects of the input data making the models more robust and efficient.

With Anthropic’s endeavor to understand interpretability in AI models, their initiative highlights the need for transparent and understandable AI systems, especially as they become more integrated into critical decision-making processes. By focusing on creating models that are both powerful and interpretable, Anthropic contributes to the development of AI that can be trusted and effectively utilized in real-world applications.

In conclusion, Sparse Autoencoders are vital for extracting interpretable features, enhancing model robustness, and ensuring efficiency. The ongoing work on understanding these powerful models and how they make inferences underscore the growing importance of interpretability in AI, paving the way for more transparent AI systems. It remains to see how these concepts evolve and driving us towards a future that entails a safe integration of AI in our lives!

P.S. If you would like to work through this exercise on your own, here is a link to a blank template for your use.

Blank Template for hand-exercise

Now go have fun and help Zephyr keep the Codex of Truth safe!

Once again special thanks to Prof. Tom Yeh for supporting this work!

References:

[1] Towards Monosemanticity: Decomposing Language Models With Dictionary Learning, Bricken et al. Oct 2023 https://transformer-circuits.pub/2023/monosemantic-features/index.html

[2] Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet, Templeton et al. May 2024 https://transformer-circuits.pub/2024/scaling-monosemanticity/

Deep Dive into Anthropic’s Sparse Autoencoders by Hand ✍️