Deep Learning | Towards Data Science

Show and Tell

Muhammad Ardi — Mon, 03 Feb 2025 16:30:24 +0000

Introduction

Natural Language Processing and Computer Vision used to be two completely different fields. Well, at least back when I started to learn machine learning and deep learning, I feel like there are multiple paths to follow, and each of them, including NLP and Computer Vision, directs me to a completely different world. Over time, we can now observe that AI becomes more and more advanced, with the intersection between multiple fields of study getting more common, including the two I just mentioned.

Today, many language models have capability to generate images based on the given prompt. That’s one example of the bridge between NLP and Computer Vision. But I guess I’ll save it for my upcoming article as it is a bit more complex. Instead, in this article I am going to discuss the simpler one: image captioning. As the name suggests, this is essentially a technique where a specific model accepts an image and returns a text that describes the input image.

One of the earliest papers in this topic is the one titled "Show and Tell: A Neural Image Caption Generator" written by Vinyals et al. back in 2015 [1]. In this article, I will focus on implementing the Deep Learning model proposed in the paper using PyTorch. Note that I won’t actually demonstrate the training process here as that’s a topic on its own. Let me know in the comments if you want a separate tutorial on that.

Image Captioning Framework

Generally speaking, image captioning can be done by combining two types of models: the one specialized to process images and another one capable of processing sequences. I believe you already know what kind of models work best for the two tasks – yes, you’re right, those are CNN and RNN, respectively. The idea here is that the CNN is utilized to encode the input image (hence this part is called encoder), whereas the RNN is used for generating a sequence of words based on the features encoded by the CNN (hence the RNN part is called decoder).

It is discussed in the paper that the authors attempted to do so using GoogLeNet (a.k.a., Inception V1) for the encoder and LSTM for the decoder. In fact, the use of GoogLeNet is not explicitly mentioned, yet based on the illustration provided in the paper it seems like the architecture used in the encoder is adopted from the original GoogLeNet paper [2]. The figure below shows what the proposed architecture looks like.

Figure 1. The image captioning model proposed in [1], where the encoder part (the leftmost block) implements the GoogLeNet model [2].

Talking more specifically about the connection between the encoder and the decoder, there are several methods available for connecting the two, namely init-inject, pre-inject, par-inject and merge, as mentioned in [3]. In the case of the Show and Tell paper, authors used pre-inject, a method where the features extracted by the encoder are perceived as the 0th word in the caption. Later in the inference phase, we expect the decoder to generate a caption based solely on these image features.

Figure 2. The four methods possible to be used to connect the encoder and the decoder part of an image captioning model [3]. In our case we are going to use the pre-inject method (b).

As we already understood the theory behind the image captioning model, we can now jump into the code!

Implementation

I’ll break the implementation part into three sections: the Encoder, the Decoder, and the combination of the two. Before we actually get into them, we need to import the modules and initialize the required parameters in advance. Look at the Codeblock 1 below to see the modules I use.

# Codeblock 1
import torch  #(1)
import torch.nn as nn  #(2)
import torchvision.models as models  #(3)
from torchvision.models import GoogLeNet_Weights  #(4)

Let’s break down these imports quickly: the line marked with #(1) is used for basic operations, line #(2) is for initializing neural network layers, line #(3) is for loading various deep learning models, and #(4) is the pretrained weights for the GoogLeNet model.

Talking about the parameter configuration, EMBED_DIM and LSTM_HIDDEN_DIM are the only two parameters mentioned in the paper, which are both set to 512 as shown at line #(1) and #(2) in the Codeblock 2 below. The EMBED_DIM variable essentially indicates the feature vector size representing a single token in the caption. In this case, we can simply think of a single token as an individual word. Meanwhile, LSTM_HIDDEN_DIM is a variable representing the hidden state size inside the LSTM cell. This paper does not mention how many times this RNN-based layer is repeated, but based on the diagram in Figure 1, it seems like it only implements a single LSTM cell. Thus, at line #(3) I set the NUM_LSTM_LAYERS variable to 1.

# Codeblock 2
EMBED_DIM       = 512    #(1)
LSTM_HIDDEN_DIM = 512    #(2)
NUM_LSTM_LAYERS = 1      #(3)

IMAGE_SIZE      = 224    #(4)
IN_CHANNELS     = 3      #(5)

SEQ_LENGTH      = 30     #(6)
VOCAB_SIZE      = 10000  #(7)

BATCH_SIZE      = 1

The next two parameters are related to the input image, namely IMAGE_SIZE (#(4)) and IN_CHANNELS (#(5)). Since we are about to use GoogLeNet for the encoder, we need to match it with its original input shape (3×224×224). Not only for the image, but we also need to configure the parameters for the caption. Here we assume that the caption length is no more than 30 words (#(6)) and the number of unique words in the dictionary is 10000 (#(7)). Lastly, the BATCH_SIZE parameter is used because by default PyTorch processes tensors in a batch. Just to make things simple, the number of image-caption pair within a single batch is set to 1.

GoogLeNet Encoder

It is actually possible to use any kind of CNN-based model for the encoder. I found on the internet that [4] uses DenseNet, [5] uses Inception V3, and [6] utilizes ResNet for the similar tasks. However, since my goal is to reproduce the model proposed in the paper as closely as possible, I am using the pretrained GoogLeNet model instead. Before we get into the encoder implementation, let’s see what the GoogLeNet architecture looks like using the following code.

# Codeblock 3
models.googlenet()

The resulting output is very long as it lists literally all layers inside the architecture. Here I truncate the output since I only want you to focus on the last layer (the fc layer marked with #(1) in the Codeblock 3 Output below). You can see that this linear layer maps a feature vector of size 1024 into 1000. Normally, in a standard image classification task, each of these 1000 neurons corresponds to a specific class. So, for example, if you want to perform a 5-class classification task, you would need to modify this layer such that it projects the outputs to 5 neurons only. In our case, we need to make this layer produce a feature vector of length 512 (EMBED_DIM). With this, the input image will later be represented as a 512-dimensional vector after being processed by the GoogLeNet model. This feature vector size will exactly match with the token embedding dimension, allowing it to be treated as a part of our word sequence.

# Codeblock 3 Output
GoogLeNet(
  (conv1): BasicConv2d(
    (conv): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
    (bn): BatchNorm2d(64, eps=0.001, momentum=0.1, affine=True, track_running_stats=True)
  )
  (maxpool1): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=True)
  (conv2): BasicConv2d(
    (conv): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
    (bn): BatchNorm2d(64, eps=0.001, momentum=0.1, affine=True, track_running_stats=True)
  )

  .
  .
  .
  .

  (avgpool): AdaptiveAvgPool2d(output_size=(1, 1))
  (dropout): Dropout(p=0.2, inplace=False)
  (fc): Linear(in_features=1024, out_features=1000, bias=True)  #(1)
)

Now let’s actually load and modify the GoogLeNet model, which I do in the InceptionEncoder class below.

# Codeblock 4a
class InceptionEncoder(nn.Module):
    def __init__(self, fine_tune):  #(1)
        super().__init__()
        self.googlenet = models.googlenet(weights=GoogLeNet_Weights.IMAGENET1K_V1)  #(2)
        self.googlenet.fc = nn.Linear(in_features=self.googlenet.fc.in_features,  #(3)
                                      out_features=EMBED_DIM)  #(4)

        if fine_tune == True:       #(5)
            for param in self.googlenet.parameters():
                param.requires_grad = True
        else:
            for param in self.googlenet.parameters():
                param.requires_grad = False

        for param in self.googlenet.fc.parameters():
            param.requires_grad = True

The first thing we do in the above code is to load the model using models.googlenet(). It is mentioned in the paper that the model is already pretrained on the ImageNet dataset. Thus, we need to pass GoogLeNet_Weights.IMAGENET1K_V1 into the weights parameter, as shown at line #(2) in Codeblock 4a. Next, at line #(3) we access the classification head through the fc attribute, where we replace the existing linear layer with a new one having the output dimension of 512 (EMBED_DIM) (#(4)). Since this GoogLeNet model is already trained, we don’t need to train it from scratch. Instead, we can either perform fine-tuning or transfer learning in order to adapt it to the image captioning task.

In case you’re not yet familiar with the two terms, fine-tuning is a method where we update the weights of the entire model. On the other hand, transfer learning is a technique where we only update the weights of the layers we replaced (in this case it’s the last fully-connected layer), while setting the weights of the existing layers non-trainable. To do so, I implement a flag named fine_tune at line #(1) which will let the model to perform fine-tuning whenever it is set to True (#(5)).

The forward() method is pretty straightforward since what we do here is simply passing the input image through the modified GoogLeNet model. See the Codeblock 4b below for the details. Additionally, here I also print out the tensor dimension before and after processing so that you can better understand how the InceptionEncoder model works.

# Codeblock 4b
    def forward(self, images):
        print(f'originalt: {images.size()}')
        features = self.googlenet(images)
        print(f'after googlenett: {features.size()}')

        return features

To test whether our decoder works properly, we can pass a dummy tensor of size 1×3×224×224 through the network as demonstrated in Codeblock 5. This tensor dimension simulates a single RGB image of size 224×224. You can see in the resulting output that our image now becomes a single-dimensional feature vector with the length of 512.

# Codeblock 5
inception_encoder = InceptionEncoder(fine_tune=True)

images = torch.randn(BATCH_SIZE, IN_CHANNELS, IMAGE_SIZE, IMAGE_SIZE)
features = inception_encoder(images)

# Codeblock 5 Output
original         : torch.Size([1, 3, 224, 224])
after googlenet  : torch.Size([1, 512])

LSTM Decoder

As we have successfully implemented the encoder, now that we are going to create the LSTM decoder, which I demonstrate in Codeblock 6a and 6b. What we need to do first is to initialize the required layers, namely an embedding layer (#(1)), the LSTM layer itself (#(2)), and a standard linear layer (#(3)). The first one (nn.Embedding) is responsible for mapping every single token into a 512 (EMBED_DIM)-dimensional vector. Meanwhile, the LSTM layer is going to generate a sequence of embedded tokens, where each of these tokens will be mapped into a 10000 (VOCAB_SIZE)-dimensional vector by the linear layer. Later on, the values contained in this vector will represent the likelihood of each word in the dictionary being chosen.

# Codeblock 6a
class LSTMDecoder(nn.Module):
    def __init__(self):
        super().__init__()

        #(1)
        self.embedding = nn.Embedding(num_embeddings=VOCAB_SIZE,
                                      embedding_dim=EMBED_DIM)
        #(2)
        self.lstm = nn.LSTM(input_size=EMBED_DIM, 
                            hidden_size=LSTM_HIDDEN_DIM, 
                            num_layers=NUM_LSTM_LAYERS, 
                            batch_first=True)
        #(3)        
        self.linear = nn.Linear(in_features=LSTM_HIDDEN_DIM, 
                                out_features=VOCAB_SIZE)

Next, let’s define the flow of the network using the following code.

# Codeblock 6b
    def forward(self, features, captions):                 #(1)
        print(f'features originalt: {features.size()}')
        features = features.unsqueeze(1)                   #(2)
        print(f"after unsqueezett: {features.shape}")

        print(f'captions originalt: {captions.size()}')
        captions = self.embedding(captions)                #(3)
        print(f"after embeddingtt: {captions.shape}")

        captions = torch.cat([features, captions], dim=1)  #(4)
        print(f"after concattt: {captions.shape}")

        captions, _ = self.lstm(captions)                  #(5)
        print(f"after lstmtt: {captions.shape}")

        captions = self.linear(captions)                   #(6)
        print(f"after lineartt: {captions.shape}")

        return captions

You can see in the above code that the forward() method of the LSTMDecoder class accepts two inputs: features and captions, where the former is the image that has been processed by the InceptionEncoder, while the latter is the caption of the corresponding image serving as the ground truth (#(1)). The idea here is that we are going to perform pre-inject operation by prepending the features tensor into captions using the code at line #(4). However, keep in mind that we need to adjust the shape of both tensors beforehand. To do so, we have to insert a single dimension at the 1st axis of the image features (#(2)). Meanwhile, the shape of the captions tensor will align with our requirement right after being processed by the embedding layer (#(3)). As the features and captions have been concatenated, we then pass this tensor through the LSTM layer (#(5)) before it is eventually processed by the linear layer (#(6)). Look at the testing code below to better understand the flow of the two tensors.

# Codeblock 7
lstm_decoder = LSTMDecoder()

features = torch.randn(BATCH_SIZE, EMBED_DIM)  #(1)
captions = torch.randint(0, VOCAB_SIZE, (BATCH_SIZE, SEQ_LENGTH))  #(2)

captions = lstm_decoder(features, captions)

In Codeblock 7, I assume that features is a dummy tensor that represents the output of the InceptionEncoder model (#(1)). Meanwhile, captions is the tensor representing a sequence of tokenized words, where in this case I initialize it as random numbers ranging between 0 to 10000 (VOCAB_SIZE) with the length of 30 (SEQ_LENGTH) (#(2)).

We can see in the output below that the features tensor initially has the dimension of 1×512 (#(1)). This tensor shape changed to 1×1×512 after being processed with the unsqueeze() operation (#(2)). The additional dimension in the middle (1) allows the tensor to be treated as a feature vector corresponding to a single timestep, which is necessary for compatibility with the LSTM layer. To the captions tensor, its shape changed from 1×30 (#(3)) to 1×30×512 (#(4)), indicating that every single word is now represented as a 512-dimensional vector.

# Codeblock 7 Output
features original : torch.Size([1, 512])       #(1)
after unsqueeze   : torch.Size([1, 1, 512])    #(2)
captions original : torch.Size([1, 30])        #(3)
after embedding   : torch.Size([1, 30, 512])   #(4)
after concat      : torch.Size([1, 31, 512])   #(5)
after lstm        : torch.Size([1, 31, 512])   #(6)
after linear      : torch.Size([1, 31, 10000]) #(7)

After pre-inject operation is performed, our tensor is now having the dimension of 1×31×512, where the features tensor becomes the token at the 0th timestep in the sequence (#(5)). See the following figure to better illustrate this idea.

Figure 3. What the resulting tensor looks like after the pre-injection operation. [3].

Next, we pass the tensor through the LSTM layer, which in this particular case the output tensor dimension remains the same. However, it is important to note that the tensor shapes at line #(5) and #(6) in the above output are actually specified by different parameters. The dimensions appear to match here because EMBED_DIM and LSTM_HIDDEN_DIM were both set to 512. Normally, if we use a different value for LSTM_HIDDEN_DIM, then the output dimension is going to be different as well. Finally, we projected each of the 31 token embeddings to a vector of size 10000, which will later contain the probability of every possible token being predicted (#(7)).

GoogLeNet Encoder + LSTM Decoder

At this point, we have successfully created both the encoder and the decoder parts of the image captioning model. What I am going to do next is to combine them together in the ShowAndTell class below.

# Codeblock 8a
class ShowAndTell(nn.Module):
    def __init__(self):
        super().__init__()
        self.encoder = InceptionEncoder(fine_tune=True)  #(1)
        self.decoder = LSTMDecoder()     #(2)

    def forward(self, images, captions):
        features = self.encoder(images)  #(3)
        print(f"after encodert: {features.shape}")

        captions = self.decoder(features, captions)      #(4)
        print(f"after decodert: {captions.shape}")

        return captions

I think the above code is pretty straightforward. In the __init__() method, we only need to initialize the InceptionEncoder as well as the LSTMDecoder models (#(1) and #(2)). Here I assume that we are about to perform fine-tuning rather than transfer learning, so I set the fine_tune parameter to True. Theoretically speaking, fine-tuning is better than transfer learning if you have a relatively large dataset since it works by re-adjusting the weights of the entire model. However, if your dataset is rather small, you should go with transfer learning instead – but that’s just the theory. It’s definitely a good idea to experiment with both options to see which works best in your case.

Still with the above codeblock, we configure the forward() method to accept image-caption pairs as input. With this configuration, we basically design this method such that it can only be used for training purpose. Here we initially process the raw image with the GoogLeNet inside the encoder block (#(3)). Afterwards, we pass the extracted features as well as the tokenized captions into the decoder block and let it produce another token sequence (#(4)). In the actual training, this caption output will then be compared with the ground truth to compute the error. This error value is going to be used to compute gradients through backpropagation, which determines how the weights in the network are updated.

It is important to know that we cannot use the forward() method to perform inference, so we need a separate one for that. In this case, I am going to implement the code specifically to perform inference in the generate() method below.

# Codeblock 8b
    def generate(self, images):  #(1)
        features = self.encoder(images)              #(2)
        print(f"after encodertt: {features.shape}n")

        words = []  #(3)
        for i in range(SEQ_LENGTH):                  #(4)
            print(f"iteration #{i}")
            features = features.unsqueeze(1)
            print(f"after unsqueezett: {features.shape}")

            features, _ = self.decoder.lstm(features)
            print(f"after lstmtt: {features.shape}")

            features = features.squeeze(1)           #(5)
            print(f"after squeezett: {features.shape}")

            probs = self.decoder.linear(features)    #(6)
            print(f"after lineartt: {probs.shape}")

            _, word = probs.max(dim=1)  #(7)
            print(f"after maxtt: {word.shape}")

            words.append(word.item())  #(8)

            if word == 1:  #(9)
                break

            features = self.decoder.embedding(word)  #(10)
            print(f"after embeddingtt: {features.shape}n")

        return words       #(11)

Instead of taking two inputs like the previous one, the generate() method takes raw image as the only input (#(1)). Since we want the features extracted from the image to be the initial input token, we first need to process the raw input image with the encoder block prior to actually generating the subsequent tokens (#(2)). Next, we allocate an empty list for storing the token sequence to be produced later (#(3)). The tokens themselves are generated one by one, so we wrap the entire process inside a for loop, which is going to stop iterating once it reaches at most 30 (SEQ_LENGTH) words (#(4)).

The steps done inside the loop is algorithmically similar to the ones we discussed earlier. However, since the LSTM cell here generates a single token at a time, the process requires the tensor to be treated a bit differently from the one passed through the forward() method of the LSTMDecoder class back in Codeblock 6b. The first difference you might notice is the squeeze() operation (#(5)), which is basically just a technical step to be done such that the subsequent layer does the linear projection correctly (#(6)). Then, we take the index of the feature vector having the highest value, which corresponds to the token most likely to come next (#(7)), and append it to the list we allocated earlier (#(8)). The loop is going to break whenever the predicted index is a stop token, which in this case I assume that this token is at the 1st index of the probs vector. Otherwise, if the model does not find the stop token, then it is going to convert the last predicted word into its 512 (EMBED_DIM)-dimensional vector (#(10)), allowing it to be used as the input features for the next iteration. Lastly, the generated word sequence will be returned once the loop is completed (#(11)).

We are going to simulate the forward pass for the training phase using the Codeblock 9 below. Here I pass two tensors through the show_and_tell model (#(1)), each representing a raw image of size 3×224×224 (#(2)) and a sequence of tokenized words (#(3)). Based on the resulting output, we found that our model works properly as the two input tensors successfully passed through the InceptionEncoder and the LSTMDecoder part of the network.

# Codeblock 9
show_and_tell = ShowAndTell()  #(1)

images = torch.randn(BATCH_SIZE, IN_CHANNELS, IMAGE_SIZE, IMAGE_SIZE)  #(2)
captions = torch.randint(0, VOCAB_SIZE, (BATCH_SIZE, SEQ_LENGTH))      #(3)

captions = show_and_tell(images, captions)

# Codeblock 9 Output
after encoder : torch.Size([1, 512])
after decoder : torch.Size([1, 31, 10000])

Now, let’s assume that our show_and_tell model is already trained on an image captioning dataset, and thus ready to be used for inference. Look at the Codeblock 10 below to see how I do it. Here we set the model to eval() mode (#(1)), initialize the input image (#(2)), and pass it through the model using the generate() method (#(3)).

# Codeblock 10
show_and_tell.eval()  #(1)

images = torch.randn(BATCH_SIZE, IN_CHANNELS, IMAGE_SIZE, IMAGE_SIZE)  #(2)

with torch.no_grad():
    generated_tokens = show_and_tell.generate(images)  #(3)

The flow of the tensor can be seen in the output below. Here I truncate the resulting outputs because it only shows the same token generation process 30 times.

# Codeblock 10 Output
after encoder    : torch.Size([1, 512])

iteration #0
after unsqueeze  : torch.Size([1, 1, 512])
after lstm       : torch.Size([1, 1, 512])
after squeeze    : torch.Size([1, 512])
after linear     : torch.Size([1, 10000])
after max        : torch.Size([1])
after embedding  : torch.Size([1, 512])

iteration #1
after unsqueeze  : torch.Size([1, 1, 512])
after lstm       : torch.Size([1, 1, 512])
after squeeze    : torch.Size([1, 512])
after linear     : torch.Size([1, 10000])
after max        : torch.Size([1])
after embedding  : torch.Size([1, 512])

.
.
.
.

To see what the resulting caption looks like, we can just print out the generated_tokens list as shown below. Keep in mind that this sequence is still in the form of tokenized words. Later, in the post-processing stage, we will need to convert them back to the words corresponding to these numbers.

# Codeblock 11
generated_tokens

# Codeblock 11 Output
[5627,
 3906,
 2370,
 2299,
 4952,
 9933,
 402,
 7775,
 602,
 4414,
 8667,
 6774,
 9345,
 8750,
 3680,
 4458,
 1677,
 5998,
 8572,
 9556,
 7347,
 6780,
 9672,
 2596,
 9218,
 1880,
 4396,
 6168,
 7999,
 454]

Ending

With the above output, we’ve reached the end of our discussion on image captioning. Over time, many other researchers attempted to make improvements to accomplish this task. So, I think in the upcoming article I will discuss the state-of-the-art method on this topic.

Thanks for reading, I hope you learn something new today!

_By the way you can also find the code used in this article here._

References

[1] Oriol Vinyals et al. Show and Tell: A Neural Image Caption Generator. Arxiv. https://arxiv.org/pdf/1411.4555 [Accessed November 13, 2024].

[2] Christian Szegedy et al. Going Deeper with Convolutions. Arxiv. https://arxiv.org/pdf/1409.4842 [Accessed November 13, 2024].

[3] Marc Tanti et al. Where to put the Image in an Image Caption Generator. Arxiv. https://arxiv.org/pdf/1703.09137 [Accessed November 13, 2024].

[4] Stepan Ulyanin. Captioning Images with CNN and RNN, using PyTorch. Medium. https://medium.com/@stepanulyanin/captioning-images-with-pytorch-bc592e5fd1a3 [Accessed November 16, 2024].

[5] Saketh Kotamraju. How to Build an Image-Captioning Model in Pytorch. Towards Data Science. https://towardsdatascience.com/how-to-build-an-image-captioning-model-in-pytorch-29b9d8fe2f8c [Accessed November 16, 2024].

[6] Code with Aarohi. Image Captioning using CNN and RNN | Image Captioning using Deep Learning. YouTube. https://www.youtube.com/watch?v=htNmFL2BG34 [Accessed November 16, 2024].

The post Show and Tell appeared first on Towards Data Science.

DeepSeek-V3 Explained 1: Multi-head Latent Attention

Shirley Li — Fri, 31 Jan 2025 10:02:05 +0000

Image created by author using ChatGPT.

This is the first article of our new series "Deepseek-V3 Explained", where we will try to demystify DeepSeek-V3 [1, 2], the latest model open-sourced by DeepSeek.

In this series, we aim to cover two major topics:

Major architecture innovations in DeepSeek-V3, including MLA (Multi-head Latent Attention) [3], DeepSeekMoE [4], auxiliary-loss-free load balancing [5], and multi-token prediction training.
Training of DeepSeek-V3, including pre-training, finetuning and RL alignment phases.

This article mainly focuses on Multi-head Latent Attention, which was first proposed in the development of DeepSeek-V2 and then used in DeepSeek-V3 as well.

Table of contents:

Background: we start from the standard MHA, explain why we need Key-Value cache at inference stage, how MQA and GQA try to optimize it, and how RoPE works, etc.
Multi-head Latent Attention: An in-depth introduction to MLA, including its motivations, why decoupled RoPE is needed, and its performance.
References.

Background

To better understand MLA and also make this article self-contained, we will revisit several related concepts in this section before diving into the details of MLA.

MHA in Decoder-only Transformers

Note that MLA is developed to speedup inference speed in autoregressive text generation, so the MHA we are talking about under this context is for decoder-only Transformer.

The figure below compares three Transformer architectures used for decoding, where (a) shows both the encoder and decoder proposed in the original "Attention is All You Need" paper. Its decoder part is then simplified by [6], leading to a decoder-only Transformer model shown in (b), which is later used in many generation models like GPT [8].

Nowadays, LLMs are more commonly to choose the structure shown in (c) for more stable training, with normalization applied on the input rather then output, and LayerNorm upgraded to RMS Norm. This will serve as the baseline architecture we will discuss in this article.

Figure 1. Transformer architectures. (a) encoder-decoder proposed in [6]. (b) Decoder-only Transformer proposed in [7] and used in GPT [8]. (c) An optimized version of (b) with RMS Norm before attention. [3]

Within this context, MHA calculation largely follows the process in [6], as shown in the figure below:

Figure 2. Scaled dot-product attention vs. Multi-Head Attention. Image from [6].

Assume we have n_h attention heads, and the dimension for each attention head is represented as d_h, so that the concatenated dimension will be (h_n · d_h).

Given a model with l layers, if we denote the input for the t-th token in that layer as h_t with dimension d, we need to map the dimension of h_t from d to (h_n · d_h) using the linear mapping matrices.

More formally, we have (equations from [3]):

where W^Q, W^K and W^V are the linear mapping matrices:

After such mapping, q_t, k_t and v_t will be split into n_h heads to calculate the scaled dot-product attention:

where W^O is another projection matrix to map the dimension inversely from (h_n · d_h) to d:

Note that the process described by Eqn.(1) to (8) above is just for a single token. During inference, we need to repeat this process for each newly generated token, which involves a lot of repeated calculation. This leads to a technique called Key-Value cache.

Key-Value Cache

As suggested by its name, Key-Value cache is a technique designed to speedup the autoregressive process by caching and reusing the previous keys and values, rather than re-computing them at each decoding step.

Note that KV cache is typically used only during the inference stage, since in training we still need to process the entire input sequence in parallel.

KV cache is commonly implemented as a rolling buffer. At each decoding step, only the new query Q is computed, while the K and V stored in the cache will be reused, so that the attention will be computed using the new Q and reused K, V. Meanwhile, the new token’s K and V will also be appended to the cache for later use.

However, the speedup achieved by KV cache comes at a cost of memory, since KV cache often scales with batch size × sequence length × hidden size × number of heads, leading to a memory bottleneck when we have larger batch size or longer sequences.

That further leads to two techniques aiming at addressing this limitation: Multi-Query Attention and Grouped-Query Attention.

Multi-Query Attention (MQA) vs Grouped-Query Attention (GQA)

The figure below shows the comparison between the original MHA, Grouped-Query Attention (GQA) [10] and Multi-Query Attention (MQA) [9].

Figure 3. MHA [6], GQA [10] AND MQA [9]. Image from [10].

The basic idea of MQA is to share a single key and a single value head across all query heads, which can significantly reduce memory usage but will also impact the accuracy of attention.

GQA can be seen as an interpolating method between MHA and MQA, where a single pair of key and value heads will be shared only by a group of query heads, not all queries. But still this will lead to inferior results compared to MHA.

In the later sections, we will see how MLA manages to seek a balance between memory efficiency and modeling accuracy.

RoPE (Rotary Positional Embeddings)

One last piece of background we need to mention is RoPE [11], which encodes positional information directly into the attention mechanism by rotating the query and key vectors in multi-head attention using sinusoidal functions.

More specifically, RoPE applies a position-dependent rotation matrix to the query and key vectors at each token, and uses sine and cosine functions for its basis but applies them in a unique way to achieve rotation.

To see what makes it position-dependent, consider a toy embedding vector with only 4 elements, i.e., (x_1, x_2, x_3, x_4).

To apply RoPE, we firstly group consecutive dimensions into pairs:

(x_1, x_2) -> position 1
(x_3, x_4) -> position 2

Then, we apply a rotation matrix to rotate each pair:

Figure 4. Illustration of the rotation matrix applied to a pair of tokens. Image by author.

where θ = θ(p) = p ⋅ θ_0, and θ_0 is a base frequency. In our 4-d toy example, this means that (x_1, x_2) will be rotated by θ_0, and (x_3, x_4) will be rotated by 2 ⋅ θ_0.

This is why we call this rotation matrix as position-dependent: at each position (or each pair), we will apply a different rotation matrix where the rotation angle is determined by position.

RoPE is widely used in modern LLMs due to its efficiency in encoding long sequences, but as we can see from the above formula, it is position-sensitive to both Q and K, making it incompatible with MLA in some ways.

Multi-head Latent Attention

Finally we can move on to the MLA part. In this section we will first layout the high-level idea of MLA, and then dive deeper into why it needs to modify RoPE. Finally, we present the detailed algorithm of MLA as well as it performance.

MLA: High-level Idea

The basic idea of MLA is to compress the attention input h_t into a low-dimensional latent vector with dimension d_c, where d_c is much lower than the original (h_n · d_h). Later when we need to calculate attention, we can map this latent vector back to the high-dimensional space to recover the keys and values. As a result, only the latent vector needs to be stored, leading to significant memory reduction.

This process can be more formally described with the following equations, where c^{KV}_t is the latent vector, W^{DKV} is the compressing matrix that maps h_t‘s dimension from (h_n · d_h) to d_c (here D in the superscript stands for "down-projection", meaning compressing the dimension), while W^{UK} and W^{UV} are both up-projection matrices that map the shared latent vector back to the high-dimensional space.

Similarly, we can also map the queries into a latent, low-dimensional vector and then map it back to the original, high-dimensional space:

Why Decoupled RoPE is Needed

As we mentioned before, RoPE is a common choice for training generation models to handle long sequences. If we directly apply the above MLA strategy, that will be incompatible with RoPE.

To see this more clearly, consider what happens when we calculate attention using Eqn. (7): when we multiply the transposed q with k, the matrices W^Q and W^{UK} will appear in the middle, and their combination equivalents to a single mapping dimension from d_c to d.

In the original paper [3], the authors describe this as W^{UK} can be "absorbed" into W^Q, as a result we do not need to store W^{UK} in the cache, further reducing memory usage.

However, this will not be the case when we take the rotation matrix in Figure (4) into consideration, as RoPE will apply a rotation matrix on the left of W^{UK}, and this rotation matrix will end up in between the transposed W^Q and W^{UK}.

As we have explained in the background part, this rotation matrix is position-dependent, meaning the rotation matrix for each position is different. As a result, W^{UK} can no longer be absorbed by W^Q.

To address this conflict, the authors propose what they call "a decoupled RoPE", by introducing additional query vectors along with a shared key vector, and use these additional vectors in the RoPE process only, while keeping the original keys kind of isolated with the rotation matrix.

The entire process of MLA can be summarized as below (equation numbers reused from Appendix C in [3]):

Figure 5. MLA process. Image edited by author based on equations in [3].

where

Eqn. (37) to (40) describe how to process query tokens.
Eqn. (41) and (42) describe how to process key tokens.
Eqn. (43) and (44) describe how to use the additional shared key for RoPE, be aware that the output of (42) is not involved in RoPE.
Eqn. (45) describes how to process value tokens.

During this process, only the blue variables with boxes need to be cached. This process can be illustrated more clearly with the flowchart blow:

Figure 6. Flowchart of MLA. Image from [3].

Performance of MLA

The table below compares the number of elements needed for KV cache (per token) as well as the modeling capacity between MHA, GQA, MQA and MLA, demonstrating that MLA could indeed achieve a better balance between memory efficiency vs. modeling capacity.

Interestingly, the modeling capacity for MLA even surpass that of the original MHA.

Table 1 from [3].

More specifically, the table below shows the performance of MHA, GQA and MQA on 7B models, where MHA significantly outperforms both MQA and GQA.

Table 8 from [3].

The authors of [3] also conduct analysis between MHA vs. MLA, and results are summarized in the table below, where MLA achieves better results overall.

Table 9 in [3].

References

The post DeepSeek-V3 Explained 1: Multi-head Latent Attention appeared first on Towards Data Science.

The Three Phases of Learning Machine Learning

Pascal Janetzky — Tue, 28 Jan 2025 13:02:11 +0000

After 6+ years of experience with machine learning, both in research and industry, it’s very interesting to see how far the field has gotten over the years. I still remember sitting in seminar rooms, listening to lectures on everything ML: deep learning, reinforcement learning, random forests, neural networks, natural language processing, …

Photo by Vlad Bagacian on Unsplash

One particular NLP lecture stands out in my memory, where we discussed the rapid advancements in the field. We were discussing vanilla attention mechanisms the week before and were now looking at Transformer-based approaches. The great tutor showed us graphs with the parameter counts of the models. We then looked up the then-recent advances, and it became clear: any figure is outdated within a month. That was when my ML journey had barely started, and already the amount of innovation that was published again and again was insane.

Since then, my journey proceeded in several phases. From exchange with other ML people, it appears that they have a similar experience: From a beginner to seasoned practitioner, the journey can be divided into the three following phases:

Phase I: Beginner (~1 year; this article) Phase II: Intermediate (~1 to 3 years; upcoming) Phase III: Advanced (~5+ years; upcoming)

The beginner phase of learning Machine Learning

This article’s focus is on the first phase in your machine learning journey: the beginner phase. To help you start in ML, you can use the checklist below to track your progress (also see section Resources).

The beginner phase is everyone’s starting stage. At this stage, your primary goal is to understand the fundamental ideas behind machine learning: how machines learn from data and the essential algorithms that power ML models. You’ll dive into topics like decision trees, k-nearest neighbors (KNN), and linear regression. These concepts serve as the building blocks for your further ML journey.

At the start of your journey, it probably makes sense to get an overview of the field of Deep Learning. Here, I can recommend MIT’s free Introduction to Deep Learning course, see Resources. It is an intense lecture showcasing all areas of (modern) deep machine learning. It allows you to see what’s awaiting you of the next year(s).

Practically speaking, we can divide the beginner phase into five categories.

Data Handling
Classic Machine Learning
Neural Networks
Theory
Miscellaneous Skills

Data Handling

In this category, you’ll learn how to work with small datasets that fit easily into your computer’s memory. This is good, as you do not need to rent expensive compute online. Often, these datasets are readily available within ML frameworks like PyTorch or TensorFlow. Common examples of small-sized datasets include:

Images: MNIST (handwritten digits)
Audio: ESC-50 environmental sound recordings
Time-series: Heartbeat timeseries categorization
Text: IMDB movie reviews for sentiment analysis

The advantage of these beginner datasets is that they are well-documented and easy to manipulate. Usually, the preprocessing you have to do for them is just resizing images or trimming sentences, which you can handle with just a few lines of code. This lets you focus on learning the core concepts without having to preprocess larger-than-memory datasets.

Should you ever need more compute, then try Google Colab. It’s a free offering by Google that let’s you run Jupyter Notebooks (i.e., interactive python code) directly in your browser. See Resources.

Classic Machine Learning

While the focus of this post is on deep machine learning – as opposed to classic machine learning – it’s useful to learn the time-proven basics as well. Among these classic ML techniques, a few stand out:

Regression: The goal here is to predict continuous values based on input data. The Boston House pricing dataset is a often-used dataset for regression; the task is to predict house prices based on features like square footage, location, etc.
Clustering: This is about grouping similar data points together. K-means clustering is a great entry point for beginners, and I recently read a paper, where the authors combined advanced deep learning techniques with k-means clustering.
Support Vector Machines (SVM): SVMs focus on finding the best decision boundary (hyperplane, essentially a border in n-dimensional space) to separate data into distinct classes. They are often used for classification datasets.

In general, though most of the recent advancements are often the result of deep learning techniques, classic machine learning still is relevant. You do not always need the advanced ML techniques, sometimes the basics suffice.

Neural Networks

Once you’re comfortable with the basics of classic ML, you’ll begin transitioning to neural networks (NNs), the foundation of deep learning. I recommend starting with dense neural networks, which consist of multiple fully connected layers, or linear layers in PyTorch. In these layers, each input is multiplied by a weight and combined with a bias to produce an output. You can use these networks for small and mid-sized datasets alike.

After dense neural networks, you’ll dive into convolutional neural networks (CNNs). These type of network are essential for tasks involving image data. CNNs use convolution operations to identify features like edges, textures, or shapes in images, regardless of their position in the image. This location-independent ability makes CNNs a very versatile approach

Lastly, for sequential data (such as time series) I recommend playing with recurrent neural networks (RNNs). RNNs are designed to retain information from previous steps, making them ideal for tasks like language modeling or time-series forecasting. In fact, earlier machine learning research (around 2014) primarily used RNNs for machine translation, see Resources.

Theory

Together with hands-on learning, it’s essential to develop a theoretical understanding of the methods you’re using. This includes mathematical notation, matrix operations, and the underlying principles of machine learning algorithms.

For instance, understanding Σ notation (summation) helps make complex mathematical expressions more concise and readable. Though they require some learning, using this way to express operations can help avoiding disambiguities, a downside of communicating in natural language. With practice, the math behind algorithms will begin to make more sense.

At this stage, you’ll also encounter key metrics used to evaluate ML models, such as accuracy, mean squared error, precision, and recall. These metrics will help you measure your models’ performance.

Miscellaneous Skills

As you are studying classic machine learning, neural networks, and the theory, you will naturally also develop practical skills in the beginner phase. These skills include programming, model management, and virtual environments.

For programming, Python is the go-to language for ML, and its large ecosystem of libraries (like NumPy, pandas, Scikit-learn, etc.) will be invaluable.

The features these libraries provide come in handy once you analyze your dataset. For example, what’s the value range? Are there some outliers? How extreme are they? Are there unusable data samples? These are some questions that you want to answer, which usually happens on the go:

You write a short code snippet – it breaks because a data sample is different – you update the code – you have learned more about your data.

Basically any time your are honing your data analysis skills, you will be improving your coding skills as well.

Model management, on the other hand, means saving, loading, and sharing trained models. This is relevant when you are training you neural network in one script, but evaluate it in a separate one.

Lastly, everything you do on the code level requires a Python interpreter, a program that helps you execute your Python code. As you will have several projects, I recommend getting familiar with virtual environments. These "encapsulate" all the requirements and dependencies into their own disk space, so that your projects do not interfere with each other.

After the Beginner Stage

By the end of the beginner phase, you’ll have a solid understanding of core machine learning concepts, hands-on experience with basic models, and can read the mathematical formulation. Through your focus on (small) projects, you have practiced a variety of ML methods. This is a good foundation to tackle more complex topics in the next phase, the intermediate phase. It covers years 1 to 3 of your ML journey, and is currently in preparation. Come back here later or follow me or Towards Data Science to get notified when it is published.

Resources:

Boston House Pricing dataset: https://mirahari.github.io/boston/
ESC50 audio dataset: https://github.com/karolpiczak/ESC-50
Heartbeat categorization: https://www.kaggle.com/datasets/shayanfazeli/heartbeat
2014 papers for machine translation: https://arxiv.org/abs/1409.0473, https://arxiv.org/abs/1409.3215, https://arxiv.org/abs/1406.1078
Beginner phase Notion checklist: https://phrasenmaeher.notion.site/The-three-phases-of-learning-Machine-Learning-The-beginner-phase-17963d033fa180bea2cacddaf580de3d
Google Colab: https://colab.research.google.com/
MIT’s Introduction to Deep Learning course: https://introtodeeplearning.com
Virtual environments in Python: https://realpython.com/python-virtual-environments-a-primer/

The post The Three Phases of Learning Machine Learning appeared first on Towards Data Science.

Understanding the Evolution of ChatGPT: Part 3- Insights from Codex and InstructGPT

Shirley Li — Tue, 21 Jan 2025 18:19:27 +0000

Understanding the Evolution of ChatGPT: Part 3— Insights from Codex and InstructGPT

(Image from Unsplash)

This is the third article in our GPT series, and also the most practical one: finally, we will talk about how to effectively fine-tune LLMs.

It is practical in the way that, if you were asked to train your own LLMs today, you can skip pre-training and jump straight into using an open-source LLM or SLM; However, very likely you’ll still need to finetune it a bit on your own data and task, and that is where this article can come to help.

More specifically, we will focus on two finetuned models – Codex and InstructGPT, as they represent addressing two types of challenges in LLM finetuning:

Codex needs to adapt a pretrained LLM to a different modality (code script), as programming languages have many unique characteristics than natural language;
InstructGPT aims to make the model more aligned with human preferences, which cannot be achieved automatically by traditional language modeling objectives.

As we will see later, both challenges demand creativity and carefulness at every stage of the finetuning process: how to collect high-quality data, how to modify model architectures, how to effectively initialize your model, how to determine a proper objective, and how to properly evaluate it.

Below is the outline for this article:

Overview: why we need finetuning and what makes it so challenging; GPT3.5 and its finetuned versions.
Codex: how to evaluate code generation properly, how to collect data and how to adapt the model to process programming languages.
InstructGPT and ChatGPT: how to evaluate alignment, why RLHF works, and how it is implemented in InstructGPT.
Summary: best practices in LLM finetuning.

Below are the links to our previous articles if you are interested:

Part 1: An In-Depth Look at GPT-1 and What Inspired It: where we cover the pre-training plus finetuning paradigm and its evolution from CV to NLP, previous pre-training efforts such as Word2Vec and GloVe, decoder-only Transformers, auto-regressive vs. auto-encoding LM, and key innovations of GPT-1.
Part 2: GPT-2 and GPT-3: where we cover how GPT models were scaled up from 117M to 175B, under the philosophy of exploring task-agnostic pre-training via scaling hypothesis and in-context learning.

Overview

As we explained in our second article, both GPT-2 and GPT-3 can be considered as OpenAI’s experiments to test the potential of task-agnostic pre-training. While doing so, the authors also mentioned finetuning as a promising direction for future studies, as it might help the model to further improve its performance on certain tasks.

Why is Finetuning Needed?

The reasons are three-fold.

The first reason is of course performance. Pre-trained models are more like generalists that can perform a wide range of tasks reasonably well, but still they might struggle to beat the specialists trained on a particular task. If our goal is to have such a specialized model to help us on a very specific task, then finetuning should be definitely considered.

Another reason is that, albeit being generally powerful, GPT-3 models are not always reliable in following human instructions, especially when those instructions became complex. This is because, as the authors explained in InstructGPT paper, that the pre-training objective focuses mainly on language modeling like predicting the next token, but such capabilities cannot translates to instruction-following. Thus, some special finetuning strategies are needed.

There are also concerns on safety and ethical aspects, due to very similar reasons that auto-regressive language modeling alone is not sufficient to enforce the model to avoid generating harmful or biased answers. For that issue, finetuning can also enable us to better control the generation process.

Challenges in Finetuning

Broadly speaking, there are two types of challenges in finetuning LLMs: the need to adapt to a new modality, and the need to align the model with human preferences.

Taking Codex as an example for the former case, where the pre-trained model needs to be applied to a different modality that presents some unique characteristics, for example, to process code scripts it needs to understand basic syntax of a specific programming language, handle static and dynamic types and even infer types, and correctly handle indentations in languages like Python.

The latter case is more tricky in some way, as "alignment" itself is a pretty vague and controversial concept, and it has to be defined more clearly and translated to a set of measurable aspects before we can actually finetuning towards that goal. Moreover, even if we have worked out a definition of alignment, achieving that goal is also non-trivial, as there is no ready-to-use training objectives directly connect to it.

On top of that, we also need to collect high-quality domain-specific training data and rethink the evaluation process, including the evaluation dataset as well as the evaluation metrics to use.

In later sections, we will see how Codex and InstructGPT handled these issues. In particular, we will highlight how they implemented every step with both creativity and carefulness, from which anyone who wants to finetune his or her own LLM can learn something.

GPT-3.5

GPT-3.5 series typically refer to the model series finetuned on top of GPT-3, including the following variants (see wiki):

code-davinvi-002: a version of Codex.
text-davinci-002: a transitional model from GPT-3 to InstructGPT.
text-davinci-003: more similar to InstructGPT.

Overall, GPT-3.5 could be considered as finetuned GPT-3 with enhanced instruction following, better generation quality, and better steerability. It is the foundation to several other models including ChatGPT, Codex, Whisper and the text model of DALL-E2, which demonstrates the potential of effectively finetuning LLMs on specialized tasks.

In the following sections, we will dive deeper into Codex and InstructGPT. Rather than covering every detail of their finetuning process, we will mainly focus on the aspects that best showcase the importance of creativity and carefulness.

Codex

The Codex model was released in 2021 and is specialized in Python code-writing.

Below are a few aspects that we want to highlight.

Evaluation of Code Generation

When building a model for a new task, the first thing that often comes to mind is how to evaluate that task properly.

This is important because, without an effective evaluation protocol, we cannot determine if we are really making any progress, and sometimes we even cannot identify the gaps in our current model in the first place.

In the case of Codex, the authors first realized that standard match-based metrics such as BLEU score are not suitable for measuring code generation performance.

In case you are not familiar with BLEU score: it is widely used for evaluating text generation tasks such as machine translation, by comparing overlapping phrases and calculating a precision score, while also considering text length to ensure balance.

However, the same coding problem might be solved with different data structures or algorithms. For example, generating a Fibonacci sequence can be implemented by either a top-down or bottom-up DP algorithm, resulting in very different code scripts:

def fib_top_down(n, memo={}):
    if n in memo:
        return memo[n]
    if n <= 1:
        return n
    memo[n] = fib_top_down(n-1, memo) + fib_top_down(n-2, memo)
    return memo[n]

def fib_bottom_up(n):
    if n <= 1:
        return n
    dp = [0] * (n + 1)
    dp[0], dp[1] = 0, 1
    for i in range(2, n + 1):
        dp[i] = dp[i-1] + dp[i-2]
    return dp[n]

In that case, if we evaluate both solutions against a given reference solution using BLEU score, it is very likely that one or even both solutions will have very low BLEU scores, even though both solutions are correct.

An alternative way is to evaluate by what the authors called "functional correctness", for example the pass@k metric used by Kulal et al, where for each problem we will generate k code samples and test each of them, and then a problem can be considered as solved if any sample passes the unit tests. In the end, the total fraction of problems solved is reported. However, as the authors pointed out, calculating pass@k with this definition will result in high variance due to randomness in this process, especially when k is small.

To mitigate this issue, the authors propose another way to estimate pass@k: instead of generating k samples directly, they generate n ≥ k samples per task. As more samples are generated and tested, the estimation process will be more reliable even if k is small. And then, based on how many samples are correct (assume c samples passes unit tests), an unbiased estimator can be estimated as below:

Figure 1. Left: the optimized pass@k definition. right: a numerically stable script to calculate pass@k. (image from Codex paper.)

where

C(n, k) is the number of ways to choose k samples out of n;
C(n-c, k) is the number of ways to choose k samples out of the (n-c) incorrect samples;
Thus, C(n-c, k)/C(n, k) represents the probability that all chosen samples are incorrect;
Finally, 1 – C(n-c, k)/C(n, k) represents the probability that at least one sample is correct.

To further prove that optimizing for BLEU score is not equivalent to optimizing for functional correctness, the authors also plot the BLEU score densities for correct (blue) and wrong (green) solutions for 4 random coding problems, where the distributions are clearly not separable:

Figure 2. BLEU score probability density for correct (blue) and wrong (green) solutions for 4 random problems. (Image from Codex paper.)

Beyond optimizing for the evaluation metric, the authors also built a new dataset called HumanEval, which contains 164 hand-written programming problems. As shown in the example below, each problem includes a function signature, a docstring, a body and an average of 7.7 unit tests:

Figure 3. Example problems from the HumanEval dataset. (Image from Codex paper.)

Note that as the authors mentioned in the paper, it is important for these tasks to be hand-written, since otherwise the problems for evaluation might be overlap with that for training. Also, to ensure the testing process will not pose any risks due to malicious code, the authors also created a sandbox to execute code scripts.

Training Data Collection

Moving to the training part, the first question is how to collect high-quality training data. For code generation, the good news is that we can leverage the vast amount of code repositories from GitHub, but still some data cleaning strategies are needed, as the paper mentioned:

We filtered out files which were likely auto-generated, had average line length greater than 100, had maximum line length greater than 1000, or contained a small percentage of alphanumeric characters.

Note that most of these cleaning strategies are specialized to programming languages, so we might need to come up with other ideas when cleaning our own data.

Adaptations in Finetuning

The most important adaptation is for the tokenizer, due to the obvious reason that the distribution of words in GitHub code differs a lot from that of natural language. In the Codex paper, the authors noted that this is especially the case when encoding whitespaces, making the original GPT-3 tokenizer less effective.

To fix that issue, an additional set of tokens were added to the vocabulary, to represent whitespace runs of different lengths. As mentioned in the paper, this simple modification enables representing code with 30% fewer tokens.

So, if our model needs to handle an input corpus presents different distribution with natural languages, we might need to do some study on the distribution and modify the tokenizer a bit as well.

Findings in Evaluation

Firstly, the figure below shows the pass rates of different models on the HumanEval dataset. Overall, all the Codex variants show significantly better performance compared to GPT-3, where

Codex (finetuned on code) solves 28% of the problems;
Codex-S (finetuned on standalone functions) solves 37.7%;
Codex-S with generating 100 samples and selecting the one with the highest mean log-probability solves 44.5%;
Codex-S oracle which selects the sample that passes the unit tests solves an amazing of of 77.5% problems.

Figure 4. Codex pass rates. (Image from Codex paper.)

Plus, a scaling law similar to that of GPT-3 is also observed, suggesting better performance can be achieved with even larger models:

Figure 5. Test loss vs. number of parameters. (Image from Codex paper.)

And the authors also noticed that higher temperatures are more preferred for larger k, highlighting the importance of careful hyper-parameter tuning:

Figure 6. Higher temperatures are preferred for larger k. (Image from Codex paper.)

InstructGPT and ChatGPT

Evaluation of Alignment

How to properly evaluate "alignment" is also challenging, as the definition of alignment is not as clear as other aspects such as accuracy. In this work the authors define alignment as if the models are "helpful, honest, and harmless" and convert them to more measurable properties:

Helpful: by measuring if the model could follow instructions and even infer intentions from a few-shot prompt.
Honest: by measuring truthfulness, or in the author’s words, "if the model’s statements about the world are true". More specifically, they propose to measure it by hallucination rate on the TruthfulQA dataset.
Harmless: by measuring "if an output is inappropriate in the context of a customer assistant, denigrates a protected class, or contains sexual or violent content", and benchmarking on datasets designed to measure bias and toxicity.

On top of that, to make sure the finetuning process will not cause severe regressions on pre-training performance, the evaluation process also need to reflect quality on both the pre-training and finetuning objectives. For that reason, InstructGPT was evaluated on two separate datasets:

Evaluations on API distribution: this is mainly for evaluating the finetuning quality, by asking human labelers to rate which output is preferred;
Evaluations on public NLP datasets: this evaluates both the pre-training and finetuning quality, including traditional NLP datasets as well as datasets for evaluating model safety like truthfulness, toxicity and bias.

Next, we will briefly explain how RLHF works and how it is implemented in InstructGPT.

RLHF (Reinforcement Learning from Human Feedback)

The figure below shows the 5 elements in a typical Reinforcement Learning scenario:

Figure 7. Five elements in RL: Agent, Environment, Reward, State and Action. (Image from wiki.)

Now imagine you are teaching your puppy to sit, where you can find all the 5 elements:

Agent: Your puppy learning this new command "sit".
Environment: Everything around your puppy.
State: The situation your puppy is in (whether it is sitting or not).
Reward: A treat that you give your puppy when it follows your command;
Action: What your puppy could do, like sitting, jumping or barking.

Reinforcement Learning works like this: In the beginning your dog (agent) didn’t understand what "sit" means, but it will try different things like running, sitting or even barking (actions) in your house (environment). Every time it sits, it will get a treat (reward). Over time your puppy learns that sitting gets a treat and it appears like it finally understands "sit".

Training a model with RL follows a very similar trial-and-error approach. The key to RL is having a well-designed reward. This reward must be closely aligned with the goal; otherwise the agent will not be able to learn the desired behaviors. Meanwhile, producing such a reward should be as easy and quick as possible, since if it is too slow or too complicated to calculate the reward, the RL process will also become extremely slow, making it less useful in practical tasks.

For example, in a game, every action the agent takes will automatically get a score from the environment, and this score is directly connected to your agent’s performance in playing this game.

However, in many real-world applications, there is no ready-to-use reward like a score in a game. Instead researchers have to take great efforts in defining a proper reward function. Moreover, some desired behaviors are very difficult to translate into reward functions – for example, how could you define a reward function to guide the agent to answer questions more politely?

This leads to RLHF: Reinforcement Learning from Human Feedback.

Again in the puppy training example, imagine your puppy finally learns to sit, but sometimes it also barks while sitting, or it will jump onto the couch first instead of sitting quietly on the floor.

What can you do in that case?

With RLHF, you don’t just give your puppy a treat every time it sits. Instead, you give treats by comparing its behaviors. For example, if the puppy sits quietly on the floor, it gets a bigger reward than if it sits while barking or after jumping onto the couch. This way, your puppy learns that sitting quietly on the floor is better, even though you didn’t explicitly explain what "quiet" means.

As we mentioned before, having an easy and fast reward is the key to RL, which makes it unrealistic to involve a human into the training loop to provide direct feedback. To overcome this issue, we can collect some human feedback first, and then use these feedback to learn a reward function to mimic human preferences when comparing two actions.

In summary, RLHF typically involves three stages:

Collect human feedback: sampling model outputs, and ask human judges to compare which is better.
Learn a reward model by mimicking human judge’s preferences.
Train a better policy using the leant reward model in the RL process.

In case you are not familiar with RL terminology: a policy refers to the agent’s strategy to choose actions based on the state of the environment.

Next we will cover how this RLHF approach is implemented in finetuning InstructGPT.

Implementation of RLHF in InstructGPT

InstructGPT and ChatGPT were trained using the same model (see this blog), with RLHF being the key element in finetuning.

The training process largely follows the steps we have introduced in the previous section, with special care on data quality and implementation details, which in my opinion, are equivalently important to make InstructGPT such a success.

Now let me break it down.

Figure 8. An illustration of the RLHF steps in training InstructGPT/ChatGPT. (image from InstructGPT paper.)

Step 1: Collect demonstration data and train a supervised policy

In this step, human labelers were asked to provide high-quality demonstrations of the desired behavior for each prompt.

Prompt dataset: To begin with, you need to have a prompt dataset from which you can sample individual prompts, and ideally that prompt dataset should be both useful and diverse.

To do that, the authors took an iterative approach: in the very beginning, labelers were asked to manually write some seed prompts, and these data were used to train a model via supervised learning. This model was later deployed to the OpenAI API to collect text prompts from users, which later formed the prompt dataset.

The table below shows the distribution of this prompt dataset, as diversity is very important in making sure the model will be trained on various tasks:

Human data collection: human data are needed in three components throughout the RLHF process, including writing demonstrations in Step 1, providing comparison data in Step 2, and conducting final evaluations after finetuning.

In the paper the authors mentioned many practices to ensure data quality:

Firstly, high-quality data come from good labelers. To ensure their ability in data labeling, a screening test was conducted to select labelers who were "sensitive to the preferences of different demographic groups, and were good at identifying outputs that were potentially harmful".
Secondly, to ensure consistency between all the labelers, an onboarding process was setup to train all labelers, and detailed instructions for each task were provided. The authors also mentioned that they setup a shared chat room to answer questions from labelers.
Finally, to see how the model generalizes to the preferences of different labelers, a separate group of labelers who didn’t got through the screening test were hired for evaluation.

Based on these human demonstration data, a pretrained GPT-3 model was finetuned using supervised learning in the first step. This model is referred to as the baseline policy, which will be used to produce comparison outputs in Step 2 and initialize the PPO algorithm in Step 3.

Step 2: Collect comparison data and train a reward model

Comparison data collection: Once the baseline policy is available, it is used to generate outputs for some sampled prompts, and these outputs will be reviewed and ranked by human labelers from the best to the worst. To speedup this ranking process, a set of K outputs will be shown simultaneously to the human labelers, where K ranges from 4 to 9.

Reward model training: The reward model was initialized from the supervised baseline policy, by removing the final unembedding layer and training on the comparison data. In particular, the authors mention that training all comparisons from each prompt as a single batch rather than shuffling the comparisons can help alleviate overfitting. It was trained to assign scalar scores to input-response pairs, with 6B parameters. Note that we need to seek a balance when deciding the size of this reward model: it needs to be sufficiently large to accurately mimic human preferences, however it cannot be too large since it needs to support fast inference during the RL process.

Step 3: Optimize a policy using the reward model with PPO

At this point we have got everything ready to finetune the model with RLHF: the initial policy and the reward model. The training in this step follows a typical RL process: in each episode, a new prompt is sampled (the "state") and new outputs will be generated (the model’s "action") by the current policy (the "agent"), and then the reward model will calculate a reward for the output ("reward"), according to which the policy will be updated using PPO.

Don’t worry if you are not familiar with PPO – it is simply a method designed to help the agent to slowly update its strategies.

A few things to mention here:

A per-token KL penalty is added at each token to mitigate the over-optimization of the reward model.
The authors further experimented with mixing the pretraining gradients into the PPO gradients, in order to fix the performance regressions on public NLP datasets (such regressions are often called "the alignment tax"), which was referred to as "PPO-ptx". In this paper, InstructGPT actually refers to the PPO-ptx models.

Note that Step 2 and Step 3 can be iterated continuously:

With an updated policy (from Step 3), we can generate new outputs and collect more comparison data, which can be used to train a new reward model by repeating Step 2;
With a new reward model (from Step 2), we can get a better policy by repeating Step 3.

Findings in Evaluation

Due to space limitation we will not go through all the evaluation results in this article, instead we will just highlight several new findings.

As perhaps the most important finding, results show that RLHF can indeed improve alignment. The figure below shows the win rate against the supervised 175B GPT3 model, evaluated by human judges. According to this figure, both PPO and PPO-ptx significantly outperform the GPT baselines, where even the 1.3B PPO models are better than the 175B GPT-3. This result clearly demonstrates the effectiveness of RLHF.

Figure 9. Human evaluation results. (Image from InstructGPT paper.)

The authors also found that InstructGPT show improves in truthfulness (hallucination rate reduced from 41% to 21%), slight improvements in toxicity (25% fewer toxic outputs), but no significant improvements on reducing bias.

Another finding is that PPO-ptx can minimize performance regressions on public NLP datasets, as shown in the figure below.

Figure 10. Few-shot performance on public NLP datasets. (Image from InstructGPT paper.)

Summary

Training a LLM usually involves multiple stages like pre-training, supervised finetuning, and alignment with RLHF. For our tasks at hand, we can usually start from an open-source, pre-trained LLM and finetune it on domain-specific data.

A few questions to ask while finetuning your own LLMs (though this is not meant to be an exhaustive list):

Do we have a clear definition on the model’s desired behaviors? How can we evaluate such behaviors? If no available metrics to use, can we create one by ourselves?
Do we have available training data? If not, how can we collect such data by ourselves? If human labelers are needed, how to ensure their labeling quality?
What kind of cleaning or pre-processing is needed? Any heuristics can we use to check the data quality?
Does our data cover a wide range of scenarios?
Do we need to modify our tokenizers? Do we need to modify the model structures? Do we need to add auxiliary finetuning objectives?
Does finetuning lead to regression on pre-training performance? Can we seek a balance?
Does finetuning lead to some unexpected negative behaviors? How can we mitigate that?
How to prevent overfitting in the finetuning process?
What hyper-parameters can we tune during finetuning or during evaluation? Any heuristics we can leverage?

In the end of the day, exploring a new task is always both challenging and exciting, and I hope the learnings from this article can help make it less challenging, more exciting, and ultimately more enjoyable

Thanks for reading!

The post Understanding the Evolution of ChatGPT: Part 3- Insights from Codex and InstructGPT appeared first on Towards Data Science.

Influential Time-Series Forecasting Papers of 2023-2024: Part 1

Nikos Kafritsas — Fri, 17 Jan 2025 12:02:18 +0000

Influential Time-Series Forecasting Papers of 2023–2024: Part 1

Created with DALL3*3

Let’s kick off 2025 with a roundup of notable time-series forecasting papers.

I’ve also included some key papers from 2023 as well. 2024 was the year of foundation forecasting models – but I’ll cover those in a separate article.

The papers I’ll cover are:

Deep Learning-Based Forecasting: A Case Study From the Online Fashion Industry
Forecasting Reconciliation: A Review
TSMixer: An All-MLP Architecture for Time Series Forecasting
CARD: Channel Aligned Robust Blend Transformer for Time Series Forecasting
Long-Range Transformers for Dynamic Spatiotemporal Forecasting

Let’s get started!

I’ve launched AI Horizon Forecast, a newsletter focusing on time-series and innovative AI research. Subscribe here to broaden your horizons!

Deep Learning-Based Forecasting: A Case Study From the Online Fashion Industry

This paper is a gem.

It made headlines initially because Zalando, a leading online fashion retailer, used a custom Transformer model to outperform SOTA forecasting models – including LGBM.

But it’s more than that – this paper offers depth and is authored by brilliant researchers in the time series/ML ecosystem.

Paper Insights Overview

The paper describes Zalando’s in-depth retail forecasting pipeline and offers valuable insights:

Challenges unique to online retail fashion forecasting.
How to handle sparsity, cold starts, and short history (Figure 1).
What external covariates were used and how the Transformer is built to leverage them?
How elegant tricks, with interesting explanations, improved the vanilla Transformer and tailored it to time-series forecasting.
Extensive details on the custom loss function, evaluation metrics, hyperparameters, and training configurations – rarely disclosed in many papers.

Figure 1: Three edge cases in Zalando’s dataset: Cold start, short history, and sparsity (Source [1])

The major contributions/key points of this paper are:

Covariate Handling: Using discount as a dynamic covariate.
Causal Relationship: Enforcing a monotonic relationship between discount and demand via a piecewise linear function parameterized by feedforward networks.
Two forecasting modes: Training and predicting short- and long-term demand with a single global model.
Scaling laws: Demonstrating the first sparks of scaling laws in Transformer forecasting models.

First, the authors explain how they configured the forecasting problem by translating the sales observations into demand forecasting. Demand planning is the most common case in retail forecasting.

Also, the authors cleverly impute demand (e.g., it can happen when there’s a stock shortage) instead of marking it as missing.

Note: Demand Forecasting is the first and most important step in a supply forecasting pipeline

Demand represents what customers want to buy, while sales are restricted by stock availability. Forecasting demand is crucial, as it reflects true customer needs, unlike sales, which are supply-dependent.

If sales drop to zero due to stock shortages, future forecasts based on sales will also reflect zero. This can mislead automated supply planning systems, leading to no new orders.

To ensure an efficient supply chain, always forecast demand instead of sales. Accurate demand forecasts prevent disruptions and maintain stock to meet customer needs.

Zalando’s Forecasting Pipeline

The paper categorizes features into 4 types:

static-global: time-independent & market-independent
dynamic-global: time-dependent & market-independent
dynamic-international: time-dependent & market-specific
static-international: time-independent & market-specific

Table 1: Covariate types in Zalando’s forecasting pipeline (Source [1])

While all covariates enhance performance, discount is the most influential. Also, data has a weekly frequency.

The Zalando team trained a global Transformer model with:

Input: A single tensor of dimension R𝑡×𝑢×𝑣, where 𝑡 is the time dimension, 𝑢 the batch size and 𝑣 the dimension of the covariate vector.
Output: near future is 5 weeks and far future is __ 5 to 20 weeks.

The authors input details for different products, markets, and discount levels – and the output of the forecasting engine is a 𝑅𝑎×𝑐×𝑡×𝑑 tensor that spans over 𝑡 = 26 future weeks, up to 𝑎 = 1 × 10⁶ products, 𝑐 = 14 countries and 𝑑 = 15 discount levels.

Figure 2 describes the top-level view of the Transformer model:

Figure 2: Top-level view of custom Transformer architecture (Source[1])

The model uses an encoder-decoder structure with time-series-specific adaptations:

Encoder: Processes past observed covariates.
Decoder: Handles future known inputs, generating short and long-term forecasts.
Non-autoregressive Decoder: Produces forecasts in a multi-step manner.
Positional Embeddings: They are not added to tokens as in the traditional Transformer – but as an extra covariate.

The key innovation was natively embedding the idea of a monotonic increase in demand with higher discounts into the model.

This makes sense since the 2 values are positively correlated – but if this relationship is not imposed in the model, it won’t necessarily occur all the time.

Future demand for a given discount is modeled via a piecewise linear, monotonic demand response function. This function is parameterized by 2 FFNs in the Monotonic Demand Layer. A softplus transformation ensures the demand function is monotonically increasing w.r.t. to discount:

Figure 3: Softplus activation function (Source)

Figure 4: Demand as a piecewise linear monotonic demand response function (Source[1])

Finally, an important aspect of this paper was the first spark of scaling laws in forecasting. It was the first time a paper showed on a private, large, and diverse dataset that a Transformer-based forecasting model leveraged scale to deliver better performance:

Figure 5: Log-log plot of Zalando’s Transformer forecasting model vs simpler model, in terms of performance against training size (Source [1])

This work established a foundation for large-scale, time-series forecasting models.

In general, it’s a well-written paper with interesting details that we didn’t cover here – but we’ll discuss them in a future article. Stay tuned!

Forecasting Reconciliation: A Review

Hierarchical forecasting and reconciliation are among the most active areas in time-series research.

It all started with a research team at Monash University (which included the distinguished professor Rob Hyndman) publishing the iconic paper [Hyndman et al., 2011]. Of course, work on this area had started earlier.

Hierarchical forecasting reached the broader machine-learning community through the M5 forecasting competition.

The paper [2] written by the same school of thought researchers is the best resource for tracking the latest forecasting reconciliation research.

Preliminaries on Forecasting Reconciliation

Consider a hierarchical time series with multiple levels, such as product and store. Your task is to forecast for all levels.

Figure 6: An overview of how the M5 time series levels are organized (Source)

Naive approach: We separately forecast sales for each product, store, and the total. However, the top-level forecasts won’t align with the sum of the lower levels, making them incoherent.
Bottom-up approach: Here, forecasts start at the bottom level and aggregate upwards. This avoids information loss but isn’t optimal. Lower-level series are often noisy or sparse, causing errors to accumulate as you aggregate.

Other approaches, like top-down and middle-out, have similar challenges.

The best way is to establish a mathematical reconciliation method:

Forecasting reconciliation is not a time-series model, but a process that ensures consistency across different levels of forecasts. Given base forecasts across all levels, this process optimally makes base forecasts coherent.

Note: Some models generate coherent forecasts directly (see [2]).

Why Use Forecasting Reconciliation?

Tutorials often present forecasting reconciliation as a way to ensure forecasts make sense, avoiding questions from a manager like why the sum of product A and B sales doesn’t match the total estimate.

Ok, this is also true. But that’s not all.

Reconciliation often improves forecast accuracy in hierarchical datasets. Here’s why:

You can use the best model for each level.
For example, models at different levels can leverage distinct loss functions (e.g., Tweedie loss for non-negative, right-skewed lower-level data; Huber loss at the top levels for outlier-prone top levels).

Once you’ve generated the base optimal forecasts at each level, you can select an appropriate reconciliation method from [2]. And just like that – boom! Your forecasts will not only be coherent but also likely more accurate overall!

Some Notation

Let’s break down hierarchical forecasting.

Suppose we have 3 stores, each selling 2 products. This setup gives us 10 time series in total, with 6 at the bottom level.

For instance, yA,t represents the sales of store A at time t and yAA,t represents the sales of product A. The top-level yt aggregates total sales.

Figure 7: A 2-level hierarchical tree diagram. The bottom level contains 6 time series, and we have 10 time series in total (Source [2])

The equations which describe this graph are the following:

The equations describing this hierarchy are as follows:

The vector yt represents the total sales, S is the matrix summing these observations, and bt is the vector of bottom-level observations/sales at time t.

Let’s zoom into the values of the above equation:

Hence, the summing matrix S determines the hierarchical structure.

Bottom-Up Approach

Reconciliation aims to adjust incoherent base forecasts to make them coherent and as accurate as possible.

We achieve this using a projection (or reconciliation) matrix G:

where:

y_tilde: reconciled forecasts at all levels
S: summing matrix
G: projection matrix
yhat: base forecasts at all levels
h: horizon

and in matrix format:

Notice that in G, the first 4 columns zero out the base forecasts at the bottom level, and the other columns pick only the base forecasts of the bottom level.

Advanced Reconciliation Methods

The bottom-up approach is straightforward – it aggregates forecasts to ensure coherence.

However, it’s not a native reconciliation method since forecasts are made coherent simply by aggregation.

That’s why the bottom-up approach has a simple projection matrix G, but in practice, we can devise a reconciliation method by solving for the matrix G.

The goal is to find the optimal G matrix according to equation 6 that gives the most accurate reconciled forecasts.

The paper starts with the simple OLS equation for estimating G:

G = (S'S)^-1 S'

And continues with the milestone paper Wickramasuriya et al. (2019) which introduces the MinTrace (MinT) method that minimizes the sum of variances of the reconciled forecast errors.

The matrix G is then computed as:

G = (S'W^(-1) S)^-1 S' W^(-1)

where Wh is the variance-covariance matrix of base forecast errors.

We won’t go into further details here, feel free to read the original paper[2].

Key Challenges and Tools

The authors discuss additional reconciliation approaches and how these can be enhanced with probabilistic forecasting or accelerated using sparse methods. However, reconciliation methods can be memory-intensive, especially with datasets containing hundreds of time series.

They also cover domain-specific applications of hierarchical forecasting and review software tools and libraries for implementation.

The paper is detailed and well-written but assumes some familiarity with the field. For beginners, I recommend Rob Hyndman’s and George Athanasopoulos’s Forecasting: Principles and Practice book which has a chapter dedicated to hierarchical forecasting.

TSMixer: An All-MLP Architecture for Time Series Forecasting

Boosted Trees excels at forecasting tabular-like time series data with multiple covariates.

However, deep learning models have struggled to leverage their architectures effectively on complex datasets due to overfitting issues.

Google researchers addressed this by developing TSMixer, a lightweight MLP-based model, and its enhanced variant, TSMixer-Ext, which accommodates exogenous and future-known variables. TSMixer-Ext delivered a remarkable performance in Walmart’s M5 competition, surpassing established SOTA models.

TSMixer stands out for its double-mixing operation, utilizing cross-variate information across both the time domain (time-mixing layer) and feature domain (feature-mixing layer).

Figure 8: Top-level view of TSMixer, featuring Time-Mixing and Feature-Mixing MLPs arranged in N blocks. (Source [3])

We covered TSMixer and TSMixer-Ext in detail in the article here – check it out for more insights!

Find a hands-on tutorial on TSMixer in the AI Projects folder (Project 8)

Figure 9: TSMixer’s rolling-forward forecasts on the ETTm2 dataset with prediction length = 48 (Image by author, Source)

CARD: Channel Aligned Robust Blend Transformer for Time Series Forecasting (ICLR 2024)

The release of the DLinear and TSMixer papers exposed weaknesses in Transformer-based forecasting models.

In multivariate setups (where models learn cross-variate dependencies across all time series and predict them jointly), Transformers often overfit.

The quadratic self-attention mechanism tended to capture noise, limiting performance – especially on smaller or toy datasets. As a result, newer attention-based approaches shifted to univariate training, yielding impressive results.

CARD revisits the multivariate scenario, aiming for SOTA results by redesigning attention mechanisms for time series.

Dual Channel Attention

Earlier, we saw how TSMixer uses a double MLP mixing operation across time and feature dimensions. Similarly, CARD applies attention across both token (time) and channel (feature) dimensions:

Figure 10: Top-level view of CARD’s architecture (Source [4])

Figure 10 is very similar to TSMixer’s architecture in Figure 8. The authors sparsify attention further with summary tokens, exponential smoothing, and extra projections. These additions reduce complexity (Figure 11), helping the model distill meaningful information.

Figure 11: Operations inside CARD’s attention block (Source [4])

Note: CARD partitions signals into patches, meaning tokenization occurs at the patch level.

Patching benefits attention-based models because point-wise tokenization lacks semantic richness, while patches better capture a sequence’s characteristics.

Token Blend Module

In Transformer forecasting, each attention head focuses on specific aspects of the data or feature patterns in the input sequence. For example:

One head might focus on short-term dependencies (e.g., relationships between nearby time steps).
Another might capture periodic behavior (e.g., weekly or seasonal cycles).

Thus, within a single attention head:

Tokens encode related information about the input sequence, filtered through the head’s particular "lens" or learned feature pattern.
Tokens adjacent to each other correspond to neighboring time steps in the sequence and share temporal proximity.

In vanilla attention, tokens merge across heads. The proposed token blend module(Figure 12) modifies this by merging adjacent tokens within the same head after multi-head attention, creating tokens for the next stage:

Figure 12: Example of token blend block in CARD (Source [4])

Since each attention head specializes in specific features (e.g., trends or periodic patterns), merging tokens within the same head retains this focus while expanding the temporal scope.

Signal Decay-Based Loss Function

The authors note that errors are higher at the far future end of the horizon. Near-future predictions contribute more to generalization improvements than far-future ones.

To address this, they use a decay-based loss function – a MAE-based variant that weights near-future predictions more:

where:

a_t(A) are the observations at time t, given historical information A
ahat_t are the predictions at time t, given historical information A
L is the sequence length

Hence, this loss emphasizes near-future accuracy more.

In general, CARD is an impressive model. Unfortunately, it hasn’t been integrated into any known forecasting library, but the authors have open-sourced it. I’ll definitely create a tutorial on this model in future articles.

Spacetimeformer

The paper "Long-Range Transformers for Dynamic Spatiotemporal Forecasting[5]", introduced a new model called Spacetimeformer.

What makes Spacetimeformer advantageous?

There are 3 key levels in general:

Temporal Models: Traditional attention-based time series models represent multiple variables per timestep in a single token, overlooking spatial relationships between features (Figure 13b).
Graph Models: Graph attention models manually encode feature relationships using static, hardcoded graphs, which cannot adapt over time (Figure 13c).
Spatiotemporal Models: Spacetimeformer integrates temporal and spatial attention, representing each feature’s value at a specific timestep as an individual token. This approach captures complex interactions across space, time, and value (Figure 13d).

Figure 13: Attention in Multivariate Forecasting. (a) A sequence with 3 variables, 2 context points, and 2 target points to predict. (b) Temporal attention, where each token encompasses all three variables, with darker blue lines indicating stronger attention between tokens. © Temporal attention with spatial interactions, where spatial relationships within each timestep are modeled using predefined spatial graphs (black lines). (d) Spatiotemporal attention, where each variable at each timestep is treated as an individual token.

Previous Transformer-based models used one input token per timestep so that the embedding of the token at time 𝑡 represents 𝑁 distinct variables at that moment in time.

This is problematic because:

Lagged Information: In practice, dependencies can be lagged, such as y depending on xi-1 rather than xi. Time-based embeddings may overlook these lags.
Limited Receptive Field: Unlike NLP, where vocabularies are discrete, time-series data are continuous. Tokenizing by timepoints restricts receptive fields, limiting the capture of long-term, cross-temporal correlations.

Spacetimeformer’s Solution

Spacetimeformer addresses these limitations with the architecture shown in Figure 1d. For 𝑁 variables and a sequence length of 𝐿, we have:

Multivariate inputs of shape (𝐿, 𝑁) are flattened into sequences of shape (𝐿 × 𝑁, 1), where each token represents a single variable’s value at a specific timestep. This transformation enables the model to jointly learn attention across both space and time, creating the "spatiotemporal attention" mechanism illustrated in Figure 13D:

The embedding differences between a temporal and a SpatioTemporal model are shown in Figure 14:

Figure 14: The Spatiotemporal model flattens the N variables into a single sequence, and uses different types of embeddings to differentiate them (Source [5])

Figure 14 shows in detail the embedding types used by SpaceTimeFormer:

Given Embeddings: Multivariate inputs include time information, with missing values ("?") set to zero for predictions. These embeddings indicate whether values are provided as context or require prediction
Time Embeddings: Time sequences are passed through a Time2Vec layer to create time embeddings that capture periodic patterns.
Variable Embeddings: Each time series covariate is mapped to a spatial representation using a lookup-table embedding.
Value & Time Embeddings: Time2Vec embeddings and variable values are projected using a feed-forward layer.
Position Embeddings: Typical learned position embeddings used in other models like BERT (not shown in Figure 15)

Figure 15: The different types of embeddings in SpaceTimeFormer and how they are handled (Source [5])

All embeddings are summed to model relationships across temporal and variable spaces, but this increases input sequence length.

Given self-attention’s quadratic complexity, scaling becomes challenging for large N values. To address this, the following optimizations are applied:

Performer’s FAVOR+ Attention: A linear approximation of attention using a random kernel method.
Global and Local Attention: Local attention allows each token to focus on its spatial variable’s timesteps, while global attention lets tokens attend across the entire spatiotemporal sequence. This reduces computations without compromising temporal dynamics.

Figure 16 illustrates SpaceTimeFormer’s encoder-decoder architecture:

Figure 16: SpaceTimeFormer top-level architecture (Source [5])

Benchmark results show promising performance, with the model outperforming other notable implementations. Newer models like the ITransformer and CARD we saw earlier have since adopted the embedding mechanism across the time dimension.

Finally, attention weights can be extracted and visualized to reveal spatiotemporal relationships among variables:

Figure 17: Forecasting the temperature at 3 weather stations in Texas (lower left, blue diamonds) and 3 stations in New York (upper right, purple triangles) (Source 5)

You can find the official repo of SpaceTimeReformer here. This model has also been applied for other cases such as financial forecasting.

References

[1] Kunz et al. Deep Learning based Forecasting: A case study from the online fashion industry

[2] Athanasopoulos et al. _Forecast reconciliation: A review_

[3] Chen et al. **** _TSMixer: An All-MLP Architecture for Time Series Forecastin_g

[4] Xue et al. CARD: Channel Aligned Robust Blend TransFormer For Time Series Forecasting (ICRL 2024)

[5] Grigsby et al. Long-Range Transformers for Dynamic Spatiotemporal Forecasting

Thank you for reading!

The post Influential Time-Series Forecasting Papers of 2023-2024: Part 1 appeared first on Towards Data Science.

Why Data Scientists Can’t Afford Too Many Dimensions and What They Can Do About It

Niklas Lang — Thu, 16 Jan 2025 13:32:00 +0000

Photo by Paulina Gasteiger on Unsplash

Dimensionality reduction is a central method in the field of Data Analysis and Machine Learning that makes it possible to reduce the number of dimensions in a data set while retaining as much of the information it contains as possible. This step is necessary to reduce the dimensionality of the dataset before training to save computing power and avoid the problem of overfitting.

In this article, we take a detailed look at dimensionality reduction and its objectives. We also illustrate the most commonly used methods and highlight the challenges of dimensionality reduction.

What is Dimensionality Reduction?

Dimensionality reduction comprises various methods that aim to reduce the number of characteristics and variables in a data set while preserving the information in it. In other words, fewer dimensions should enable a simplified representation of the data without losing patterns and structures within the data. This can significantly accelerate downstream analyses and also optimize machine learning models.

In many applications, problems occur due to the high number of variables in a data set, which is also referred to as the curse of dimensionality. For example, too much dimensionality can lead to these problems:

Overfitting: When machine learning models become overly reliant on characteristics of the training dataset and therefore only provide poor results for new, unseen data, this is known as overfitting. With a high number of features, the risk of overfitting increases as the model becomes more complex and therefore adapts too strongly to the errors in the data set.
Computational complexity: Analyses and models that have to process many variables often require more parameters to be trained. This increases the computational complexity, which is reflected either in a longer training time or in increased resource consumption.
Data Noise: With an increased number of variables, the probability of erroneous data or so-called noise also increases. This can influence the analyses and lead to incorrect predictions.

Although large data sets with many characteristics are very informative and valuable, the high number of dimensions can also quickly lead to problems. Dimensionality reduction is a method that attempts to preserve the information content of the data set while reducing the number of dimensions.

What is the Curse of Dimensionality?

The Curse of Dimensionality occurs with high-dimensional data sets, i.e. those that have a large number of attributes or features. At first, many attributes are a good thing because they contain a lot of information and describe the data points well. For example, if we have a dataset about people, the attributes can be information such as hair color, height, weight, eye color, etc.

In mathematical terms, however, each additional attribute means a new dimension in the space and therefore a significant increase in possibilities. This becomes clear from the following example, in which we want to find out which customers buy which products. In the first step, we only look at the age of the prospective customers and whether they have bought the product. We can still depict this relatively easily in a two-dimensional diagram.

Diagram of Buyers divided by Age | Source: Author

As soon as we add more information about the customer, things get a little more complex. The information on the customer’s income would mean a new axis on which the numerical income is mapped. So the two-dimensional diagram becomes a three-dimensional one. The additional attribute "gender" would lead to a fourth dimension and so on.

When working with data, it is desirable to have a lot of attributes and information in the data set to give the model many opportunities to recognize structures in the data. However, it can also lead to serious problems, as the name Curse of Dimensionality suggests.

Data Sparsity

The example shown illustrates a problem that occurs with many attributes. Due to the large number of dimensions, the so-called data space, i.e. the number of values a dataset can take on, also grows. This can lead to what is known as data sparsity meaning that the training data set used to train the model does not contain certain values at all or only very rarely. As a result, the model only delivers poor results for these marginal cases.

Let’s assume that we examine 1,000 customers in our example, as it would be too time-consuming to survey even more customers or this data is simply not available. All age groups from young to old may be well represented among these customers. However, if the additional dimension of income is added, it becomes less likely that the possible characteristics, such as "young" and "high income" or "old" and "medium income", will occur and be backed up with enough data points.

Distance Concentration

If you want to evaluate the similarity of different data sets in the field of machine learning, distance functions are often used for this. The most common clustering algorithms, such as k-means clustering, rely on calculating the distance between points and assigning them to a cluster depending on their size. In multidimensional spaces, however, it can quickly become the case that all points are at a similar distance from each other so it seems almost impossible to separate them.

We are also familiar with this phenomenon from everyday life. If you take a photo of two objects, such as two trees, they can look very close to each other in the picture, as it is only a two-dimensional image. In real life, however, the trees may be several meters apart, which only becomes clear in three dimensions.

All these problems, which can occur in connection with many dimensions, are summarized under the term Curse of Dimensionality.

What are the Goals of Dimensionality Reduction?

Dimensionality reduction primarily pursues three primary goals: Improving model performance, visualizing data, and increasing processing speed. We will examine these in more detail in the following section.

Improving Model Performance

One of the main goals of dimensionality reduction is to improve model performance. By reducing the number of variables in a dataset, a less complex model can be used, which in turn reduces the risk of overfitting.

Models that have a large number of parameters and are therefore highly complex tend to overfit the training dataset and the noise in the data. As a result, the model delivers poorer results with new data that does not contain this noise, while the accuracy of the training data set is very good. This phenomenon is known as overfitting. During dimensionality reduction, unimportant or redundant features are removed from the data set, which reduces the risk of overfitting. As a result, the model delivers better quality for new, unseen data.

Visualization of Data

If you want to visualize data sets with many features, you face the challenge of mapping all this information in a two- or at most three-dimensional space. Any dimensionality beyond this is no longer directly tangible for us humans, but it is easiest to assign a separate dimension to each feature in the data set. Therefore, with high-dimensional data sets, we are often faced with the problem that we cannot simply visualize the data to gain an initial understanding of the peculiarities of the data and, for example, to recognize whether there are outliers.

Dimensionality reduction helps to reduce the number of dimensions to such an extent that visualization in two- or three-dimensional space is possible. This makes it easier to better understand the relationships between the variables and the data structures.

Increasing the Processing Speed

Computing time and the necessary resources play a major role in the implementation of projects, especially for machine learning and Deep Learning algorithms. Often, only limited resources are available, which should be used optimally. By removing redundant features from the data set at an early stage, you not only save time and computing power during data preparation but also when training the model, without having to accept lower performance.

In addition, dimensionality reduction makes it possible to use simpler models that not only require less power during initial training but can also perform calculations faster later during operation. This is an important factor, especially for real-time calculations.

Overall, Dimensionality Reduction is an important method for improving data analysis and building more robust machine-learning models. It is also an important step in the visualization of data.

Which Methods are used for Dimensionality Reduction?

In practice, various methods for dimensionality reduction have become established, three of which are explained in more detail below. Depending on the application and the structure of the data, these methods already cover a broad spectrum and can be used for most practical problems.

Principal Component Analysis

Principal component analysis (PCA) assumes that several variables in a data set possibly measure the same thing, i.e. are correlated. These different dimensions can be mathematically combined into so-called principal components without compromising the significance of the data set. The shoe size and height of a person, for example, are often correlated and can therefore be replaced by a common dimension to reduce the number of input variables.

Principal component analysis describes a method for mathematically calculating these components. The following two key concepts are central to this:

The covariance matrix is a matrix that specifies the pairwise covariances between two different dimensions of the data space. It is a square matrix, i.e. it has as many rows as columns. For any two dimensions, the covariance is calculated as follows:

Here n stands for the number of data points in the data set, _Xi is the value of the dimension X of the i-th data point and X̅ is the mean value of the dimension X for all n data points. As can be seen from the formula, the covariances between two dimensions do not depend on the order of the dimensions, so the following applies COV(X,Y) = COV(Y,X). These values result in the following covariance matrix C for the two dimensions X and Y:

The covariance of two identical dimensions is simply the variance of the dimension itself, i.e:

The covariance matrix is the first important step in the principal component analysis. Once this matrix has been created, the eigenvalues and eigenvectors can be calculated from it. Mathematically, the following equation is solved for the eigenvalues:

Here λ is the desired eigenvalue and I is the unit matrix of the same size as the covariance matrix C. When this equation is solved, one or more eigenvalues of a matrix are obtained. They represent the linear transformation of the matrix in the direction of the associated eigenvector. An associated eigenvector can therefore also be calculated for each eigenvalue, for which the slightly modified equation must be solved:

Where v is the desired eigenvector, according to which the equation must be solved accordingly. In the case of the covariance matrix, the eigenvalue corresponds to the variance of the eigenvector, which in turn represents a principal component. Each eigenvector is therefore a mixture of different dimensions of the data set, the principal components. The corresponding eigenvalue therefore indicates how much variance of the data set is explained by the eigenvector. The higher this value, the more important the principal component is, as it contains a large proportion of the information in the data set.

Therefore, after calculating the eigenvalues, they are sorted by size and the eigenvalues with the highest values are selected. The corresponding eigenvectors are then calculated and used as principal components. This results in a dimension reduction, as only the principal components are used to train the model instead of the individual features of the data set.

t-Distributed Stochastic Neighbor Embedding (t-SNE)

t-Distributed Stochastic Neighbor Embedding, or t-SNE for short, approaches the problem of dimensionality reduction differently by attempting to create a new, lower-dimensional space that adopts the distances of the data points from the higher-dimensional space as far as possible. The basic idea of this becomes clear in the following example.

It is not easy to transfer data sets from a high dimensionality to a low dimensionality while retaining as much information as possible from the data set. The following figure shows a simple, two-dimensional data set with a total of 50 data points. Three different clusters can be identified, which are also well separated from each other. The yellow cluster is furthest away from the other two clusters, while the purple and blue data points are closer to each other.

Two-dimensional data for t-SNE | Source: Author

The aim now is to convert this two-dimensional data set into a lower dimension, i.e. into one dimension. The simplest approach for this would be to represent the data either only by its X or Y coordinate.

Dimensionality reduction by X- or Y-coordinate | Source: Author

However, it is clear that this simple transformation has lost much of the information in the data set and gives a different picture than the original two-dimensional data. If only the X coordinates are used, it looks as if the yellow and purple clusters overlap and that all three clusters are roughly equidistant from each other. If, on the other hand, only the Y coordinates are used for dimensionality reduction, the yellow cluster is much better separated from the other clusters, but it looks as if the purple and blue clusters overlap.

The basic idea of t-SNE is that the distances from the high dimensionality are transferred to the low dimensionality as far as possible. To do this, it uses a stochastic approach and converts the distances between points into a probability that indicates how likely it is that two random points are next to each other.

More precisely, it is a conditional probability that indicates how likely it is that one point would choose the other point as a neighbor. Hence the name "Stochastic Neighbor Embedding".

Dimensionality reduction according to t-SNE | Source: Author

As you can see, this approach leads to a much better result, in which the three different clusters can be clearly distinguished from one another. It is also clear that the yellow data points are significantly further away from the other data points and the blue and purple clusters are somewhat closer together. To better understand the details of this approach, you are welcome to read our detailed article on this topic.

Mastering t-SNE: A Comprehensive Guide to Understanding and Implementation in Python

Linear Discriminant Analysis

Linear Discriminant Analysis (LDA for short) aims to identify and maximize the separability between classes by projecting the data onto a smaller dimension. In contrast to other methods, it places particular emphasis on maximizing the separability between classes and is therefore particularly important for classification tasks.

Two central key figures are calculated:

Within-Class Scatter: This key figure measures the variability within a class of data. The lower this value, the more similar the data points are within a class and the more the classes are separated.
Between-class scatter: This scatter in turn measures the variability between the mean values of the classes. This value should be as large as possible, as this indicates that the classes are easier to separate from each other.

Simply put, these key figures are used to form matrices and calculate their eigenvalues. The eigenvectors for the largest eigenvalues are in turn the dimensions in the new feature space, which has fewer dimensions than the original data set. All data points can then be projected onto the eigenvectors, which reduces the dimensions.

Linear Discriminant Analysis is particularly suitable for applications in which the classes are already known and the data is also clearly labeled. It uses this class information from supervised learning to find a low-dimensional space that separates the classes as well as possible.

A disadvantage of LDA is that the maximum reduction to a dimensional space is limited and depends on the number of classes. A data set with (n) classes can therefore be reduced to a maximum of (n-1) dimensions. In concrete terms, this means, for example, that a data set with three different classes can be reduced to a two-dimensional space.

This approach of dimensionality reduction is also particularly suitable for large data sets, as the computing effort is only moderate and scales well with the amount of data so that the computing effort is kept within limits even with large amounts of data.

What are the Challenges of Dimensionality Reduction?

Dimensionality reduction is an important step in data pre-processing to increase the generalizability of Machine Learning models and the general model performance and possibly save computing power. However, the methods also bring with them some challenges that should be considered before use. These include

Loss of information: Although the methods presented attempt to retain as much variance, i.e. information content, of the data set, some information is inevitably lost when dimensions are reduced. Important data can also be lost, which may lead to poorer analyses or model results. Therefore, the resulting model may be less accurate or less powerful.
Interpretability: Most algorithms used for dimensionality reduction provide a low-dimensional space that no longer contains the original dimensions, but a mathematical summary of them. Principal component analysis, for example, uses the eigenvectors, which are a summary of various dimensions from the data set. As a result, interpretability is lost to a certain extent, as the previous dimensions are easier to measure and visualize. Especially in use cases where transparency is important and the results require a practical interpretation, this can be a decisive disadvantage.
Scaling: Performing dimensionality reductions is computationally intensive and takes time, especially for large data sets. In real-time applications, however, this time is not available, as the computing time of the model is added to this. Computing-intensive models in particular, such as t-SNE or autoencoders, quickly fall out because they take too long to make live predictions.
Selecting the method: Depending on the application, the dimensionality reduction methods presented have their strengths and weaknesses. The selection of a suitable algorithm therefore plays an important role, as it has a significant influence on the results. There is no one-size-fits-all solution and a new decision must always be made based on the application.

Dimensionality reduction has many advantages and is an integral part of data pre-processing in many applications. However, the disadvantages and challenges mentioned must also be taken into account to train an efficient model.

This is what you should take with you

Dimensionality reduction is a method of reducing the number of dimensions, i.e. variables, in a data set while retaining as much information content as possible.
With many input variables, the so-called curse of dimensionality arises. This leads to problems such as data sparsity or distance concentration.
Dimensionality reduction often leads to a lower probability of overfitting, better visualization, and optimized computing power when training machine learning models.
The most common dimensionality reduction methods include principal component analysis, t-SNE, and LDA.
However, dimensionality reduction also has disadvantages, such as poorer interpretability or loss of information.

The post Why Data Scientists Can’t Afford Too Many Dimensions and What They Can Do About It appeared first on Towards Data Science.

Understanding Flash Attention: Writing the Algorithm from Scratch in Triton

Alex Dremov — Wed, 15 Jan 2025 17:01:59 +0000

Read for free at alexdremov.me

Flash Attention is a revolutionary technique that dramatically accelerates the attention mechanism in transformer-based models, delivering processing speeds many times faster than naive methods. By cleverly tiling data and minimizing memory transfers, it tackles the notorious GPU memory bottleneck that large language models often struggle with.

In this post, we’ll dive into how Flash Attention leverages efficient I/O-awareness to reduce overhead, then take it a step further by crafting a block-sparse attention kernel in Triton.

I will provide a simple explanation of how Flash Attention works. We will then implement the explained algorithm in Triton!

What is Attention?

The attention mechanism (or scaled dot-product attention) is a core element of transformer models, which is a leading architecture for solving the problem of language modeling. All popular models, like GPT, LLaMA, and BERT, rely on attention.

The formula is pretty simple:

The rest is history.

Even though the formula looks simple, its computation involves multiplications of large tensors and a lot of data movement. Considering that this is a core part of the transformer architecture, optimizing the algorithm greatly improves the performance of the model in general.

In the naive implementation, attention requires O(n²) additional memory and O(n²) compute time complexity, where n is the sequence length. That’s a lot!

Flash Attention

Core Idea

The main idea of Flash attention can be summarized in a simple quote from the original paper:

We argue that a missing principle is making attention algorithms IO-aware – accounting for reads and writes between levels of GPU memory.

That is, modern GPUs have several types of memory:

SRAM – fast, on-chip, small
HBM – slower than SRAM, large size. That’s what we usually address as GPU memory.

Check out the memory hierarchy in the image below to see the differences in bandwidth and sizes of different memory types.

Image from FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness by Tri Dao et al.

To conduct computation, data must be transferred from HBM to SRAM, and this transfer is not overhead-free!

The Flash Attention algorithm proposes a method of computing attention in tiles, without explicitly materializing the attention scores tensor:

Not materializing a matrix means that at any given time, the matrix does not exist in its full shape in memory.

It’s easy to see that this matrix requires O(n²) of memory to store. For large sequence lengths, that’s a lot of data! So, if we manage to avoid explicitly materializing this matrix, we can save lots of memory.

However, this matrix is necessary for transformer training as it is a part of backpropagation and gradient calculation. The authors propose that it’s better to recalculate this matrix during the backward pass (again without explicit materialization). Not only does this saves lots of memory, but it also provides huge speedups as we don’t need to transfer this enormous matrix between different GPU memory types.

Overall, such an approach did not only speed up calculations by taking GPU I/O specifics into account, but also allowed processing huge sequence lengths as memory complexity drops to O(n).

Tiled Attention Calculation

The last thing to understand is how to compute attention in tiles. Basically, this means that we will calculate attention over the full sequence by processing incoming tokens in small portions.

Well, it’s easy to calculate QK^T in tiles. Considering that attention dimension is not high, we can load full matrix rows and columns and conduct multiplication in tiles.

Yes, if we want to have an enormous attention dimension, Flash Attention will not work without algorithm modifications.

As dimensions are usually quite small even for enormous models, this limitation is fair.

Tiled QK^T | Image by the author

So, we have QK^T calculated in SRAM. All that’s left is to apply softmax, multiply by V, and that’s it!

That’s where the trick is.

The problem is that the softmax denominator requires aggregation over the sequence length to normalize scores, and we do not have access to the whole length as we load data in tiles.

To address it, we can implement a concatenated softmax algorithm. Using it, we can calculate softmax "in batch" mode: by adjusting computed values with the new incoming data.

Taking the algorithm from the original article, we can define rules to compute the softmax over data concatenation. Having two vectors x1 and x2, we need to calculate the softmax denominator l(x) over those vectors’ concatenation [x1, x2]. If the vector’s maximum is m(x), we can easily derive the softmax denominator of the concatenation:

The last equivalence can be easily verified as

So, now we have what we want – we can calculate softmax per-tile and then, by doing re-normalization from the formula above, compute the global softmax. The last thing to do is to incorporate the tile of the V tensor and keep doing the same re-normalization (as matrix multiplication is a linear operation).

And all of this without loading the full sequence into memory or materializing QK^T!

Notice that we calculate Softmax(QK^T) in tiles only, without needing to have the whole matrix at any moment.

Also, in the actual algorithm for numerical stability, we will compute not Softmax(x) but Softmax(x – max(x)). We can do that as softmax is invariant to constant shifts.

Triton Implementation

Now, we can easily implement the outlined algorithm in Triton, which is a tool that allows us to write efficient GPU kernels with the ease of Python.

To learn more about Triton, check out their official guides.

Tutorials – Triton documentation

Outlining the Algorithm

The first step is to decide how we will assign jobs and what data each job will load. By the algorithm of tiled softmax, each job must have access to K, V over the whole sequence length. So, each job will iterate over K, V in tiles. We don’t have any algorithmic restriction on the number of Q tiles processed. Therefore, each job will load just one Q tile and work with it only – this way we will maximize job parallelism.

Kernel jobs data management | Image by the author

In summary, each job will load a single Q tile, iterate over all tiles in K and V, and store one tile of result corresponding to the Q tile.

The Kernel

What’s left is to write the actual code. Let’s focus on the core part first, and only then we’ll add Triton-specific boilerplates.

Below is a Triton pseudocode with every line explained.

See? Easy!

What’s important is that you can see how simple it is to write such a thing as soon as we understand the idea of tiled softmax. Apart from that, there’s nothing complicated from the algorithm perspective.

This kernel can be made even faster by implementing triton optimizations. However, this is out of the scope of this article.

This pseudocode is pretty close to the actual code. You may find it in my GitHub by following the link. All that I added is just data management and Pytorch wrappers.

kernels/src/self_attention/kernel.py at main · alexdremov/kernels

Don’t hesitate to ask if something isn’t clear. I’m here in the comments .

The code above was extensively tested to match PyTorch’s scaled_dot_product_attention. You can also check out the tests to see how to use the written kernel.

Benchmarking

While we wrote the kernel in Triton to improve the algorithm understanding, it’s interesting to compare the performance with a naive implementation and PyTorch’s scaled_dot_product_attention.

Benchmarking implementations for different sequence lengths | Image by the author

As expected, the Flash Attention algorithm completely outperforms the naive implementation performance-wise. Also, I’ve marked with a dashed line the range of lengths for which the naive implementation causes a CUDA out-of-memory error.

We see that our Triton implementation is slightly worse than PyTorch SDPA. But the difference is not too large Considering the fact that PyTorch SDPA is a well-optimized CUDA kernel, that’s a nice result.

Benchmarking code is also available in the repository.

kernels/benchmark/benchmark_self_attention.py at main · alexdremov/kernels

This story was originally published on alexdremov.me Check it out! (at least, TEX looks better there)

Conclusions

In the post, I covered the motivation of the Flash Attention algorithm as well as its algorithm details. Finally, we were able to implement it from scratch in Triton, reproducing the speedups from the paper.

I hope this post improved your understanding of Flash Attention. Feel free to leave a comment below if you have any questions.

References

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

Tutorials – Triton documentation

GitHub – alexdremov/kernels: Collection of useful kernels

The post Understanding Flash Attention: Writing the Algorithm from Scratch in Triton appeared first on Towards Data Science.

LossVal Explained: Efficiently Estimate the Importance of Your Training Data

Tim Wibiral — Wed, 15 Jan 2025 14:01:59 +0000

LossVal Explained: Efficient Data Valuation for Neural Networks

How to exploit the loss function to efficiently estimate the importance of your training data

Data Valuation visualized. Understanding how important each training sample is for the performance of your machine learning model. (Image by author.)

This blog post summarizes and explains our paper "LossVal: Efficient Data Valuation for Neural Networks".

Not all data is created equal: Some training data points influence the training of a Machine Learning model much more than others. Understanding the influence of each data point is often highly inefficient and often relies on repeated retraining of the model. LossVal presents a new approach to this, that efficiently integrates the Data Valuation process into the loss function of an artificial neural network.

What is Data Valuation?

Machine Learning models are often trained with large datasets. In most cases, not all training samples in such a dataset are equally helpful or informative for the model. For example, if a data point is noisy or has a wrong label, it is less informative for your machine learning model. For one of the tasks in our paper, we trained a machine-learning model on a vehicle crash test dataset to predict how harmful a crash would be for an occupant, based on some vehicle parameters. Some of the data points are from cars of the 80s and 90s! You can imagine, that very old cars may be less important for the model’s predictions on modern cars.

The process of understanding the effect of each training sample on the machine-learning model is called Data Valuation, where an importance score is assigned to each training sample. Data Valuation is a growing field connected to data markets, explainable AI, active learning, and many more. Many approaches have been proposed, like Data Shapley, Influence Functions, or LAVA. To learn more about this, you can check out my recent blog post that presents different Data Valuation methods and applications.

LossVal

The basic idea behind LossVal is to "learn" the importance score of each sample while training the model, similar to how the model weights are learned. This saves us from rerunning the training of the model multiple times and from having to track all model weight updates during the training.

To achieve this, we can modify standard loss functions like means squared error (MSE) and cross-entropy loss. We incorporate instance-based weights into the loss and multiply it by a weighted distance function. In general, the LossVal loss functions have the following form:

where ℒ indicates the weighted target loss (weighted MSE or cross-entropy) and OT indicates a weighted distribution distance (OT stands for optimal transport). This results in new loss functions that can be used like any other loss function for training a neural network. However, during each training step, the weights w in the loss are updated using the gradient descent.

We demonstrate this for regression tasks using the MSE and for classification using the cross-entropy loss. Afterward, we take a closer look at the distribution distance OT.

LossVal for Regression

Let’s start with the MSE. The standard MSE is the squared difference between the model prediction ŷ and the correct prediction y (with n being the index of the training sample):

For LossVal, we modify the MSE in two steps: First, a weight wₙ is included for each training instance n. Second, the whole MSE is multiplied with a weighted distribution distance function.

LossVal for Classification

The cross-entropy loss is typically expressed like this:

We can modify the cross-entropy loss in the same way as the MSE:

The Optimal Transport Distance

The optimal transport distance is the minimum effort you need to transform one distribution into another. It is also known as the earth mover distance, coming from the analogy of the fastest way to fill a hole with a pile of dirt. OT can be defined as:

where c is the cost of moving point xₙ to _x_ⱼ. Each γ is a possible transport plan, defining how the points are moved. The optimal transport plan is the γ* with the least effort involved (the smallest distribution distance). Note that we include the weights w in the cost function via joint distribution Π(w, 1). In other words, OTᵥᵥ is the weighted distance between the training and the validation set. You can find an in-depth explanation for optimal transport here.

In a more practical sense, minimizing OTᵥᵥ by changing the weights will assign higher weights to the training data points similar to the validation data. Effectively, noisy samples get a smaller weight.

Implementation

Our implementation and all data are available on GitHub. The code below shows the implementation of LossVal for the mean squared error.

def LossVal_mse(train_X: torch.Tensor, 
                train_y_true: torch.Tensor, train_y_pred: torch.Tensor,
                val_X: torch.Tensor, sample_ids: torch.Tensor
                weights: torch.Tensor, device: torch.device) -> torch.Tensor:
    weights = weights.index_select(0, sample_ids)  # Select the weights corresponding to the sample_ids

    # Step 1: Compute the weighted mse loss
    loss = torch.sum((train_y_true - train_y_pred) ** 2, dim=1)
    weighted_loss = torch.sum(weights @ loss)  # Loss is a vector, weights is a matrix

    # Step 2: Compute the Sinkhorn distance between the training and validation distributions
    sinkhorn_distance = SamplesLoss(loss="sinkhorn")
    dist_loss = sinkhorn_distance(weights, train_X, torch.ones(val_X.shape[0], requires_grad=True).to(device), val_X)

    # Step 3: Combine mse and Sinkhorn distance
    return weighted_loss * dist_loss**2

This loss function works like any other loss function in pytorch, with some peculiarities: the parameters include the validation set, sample weights, and the indices of the samples in the batch. This is necessary to select the correct weights for the batched samples for calculating the weighted loss. Keep in mind that this implementation relies on the automatic gradient calculation of pytorch. That means the sample weights vector needs to be part of the model parameters. This way, the optimization of the weights profits from the optimizer implementation, like Adam. Alternatively, one could also update the weights per hand, using the gradient of the loss with respect to each weight i. The implementation for cross-entropy works equivalently, but you need to replace line 8.

But does it work?

Benchmark comparison of different Data Valuation methods for noisy sample detection. Higher is better. (Image by author.)

The graphic above shows the comparison between different Data Valuation approaches on the noisy sample detection task. This task is defined by the OpenDataVal benchmark. First, noise is added to p% of the training data, then the Data Valuation is used to find the noisy samples. Better methods will find more of the noisy samples, hence achieving a higher F1 score. The graph above shows the average over 6 datasets for classification and 6 datasets for regression. We tested 3 different noise types; noisy labels, noisy features, and mixed noise. In the mixed noise condition, half of the noisy sample have feature noise and the other half have label noise. In noisy sample detection, LossVal outperforms all other methods for label noise and mixed noise. However, LAVA performs better for feature noise.

The experimental setup for the point removal experiment (graphic below) is similar. However, here the goal is to remove the highest valued data points from the training set and see how a model trained on this training set performs. This means, that a better Data Valuation method will lead to a faster degradation in model performance because it removes important data points earlier. We found that LossVal matches state-of-the-art methods.

Benchmark comparison of different Data Valuation methods for removal of high-value points. Lower is better. (Image by author.)

For more detailed results, take a look at our paper.

Conclusion

The idea behind LossVal is simple: Use the gradient descent to find an optimal weight for each data point. The weight signifies the importance of the data point.

Our experiments show that LossVal achieves state-of-the-art performance on the OpenDataVal benchmark. LossVal has a lower time complexity than all other model-based approaches we tested and demonstrates a more robust performance over different types of noise and tasks.

Overall, LossVal presents a efficient alternative to other state-of-the-art Data Valuation approaches for neural networks.

Feel free to get in touch via LinkedIn

The post LossVal Explained: Efficiently Estimate the Importance of Your Training Data appeared first on Towards Data Science.

From Darwin to Deep Work

Pascal Janetzky — Tue, 14 Jan 2025 13:02:21 +0000

Focus strategies for machine learning practitioners

In December 1831, the HMS Beagle set sail on a nearly five-year journey around the world. Among its crew was 22-year-old Charles Darwin, eager to explore the tropical regions of the globe. During this voyage, Darwin collected fossils and developed a growing interest in the natural world, particularly in plants and species. The trip profoundly shaped Darwin’s thinking and laid the groundwork for his seminal work, On the Origin of Species, which was eventually published 23 years later in 1859.

Photo by Torsten Dederichs on Unsplash

Writing a book of such groundbreaking importance required clarity of thought. Darwin adhered to a strict daily schedule to stay productive. He would rise at 7 a.m. and dedicate the hours from 8 a.m. to noon to uninterrupted study and writing. These morning sessions were reserved for his most cognitively demanding work: analyzing his findings and refining his theories. After lunch, he would mentally work through challenges on long walks. Darwin’s disciplined routine and deep focus enabled him to tackle complex problems over extended periods – a testament to a concept called "Deep Work."

Deep Work

The term deep work was coined by computer science professor Cal Newport, in his book of the same name, Deep Work. Newport defines deep work as the ability to focus intensely on a single, cognitively demanding task. For Machine Learning (ML) practitioners, deep work is an especially critical and valuable skill.

Deep Work for ML practitioners

Machine learning research involves tasks that are intellectually demanding. Examples include are reading papers, solving mathematical problems, and programming algorithms. Each of these activities is cognitively taxing, requiring full attention over long spans of uninterrupted time.

Take reading research papers, for instance. For beginners, it can take three hours or more to read a single paper end-to-end. Following the authors’ arguments and understanding the underlying concepts is mentally exhausting. If you’re interrupted during this process – whether by an email notification or a colleague dropping by – it disrupts your train of thought. Research has shown that returning to the original point of focus after an interruption takes significant time and effort [1][2].

Reading a paper is only one aspect of machine learning, programming and mathematics equally require continued focus. Writing code to implement an algorithm is a creative and demanding task. Interruptions during coding sessions not only break your concentration but can lead to errors that are frustrating and time-consuming to debug. As any math student can attest, working through complex mathematical proofs or equations demands sustained focus; distractions make it nearly impossible to follow the logical flow.

The Modern Workplace

While being able to work deeply, especially in the field of machine learning, is essential, modern workplaces offer a wide array of distractions. Offices are built on the idea of communication, and thus email and instant messaging systems like Teams are commonly used. Although these tools make collaboration easier, they also encourage constant communication. Even when you are deep down coding a new algorithm, it’s easy to do a quick single click on a shiny icon, fragmenting your concentration.

Statistics suggest that the average worker spends about 50% of their day on email alone [3]. It’s not the case, as Cal Newport describes in Deep Work, that you can instantly switch your attention to answering these emails, and then directly go back to coding. Quite contrary, the residual effect of attention says that parts of your mental capacities are still caught in the previous task (answering messages) when you return to your initial task (coding) [2].

The residual effect is amplified when you could not finish the interrupting tasks (e.g., when a message you saw asks you to prepare a presentation for some event X), causing unfinished fragments to now linger around in your mind. It’s unlikely that Darwin would have been able to think clearly in such an environment.

Deep Work Strategies in ML

So, how can ML practitioners find the time to do what they are good at in a state of deep focus? At first glance, it is appealing to entirely disconnect from email and messaging apps and always focus on reading papers and coding ML algorithms. However, this is neither practical nor advisable. On the contrary, communication is crucial in any workplace, and many ML projects heavily benefit from exchange with colleagues. Ignoring this reality entirely could lead to professional consequences, such as projects being cancelled.

Thus, instead of eliminating communication, the goal should be to balance it with deep work.

Over the past six years of working in machine learning (see Further Reading at this article’s end), I’ve tried several strategies of integrating deep work into my ML practice. I’ve found that the best time to develop these habits is during your university or college years, when your schedule is more flexible. While lecture times might dictate parts of your day, you are still relatively free to structure your own studies.

Once you transition into the professional world, you’ll likely have less control over your schedule. However, even in these environments, you can still make time for deep work. Practically speaking, if deep work enables you to implement ML algorithms faster or develop better solutions, everyone benefits from your practice.

With that in mind, here are the three basic strategies that I found most practical for deeply working on machine learning. Use these as a starting point for implementing your own deep work practice.

Schedule Deep Work Blocks Pre-scheduling blocks of deep work is probably the fastest approach to making time for deep, uninterrupted work. For example, set aside some hours in the morning for tasks that require maximum focus, such as reading papers, coding, or developing theories. In my experience, I found three to four hours of uninterrupted concentration to be a high upper bound, a time after which I usually need a (lunch) break to recharge. Generally, during this deep work time I am silencing my notifications.

Having two blocks of deep work per day (e.g., one in the morning, one in the evening) is doable, but might require practice, especially if your are not (or no longer) used to concentrating intensively over longer periods of time. For starters, I usually recommend two hours first thing in the morning, and then expanding the block with experience. Bonus point: schedule these blocks the evening before.

Batch Communication Pre-scheduling blocks of deep work allows you to focus intensively, but it will not handle communication for you. Inspired by Newport, I suggest batching communication. Depending on your circumstances, you can check mail four times daily: directly in the morning, before lunch, after lunch, and late in the afternoon. I found that this approach minimizes interruptions while ensuring that you are still available for colleagues.

To repeat, skipping communication altogether is not an option and won’t help you and your colleagues. To some degree, machine learning lives from exchanging ideas and concerns – which is what conferences are made for.

Create a Distraction-free Environment Once you have pre-scheduled times for deep work and handling communication, it’s time to optimize your workspace for focus. In my experience, you have two levers: reducing sounds and reducing visual clutter. As for reducing sounds, I use ear plugs (alternative: noise-canceling headphones) to block noise. For reducing visual distractions (think: conference posters, coworkers walking by, etc.), I try to work from distraction-free booths. If you work from home at least partially, then you can follow Darwin, who retreated to his office to work without distractions.

Charles Darwin’s disciplined routine during the writing of On the Origin of Species offers valuable lessons for ML practitioners. To achieve breakthroughs, whether in natural science or machine learning, requires time to think and work deeply. In this article, I presented three strategies that I found most practical for implementing stretches of deep work into your machine learning practice: pre-schedule blocks of deep work, batch communications, and reducing environmental distractions. Start with these central strategies to implement your own deep work practices.

Acknowledgements: The introduction of Darwin is inspired by two Wikipedia entries. The story of Darwin’s routine is described in Deep Work by Cal Newport, another inspiration for this article.

Further reading:

How I’d Learn Machine Learning Again, After 6 Years

ML Beginners Should Read Papers

How (and Where) ML Beginners Can Find Papers

Deep Work: Rules for Focused Success in a Distracted World – Cal Newport

Sources:

The post From Darwin to Deep Work appeared first on Towards Data Science.

Deep Learning | Towards Data Science

Show and Tell

Introduction

Image Captioning Framework

Implementation

GoogLeNet Encoder

LSTM Decoder

GoogLeNet Encoder + LSTM Decoder

Ending

References

DeepSeek-V3 Explained 1: Multi-head Latent Attention

Background

MHA in Decoder-only Transformers

Key-Value Cache

Multi-Query Attention (MQA) vs Grouped-Query Attention (GQA)

RoPE (Rotary Positional Embeddings)

Multi-head Latent Attention

MLA: High-level Idea

Why Decoupled RoPE is Needed

Performance of MLA

References

The Three Phases of Learning Machine Learning

The beginner phase of learning Machine Learning

Data Handling

Classic Machine Learning

Neural Networks

Theory

Miscellaneous Skills

After the Beginner Stage

Further Reading:

Resources:

Understanding the Evolution of ChatGPT: Part 3- Insights from Codex and InstructGPT

Understanding the Evolution of ChatGPT: Part 3— Insights from Codex and InstructGPT

Overview

Why is Finetuning Needed?

Challenges in Finetuning

GPT-3.5

Codex

Evaluation of Code Generation

Training Data Collection

Adaptations in Finetuning

Findings in Evaluation

InstructGPT and ChatGPT

Evaluation of Alignment

RLHF (Reinforcement Learning from Human Feedback)

Implementation of RLHF in InstructGPT

Findings in Evaluation

Summary

Influential Time-Series Forecasting Papers of 2023-2024: Part 1

Influential Time-Series Forecasting Papers of 2023–2024: Part 1

Deep Learning-Based Forecasting: A Case Study From the Online Fashion Industry

Paper Insights Overview

Zalando’s Forecasting Pipeline

Forecasting Reconciliation: A Review

Preliminaries on Forecasting Reconciliation

Why Use Forecasting Reconciliation?

Some Notation

Bottom-Up Approach

Advanced Reconciliation Methods

Key Challenges and Tools

TSMixer: An All-MLP Architecture for Time Series Forecasting

CARD: Channel Aligned Robust Blend Transformer for Time Series Forecasting (ICLR 2024)

Dual Channel Attention

Token Blend Module

Signal Decay-Based Loss Function

Spacetimeformer

What makes Spacetimeformer advantageous?

Spacetimeformer’s Solution

References

Thank you for reading!

Why Data Scientists Can’t Afford Too Many Dimensions and What They Can Do About It

What is Dimensionality Reduction?

What is the Curse of Dimensionality?

What are the Goals of Dimensionality Reduction?

Which Methods are used for Dimensionality Reduction?

Principal Component Analysis

t-Distributed Stochastic Neighbor Embedding (t-SNE)

Linear Discriminant Analysis

What are the Challenges of Dimensionality Reduction?

This is what you should take with you