The world’s leading publication for data science, AI, and ML professionals.

Reducing the Size of Docker Images Serving Large Language Models (part 2)

How to reduce a "small" Docker image by another 10%.

How to reduce the size of a "small" Docker image by another 10%

Generated by Runway for the prompt: There are two containers on board the ship, one large and the other small. They are bright, vivid, and realistic in color.
Generated by Runway for the prompt: There are two containers on board the ship, one large and the other small. They are bright, vivid, and realistic in color.

Introduction

This is a continuation of the topic of reducing the size of Docker images serving large language models. In my previous story [1], I presented how to reduce the size of a docker image model from 7 GB to under 700 MB. The solution eliminated heavy libraries like CUDA, cuDNN, cuBLAS, torch, and triton. It was possible by converting and quantizing the model to the ONNX format and using onnxruntime with CPU instead of the torch with GPU.

In this story, I present how to reduce the size of the target image further by another 10%. This might seem like overkill, as 700 MB is already a relatively small image. However, the techniques presented here can provide a deeper look into the docker image serving a language model. They can help to understand what components are required to run the model and see that there might be some lighter alternatives.

Scripts and resources used in this story are also available on GitHub [2]:

GitHub – CodeNLP/codenlp-docker-ml: This repository demonstrates how to create a small Docker image…


S-size Docker image

Let’s start by recalling the solution presented in [1], which allowed us to reduce the image size to under 700 MB.

Here is the Python code that implements the API endpoint:

and the Dockerfile to build the image:

A more detailed explanation can be found here [1]. At this point, I would like to add a comment to line 25 in the Python script, as it is not so obvious – vector = {k: v for k, v in vector.items()}. At first glance, it does nothing as it converts from a dictionary into a … dictionary. In fact, it converts an object of type:

<class 'transformers.tokenization_utils_base.BatchEncoding'>

into a dict. This is required as the ort_sess.run method expects dict as an input. Without the conversion, a nasty exception will arise.

Now, we can introduce a recipe for an even smaller Docker image serving the Llm model—super small size, XS.


XS-size Docker image

First, I will demonstrate the Python script for inference and the Dockerfile. Then, I will show the differences compared to the S-size image and discuss the changes.

Here is the Python script implementing the API endpoint:

And the Dockerfile to build the image:

Here is the side-by-side comparison of the files with highlighted differences.

Image by author: Comparison of the Dockerfiles.
Image by author: Comparison of the Dockerfiles.
Image by author: Comparison of the Python scripts.
Image by author: Comparison of the Python scripts.

We have two major differences, which are explained in the following subsections.

Replace transformers with tokenizers

In the initial solution, the transformers library is used for two purposes:

  1. Tokenize the input text, i.e., transform the texts into the dictionary or subtoken identifiers and attention masks.
  2. Load the label mapping from the configuration (integers to string labels).

Both tasks can be accomplished using lighter libraries. By the way, the transformer’s library size is around 70 MB.

Tokenization can be performed by the tokenizers library, which is already used under the hood by transformers. The library’s size is only 12 MB. However, there is a bit more to do as AutoTokenizer wraps the output to the right data structure – a dictionary with model-specific attributes.

To load the tokenizer, instead of:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_path)

we do the following:

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_file(model_path + "/tokenizer.json")

We used the class method Toknizer.from_file and pass a path directly to the tokenizer.json file.

To tokenize the texts, we replace this code block:

text = [input.text]
vector = tokenizer(text, padding=True)
vector = {k: v for k, v in vector.items()}

with these lines:

encoded = tokenizer.encode(input.text)
vector = {
    "input_ids": [encoded.ids],
    "attention_mask": [encoded.attention_mask],
}

Here, we manually create the dict structure with specific fields the model requires. The xlm-roberta model requires two attributes: input_ids and attention_mask. Other models may require a different set of attributes.

This is for the tokenization part. Now let’s change the way we load the label mapping.

Label mapping

We will use the json.load(...) method to load the mapping from the model config file. The method returns a dict structure containing the config file. The main difference between the output of PretrainedConfig.from_pretrained(...) and json.load(...) is the first one returns an object of PretrainedConfig class, and the other returns a dict structure. To access the id2label mapping in the PretrainedConfig object, we call it an attribute that is config.id2label. In turn, to get the mapping from dict we have to use id2label as a key that is config["id2label"]. The other difference is the type of keys in the mapping. They are represented by int values. PretrainedConfig automatically casts them to int values; thus, we retrieve the label by calling config.id2label[label_id]. In turn, json.load(...) does not cast the values and treats each of them as a string. This is why we need to cast the int value to string explicitly – config["id2label"][str(label_id)].

That’s all for the first part related to using tokenizers packages instead of transformers. Now, let’s move to the other part related to model compression.

Compress the model

The second technique for reducing the size of the Docker image is using the compressed model as an archive. Please keep in mind that this will reduce the size of the Docker image for its storage and transportation (pushing to and pulling from the image registry). This has benefits as it reduces data transfer over the network and the disk’s usage. On runtime, the model must be unpacked so it will occupy more storage when it is active. The other benefit is that the model may already be distributed as a compressed archive (as an artifact stored in WandB [3], MLflow [4], or any other platform for ML experiment tracking), so it won’t require additional steps.

The potential downside of this technique is the additional time it takes to initialize the Docker to unpack the archive. If you’re optimizing image build to reduce initialization time, this might not suit you. The decompression time is negligible in other cases, but we can save some additional MBs. In my case, the compressed version of the model and tokenizer was reduced from 301 MB to 219 MB.

To decompress the model on runtime, I move entry point actions to a separate bash script:

#!/bin/bash

tar -xvf models/xlm-roberta-base-language-detection-onnx.tar.gz -C models/

uvicorn api:app --host 0.0.0.0

and modify the ENTRYPOINT in the Dockerfile:

ENTRYPOINT ["bash", "entrypoint_onnx_xs.sh"]

And that’s it. I welcome you to the next section, where you will see the final output of our efforts in terms of Docker image size.


Docker image comparison

Let’s see what are the variants of the Docker images that we have created so far:

  • language_detection_cuda – inference on GPU using the Torch backend and full model,
  • language_detection_onnx – inference on CPU using onnxruntime and quantized model,
  • language_detection_onnx_xs – inference on CPU using onnxruntime and quantized model, with model compression and tokenizers instead of transformers package.
docker images | grep language_detection

To simplify, the output contains only the image name and its size:

language_detection_cuda     (...)      7.05GB
language_detection_onnx     (...)      699MB
language_detection_onnx_xs  (...)      575MB

For visual comparison, here is the chart with the Docker image sizes. The X ax is on a logarithmic scale to see the differences between ONNX-S and ONNX-XS.

Image by author, generated using [5]
Image by author, generated using [5]

The final Docker image has a size of 575 MB. Compared to the initial image, the size was reduced by 12 times.


Conclusions

We have seen that it is possible to reduce the size of the Docker image serving an LLM model from gigabytes to megabytes. In our case, it was from 7GB to 575MB. Such a significant size reduction might be useful when we are limited or stumble by network transfers (pushing and pulling the image over a network), the image registry’s limitations, or the production server’s memory limitations.

Despite the many benefits of using small images, some downsides were not discussed here: slower inference and potentially lower performance. Many aspects should be considered when choosing the right approach: business requirements, expected performance and inference time, and available infrastructure. Since there are many factors to consider, this is material for a separate story – part 3 🙂


References

[1] https://towardsdatascience.com/reducing-the-size-of-docker-images-serving-llm-models-b70ee66e5a76

[2] https://github.com/CodeNLP/codenlp-docker-ml

[3] https://wandb.com

[4] https://mlflow.org

[5] https://www.rapidtables.com/tools/bar-graph.html


Related Articles