How to reduce the size of a "small" Docker image by another 10%

Introduction
This is a continuation of the topic of reducing the size of Docker images serving large language models. In my previous story [1], I presented how to reduce the size of a docker image model from 7 GB to under 700 MB. The solution eliminated heavy libraries like CUDA, cuDNN, cuBLAS, torch, and triton. It was possible by converting and quantizing the model to the ONNX format and using onnxruntime
with CPU instead of the torch
with GPU.
In this story, I present how to reduce the size of the target image further by another 10%. This might seem like overkill, as 700 MB is already a relatively small image. However, the techniques presented here can provide a deeper look into the docker image serving a language model. They can help to understand what components are required to run the model and see that there might be some lighter alternatives.
Scripts and resources used in this story are also available on GitHub [2]:
GitHub – CodeNLP/codenlp-docker-ml: This repository demonstrates how to create a small Docker image…
S-size Docker image
Let’s start by recalling the solution presented in [1], which allowed us to reduce the image size to under 700 MB.
Here is the Python code that implements the API endpoint:
and the Dockerfile to build the image:
A more detailed explanation can be found here [1]. At this point, I would like to add a comment to line 25 in the Python script, as it is not so obvious – vector = {k: v for k, v in vector.items()}
. At first glance, it does nothing as it converts from a dictionary into a … dictionary. In fact, it converts an object of type:
<class 'transformers.tokenization_utils_base.BatchEncoding'>
into a dict
. This is required as the ort_sess.run
method expects dict
as an input. Without the conversion, a nasty exception will arise.
Now, we can introduce a recipe for an even smaller Docker image serving the Llm model—super small size, XS.
XS-size Docker image
First, I will demonstrate the Python script for inference and the Dockerfile. Then, I will show the differences compared to the S-size image and discuss the changes.
Here is the Python script implementing the API endpoint:
And the Dockerfile to build the image:
Here is the side-by-side comparison of the files with highlighted differences.


We have two major differences, which are explained in the following subsections.
Replace transformers with tokenizers
In the initial solution, the transformers
library is used for two purposes:
- Tokenize the input text, i.e., transform the texts into the dictionary or subtoken identifiers and attention masks.
- Load the label mapping from the configuration (integers to string labels).
Both tasks can be accomplished using lighter libraries. By the way, the transformer’s library size is around 70 MB.
Tokenization can be performed by the tokenizers
library, which is already used under the hood by transformers
. The library’s size is only 12 MB. However, there is a bit more to do as AutoTokenizer
wraps the output to the right data structure – a dictionary with model-specific attributes.
To load the tokenizer, instead of:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained(model_path)
we do the following:
from tokenizers import Tokenizer
tokenizer = Tokenizer.from_file(model_path + "/tokenizer.json")
We used the class method Toknizer.from_file
and pass a path directly to the tokenizer.json
file.
To tokenize the texts, we replace this code block:
text = [input.text]
vector = tokenizer(text, padding=True)
vector = {k: v for k, v in vector.items()}
with these lines:
encoded = tokenizer.encode(input.text)
vector = {
"input_ids": [encoded.ids],
"attention_mask": [encoded.attention_mask],
}
Here, we manually create the dict
structure with specific fields the model requires. The xlm-roberta
model requires two attributes: input_ids and attention_mask. Other models may require a different set of attributes.
This is for the tokenization part. Now let’s change the way we load the label mapping.
Label mapping
We will use the json.load(...)
method to load the mapping from the model config file. The method returns a dict
structure containing the config file. The main difference between the output of PretrainedConfig.from_pretrained(...)
and json.load(...)
is the first one returns an object of PretrainedConfig
class, and the other returns a dict
structure. To access the id2label mapping in the PretrainedConfig
object, we call it an attribute that is config.id2label
. In turn, to get the mapping from dict
we have to use id2label as a key that is config["id2label"]
. The other difference is the type of keys in the mapping. They are represented by int values. PretrainedConfig
automatically casts them to int values; thus, we retrieve the label by calling config.id2label[label_id]
. In turn, json.load(...)
does not cast the values and treats each of them as a string. This is why we need to cast the int value to string explicitly – config["id2label"][str(label_id)]
.
That’s all for the first part related to using tokenizers
packages instead of transformers
. Now, let’s move to the other part related to model compression.
Compress the model
The second technique for reducing the size of the Docker image is using the compressed model as an archive. Please keep in mind that this will reduce the size of the Docker image for its storage and transportation (pushing to and pulling from the image registry). This has benefits as it reduces data transfer over the network and the disk’s usage. On runtime, the model must be unpacked so it will occupy more storage when it is active. The other benefit is that the model may already be distributed as a compressed archive (as an artifact stored in WandB [3], MLflow [4], or any other platform for ML experiment tracking), so it won’t require additional steps.
The potential downside of this technique is the additional time it takes to initialize the Docker to unpack the archive. If you’re optimizing image build to reduce initialization time, this might not suit you. The decompression time is negligible in other cases, but we can save some additional MBs. In my case, the compressed version of the model and tokenizer was reduced from 301 MB to 219 MB.
To decompress the model on runtime, I move entry point actions to a separate bash script:
#!/bin/bash
tar -xvf models/xlm-roberta-base-language-detection-onnx.tar.gz -C models/
uvicorn api:app --host 0.0.0.0
and modify the ENTRYPOINT in the Dockerfile:
ENTRYPOINT ["bash", "entrypoint_onnx_xs.sh"]
And that’s it. I welcome you to the next section, where you will see the final output of our efforts in terms of Docker image size.
Docker image comparison
Let’s see what are the variants of the Docker images that we have created so far:
- language_detection_cuda – inference on GPU using the
Torch
backend and full model, - language_detection_onnx – inference on CPU using
onnxruntime
and quantized model, - language_detection_onnx_xs – inference on CPU using
onnxruntime
and quantized model, with model compression andtokenizers
instead oftransformers
package.
docker images | grep language_detection
To simplify, the output contains only the image name and its size:
language_detection_cuda (...) 7.05GB
language_detection_onnx (...) 699MB
language_detection_onnx_xs (...) 575MB
For visual comparison, here is the chart with the Docker image sizes. The X ax is on a logarithmic scale to see the differences between ONNX-S and ONNX-XS.
![Image by author, generated using [5]](https://towardsdatascience.com/wp-content/uploads/2024/05/1yrv7Jy9-YciA2CtQJ7bLWQ.png)
The final Docker image has a size of 575 MB. Compared to the initial image, the size was reduced by 12 times.
Conclusions
We have seen that it is possible to reduce the size of the Docker image serving an LLM model from gigabytes to megabytes. In our case, it was from 7GB to 575MB. Such a significant size reduction might be useful when we are limited or stumble by network transfers (pushing and pulling the image over a network), the image registry’s limitations, or the production server’s memory limitations.
Despite the many benefits of using small images, some downsides were not discussed here: slower inference and potentially lower performance. Many aspects should be considered when choosing the right approach: business requirements, expected performance and inference time, and available infrastructure. Since there are many factors to consider, this is material for a separate story – part 3 🙂
References
[1] https://towardsdatascience.com/reducing-the-size-of-docker-images-serving-llm-models-b70ee66e5a76