Deployment | Towards Data Science

Kubernetes — Understanding and Utilizing Probes Effectively

Kiril Vodenicharov — Thu, 06 Mar 2025 03:59:54 +0000

Introduction

Let’s talk about Kubernetes probes and why they matter in your deployments. When managing production-facing containerized applications, even small optimizations can have enormous benefits.

Aiming to reduce deployment times, making your applications better react to scaling events, and managing the running pods healthiness requires fine-tuning your container lifecycle management. This is exactly why proper configuration — and implementation — of Kubernetes probes is vital for any critical deployment. They assist your cluster to make intelligent decisions about traffic routing, restarts, and resource allocation.

Properly configured probes dramatically improve your application reliability, reduce deployment downtime, and handle unexpected errors gracefully. In this article, we’ll explore the three types of probes available in Kubernetes and how utilizing them alongside each other helps configure more resilient systems.

Quick refresher

Understanding exactly what each probe does and some common configuration patterns is essential. Each of them serves a specific purpose in the container lifecycle and when used together, they create a rock-solid framework for maintaining your application availability and performance.

Startup: Optimizing start-up times

Start-up probes are evaluated once when a new pod is spun up because of a scale-up event or a new deployment. It serves as a gatekeeper for the rest of the container checks and fine-tuning it will help your applications better handle increased load or service degradation.

Sample Config:

startupProbe:
  httpGet:
    path: /health
    port: 80
  failureThreshold: 30
  periodSeconds: 10

Key takeaways:

Keep periodSeconds low, so that the probe fires often, quickly detecting a successful deployment.
Increase failureThreshold to a high enough value to accommodate for the worst-case start-up time.

The Startup probe will check whether your container has started by querying the configured path. It will additionally stop the triggering of the Liveness and Readiness probes until it is successful.

Liveness: Detecting dead containers

Your liveness probes answer a very simple question: “Is this pod still running properly?” If not, K8s will restart it.

Sample Config:

livenessProbe:
  httpGet:
    path: /health
    port: 80
  periodSeconds: 10
  failureThreshold: 3

Key takeaways:

Since K8s will completely restart your container and spin up a new one, add a failureThreshold to combat intermittent abnormalities.
Avoid using initialDelaySeconds as it is too restrictive — use a Start-up probe instead.

Be mindful that a failing Liveness probe will bring down your currently running pod and spin up a new one, so avoid making it too aggressive — that’s for the next one.

Readiness: Handling unexpected errors

The readiness probe determines if it should start — or continue — to receive traffic. It is extremely useful in situations where your container lost connection to the database or is otherwise over-utilized and should not receive new requests.

Sample Config:

readinessProbe:
  httpGet:
    path: /health
    port: 80
  periodSeconds: 3
  failureThreshold: 1
  timeoutSeconds: 1

Key takeaways:

Since this is your first guard to stopping traffic to unhealthy targets, make the probe aggressive and reduce the periodSeconds .
Keep failureThreshold at a minimum, you want to fail quick.
The timeout period should also be kept at a minimum to handle slower Containers.
Give the readinessProbe ample time to recover by having a longer-running livenessProbe .

Readiness probes ensure that traffic will not reach a container not ready for it and as such it’s one of the most important ones in the stack.

Putting it all together

As you can see, even if all of the probes have their own distinct uses, the best way to improve your application’s resilience strategy is using them alongside each other.

Your startup probe will assist you in scale up scenarios and new deployments, allowing your containers to be quickly brought up. They’re fired only once and also stop the execution of the rest of the probes until they successfully complete.

The liveness probe helps in dealing with dead containers suffering from non-recoverable errors and tells the cluster to bring up a new, fresh pod just for you.

The readiness probe is the one telling K8s when a pod should receive traffic or not. It can be extremely useful dealing with intermittent errors or high resource consumption resulting in slower response times.

Additional configurations

Probes can be further configured to use a command in their checks instead of an HTTP request, as well as giving ample time for the container to safely terminate. While these are useful in more specific scenarios, understanding how you can extend your deployment configuration can be beneficial, so I’d recommend doing some additional reading if your containers handle unique use cases.

Further reading:
Liveness, Readiness, and Startup Probes
Configure Liveness, Readiness and Startup Probes

The post Kubernetes — Understanding and Utilizing Probes Effectively appeared first on Towards Data Science.

Designing, Building & Deploying an AI Chat App from Scratch (Part 2)

Joris Baan — Mon, 20 Jan 2025 10:02:26 +0000

Photo by Alex wong on Unsplash

1. Introduction

In the previous post, we built an AI-powered chat application on our local computer using microservices. Our stack included FastAPI, Docker, Postgres, Nginx and llama.cpp. The goal of this post is to learn more about the fundamentals of cloud deployment and scaling by deploying our app to Azure, making it available to real users. We’ll use Azure because they offer a free education account, but the process is similar for other platforms like AWS and GCP.

You can check a live demo of the app at chat.jorisbaan.nl. Now, obviously, this demo isn’t very large-scale, because the costs ramp up very quickly. With the tight scaling limits I configured I reckon it can handle about 10–40 concurrent users until I run out of Azure credits. However, I do hope it demonstrates the principles behind a scalable production system. We could easily configure it to scale to many more users with a higher budget.

I give a complete breakdown of our infrastructure and the costs at the end. The codebase is at https://github.com/jsbaan/ai-app-from-scratch.

A quick demo of the app at chat.jorisbaan.nl. We start a new chat, come back to that same chat, and start another chat.

1.1. Recap: local application

Let’s recap how we built our local app: A user can start or continue a chat with a language model by sending an HTTP request to http://localhost. An Nginx reverse proxy receives and forwards the request to a UI over a private Docker network. The UI stores a session cookie to identify the user, and sends requests to the backend: the language model API that generates text, and the database API that queries the database server.

Local architecture of the app. See part 1 for more details. Made by author in draw.io.

2. Cloud architecture

Conceptually, our cloud architecture will not be too different from our local application: a bunch of containers in a private network with a gateway to the outside world, our users.

However, instead of running containers on our local computer with Docker Compose, we will deploy them to a computing environment that automatically scales across virtual or psychical machines to many concurrent users.

2.1 Scaling

Scaling is a central concept in cloud architectures. It means being able to dynamically handle varying numbers of users (i.e., HTTP requests). Uvicorn, the web server running our UI and database API, can already handle about 40 concurrent requests. It’s even possible to use another web server called Gunicorn as a process manager that employs multiple Uvicorn workers in the same container, further increasing concurrency.

Now, if we want to support even more concurrent request, we could give each container more resources, like CPUs or memory (vertical scaling). However, a more reliable approach is to dynamically create copies (replicas) of a container based on the number of incoming HTTP requests or memory/CPU usage, and distribute the incoming traffic across replicas (horizontal scaling). Each replica container will be assigned an IP address, so we also need to think about networking: how to centrally receive all requests and distribute them over the container replicas.

This "prism" pattern is important: requests arrive centrally in some server (a load balancer) and fan out for parallel processing to multiple other servers (e.g., several identical UI containers).

Photo of two prisms by Fernando @cferdophotography on Unsplash

2.2 Kubernetes Concepts

Kubernetes is the industry standard system for automating deployment, scaling and management of containerized applications. Its core concepts are crucial to understand modern cloud architectures, including ours, so let’s quickly review the basics.

Node: A physical or virtual machine to run containerized app or manage the cluster.
Cluster: A set of Nodes managed by Kubernetes.
Pod: The smallest deployable unit in Kubernetes. Runs one main app container with optional secondary containers that share storage and networking.
Deployment: An abstraction that manages the desired state of a set of Pod replicas by deploying, scaling and updating them.
Service: An abstraction that manages a stable entrypoint (the service’s DNS name) to expose a set of Pods by distributing incoming traffic over the various dynamic Pod IP addresses. A Service has multiple types:
A ClusterIP Service exposes Pods within the Cluster
A LoadBalancer Service exposes Pods to outside the Cluster. It triggers the cloud provider to provision an external public IP and load balancer outside the cluster that can be used to reach the cluster. These external requests are then routed via the Service to individual Pods.
Ingress: An abstraction that defines more complex rules for a cluster’s entrypoint. It can route traffic to multiple Services; give Services externally-reachable URLs; load balance traffic; and handle secure HTTPS.
Ingress Controller: Implements the Ingress rules. For example, an Nginx-based controller runs an Nginx server (like in our local app) under the hood that is dynamically configured to route traffic according to Ingress rules. To expose the Ingress Controller itself to the outside world, you can use a LoadBalancer Service. This architecture is often used.

2.3 Azure Container Apps

Armed with these concepts, instead of deploying our app with Kubernetes directly, I wanted to experiment a little by using Azure Container Apps (ACA). This is a serverless platform built on top of Kubernetes that abstracts away some of its complexity.

With a single command, we can create a Container App Environment, which, under the hood, is an invisible Kubernetes Cluster managed by Azure. Within this Environment, we can run a container as a Container App that Azure internally manages as Kubernetes Deployments, Services, and Pods. See article 1 and article 2 for detailed comparisons.

A Container App Environment also auto-creates:

An invisible Envoy Ingress Controller that routes requests to internal Apps and handles HTTPS and App auto-scaling based on request volume.
An external Public IP address and Azure Load Balancer that routes external traffic to the Ingress Controller that in turn routes it to Apps (sounds similar to a Kubernetes LoadBalancer Service, eh?).
An Azure-generated URL for each Container App that is publicly accessible over the internet or internal, based on its Ingress config.

This gives us everything we need to run our containers at scale. The only thing missing is a database. We will use an Azure-managed PostgreSQL server instead of deploying our own container, because it’s easier, more reliable and scalable. Our local Nginx reverse proxy container is also obsolete because ACA automatically deploys an Envoy Ingress Controller.

It’s interesting to note that we literally don’t have to change a single line of code in our local application, we can just treat it as a bunch of containers!

2.4 Azure architecture: putting it all together

Here is a diagram of the full Cloud architecture for our chat application that contains all our Azure resources. Let’s take a high level look at how a user request flows through the system.

Azure architecture diagram. Made by author in draw.io.

User sends HTTPS request to chat.jorisbaan.nl.
A Public DNS server like Google DNS resolves this domain name to an Azure Public IP address.
The Azure Load Balancer on this IP address routes the request to the (for us invisible) Envoy Ingress Controller.
The Ingress Controller routes the request to UI Container App, who routes it to one of its Replicas where a UI web server is running.
The UI web server makes requests to the database API and language model API Apps, who both route it to one of their Replicas.
A database API replica queries the PostgreSQL server hostname. The Azure Private DNS Zone resolves the hostname to the PostgreSQL server’s IP address.

3. Deployment

So, how do we actually create all this? Rather than clicking around in the Azure Portal, infrastructure-as-code tools like Terraform are best to create and manage cloud resources. However, for simplicity, I will instead use the Azure CLI to create a bash script that deploys our entire application step by step. You can find the full Deployment script including environment variables here . We will go through it step by step now.

3.1 Setting up

We need an Azure account (I’m using a free education account), a clone of the https://github.com/jsbaan/ai-app-from-scratch repo, Docker to build and push the container images, the downloaded model, and the Azure CLI to start creating cloud resources.

We first create a resource group so our resources are easier to find, manage and delete. The --location parameter refers to the physical datacenter we’ll use to deploy our app’s infrastructure. Ideally, it is close to our users. We then create a private virtual network with 256 IP addresses to isolate, secure and connect our database server and Container Apps.

brew update && brew install azure-cli # for macos

echo "Create resource group"
az group create 
  --name $RESOURCE_GROUP 
  --location "$LOCATION"

echo "Create VNET with 256 IP addresses"
az network vnet create 
  --resource-group $RESOURCE_GROUP 
  --name $VNET 
  --address-prefix 10.0.0.0/24 
  --location $LOCATION

3.2 PostgreSQL server deployment

Depending on the hardware, an Azure-managed PostgreSQL database server costs about $13 to $7000 a month. To communicate with Container Apps, we put the DB server within the same private virtual network but in its own subnet. A subnet is a dedicated range of IP addresses that can have its own security and routing rules.

We create the Azure PostgreSQL Flexible Server with private access. This means only resources within the same virtual network can reach it. Azure automatically creates a Private DNS Zone that manages a hostname for the database that resolves to its IP address. The database API will later use this hostname to connect to the database server.

We will randomly generate the database credentials and store them in a secure place: Azure KeyVault.

echo "Create subnet for DB with 128 IP addresses"
az network vnet subnet create 
  --resource-group $RESOURCE_GROUP 
  --name $DB_SUBNET 
  --vnet-name $VNET 
  --address-prefix 10.0.0.128/25

echo "Create a key vault to securely store and retrieve secrets, 
like the db password"
az keyvault create 
  --name $KEYVAULT 
  --resource-group $RESOURCE_GROUP 
  --location $LOCATION

echo "Give myself access to the key vault so I can store and retrieve 
the db password"
az role assignment create 
  --role "Key Vault Secrets Officer" 
  --assignee $EMAIL 
  --scope "/subscriptions/$SUBSCRIPTION_ID/resourceGroups/$RESOURCE_GROUP/providers/Microsoft.KeyVault/vaults/$KEYVAULT"

echo "Store random db username and password in the key vault"
az keyvault secret set 
  --name postgres-username 
  --vault-name $KEYVAULT  
  --value $(openssl rand -base64 12 | tr -dc 'a-zA-Z' | head -c 12)
az keyvault secret set 
  --name postgres-password 
  --vault-name $KEYVAULT  
  --value $(openssl rand -base64 16)

echo "While we're at it, let's already store a secret session key for the UI"
az keyvault secret set 
  --name session-key 
  --vault-name $KEYVAULT  
  --value $(openssl rand -base64 16)

echo "Create PostgreSQL flexible server in our VNET in its own subnet. 
Auto-creates Private DS Zone."
POSTGRES_USERNAME=$(az keyvault secret show --name postgres-username --vault-name $KEYVAULT --query "value" --output tsv)
POSTGRES_PASSWORD=$(az keyvault secret show --name postgres-password --vault-name $KEYVAULT --query "value" --output tsv)
az postgres flexible-server create 
  --resource-group $RESOURCE_GROUP 
  --name $DB_SERVER 
  --vnet $VNET 
  --subnet $DB_SUBNET 
  --location $LOCATION 
  --admin-user $POSTGRES_USERNAME 
  --admin-password $POSTGRES_PASSWORD 
  --sku-name Standard_B1ms 
  --tier Burstable 
  --storage-size 32 
  --version 16 
  --yes

3.3 Azure Container App Environment deployment

With the network and database in place, let’s deploy the infrastructure to run containers – the Container App Environment (recall, this is a Kubernetes cluster under the hood).

We create another subnet with 128 IP addresses and delegate its management to the Container App Environment. The subnet should be big enough for every ten new replicas to get a new IP address in the subrange. We can then create the Environment. This is just a single command without much configuration.

echo "Create subnet for ACA with 128 IP addresses."
az network vnet subnet create 
  --resource-group $RESOURCE_GROUP 
  --name $ACA_SUBNET 
  --vnet-name $VNET 
  --address-prefix 10.0.0.0/25

echo "Delegate the subnet to ACA"
az network vnet subnet update 
  --resource-group $RESOURCE_GROUP 
  --vnet-name $VNET 
  --name $ACA_SUBNET 
  --delegations Microsoft.App/environments

echo "Obtain the ID of our subnet"
ACA_SUBNET_ID=$(az network vnet subnet show 
  --resource-group $RESOURCE_GROUP 
  --name $ACA_SUBNET 
  --vnet-name $VNET 
  --query id --output tsv)

echo "Create Container Apps Environment in our custom subnet.
By default, it has a Workload profile with Consumption plan."
az containerapp env create 
  --resource-group $RESOURCE_GROUP 
  --name $ACA_ENVIRONMENT 
  --infrastructure-subnet-resource-id $ACA_SUBNET_ID 
  --location $LOCATION

3.4 Azure Container Apps deployment

Each Container App needs a Docker image to run. Let’s first setup a Container Registry, and then build all our images locally and push them to the registry. Note that we simply copied the model file into the language model image using its Dockerfile, so we don’t need to mount external storage like we did for local deployment in part 1.

echo "Create container registry (ACR)"
az acr create 
  --resource-group $RESOURCE_GROUP 
  --name $ACR 
  --sku Standard 
  --admin-enabled true

echo "Login to ACR and push local images"
az acr login --name $ACR
docker build --tag $ACR.azurecr.io/$DB_API $DB_API
docker push $ACR.azurecr.io/$DB_API
docker build --tag $ACR.azurecr.io/$LM_API $LM_API
docker push $ACR.azurecr.io/$LM_API
docker build --tag $ACR.azurecr.io/$UI $UI
docker push $ACR.azurecr.io/$UI

Now, onto deployment. To create Container Apps we specify their Environment, container registry, image, and the port they will listen to for requests. The ingress parameter regulates whether Container Apps can be reached from the outside world. Our two APIs are internal and therefore completely isolated, with no public URL and no traffic ever routed from the Envoy Ingress Controller. The UI is external and has a public URL, but sends internal HTTP requests over the virtual network to our APIs. We pass these internal hostnames and db credentials as environment variables.

echo "Deploy DB API on Container Apps with the db credentials from the key 
vault as env vars. More secure is to use a managed identity that allows the 
container itself to retrieve them from the key vault. But for simplicity we 
simply fetch it ourselves using the CLI."
POSTGRES_USERNAME=$(az keyvault secret show --name postgres-username --vault-name $KEYVAULT --query "value" --output tsv)
POSTGRES_PASSWORD=$(az keyvault secret show --name postgres-password --vault-name $KEYVAULT --query "value" --output tsv)
az containerapp create --name $DB_API 
  --resource-group $RESOURCE_GROUP 
  --environment $ACA_ENVIRONMENT 
  --registry-server $ACR.azurecr.io 
  --image $ACR.azurecr.io/$DB_API 
  --target-port 80 
  --ingress internal 
  --env-vars "POSTGRES_HOST=$DB_SERVER.postgres.database.azure.com" "POSTGRES_USERNAME=$POSTGRES_USERNAME" "POSTGRES_PASSWORD=$POSTGRES_PASSWORD" 
  --min-replicas 1 
  --max-replicas 5 
  --cpu 0.5 
  --memory 1

echo "Deploy UI on Container Apps, and retrieve the secret random session 
key the UI uses to encrypt session cookies"
SESSION_KEY=$(az keyvault secret show --name session-key --vault-name $KEYVAULT --query "value" --output tsv)
az containerapp create --name $UI 
  --resource-group $RESOURCE_GROUP 
  --environment $ACA_ENVIRONMENT 
  --registry-server $ACR.azurecr.io 
  --image $ACR.azurecr.io/$UI 
  --target-port 80 
  --ingress external 
  --env-vars "db_api_url=http://$DB_API" "lm_api_url=http://$LM_API" "session_key=$SESSION_KEY" 
  --min-replicas 1 
  --max-replicas 5 
  --cpu 0.5 
  --memory 1 

echo "Deploy LM API on Container Apps"
az containerapp create --name $LM_API 
  --resource-group $RESOURCE_GROUP 
  --environment $ACA_ENVIRONMENT 
  --registry-server $ACR.azurecr.io 
  --image $ACR.azurecr.io/$LM_API 
  --target-port 80 
  --ingress internal 
  --min-replicas 1 
  --max-replicas 5 
  --cpu 2 
  --memory 4 
  --scale-rule-name my-http-rule 
  --scale-rule-http-concurrency 2

3.5 Scaling our Container Apps

Let’s take a look at how our Container Apps they scale. Container Apps can scale to zero, which means they have zero replicas and stop running (and stop incurring costs). This is a feature of the serverless paradigm, where infrastructure is provisioned on demand. The invisible Envoy proxy handles scaling based on triggers, like concurrent HTTP requests. Spawning new replicas may take some time, which is called cold-start. We set the minimum number of replicas to 1 to avoid cold starts and the resulting timeout errors for first requests.

The default scaling rule creates a new replica whenever an existing replica receives 10 concurrent HTTP requests. This applies to the UI and the database API. To test whether this scaling rule makes sense, we would have to perform load testing to simulate real user traffic and see what each Container App replica can handle individually. My guess is that they can handle a lot more concurrent request than 10, and we could relax the rule.

3.5.1 Scaling language model inference.

Even with our small, quantized language model, inference requires much more compute than a simple FastAPI app. The inference server handles incoming requests sequentially, and the default Container App resources of 0.5 virtual CPU cores and 1GB memory result in very slow response times: up to 30 seconds for generating 128 tokens with a context window of 1024 (these parameters are defined in the LM API’s Dockerfile).

Increasing vCPU to 2 and memory to 4GB gives much better inference speed, and handles about 10 requests within 30 seconds. I configured the http scaling rule very tightly at 2 concurrent requests, so whenever 2 users chat at the same time, the LM API will scale out.

With 5 maximum replicas, I think this will allow for roughly 10–40 concurrent users, depending on the length of the chat histories. Now, obviously, this isn’t very large-scale, but with a higher budget, we could increase vCPUs, memory and the number of replicas. Ultimately we would need to move to GPU-based inference. More on that later.

3.6 Custom domain name & HTTPS

The automatically generated URL from the UI App looks like https://chat-ui.purplepebble-ac46ada4.germanywestcentral.azurecontainerapps.io/. This isn’t very memorable, so I want to make our app available as subdomain on my website: chat.jorisbaan.nl.

I simply add two DNS records on my domain registrar portal (like GoDaddy). A CNAME record that links my chat subdomain to the UI’s URL, and TXT record to prove ownership of the subdomain to Azure and obtain a TLS certificate.

# Obtain UI URL and verification code
URL=$(az containerapp show -n $UI -g $RESOURCE_GROUP -o tsv --query "properties.configuration.ingress.fqdn")
VERIFICATION_CODE=$(az containerapp show -n $UI -g $RESOURCE_GROUP -o tsv --query "properties.customDomainVerificationId")

# Add a CNAME record with the URL and a TXT record with the verification code to domain registrar
# (Do this manually)

# Add custom domain name to UI App
az containerapp hostname add --hostname chat.jorisbaan.nl -g $RESOURCE_GROUP -n $UI
# Configure managed certificate for HTTPS
az containerapp hostname bind --hostname chat.jorisbaan.nl -g $RESOURCE_GROUP -n $UI --environment $ACA_ENVIRONMENT --validation-method CNAME

Container Apps manages a free TLS certificate for my subdomain as long as the CNAME record points directly to the container’s domain name.

The public URL for the UI changes whenever I tear down and redeploy an Environment. We could use a fancier service like Azure Front Door or Application Gateway to get a stable URL and act as reverse proxy with additional security, global availability, and edge caching.

4. Resources & costs overview

Now that the app is deployed, let’s look at an overview of all the Azure resources it app uses. We created most of them ourselves, but Azure also automatically created a Load balancer, Public IP, Private DNS Zone, Network Watcher and Log Analytics workspace.

Screenshot of all resources from Azure Portal.

Some resources are free, others are free up to a certain time or compute budget, which is part of the reason I chose them. The following resources incur the highest costs:

Load Balancer (standard Tier): free for 1 month, then $18/month.
Container Registry (standard Tier): free for 12 months, then $19/month.
PostgreSQL Flexible Server (Burstable B1MS Compute Tier): free for 12 months, then at least $13/month.
Container App: Free for 50 CPU hours/month or 2M requests/month, then $10/month for an App with a single replica, 0.5 vCPUs and 1GB memory. The LM API with 2vCPUs, 4GB memory costs about $50 per month for a single replica.

You can see that the costs of this small (but scalable) app can quickly add up to hundreds of dollars per month, even without a GPU server to run a stronger language model! That’s the reason why the app probably won’t be up when you’re reading this.

It also becomes clear that Azure Container Apps is more expensive then I initially thought: it requires a standard-Tier Load balancer for automatic external ingress, HTTPS and auto-scaling. We could get around this by disabling external ingress and deploying a cheaper alternative – like a VM with a custom reverse proxy, or a basic-Tier Load balancer. Still, a standard-tier Kubernetes cluster would have cost at least $150/month, so ACA can be cheaper at small scale.

5. Roadmap

Now, before we wrap up, let’s look at just a few of the many directions to improve this deployment.

Continuous Integration & Continuous Deployment. I would set up a CI/CD pipeline that runs unit and integration tests and redeploys the app upon code changes. It might be triggered by a new git commit or merged pull request. This will also make it easier to see when a service isn’t deployed properly. I would also set up monitoring and alerting to be aware of issues quickly (like a crashing Container App instance).

Lower latency: the language model server. I would load test the whole app – simulating real-world user traffic – with something like Locust or Azure Load Testing. Even without load testing, we have an obvious bottleneck: the LM server. Small and quantized as it is, it can still take up quite a while for lengthy answers, with no concurrency. For more users it would be faster and more efficient to run a GPU inference server with a batching mechanism that collects multiple generation requests in a queue – perhaps with Kafka – and runs batch inference on chunks.

With even more users, we might want several GPU-based LM servers that consume from the same queue. For GPU infrastructure I’d look into Azure Virtual Machines or something more fancy like Azure Machine Learning.

The llama.cpp inference engine is good for single-user CPU-based inference. When moving to a GPU-server, I would look into inference engines more suitable to batch inference, like vLLM or Huggingface TGI. And, obviously, a better (bigger) model for increased response quality – depending on the use case.

6. Final thoughts

I hope this project offers a glimpse of what an AI-powered web app in production may look like. I tried to balance realistic engineering with cutting about every corner to keep it simple, cheap, understandable, and limit my time and compute budget. Sadly, I cannot keep the app live for long since it would quickly cost hundreds of dollars per month. If someone can help with Azure credits to keep the app running, let me know!

Some closing thoughts about using managed services: Although Azure Container Apps abstracts away some of the Kubernetes complexity, it’s still extremely useful to have an understanding of the lower-level Kubernetes concepts. The automatically created invisible infrastructure like Public IPs, Load balancers and ingress controllers add unforeseen costs and make it difficult to understand what’s going on. Also, ACA documentation is limited compared to Kubernetes. However, if you know what you’re doing, you can set something up very quickly.

Acknowledgements

I heavily relied on the Azure docs, and the ACA docs in particular. Thanks to Dennis Ulmer for proofreading and Lucas de Haas for useful discussion.

AI usage

I experimented a bit more with AI tools compared to part 1. I used Pycharm’s CoPilot plugin for code completion and had quite some back-and-forth with ChatGPT to learn about the Azure or Kubernetes ecosystem, and to spar about bugs. I double-checked everything in the docs and most of the information was solid. Like part 1, I did not use AI to write this post, though I did use ChatGPT to paraphrase some bad-running sentences.

The post Designing, Building & Deploying an AI Chat App from Scratch (Part 2) appeared first on Towards Data Science.

Complete MLOPS Cycle for a Computer Vision Project

Yağmur Çiğdem Aktaş — Thu, 28 Nov 2024 17:42:07 +0000

Dive into MLOPS basics to improve your skills for designing, developing, and deploying computer vision projects for real-world, industrial applications

These days, we encounter (and maybe produce on our own) many Computer Vision projects, where AI is the hottest topic for new technologies. Fine-tuning a pre-trained image classification, object detection, or any other computer vision project is not a big deal. But what is the correct way of creating and deploying an AI project for industrial usage?

MLOps (Machine Learning Operations) is a set of practices, tools, and frameworks aimed at automating the development, deployment, monitoring, and management of machine learning models in production environments. It bridges the gap between the research and development environments and helps us improve both stages.

Image by Author

In this complete set of tutorials, we will be covering each step of a computer vision project’s MLOPS cycle.

A complete cycle of MLOPS for an AI project is listed below, with an example tool that we will use to accomplish the related step:

Data versioning & Management (Dvc)
Experiment Tracking (MLFlow)
Model Optimization (ONNX)
Model Packaging & Serving (Docker)
CI/CD for ML Pipelines (Git)
Monitoring & Feedback Loops (Grafana)

In our tutorial, we will be examining all these steps over object detection or image classification models (sometimes for both)!

Let’s start directly here by discovering what is DVC, why we need it, and how we can use it easily!

Data versioning & Management (DVC)

Imagine you work on an industrial project, where you expect to have an updated version of the dataset regularly, i.e. a new product is added and you need to retrain your model to keep up with the newest objects that your AI model should detect.

To store the datasets in separate folders like dataset_v1, dataset_v2, …dataset_vx would be as awful as creating a new folder for our code updates and calling them project_v1, project_v2,… project_vx. Fortunately keeping track of our code and versioning is done by Git, a very common framework among developers. DVC comes to help us in the same of Git, this time to keep track of our datasets and version them without the need to create a new folder each time we update our dataset!

Therefore, at the end of this tutorial, we will learn how to convert our dataset environment from an unprofessional to an appropriate one shown as below figure:

Image by Author

Assuming that you have Git already initialized in your project folder, you can follow the steps below; otherwise, first, initialize a Git repository because DVC collaborates with Git to track your dataset!

Download and Initialize DVC (Linux)

Download DVC using the following command if you are a Linux user, if not find the correct command for yourself from the official repository.

snap install dvc --classic

Go to your project environment and initialize DVC. We assume that your project structure is:

project
  |__ data
  |__ cfg
  |__ models
  |__ weights
  ...
So basically you have a main project folder and everything arranged 
inside as subfolders, as well as the data folder

cd project
dvc init

Start Versioning

Put your first version dataset into the "data/ " folder. In my case, it is called dataset_v2, since I have lost the dataset_v1 for this old project that I had only in my local.

mv dataset_v2/* data/

Add this change to your DVC track history.

dvc add data

Be sure that Git also doesn’t track your data, it is totally unnecessary and bad usage of Git since it’s responsible for tracking the development code, not the datasets!

.gitignore

data/*

Add the DVC log to Git tracking, and also .gitignore since we have updated that and commit this change via git.

git add data.dvc .gitignore
git commit -m "dataset version 2"

Determine local storage to be the place where DVC will store the data in its format between the different versions. I named it "local_onedrive_remote" since in the next steps we will learn how to push or pull data between our Onedrive cloud storage.

dvc remote add -d local_onedrive_remote ~/dvc_onedrive_remote

Time to push our first dataset to our local storage!

dvc push

Before going further to repeat these steps until we version all the datasets stored in different folders, we will take a look at how to keep this versioning in cloud storage. This is an important step if you want any backup, or collaborate with your colleagues over the cloud. Also, you could be able to pull the dataset with all the available versions from any other machine you need your dataset locally.

Image by Author

Rclone: a bridge between your local and remote storage

Rclone is a tool that helps to push and pull your data between local and remote paths. It becomes a bridge to fill the gap for finishing our data versioning pipeline.

Install Rclone into your local machine:

sudo apt update
sudo apt install rclone

Create a new Rclone configuration for your cloud storage, in my case it’s my personal Onedrive, but you can choose any type of cloud storage listed by Rclone:

rclone config

Click n to create a new configuration, enter the name you want to give to the configuration and choose the type. For me, it’s 21 referencing Onedrive from the given list.

If everything is fine, you should be able to see your new storage by running rclone config command again:

Image by Author

Also, a double-check would be nice via rclone ls onedrive:/command to see if it starts listing all the contents in your remote storage so that you are sure the remote link is correct and mounted nicely in the storage object you call "onedrive" (or anything else you prefer for your personal cloud storage)

The last command for pushing the local storage versioning to our remote storage:

rclone sync ~/dvc_onedrive_remote onedrive:/DVC_STORAGE

What we do with this line is basically to synchronize our local storage (~/dvc_onedrive_remote) with the remote one (onedrive:/DVC_STORAGE), where onedrive is the selected name for the rclone remote repo while I configure it, and DVC_STORAGE is the folder I have created in my Onedrive to store my data.

That is all to set up our data versioning environment!

Now I will be applying the same commands to add the newer versions of my dataset into my versioning history and delete all the separated folders one by one.

The following bash script is useful to run after copy paste a newer version of the dataset folder (dataset_v3, dataset_v4,..) to complete all the additional steps at once.

#!/bin/bash

# Step 1: Automatically determine the dataset version
# Count previous commits containing "dataset version"
previous_version=$(git log --oneline | grep -c "dataset version")

# Increment the dataset version
new_version=$((previous_version + 1))

# Step 2: Add the dataset to DVC
echo "Adding dataset to DVC..."
dvc add data

# Step 3: Stage the updated DVC metadata
echo "Staging DVC metadata..."
git add data.dvc

# Step 4: Commit with the new dataset version
commit_message="dataset version $new_version"
echo "Committing with message: $commit_message"
git commit -m "$commit_message"

# Step 5: Push to DVC remote
echo "Pushing dataset to DVC remote..."
dvc push

# Step 6: Sync with OneDrive via Rclone
echo "Syncing DVC cache with OneDrive..."
rclone sync ~/dvc_onedrive_remote onedrive:/DVC_STORAGE

echo "Dataset version $new_version successfully pushed and synced!"

Now that everything is done and I have a dataset having 7 different versions in my DVC storage and only the last version in my project directory, it’s time to see how we can travel between different versions in case we need to use an older version of our dataset.

Pull an old version of the dataset

The current and newest dataset I have in my project folder seems like the below with 56 classes written in classes.txt, with 1072 training images, and 256 validation images.

Image by Author

Check the commit you need to go back for the specific version:

git log --oneline

Image by Author

Let’s say I need my dataset version 5, I choose 8f8de95 as the commit I want to go, and I pull the data from the DVC store back to my project folder.

git checkout 8f8de95
dvc pull

Now the current dataset in my project folder looks as below, with 39 classes written in classes.txt, with 662 training and 152 validation images. I can see that even the distribution.png has been tracked by DVC and updated to the older one.

Image by Author

Get back to the newest dataset version

Let’s say we are done using the old dataset and want to go back to the latest version, two lines and we are done again!

git checkout master #or main according to your repo
dvc pull

Pull data from the cloud to a new machine

We use rclone sync ~/dvc_onedrive_remote onedrive:/DVC_STORAGE command to synchronize our local remote repo with the cloud remote repo. When we need the inverse (from remote to local), it’s just the inverse direction of the same command! So the command rclone sync onedrive:/DVC_STORAGE ~/dvc_onedrive_remote would synchronize the remote storage with our local one and can be used with any new machine you want to pull.

What if we have multiple datasets in the same project?

In real-world applications, you may have more than 1 subtask for the same project. For example, an object detection model and image classification model work in parallel or sequentially, which needs to be trained with different datasets. It is nothing more than arranging our project folder well and designing our DVC system accordingly:

Image by Author

Since now our main workplace contains multiple subtask folders as classification and detection, we rename data.dvc as _datadetection.dvc and take place in the root of our main folder, as well as creating a new one named _dataclassification.dvc.

Since we replace them, we should update the paths written in .dvc files also:

Repeating the previous steps, we configure the DVC for the classification subtask:

dvc add classification/data 
git add data_classification.dvc
dvc remote add -d local_onedrive_remote_classification ~/dvc_onedrive_remote_classification
dvc push -r local_onedrive_remote_classification data_classification.dvc 
rclone config # create a new cloud storage for classification dataset
rclone sync ~/dvc_onedrive_remote_classification onedrive:/DVC_STORAGE_CLASSIFICATION

That’s it! After arranging the workspace and adding new subtasks to the DVC system, updating the previous task’s .dvc files or paths if necessary, the rest is nothing but the same. The only thing you may give attention to is to run the correct command, for the correct dataset update. For example in our settings:

If you have an update in the classification dataset:

dvc add classification/data 
git add data_classification.dvc
dvc push -r local_onedrive_remote_classification data_classification.dvc 
rclone sync ~/dvc_onedrive_remote_classification onedrive:/DVC_STORAGE_CLASSIFICATION

If you have an update in the detection dataset:

dvc add detection/data 
git add data_detection.dvc
dvc push -r local_onedrive_remote data_detection.dvc 
rclone sync ~/dvc_onedrive_remote onedrive:/DVC_STORAGE

We are done with the first step of the Mlops cycle for our project, to see the next steps after setting up the data versioning environment, keep up with the following contents of this tutorial!

The post Complete MLOPS Cycle for a Computer Vision Project appeared first on Towards Data Science.

From Local to Cloud: Estimating GPU Resources for Open-Source LLMs

Maxime Jabarian — Mon, 18 Nov 2024 09:02:43 +0000

Estimating GPU memory for deploying the latest open-source LLMs

Source

If you’re like me, you probably get excited about the latest and greatest open-source LLMs – from models like Llama 3 to the more compact Phi-3 Mini. But before you jump into deploying your language model, there’s one crucial factor you need to plan for: GPU memory. Misjudge this, and your shiny new web app might choke, run sluggishly, or rack up hefty cloud bills. To make things easier, I explain to you what’s quantization, and I’ve prepared for you a GPU Memory Planning Cheat Sheet in 2024— a handy summary of the latest open-source LLMs on the market and what you need to know before Deployment.

If you are not a member, read here.

Why Bother Estimating GPU Memory?

When deploying LLMs, guessing how much Gpu memory you need is risky. Too little, and your model crashes. Too much, and you’re burning money for no reason.

Understanding these memory requirements upfront is like knowing how much luggage you can fit in your car before a road trip – it saves headaches and keeps things efficient.

Quantization: What’s It For?

Quantization impacts the "brain" of an Llm by simplifying the numerical precision of its weights, which are key to how the model generates text and makes decisions.

Memory and Speed Boost: Reducing from 32-bit to 16-bit or 8-bit precision cuts down memory usage and speeds up inference, making deployment on limited GPUs more efficient. It’s like lightening the brain’s load to think faster.
Trade-offs in "Thinking Power": With simpler, less precise weights, the model might lose some of its ability to handle complex or nuanced tasks, leading to less accurate or lower-quality outputs.
Balancing Efficiency and Accuracy: For most applications, this precision loss is minimal (such as text summarization). But for tasks requiring fine detail, the impact can be more significant (resolving complex problem).

Estimating GPU Memory

To estimate the GPU memory (M) required for an LLM, use the following formula:

Source

Where:

M: GPU memory in gigabytes (GB)
P: Number of model parameters in billions
Q: Bit precision (e.g., 8, 16, or 32 bits)
1.2: A 20% overhead factor for additional memory needs

Example

Consider Grok-1 model from xAI with 314 billion parameters (P = 314) using 16-bit precision (Q = 16):

So, to deploy Grok-1 at 16-bit precision, you would need a whopping 753.6 GB of GPU memory. This clearly shows the massive resource requirements of these large-scale models!

GPU Memory Planning Cheat Sheet in 2024

Made by author – This table gives a snapshot of some impressive open-source LLMs coming out in 2024, highlighting their specs and the GPU muscle they need.

From lightweight models like OpenELM to resource-hungry giants like Snowflake Arctic, context lengths vary up to 128,000 tokens, and using 8-bit precision can drastically cut GPU memory needs for efficient deployment.

Smaller models are ideal for solo developers or startups, while quantization helps make larger models feasible on budget-friendly hardware.

Key Takeaways to Make Your Life Easier

Lower Precision Can Save You Big: Using 8-bit precision can drastically cut down on memory use. But keep in mind, it might come at a performance cost. It’s all about trade-offs.
Account for Overhead: That 20% buffer in the formula isn’t just for fun. It helps you avoid nasty surprises like your model stalling due to a lack of memory.
Pick the Right Model for Your Use Case: If you need long context windows for applications like document summarization, models like LWM or Jamba could be good. But watch out for their sky-high memory needs.

Conlusion

Now you have the information to make your own estimation based on your needs. If you’re deploying a model for real-time text generation, you don’t want latency or, worse, for the whole app to crash. And if you’re working in the cloud, optimizing GPU usage can mean thousands of dollars saved over time. This is why understanding these memory estimates is really important.

Loved the Article? Here’s How to Show Some Love:

Clap many times – each one truly helps!
Follow me here on Medium and subscribe for free to catch my latest posts.
Let’s connect on LinkedIn, check out my projects on GitHub, and stay in touch on Twitter

References

[1] Eugen Yan, Open LLMs: A Collection of Open-Source Models

[2] Hugging Face, Open LLM Leaderboard

[3] EleutherAI, Understanding Transformer Math

[4] Vokturz, Can It Run LLM?

[5] Microsoft Machine Learning Blog, Fundamentals of Deploying Large Language Model Inference

The post From Local to Cloud: Estimating GPU Resources for Open-Source LLMs appeared first on Towards Data Science.

Economics of Hosting Open Source LLMs

Ida Silfverskiöld — Tue, 12 Nov 2024 14:57:01 +0000

Large Language Models in Production

Total Processing Time on GPU vs CPU – Not to scale* | Image by author

_If you’re not a member but want to read this article, see this friend link here._

If you’ve been experimenting with open-source models of different sizes, you’re probably asking yourself: what’s the most efficient way to deploy them?

What’s the pricing difference between on-demand and serverless providers, and is it really worth dealing with a player like AWS when there are LLM serving platforms?

I’ve decided to dive into this subject, comparing cloud vendors like AWS with newer alternatives like Modal, BentoML, Replicate, Hugging Face Endpoints, and Beam.

We’ll look at metrics such as processing time, cold start delays, and CPU, memory, and GPU costs to understand what’s most efficient and economical. We’ll also cover softer metrics like ease of Deployment, developer experience and community.

Some of the metrics we’ll look at | Image by author

We’ll explore a few use cases, such as deploying a smaller model on CPU versus running a 7–8 billion parameter model on GPU.

I’ll also dig into the process of deploying a smaller model on AWS Lambda with EFS and compare it against a more modern platform like Modal.

I won’t dive into optimization strategies here – things like speeding up inference with different frameworks or quantization – that’s a separate topic altogether.

Instead, this article will focus on how to choose the right deployment option, give you a chance to compare performance across different scenarios, and help you understand the economic costs of deploying both small and large LLMs.

Introduction

When you’re using off-the-shelf open-source models, there are plenty of API options that are easy to tap into. I recommend checking out the this list for a few choices. You can also choose to self-host – take a look at the ‘Local Inference’ section in the same list.

However, you may need to use private, fine-tuned, or less common models.

You could of course host these locally as well, but you’ll need enough juice on your computer, plus you might want to integrate these models into an application running on another server.

This brings us to hosting open-source models on-demand or via serverless platforms. The idea is that you only pay for the resources you use, whether it’s on-demand or per run, as with serverless.

Serverless and on-demand work a bit the same, but with serverless, the scaling down happens faster, so you don’t pay for idle resources.

You can look at my scribbles below for more of a comparison.

On-demand vs serverless scribbles | Image by author

In this article, we’ll compare pricing for AWS’s EC2 and Lambda with several emerging platforms that have recently gained popularity.

Different deployment choices that we’ll cover | Image by author

This way, you’ll get a better sense of what might work best.

As a side note, I have not been paid by any of these vendors, so the information I share here is my own.

If you’re a stakeholder, this is a great way to understand the economics of the different options and what it might cost to run inference based on model size and vendor choice.

The first part of the article covers the research, which anyone can follow along with, while the second part goes into the technical aspects of deployment that you may or may not want to read.

LLM Serving Frameworks

Now, before we get started, I want to comment a bit on LLM inference frameworks, which simplify the setup of API endpoints to serve models. There are several open-source LLM serving frameworks available, including vLLM, TensorRT, and TGI, which we can use here.

You can check out some of the more popular ones in the ‘LLM Serving Frameworks’ section of the list I shared earlier (seen below).

From the LLM Resources list under ‘LLM Serving Frameworks’ | Image by author

Some have measured the performance differences between these frameworks, and you should definitely do your own research.

In this article, though, we’ll use vLLM, which is widely used – except when deploying a model via Hugging Face Endpoints, which will automatically use TGI for us.

To deploy a smaller transformer model running on CPU, I simply used the Hugging Face [pipeline](https://huggingface.co/docs/transformers/en/main_classes/pipelines) or the transformers library directly.

The Research

In this first part, we’ll look at the efficiency, cost, and performance of our choices, both on-demand and serverless. We’ll start by going through the metrics before diving into any technical details.

Processing Time

Let’s begin by measuring the total processing time across the platforms when the container is warm (i.e., it’s been used within the last few seconds) with no concurrency.

We define processing time as the total time taken to complete the response. Note that some might measure time to first response, especially when streaming the output.

For consistency, I used the same prompts for each test. For the 400M model, I batched the texts by 30.

You can see the metrics below.

*Not to scale, see full calculations here | Image by the author

I only ran these tests a few times per platform on the same day. Ideally, I should have tested them over several days. I may have been unlucky for some of them.

But to discuss how they did, for the **** Serverless providers, Modal and Beam, perform really well on CPU (shown as the light green bars). It’s naturally easier to boot up a 400M model than it is for a 8B model.

I found that using even smaller models (under 130M) works great with AWS Lambda, especially if you cache your models using EFS.

While I really like Hugging Face Endpoints, I find their CPU instances to be a bit unpredictable. However, their AWS GPU instances are quite reliable and really fast.

I can get very fast responses on GPU with Hugging Face, even hosting a 7B model on an L4 instance can return a response within 10 seconds – something we can’t achieve with the serverless providers, which need more GPU power.

If we pick a A100 GPU, we see that all providers do very well for a 7B-8B parameter model, returning full responses within a few seconds.

Of course, speed is great, but we also need to consider other metrics.

Cold Boots

Next, let’s dive into cold boots, i.e. how long it takes for a model to respond if it hasn’t been used for a while. Even if you cache a model, it may still need to download shards, which can add a few seconds.

On-demand services may allow you to cache models for faster boot times, which I didn’t do here, but most serverless providers show you how to cache during build time, which can reduce cold boot latency.

Let’s look at the metrics across a few platforms below.

*Not to scale, see calculations here | Image by the author

Note, I calculated the entire processing time when cold, be sure to check the calculations directly for only the cold boots.

As expected, on-demand services where I didn’t cache the models, perform worse, such as BentoML, Hugging Face Endpoints, and Baseten.

While Hugging Face Endpoints can perform well once they’re running, you can still encounter cold boots lasting from 30 seconds to up to 5 minutes, which can be problematic if you need to scale up and down often. They will also throw 500 errors until the container is fully running again.

Serverless providers are faster as they are designed to scale quickly by asking us to cache the model weights when we first deploy.

On CPU, Beam performed the best, followed by Baseten, Modal, and Lambda with EFS. Smaller models are generally faster to boot up. Using Lambda for a small model with only 125M parameters showed great results, with quick processing times and minimal cold boot delays.

Although I would argue that using Modal or Beam for a smaller model would do fine as well.

GPU & CPU Pricing

Let’s turn to pricing. We need to look at the costs for CPU, memory, and GPU resources.

There are some noticeable differences between the platforms.

Serverless providers are generally more expensive since they also charge for CPU and memory on top of GPU usage. However, they don’t bill you for idle time, which can help offset the higher costs.

You can find the pricing for Nvidia GPUs in the image below.

*GPU pricing per platform, see calculations here | Image by the author

You should though take a look at SageMaker, that has the highest GPU cost across all of these. If you need to use AWS it may be better to use EC2 directly.

Let’s also look at CPU pricing.

*CPU/Mem pricing per platform, see calculations here | Image by the author

Hugging Face Endpoints leads with a cost of $0.07 for an instance with 2 vCPU and 4GB of memory, it’s too bad that their CPU instances just doesn’t perform that well.

Beam and Modal allow you to tweak the resources needed, which helps minimize costs. For a 400M model, I calculated that I only needed 3GB of memory and 1 core (2 vCPU) on both platforms.

On the other hand, Replicate forces us to use 4 vCPU regardless of the model size, making it the most expensive CPU option here.

We’ll go through a few use cases to compare pricing and efficiency across all these platforms.

Case 1: Fine-Tuned 400M Model running on CPU

The first case will be running a 400M model sporadically throughout the day. This means the container needs to scale up and down each time it’s called.

It may not always be the case that we need to scale up and down, but we’ll have to calculate it as if it is.

I ran this case study by batching 30 texts for each call with a smaller fine-tuned model, making 250 calls throughout the day. For simplicity, we’ll assume that the container is cold each time it runs (except for Hugging Face Endpoints).

*Not to scale, see calculations here | Image by the author

The serverless providers are a better option here as we don’t pay for idle time in the same way as we do for on-demand. For BentoML we need to keep it idle for at least 5 minutes before it autoscales down, and for HF endpoints need to wait 15 minutes.

A side note, if auto scaling down is new to you, this concept means that we tell the platform to scale our instance down automatically if they allow it.

They all have different requirements, Baseten and HF Endpoints has a 15 minute idle window and BentoML has 5 minutes.

As HF endpoints will take at least 15 minutes to scale down, if we call the function every 5–6 minutes it won’t have time to scale down thus we have very low start boots but a majority of idle time.

We can see that having 17 hours of idle time, as with the HF case, and 18 hours in the BentoML case, is inherently inefficient. We will pay a majority of money to idle resources throughout the day.

A cent or a dollar here and there might not seem like much for your first few days but after awhile it adds up.

Looking at monthly costs for running one smaller model on CPU | Image by the author

Just think of it like people saving a bit of money each day in their savings account – overpaying here would be the reverse of that.

But what if we ran all 250 calls while the container is warm? How much would the cost differ?

*Not to scale, see calculations here | Image by the author

Beam seems to be an outlier here but I think they are running over max CPU that the other platforms aren’t allowing you to do.

In this scenario, cold boots and idle time disappear. This shows that using a persistent container is the better choice if you’re processing everything in one go – it’s a lot cheaper.

It’s worth noting that a 400M model is best suited for a T4 GPU on both Hugging Face Endpoints and BentoML. This setup keeps costs low while significantly reducing processing time.

One thing to keep in mind: if you use AWS Lambda with EFS, you’ll incur an additional cost for a NAT Gateway, which can add $1 to $3 per day, increasing the overall cost more than is shown here.

Now, let’s move on to the second case – running a larger model with 7B to 8B parameters on GPU.

Case 2: General 8B Model running on GPU

For this case, I’ve been testing models like Mistral, Gemma, or Llama with sizes around 7B – 8B.

The scenario involves sporadically invoking the model 250 times throughout the day. We assume the container scales up and down for each invocation, even though this might not always be the case.

Just like with the CPU tests, the on-demand services we assume to be running for 24 hours as it doesn’t have time to scale down.

I have made sure to write out the GPU instance we’ve used here for each vendor. Look at the bar chart below.

*Not to scale, see calculations here | Image by the author

For the serverless providers, I’ve slightly inflated the processing time by multiplying it but excluded cold boots from the total price calculation.

While the actual cost might be lower, this adjustment is to be cautious. There is a chance you’ll be billed more as you will pay for some of the start boots.

*Not to scale, see calculations here | Image by the author

Just as we saw on the CPU case, running the 250 calls in one go is more cost effective.

If you would setup calculations for let’s say Anthropic’s and OpenAI’s cheapest models and compared them to the cost of self-hosting, you will see that you are paying significantly less to call their models with the same prompt than if you would host it like this.

People call these vendors the McDonald’s of LLMs.

We think that open source will be cheaper but we don’t calculate the actual unit economics of hosting. These platforms are also subsidized by VC funding. However, like I mentioned before there are cheaper ways to access open source models using vendors you’ll find here.

If you want to dig into the detailed calculations, you can check out this file. Fair warning – it looks a bit messy.

User Experience

By now, you may have reached your own conclusions, but one last thing I want to cover is user experience.

If you are a non-coder then HF Endpoints is very easy to work with, as you can simply click to deploy a model from the HuggingFace hub. If you are a bit technical you may prefer other options where you have more control.

For Replicate, they have a large follower base and a lot of public models shared by various people. There is community around it. They have a few one-click train and deploy processes that make it easier.

However, I found Modal, Beam and BentoML to be a great developer experience in general. You deploy directly via the terminal and let the code run on their servers.

With Replicate, if you are deploying your own models, you’ll need a GPU machine and with Baseten you need to download a library called Truss, which takes a bit of time.

I have collected some of my notes in this table (also seen below).

From the LLM Resources list | Image by author

The table will have links to get started scripts as well if you’re keen to work with any of these.

Now that we’ve covered most of the non-technical aspects, I’ll walk you through two deployment options for a model that performs well on CPU, AWS Lambda and Modal.

Technical Bits

In this section, we’ll go through deploying a 400M model that I’ve fine-tuned for keyword extraction using AWS Lambda with EFS, and compare it to a deployment on a newer platform like Modal.

Both tools are serverless, which means we need to cache the model properly at build time so we can quickly access it on consecutive runs. AWS provides a ready-made script that we can easily tweak, and I’ve also prepared a script for Modal here.

We’ll focus on two main things: how to deploy the model on each platform and reflecting on the key differences in the deployment process.

Deployment to Lambda w/ EFS

For this part, you can read it through or follow along to deploy.

To follow along you will need git, AWS CDK, Docker, NodeJS 18+, Python 3.9+ installed on your computer. Once you have all of these installed you can open up a new terminal.

Create a new directory if you want to and then clone the repository below.

git clone https://github.com/aws-samples/zero-administration-inference-with-aws-lambda-for-hugging-face.git

Go into the directory that has been created.

cd zero-administration-inference-with-aws-lambda-for-hugging-face

You can now open up the files in your code editor.

I use VSCode so I simply do like so.

.code

Now we can go into the files that have been created and tweak them a bit. Look into the Inference folder where you will see two files, sentiment.py and summarization.py.

We can easy change the models in these files to the model’s we want.

If you go to the HuggingFace hub and locate a model you are interested in.

I will go with one of mine.

If you’re interested in how to build a model like this you can see a tutorial [here](https://medium.com/towards-data-science/fine-tune-smaller-transformer-models-text-classification-77cbbd3bf02b) for the keyword extractor and here for text classification.

Once you’ve located a model you are interested in, you can click on the button ‘Use this model’.

As you see we have two choices here but seeing as this script is using the pipeline we can do so as well.

I have changed the code in the below file with a new model while also expecting ‘texts’ for batching rather than just ‘text.’

# inference/summarization.py

import json
from transformers import pipeline

extractor = pipeline("text2text-generation", model="ilsilfverskiold/tech-keywords-extractor")

def handler(event, context):
    # texts should be an array
    texts = event['texts']
    response = {
        "statusCode": 200,
        "body": extractor(texts)[0]
    }
    return response

You can look into the image above to see the file structure.

I changed both scripts with different models that I usually use. Make sure you save the scripts once you are done.

Then you can set up a virtual env in a terminal.

python3 -m venv venv
source venv/bin/activate

Make sure you have NodeJS 18 before you download the requirements.

pip install -r requirements.txt

Before you can do anything else you need to make sure that the user you have configured with the AWS CDK has the correct permissions.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "ecr:*",
                "ssm:*",
                "iam:*",
                "lambda:*",
                "s3:*",
                "ec2:*",
                "logs:*",
                "cloudformation:*",
                "elasticfilesystem:*"
            ],
            "Resource": "*"
        }
    ]
}

After this you can run bootstrap.

cdk bootstrap

If you have issues here check if the aws-cdk-lib is installed and if not re-install it.

pip install aws-cdk-lib

cdk bootstrap

Once you hit this, the command will create a Cloudformation stack.

If you run into issues here with ECR, create the repository manually.

If you have Docker running on your computer you can now deploy via your terminal.

cdk deploy

From here the CDK starts building a Docker image for the Lambda function using the Dockerfile in your inference folder. Each Lambda function has been provisioned with 8 GB of memory and a 600-second timeout.

It will create a VPC that has an Internet Gateway, EFS for caching the models, several Docker-based Lambda functions for hosting both of the models in the script and a few IAM roles for Lambda execution.

This will take some time.

What it may look like if you have an unstable WiFi connection | Image by the author

I was sitting in a small village in Italy doing this so my internet connection failed and I had to rent a GPU machine to deploy.

This may not happen to you but just make sure you have enough juice and a stable internet connection to deploy.

Once you have deployed you can go to Lambda in the AWS console and look for your new functions. You can test them directly there. The first run will be slower but once it is warm it is a bit faster.

Some notes here, since the Lambda function is in a private subnet (inside the VPC), it cannot access the internet, which is why AWS will create a NAT Gateway for you. Using a NAT Gateway is though price-y, and will incur costs of around $1-$3 per day regardless of how much it is used.

We could try to put the Lambda function inside a public subnet but alas I did not try it. There may be a way to go around this creating VPC Endpoints.

We do need a VPC for EFS, so we can cache the models, so they do not need to be downloaded each time you invoke the function. Yes, AWS Lambda has a very generous free-tier but it may incur other costs that you need to be aware of when we add other resources.

Once you’re done testing I would recommend you destroy the resources so you do not pay for a NAT Gateway round the clock.

cdk destroy

An additional note on using this method, you can not specify memory and CPU seperately. If you need more CPU you need to increase memory which can get expensive.

However, I wouldn’t fully disregard AWS Lambda when using smaller models of 125M parameters or less. You can provision a Lambda function with less memory.

Deployment to Modal

Modal has been created for the use of deploying ML models which will make this process a lot cleaner. The script we’ll use here to deploy the same model as before you’ll find here.

We can specify memory, CPU and GPU within our function directly when we deploy. We can also ask for an endpoint to be created for us within the script which will make it easier to test our model with an endpoint.

However, just because we’re using another platform, this does not mean that it won’t cost us a bit as well.

Remember the calculations we did before.

To get started you’ll need a Modal account and python3 installed. After you have created one we can open up a terminal and create a new folder.

mkdir testing_modal
cd testing_modal

We can then set up a virtual environment.

python3 -m venv venv
source venv/bin/activate

Install the Modal package using pip.

pip install modal

With Modal, all the resources, environment setup, and execution happen on their platform, not locally, so we won’t have the same issues as we did with deploying to AWS.

To authenticate you run this.

python3 -m modal setup

Now if you do not have any files in the folder create one.

touch text.py

You can simply paste the code below into it but we’ll also go through it.

# text.py
import modal
from pydantic import BaseModel
from typing import List

app = modal.App("text-generation") # set an app name
model_repo_id = "ilsilfverskiold/tech-keywords-extractor" # decide on your model repo
cache_dir = "/cache"

image = (
    modal.Image.debian_slim()
    .pip_install(
        "huggingface-hub==0.16.4",
        "transformers",
        "torch"
    )
)

# have these loaded in modal rather than locally
with image.imports():
    import torch
    from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# set up the function to run for extracting keywords from texts (as per the model we're using)
# the snapshot download should download the model on build and cache it 
@app.cls(cpu=1, memory=3000, image=image) # define cpu (cores), memory and/if gpu - default CPU request is 0.1 cores the soft CPU limit is 4.1 cores - default 128 MiB of memory
class TextExtraction:
    @modal.build()
    def download_model(self):
        from huggingface_hub import snapshot_download
        snapshot_download(repo_id=model_repo_id, cache_dir=cache_dir)

    @modal.enter()
    def load_model(self):
        self.tokenizer = AutoTokenizer.from_pretrained(model_repo_id, cache_dir=cache_dir)
        self.model = AutoModelForSeq2SeqLM.from_pretrained(model_repo_id, cache_dir=cache_dir)

    @modal.method()
    def extract_text(self, texts):
        inputs = self.tokenizer(texts, return_tensors="pt", padding=True, truncation=True)
        outputs = self.model.generate(**inputs, max_new_tokens=100)
        generated_texts = [self.tokenizer.decode(output, skip_special_tokens=True) for output in outputs]

        return generated_texts

class TextsRequest(BaseModel):
    texts: List[str]

# set up the web endpoint 
@app.function(image=image)
@modal.web_endpoint(method="POST", label=f"{model_repo_id.split('/')[-1]}-web", docs=True)
def generate_web(request: TextsRequest):
    texts = request.texts
    extracted_texts = TextExtraction().extract_text.remote(texts)
    return {"extracted_texts": extracted_texts}
    # add potential error handling

Remember that I am using the same model, you may use another one.

To deploy you simply run.

modal deploy text.py

This script sets up an app in Modal called "text-generation" and builds a Docker image with the needed dependencies (huggingface-hub, transformers, and torch).

It installs these directly in Modal’s environment, so you don’t have to deal with it locally. The app asks for 1 CPU core and 3 GB of memory, which is the setup I used during testing.

Model caching is handled by @modal.build(), where it uses snapshot_download() to pull the model from Hugging Face and saves it in /cache. We need to do this so it can be evoked faster on cold starts.

The @modal.enter() decorator runs when the TextExtraction class gets called for the first time, loading the tokenizer and model from the cached files into memory.

Once the model is loaded, you can call the extract_text() method to run inference. The @modal.web_endpoint sets up a serverless API endpoint that lets you hit extract_text() via a POST request and get your text extraction results back.

The whole thing runs in Modal’s environment so we don’t need to worry about your computer having enough juice. This is more important with larger models of course.

Once it has been deployed you’ll see something like this in the terminal with your endpoint.

You’ll be able to see this application in your Modal dashboard.

To run this function you can call the url you got back in the terminal.

curl -X POST "https://" 
-H "Content-Type: application/json" 
-d '{"texts": ["Artificial intelligence in healthcare represents a collection of multiple technologies enabling machines to sense, comprehend, act, and learn"]}'

This does not add in authentication, please see docs here from Modal to add this in.

Some Notes

As you’ve learned by now, using any deployment choice you need to first cache the model on build time to make sure the cold boot is faster once you scale down. If you want to try to deploy to any other platform you can see all get started scripts here.

Going with a new platform isn’t necessarily bad, and will be much faster. However, sometimes your organisation is strict with the platforms you are allowed to use.

The cost may also be slightly steeper with an easier choice, but the ones I have shown you aren’t that far from EC2 directly in terms of cost.

If you’ve reached it this far I hope you get some intel into the research I’ve done here and that it will help you pick a vendor.

The post Economics of Hosting Open Source LLMs appeared first on Towards Data Science.

Reducing the Size of Docker Images Serving Large Language Models (part 2)

Michał Marcińczuk, Ph.D. — Wed, 08 May 2024 07:11:17 +0000

How to reduce the size of a "small" Docker image by another 10%

Generated by Runway for the prompt: There are two containers on board the ship, one large and the other small. They are bright, vivid, and realistic in color.

Introduction

This is a continuation of the topic of reducing the size of Docker images serving large language models. In my previous story [1], I presented how to reduce the size of a docker image model from 7 GB to under 700 MB. The solution eliminated heavy libraries like CUDA, cuDNN, cuBLAS, torch, and triton. It was possible by converting and quantizing the model to the ONNX format and using onnxruntime with CPU instead of the torch with GPU.

In this story, I present how to reduce the size of the target image further by another 10%. This might seem like overkill, as 700 MB is already a relatively small image. However, the techniques presented here can provide a deeper look into the docker image serving a language model. They can help to understand what components are required to run the model and see that there might be some lighter alternatives.

Scripts and resources used in this story are also available on GitHub [2]:

GitHub – CodeNLP/codenlp-docker-ml: This repository demonstrates how to create a small Docker image…

S-size Docker image

Let’s start by recalling the solution presented in [1], which allowed us to reduce the image size to under 700 MB.

Here is the Python code that implements the API endpoint:

and the Dockerfile to build the image:

A more detailed explanation can be found here [1]. At this point, I would like to add a comment to line 25 in the Python script, as it is not so obvious – vector = {k: v for k, v in vector.items()}. At first glance, it does nothing as it converts from a dictionary into a … dictionary. In fact, it converts an object of type:

into a dict. This is required as the ort_sess.run method expects dict as an input. Without the conversion, a nasty exception will arise.

Now, we can introduce a recipe for an even smaller Docker image serving the Llm model—super small size, XS.

XS-size Docker image

First, I will demonstrate the Python script for inference and the Dockerfile. Then, I will show the differences compared to the S-size image and discuss the changes.

Here is the Python script implementing the API endpoint:

And the Dockerfile to build the image:

Here is the side-by-side comparison of the files with highlighted differences.

Image by author: Comparison of the Dockerfiles.

Image by author: Comparison of the Python scripts.

We have two major differences, which are explained in the following subsections.

Replace transformers with tokenizers

In the initial solution, the transformers library is used for two purposes:

Tokenize the input text, i.e., transform the texts into the dictionary or subtoken identifiers and attention masks.
Load the label mapping from the configuration (integers to string labels).

Both tasks can be accomplished using lighter libraries. By the way, the transformer’s library size is around 70 MB.

Tokenization can be performed by the tokenizers library, which is already used under the hood by transformers. The library’s size is only 12 MB. However, there is a bit more to do as AutoTokenizer wraps the output to the right data structure – a dictionary with model-specific attributes.

To load the tokenizer, instead of:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(model_path)

we do the following:

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_file(model_path + "/tokenizer.json")

We used the class method Toknizer.from_file and pass a path directly to the tokenizer.json file.

To tokenize the texts, we replace this code block:

text = [input.text]
vector = tokenizer(text, padding=True)
vector = {k: v for k, v in vector.items()}

with these lines:

encoded = tokenizer.encode(input.text)
vector = {
    "input_ids": [encoded.ids],
    "attention_mask": [encoded.attention_mask],
}

Here, we manually create the dict structure with specific fields the model requires. The xlm-roberta model requires two attributes: input_ids and attention_mask. Other models may require a different set of attributes.

This is for the tokenization part. Now let’s change the way we load the label mapping.

Label mapping

We will use the json.load(...) method to load the mapping from the model config file. The method returns a dict structure containing the config file. The main difference between the output of PretrainedConfig.from_pretrained(...) and json.load(...) is the first one returns an object of PretrainedConfig class, and the other returns a dict structure. To access the id2label mapping in the PretrainedConfig object, we call it an attribute that is config.id2label. In turn, to get the mapping from dict we have to use id2label as a key that is config["id2label"]. The other difference is the type of keys in the mapping. They are represented by int values. PretrainedConfig automatically casts them to int values; thus, we retrieve the label by calling config.id2label[label_id]. In turn, json.load(...) does not cast the values and treats each of them as a string. This is why we need to cast the int value to string explicitly – config["id2label"][str(label_id)].

That’s all for the first part related to using tokenizers packages instead of transformers. Now, let’s move to the other part related to model compression.

Compress the model

The second technique for reducing the size of the Docker image is using the compressed model as an archive. Please keep in mind that this will reduce the size of the Docker image for its storage and transportation (pushing to and pulling from the image registry). This has benefits as it reduces data transfer over the network and the disk’s usage. On runtime, the model must be unpacked so it will occupy more storage when it is active. The other benefit is that the model may already be distributed as a compressed archive (as an artifact stored in WandB [3], MLflow [4], or any other platform for ML experiment tracking), so it won’t require additional steps.

The potential downside of this technique is the additional time it takes to initialize the Docker to unpack the archive. If you’re optimizing image build to reduce initialization time, this might not suit you. The decompression time is negligible in other cases, but we can save some additional MBs. In my case, the compressed version of the model and tokenizer was reduced from 301 MB to 219 MB.

To decompress the model on runtime, I move entry point actions to a separate bash script:

#!/bin/bash

tar -xvf models/xlm-roberta-base-language-detection-onnx.tar.gz -C models/

uvicorn api:app --host 0.0.0.0

and modify the ENTRYPOINT in the Dockerfile:

ENTRYPOINT ["bash", "entrypoint_onnx_xs.sh"]

And that’s it. I welcome you to the next section, where you will see the final output of our efforts in terms of Docker image size.

Docker image comparison

Let’s see what are the variants of the Docker images that we have created so far:

language_detection_cuda – inference on GPU using the Torch backend and full model,
language_detection_onnx – inference on CPU using onnxruntime and quantized model,
language_detection_onnx_xs – inference on CPU using onnxruntime and quantized model, with model compression and tokenizers instead of transformers package.

docker images | grep language_detection

To simplify, the output contains only the image name and its size:

language_detection_cuda     (...)      7.05GB
language_detection_onnx     (...)      699MB
language_detection_onnx_xs  (...)      575MB

For visual comparison, here is the chart with the Docker image sizes. The X ax is on a logarithmic scale to see the differences between ONNX-S and ONNX-XS.

Image by author, generated using [5]

The final Docker image has a size of 575 MB. Compared to the initial image, the size was reduced by 12 times.

Conclusions

We have seen that it is possible to reduce the size of the Docker image serving an LLM model from gigabytes to megabytes. In our case, it was from 7GB to 575MB. Such a significant size reduction might be useful when we are limited or stumble by network transfers (pushing and pulling the image over a network), the image registry’s limitations, or the production server’s memory limitations.

Despite the many benefits of using small images, some downsides were not discussed here: slower inference and potentially lower performance. Many aspects should be considered when choosing the right approach: business requirements, expected performance and inference time, and available infrastructure. Since there are many factors to consider, this is material for a separate story – part 3

References

[1] https://towardsdatascience.com/reducing-the-size-of-docker-images-serving-llm-models-b70ee66e5a76

[2] https://github.com/CodeNLP/codenlp-docker-ml

[3] https://wandb.com

[4] https://mlflow.org

[5] https://www.rapidtables.com/tools/bar-graph.html

The post Reducing the Size of Docker Images Serving Large Language Models (part 2) appeared first on Towards Data Science.

Seven Requisite Skills for Navigating from Data Science to Applications

Wencong Yang, PhD — Fri, 12 Apr 2024 18:02:47 +0000

Helping Entry-Level Data Scientists to Transform Ideas into Industrial-Level applications

Image by author (Ideogram)

Back in my college days, my role in data science projects was like an alchemist—experimenting with fancy AI models to dig out the relationship among variables from data in my major. Powerful AI algorithms consistently amazed me by outperforming traditional statistical methods and physical-based models. However, the real challenge began when I became an AI engineer in the industry in 2022. From then on, the technology stack of data science expanded rapidly into fields that I was unfamiliar with. My first challenge in the industry was to ship a model to the production environment, with the requirements of reliability, maintainability, and scalability. Retrospecting my struggles, I realize transforming AI models from prototypes to production-ready applications is nothing more than a combination of

Good design patterns
Robust code
Efficient deployment strategies

This article is a comprehensive guide summarizing from seven key topics from my earlier sub-articles. Each topic explores one aspect of developing and deploying data science projects at an industry level:

Code modularization
Data validation
Abstraction
Configuration management
Web service
API documentation
Docker and cloud

Using a streamflow forecasting application as a case study, the article will dive into each topic with core concepts and demos, equipping entry-level data scientists with powerful tools to enhance their career skills. Let’s start the journey of AI engineering!

1. Organizing Codes in Modular

Modularization divides a program into smaller and independent modules. Modular code makes it easier to maintain and debug, as errors can be solved within specific modules. Modular code also increases extensibility, as you only need to modify codes in specific modules when adding additional features. Moreover, creating code modules enables multiple developers to work on different parts of the project simultaneously.

Example

Here is the directory layout for the code of our streamflow forecasting application:

.gitignore
config.yaml
Dockerfile
LICENSE
main_service.py
main_train.py
README.md
requirements.txt
├── adapter
├── config
├── domain
├── model
├── resources
├── service
├── test
└── utils

Our streamflow forecasting application can be organized into the following code modules:

adapter: Reads data from various sources and converts them into the formats required by AI model training and inference.
config: Specifies configurable parameters for different components of an AI application pipeline, e.g., data reading, model training, and service deployment.
domain: Defines data schemas to maintain consistency of data flow.
model: Organizes functions associated with model training and inference, including the data loader setup and the training loop for PyTorch.
resources: Stores intermediate assets in the model training process, e.g., data scalers and model checkpoints.
service: Archives web framework code to run a model service.
test: Writes unit and integration test functions.
utils: Defines functions that can be utilized across other modules, such as date formatting.

By dividing the project into modules, developing work becomes more manageable and trackable. Although the code structure above follows software design principles, it is not the only applicable structure. You can find the template that fits your project.

2. Automating Data Validation

Data validation ensures data in the correct format, reducing the risk of errors and increasing the application’s robustness to changes. The solution to validate data is to define standard and readable schemas for input and output data in key components of the pipeline. In Python, the validation process for various data types is accelerated by three widely used libraries:

pydantic: Validates any kind of data.
pandera: Validates tabular data.
jsonschema: Validates JSON data.

Figure 1. Python tools for data validation. Source: by author.

Example

The streamflow forecasting application requires specific formats for data in training and parameters in web requesting.

Validating training data format

Data schema defined with pandera automatically checks the format of input data. For instance, meteorological data records from various sources must be formatted to follow the MeteoModel schema before being fed into models.

# @File: data_schema.py

from pandera import DataFrameModel
import pandera as pa

class MeteoModel(DataFrameModel):
    '''Meteorological data schema'''
    id: str
    time: str
    temperature_max: float = pa.Field(nullable=True)
    temperature_min: float = pa.Field(nullable=True)
    precipitation: float = pa.Field(nullable=True)
    evapotranspiration: float = pa.Field(nullable=True)

......

The schema can be used to validate the data format directly by calling MeteoModel.validate:

import pandas as pd

meteo_df = pd.DataFrame({
    'id': ['10251335', '10251335', '10251335', '10251335'],
    'time': ['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04'],
    'temperature_max': [10.0, 11.0, 12.0, 13.0],
    'temperature_min': [5.0, 6.0, 7.0, 8.0],
    'precipitation': [0.0, 0.1, 0.2, 0.3],
    'evapotranspiration': [0.0, 0.1, 0.2, 0.3]
})

try:
    MeteoModel.validate(meteo_df)
except Exception as e:
    print(e)

Validating request parameters

Web request parameters should be in JSON format, which is the standard for web APIs. The streamflow forecasting service takes site_id and forecast_days as input parameters:

# @File: query_schema.py

ForecastRequestSchema = {
    "type": "object",
    "properties": {
        "site_id": {"type": "string"},
        "forecast_days": {"type": "integer", "maximum": 5, "minimum": 1},
    },
    "required": ["site_id", "forecast_days"]
}

......

With the predefined schema, we can catch errors about invalid request parameters:

from jsonschema import validate

request_data = {
    "site_id": "USGS-12345678",
    "forecast_days": 3
}

try:
    validate(instance=request_data, schema=ForecastRequestSchema)
except Exception as e:
    print(e)

Explore more

From Data Science to Production: Automatic Data Validation

3. Abstraction

We often write repeated code in data science projects to for logic such as processing data from various sources. To avoid repeating yourself, the best way is to maintain the main logic of your pipeline and make it reusable. In Python, you can use abstract classes in the abc library to write generic code that can work in different situations, which simplifies future extensions.

Example

To handle different sources of meteorological data in the streamflow forecasting application, the AbstractMeteoReader is defined as the template for concrete data readers.

# @File: abstract_reader.py

from domain.data_schema import MeteoModel
from abc import ABC, abstractmethod
from pandera.typing import DataFrame

class AbstractMeteoReader(ABC):
    '''Abstract class for the reader of meteorological data'''

    @abstractmethod
    def __init__(self, **kwargs):
        pass

    @abstractmethod
    def get_site_history_daily_meteo(self, site_id: str, lat: float, lon: float, history_days: int) -> DataFrame[MeteoModel]:
        pass

    @abstractmethod
    def get_site_forecast_daily_meteo(self, site_id: str, lat: float, lon: float, forecast_days: int) -> DataFrame[MeteoModel]:
        pass

......

All concrete readers should inherit the abstract class and implement common functions get_site_history_daily_meteo and get_site_forecast_daily_meteo. Here is the reader code of Open-Meteo weather data:

# @File: meteo_reader.py

from domain.data_schema import MeteoModel
from config.config_data import OpenMeteoDataConfig
from adapter.abstract_reader import AbstractMeteoReader
from pandera.typing import DataFrame

class OpenMeteoReader(AbstractMeteoReader):
    '''Data reader for Open-meteo meteorological data'''

    def __init__(self):
        self.config = OpenMeteoDataConfig

    def get_site_history_daily_meteo(self, site_id: str, latitude: float, longitude: float, history_days: int) -> DataFrame[MeteoModel]:
        ......

    def get_site_forecast_daily_meteo(self, site_id: str, latitude: float, longitude: float, forecast_days: int) -> DataFrame[MeteoModel]:
        ......

class XXXXMeteoReader():
    '''Data reader for another meteorological data'''
    ......

Explore more

From Data Science to Production: Abstract Classes for Model Deployment

4. Configuration Management

Configuration management adapts the data science applications to different environments and use cases. Configurable parameters include options for data reading, model hyperparameters, web service ports, etc. A recommended way is to separate the configuration from the code instead of hard-coding it within functions. In Python, configurations can be defined in INI files, YAML files, or Python classes.

Figure 2. Options for configuration management in Python. Source: by author.

Example

The streamflow forecasting application is expected to work with different data sources, which means it works with different combinations of streamflow and meteorological datasets. We expose options of data sources to end users through the config.yaml file, making the forecasting applicable for multiple predefined regions.

# @File: config.yaml

data:
  flow_data: usgs
  meteo_data: open-meteo
service:
  port: 8888

The YAML file specifies clear and readable parameters that can be modified by end users and loaded in each deployment. Meanwhile, immutable configurations for users are defined through internal Python classes, such as OpenMeteoDataConfig.

# @File: config_data.py

import yaml

with open('config.yaml', 'r') as file:
    BaseConfig = yaml.safe_load(file)

class DataConfig:
    flow_data = BaseConfig['data']['flow_data']
    weather_data = BaseConfig['data']['meteo_data']
    history_days = 365 * 30

class OpenMeteoDataConfig:
    lag_days = 4
    varnames = ["temperature_2m_max", "temperature_2m_min", "precipitation_sum", "et0_fao_evapotranspiration"]

......

Explore more

From Data Science to Production: Configuration Management for ML Code

5. Building Service APIs

To integrate models into software or applications, the popular way is to create a standard web service interface, known as an Application Programming Interface (API). An API contains a set of URLs that are accessible to end users or other engineers. The most popular web framework libraries in Python include Flask, Tornado, and Django. They offer high-performance backends for developing web services.

Figure 3. Deploying an AI model as a web service. Source: by author.

Example

The streamflow forecasting application utilizes Tornado to create a web service. A Tornado web application consists of three parts: tornado.web.RequestHandler objects that execute your inference pipeline code, a tornado.web.Application object that routes requests to corresponding handlers, and a main function that runs the server. We’ve defined three handlers:

InfoHandler: Queries site information from the local database.
ForecastHandler: Feeds data into the forecasting model and generates predictions.
HealthHandler: Validates service connections.

# @File: main_service.py

from service.info_service import InfoService
from service.forecast_service import ForecastService
from config.config_service import ServiceConfig
import asyncio
import tornado
from tornado.concurrent import run_on_executor
import concurrent

class BaseHandler(tornado.web.RequestHandler):

    executor = concurrent.futures.ThreadPoolExecutor(max_workers=10)

    @run_on_executor
    def _process_get(self, service):
        query = self.request.arguments
        for key, value in query.items():
            query[key] = str(value[0].decode('utf-8'))
        print(query)
        response = service.execute(query)
        return response

class HealthHandler(BaseHandler):

    async def get(self):
        self.write("OK")

class InfoHandler(BaseHandler):

    async def get(self):
        service = InfoService()
        response = await self._process_get(service)
        self.write(response)

class ForecastHandler(BaseHandler):

    async def get(self):
        service = ForecastService()
        response = await self._process_get(service)
        self.write(response)

class Application(tornado.web.Application):

    _routes = [
        tornado.web.url(r"/healthCheck", HealthHandler),
        tornado.web.url(r"/info", InfoHandler),
        tornado.web.url(r"/forecast", ForecastHandler)
    ]

    def __init__(self):
        super(Application, self).__init__(self._routes)

async def main():
    app = Application()
    app.listen(ServiceConfig.port)
    await asyncio.Event().wait()

if __name__ == "__main__":
    asyncio.run(main())

Running the code in a server exposes three URLs for external requests: http://:/healthCheck, http://:/info, and http://:/forecast.

Explore more

From Data Science to Production: Building a Model Service Using Tornado

6. Simplifying API Documentation with Swagger

API documentation simplifies the integration of models by other developers. The OpenAPI specification provides a standard description format for web APIs, while manually writing detailed documentation can be tedious. Fortunately, the Swagger UI tool can generate API documentation directly from the application code. Powerful features of Swagger UI include:

Automatic documentation
Interactive interface
Well-defined structure & schema

Figure 4. Concepts of API documentation. Source: by author.

Example

The tornado_swagger Python library works with Tornado web framework to automate API documentation generation. It creates a user-friendly UI page (http://:/doc) that is deployed alongside the web service. With the help of tornado_swagger, the Swagger UI for the streamflow forecasting application displays three API methods:

Health check API: Validates service connectivity.
Info API: Lists river sites for forecasting.
Forecast API: Predicts future streamflow for a given gauge station.

and two response data structures:

InfoResponse: The data schema returned by the Info API.
ForecastResponse: The data schema returned by the Forecast API.

Figure 5. Elements of Swagger documentation in the streamflow forecasting project. Source: by author.

Details of API methods and data schemas are specified through annotations added to the application code, which can be parsed by tornado_swagger to generate documentation UI. In addition, the API documentation supports direct interaction, facilitating understanding and testing.

Figure 6. Definitions of API methods and data schemas in Swagger documentation. Source: by author.

Explore more

From Data Science to Production: Generating API Documentation with Swagger

7. Cloud Deployment: Docker and AWS Fargate

The final step is to deploy the model in a cloud environment. Data scientists, who focus on model building and refinement, expect little effort in handling deployment infrastructure and scalability. Launching a containerized application on a serverless cloud platform is a popular approach, which involves two tools:

Docker: Creates lightweight and portable containerized applications.
Aws Fargate: Runs containerized workloads in a serverless cloud environment.

Example

Figure 7. Concepts of AWS Fargate. Source: by author.

Key steps in the deployment process includes:

Building a Docker image
Uploading the image to a remote repository
Creating a cluster in Amazon Elastic Container Service (ECS)
Creating a task definition in ECS
Creating a service, specified as a Fargate instance, to run the predefined task

The forecast-service has been deployed in our TestCluster on ECS. AWS Fargate automatically orchestrates the service’s launch and scales the number of tasks based on real-time resource demand.

Figure 8. Console of AWS Fargate service. Source: by author.

Explore more

From Data Science to Production: Streamlining Model Deployment in Cloud Environment

Summary

Turning a data science project into a production-ready application requires Clean Code and efficient deployments. This article guides you through a journey with several key sections from code management to cloud deployment. Each section introduces core ideas and corresponding tools. Following these sections, entry-level data scientists can develop robust, extensible, and scalable applications, making their models accessible to the public rapidly.

More details about each topic can be seen in the extended reading for each section. I am also excited to learn more about any additional skills you might encounter in the journey.

The post Seven Requisite Skills for Navigating from Data Science to Applications appeared first on Towards Data Science.

Scaling AI Models Like You Mean It

Sean Sheng — Wed, 10 Apr 2024 22:20:04 +0000

If you’re reading this article, you probably need no introduction to the advantages of deploying open-source models. Over the past couple of years, we have seen incredible growth in the both the quantity and quality of open source models.

Platforms such as Hugging Face have democratized access to a wide array of models, including Large Language Models (LLMs) and diffusion models, empowering developers to innovate freely and efficiently.
Developers enjoy greater autonomy, as they can fine-tune and combine different models at will, leading to innovative approaches like Retrieval-Augmented Generation (RAG) and the creation of advanced agents.
From an economic perspective, open-source models provide substantial cost savings, enabling the use of smaller, specialized models that are more budget-friendly compared to general-purpose models like GPT-4.

Open-source models present an attractive solution, but what’s the next hurdle? Unlike using a model endpoint like OpenAI, where the model is a scalable black box behind the API, deploying your own open-source models introduces scaling challenges. It’s crucial to ensure that your model scales effectively with production traffic and maintains a seamless experience during traffic spikes. Additionally, it’s important to manage costs efficiently, so you only pay for what you use and avoid any financial surprises at the end of the month.

True north: Serverless functions for GPUs

Interestingly, this sounds like a challenge that modern serverless architectures, like AWS Lambda, have already solved – a solution that have existed for almost a decade. However, when it comes to AI model deployment, this isn’t quite the case.

The limitations of serverless functions for AI deployments are multifaceted.

No GPU support. Platforms like AWS Lambda don’t support GPU. This isn’t merely a technical oversight; it’s rooted in architectural and practical considerations.
GPUs cannot be easily shared. GPUs, while highly parallelizable as devices, is not as flexible in handling multiple inference tasks on different models simultaneously.
GPUs are expensive. They’re exceptional for model inferencetasks but costly to maintain, especially if not utilized continuously.

Next, let’s take a look at our scaling journey and the important lessons we have learned along the way.

The cold start problem

Before we could even begin to work on scaling, we have the notorious "cold start" problem. This issue presents itself in three different stages:

Breakdown of the cold start problem. Image by the author.

Cloud provisioning: This phase involves the time it takes for a cloud provider to allocate an instance and integrate it into our cluster. This process varies widely, ranging from as quick as 30 seconds to several minutes, and in some cases, even hours, especially for high-demand instances like the Nvidia A100 and H100 GPUs.
Container image pulling: Unlike simple Python job images, AI model serving images are very complex, due to the dependencies and custom libraries they require. Although cloud providers boast multi-gigabit network bandwidth, our experience often saw download speeds far below them, with image pulling time about 3 minutes.
Model loading. The time required here is largely dependent on the model’s size, with larger models like LLMs and diffusion models taking significantly longer time due to their billions of parameters. For example, loading a 5GB model like Stable Diffusion 2 might take approximately 1.3 minutes with 1Gbps network bandwidth, while larger models like Llama 13B and Mixtral 8x7B could require 3.5 minutes and 12.5 minutes respectively.

Each phase of the cold start issue demands specific strategies to minimize delays. In the following sections, we’ll explore each of them in more detail, sharing our strategies and solutions.

Cloud provisioning

In contrast to the homogeneous environment of serverless CPUs, managing a diverse range of compute instance types is crucial when dealing with GPUs, each tailored for specific use cases. For instance, IO-bound LLMs require high GPU memory bandwidth and capacity, while generative models need more powerful GPU compute.

Ensuring availability during peak traffic by maintaining all GPU instance types could lead to prohibitively high costs. To avoid the financial strain of idle instances, we implemented a "standby instances" mechanism. Rather than preparing for the maximum potential load, we maintained a calculated number of standby instances that match the incremental scaling step sizes. For example, if we scale by two GPUs at a time, we need to have two standby instances ready. This allows us to quickly add resources to our serving fleet as demand surges, significantly reducing wait time, while keeping cost manageable.

Image by the author.

In a multi-tenant environment, where multiple teams or, in our case, multiple organizations, share a common resource pool, we can achieve more efficient utilization rates. This shared environment allows us to balance varying resource demands, contributing to improved cost efficiency. However, managing multi-tenancy introduces challenges, such as enforcing quotas and ensuring network isolation, which can add complexity to the cluster.

Container image pulling

Serverless CPU workloads often use lightweight images, like the Python slim image (around 154 MB). In stark contrast, a container image built for serving an LLM can be much larger (6.7 GB); the bulk of this size comes from the various dependencies required to run the AI model.

Image by the author.

Despite high-bandwidth networks advertised by cloud providers, the reality often falls short, with actual download speeds being a fraction of the promised rates.

Practically, a significant portion of the files were never used. One way is to optimize the container image itself, but that quickly proved to be unmanageable. Instead, we shifted our focus to an on-demand file pulling approach. Specifically, we first downloaded only the image metadata, with the actual remote files being fetched later as needed. In addition, we leveraged peer-to-peer networking within the cluster to dramatically increase pulling efficiency.

Container image metadata can be pull in seconds. Image by the author.

With these optimizations, we reduced the image pulling time from several minutes to mere seconds. However, we all know this measurement is "cheating" since the actual files are not pulled at this stage. The real file pulling occurs when the service runs. Therefore, it’s crucial to have a service framework that allows you to define behaviors at various lifecycle stages, such as initialization and serving. By doing all of the bootstrapping during initialization, we can ensure that all file dependencies are pulled. This way, when it comes to serving time, there are no delays caused by file pulling.

Service framework that enables service initialization and API definitions. Image by the author.

In the above example, model loading is done during the initialization lifecycle within __init__ and serving happens within the @bentoml.api named txt2img.

Model loading

Initially, the most straightforward method for model loading was to fetch it directly from a remote store like Hugging Face. Using Content Delivery Networks (CDNs), NVMe SSDs, and shared memory, we could remove some of the bottlenecks. While this worked, it was far from optimal.

To improve this process, we considered using in-region network bandwidth. We seeded models in our distributed file systems and broke them into smaller chunks, allowing for parallel downloads. This drastically improved performance, but we still encountered cloud provider’s network bandwidth bottlenecks.

In response, we further optimized to leverage in-cluster network bandwidth by using peer-to-peer sharing and tapping into local caches. While the improvements were substantial, they added a layer of complexity to the process, which we need to abstract away from the developers.

Image by the author.

Even with the above practices, we still suffer from a sequential bottleneck: the need to wait for each step to complete before proceeding with the next. Models had to be downloaded to persistent drive entirely before loading into CPU memory, and then into the GPU.

Image by the author.

We turned to a stream-based method for loading model weights, using the distributed file cache system we had in place. This system allows programs to operate as if all files were logically available on disk. In reality, the required data is fetched on-demand from remote storage therefore bypassed disk writing. By leveraging a format like Safetensors, we can efficiently load the model weights into the main memory through memory mapping (mmap) before loading to the GPU memory in a streaming fashion.

Moreover, we adopted asynchronous writing to disk. By doing so, we created a faster-access cache layer on the local disk. Thus, new deployments with only code changes could bypass the slower remote storage fetch phase, reading the model weights from local cache directly.

To summarize, we managed to optimize the cold start time and we were happy with the results:

No cloud provision delay with standby instances.
Faster container image pulling with on-demand and peer-to-peer streaming.
Accelerated model loading time with distributed file systems, peer-to-peer caching, and streamed loading to GPU memory.
Parallelized image pulling and model loading enabled by service framework.

Scaling metrics

Next, we need to identify the most indicative signal for scaling AI model deployments on GPUs.

Resource utilization metrics

Initially, we considered CPU utilization. It’s straightforward and has an intuitive default threshold, such as 80%. However, the obvious drawback is that CPU metrics don’t capture GPU utilization. Additionally, the Global Interpreter Lock (GIL) in Python limits parallelism, preventing high CPU utilization on multi-core instances, making CPU utilization a less feasible metric.

We also explored GPU utilization as a more direct measure of our models’ workloads. However, we encountered an issue: the GPU utilization reported by tools like nvml didn’t accurately represent the actual utilization of the GPU. This metric samples kernel usage over a period of time, and a GPU is considered utilized if at least one kernel is executing. This aligns with our observation that better performance can often be achieved through improved batching, even though the GPU device was already reported as having high utilization.

_Note: According to the NVIDIA documentation, utilization.gpu means "Percent of time over the past sample period during which one or more kernels was executing on the GPU. The sample period may be between 1 second and 1/6 second depending on the product"._

Resource-based metrics are inherently retrospective as they only reflect usage after the resources have been consumed. They’re also capped at 100%, which presents a problem: when scaling based on these metrics, the maximum ratio for adjustment is typically the current utilization over the desired threshold (see scaling formula below). This results in a conservative scale-up behavior that doesn’t necessarily match the actual demand of production traffic.

desiredReplicas = ceil[currentReplicas * ( currentMetricValue / desiredMetricValue )]

Request-based metrics

We turned to request-based metrics for more proactive signaling that are also not capped at a 100%.

QPS is a widely recognized metric for its simplicity. However, its application in generative AI, such as with LLMs, is still a question. QPS is not easy to configure and due to the variable cost per request, which depends on the number of tokens processed and generated, using QPS as a scaling metric can lead to inaccuracies.

Concurrency, on the other hand, has proven to be an ideal metric for reflecting the actual load on the system. It represents the number of active requests either queued or being processed. This metric:

Precisely reflects the load on the system. Little’s Law, which states that QPS multiplied by average latency equals concurrency, provides an elegant way to understand the relationship between QPS and concurrency. In practice, the average latency per request is rather unknown in model serving. However, by measuring concurrency, we don’t need to calculate average latency.
Accurately calculate the desired replicas using the scaling formula. Allowing the deployment to directly scale to the ideal size without intermediate steps.
Easy to configure based on batch size. For non-batchable models, it’s simply the number of GPUs, since each can only handle one generation task at a time. For models that support batching, the batch size determines the concurrency level.

For concurrency to work, we need the support from the service framework to automatically instrument concurrency as a metric and serve it as a scaling signal for the deployment platform. We must also establish right scaling policies to help against overzealous scale-up during a traffic spike or premature scale-down when traffic is sparse.

Request queue

A another important mechanism we integrated with concurrency is the request queue. It acts as a buffer and an orchestrator, ensuring that incoming requests are handled efficiently and without overloading any single server replica.

In a scenario without a request queue, all incoming requests are dispatched directly to the server (6 requests in the image below). If multiple requests arrive simultaneously, and there’s only one active server replica, it becomes a bottleneck. The server tries to process each request in a first-come-first-serve manner, often leading to timeouts and a bad client experience.

Image by the author.

Conversely, with a request queue in place, the server consumes requests at an optimal rate, processing at a rate based on the concurrency defined for the service. When additional server replicas scale up, they too begin to pull from the queue. This mechanism prevents any single server from becoming overwhelmed and allows for a smoother, more manageable distribution of requests across the available infrastructure.

Conclusions

Our journey in exploring AI model scaling solutions has been an adventure, which has led us to ultimately create the scaling experience on BentoCloud – a platform that encapsulates all our learnings.

To avoid the impression of a promotion, we’ll illustrate our point with a picture that’s worth a thousand words. The monitoring dashboard below demonstrates the correlation between incoming requests and the scaling up of server instances.

Equally important to scaling up is the ability to scale down. As the requests waned to zero, the deployment reduced the number of active instances accordingly. This ability ensures that no unnecessary costs are incurred for unused resources, aligning expenditure with actual usage.

BentoCloud monitoring dashboard. Image by the author.

We hope the takeaway is that scaling for model deployments should be considered an important aspect of production applications. Unlike scaling CPU workloads, scaling model deployments on GPUs presents unique challenges, including cold start times, configuring scaling metrics, and orchestrating requests. When evaluating deployment platforms, their solutions to these challenges should be thoroughly assessed.

The post Scaling AI Models Like You Mean It appeared first on Towards Data Science.

5 Levels of MLOps Maturity

Maciej Balawejder — Thu, 15 Jun 2023 18:35:42 +0000

Progression of ML infrastructure from Level 1 maturity to Level 5. Image by author.

Introduction

Building a solid infrastructure for ML systems is a big deal. It needs to ensure that the development and Deployment of ML applications are organized and reliable. But here’s the thing – every company’s infrastructure needs are different. It depends on how many ML applications they have, how quickly they need to deploy, or how many requests they need to handle.

For example, if a company has just one model in production, the deployment process can be handled manually. On the other end of the spectrum, companies like Netflix or Uber, with hundreds of models in production, need highly specialized infrastructure to support them.

Now you might ask yourself a question : Where does your company fit on that spectrum?

MLOps maturity levels shared by Google and Microsoft are here to help. They describe the advancement and sophistication of the ML infrastructure based on the best practices in the industry.

This blog post aims to synthesize and take the best from both frameworks. First, we’ll analyze five maturity levels and show the progression from manual processes to advanced automated infrastructures. Then, in the last section, we will argue that some of the points presented by Microsoft and Google should not be followed blindly but rather be adjusted to your needs. This should help you to be more aware in the process of figuring out where you stand with your infrastructure and finding potential areas for improvement.

Alright, let’s dive in!

What is MLOps?

MLOps is a set of practices to establish a standardized and repeatable process for managing the entire ML lifecycle, starting from data preparation, model training, deployment, and monitoring. It borrows from the widely adopted DevOps practices in software engineering, which are focused on giving teams a rapid and continuously iterative approach to shipping software applications.

However, DevOps tools are not sufficient for the ML world and differ in several ways:

MLOps requires a multidisciplinary team with a diverse skill set. This team includes data engineers responsible for data collection and storage, data scientists who develop the models, machine learning engineers(MLE) to deploy the models, and software engineers who integrate them with the product.
Data science is inherently experimental, allowing for ongoing improvement by exploring different models, data analysis, training techniques, and hyperparameter configurations. The Infrastructure supporting MLOps should include tracking and evaluating successful and unsuccessful approaches.
Even if the model is up and running in production, it still can fail due to changes in the incoming data. This is called a silent model failure, caused by data and concept drift. Therefore, ML infrastructure requires a monitoring system to constantly check the model’s performance and data to prevent this issue.

Now let’s explore the various maturity levels of Mlops infrastructures.

Level 1 – Manual

Manual ML infrastructure. The design is inspired by Google’s blog post. Image by author.

At this level, the data processing, experimentation, and model deployment processes are entirely manual. Microsoft refers to this level as ‘No MLOps’, since the ML lifecycle is difficult to repeat and automate.

The entire workflow relies heavily on skilled data scientists, with some assistance from a data engineer to prepare the data and a software engineer to integrate the model with the product/business processes if needed.

This approach works great in cases like:

Early-stage start-ups and proof of concept projects – where the focus is on experimentation and resources are limited. Developing and deploying ML models is the main concern before scaling up the operations.
Small-scale ML applications – The manual approach can be sufficient for ML applications with limited scope or a small user base, like a small online fashion store. With minimal data dependencies and real-time requirements, data scientists can manually handle the data processing, experimentation, and deployment processes.
Ad hoc ML tasks – In specific scenarios like marketing campaigns, one-time ML tasks or analyses may not require full MLOps implementation.

According to both Google and Microsoft, this approach also faces several limitations, including:

Lack of Monitoring system – there’s no visibility on the model’s performance. If it degrades, it will have a negative business impact. Also, there’s post-deployment data science to understand the model’s behavior in production.
No frequent retrains of production models – no adaptation of the model to the latest trends or patterns.
Releases are painful and infrequent – since it’s done manually, releases of the models happen only a couple of times per year.
No centralized tracking of model performance makes it hard to compare different models’ performance, repeat the results, or update it.
Limited documentation and no versioning – pose few challenges in terms of the risk of introducing unintended changes to the code, limited rollback to the working version, and lack of repeatability.

Level 2 – Repeatable

Repeatable ML infrastructure with additional source repository and monitoring. Image by author.

Next, we introduce the DevOps aspect to the infrastructure by converting the experiments to the source code and storing them in the source repository using a version control system like Git.

Microsoft suggests changes to the data collection process by adding the following:

Data pipeline – allows to extract the data from different sources and combine them together. Then, the data is transformed using cleaning, aggregating, or filtering operations. It makes the infrastructure more scalable, efficient, and accurate than the manual one.
Data catalog – a centralized repository that includes information such as data source, data type, data format, owner, usage, and lineage. It helps to organize, manage, and maintain large volumes of data in a scalable and efficient manner.

To level up the infrastructure, we must bring in some automated testing alongside version control. This means using practices like unit tests, integration tests, or regression tests. These will help us deploy faster and make things more reliable by ensuring our code changes don’t cause errors or bugs.

With all those changes in place, we can repeat the data collection and deployment process. However, we still need a proper monitoring system. Microsoft mentions it briefly by saying there’s "limited feedback on how well a model performs in production," but they don’t go into the details about it.

Level 3 – Reproducible

Reproducible ML infrastructure with automated training and orchestrated experiments. Image by author.

There are two key reasons why reproducibility is crucial: troubleshooting and collaboration. Imagine a scenario when the performance of your recently deployed model is deteriorating, resulting in inaccurate predictions. In that case, you need to keep a record of previous versions of the data and model to roll back the other version of the model until you find the root cause of the underlying issue.

Moreover, reproducibility makes it easier for different team members to understand what others are doing and build on each other’s work. This collaborative approach and knowledge sharing can lead to faster innovation and better models.

To achieve reproducibility, we made have to level up the architecture in four ways:

Automated training pipeline – handles the end-to-end process of training models, from data preparation to model evaluation.
Metadata store – a database is a way to track and manage metadata, including data sources, model configurations, hyperparameters, training runs, evaluation metrics, and all the experiments data.
Model registry – is a repository to store ML models, their versions, and their artifacts necessary for deployment, which helps to retrieve the exact version if needed.
Feature store – which is there to help data scientists and machine learning engineers to develop, test, and deploy machine learning models more efficiently by providing a centralized location for storing, managing, and serving features. It also can be used to track the evolution of features over time and preprocess and transform features as needed.

At that stage, a monitoring service is available, offering real-time feedback on the performance of the model. However, apart from confirming it’s there, neither Microsoft nor Google provide any additional information.

Level 4 – Automated

Automated ML infrastructure with CI/CD. Image by author.

This automation level helps data scientists efficiently explore new ideas in feature engineering, model architecture, and hyperparameters by automating the machine learning pipeline, including building, testing, and deployment. To achieve this, Microsoft suggests incorporating two extra components:

CI/CD – where Continuous Integration (CI) ensures integration of code changes from different team members into a shared repository, while Continuous Deployment (CD) automates the deployment of validated code to production environments. This allows for rapid deployment of model updates, improvements, and bug fixes.
A/B testing of models – this model validation method involves comparing predictions and user feedback between an existing model and a candidate model to determine the better one.

Level 5 – Continuously improved

Continuously improved ML infrastructure with automated retraining. Image by author.

At this stage, the model is automatically retrained based on the trigger from the monitoring system. This process of retraining is also known as continuous learning. The objectives of continuous learning are:

Combat sudden data drifts that may occur, ensuring the model remains effective even when faced with unexpected changes in the data.
Adapt to rare events such as Black Friday, where patterns and trends in the data may significantly deviate from the norm.
Overcoming the cold start problem, which arises when the model needs to make predictions for new users lacking historical data

Push for automation

Microsoft and Google are major players in the cloud computing market, with Azure holding a 22% market share and Google at 10%. They offer a wide range of services, including computing, storage, and development tools, which are essential components for building advanced ML infrastructure.

Like any business, they main goal is to generate revenue by selling these services. This is partially why their blogs emphasize advancement and automation. However, a higher level of maturity doesn’t guarantee better results for your business. The optimal solution is the one that aligns with your company’s specific needs and right tech stack.

While maturity levels can help to determine your current advancement, they shouldn’t be followed blindly since Microsoft and Google’s main incentives are to sell their services. The example is specifically their push for automated retraining. This process requires a lot of computation, but it’s often unnecessary or harmful. Retraining should be done when needed. What’s more important for your infrastructure is having a reliable monitoring system and an effective root cause analysis process.

Monitoring should start from the manual level

A limited monitoring system appears at level 2 in the described maturity levels. In reality, you should monitor your model as soon as business decisions are taken based on its output, regardless of maturity level. It allows you to reduce the risk of failure and see how the model performs regarding your business goals.

The initial step in monitoring can be as simple as comparing the model’s predictions to the actual values. This basic comparison is a baseline assessment of the model’s performance and a good starting point for further analysis when the model is failing. Additionally, it’s important to consider the evaluation of data science efforts, which includes measuring the return on investment (ROI). This means assessing the value that data science techniques and algorithms bring to the table. It’s crucial to understand how effective these efforts are in generating business value.

Evaluating ROI gives you insights and information that can help you make better decisions regarding allocating resources and planning future investments. As infrastructure evolves, the monitoring system can become more complex with additional features and capabilities. However, you should still pay attention to the importance of applying a basic monitoring setup to the infrastructure at the first level of maturity.

Risks of retraining

In the description of level 5, we listed the benefits of automatic retraining in production. However, before adding it to your infrastructure, you should consider the risks related to it:

Retraining on delayed data

In some real-world scenarios, like loan-default prediction, labels may be delayed for months or even years. The ground truth is still coming, but you are retraining your model using the old data, which may not represent the current reality well.

2. Failure to determine the root cause of the problem

If the model’s performance drops, it doesn’t always mean that it needs more data. There could be various reasons for the model’s failure, such as changes in downstream business processes, training-serving skew, or data leakage. You should first investigate to find the underlying issue and then retrain the model if necessary.

3. Higher risk of failure

Retraining amplifies the risk of model failure. Besides the fact that it adds complexity to the infrastructure, the more frequently you update, the more opportunities the model has to fail. Any undetected problem appearing in the data collection or preprocessing will be propagated to the model, resulting in a retrained model on flawed data.

4. Higher costs

Retraining is not a cost-free process. It involves expenses related to:

Storing and validating the retraining data
Compute resources to retrain the model
Testing a new model to determine if it performs better than the current one

Summary

ML systems are complex. Building and deploying models in a repeatable and sustainable manner is tough. In this blog post, we have explored five MLOps maturity levels based on the Google and Microsoft best practices in the industry. We have discussed the evolution from manual deployment to automated infrastructures, highlighting the benefits that each level brings. However, it is crucial to understand that these practices should not be followed blindly. Instead, their adaptation should be based on your company’s specific needs and requirements.

The post 5 Levels of MLOps Maturity appeared first on Towards Data Science.

Deployment | Towards Data Science

Kubernetes — Understanding and Utilizing Probes Effectively

Introduction

Quick refresher

Startup: Optimizing start-up times

Liveness: Detecting dead containers

Readiness: Handling unexpected errors

Putting it all together

Additional configurations

Designing, Building & Deploying an AI Chat App from Scratch (Part 2)

1. Introduction

1.1. Recap: local application

Table of contents

2. Cloud architecture

2.1 Scaling

2.2 Kubernetes Concepts

2.3 Azure Container Apps

2.4 Azure architecture: putting it all together

3. Deployment

3.1 Setting up

3.2 PostgreSQL server deployment

3.3 Azure Container App Environment deployment

3.4 Azure Container Apps deployment

3.5 Scaling our Container Apps

3.5.1 Scaling language model inference.

3.6 Custom domain name & HTTPS

4. Resources & costs overview

5. Roadmap

6. Final thoughts

Acknowledgements

AI usage

Complete MLOPS Cycle for a Computer Vision Project

Dive into MLOPS basics to improve your skills for designing, developing, and deploying computer vision projects for real-world, industrial applications

Data versioning & Management (DVC)

From Local to Cloud: Estimating GPU Resources for Open-Source LLMs

Estimating GPU memory for deploying the latest open-source LLMs

Why Bother Estimating GPU Memory?

Quantization: What’s It For?

Estimating GPU Memory

Example

GPU Memory Planning Cheat Sheet in 2024

Key Takeaways to Make Your Life Easier

Conlusion

Loved the Article? Here’s How to Show Some Love:

Economics of Hosting Open Source LLMs

Large Language Models in Production

Introduction

LLM Serving Frameworks

The Research

Processing Time

Cold Boots

GPU & CPU Pricing

Case 1: Fine-Tuned 400M Model running on CPU

Case 2: General 8B Model running on GPU

User Experience

Technical Bits

Deployment to Lambda w/ EFS

Deployment to Modal

Some Notes

Reducing the Size of Docker Images Serving Large Language Models (part 2)

How to reduce the size of a "small" Docker image by another 10%

Introduction

S-size Docker image

XS-size Docker image

Replace transformers with tokenizers

Compress the model

Docker image comparison

Conclusions

References

Seven Requisite Skills for Navigating from Data Science to Applications

Helping Entry-Level Data Scientists to Transform Ideas into Industrial-Level applications

1. Organizing Codes in Modular

Example

2. Automating Data Validation

Example

Explore more

3. Abstraction

Example

Explore more

4. Configuration Management