Containers | Towards Data Science https://towardsdatascience.com/tag/containers/ The world’s leading publication for data science, AI, and ML professionals. Thu, 06 Mar 2025 05:58:39 +0000 en-US hourly 1 https://wordpress.org/?v=6.7.1 https://towardsdatascience.com/wp-content/uploads/2025/02/cropped-Favicon-32x32.png Containers | Towards Data Science https://towardsdatascience.com/tag/containers/ 32 32 Kubernetes — Understanding and Utilizing Probes Effectively https://towardsdatascience.com/kubernetes-understanding-and-utilizing-probes-effectively/ Thu, 06 Mar 2025 03:59:54 +0000 https://towardsdatascience.com/?p=598812 Why proper configuration and implementation of Kubernetes probes is vital for any critical deployment

The post Kubernetes — Understanding and Utilizing Probes Effectively appeared first on Towards Data Science.

]]>
Introduction

Let’s talk about Kubernetes probes and why they matter in your deployments. When managing production-facing containerized applications, even small optimizations can have enormous benefits.

Aiming to reduce deployment times, making your applications better react to scaling events, and managing the running pods healthiness requires fine-tuning your container lifecycle management. This is exactly why proper configuration — and implementation — of Kubernetes probes is vital for any critical deployment. They assist your cluster to make intelligent decisions about traffic routing, restarts, and resource allocation.

Properly configured probes dramatically improve your application reliability, reduce deployment downtime, and handle unexpected errors gracefully. In this article, we’ll explore the three types of probes available in Kubernetes and how utilizing them alongside each other helps configure more resilient systems.

Quick refresher

Understanding exactly what each probe does and some common configuration patterns is essential. Each of them serves a specific purpose in the container lifecycle and when used together, they create a rock-solid framework for maintaining your application availability and performance.

Startup: Optimizing start-up times

Start-up probes are evaluated once when a new pod is spun up because of a scale-up event or a new deployment. It serves as a gatekeeper for the rest of the container checks and fine-tuning it will help your applications better handle increased load or service degradation.

Sample Config:

startupProbe:
  httpGet:
    path: /health
    port: 80
  failureThreshold: 30
  periodSeconds: 10

Key takeaways:

  • Keep periodSeconds low, so that the probe fires often, quickly detecting a successful deployment.
  • Increase failureThreshold to a high enough value to accommodate for the worst-case start-up time.

The Startup probe will check whether your container has started by querying the configured path. It will additionally stop the triggering of the Liveness and Readiness probes until it is successful.

Liveness: Detecting dead containers

Your liveness probes answer a very simple question: “Is this pod still running properly?” If not, K8s will restart it.

Sample Config:

livenessProbe:
  httpGet:
    path: /health
    port: 80
  periodSeconds: 10
  failureThreshold: 3

Key takeaways:

  • Since K8s will completely restart your container and spin up a new one, add a failureThreshold to combat intermittent abnormalities.
  • Avoid using initialDelaySeconds as it is too restrictive — use a Start-up probe instead.

Be mindful that a failing Liveness probe will bring down your currently running pod and spin up a new one, so avoid making it too aggressive — that’s for the next one.

Readiness: Handling unexpected errors

The readiness probe determines if it should start — or continue — to receive traffic. It is extremely useful in situations where your container lost connection to the database or is otherwise over-utilized and should not receive new requests.

Sample Config:

readinessProbe:
  httpGet:
    path: /health
    port: 80
  periodSeconds: 3
  failureThreshold: 1
  timeoutSeconds: 1

Key takeaways:

  • Since this is your first guard to stopping traffic to unhealthy targets, make the probe aggressive and reduce the periodSeconds .
  • Keep failureThreshold at a minimum, you want to fail quick.
  • The timeout period should also be kept at a minimum to handle slower Containers.
  • Give the readinessProbe ample time to recover by having a longer-running livenessProbe .

Readiness probes ensure that traffic will not reach a container not ready for it and as such it’s one of the most important ones in the stack.

Putting it all together

As you can see, even if all of the probes have their own distinct uses, the best way to improve your application’s resilience strategy is using them alongside each other.

Your startup probe will assist you in scale up scenarios and new deployments, allowing your containers to be quickly brought up. They’re fired only once and also stop the execution of the rest of the probes until they successfully complete.

The liveness probe helps in dealing with dead containers suffering from non-recoverable errors and tells the cluster to bring up a new, fresh pod just for you.

The readiness probe is the one telling K8s when a pod should receive traffic or not. It can be extremely useful dealing with intermittent errors or high resource consumption resulting in slower response times.

Additional configurations

Probes can be further configured to use a command in their checks instead of an HTTP request, as well as giving ample time for the container to safely terminate. While these are useful in more specific scenarios, understanding how you can extend your deployment configuration can be beneficial, so I’d recommend doing some additional reading if your containers handle unique use cases.

Further reading:
Liveness, Readiness, and Startup Probes
Configure Liveness, Readiness and Startup Probes

The post Kubernetes — Understanding and Utilizing Probes Effectively appeared first on Towards Data Science.

]]>
Why Data Scientists Should Care about Containers — and Stand Out with This Knowledge https://towardsdatascience.com/why-data-scientists-should-care-about-containers-and-stand-out-with-this-knowledge/ Thu, 20 Feb 2025 04:51:55 +0000 https://towardsdatascience.com/?p=598159 “I train models, analyze data and create dashboards — why should I care about containers?” Many people who are new to the world of data science ask themselves this question. But imagine you have trained a model that runs perfectly on your laptop. However, error messages keep popping up in the cloud when others access […]

The post Why Data Scientists Should Care about Containers — and Stand Out with This Knowledge appeared first on Towards Data Science.

]]>
“I train models, analyze data and create dashboards — why should I care about Containers?”

Many people who are new to the world of data science ask themselves this question. But imagine you have trained a model that runs perfectly on your laptop. However, error messages keep popping up in the cloud when others access it — for example because they are using different library versions.

This is where containers come into play: They allow us to make machine learning models, data pipelines and development environments stable, portable and scalable — regardless of where they are executed.

Let’s take a closer look.

Table of Contents
1 — Containers vs. Virtual Machines: Why containers are more flexible than VMs
2 — Containers & Data Science: Do I really need Containers? And 4 reasons why the answer is yes.
3 — First Practice, then Theory: Container creation even without much prior knowledge
4 — Your 101 Cheatsheet: The most important Docker commands & concepts at a glance
Final Thoughts: Key takeaways as a data scientist
Where Can You Continue Learning?

1 — Containers vs. Virtual Machines: Why containers are more flexible than VMs

Containers are lightweight, isolated environments. They contain applications with all their dependencies. They also share the kernel of the host operating system, making them fast, portable and resource-efficient.

I have written extensively about virtual machines (VMs) and virtualization in ‘Virtualization & Containers for Data Science Newbiews’. But the most important thing is that VMs simulate complete computers and have their own operating system with their own kernel on a hypervisor. This means that they require more resources, but also offer greater isolation.

Both containers and VMs are virtualization technologies.

Both make it possible to run applications in an isolated environment.

But in the two descriptions, you can also see the 3 most important differences:

  • Architecture: While each VM has its own operating system (OS) and runs on a hypervisor, containers share the kernel of the host operating system. However, containers still run in isolation from each other. A hypervisor is the software or firmware layer that manages VMs and abstracts the operating system of the VMs from the physical hardware. This makes it possible to run multiple VMs on a single physical server.
  • Resource consumption: As each VM contains a complete OS, it requires a lot of memory and CPU. Containers, on the other hand, are more lightweight because they share the host OS.
  • Portability: You have to customize a VM for different environments because it requires its own operating system with specific drivers and configurations that depend on the underlying hardware. A container, on the other hand, can be created once and runs anywhere a container runtime is available (Linux, Windows, cloud, on-premise). Container runtime is the software that creates, starts and manages containers — the best-known example is Docker.
Created by the author

You can experiment faster with Docker — whether you’re testing a new ML model or setting up a data pipeline. You can package everything in a container and run it immediately. And you don’t have any “It works on my machine”-problems. Your container runs the same everywhere — so you can simply share it.

2 — Containers & Data Science: Do I really need Containers? And 4 reasons why the answer is yes.

As a data scientist, your main task is to analyze, process and model data to gain valuable insights and predictions, which in turn are important for management.

Of course, you don’t need to have the same in-depth knowledge of containers, Docker or Kubernetes as a DevOps Engineer or a Site Reliability Engineer (SRE). Nevertheless, it is worth having container knowledge at a basic level — because these are 4 examples of where you will come into contact with it sooner or later:

Model deployment

You are training a model. You not only want to use it locally but also make it available to others. To do this, you can pack it into a container and make it available via a REST API.

Let’s look at a concrete example: Your trained model runs in a Docker container with FastAPI or Flask. The server receives the requests, processes the data and returns ML predictions in real-time.

Reproducibility and easier collaboration

ML models and pipelines require specific libraries. For example, if you want to use a deep learning model like a Transformer, you need TensorFlow or PyTorch. If you want to train and evaluate classic machine learning models, you need Scikit-Learn, NumPy and Pandas. A Docker container now ensures that your code runs with exactly the same dependencies on every computer, server or in the cloud. You can also deploy a Jupyter Notebook environment as a container so that other people can access it and use exactly the same packages and settings.

Cloud integration

Containers include all packages, dependencies and configurations that an application requires. They therefore run uniformly on local computers, servers or cloud environments. This means you don’t have to reconfigure the environment.

For example, you write a data pipeline script. This works locally for you. As soon as you deploy it as a container, you can be sure that it will run in exactly the same way on AWS, Azure, GCP or the IBM Cloud.

Scaling with Kubernetes

Kubernetes helps you to orchestrate containers. But more on that below. If you now get a lot of requests for your ML model, you can scale it automatically with Kubernetes. This means that more instances of the container are started.

3 — First Practice, then Theory: Container creation even without much prior knowledge

Let’s take a look at an example that anyone can run through with minimal time — even if you haven’t heard much about Docker and containers. It took me 30 minutes.

We’ll set up a Jupyter Notebook inside a Docker container, creating a portable, reproducible Data Science environment. Once it’s up and running, we can easily share it with others and ensure that everyone works with the exact same setup.

0 — Install Docker Dekstop and create a project directory

To be able to use containers, we need Docker Desktop. To do this, we download Docker Desktop from the official website.

Now we create a new folder for the project. You can do this directly in the desired folder. I do this via Terminal — on Windows with Windows + R and open CMD.

We use the following command:

Screenshot taken by the author

1. Create a Dockerfile

Now we open VS Code or another editor and create a new file with the name ‘Dockerfile’. We save this file without an extension in the same directory. Why doesn’t it need an extension?

We add the following code to this file:

# Use the official Jupyter notebook image with SciPy
FROM jupyter/scipy-notebook:latest  

# Set the working directory inside the container
WORKDIR /home/jovyan/work  

# Copy all local files into the container
COPY . .

# Start Jupyter Notebook without token
CMD ["start-notebook.sh", "--NotebookApp.token=''"]

We have thus defined a container environment for Jupyter Notebook that is based on the official Jupyter SciPy Notebook image.

First, we define with FROM on which base image the container is built. jupyter/scipy-notebook:latest is a preconfigured Jupyter notebook image and contains libraries such as NumPy, SiPy, Matplotlib or Pandas. Alternatively, we could also use a different image here.

With WORKDIR we set the working directory within the container. /home/jovyan/work is the default path used by Jupyter. User jovyan is the default user in Jupyter Docker images. Another directory could also be selected — but this directory is best practice for Jupyter containers.

With COPY . . we copy all files from the local directory — in this case the Dockerfile, which is located in the jupyter-docker directory — to the working directory /home/jovyan/work in the container.

With CMD [“start-notebook.sh”, “ — NotebookApp.token=‘’’”] we specify the default start command for the container, specify the start script for Jupyter Notebook and define that the notebook is started without a token — this allows us to access it directly via the browser.

2. Create the Docker image

Next, we will build the Docker image. Make sure you have the previously installed Docker desktop open. We now go back to the terminal and use the following command:

cd jupyter-docker
docker build -t my-jupyter .

With cd jupyter-docker we navigate to the folder we created earlier. With docker build we create a Docker image from the Dockerfile. With -t my-jupyter we give the image a name. The dot means that the image will be built based on the current directory. What does that mean? Note the space between the image name and the dot.

The Docker image is the template for the container. This image contains everything needed for the application such as the operating system base (e.g. Ubuntu, Python, Jupyter), dependencies such as Pandas, Numpy, Jupyter Notebook, the application code and the startup commands. When we “build” a Docker image, this means that Docker reads the Dockerfile and executes the steps that we have defined there. The container can then be started from this template (Docker image).

We can now watch the Docker image being built in the terminal.

Screenshot taken by the author

We use docker images to check whether the image exists. If the output my-jupyter appears, the creation was successful.

docker images

If yes, we see the data for the created Docker image:

Screenshot taken by the author

3. Start Jupyter container

Next, we want to start the container and use this command to do so:

docker run -p 8888:8888 my-jupyter

We start a container with docker run. First, we enter the specific name of the container that we want to start. And with -p 8888:8888 we connect the local port (8888) with the port in the container (8888). Jupyter runs on this port. I do not understand.

Alternatively, you can also perform this step in Docker desktop:

Screenshot taken by the author

4. Open Jupyter Notebook & create a test notebook

Now we open the URL [http://localhost:8888](http://localhost:8888/) in the browser. You should now see the Jupyter Notebook interface.

Here we will now create a Python 3 notebook and insert the following Python code into it.

import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(0, 10, 100)
y = np.sin(x)

plt.plot(x, y)
plt.title("Sine Wave")
plt.show()

Running the code will display the sine curve:

Screenshot taken by the author

5. Terminate the container

At the end, we end the container either with ‘CTRL + C’ in the terminal or in Docker Desktop.

With docker ps we can check in the terminal whether containers are still running and with docker ps -a we can display the container that has just been terminated:

Screenshot taken by the author

6. Share your Docker image

If you now want to upload your Docker image to a registry, you can do this with the following command. This will upload your image to Docker Hub (you need a Docker Hub account for this). You can also upload it to a private registry of AWS Elastic Container, Google Container, Azure Container or IBM Cloud Container.

docker login

docker tag my-jupyter your-dockerhub-name/my-jupyter:latest

docker push dein-dockerhub-name/mein-jupyter:latest

If you then open Docker Hub and go to your repositories in your profile, the image should be visible.

This was a very simple example to get started with Docker. If you want to dive a little deeper, you can deploy a trained ML model with FastAPI via a container.

4 — Your 101 Cheatsheet: The most important Docker commands & concepts at a glance

You can actually think of a container like a shipping container. Regardless of whether you load it onto a ship (local computer), a truck (cloud server) or a train (data center) — the content always remains the same.

The most important Docker terms

  • Container: Lightweight, isolated environment for applications that contains all dependencies.
  • Docker: The most popular container platform that allows you to create and manage containers.
  • Docker Image: A read-only template that contains code, dependencies and system libraries.
  • Dockerfile: Text file with commands to create a Docker image.
  • Kubernetes: Orchestration tool to manage many containers automatically.

The basic concepts behind containers

  • Isolation: Each container contains its own processes, libraries and dependencies
  • Portability: Containers run wherever a container runtime is installed.
  • Reproducibility: You can create a container once and it runs exactly the same everywhere.

The most basic Docker commands

docker --version # Check if Docker is installed
docker ps # Show running containers
docker ps -a # Show all containers (including stopped ones)
docker images # List of all available images
docker info # Show system information about the Docker installation

docker run hello-world # Start a test container
docker run -d -p 8080:80 nginx # Start Nginx in the background (-d) with port forwarding
docker run -it ubuntu bash # Start interactive Ubuntu container with bash

docker pull ubuntu # Load an image from Docker Hub
docker build -t my-app . # Build an image from a Dockerfile

Final Thoughts: Key takeaways as a data scientist

👉 With Containers you can solve the “It works on my machine” problem. Containers ensure that ML models, data pipelines, and environments run identically everywhere, independent of OS or dependencies.

👉 Containers are more lightweight and flexible than virtual machines. While VMs come with their own operating system and consume more resources, containers share the host operating system and start faster.

👉 There are three key steps when working with containers: Create a Dockerfile to define the environment, use docker build to create an image, and run it with docker run — optionally pushing it to a registry with docker push.

And then there’s Kubernetes.

A term that comes up a lot in this context: An orchestration tool that automates container management, ensuring scalability, load balancing and fault recovery. This is particularly useful for microservices and cloud applications.

Before Docker, VMs were the go-to solution (see more in ‘Virtualization & Containers for Data Science Newbiews’.) VMs offer strong isolation, but require more resources and start slower.

So, Docker was developed in 2013 by Solomon Hykes to solve this problem. Instead of virtualizing entire operating systems, containers run independently of the environment — whether on your laptop, a server or in the cloud. They contain all the necessary dependencies so that they work consistently everywhere.

I simplify tech for curious minds🚀 If you enjoy my tech insights on Python, data science, Data Engineering, machine learning and AI, consider subscribing to my substack.

Where Can You Continue Learning?

The post Why Data Scientists Should Care about Containers — and Stand Out with This Knowledge appeared first on Towards Data Science.

]]>
Virtualization & Containers for Data Science Newbies https://towardsdatascience.com/virtualization-containers-for-data-science-newbies/ Wed, 12 Feb 2025 01:04:25 +0000 https://towardsdatascience.com/?p=597744 Virtualization makes it possible to run multiple virtual machines (VMs) on a single piece of physical hardware. These VMs behave like independent computers, but share the same physical computing power. A computer within a computer, so to speak. Many cloud services rely on virtualization. But other technologies, such as containerization and serverless computing, have become […]

The post Virtualization & Containers for Data Science Newbies appeared first on Towards Data Science.

]]>
Virtualization makes it possible to run multiple virtual machines (VMs) on a single piece of physical hardware. These VMs behave like independent computers, but share the same physical computing power. A computer within a computer, so to speak.

Many cloud services rely on virtualization. But other technologies, such as containerization and serverless computing, have become increasingly important.

Without virtualization, many of the digital services we use every day would not be possible. Of course, this is a simplification, as some cloud services also use bare-metal infrastructures.

In this article, you will learn how to set up your own virtual machine on your laptop in just a few minutes — even if you have never heard of Cloud Computing or containers before.

Table of Contents
1 — The Origins of Cloud Computing: From Mainframes to Serverless Architecture
2 — Understanding Virtualization: Why it’s the Basis of Cloud Computing
3 — Create a Virtual Machine with VirtualBox
Final Thoughts
Where can you continue learning?

1 — The Origins of Cloud Computing: From Mainframes to Serverless Architecture

Cloud computing has fundamentally changed the IT landscape — but its roots go back much further than many people think. In fact, the history of the cloud began back in the 1950s with huge mainframes and so-called dumb terminals.

  • The era of mainframes in the 1950s: Companies used mainframes so that several users could access them simultaneously via dumb terminals. The central mainframes were designed for high-volume, business-critical data processing. Large companies still use them today, even if cloud services have reduced their relevance.
  • Time-sharing and virtualization: In the next decade (1960s), time-sharing made it possible for multiple users to access the same computing power simultaneously — an early model of today’s cloud. Around the same time, IBM pioneered virtualization, allowing multiple virtual machines to run on a single piece of hardware.
  • The birth of the internet and web-based applications in the 1990s: Six years before I was born, Tim Berners-Lee developed the World Wide Web, which revolutionized online communication and our entire working and living environment. Can you imagine our lives today without internet? At the same time, PCs were becoming increasingly popular. In 1999, Salesforce revolutionized the software industry with Software as a Service (SaaS), allowing businesses to use CRM solutions over the internet without local installations.
  • The big breakthrough of cloud computing in the 2010s:
    The modern cloud era began in 2006 with Amazon Web Services (AWS): Companies were able to flexibly rent infrastructure with S3 (storage) and EC2 (virtual servers) instead of buying their own servers. Microsoft Azure and Google Cloud followed with PaaS and IaaS services.
  • The modern cloud-native era: This was followed by the next innovation with containerization. Docker made Containers popular in 2013, followed by Kubernetes in 2014 to simplify the orchestration of containers. Next came serverless computing with AWS Lambda and Google Cloud Functions, which enabled developers to write code that automatically responds to events. The infrastructure is fully managed by the cloud provider.

Cloud computing is more the result of decades of innovation than a single new technology. From time-sharing to virtualization to serverless architectures, the IT landscape has continuously evolved. Today, cloud computing is the foundation for streaming services like Netflix, AI applications like ChatGPT and global platforms like Salesforce.

2 — Understanding Virtualization: Why Virtualization is the Basis of Cloud Computing

Virtualization means abstracting physical hardware, such as servers, storage or networks, into multiple virtual instances.

Several independent systems can be operated on the same physical infrastructure. Instead of dedicating an entire server to a single application, virtualization enables multiple workloads to share resources efficiently. For example, Windows, Linux or another environment can be run simultaneously on a single laptop — each in an isolated virtual machine.

This saves costs and resources.

Even more important, however, is the scalability: Infrastructure can be flexibly adapted to changing requirements.

Before cloud computing became widely available, companies often had to maintain dedicated servers for different applications, leading to high infrastructure costs and limited scalability. If more performance was suddenly required, for example because webshop traffic increased, new hardware was needed. The company had to add more servers (horizontal scaling) or upgrade existing ones (vertical scaling).

This is different with virtualization: For example, I can simply upgrade my virtual Linux machine from 8 GB to 16 GB RAM or assign 4 cores instead of 2. Of course, only if the underlying infrastructure supports this. More on this later.

And this is exactly what cloud computing makes possible: The cloud consists of huge data centers that use virtualization to provide flexible computing power — exactly when it is needed. So, virtualization is a fundamental technology behind cloud computing.

How does serverless computing work?

What if you didn’t even have to manage virtual machines anymore?

Serverless computing goes one step further than Virtualization and containerization. The cloud provider handles most infrastructure tasks — including scaling, maintenance and resource allocation. Developers should focus on writing and deploying code.

But does serverless really mean that there are no more servers?

Of course not. The servers are there, but they are invisible for the user. Developers no longer have to worry about them. Instead of manually provisioning a virtual machine or container, you simply deploy your code, and the cloud automatically executes it in a managed environment. Resources are only provided when the code is running. For example, you can use AWS Lambda, Google Cloud Functions or Azure Functions.

What are the advantages of serverless?

As a developer, you don’t have to worry about scaling or maintenance. This means that if there is a lot more traffic at a particular event, the resources are automatically adjusted. Serverless computing can be cost-efficient, especially in Function-as-a-Service (FaaS) models. If nothing is running, you pay nothing. However, some serverless services have baseline costs (e.g. Firestore).

Are there any disadvantages?

You have much less control over the infrastructure and no direct access to the servers. There is also a risk of vendor lock-in. The applications are strongly tied to a cloud provider.

A concrete example of serverless: API without your own server

Imagine you have a website with an API that provides users with the current weather. Normally, a server runs around the clock — even at times when no one is using the API.

With AWS Lambda, things work differently: A user enters ‘Mexico City’ on your website and clicks on ‘Get weather’. This request triggers a Lambda function in the background, which retrieves the weather data and sends it back. The function is then stopped automatically. This means you don’t have a permanently running server and no unnecessary costs — you only pay when the code is executed.

3 — What Data Scientists should Know about Containers and VMs — What’s the Difference?

You’ve probably heard of containers. But what is the difference to virtual machines — and what is particularly relevant as a data scientist?

Both containers and virtual machines are virtualization technologies.

Both make it possible to run applications in isolation.

Both offer advantages depending on the use case: While VMs provide strong security, containers excel in speed and efficiency.

The main difference lies in the architecture:

  • Virtual machines virtualize the entire hardware — including the operating system. Each VM has its own operational system (OS). This in turn requires more memory and resources.
  • Containers, on the other hand, share the host operating system and only virtualize the application layer. This makes them significantly lighter and faster.

Put simply, virtual machines simulate entire computers, while containers only encapsulate applications.

Why is this important for data scientists?

Since as a data scientist you will come into contact with machine learning, data engineering or data pipelines, it is also important to understand something about containers and virtual machines. Sure, you don’t need to have in-depth knowledge of it like a DevOps Engineer or a Site Reliability Engineer (SRE).

Virtual machines are used in data science, for example, when a complete operating system environment is required — such as a Windows VM on a Linux host. Data science projects often need specific environments. With a VM, it is possible to provide exactly the same environment — regardless of which host system is available.

A VM is also needed when training deep learning models with GPUs in the cloud. With cloud VMs such as AWS EC2 or Azure Virtual Machines, you have the option of training the models with GPUs. VMs also completely separate different workloads from each other to ensure performance and security.

Containers are used in data science for data pipelines, for example, where tools such as Apache Airflow run individual processing steps in Docker containers. This means that each step can be executed in isolation and independently of each other — regardless of whether it involves loading, transforming or saving data. Even if you want to deploy machine learning models via Flask / FastAPI, a container ensures that everything your model needs (e.g. Python libraries, framework versions) runs exactly as it should. This makes it super easy to deploy the model on a server or in the cloud.

3 — Create a Virtual Machine with VirtualBox

Let’s make this a little more concrete and create an Ubuntu VM. 🚀

I use the VirtualBox software with my Windows Lenovo laptop. The virtual machine runs in isolation from your main operating system so that no changes are made to your actual system. If you have Windows Pro Edition, you can also enable Hyper-V (pre-installed by default, but disabled). With an Intel Mac, you should also be able to use VirtualBox. With an Apple Silicon, Parallels Desktop or UTM is apparently the better alternative (not tested myself).

1) Install Virtual Box

The first step is to download the installation file for VirtualBox from the official Virtual Box website and install VirtualBox. VirtualBox is installed including all necessary drivers.

You can ignore the note about missing dependencies Python Core / win32api as long as you do not want to automate VirtualBox with Python scripts.

Then we start the Oracle VirtualBox Manager:

Screenshot taken by the author

2) Download the Ubuntu ISO file

Next, we download the Ubuntu ISO file from the Ubuntu website. An ISO Ubuntu file is a compressed image file of the Ubuntu operating system. This means that it contains a complete copy of the installation data. I download the LTS version because this version receives security and maintenance updates for 5 years (Long Term Support). Note the location of the .iso file as we will use it later in VirtualBox.

Screenshot taken by the author

3) Create a virtual machine in VirtualBox

Next, we create a new virtual machine in the VirtualBox Manager and give it the name Ubuntu VM 2025. Here we select Linux as the type and Ubuntu (64-bit) as the version. We also select the previously downloaded ISO file from Ubuntu as the ISO image. It would also be possible to add the ISO file later in the mass storage menu.

Screenshot taken by the author

Next, we select a user name vboxuser2025 and a password for access to the Ubuntu system. The hostname is the name of the virtual machine within the network or system. It must not contain any spaces. The domain name is optional and would be used if the network has multiple devices.

We then assign the appropriate resources to the virtual machine. I choose 8 GB (8192 MB) RAM, as my host system has 64 GB RAM. I recommend 4GB (4096) as a minimum. I assign 2 processors, as my host system has 8 cores and 16 logical processors. It would also be possible to assign 4 cores, but this way I have enough resources for my host system. You can find out how many cores your host system has by opening the Task Manager in Windows and looking at the number of cores under the Performance tab under CPU.

Screenshot taken by the author

Next, we click on ‘Create a virtual hard disk now’ to create a virtual hard disk. A VM requires its own virtual hard disk to install the OS (e.g. Ubuntu, Windows). All programs, files and configurations of the VM are stored on it — just like on a physical hard disk. The default value is 25 GB. If you want to use a VM for machine learning or data science, more storage space (e.g. 50–100 GB) would be useful to have room for large data sets and models. I keep the default setting.

We can then see that the virtual machine has been created and can be used:

Screenshot taken by the author

4) Use Ubuntu VM

We can now use the newly created virtual machine like a normal separate operating system. The VM is completely isolated from the host system. This means you can experiment in it without changing or jeopardizing your main system.

If you are new to Linux, you can try out basic commands like ls, cd, mkdir or sudo to get to know the terminal. As a data scientist, you can set up your own development environments, install Python with Pandas and Scikit-learn to develop data analysis and machine learning models. Or you can install PostgreSQL and run SQL queries without having to set up a local database on your main system. You can also use Docker to create containerized applications.

Final Thoughts

Since the VM is isolated, we can install programs, experiment and even destroy the system without affecting the host system.

Let’s see if virtual machines remain relevant in the coming years. As companies increasingly use microservice architectures (instead of monoliths), containers with Docker and Kubernetes will certainly become even more important. But knowing how to set up a virtual machine and what it is used for is certainly useful.

I simplify tech for curious minds. If you enjoy my tech insights on Python, data science, data engineering, machine learning and AI, consider subscribing to my substack.

Where Can You Continue Learning?

The post Virtualization & Containers for Data Science Newbies appeared first on Towards Data Science.

]]>
The Fallacy of Complacent Distroless Containers https://towardsdatascience.com/the-fallacy-of-complacent-distroless-containers-8b09bd3ad55a/ Thu, 02 Jan 2025 18:52:34 +0000 https://towardsdatascience.com/the-fallacy-of-complacent-distroless-containers-8b09bd3ad55a/ Making containers smaller is the most popular practice when reducing your attack surface. But how real is this sense of security?

The post The Fallacy of Complacent Distroless Containers appeared first on Towards Data Science.

]]>
Image generated with Leonardo AI
Image generated with Leonardo AI

Building Docker images is an easy and accessible practice, however, perfecting them is still an art that is challenging to master. In pursuit of the smallest, most secure and yet functional container images, developers face themselves with distroless practices that usually involve complex tooling, deep distro knowledge and error-prone trimming strategies. In fact, such practices often neglect the use of package managers, contributing to a security abyss, as most vulnerability scanners rely on package manager metadata to detect the software components within the container image.

Building container images

When you build a container image, you’re packaging your application, together with its dependencies, in a portable software unit that can later be deployed in isolation, without the need to virtualize an entire operating system.

Building container images is actually a very accessible practice nowadays. There’s an abundance of tools (e.g. Docker, Rockcraft, Buildah…) specifically for that purpose.

But, in the process of packing your application and everything it needs in order to run, could you possibly be adding more that what’s needed?

Most of the time, the answer is yes!

Here’s a very simple Dockerfile:

FROM ubuntu:24.04

RUN apt update && apt install -y --no-install-recommends nginx 
  && rm -rf /var/lib/apt/lists/*

ENTRYPOINT ["nginx"]
CMD ["-g", "daemon off;"]

In this example, we’re packing Nginx on top of an Ubuntu 24.04 image. But,

  • ubuntu:24.04 will be in our final image. Do we actually need it? Most likely not. With it, a bunch of unnecessary software (e.g. utilities like apt) will be kept and thus increase the image’s attack surface;
  • even though we were careful not to install recommendations and clean the apt lists, we still installed the whole Nginx package and all its dependencies. Do we need all that? This is a trickier one to answer as it depends a lot on the use case, but we surely know we don’t want things like Nginx’s man pages, for example.

Distroless containers

"Distroless" images contain only your application and its runtime dependencies.

They do not contain the typical additional libraries or utilities from a Linux distribution.

This has been the most advocated practice in the space of container security for the past 7 years. And although conceptually right, what’s the cost of building these smaller and "more secure(?)" distroless Containers?

  • Easy to build? Not really. It can be a hard craft to master as you may need to use specialized tooling and require deep distro knowledge to effectively "remove the distro".
  • Error-prone? Yes. Some of the most advocated strategies for building distroless images involve following a "top-down" approach – i.e. bloating a base container with your application, and then manually cherry-picking the desired contents into a "scratch" environment.

Correlation is not causation

It’s not because your container is smaller, that it will necessarily be more secure! In fact, the making of distroless containers is prone to the creation of blind spots.

A 2022 Rezilion report by Yotam Perkal tested the reliability and consistency of different vulnerability scanners by scanning 20 of the most popular container images and comparing the resulting vulnerability reports. Besides the abundance of HIGH and CRITICAL misidentifications, the report also shows an 82% average precision from these tools, with a significant portion of the resulting being comprised of both False Positives and False Negatives.

To be honest, I’m ok with False Positives – it’s like being told you’re sick, when in reality it was just an examination error – it’s scary, but not truly dangerous.

False Negatives, on the other hand, are much worse! It’s like having a problem you’re not aware of – a blind spot!

The main cause for security blind spots

One of the main reasons why vulnerability scanners are unable to detect certain vulnerabilities is because most of them rely on package metadata and are thus unable to detect software components not managed by package managers.

Don’t believe me? Let me show you.

For demonstration purposes, let’s just take a popular and vulnerable Docker image from Docker Hub, and a popular vulnerability scanner.

Let’s say:

  • Trivy as the scanner, and
  • [ubuntu:lunar](https://hub.docker.com/layers/library/ubuntu/lunar/images/sha256-ea1285dffce8a938ef356908d1be741da594310c8dced79b870d66808cb12b0f) as the Docker image.

At the time of writing this, the chosen Docker image is already EOL, and vulnerable. According to Trivy, this image has a total of 11 CVEs:

$ trivy image ubuntu:lunar
...
ubuntu:lunar (ubuntu 23.04)

Total: 11 (UNKNOWN: 0, LOW: 2, MEDIUM: 9, HIGH: 0, CRITICAL: 0)

BUT, this is a Debian-based container image, so what does Trivy say if we delete the image’s package metadata? Let’s see…

$ echo '''
FROM ubuntu:lunar

# Whiteout the dpkg status file
RUN rm /var/lib/dpkg/status
''' | docker build -t ubuntu:lunar-tampered -

Drumroll please… 🥁

$ trivy image ubuntu:lunar-tampered
...
ubuntu:lunar-tampered (ubuntu 23.04)

Total: 0 (UNKNOWN: 0, LOW: 0, MEDIUM: 0, HIGH: 0, CRITICAL: 0)

Zero, zip, zilch, nada…no vulnerabilities! Or so it looks. But we know there are still 11 CVEs. We just deleted the package metadata Trivy relies on to perform the scan.


What now?

Vulnerability scanners behave differently and may rely on information within the container image itself in order to produce accurate reports!

So here’s a checklist you can use to ensure the containers you build and consume are not carrying hidden vulnerabilities:

  1. it’s not because it’s small and Distroless that the container you’re planning to use is secure.
  2. look beyond the vulnerability scanner. As we saw in the example above, a single missing file can cause the scanners to fail to identify CVEs. So don’t turn a blind eye on this! Yes, use scanners! But also try looking around for hints that blind spots may exist. How?

    • some scanners, like Trivy, will actually issue a warning when the files they rely on (like dpkg/status above) are missing. E.g. Trivy will say: | WARN No OS package is detected. Make sure you haven’t deleted any files | that contain information about the installed packages. | WARN e.g. files under "/lib/apk/db/", "/var/lib/dpkg/" and "/var/lib/rpm" – some of these tools, Trivy included, can also produce SBOMs. This is a more user-friendly way of double-checking the image’s software components. So try to produce an SBOM (e.g. trivy image --format spdx-json --output result.json <yourImage>). Is this SBOM empty? Is it missing components that you’d expect to see in that image? If so, then the vulnerability scanner will very likely also fail to produce an accurate report.
  3. vulnerability scanners vary, so don’t rely just on one scanner. Try choosing the ones with better support for the type of software ecosystem that is packed inside the image you want to use.
  4. avoid "dead drops" when building your container. I.e. cherry-picking the minimum set of files you need to make your application work might sound appealing, but you may unintentionally be leaving out the necessary metadata for the scanners to work properly.
  5. related to the above, favor the use of package managers. Yes, some of them aren’t really well adjusted to work with minimal containers, but as we saw above, some of the metadata they produce is critical for a proper Security scan.

In a coming article, I’ll explore a few techniques for building minimal container images, from the typical multi-stage approach to the distroless-friendly tools like Chisel.

The post The Fallacy of Complacent Distroless Containers appeared first on Towards Data Science.

]]>
Debugging SageMaker Endpoints With Docker https://towardsdatascience.com/debugging-sagemaker-endpoints-with-docker-7a703fae3a26/ Fri, 16 Jun 2023 13:58:37 +0000 https://towardsdatascience.com/debugging-sagemaker-endpoints-with-docker-7a703fae3a26/ An Alternative To SageMaker Local Mode

The post Debugging SageMaker Endpoints With Docker appeared first on Towards Data Science.

]]>
A pain point with getting started with SageMaker Real-Time Inference is that it is hard to debug at times. When creating an endpoint there are a number of ingredients you need to make sure are baked properly for successful deployment.

  • Proper file structuring of model artifacts depending on the Model Server and Container that you are utilizing. Essentially the model.tar.gz you provide must be in a format that is compliant with the Model Server.
  • If you have a custom inference script that implements pre and post processing for your model, you need to ensure that the handlers implemented are compliant with your model server and that there are no scripting errors at the code level either.

Previously we have discussed SageMaker Local Mode, but at the time of this article Local Mode does not support all hosting options and model servers that are available for SageMaker Deployment.

To overcome this limitation we take a look at using Docker with a sample model and how we can test/debug our model artifacts and inference script prior to SageMaker Deployment. In this specific example we will utilize the BART Model that I have covered in my last article and see how we can host it with Docker.

NOTE: For those of you new to AWS, make sure you make an account at the following link if you want to follow along. The article also assumes an intermediate understanding of SageMaker Deployment, I would suggest following this article for understanding Deployment/Inference more in depth. An intermediate level of understanding of Docker will also be helpful to fully understand this example.

How Does SageMaker Hosting Work?

Before we can get to the code portion of this article, let’s take a look at how SageMaker actually serves requests. At it’s core SageMaker Inference has two constructs:

  • Container: This establishes the runtime environment for the model, it is also integrated with the model server that you are utilizing. You can either utilize one of the existing Deep Learning Containers (DLCs) or Build Your Own Container.
  • Model Artifacts: In the CreateModel API call we specify an S3 URL with the model data present in the format of a model.tar.gz (tarball). This model data is loaded into the opt/ml/model directory on the container, this also includes any inference script that you provide.

The key here is that the container needs a web server implemented that responds to port 8080 on the /invocations and /ping paths. An example of a web server we have implemented with these paths is Flask during a Bring Your Own Container example.

With Docker what we will do is expose this port and point towards our local script and model artifacts, this way we simulate the way a SageMaker Endpoint is expected to behave.

Testing With Docker

For simplicity’s sake we will utilize my BART example from my last article, you can grab the artifacts from this repository. Here you should see the following files:

  • model.py: This is the inference script that we are working with. In this case we are utilizing DJL Serving which expects a model.py with a handler function implementing inference. Your inference script still needs to be compatible with the format that the model server expects.
  • requirements.txt: Any additional dependencies that your model.py script requires. For DJL Serving PyTorch is already installed beforehand, we use numpy for data processing.
  • serving.properties: This is also a DJL specific file, here you can define any configuration at the model level (ex: workers per model)

We have our model artifacts, now we need the container that we will be utilizing. In this case we can retrieve the existing DJL DeepSpeed image. For an extensive list of the images that are already provided by AWS please reference this guide. You can also build your own image locally and point towards that. In this case we are operating in a SageMaker Classic Notebook Instance environment which comes with Docker pre-installed as well.

To work with existing AWS provided images we first need to login to AWS Elastic Container Registry (ECR) to retrieve the image, you can do that with the following shell command.

$(aws ecr get-login --region us-east-1 --no-include-email --registry-ids 763104351884)

You should see a login succeeded message similar to the following.

Login Succeeded (Screenshot by Author)
Login Succeeded (Screenshot by Author)

Once logged in we can get to the path where our model artifacts are stored and run the following command which will launch the model server. If you have not already retrieved the image, this will also be pulled from ECR.

docker run 
-v /home/ec2-user/SageMaker:/opt/ml/model 
--cpu-shares 512 
-p 8080:8080 
763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.21.0-deepspeed0.8.0-cu117 
serve

A few key points here:

  • We are exposing port 8080 as SageMaker Inference expects.
  • We also point towards the existing image. This string is dependent on the region and model you are operating in. You can also utilize the SageMaker Python SDK retrieve image_uri API call to identify the appropriate image to pull here.
Image being retrieved (Screenshot by Author)
Image being retrieved (Screenshot by Author)

After the image has been pulled you see that the model server has been started.

DJL Server Started (Screenshot by Author)
DJL Server Started (Screenshot by Author)

We can also verify this container is running by utilizing the following Docker command.

docker container ls
Container Started (Screenshot by Author)
Container Started (Screenshot by Author)

We see that the API is exposed via port 8080 which we can send sample requests to via curl. Notice we specify the /invocations path that SageMaker Containers expect.

curl -X POST http://localhost:8080/invocations -H "Content-type: text/plain"
 "This is a sample test string"

We then see inference returned for the request and also the model server tracking the response and emitting our logging statements from our inference script.

Sample Request (Screenshot by Author)
Sample Request (Screenshot by Author)

Let’s break our model.py and see if we can catch the error early with Docker. Here in the inference function I add a syntactically incorrect print statement and restart my model server to see if this error is captured.

def inference(self, inputs):
        """
        Custom service entry point function.

        :param inputs: the Input object holds the text for the BART model to infer upon
        :return: the Output object to be send back
        """

        #sample error
        print("=)

We can then see this error captured by the model server when we execute the docker run command.

Error captured by Model Server (Screenshot by Author)
Error captured by Model Server (Screenshot by Author)

Note that you are not limited to utilizing just curl to test your container. We can also use something like the Python requests library to interface and work with the container. A sample request would look like the following:

import requests

headers = {
    'Content-type': 'text/plain',
}

response = requests.post('http://localhost:8080/invocations', headers=headers)

Utilizing something like requests you can run larger scale load tests on the container. Note that the hardware you are running the container on is what is being utilized (think of this as your equivalent to the instance behind a Sagemaker Endpoint).

Additional Resources & Conclusion

GitHub – RamVegiraju/SageMaker-Docker-Local: How to locally test SageMaker Inference with Docker

You can find the code for the entire example at the link above. With SageMaker Inference you want to avoid the pain of waiting for the endpoint to create to capture any errors. Utilizing this approach you can work with any SageMaker container to test and debug your model artifacts and inference scripts.

As always feel free to leave any feedback or questions, thank you for reading!


If you enjoyed this article feel free to connect with me on LinkedIn and subscribe to my Medium Newsletter. If you’re new to Medium, sign up using my Membership Referral.

The post Debugging SageMaker Endpoints With Docker appeared first on Towards Data Science.

]]>
How to Create Reusable R Containers for SageMaker Jobs https://towardsdatascience.com/how-to-create-reusable-r-containers-for-sagemaker-jobs-a3d481daf5cd/ Wed, 04 May 2022 05:06:33 +0000 https://towardsdatascience.com/how-to-create-reusable-r-containers-for-sagemaker-jobs-a3d481daf5cd/ A guide to creating reusable containers on SageMaker for R developers

The post How to Create Reusable R Containers for SageMaker Jobs appeared first on Towards Data Science.

]]>
A guide to creating reusable containers on SageMaker for R developers
Image sourced from unsplash.com
Image sourced from unsplash.com

SageMaker is great in terms of giving you full flexibility to use its services with your own runtime and language of choice. If none of the available runtimes or languages fit your code, you first need to overcome the initial hurdle of creating a compatible docker container that SageMaker can use.

In this blog, we take a deep dive on how to create such R-Containers for use in SageMaker and we try to understand in more depth how SageMaker works. This gives us better clarity on some decisions that we will make during the container build phase. For an end-to-end example of an ML pipeline that utilises these R-containers, check this GitHub example.

Docker containers in a nutshell

There is a slight chance you’ve landed on this article but have no idea what a docker container is. I will not attempt to explain what docker or containers are, since there are already about a million such articles written out there that do it better than I ever could.

In a nutshell, a container is a standard unit of software that packages up code and all its dependencies in a single "object" that can be executed safely and reliably across different systems.

For this blog, you need to be broadly familiar with a few concepts, namely what a Dockerfile, an image, a container registry and a container are. If you are curious about containers and want to learn more, you can start learning more about it here.

Why Containers + SageMaker?

Sagemaker is built in a modular way that allows us to use our own containers with its services. This gives us the flexibility to use the libraries, programming languages and/or runtimes of our choice while still leveraging the full benefits of using its services.

R Container for SageMaker Processing

Creating an R container for Processing jobs is probably the most simple of all the containers we may need on SageMaker. The Dockerfile can be as following:

When the container is created and registered on ECR, Amazon Elastic Container Registry, we can run a processing job. This is similar to how we would usually run a processing job, we just pass in the parameter of _imageuri the uri of the newly created image to the job. An example of such a processing job run (as part of a pipeline as well) can be found in line 33 of pipeline.R in the example shared above. When the processing job runs, SageMaker runs the container with the following command:

docker run [AppSpecification.ImageUri]

Therefore, the entry point command will be run and the script passed into the code argument of the ScriptProcessor will be run. In this case, our entry point is the command Rscript and therefore this container can be reused for all processing jobs that need to execute some arbitrary code, assuming of course the necessary package dependencies are available for it.

Further customisations are possible, and if you are interested to dive deeper into how SageMaker containers work for Processing jobs specifically, feel free to read the relevant documentation page.

R Container for SageMaker Training and Deployment

Creating an R container for Training jobs, which can also be reused for Deploying a model, involves a couple more steps compared to the simple-straightforward example above.

A template Dockerfile can be as following:

You will notice that once we install the necessary packages that our model/code requires, we also copy a run.sh and a entrypoint.R files. Let’s see what these files are and why are needed.

#!/bin/bash
echo "ready to execute"
Rscript /opt/ml/entrypoint.R $1

The run.sh script is a very simple one, that all it does is run the entrypoint.R script passing along the command line argument under $1. We do this, because SageMaker runs the docker container for training and service with the commands:

docker run image train

or

docker run image serve

depending on whether we called the training or the deployment methods. Based on the argument $1 which is either "train" or "serve" we want to differentiate the next step. The bash script is required here to pass this argument down to the Rscript execution as there is no straightforward way to read the docker run arguments from within R code. If you know of a better/simpler way of doing the above please do let me know in the comments!

Let’s now look into the entrypoint.R script:

This is now getting way more SageMaker specific, let’s unpack it! SageMaker has a very well defined file structure where it saves files and expects to find files under /opt/ml/ . Specifically, what we utilise here is the following:

/opt/ml/
    - input/config/hyperparameters.json
    - code/
    - model/
        - <model artifacts>
        - code/

hyperparameters.json file When a training estimator is created, we will want to pass in some custom code to define and train our model. When this is passed, SageMaker will zip those files (it could be a whole directory of files you need to pass for training) into one file called "sourcedir.tar.gz" and will upload it to an S3 location. Once we start a training job, SageMaker creates the file hyperparameters.json in the location /opt/ml/input/config/ that contains any passed hyper parameters but also contains the key "_sagemaker_submitdirectory" with the value of the S3 location where the "sourcedir.tar.gz" file was uploaded. When in training mode, we need to download and unzip our training code. This is exactly what the first section of the above if statement is doing.

code directory Following the convention of how SageMaker downloads and unpacks the training code on the built-in algorithms and managed framework containers, we are extracting the training code in the directory /opt/ml/code/. However, this is not a requirement, but rather a good practice to follow the service’s standards.

model directory This is the directory where SageMaker will automatically download the model artefacts and code relevant to inference. The second section of the if statement in the above snippet is leveraging this, to source the deploy.R script. It is important to note here that this Dockerfile & code sample assumes that our inference code will include a deploy.R file which will be the one that will be run for deployment. If you follow a different convention of how you would like to name this file, feel free to rename it. In this code example, during the training process and once the model is created, the artefacts of the model are saved under the /opt/ml/model folder. We also save the inference code in the subfolder code/ in the same directory. This way, when SageMaker zips the files to create the model.tar.gz file, this file will also include the necessary for deployment code.

The above, is an architectural/design decision taken to bundle the inference code with the model itself. It can be perfectly valid for your use-case to want to decouple these two and keep the inference code separate to the model artefacts. This is of course possible and up to you to decide which approach to follow.

Please also note, that the model artefacts are saved in a single model.tar.gz file on S3, however, during deployment, SageMaker will automatically download and unzip this file, so we don’t have to manually do this ourselves during deployment.

Pro Tip: You may want to have different containers for training and deploying, in which case the above step can be simplified and skip the usage of the run.sh script.

Further customisations are possible, and if you are interested to dive deeper into how SageMaker containers work specifically for training and inference jobs, feel free to read the relevant documentation page.

Building the containers

If you are familiar with building containers, you will realise that there is nothing inherently special about the following process. All we need to do, is build the containers based on the Dockerfiles provided and register the images with ECR, the SageMaker job will pull the image at runtime. If you already know how to build&register an image to ECR, feel free to skip this section of the post.

For users of RStudio on SageMaker or anyone not able or willing to have the docker daemon run on their development environment, I suggest outsourcing the actual building of the container to another AWS service, namely AWS CodeBuild. Luckily for us, we don’t need to actively interact with that service, thanks to the useful utility SageMaker Docker Build that hides all this complexity from us. Install the utility with a command like:

py_install("sagemaker-studio-image-build", pip=TRUE)

and we are good to go. Building the container requires a single command:

sm-docker build . --file {Dockerfile-Name} --repository {ECR-Repository-Name:Optional-Tag}

Conclusion

SageMaker Processing, Training and Hosting capabilities are really versatile and by bringing our own container we can build our model and our application exactly they way we want.

In this blog we explored how we can create our own, reusable, R-enabled docker container that we can use for our processing, training and deployment needs.

The complete example of the code used in this post can be found in this Github repository.

Reach out to me in the comments or connect with me in LinkedIn if you are building your own containers for R on SageMaker would like to discuss about it!

The post How to Create Reusable R Containers for SageMaker Jobs appeared first on Towards Data Science.

]]>
Heroku + Docker in 10 Minutes https://towardsdatascience.com/heroku-docker-in-10-minutes-f4329c4fd72f/ Sun, 06 Feb 2022 18:20:56 +0000 https://towardsdatascience.com/heroku-docker-in-10-minutes-f4329c4fd72f/ Deployment for Python applications made easy - and it's free

The post Heroku + Docker in 10 Minutes appeared first on Towards Data Science.

]]>
It’s easy to launch a web application on a local machine when all that is required is to be good at a programming language. However, it takes a lot of trial and error to deploy a web application, especially when more tools are involved, and you now have to worry about the deployment environment, scalability, and other concerns. If you’re looking to deploy a Python web application (i.e. Flask/Django), this article is for you! You can skip the first three sections if you already have some knowledge of Heroku and Docker.

Update: This article is part of a series. Check out other "in 10 Minutes" topics here!

Table of Contents


Why Heroku?

Heroku is a cloud platform as a service (PaaS) that allows applications to be hosted on the cloud. For people looking for a free hosting platform for Python applications, Heroku is one of the top 10 choices (although there are paid tiers as well).

For the free tier, Heroku offers integration with GitHub and the use of Heroku Containers for deployment, referred to as dyno. I would also mention some of the caveats of using the free tier that I find cumbersome.

  1. It is not possible to choose a custom domain, so applications will be hosted with <app-name>.herokuapp.com domain
  2. There is no SSL certificate, but a workaround is to manually type in https:// to get the same secure lock icon
  3. Free dynos will sleep after some period of inactivity, therefore relaunching the application will take some time (~ 1 minute) for the container to start up
  4. The repository will be compiled into a slug, and the performance will start degrading if the slug size exceeds 500 MB. This would severely limit the repository size. A workaround is to compile your application into a Docker image and bypass the slug size requirement

Update: Heroku has removed its free tier as of Nov 2022, an alternative would be to use Google Cloud or Fly. If you are using the paid version of Heroku, feel free to read on!

Google Cloud vs. Fly.io as Heroku Alternatives

Why Docker?

Docker helps deliver the web application in packages called containers. Using containers is a best practice for Deployment since each container has its software, libraries, and configuration files, and is easy to scale your web application up or down. However, in Heroku free tier, I don’t think there is an option to scale the web application up.

The last caveat in the previous section was also the reason why I chose to switch from using Heroku containers to Docker containers. I was expanding my web application and realized my repository has grown too big and my web application is increasingly slow. After the switch to Docker containers, my web application runs even faster than on my local computer!

I would recommend using Heroku with Docker to future-proof your web application so you don’t have to perform the switch as I did.

Docker Crash Course

The instructions on how to create a container are usually written in a Dockerfile, and the files to ignore from the compilation are found in .dockerignore. Below is a sample of what .dockerignore file looks like, but this file is optional if you want Docker to compile everything in the repository.

__pycache__
*.pyc
env/
db.sqlite3
docs/
*.log

For Dockerfile, there are some commands that are commonly used,

  • FROM is used once at the start of Dockerfile to indicate which base image to use, for our case we would want to use a base image that supports Python
  • ARG defines variables that users pass in at build-time. In the example below, port is an argument to be passed in when building a Docker image
  • USER sets username or user group when running Docker image
  • COPY copies files and directories to the container
  • WORKDIR sets the working directory of the container
  • ENV sets environment variable
  • RUN runs shell command, which calls /bin/sh -c on Linux
  • EXPOSE informs Docker that the container listens on the specified network ports at runtime, used for testing Docker applications locally
  • CMD is used once at the end of Dockerfile and contains the final command to run execute the container

Information on the full list of Docker commands can be found on the Docker documentation. Below is a sample of what the Dockerfile looks like,

FROM python:3.8-slim
ARG port

USER root
COPY . /<your-app>
WORKDIR /<your-app>

ENV PORT=$port

RUN apt-get update &amp;&amp; apt-get install -y --no-install-recommends apt-utils 
    &amp;&amp; apt-get -y install curl 
    &amp;&amp; apt-get install libgomp1

RUN chgrp -R 0 /<your-app> 
    &amp;&amp; chmod -R g=u /<your-app> 
    &amp;&amp; pip install pip --upgrade 
    &amp;&amp; pip install -r requirements.txt
EXPOSE $PORT

CMD gunicorn app:server --bind 0.0.0.0:$PORT --preload

It is best to test if the Dockerfile is able to compile locally before deployment, which requires building the Docker image and running it.

  • To build Docker image: The following command passes in the port argument and the image name is tmp_image, docker build --no-cache --build-arg port=8060 -t tmp_image .
  • To run Docker image: The following command maps port of Docker image to local port so the web application can be viewed locally, docker run --rm -p 8060:8060 tmp_image

Deploy Docker on Heroku

Assuming you already have your folder structure for your web application, you only require 3 additional files for deployment to Heroku. Two of these files are instructions for Docker as explained in the previous section, and the last file heroku.yml contains deployment instructions for Heroku. Below is the sample project folder structure,

your-app
|-- .dockerignore
|-- app.py
|-- Dockerfile
|-- heroku.yml
|-- requirements.txt

For heroku.yml file, you only need to indicate that deployment is using Docker and the file looks like this,

build:
  docker:
    web: Dockerfile

And it’s completed! All that’s left is to follow the on-screen instructions to link Heroku to your codebase, and your web application will be ready after the build!


Hope you have learnt how to deploy Python applications on Heroku using Docker. If you’re interested to view my web application which is deployed the same way, the link is below!

Tools to make life easier

Thank you for reading! If you liked this article, feel free to share it.


Related Links

Heroku Deployment Documentation: https://devcenter.heroku.com/categories/deploying-with-docker

Dockerfile Documentation: https://docs.docker.com/engine/reference/builder/

heroku.yml Documentation: https://devcenter.heroku.com/articles/build-docker-images-heroku-yml

The post Heroku + Docker in 10 Minutes appeared first on Towards Data Science.

]]>
MLOps with Docker and Jenkins: Automating Machine Learning Pipelines https://towardsdatascience.com/mlops-with-docker-and-jenkins-automating-machine-learning-pipelines-a3a4026c4487/ Tue, 28 Sep 2021 10:00:20 +0000 https://towardsdatascience.com/mlops-with-docker-and-jenkins-automating-machine-learning-pipelines-a3a4026c4487/ How to containerize your ML models with Docker and automate a Pipeline with Jenkins

The post MLOps with Docker and Jenkins: Automating Machine Learning Pipelines appeared first on Towards Data Science.

]]>
Introduction

The purpose of this post is to provide an example of how we can use DevOps tools like Docker and Jenkins to automate a Machine Learning Pipeline. At the end of this post, you will know how to containerize a Machine Learning model with Docker and create a pipeline with Jenkins that automatically process raw data, trains a model and returns test accuracy every time we make a change in our repository.

All the code needed for this post can be found in Github.

For this task we will use the Adult census income Dataset. Target variable is income: a binary variable that indicates if an individual earns more than 50k a year or not.

📒 NOTE: As the purpose to this article is to automate a Machine Learning Pipeline, we won’t dive into EDA as is out of the scope. If you are curious about that you can check this Kaggle notebook, but is not mandatory in order to understand what is done here.

Ok, so let’s start!

Setting the strategy

Before starting to code, I think is important to understand what is the plan. If you look at the Github repository you will see three python scripts: is easy to figure out what they do by looking at their names 🙂 . We also have the raw dataset: adult.csv, and a Dockerfile (we will talk about it later). But now I want you to understand the workflow of this project, and for that the first thing we need to do is to understand what are the inputs and outputs of our Python scripts:

Image by Author
Image by Author

As we see in the image, preprocessing.py takes the raw data as input and outputs processed data split into train and test. train.py takes train processed data as input and outputs the model and a json file where we will store the validation accuracy. test.py takes test processed data and the model as inputs and outputs a json file with test accuracy.

With this in mind, now we have a bunch of scripts that we have to run in a certain order, that create a bunch of files that we need to store and access. Furthermore, we want to automate all this process.

Nowadays, the best way to manage this issue is using Docker: with this tool you can create an isolated environment with all the dependencies needed to run your code (solving the "it works in my machine" problem!) that makes it all easier. Once we have that, we will be able to automate all the process with Jenkins.

There are 3 concepts on which Docker is based: Containers, Images and Dockerfiles. Is indispensable to understand what they do in order to work with Docker. If you are not familiar with them, here is an intuitive definition:

  • Containers: A standard unit of software that packages everything you need to run your application (dependencies, environment variables…)
  • Dockerfile: This is a file in which you define everything you want to be inside of a container.
  • Image: This is the blueprint needed for running a container. You build an image by executing a Dockerfile.

So, in order to use Docker, you will follow this steps:

  1. Define a Dockerfile
  2. Build the image
  3. Run a container
  4. Run commands inside the container

Let’s go step by step.

Defining the Dockerfile

Here we have to define everything we need to run the pipeline. You can have a look at the Dockerfile in the repository, but if you are not familiar with the syntax it may be overwhelming at first. So what we are going to do here is talk about what we want to specify in it and have a look at the syntax step by step.

First, we need to specify where we want to run our pipeline. For most of the containerized applications people use to choose a light distribution of Linux, like alpine. However, for our pipeline we will just use an image of jupyter called jupyter/scipy-notebook. In the Dockerfile, we specify the following command:

FROM jupyter/scipy-notebook

Then, we have to install some packages. For this purpose we use the command RUN:

USER root 
RUN apt-get update &amp;&amp; apt-get install -y jq 
RUN pip install joblib

📒 NOTE: It may not make much sense now, but we will need jq in order to access values inside json files, and joblib in order to serialize and deserialize the model.

Next thing we have to set is the distribution of the files inside the container. We want to build a container that has this structure inside:

Image by Author
Image by Author

📒 NOTE: "work" folder is autogenerated by Docker. We are not going to put anything inside.

First we create the folders:

RUN mkdir model raw_data processed_data results

And then we set the directories as environment variables (so we don’t hard code paths all over the code)

ENV MODEL_DIR=/home/jovyan/model
ENV RAW_DATA_DIR=/home/jovyan/raw_data
ENV PROCESSED_DATA_DIR=/home/jovyan/processed_data
ENV RESULTS_DIR=/home/jovyan/results
ENV RAW_DATA_FILE=adult.csv

Finally, we set the order to copy the scripts and the raw data from our repository. They will be pasted in our container once we create it.

COPY adult.csv ./raw_data/adult.csv 
COPY preprocessing.py ./preprocessing.py 
COPY train.py ./train.py 
COPY test.py ./test.py

Building the image

Once we have our Dockerfile specified, we can build the image. The command to do this is:

sudo -S docker build -t adult-model .

We specify the name of the image with -t adult-model (-t stands for tag) and the path of the Dockerfile with .. Docker automatically picks the file named "Dockerfile".

Running a container

Now that we have an image (a blueprint for a container), we can build a container!

📒 NOTE: we are going to build just one container, but in case you don’t know, once we have an image we can build as many containers as we want! This opens up a wide range of possibilities.

The command to run a container is the following:

sudo -S docker run -d --name model adult-model

where -d flag is for detached (runs the container in background). We name it "model" and we specify the image we use (adult-model).

Running commands inside the container

Now that we have our container running, we can run commands inside it by using docker exec. In this project, we need to execute the scripts in order and then show the results. We can do that by the following commands:

  • Run preprocessing.py
sudo -S docker container exec model python3 preprocessing.py
  • Run train.py
sudo -S docker container exec model python3 train.py
  • Run test.py
sudo -S docker container exec model python3 test.py
  • Show validation accuracy and test accuracy
sudo -S docker container exec model cat 
/home/jovyan/results/train_metadata.json 
/home/jovyan/results/test_metadata.json

📒 NOTE: If you are curious enough (I guess you are) you will want to know what each script actually does. Don’t worry, if you are familiar with basic Machine Learning tools (here I basically use Pandas and SKlearn libraries), you can open the scripts and have a look at the code. It’s not a big deal and most of the lines are commented. If you want a deep understanding or you are looking for more complex models than the one shown here, you can take a look at this notebook.

The testing step

When building pipelines is common to have a step dedicated to test if the application is well built and good enough to be deployed into production. In this project, we will use a conditional statement that tests if the validation accuracy is higher than a threshold. If it is, the model is deployed. If not, the process stops. The code for doing this is the following:

val_acc=$(sudo -S docker container exec model  jq .validation_acc  /home/jovyan/results/train_metadata.json)
threshold=0.8

if echo "$threshold > $val_acc" | bc -l | grep -q 1
then
    echo 'validation accuracy is lower than the threshold, process stopped'
else
   echo 'validation accuracy is higher than the threshold'
   sudo -S docker container exec model python3 test.py
   sudo -S docker container exec model cat  /home/jovyan/results/train_metadata.json  /home/jovyan/results/test_metadata.json 
fi

As you can see, first we set the two variables we want to compare (validation accuracy and the threshold) and then we pass them through a conditional statement. If the validation accuracy is higher than the threshold, we will execute the model for the test data and then we will show both test and validation results. If not, the process will stop.

And there we have it! our model is fully containerized and we can run all the steps in our pipeline!

Question for Docker beginners: What is the point of what we have done?

If you are not familiar with Docker, now you might be asking: Ok, This is all good stuff, but at the end I just have my model and my predictions. I can also get them by running my Python code and with no need to learn Docker so, what’s the point of all this?

I’m glad you asked 😎 .

First, having your Machine Learning models in a Docker container is really useful in order to deploy that model into a production environment. As an example, How many times have you seen code on a tutorial or in a repository that you have tried to replicate, and when running the same code in your machine your screen has filled with red? If we don’t like to pass through that, imagine what our customers might feel. With Docker containers, this problem is solved.

Another reason why Docker is really useful is probably the same reason why you are reading this: to help automating an entire pipeline.

So, without any further ado, let’s get straight to the point!

Automating a ML pipeline with Jenkins

For this step we will use Jenkins, a widely famous open source automation server that provides an endless list of plugins to support building, deploying and automating any project.

For this time, we will build the steps of the pipeline using a tool called jobs. Each job will be a step in our pipeline.

📒 NOTE: To keep things running smoothly, you will probably need to configure a few things:

  • It is probable that you will experience some problems trying to connect Jenkins with Github if you use Jenkins in your localhost. If that is your case, consider creating a secure URL to your localhost. Best tool I have found to do so is ngrok.
  • As Jenkins uses its own user (called jenkins), you may need to give it permissions to execute commands without password. You can do this by opening sudoers file with sudo visudo /etc/sudoers and pasting jenkins ALL=(ALL) NOPASSWD: ALL.

That being said, let’s see what is the plan. We will create 4 jobs:

  1. The "github-to-container" job: In this job we will "connect" Jenkins with Github in a way that the job will be triggered everytime we do a commit in Github. We will also build the Docker image and run a container.
  2. The "preprocessing" job: In this step we will execute the preprocessing.py script. This Job will be triggered by the "github-to-container" job.
  3. The "train" job: In this job we will execute the train.py script. This Job will be triggered by the "preprocessing" job.
  4. The "test" job: In this job we will pass the validation score through our conditional statement. If it is higher than the threshold, we will execute the test.py script and show the metadata (the validation and test accuracy). If the validation score is lower than the threshold, the process will stop and no metadata will be provided.

Once we know what to do, let’s go for it!

Creating Jenkins Jobs

The github-to-container job

For the github-to-container job, first we need to create a "connection" between github and Jenkins. This is done using webhooks. To create the Webhook, go to your repository in Github, choose settings and select webhooks. Select add webhook. In the Payload URL, pass the URL where you run Jenkins and add "//github-webhook/". For content type choose "application/json". For the answer of "Which events would you like to trigger this webhook?", choose "Just the push event". At the bottom select Active. Select add webhook.

Image by Author
Image by Author

Then, you will need to create a credential in Jenkins in order to access Github. In Jenkins, go to Manage Jenkins

Image by Author
Image by Author

Then select Manage Credentials,

Image by Author
Image by Author

In "Stores scoped to Jenkins" select Jenkins

Image by Author
Image by Author

Then select "Global credentials (unrestricted)"

Image by Author
Image by Author

and Add credentials

Image by Author
Image by Author

Here, for Scope select "Global (Jenkins, nodes, items, all child items, etc)", for username and password write your Github username and password. You can leave ID empty as it will be autogenerated. You can also add a description. Finally, click OK.

Now let’s build the first job!

In Jenkins, go to New Item,

Image by Author
Image by Author

then give it a name and choose Freestyle project.

Image by Author
Image by Author

Next step is to set the configuration. For this step, in Source Code Management choose Git, and then paste the URL of your repository and your Github credentials.

Image by Author
Image by Author

Then, in Build Triggers select "GitHub hook trigger for GITScm polling". Finally in the build section choose Add build step, then Execute shell, and then write the code to build the image and run the container (we have already talked about it):

Image by Author
Image by Author

Choose save.

The preprocessing job

For the "preprocessing" job, in Source Code Management leave it as None. In Build Triggers, select "Build after other projects are built". Then in Projects to watch enter the name of the first job and select "Trigger only if build is stable".

Image by Author
Image by Author

In Build, choose Add build step, then execute shell, and write the code to run preprocessing.py:

Image by Author
Image by Author

The train job

The "train" job has the same scheme as the "preprocessing" job, but with a few differences. As you might guess, you will need to write the name of the second job in the Build triggers section:

Image by Author
Image by Author

And in the Build section write the code for running train.py.

Image by Author
Image by Author

The test job

For the "test" job, select the "train" job for the Build Triggers section,

Image by Author
Image by Author

and in the Build section write the following code:

val_acc=$(sudo -S docker container exec model  jq .validation_acc  /home/jovyan/results/train_metadata.json)
threshold=0.8

if echo "$threshold > $val_acc" | bc -l | grep -q 1
then
    echo 'validation accuracy is lower than the threshold, process stopped'
else
   echo 'validation accuracy is higher than the threshold'
   sudo -S docker container exec model python3 test.py
   sudo -S docker container exec model cat  /home/jovyan/results/train_metadata.json  /home/jovyan/results/test_metadata.json 
fi    

sudo -S docker rm -f model

I have written it just in case you want to copy paste, but in Jenkins it should look like this:

Image by Author
Image by Author

Click save and there we have it! Our Pipeline is now fully automated!

You can now play with it: try making a commit in Github and see how every step goes automatically. At the end, if the model validation accuracy is higher than the threshold, the model will compute the test accuracy and give back the results.

📒 NOTE: In order to see the output of each step, select the step, click on the first number in the Build section at the bottom left and select Console Output. For the last step, you should see the validation and test accuracy.

I hope you have learned a lot! Thanks for reading!


References

Docker for Machine Learning – Part III

From DevOps to MLOPS: Integrate Machine Learning Models using Jenkins and Docker

From Naïve to XGBoost and ANN: Adult Census Income

The post MLOps with Docker and Jenkins: Automating Machine Learning Pipelines appeared first on Towards Data Science.

]]>
Deploying a Docker container with ECS and Fargate. https://towardsdatascience.com/deploying-a-docker-container-with-ecs-and-fargate-7b0cbc9cd608/ Sun, 04 Jul 2021 03:08:37 +0000 https://towardsdatascience.com/deploying-a-docker-container-with-ecs-and-fargate-7b0cbc9cd608/ Including a guide to the necessary permissions.

The post Deploying a Docker container with ECS and Fargate. appeared first on Towards Data Science.

]]>
Photo by Dominik Lückmann on Unsplash
Photo by Dominik Lückmann on Unsplash

This week I needed to deploy a Docker image on ECS as part of a data ingestion pipeline. I found the process of deploying the Docker image to ECS to be fairly straightforward, but getting the correct permissions from the security team was a bear.

In this article, we will dig into the steps to deploy a simple app to ECS and run it on a Fargate Cluster so you don’t have to worry about provisioning or maintaining EC2 instances. More importantly, we’ll take a look at the necessary IAM user and IAM role permissions, how to set them up, and what to request from your cyber security team if you need to do this at work.

Let’s dig in, starting with terminology.

ECS, ECR, Fargate

The three AWS technologies we are going to use here are Elastic Container Service (ECS), Elastic Container Registry (ECR), and Fargate.

ECS

ECS is the core of our work. In ECS we will create a task and run that task to deploy our Docker image to a container. ECS also handles the scaling of applications that need multiple instances running. ECS Manages the deployment of our application. Learn more.

ECR

ECR is versioned storage for Docker images on AWS. ECS pulls images from ECR when deploying. Learn more.

Fargate

Fargate provisions and manages clusters of compute instances. This is amazing because:

  1. You don’t have to provision or manage the EC2 instances your application runs on.
  2. You are only charged for the time your app is running. In the case of an application that runs a periodic task and exits this can save a lot of money.

Policies, Groups, IAM users, and IAM roles

If you are new to the AWS ecosystem and not doing this tutorial on a root account you will need to know a little about security management on AWS.

Policies

A policy is a collection of permissions for a specified services. For example you could have a policy that only allows some users to view the ECS tasks, but allows other users to run them.

Policies can be attached to Groups or directly to individual IAM users.

Groups

Groups are what they sound like: groups of users that share access policies. When you add a policy to a group, all of the members of that group acquire the permissions in the policy.

IAM user

IAM stands for Identity and Access Management but really its just an excuse to call a service that identifies a user "I am" (Clever right?). If you are not the root user you will be logging into AWS Management Console as an IAM user.

IAM Roles

Roles are a little bit more confusing. A role is a set of permissions for an AWS service. They are used when one service needs permission to access another service. The role is created for the specific type of service it will be attached to and it is attached to an instance of that service. (There are other applications for Roles but they are beyond the scope of this article.)

Setting up Permissions

As I mentioned, this is the most painful part of the process. Amazon has tried to make this easy but access management is hard.

We’ll walk through setting up the appropriate policies from a root account. Then we’ll translate that to what to ask for from you security team so you can get your Docker container up and running on ECS.

Create an IAM User and assign permissions

This is a good exercise to go through just to get an idea of what is going on behind the scenes. It will help you negotiate the access you need from your organization to do your job.

  • Login to your AWS account as a root user. If you don’t have an account you can signup for an account here.
  • Search for IAM
Image by author
Image by author
  • From the IAM dashboard select Users from the left menu.
  • Select Add user from the top of the page.
Image by author
Image by author
  • On the Add user screen select a username,
  • Check Programatic access and AWS Management Console access. The rest we can leave they are.
Image by author
Image by author

Attaching the ECS access policy

To keep our life simple, we are going to attach the access policies directly to this new IAM user. ECS requires permissions for many services such as listing roles and creating clusters in addition to permissions that are explicitly ECS. The best way to add all of these permissions to our new IAM user is to use an Amazon managed policy to grant access to the new user.

  • Select Attach existing policies directly directly under Set permissions.
  • Search forAmazonECS_FullAccess. (the policies with the cube logo before them are Amazon managed policies).
  • Select checkbox next to the policy.
Image by author
Image by author

Create an ECR policy

We will also need to have access to ECR to store our images. The process is similar except that there is no Amazon managed policy option. We must create a new policy to attach to our IAM user.

  • Once again, select Create policy.
  • Under Service selectElastic Container Registry.
  • Under Actions select All Elastic Container Registry actions (ecr:*)
  • Under Resources select specific and Add ARN. Here we will select the region, leave our account number and select Any for Repository name.
Image by author
Image by author
  • Click Add.
  • Skip the tags by clicking Next: Review.
  • Fill in an appropriate policy name. We will use ECR_FullAccess
  • Select Create policy

Attach the new policies to the IAM user

  • After creating the policies go back to the browser tab where we were creating the IAM user.
  • Refresh the policies by clicking on the refresh symbol to the top right of the policy table.
  • Search for ECR_FullAccess ECS_FullAccess and select the radio button to the left of each policy we created to attach it to our IAM user.
Image by author
Image by author
  • Select Next:Tags.
  • Leave tags blank.
  • Select Next:Review
  • Finally, review our work and create the user.
Image by author
Image by author

When you submit this page you will get a confirmation screen. Save all of the information there in safe place we will need all of it when we deploy our container.

In the real world it is unlikely that you would need to create these permissions for yourself. It’s much more likely that you will need to request them from someone, perhaps a security team, at your organization. Now that you know a little about what is involved you are better prepared to make that request.

Your request should contain

  • a very brief explanation of what you need to accomplish.
  • a requested list of permissions.

The second is arguably unnecessary, but it will save everyone the time and pain of many back and forth emails as they try to work out exactly which permissions you need.

They may grant the permissions you request, or they may grant you a subset of them. They are the cyber security experts so if you get less than you ask for proceed in good faith. If you hit a wall, send them the error so they can grant the necessary permissions for you to move forward.

Your request could look something like this:

Hi Joe,
I need to deploy a Docker container on ECS.  I will also need access to ECR for this.
Please add the following to my IAM user privileges:
- AmazonECS_FullAcces managed policy
- The following policy for ECR access:
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": "ecr:*",
            "Resource": "*"
        }
    ]
}
Thanks! You are the best!

Ok on to the main event

Deploying a Docker Container to ECS

The steps here are:

  1. Create the Docker image
  2. Create an ECR registry
  3. Tag the image
  4. Give the Docker CLI permission to access your Amazon account
  5. Upload your docker image to ECR
  6. Create a Fargate Cluster for ECS to use for the deployment of your container.
  7. Create an ECS Task.
  8. Run the ECS Task!

Create the Docker image

For the purpose of this demo I am going to use an a simple flask app that shows gifs of cats from this GitHub repository. The app is part of docker-curriculum.com which is a great Docker primer if you are just getting started.

Lets begin

  • Clone the source files form GitHub and cd into the flask-app directory.
$ git clone https://github.com/prakhar1989/docker-curriculum.git
$ cd docker-curriculum/flask-app
  • Create the Docker image:
docker build -t myapp .

Test the app to make sure everything is working. The flask app we downloaded listens on port 5000 so we will use the same port to test.

docker run --publish 5000:5000 myapp

Now you should be able to go to localhost:5000 and see a random cat gif

Image by author
Image by author

Yay!

Create an ECR registry.

In this step we are going to create the repository in ECR to store our image. We will need the ARN (Amazon Resource Name – a unique identifier for all AWS resources) of this repository to properly tag and upload our image.

First login to the AWS console with the test_user credentials we created earlier. Amazon will ask for your account id, username, and password.

Image by author
Image by author
  • Once you are in, search for Elastic Container Registry and select it.
Image by author
Image by author
  • From there fill in the name of the repository as myapp and leave everything else default.
Image by author
Image by author
  • Select Create Repository in the lower left of the page and your repository is created. You will see your repository in the repository list, and most importantly the ARN(here called a URI) which we will need to push up our image. Copy the URI for the next step.
Image by author
Image by author

If you prefer you can also do the above step from the command line like so:

$ aws ecr create-repository 
    --repository-name myapp     
    --region us-east-1

Tag the image

In order for ECR to know which repository we are pushing our image to we must tag the image with that URI.

$ docker tag myapp [use your uri here]

The full command for my ECR registry looks like this:

docker tag myapp 828253152264.dkr.ecr.us-east-1.amazonaws.com/myapp

Give the Docker CLI permission to access your Amazon account

I’ll admit this step is a little convoluted. We need to login to aws to get a key, that we pass to docker so it can upload our image to ECR. You will need the aws cli for the rest of our work.

  • First we’ll login to our aws account.
# aws configure

AWS will ask us for our credentials which you saved from way back when we created the AIM user (right?). Use those credentials to authenticate.

  • Next, we need to generate a ECR login token for docker. This step is best combined with the following step but its good to take a deeper look to see what is going on. When you run the followign command it spits out an ugly token. Docker needs that token to push to your repository.
# aws ecr get-login-password --region us-east-1
  • We can pipe that token straight into Docker like this. Make sure to replace [your account number] with your account number. The ARN at the end is the same as the one we used earlier without the name of the repository at the end.
# aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin [your account number].dkr.ecr.us-east-1.amazonaws.com

If all goes well the response will be Login Succeeded.

Upload your docker image to ECR

We’ve done the hard part now. It should be smooth sailing from here.

  • Use docker to push the image to the ECR repository.
docker push 828253152264.dkr.ecr.us-east-1.amazonaws.com/myapp

Create a Fargate Cluster.

Let’s return to the AWS management console for this step.

  • Search for Elastic Container Service and select Elastic Container Service.
  • From the left menu select Clusters
  • Select Create cluster
Image by author
Image by author
  • Under Select cluster template we are going to select networking only. We don’t need ec2 instances in our cluster because Fargate will take care of spinning up compute resources when we start our task and spinning them down when we stop our task.
Image by author
Image by author
  • I’ll name the cluster fargate-cluster, and the rest we can leave as is.
  • Select Create

Create an ECS Task

The ECS Task is the action that takes our image and deploys it to a container. To create an ECS Task lets go back to the ECS page and do the following:

  • Select Task Definitions from the left menu. Then select Create new Task Definition
Image by author
Image by author
  • Select Fargate
  • Select Next Step
Image by author
Image by author
  • Enter a name for the task. I am going to use myapp.
  • Leave Task Role and Network Mode set to their default values.
Image by author
Image by author
  • Leave Task Execution Role set to its default.
  • For Task memory and Task CPU select the minimum values. We only need minimal resources for this test.
Image by author
Image by author
  • Under Container definition select Add Container.
  • Enter a Container name. I will use myapp again.
  • In the Image box enter the ARN of our image. You will want to copy and paste this from the ECR dashboard if you haven’t already.
  • We can keep the Memory Limit to 128Mb
  • In port mappings you will notice that we can’t actually map anything. Whatever port we enter here will be opened on the instance and will map to the same port on container. We will use 5000 because that is where our flask app listens.
Image by author
Image by author
  • Leave everything else set to its default value and click Add in the lower left corner of the dialog.
  • Leave everything else in the Configure task and container definitions page as is and select Create in the lower left corner of the page.
  • Go back to the ECS page, select Task Definitions and we should see our new task with a status of ACTIVE.
Image by author
Image by author

Run the ECS Task!

This is the moment we have all been waiting for.

  • Select the task in the Task definition list
  • Click Actions and select Run Task
Image by author
Image by author
  • For Launch type: select Fargate
  • Make sure Cluseter: is set to thefargate-cluster we created earlier.
Image by author
Image by author
  • Cluster VPC select a vpc from the list. If you are building a custom app this should be the vpc assigned to any other AWS services you will need to access from your instance. For our app, any will do.
  • Add at least one subnet.
  • Auto-assign public IP should be set to ENBABLED
Image by author
Image by author
  • Edit the security group. Because our app listens on port 5000, and we opened port 5000 on our container, we also need to open port 5000 in the security group. By default the security group created by Run Task only allows incoming connections on port 80. Click on Edit next to the security group name and add a Custom TCP rule that opens port 5000.
Image by author
Image by author

And finally, run the task by clicking Run Task in the lower left corner of the page.

Check to see if our app is running

After you run the Task, you will be forwarded to the fargate-cluster page. When the Last Status for your cluster changes to RUNNING, your app is up and running. You may have to refresh the table a couple of times before the status is RUNNING. This can take a few minutes.

Image by author
Image by author
  • Click on the link in the Task column.
  • Find the Public IP address in the Network section of the Task page.
Image by author
Image by author
  • Enter the public IP address followed by :5000 in your browser to see your app in action.
Image by author
Image by author

Shut down the app

When you are done looking at cat gifs, you’ll want to shut down your app to avoid charges.

  • From the ECS page select Clusters from the left menu, and select the fargate-cluster from the list of clusters.
Image by author
Image by author
  • From the table at the bottom of the page select tasks.
  • Check the box next to the running task
  • Select stop from the dropdown menu at the top of the table
Image by author
Image by author

Conclusion

Now that you know how to deploy a Docker image to ECS the world is your oyster. You can deploy a scraping app that runs until it completes then shuts down so you are only billed for the time it runs. You can scale a web service. You can spread cat gifs around the internet with multiple cat gif servers. It’s all up to you.

Resources

  • If your permissions do not allow your Task to create an ECS task execution IAM role you can create one with these directions.

Amazon ECS task execution IAM role

A Docker Tutorial for Beginners

  • The Amazon tutorial for deploying a Docker image to ECS.

Docker basics for Amazon ECS

Now go do good.

The post Deploying a Docker container with ECS and Fargate. appeared first on Towards Data Science.

]]>