A pain point with getting started with SageMaker Real-Time Inference is that it is hard to debug at times. When creating an endpoint there are a number of ingredients you need to make sure are baked properly for successful deployment.
- Proper file structuring of model artifacts depending on the Model Server and Container that you are utilizing. Essentially the model.tar.gz you provide must be in a format that is compliant with the Model Server.
- If you have a custom inference script that implements pre and post processing for your model, you need to ensure that the handlers implemented are compliant with your model server and that there are no scripting errors at the code level either.
Previously we have discussed SageMaker Local Mode, but at the time of this article Local Mode does not support all hosting options and model servers that are available for SageMaker Deployment.
To overcome this limitation we take a look at using Docker with a sample model and how we can test/debug our model artifacts and inference script prior to SageMaker Deployment. In this specific example we will utilize the BART Model that I have covered in my last article and see how we can host it with Docker.
NOTE: For those of you new to AWS, make sure you make an account at the following link if you want to follow along. The article also assumes an intermediate understanding of SageMaker Deployment, I would suggest following this article for understanding Deployment/Inference more in depth. An intermediate level of understanding of Docker will also be helpful to fully understand this example.
How Does SageMaker Hosting Work?
Before we can get to the code portion of this article, let’s take a look at how SageMaker actually serves requests. At it’s core SageMaker Inference has two constructs:
- Container: This establishes the runtime environment for the model, it is also integrated with the model server that you are utilizing. You can either utilize one of the existing Deep Learning Containers (DLCs) or Build Your Own Container.
- Model Artifacts: In the CreateModel API call we specify an S3 URL with the model data present in the format of a model.tar.gz (tarball). This model data is loaded into the opt/ml/model directory on the container, this also includes any inference script that you provide.
The key here is that the container needs a web server implemented that responds to port 8080 on the /invocations and /ping paths. An example of a web server we have implemented with these paths is Flask during a Bring Your Own Container example.
With Docker what we will do is expose this port and point towards our local script and model artifacts, this way we simulate the way a SageMaker Endpoint is expected to behave.
Testing With Docker
For simplicity’s sake we will utilize my BART example from my last article, you can grab the artifacts from this repository. Here you should see the following files:

- model.py: This is the inference script that we are working with. In this case we are utilizing DJL Serving which expects a model.py with a handler function implementing inference. Your inference script still needs to be compatible with the format that the model server expects.
- requirements.txt: Any additional dependencies that your model.py script requires. For DJL Serving PyTorch is already installed beforehand, we use numpy for data processing.
- serving.properties: This is also a DJL specific file, here you can define any configuration at the model level (ex: workers per model)
We have our model artifacts, now we need the container that we will be utilizing. In this case we can retrieve the existing DJL DeepSpeed image. For an extensive list of the images that are already provided by AWS please reference this guide. You can also build your own image locally and point towards that. In this case we are operating in a SageMaker Classic Notebook Instance environment which comes with Docker pre-installed as well.
To work with existing AWS provided images we first need to login to AWS Elastic Container Registry (ECR) to retrieve the image, you can do that with the following shell command.
$(aws ecr get-login --region us-east-1 --no-include-email --registry-ids 763104351884)
You should see a login succeeded message similar to the following.

Once logged in we can get to the path where our model artifacts are stored and run the following command which will launch the model server. If you have not already retrieved the image, this will also be pulled from ECR.
docker run
-v /home/ec2-user/SageMaker:/opt/ml/model
--cpu-shares 512
-p 8080:8080
763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.21.0-deepspeed0.8.0-cu117
serve
A few key points here:
- We are exposing port 8080 as SageMaker Inference expects.
- We also point towards the existing image. This string is dependent on the region and model you are operating in. You can also utilize the SageMaker Python SDK retrieve image_uri API call to identify the appropriate image to pull here.

After the image has been pulled you see that the model server has been started.

We can also verify this container is running by utilizing the following Docker command.
docker container ls

We see that the API is exposed via port 8080 which we can send sample requests to via curl. Notice we specify the /invocations path that SageMaker Containers expect.
curl -X POST http://localhost:8080/invocations -H "Content-type: text/plain"
"This is a sample test string"
We then see inference returned for the request and also the model server tracking the response and emitting our logging statements from our inference script.

Let’s break our model.py and see if we can catch the error early with Docker. Here in the inference function I add a syntactically incorrect print statement and restart my model server to see if this error is captured.
def inference(self, inputs):
"""
Custom service entry point function.
:param inputs: the Input object holds the text for the BART model to infer upon
:return: the Output object to be send back
"""
#sample error
print("=)
We can then see this error captured by the model server when we execute the docker run command.

Note that you are not limited to utilizing just curl to test your container. We can also use something like the Python requests library to interface and work with the container. A sample request would look like the following:
import requests
headers = {
'Content-type': 'text/plain',
}
response = requests.post('http://localhost:8080/invocations', headers=headers)
Utilizing something like requests you can run larger scale load tests on the container. Note that the hardware you are running the container on is what is being utilized (think of this as your equivalent to the instance behind a Sagemaker Endpoint).
Additional Resources & Conclusion
GitHub – RamVegiraju/SageMaker-Docker-Local: How to locally test SageMaker Inference with Docker
You can find the code for the entire example at the link above. With SageMaker Inference you want to avoid the pain of waiting for the endpoint to create to capture any errors. Utilizing this approach you can work with any SageMaker container to test and debug your model artifacts and inference scripts.
As always feel free to leave any feedback or questions, thank you for reading!
If you enjoyed this article feel free to connect with me on LinkedIn and subscribe to my Medium Newsletter. If you’re new to Medium, sign up using my Membership Referral.