
Back in my college days, my role in data science projects was like an alchemist—experimenting with fancy AI models to dig out the relationship among variables from data in my major. Powerful AI algorithms consistently amazed me by outperforming traditional statistical methods and physical-based models. However, the real challenge began when I became an AI engineer in the industry in 2022. From then on, the technology stack of data science expanded rapidly into fields that I was unfamiliar with. My first challenge in the industry was to ship a model to the production environment, with the requirements of reliability, maintainability, and scalability. Retrospecting my struggles, I realize transforming AI models from prototypes to production-ready applications is nothing more than a combination of
- Good design patterns
- Robust code
- Efficient deployment strategies
This article is a comprehensive guide summarizing from seven key topics from my earlier sub-articles. Each topic explores one aspect of developing and deploying data science projects at an industry level:
- Code modularization
- Data validation
- Abstraction
- Configuration management
- Web service
- API documentation
- Docker and cloud
Using a streamflow forecasting application as a case study, the article will dive into each topic with core concepts and demos, equipping entry-level data scientists with powerful tools to enhance their career skills. Let’s start the journey of AI engineering!
1. Organizing Codes in Modular
Modularization divides a program into smaller and independent modules. Modular code makes it easier to maintain and debug, as errors can be solved within specific modules. Modular code also increases extensibility, as you only need to modify codes in specific modules when adding additional features. Moreover, creating code modules enables multiple developers to work on different parts of the project simultaneously.
Example
Here is the directory layout for the code of our streamflow forecasting application:
.gitignore
config.yaml
Dockerfile
LICENSE
main_service.py
main_train.py
README.md
requirements.txt
├── adapter
├── config
├── domain
├── model
├── resources
├── service
├── test
└── utils
Our streamflow forecasting application can be organized into the following code modules:
adapter
: Reads data from various sources and converts them into the formats required by AI model training and inference.config
: Specifies configurable parameters for different components of an AI application pipeline, e.g., data reading, model training, and service deployment.domain
: Defines data schemas to maintain consistency of data flow.model
: Organizes functions associated with model training and inference, including the data loader setup and the training loop for PyTorch.resources
: Stores intermediate assets in the model training process, e.g., data scalers and model checkpoints.service
: Archives web framework code to run a model service.test
: Writes unit and integration test functions.utils
: Defines functions that can be utilized across other modules, such as date formatting.
By dividing the project into modules, developing work becomes more manageable and trackable. Although the code structure above follows software design principles, it is not the only applicable structure. You can find the template that fits your project.
2. Automating Data Validation
Data validation ensures data in the correct format, reducing the risk of errors and increasing the application’s robustness to changes. The solution to validate data is to define standard and readable schemas for input and output data in key components of the pipeline. In Python, the validation process for various data types is accelerated by three widely used libraries:
pydantic
: Validates any kind of data.pandera
: Validates tabular data.jsonschema
: Validates JSON data.

Example
The streamflow forecasting application requires specific formats for data in training and parameters in web requesting.
- Validating training data format
Data schema defined with pandera automatically checks the format of input data. For instance, meteorological data records from various sources must be formatted to follow the MeteoModel schema before being fed into models.
# @File: data_schema.py
from pandera import DataFrameModel
import pandera as pa
class MeteoModel(DataFrameModel):
'''Meteorological data schema'''
id: str
time: str
temperature_max: float = pa.Field(nullable=True)
temperature_min: float = pa.Field(nullable=True)
precipitation: float = pa.Field(nullable=True)
evapotranspiration: float = pa.Field(nullable=True)
......
The schema can be used to validate the data format directly by calling MeteoModel.validate
:
import pandas as pd
meteo_df = pd.DataFrame({
'id': ['10251335', '10251335', '10251335', '10251335'],
'time': ['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04'],
'temperature_max': [10.0, 11.0, 12.0, 13.0],
'temperature_min': [5.0, 6.0, 7.0, 8.0],
'precipitation': [0.0, 0.1, 0.2, 0.3],
'evapotranspiration': [0.0, 0.1, 0.2, 0.3]
})
try:
MeteoModel.validate(meteo_df)
except Exception as e:
print(e)
- Validating request parameters
Web request parameters should be in JSON format, which is the standard for web APIs. The streamflow forecasting service takes site_id
and forecast_days
as input parameters:
# @File: query_schema.py
ForecastRequestSchema = {
"type": "object",
"properties": {
"site_id": {"type": "string"},
"forecast_days": {"type": "integer", "maximum": 5, "minimum": 1},
},
"required": ["site_id", "forecast_days"]
}
......
With the predefined schema, we can catch errors about invalid request parameters:
from jsonschema import validate
request_data = {
"site_id": "USGS-12345678",
"forecast_days": 3
}
try:
validate(instance=request_data, schema=ForecastRequestSchema)
except Exception as e:
print(e)
Explore more
3. Abstraction
We often write repeated code in data science projects to for logic such as processing data from various sources. To avoid repeating yourself, the best way is to maintain the main logic of your pipeline and make it reusable. In Python, you can use abstract classes in the abc
library to write generic code that can work in different situations, which simplifies future extensions.
Example
To handle different sources of meteorological data in the streamflow forecasting application, the AbstractMeteoReader
is defined as the template for concrete data readers.
# @File: abstract_reader.py
from domain.data_schema import MeteoModel
from abc import ABC, abstractmethod
from pandera.typing import DataFrame
class AbstractMeteoReader(ABC):
'''Abstract class for the reader of meteorological data'''
@abstractmethod
def __init__(self, **kwargs):
pass
@abstractmethod
def get_site_history_daily_meteo(self, site_id: str, lat: float, lon: float, history_days: int) -> DataFrame[MeteoModel]:
pass
@abstractmethod
def get_site_forecast_daily_meteo(self, site_id: str, lat: float, lon: float, forecast_days: int) -> DataFrame[MeteoModel]:
pass
......
All concrete readers should inherit the abstract class and implement common functions get_site_history_daily_meteo
and get_site_forecast_daily_meteo
. Here is the reader code of Open-Meteo weather data:
# @File: meteo_reader.py
from domain.data_schema import MeteoModel
from config.config_data import OpenMeteoDataConfig
from adapter.abstract_reader import AbstractMeteoReader
from pandera.typing import DataFrame
class OpenMeteoReader(AbstractMeteoReader):
'''Data reader for Open-meteo meteorological data'''
def __init__(self):
self.config = OpenMeteoDataConfig
def get_site_history_daily_meteo(self, site_id: str, latitude: float, longitude: float, history_days: int) -> DataFrame[MeteoModel]:
......
def get_site_forecast_daily_meteo(self, site_id: str, latitude: float, longitude: float, forecast_days: int) -> DataFrame[MeteoModel]:
......
class XXXXMeteoReader():
'''Data reader for another meteorological data'''
......
Explore more
From Data Science to Production: Abstract Classes for Model Deployment
4. Configuration Management
Configuration management adapts the data science applications to different environments and use cases. Configurable parameters include options for data reading, model hyperparameters, web service ports, etc. A recommended way is to separate the configuration from the code instead of hard-coding it within functions. In Python, configurations can be defined in INI files, YAML files, or Python classes.

Example
The streamflow forecasting application is expected to work with different data sources, which means it works with different combinations of streamflow and meteorological datasets. We expose options of data sources to end users through the config.yaml
file, making the forecasting applicable for multiple predefined regions.
# @File: config.yaml
data:
flow_data: usgs
meteo_data: open-meteo
service:
port: 8888
The YAML
file specifies clear and readable parameters that can be modified by end users and loaded in each deployment. Meanwhile, immutable configurations for users are defined through internal Python classes, such as OpenMeteoDataConfig
.
# @File: config_data.py
import yaml
with open('config.yaml', 'r') as file:
BaseConfig = yaml.safe_load(file)
class DataConfig:
flow_data = BaseConfig['data']['flow_data']
weather_data = BaseConfig['data']['meteo_data']
history_days = 365 * 30
class OpenMeteoDataConfig:
lag_days = 4
varnames = ["temperature_2m_max", "temperature_2m_min", "precipitation_sum", "et0_fao_evapotranspiration"]
......
Explore more
From Data Science to Production: Configuration Management for ML Code
5. Building Service APIs
To integrate models into software or applications, the popular way is to create a standard web service interface, known as an Application Programming Interface (API). An API contains a set of URLs that are accessible to end users or other engineers. The most popular web framework libraries in Python include Flask
, Tornado
, and Django
. They offer high-performance backends for developing web services.

Example
The streamflow forecasting application utilizes Tornado
to create a web service. A Tornado
web application consists of three parts: tornado.web.RequestHandler
objects that execute your inference pipeline code, a tornado.web.Application
object that routes requests to corresponding handlers, and a main
function that runs the server. We’ve defined three handlers:
InfoHandler
: Queries site information from the local database.ForecastHandler
: Feeds data into the forecasting model and generates predictions.HealthHandler
: Validates service connections.
# @File: main_service.py
from service.info_service import InfoService
from service.forecast_service import ForecastService
from config.config_service import ServiceConfig
import asyncio
import tornado
from tornado.concurrent import run_on_executor
import concurrent
class BaseHandler(tornado.web.RequestHandler):
executor = concurrent.futures.ThreadPoolExecutor(max_workers=10)
@run_on_executor
def _process_get(self, service):
query = self.request.arguments
for key, value in query.items():
query[key] = str(value[0].decode('utf-8'))
print(query)
response = service.execute(query)
return response
class HealthHandler(BaseHandler):
async def get(self):
self.write("OK")
class InfoHandler(BaseHandler):
async def get(self):
service = InfoService()
response = await self._process_get(service)
self.write(response)
class ForecastHandler(BaseHandler):
async def get(self):
service = ForecastService()
response = await self._process_get(service)
self.write(response)
class Application(tornado.web.Application):
_routes = [
tornado.web.url(r"/healthCheck", HealthHandler),
tornado.web.url(r"/info", InfoHandler),
tornado.web.url(r"/forecast", ForecastHandler)
]
def __init__(self):
super(Application, self).__init__(self._routes)
async def main():
app = Application()
app.listen(ServiceConfig.port)
await asyncio.Event().wait()
if __name__ == "__main__":
asyncio.run(main())
Running the code in a server exposes three URLs for external requests: http://<ip>:<port>/healthCheck
, http://<ip>:<port>/info
, and http://<ip>:<port>/forecast
.
Explore more
From Data Science to Production: Building a Model Service Using Tornado
6. Simplifying API Documentation with Swagger
API documentation simplifies the integration of models by other developers. The OpenAPI
specification provides a standard description format for web APIs, while manually writing detailed documentation can be tedious. Fortunately, the Swagger UI
tool can generate API documentation directly from the application code. Powerful features of Swagger UI
include:
- Automatic documentation
- Interactive interface
- Well-defined structure & schema

Example
The tornado_swagger
Python library works with Tornado
web framework to automate API documentation generation. It creates a user-friendly UI page (http://<ip>:<port>/doc
) that is deployed alongside the web service. With the help of tornado_swagger
, the Swagger UI
for the streamflow forecasting application displays three API methods:
- Health check API: Validates service connectivity.
- Info API: Lists river sites for forecasting.
- Forecast API: Predicts future streamflow for a given gauge station.
and two response data structures:
- InfoResponse: The data schema returned by the Info API.
- ForecastResponse: The data schema returned by the Forecast API.

Details of API methods and data schemas are specified through annotations added to the application code, which can be parsed by tornado_swagger
to generate documentation UI. In addition, the API documentation supports direct interaction, facilitating understanding and testing.

Explore more
From Data Science to Production: Generating API Documentation with Swagger
7. Cloud Deployment: Docker and AWS Fargate
The final step is to deploy the model in a cloud environment. Data scientists, who focus on model building and refinement, expect little effort in handling deployment infrastructure and scalability. Launching a containerized application on a serverless cloud platform is a popular approach, which involves two tools:
- Docker: Creates lightweight and portable containerized applications.
- Aws Fargate: Runs containerized workloads in a serverless cloud environment.
Example

Key steps in the deployment process includes:
- Building a Docker image
- Uploading the image to a remote repository
- Creating a cluster in Amazon Elastic Container Service (ECS)
- Creating a task definition in ECS
- Creating a service, specified as a Fargate instance, to run the predefined task
The forecast-service has been deployed in our TestCluster on ECS. AWS Fargate automatically orchestrates the service’s launch and scales the number of tasks based on real-time resource demand.

Explore more
From Data Science to Production: Streamlining Model Deployment in Cloud Environment
Summary
Turning a data science project into a production-ready application requires Clean Code and efficient deployments. This article guides you through a journey with several key sections from code management to cloud deployment. Each section introduces core ideas and corresponding tools. Following these sections, entry-level data scientists can develop robust, extensible, and scalable applications, making their models accessible to the public rapidly.
More details about each topic can be seen in the extended reading for each section. I am also excited to learn more about any additional skills you might encounter in the journey.