Seven Requisite Skills for Navigating from Data Science to Applications

Back in my college days, my role in data science projects was like an alchemist—experimenting with fancy AI models to dig out the relationship among variables from data in my major. Powerful AI algorithms consistently amazed me by outperforming traditional statistical methods and physical-based models. However, the real challenge began when I became an AI engineer in the industry in 2022. From then on, the technology stack of data science expanded rapidly into fields that I was unfamiliar with. My first challenge in the industry was to ship a model to the production environment, with the requirements of reliability, maintainability, and scalability. Retrospecting my struggles, I realize transforming AI models from prototypes to production-ready applications is nothing more than a combination of

Good design patterns
Robust code
Efficient deployment strategies

This article is a comprehensive guide summarizing from seven key topics from my earlier sub-articles. Each topic explores one aspect of developing and deploying data science projects at an industry level:

Code modularization
Data validation
Abstraction
Configuration management
Web service
API documentation
Docker and cloud

Using a streamflow forecasting application as a case study, the article will dive into each topic with core concepts and demos, equipping entry-level data scientists with powerful tools to enhance their career skills. Let’s start the journey of AI engineering!

1. Organizing Codes in Modular

Modularization divides a program into smaller and independent modules. Modular code makes it easier to maintain and debug, as errors can be solved within specific modules. Modular code also increases extensibility, as you only need to modify codes in specific modules when adding additional features. Moreover, creating code modules enables multiple developers to work on different parts of the project simultaneously.

Example

Here is the directory layout for the code of our streamflow forecasting application:

.gitignore
config.yaml
Dockerfile
LICENSE
main_service.py
main_train.py
README.md
requirements.txt
├── adapter
├── config
├── domain
├── model
├── resources
├── service
├── test
└── utils

Our streamflow forecasting application can be organized into the following code modules:

adapter: Reads data from various sources and converts them into the formats required by AI model training and inference.
config: Specifies configurable parameters for different components of an AI application pipeline, e.g., data reading, model training, and service deployment.
domain: Defines data schemas to maintain consistency of data flow.
model: Organizes functions associated with model training and inference, including the data loader setup and the training loop for PyTorch.
resources: Stores intermediate assets in the model training process, e.g., data scalers and model checkpoints.
service: Archives web framework code to run a model service.
test: Writes unit and integration test functions.
utils: Defines functions that can be utilized across other modules, such as date formatting.

By dividing the project into modules, developing work becomes more manageable and trackable. Although the code structure above follows software design principles, it is not the only applicable structure. You can find the template that fits your project.

2. Automating Data Validation

Data validation ensures data in the correct format, reducing the risk of errors and increasing the application’s robustness to changes. The solution to validate data is to define standard and readable schemas for input and output data in key components of the pipeline. In Python, the validation process for various data types is accelerated by three widely used libraries:

pydantic: Validates any kind of data.
pandera: Validates tabular data.
jsonschema: Validates JSON data.

Figure 1. Python tools for data validation. Source: by author.

Example

The streamflow forecasting application requires specific formats for data in training and parameters in web requesting.

Validating training data format

Data schema defined with pandera automatically checks the format of input data. For instance, meteorological data records from various sources must be formatted to follow the MeteoModel schema before being fed into models.

# @File: data_schema.py

from pandera import DataFrameModel
import pandera as pa

class MeteoModel(DataFrameModel):
    '''Meteorological data schema'''
    id: str
    time: str
    temperature_max: float = pa.Field(nullable=True)
    temperature_min: float = pa.Field(nullable=True)
    precipitation: float = pa.Field(nullable=True)
    evapotranspiration: float = pa.Field(nullable=True)

......

The schema can be used to validate the data format directly by calling MeteoModel.validate:

import pandas as pd

meteo_df = pd.DataFrame({
    'id': ['10251335', '10251335', '10251335', '10251335'],
    'time': ['2021-01-01', '2021-01-02', '2021-01-03', '2021-01-04'],
    'temperature_max': [10.0, 11.0, 12.0, 13.0],
    'temperature_min': [5.0, 6.0, 7.0, 8.0],
    'precipitation': [0.0, 0.1, 0.2, 0.3],
    'evapotranspiration': [0.0, 0.1, 0.2, 0.3]
})

try:
    MeteoModel.validate(meteo_df)
except Exception as e:
    print(e)

Validating request parameters

Web request parameters should be in JSON format, which is the standard for web APIs. The streamflow forecasting service takes site_id and forecast_days as input parameters:

# @File: query_schema.py

ForecastRequestSchema = {
    "type": "object",
    "properties": {
        "site_id": {"type": "string"},
        "forecast_days": {"type": "integer", "maximum": 5, "minimum": 1},
    },
    "required": ["site_id", "forecast_days"]
}

......

With the predefined schema, we can catch errors about invalid request parameters:

from jsonschema import validate

request_data = {
    "site_id": "USGS-12345678",
    "forecast_days": 3
}

try:
    validate(instance=request_data, schema=ForecastRequestSchema)
except Exception as e:
    print(e)

Explore more

From Data Science to Production: Automatic Data Validation

3. Abstraction

We often write repeated code in data science projects to for logic such as processing data from various sources. To avoid repeating yourself, the best way is to maintain the main logic of your pipeline and make it reusable. In Python, you can use abstract classes in the abc library to write generic code that can work in different situations, which simplifies future extensions.

Example

To handle different sources of meteorological data in the streamflow forecasting application, the AbstractMeteoReader is defined as the template for concrete data readers.

# @File: abstract_reader.py

from domain.data_schema import MeteoModel
from abc import ABC, abstractmethod
from pandera.typing import DataFrame

class AbstractMeteoReader(ABC):
    '''Abstract class for the reader of meteorological data'''

    @abstractmethod
    def __init__(self, **kwargs):
        pass

    @abstractmethod
    def get_site_history_daily_meteo(self, site_id: str, lat: float, lon: float, history_days: int) -> DataFrame[MeteoModel]:
        pass

    @abstractmethod
    def get_site_forecast_daily_meteo(self, site_id: str, lat: float, lon: float, forecast_days: int) -> DataFrame[MeteoModel]:
        pass

......

All concrete readers should inherit the abstract class and implement common functions get_site_history_daily_meteo and get_site_forecast_daily_meteo. Here is the reader code of Open-Meteo weather data:

# @File: meteo_reader.py

from domain.data_schema import MeteoModel
from config.config_data import OpenMeteoDataConfig
from adapter.abstract_reader import AbstractMeteoReader
from pandera.typing import DataFrame

class OpenMeteoReader(AbstractMeteoReader):
    '''Data reader for Open-meteo meteorological data'''

    def __init__(self):
        self.config = OpenMeteoDataConfig

    def get_site_history_daily_meteo(self, site_id: str, latitude: float, longitude: float, history_days: int) -> DataFrame[MeteoModel]:
        ......

    def get_site_forecast_daily_meteo(self, site_id: str, latitude: float, longitude: float, forecast_days: int) -> DataFrame[MeteoModel]:
        ......

class XXXXMeteoReader():
    '''Data reader for another meteorological data'''
    ......

Explore more

From Data Science to Production: Abstract Classes for Model Deployment

4. Configuration Management

Configuration management adapts the data science applications to different environments and use cases. Configurable parameters include options for data reading, model hyperparameters, web service ports, etc. A recommended way is to separate the configuration from the code instead of hard-coding it within functions. In Python, configurations can be defined in INI files, YAML files, or Python classes.

Figure 2. Options for configuration management in Python. Source: by author.

Example

The streamflow forecasting application is expected to work with different data sources, which means it works with different combinations of streamflow and meteorological datasets. We expose options of data sources to end users through the config.yaml file, making the forecasting applicable for multiple predefined regions.

# @File: config.yaml

data:
  flow_data: usgs
  meteo_data: open-meteo
service:
  port: 8888

The YAML file specifies clear and readable parameters that can be modified by end users and loaded in each deployment. Meanwhile, immutable configurations for users are defined through internal Python classes, such as OpenMeteoDataConfig.

# @File: config_data.py

import yaml

with open('config.yaml', 'r') as file:
    BaseConfig = yaml.safe_load(file)

class DataConfig:
    flow_data = BaseConfig['data']['flow_data']
    weather_data = BaseConfig['data']['meteo_data']
    history_days = 365 * 30

class OpenMeteoDataConfig:
    lag_days = 4
    varnames = ["temperature_2m_max", "temperature_2m_min", "precipitation_sum", "et0_fao_evapotranspiration"]

......

Explore more

From Data Science to Production: Configuration Management for ML Code

5. Building Service APIs

To integrate models into software or applications, the popular way is to create a standard web service interface, known as an Application Programming Interface (API). An API contains a set of URLs that are accessible to end users or other engineers. The most popular web framework libraries in Python include Flask, Tornado, and Django. They offer high-performance backends for developing web services.

Figure 3. Deploying an AI model as a web service. Source: by author.

Example

The streamflow forecasting application utilizes Tornado to create a web service. A Tornado web application consists of three parts: tornado.web.RequestHandler objects that execute your inference pipeline code, a tornado.web.Application object that routes requests to corresponding handlers, and a main function that runs the server. We’ve defined three handlers:

InfoHandler: Queries site information from the local database.
ForecastHandler: Feeds data into the forecasting model and generates predictions.
HealthHandler: Validates service connections.

# @File: main_service.py

from service.info_service import InfoService
from service.forecast_service import ForecastService
from config.config_service import ServiceConfig
import asyncio
import tornado
from tornado.concurrent import run_on_executor
import concurrent

class BaseHandler(tornado.web.RequestHandler):

    executor = concurrent.futures.ThreadPoolExecutor(max_workers=10)

    @run_on_executor
    def _process_get(self, service):
        query = self.request.arguments
        for key, value in query.items():
            query[key] = str(value[0].decode('utf-8'))
        print(query)
        response = service.execute(query)
        return response

class HealthHandler(BaseHandler):

    async def get(self):
        self.write("OK")

class InfoHandler(BaseHandler):

    async def get(self):
        service = InfoService()
        response = await self._process_get(service)
        self.write(response)

class ForecastHandler(BaseHandler):

    async def get(self):
        service = ForecastService()
        response = await self._process_get(service)
        self.write(response)

class Application(tornado.web.Application):

    _routes = [
        tornado.web.url(r"/healthCheck", HealthHandler),
        tornado.web.url(r"/info", InfoHandler),
        tornado.web.url(r"/forecast", ForecastHandler)
    ]

    def __init__(self):
        super(Application, self).__init__(self._routes)

async def main():
    app = Application()
    app.listen(ServiceConfig.port)
    await asyncio.Event().wait()

if __name__ == "__main__":
    asyncio.run(main())

Running the code in a server exposes three URLs for external requests: http://<ip>:<port>/healthCheck, http://<ip>:<port>/info, and http://<ip>:<port>/forecast.

Explore more

From Data Science to Production: Building a Model Service Using Tornado

6. Simplifying API Documentation with Swagger

API documentation simplifies the integration of models by other developers. The OpenAPI specification provides a standard description format for web APIs, while manually writing detailed documentation can be tedious. Fortunately, the Swagger UI tool can generate API documentation directly from the application code. Powerful features of Swagger UI include:

Automatic documentation
Interactive interface
Well-defined structure & schema

Figure 4. Concepts of API documentation. Source: by author.

Example

The tornado_swagger Python library works with Tornado web framework to automate API documentation generation. It creates a user-friendly UI page (http://<ip>:<port>/doc) that is deployed alongside the web service. With the help of tornado_swagger, the Swagger UI for the streamflow forecasting application displays three API methods:

Health check API: Validates service connectivity.
Info API: Lists river sites for forecasting.
Forecast API: Predicts future streamflow for a given gauge station.

and two response data structures:

InfoResponse: The data schema returned by the Info API.
ForecastResponse: The data schema returned by the Forecast API.

Figure 5. Elements of Swagger documentation in the streamflow forecasting project. Source: by author.

Details of API methods and data schemas are specified through annotations added to the application code, which can be parsed by tornado_swagger to generate documentation UI. In addition, the API documentation supports direct interaction, facilitating understanding and testing.

Figure 6. Definitions of API methods and data schemas in Swagger documentation. Source: by author.

Explore more

From Data Science to Production: Generating API Documentation with Swagger

7. Cloud Deployment: Docker and AWS Fargate

The final step is to deploy the model in a cloud environment. Data scientists, who focus on model building and refinement, expect little effort in handling deployment infrastructure and scalability. Launching a containerized application on a serverless cloud platform is a popular approach, which involves two tools:

Docker: Creates lightweight and portable containerized applications.
Aws Fargate: Runs containerized workloads in a serverless cloud environment.

Example

Figure 7. Concepts of AWS Fargate. Source: by author.

Key steps in the deployment process includes:

Building a Docker image
Uploading the image to a remote repository
Creating a cluster in Amazon Elastic Container Service (ECS)
Creating a task definition in ECS
Creating a service, specified as a Fargate instance, to run the predefined task

The forecast-service has been deployed in our TestCluster on ECS. AWS Fargate automatically orchestrates the service’s launch and scales the number of tasks based on real-time resource demand.

Figure 8. Console of AWS Fargate service. Source: by author.

Explore more

From Data Science to Production: Streamlining Model Deployment in Cloud Environment

Summary

Turning a data science project into a production-ready application requires Clean Code and efficient deployments. This article guides you through a journey with several key sections from code management to cloud deployment. Each section introduces core ideas and corresponding tools. Following these sections, entry-level data scientists can develop robust, extensible, and scalable applications, making their models accessible to the public rapidly.

More details about each topic can be seen in the extended reading for each section. I am also excited to learn more about any additional skills you might encounter in the journey.

Seven Requisite Skills for Navigating from Data Science to Applications

1. Organizing Codes in Modular

Example

2. Automating Data Validation

Example

Explore more

3. Abstraction

Example

Explore more

4. Configuration Management

Example

Explore more

5. Building Service APIs

Example

Explore more

6. Simplifying API Documentation with Swagger

Example

Explore more

7. Cloud Deployment: Docker and AWS Fargate

Example

Explore more

Summary

Related Articles

How to Forecast Hierarchical Time Series

Time Series Analysis Introduction - A Comparison of ARMA, ARIMA, SARIMA Models

Imperfections Unveiled: The Intriguing Reality Behind Our MLOps Course Creation

MLOps Tempo: How Do Strategic Goals Speed Your Development?

Using Server-less Functions to Govern and Monitor Cloud-Based Training Experiments

Going from a predictive model to productive deployment

7 Tips to Future-Proof Machine Learning Projects

Random thoughts on my first ML deployment

How 2 Build a Cloud-Based ML Ops Framework in 2 weeks