The world’s leading publication for data science, AI, and ML professionals.

7 Tips to Future-Proof Machine Learning Projects

An Introduction to Developing More Collaborative, Reproducible and Reusable ML Code

7 Tips to Future-Proof ML Project (image by author)
7 Tips to Future-Proof ML Project (image by author)

There can be a knowledge gap when transitioning from exploratory Machine Learning projects, typical in research and study, to industry-level projects. This is due to the fact that industry projects generally have three additional goals: collaborative, reproducible, and reusable, which serve the purpose of enhancing business continuity, increasing efficiency and reducing cost. Although I am no way near finding a perfect solution, I would like to document some tips to transform a exploratory, notebook-based ML code to industry-ready project that is designed with more scalability and sustainability.

I have categorized these tips into three key strategies:

  • Improvement 1: Modularization – Break Down Code into Smaller Pieces
  • Improvement 2: Versioning – Data, Code and Model Versioning
  • Improvement 3: Consistency – Consistent Structure and Naming Convention

Improvement 1: Modularization – Break Down Code into Smaller Pieces

Problem Statement

One struggle I have faced is to have only one notebook for the entire data science project – which is common while learning data science. As you may experience, there are repeatable code components in a data science lifecycle, for instance, same data preprocessing steps are applied to transform both train data and inference data. If not handled properly, it results in different versions of the same function are copied and reused at multiple locations. Not only does it decrease the consistency of the code, but it also makes troubleshooting the entire notebook more challenging.

Bad Example

train_data = train_data.drop(['Evaporation', 'Sunshine', 'Cloud3pm', 'Cloud9am'], axis=1)
numeric_cols = ['MinTemp', 'MaxTemp', 'Rainfall', 'WindGustSpeed', 'WindSpeed9am']
train_data[numeric_cols] = train_data[numeric_cols].fillna(train_data[numeric_cols].mean())
train_data['Month'] = pd.to_datetime(train_data['Date']).dt.month.apply(str)

inference_data = inference_data.drop(['Evaporation', 'Sunshine', 'Cloud3pm', 'Cloud9am'], axis=1)
numeric_cols = ['MinTemp', 'MaxTemp', 'Rainfall', 'WindGustSpeed', 'WindSpeed9am']
inference_data[numeric_cols] = inference_data[numeric_cols].fillna(inference_data[numeric_cols].mean())
inference_data['Month'] = pd.to_datetime(inference_data['Date']).dt.month.apply(str)

Tip 1: Reuse code where possible by creating and importing functions and modules

Good Example 1

def data_preparation(data):
    data = data.drop(['Evaporation', 'Sunshine', 'Cloud3pm', 'Cloud9am'], axis=1)
    numeric_cols = ['MinTemp', 'MaxTemp', 'Rainfall', 'WindGustSpeed', 'WindSpeed9am']
    data[numeric_cols] = data[numeric_cols].fillna(data[numeric_cols].mean())
    data['Month'] = pd.to_datetime(data['Date']).dt.month.apply(str)
    return data
train_preprocessed = data_preparation(train_data)
inference_preprocessed = data_preparation(inference_data)

In this example, we extract the common processing steps as a data_preparation function and apply it to train_data and inference_data. Breaking down a long script into self-contained components like this makes it easier to unit test and troubleshoot. Additionally, it reduces the risk of inconsistency when we have multiple copies of the same function steps and accidentally modify or misspell in one of the copies.

Good Example 2

Furthermore, we can store this function in a standalone Python module (i.e. preprocessing.py below) and import the function from this file.

## file preprocessing.py ##
def data_preparation(data):
    data = data.drop(['Evaporation', 'Sunshine', 'Cloud3pm', 'Cloud9am'], axis=1)
    numeric_cols = ['MinTemp', 'MaxTemp', 'Rainfall', 'WindGustSpeed', 'WindSpeed9am']
    data[numeric_cols] = data[numeric_cols].fillna(data[numeric_cols].mean())
    data['Month'] = pd.to_datetime(data['Date']).dt.month.apply(str)
    return data
from preprocessing import data_preparation 
train_preprocessed = data_preparation(train_data)
inference_preprocessed = data_preparation(inference_data)

This makes it readily accessible and reusable for applications in other projects or by other team members.

Tip 2: Keep parameters in a separate config file

To further improve upon the script, we can store parameters e.g. dropped columns in another configuration file (i.e. parameter.pybelow) and importing it from the module.

Good Example 3

## parameters.py ##
DROP_COLS = ['Evaporation', 'Sunshine', 'Cloud3pm', 'Cloud9am']
NUM_COLS = ['MinTemp', 'MaxTemp', 'Rainfall', 'WindGustSpeed', 'WindSpeed9am']
from parameters import DROP_COLS, NUM_COLS
def data_preparation(data):
    data = data.drop(DROP_COLS, axis=1)
    data[NUM_COLS] = data[NUM_COLS].fillna(data[NUM_COLS].mean())
    data['Month'] = pd.to_datetime(data['Date']).dt.month.apply(str)
    return data

It is beneficial when the parameters remains constant in one iteration of ML pipeline, but can also be mutable as the pipeline changes overtime. While modularizing a simple script like the above might seem unnecessary, it becomes effective as the script becomes more complicated.

This approach is also widely used for storing and parsing model hyperparameters, or storing API tokens and password credentials in a secure location without exposing it in the script.

Improvement 2: Versioning – Data, Code and Model Versioning

Problem Statement

An unexpected trend was identified in the model output which requires revisiting the project. However, only a single output file was generated as the end result of the code. Since production data is changing overtime, regenerating the model output or tracing back to the source has become nearly impossible. Furthermore, the model cannot be reused for future predictions.

Tips 3: Data Versioning

Industry data and production data are hardly static and may update on a daily basis. Therefore, it is crucial to take a snapshot of the data at the point in time when it is used for model training or prediction. A common practice is to use timestamp to version the data.

from datetime import datetime
timestamp = datetime.today().strftime('%Y%m%d')
train_data.to_csv(f'train_data_{timestamp}.csv')
## output 
>>> train_data_20240101.csv

There are more elegant services and solutions in the industry. DVC is a good example if you are looking for tools that make the process more streamlined.

It is also important to save data snapshots throughout the project lifecycle, for example: raw data, processed data, train data, validation data, test data and inference data. This reduces the necessity to rerun the code from scratch each time. Besides, if data drifts are detected in final model output, keeping a record of the intermediate steps helps to identify where the changes occur.

Tip 4: Model Versioning

Depending on training data, preprocessing pipeline and hyperparameters, models developed from the same algorithm can vary significantly from each other, thus it is essential to keeping track of different model configurations during the model experimentation phase. Since models themselves also have a certain level of randomness, even though it is trained on the same dataset and process, the output can be different. This extends beyond the scope of machine learning or deep learning models. PCA and data transformation that required fitting training data would also have a dimension of randomness, which means that using random_seed is important to mitigate the amount of variations in the output.

While learning model experiment tracking is a long journey, the first thing you can do is to save the model to the working directory. There are multiple ways to save a trained model in Python. For example, we can use pickle library.

import pickle
model_filename = 'model.pkl'
pickle.dump(model, model_filename)

You may want to choose a more descriptive name for your filename, and it is always helpful to provide a brief description that explains the model variant.

To load the model:

model = pickle.load('model.pkl')

Tip 5: Code Versioning

The third recommendation is to save the queries that have been used to generate any output data, e.g. the SQL script for extracting the raw data. When executing batch inference, save the script with the precise date instead of a relative date. This helps to keep a record of the time snapshot for future reference.

# use relative date
SELECT ID, Date, MinTemp, MaxTemp
FROM dim_temperature
WHERE Date <= DATEADD(year,-1,GETDATE())
# use precise date
SELECT ID, Date, MinTemp, MaxTemp
FROM dim_temperature
WHERE Date <= '2024-01-01' AND Date >= '2023-01-01'

Furthermore, Git is undoubtedly an essential code versioning tool when collaborating on a Data Science project with other teammates. In short, it helps to track changes and revert back to previous checkpoint when necessary.

Improvement 3: Consistency – Consistent Structure and Naming Convention

Problem Statement

All the data, files, and scripts are stored in one flat directory structure. The code is convoluted together within one notebook. It becomes difficult to figure out any dependencies, and there is the risk of accidentally executing a line of code that overwrites previous output data. Due to lacking reusability and consistency, every ML project is built from scratch without a standard workflow or structure.

Tip 6: Consistent Directory Structure

As the field of Data Science and Machine Learning has matured overtime, consistent framework and project lifecycle have been gradually developed, such as CRISP-DM and TDSP. Therefore, we can build the project directory structure to adhere with a standard framework. For instance, "cookiecutter data science" provides a logical, reasonably standardized, but flexible project structure for doing and sharing data science work. Based on its recommended directory structure, I have adjusted it to a reduced version as below which I am also allowing it to evolve overtime.

Feel free to develop a structure that best suits your workflow and can be used as a template to design all future projects. In addition to the benefit of consistency, it is a powerful way to organize thoughts and create a high level architecture during the development phase.

├── data
│   ├── output      <- The output data from the model. 
│   ├── processed      <- The final, canonical data sets for modeling.
│   └── raw            <- The original, immutable data dump.
│
├── models             <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks          <- Jupyter notebooks. 
│
├── reports            <- Generated analysis as HTML, PDF, LaTeX, etc.
│   └── figures        <- Generated graphics and figures to be used in reporting
│
├── requirements.txt   <- The requirements file for reproducing the analysis environment, e.g.
│                         generated with `pip freeze > requirements.txt`
│
├── code              <- Source code for use in this project.
│   ├── __init__.py    <- Makes src a Python module
│   │
│   ├── data           <- Scripts to generate and process data
│   │   ├── data_preparation.py
│   │   └── data_preprocessing.py
│   │
│   ├── models         <- Scripts to train models and then use trained models to make
│   │   │                 predictions
│   │   ├── inference_model.py
│   │   └── train_model.py
│   │
│   └── analysis  <- Scripts to create exploratory and results oriented visualizations
│       └── analysis.py
│

Tip 7: Consistent Naming Convention

Another way to introduce more consistency and also reduce frictions among team collaboration is keeping a naming convention for your model, data and code. There isn’t a single best practice and it’s about finding a method that suits your specific use cases. You may derive some insights from HuggingFace or Kaggle model hub, for example <model-name>-<parameters>-<model-version> or <model-name>-<data-version>-<use-case>.

And of course, documentation is always preferred to add extra details behind the names. However, it is often easier said than done, we may stick to it for the first few times until we completely forget about maintaining the same convention. One tip I’ve learned is to create a template file with the naming convention, and save it in the working directory. Then it can both serve as a reminder and can be easily duplicated for creating new files.


Thanks for reaching the end. If you would like to read more of my articles on Medium, I would really appreciate your support by signing up Medium membership.


Take Home Message

The article discussed how to future-proof machine learning projects with three key improvements: modularization, versioning, and maintaining consistency.

Improvement 1: Modularization

  • Tip 1: Reuse code when possible by importing functions, modules
  • Tip 2: Keep parameters in a separate config file

Improvement 2: Versioning

  • Tip 3: Data versioning
  • Tip 4: Model versioning
  • Tip 5: Code versioning

Improvement 3: Consistency

  • Tip 6: Consistent directory structure
  • Tip 7: Consistent naming convention

More Related Articles

Get Started in Data Science

Practical Guides to Machine Learning

How to Self-Learn Data Science


Originally published at https://www.visual-design.net on Feb 24th, 2024.


Related Articles