Python | Towards Data Science https://towardsdatascience.com/tag/python/ The world’s leading publication for data science, AI, and ML professionals. Thu, 06 Mar 2025 19:59:52 +0000 en-US hourly 1 https://wordpress.org/?v=6.7.1 https://towardsdatascience.com/wp-content/uploads/2025/02/cropped-Favicon-32x32.png Python | Towards Data Science https://towardsdatascience.com/tag/python/ 32 32 How to Spot and Prevent Model Drift Before it Impacts Your Business https://towardsdatascience.com/how-to-spot-and-prevent-model-drift-before-it-impacts-your-business/ Thu, 06 Mar 2025 19:22:22 +0000 https://towardsdatascience.com/?p=598826 3 essential methods to track model drift you should know

The post How to Spot and Prevent Model Drift Before it Impacts Your Business appeared first on Towards Data Science.

]]>
Despite the AI hype, many tech companies still rely heavily on machine learning to power critical applications, from personalized recommendations to fraud detection. 

I’ve seen firsthand how undetected drifts can result in significant costs — missed fraud detection, lost revenue, and suboptimal business outcomes, just to name a few. So, it’s crucial to have robust monitoring in place if your company has deployed or plans to deploy machine learning models into production.

Undetected Model Drift can lead to significant financial losses, operational inefficiencies, and even damage to a company’s reputation. To mitigate these risks, it’s important to have effective model monitoring, which involves:

  • Tracking model performance
  • Monitoring feature distributions
  • Detecting both univariate and multivariate drifts

A well-implemented monitoring system can help identify issues early, saving considerable time, money, and resources.

In this comprehensive guide, I’ll provide a framework on how to think about and implement effective Model Monitoring, helping you stay ahead of potential issues and ensure stability and reliability of your models in production.

What’s the difference between feature drift and score drift?

Score drift refers to a gradual change in the distribution of model scores. If left unchecked, this could lead to a decline in model performance, making the model less accurate over time.

On the other hand, feature drift occurs when one or more features experience changes in the distribution. These changes in feature values can affect the underlying relationships that the model has learned, and ultimately lead to inaccurate model predictions.

Simulating score shifts

To model real-world fraud detection challenges, I created a synthetic dataset with five financial transaction features.

The reference dataset represents the original distribution, while the production dataset introduces shifts to simulate an increase in high-value transactions without PIN verification on newer accounts, indicating an increase in fraud.

Each feature has different underlying distributions:

  • Transaction Amount: Log-normal distribution (right-skewed with a long tail)
  • Account Age (months): clipped normal distribution between 0 to 60 (assuming a 5-year-old company)
  • Time Since Last Transaction: Exponential distribution
  • Transaction Count: Poisson distribution
  • Entered PIN: Binomial distribution.

To approximate model scores, I randomly assigned weights to these features and applied a sigmoid function to constrain predictions between 0 to 1. This mimics how a logistic regression fraud model generates risk scores.

As shown in the plot below:

  • Drifted features: Transaction Amount, Account Age, Transaction Count, and Entered PIN all experienced shifts in distribution, scale, or relationships.
Distribution of drifted features (image by author)
  • Stable feature: Time Since Last Transaction remained unchanged.
Distribution of stable feature (image by author)
  • Drifted scores: As a result of the drifted features, the distribution in model scores has also changed.
Distribution of model scores (image by author)

This setup allows us to analyze how feature drift impacts model scores in production.

Detecting model score drift using PSI

To monitor model scores, I used population stability index (PSI) to measure how much model score distribution has shifted over time.

PSI works by binning continuous model scores and comparing the proportion of scores in each bin between the reference and production datasets. It compares the differences in proportions and their logarithmic ratios to compute a single summary statistic to quantify the drift.

Python implementation:

# Define function to calculate PSI given two datasets
def calculate_psi(reference, production, bins=10):
  # Discretize scores into bins
  min_val, max_val = 0, 1
  bin_edges = np.linspace(min_val, max_val, bins + 1)

  # Calculate proportions in each bin
  ref_counts, _ = np.histogram(reference, bins=bin_edges)
  prod_counts, _ = np.histogram(production, bins=bin_edges)

  ref_proportions = ref_counts / len(reference)
  prod_proportions = prod_counts / len(production)
  
  # Avoid division by zero
  ref_proportions = np.clip(ref_proportions, 1e-8, 1)
  prod_proportions = np.clip(prod_proportions, 1e-8, 1)

  # Calculate PSI for each bin
  psi = np.sum((ref_proportions - prod_proportions) * np.log(ref_proportions / prod_proportions))

  return psi
  
# Calculate PSI
psi_value = calculate_psi(ref_data['model_score'], prod_data['model_score'], bins=10)
print(f"PSI Value: {psi_value}")

Below is a summary of how to interpret PSI values:

  • PSI < 0.1: No drift, or very minor drift (distributions are almost identical).
  • 0.1 ≤ PSI < 0.25: Some drift. The distributions are somewhat different.
  • 0.25 ≤ PSI < 0.5: Moderate drift. A noticeable shift between the reference and production distributions.
  • PSI ≥ 0.5: Significant drift. There is a large shift, indicating that the distribution in production has changed substantially from the reference data.
Histogram of model score distributions (image by author)

The PSI value of 0.6374 suggests a significant drift between our reference and production datasets. This aligns with the histogram of model score distributions, which visually confirms the shift towards higher scores in production — indicating an increase in risky transactions.

Detecting feature drift

Kolmogorov-Smirnov test for numeric features

The Kolmogorov-Smirnov (K-S) test is my preferred method for detecting drift in numeric features, because it is non-parametric, meaning it doesn’t assume a normal distribution.

The test compares a feature’s distribution in the reference and production datasets by measuring the maximum difference between the empirical cumulative distribution functions (ECDFs). The resulting K-S statistic ranges from 0 to 1:

  • 0 indicates no difference between the two distributions.
  • Values closer to 1 suggest a greater shift.

Python implementation:

# Create an empty dataframe
ks_results = pd.DataFrame(columns=['Feature', 'KS Statistic', 'p-value', 'Drift Detected'])

# Loop through all features and perform the K-S test
for col in numeric_cols:
    ks_stat, p_value = ks_2samp(ref_data[col], prod_data[col])
    drift_detected = p_value < 0.05
		
		# Store results in the dataframe
    ks_results = pd.concat([
        ks_results,
        pd.DataFrame({
            'Feature': [col],
            'KS Statistic': [ks_stat],
            'p-value': [p_value],
            'Drift Detected': [drift_detected]
        })
    ], ignore_index=True)

Below are ECDF charts of the four numeric features in our dataset:

ECDFs of four numeric features (image by author)

Let’s look at the account age feature as an example: the x-axis represents account age (0-50 months), while the y-axis shows the ECDF for both reference and production datasets. The production dataset skews towards newer accounts, as it has a larger proportion of observations with lower account ages.

Chi-Square test for categorical features

To detect shifts in categorical and boolean features, I like to use the Chi-Square test.

This test compares the frequency distribution of a categorical feature in the reference and production datasets, and returns two values:

  • Chi-Square statistic: A higher value indicates a greater shift between the reference and production datasets.
  • P-value: A p-value below 0.05 suggests that the difference between the reference and production datasets is statistically significant, indicating potential feature drift.

Python implementation:

# Create empty dataframe with corresponding column names
chi2_results = pd.DataFrame(columns=['Feature', 'Chi-Square Statistic', 'p-value', 'Drift Detected'])

for col in categorical_cols:
    # Get normalized value counts for both reference and production datasets
    ref_counts = ref_data[col].value_counts(normalize=True)
    prod_counts = prod_data[col].value_counts(normalize=True)

    # Ensure all categories are represented in both
    all_categories = set(ref_counts.index).union(set(prod_counts.index))
    ref_counts = ref_counts.reindex(all_categories, fill_value=0)
    prod_counts = prod_counts.reindex(all_categories, fill_value=0)

    # Create contingency table
    contingency_table = np.array([ref_counts * len(ref_data), prod_counts * len(prod_data)])

    # Perform Chi-Square test
    chi2_stat, p_value, _, _ = chi2_contingency(contingency_table)
    drift_detected = p_value < 0.05

    # Store results in chi2_results dataframe
    chi2_results = pd.concat([
        chi2_results,
        pd.DataFrame({
            'Feature': [col],
            'Chi-Square Statistic': [chi2_stat],
            'p-value': [p_value],
            'Drift Detected': [drift_detected]
        })
    ], ignore_index=True)

The Chi-Square statistic of 57.31 with a p-value of 3.72e-14 confirms a large shift in our categorical feature, Entered PIN. This finding aligns with the histogram below, which visually illustrates the shift:

Distribution of categorical feature (image by author)

Detecting multivariate shifts

Spearman Correlation for shifts in pairwise interactions

In addition to monitoring individual feature shifts, it’s important to track shifts in relationships or interactions between features, known as multivariate shifts. Even if the distributions of individual features remain stable, multivariate shifts can signal meaningful differences in the data.

By default, Pandas’ .corr() function calculates Pearson correlation, which only captures linear relationships between variables. However, relationships between features are often non-linear yet still follow a consistent trend.

To capture this, we use Spearman correlation to measure monotonic relationships between features. It captures whether features change together in a consistent direction, even if their relationship isn’t strictly linear.

To assess shifts in feature relationships, we compare:

  • Reference correlation (ref_corr): Captures historical feature relationships in the reference dataset.
  • Production correlation (prod_corr): Captures new feature relationships in production.
  • Absolute difference in correlation: Measures how much feature relationships have shifted between the reference and production datasets. Higher values indicate more significant shifts.

Python implementation:

# Calculate correlation matrices
ref_corr = ref_data.corr(method='spearman')
prod_corr = prod_data.corr(method='spearman')

# Calculate correlation difference
corr_diff = abs(ref_corr - prod_corr)

Example: Change in correlation

Now, let’s look at the correlation between transaction_amount and account_age_in_months:

  • In ref_corr, the correlation is 0.00095, indicating a weak relationship between the two features.
  • In prod_corr, the correlation is -0.0325, indicating a weak negative correlation.
  • Absolute difference in the Spearman correlation is 0.0335, which is a small but noticeable shift.

The absolute difference in correlation indicates a shift in the relationship between transaction_amount and account_age_in_months.

There used to be no relationship between these two features, but the production dataset indicates that there is now a weak negative correlation, meaning that newer accounts have higher transaction amounts. This is spot on!

Autoencoder for complex, high-dimensional multivariate shifts

In addition to monitoring pairwise interactions, we can also look for shifts across more dimensions in the data.

Autoencoders are powerful tools for detecting high-dimensional multivariate shifts, where multiple features collectively change in ways that may not be apparent from looking at individual feature distributions or pairwise correlations.

An autoencoder is a neural network that learns a compressed representation of data through two components:

  • Encoder: Compresses input data into a lower-dimensional representation.
  • Decoder: Reconstructs the original input from the compressed representation.

To detect shifts, we compare the reconstructed output to the original input and compute the reconstruction loss.

  • Low reconstruction loss → The autoencoder successfully reconstructs the data, meaning the new observations are similar to what it has seen and learned.
  • High reconstruction loss → The production data deviates significantly from the learned patterns, indicating potential drift.

Unlike traditional drift metrics that focus on individual features or pairwise relationships, autoencoders capture complex, non-linear dependencies across multiple variables simultaneously.

Python implementation:

ref_features = ref_data[numeric_cols + categorical_cols]
prod_features = prod_data[numeric_cols + categorical_cols]

# Normalize the data
scaler = StandardScaler()
ref_scaled = scaler.fit_transform(ref_features)
prod_scaled = scaler.transform(prod_features)

# Split reference data into train and validation
np.random.shuffle(ref_scaled)
train_size = int(0.8 * len(ref_scaled))
train_data = ref_scaled[:train_size]
val_data = ref_scaled[train_size:]

# Build autoencoder
input_dim = ref_features.shape[1]
encoding_dim = 3 
# Input layer
input_layer = Input(shape=(input_dim, ))
# Encoder
encoded = Dense(8, activation="relu")(input_layer)
encoded = Dense(encoding_dim, activation="relu")(encoded)
# Decoder
decoded = Dense(8, activation="relu")(encoded)
decoded = Dense(input_dim, activation="linear")(decoded)
# Autoencoder
autoencoder = Model(input_layer, decoded)
autoencoder.compile(optimizer="adam", loss="mse")

# Train autoencoder
history = autoencoder.fit(
    train_data, train_data,
    epochs=50,
    batch_size=64,
    shuffle=True,
    validation_data=(val_data, val_data),
    verbose=0
)

# Calculate reconstruction error
ref_pred = autoencoder.predict(ref_scaled, verbose=0)
prod_pred = autoencoder.predict(prod_scaled, verbose=0)

ref_mse = np.mean(np.power(ref_scaled - ref_pred, 2), axis=1)
prod_mse = np.mean(np.power(prod_scaled - prod_pred, 2), axis=1)

The charts below show the distribution of reconstruction loss between both datasets.

Distribution of reconstruction loss between actuals and predictions (image by author)

The production dataset has a higher mean reconstruction error than that of the reference dataset, indicating a shift in the overall data. This aligns with the changes in the production dataset with a higher number of newer accounts with high-value transactions.

Summarizing

Model monitoring is an essential, yet often overlooked, responsibility for data scientists and machine learning engineers.

All the statistical methods led to the same conclusion, which aligns with the observed shifts in the data: they detected a trend in production towards newer accounts making higher-value transactions. This shift resulted in higher model scores, signaling an increase in potential fraud.

In this post, I covered techniques for detecting drift on three different levels:

  • Model score drift: Using Population Stability Index (PSI)
  • Individual feature drift: Using Kolmogorov-Smirnov test for numeric features and Chi-Square test for categorical features
  • Multivariate drift: Using Spearman correlation for pairwise interactions and autoencoders for high-dimensional, multivariate shifts.

These are just a few of the techniques I rely on for comprehensive monitoring — there are plenty of other equally valid statistical methods that can also detect drift effectively.

Detected shifts often point to underlying issues that warrant further investigation. The root cause could be as serious as a data collection bug, or as minor as a time change like daylight savings time adjustments.

There are also fantastic python packages, like evidently.ai, that automate many of these comparisons. However, I believe there’s significant value in deeply understanding the statistical techniques behind drift detection, rather than relying solely on these tools.

What’s the model monitoring process like at places you’ve worked?


Want to build your AI skills?

👉🏻 I run the AI Weekender and write weekly blog posts on data science, AI weekend projects, career advice for professionals in data.


Resources

The post How to Spot and Prevent Model Drift Before it Impacts Your Business appeared first on Towards Data Science.

]]>
LLM + RAG: Creating an AI-Powered File Reader Assistant https://towardsdatascience.com/llm-rag-creating-an-ai-powered-file-reader-assistant/ Mon, 03 Mar 2025 21:02:28 +0000 https://towardsdatascience.com/?p=598621 How to create a chatbot to answer questions about file’s content

The post LLM + RAG: Creating an AI-Powered File Reader Assistant appeared first on Towards Data Science.

]]>
Introduction

AI is everywhere. 

It is hard not to interact at least once a day with a Large Language Model (LLM). The chatbots are here to stay. They’re in your apps, they help you write better, they compose emails, they read emails…well, they do a lot.

And I don’t think that that is bad. In fact, my opinion is the other way – at least so far. I defend and advocate for the use of AI in our daily lives because, let’s agree, it makes everything much easier.

I don’t have to spend time double-reading a document to find punctuation problems or type. AI does that for me. I don’t waste time writing that follow-up email every single Monday. AI does that for me. I don’t need to read a huge and boring contract when I have an AI to summarize the main takeaways and action points to me!

These are only some of AI’s great uses. If you’d like to know more use cases of LLMs to make our lives easier, I wrote a whole book about them.

Now, thinking as a data scientist and looking at the technical side, not everything is that bright and shiny. 

LLMs are great for several general use cases that apply to anyone or any company. For example, coding, summarizing, or answering questions about general content created until the training cutoff date. However, when it comes to specific business applications, for a single purpose, or something new that didn’t make the cutoff date, that is when the models won’t be that useful if used out-of-the-box – meaning, they will not know the answer. Thus, it will need adjustments.

Training an LLM model can take months and millions of dollars. What is even worse is that if we don’t adjust and tune the model to our purpose, there will be unsatisfactory results or hallucinations (when the model’s response doesn’t make sense given our query).

So what is the solution, then? Spending a lot of money retraining the model to include our data?

Not really. That’s when the Retrieval-Augmented Generation (RAG) becomes useful.

RAG is a framework that combines getting information from an external knowledge base with large language models (LLMs). It helps AI models produce more accurate and relevant responses.

Let’s learn more about RAG next.

What is RAG?

Let me tell you a story to illustrate the concept.

I love movies. For some time in the past, I knew which movies were competing for the best movie category at the Oscars or the best actors and actresses. And I would certainly know which ones got the statue for that year. But now I am all rusty on that subject. If you asked me who was competing, I would not know. And even if I tried to answer you, I would give you a weak response. 

So, to provide you with a quality response, I will do what everybody else does: search for the information online, obtain it, and then give it to you. What I just did is the same idea as the RAG: I obtained data from an external database to give you an answer.

When we enhance the LLM with a content store where it can go and retrieve data to augment (increase) its knowledge base, that is the RAG framework in action.

RAG is like creating a content store where the model can enhance its knowledge and respond more accurately.

Diagram: User prompts and content using LLM + RAG
User prompt about Content C. LLM retrieves external content to aggregate to the answer. Image by the author.

Summarizing:

  1. Uses search algorithms to query external data sources, such as databases, knowledge bases, and web pages.
  2. Pre-processes the retrieved information.
  3. Incorporates the pre-processed information into the LLM.

Why use RAG?

Now that we know what the RAG framework is let’s understand why we should be using it.

Here are some of the benefits:

  • Enhances factual accuracy by referencing real data.
  • RAG can help LLMs process and consolidate knowledge to create more relevant answers 
  • RAG can help LLMs access additional knowledge bases, such as internal organizational data 
  • RAG can help LLMs create more accurate domain-specific content 
  • RAG can help reduce knowledge gaps and AI hallucination

As previously explained, I like to say that with the RAG framework, we are giving an internal search engine for the content we want it to add to the knowledge base.

Well. All of that is very interesting. But let’s see an application of RAG. We will learn how to create an AI-powered PDF Reader Assistant.

Project

This is an application that allows users to upload a PDF document and ask questions about its content using AI-powered natural language processing (NLP) tools. 

  • The app uses Streamlit as the front end.
  • Langchain, OpenAI’s GPT-4 model, and FAISS (Facebook AI Similarity Search) for document retrieval and question answering in the backend.

Let’s break down the steps for better understanding:

  1. Loading a PDF file and splitting it into chunks of text.
    1. This makes the data optimized for retrieval
  2. Present the chunks to an embedding tool.
    1. Embeddings are numerical vector representations of data used to capture relationships, similarities, and meanings in a way that machines can understand. They are widely used in Natural Language Processing (NLP), recommender systems, and search engines.
  3. Next, we put those chunks of text and embeddings in the same DB for retrieval.
  4. Finally, we make it available to the LLM.

Data preparation

Preparing a content store for the LLM will take some steps, as we just saw. So, let’s start by creating a function that can load a file and split it into text chunks for efficient retrieval.

# Imports
from  langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

def load_document(pdf):
    # Load a PDF
    """
    Load a PDF and split it into chunks for efficient retrieval.

    :param pdf: PDF file to load
    :return: List of chunks of text
    """

    loader = PyPDFLoader(pdf)
    docs = loader.load()

    # Instantiate Text Splitter with Chunk Size of 500 words and Overlap of 100 words so that context is not lost
    text_splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=100)
    # Split into chunks for efficient retrieval
    chunks = text_splitter.split_documents(docs)

    # Return
    return chunks

Next, we will start building our Streamlit app, and we’ll use that function in the next script.

Web application

We will begin importing the necessary modules in Python. Most of those will come from the langchain packages.

FAISS is used for document retrieval; OpenAIEmbeddings transforms the text chunks into numerical scores for better similarity calculation by the LLM; ChatOpenAI is what enables us to interact with the OpenAI API; create_retrieval_chain is what actually the RAG does, retrieving and augmenting the LLM with that data; create_stuff_documents_chain glues the model and the ChatPromptTemplate.

Note: You will need to generate an OpenAI Key to be able to run this script. If it’s the first time you’re creating your account, you get some free credits. But if you have it for some time, it is possible that you will have to add 5 dollars in credits to be able to access OpenAI’s API. An option is using Hugging Face’s Embedding. 

# Imports
from langchain_community.vectorstores import FAISS
from langchain_openai import OpenAIEmbeddings
from langchain.chains import create_retrieval_chain
from langchain_openai import ChatOpenAI
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate
from scripts.secret import OPENAI_KEY
from scripts.document_loader import load_document
import streamlit as st

This first code snippet will create the App title, create a box for file upload, and prepare the file to be added to the load_document() function.

# Create a Streamlit app
st.title("AI-Powered Document Q&A")

# Load document to streamlit
uploaded_file = st.file_uploader("Upload a PDF file", type="pdf")

# If a file is uploaded, create the TextSplitter and vector database
if uploaded_file :

    # Code to work around document loader from Streamlit and make it readable by langchain
    temp_file = "./temp.pdf"
    with open(temp_file, "wb") as file:
        file.write(uploaded_file.getvalue())
        file_name = uploaded_file.name

    # Load document and split it into chunks for efficient retrieval.
    chunks = load_document(temp_file)

    # Message user that document is being processed with time emoji
    st.write("Processing document... :watch:")

Machines understand numbers better than text, so in the end, we will have to provide the model with a database of numbers that it can compare and check for similarity when performing a query. That’s where the embeddings will be useful to create the vector_db, in this next piece of code.

# Generate embeddings
    # Embeddings are numerical vector representations of data, typically used to capture relationships, similarities,
    # and meanings in a way that machines can understand. They are widely used in Natural Language Processing (NLP),
    # recommender systems, and search engines.
    embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_KEY,
                                  model="text-embedding-ada-002")

    # Can also use HuggingFaceEmbeddings
    # from langchain_huggingface.embeddings import HuggingFaceEmbeddings
    # embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

    # Create vector database containing chunks and embeddings
    vector_db = FAISS.from_documents(chunks, embeddings)

Next, we create a retriever object to navigate in the vector_db.

# Create a document retriever
    retriever = vector_db.as_retriever()
    llm = ChatOpenAI(model_name="gpt-4o-mini", openai_api_key=OPENAI_KEY)

Then, we will create the system_prompt, which is a set of instructions to the LLM on how to answer, and we will create a prompt template, preparing it to be added to the model once we get the input from the user.

# Create a system prompt
    # It sets the overall context for the model.
    # It influences tone, style, and focus before user interaction starts.
    # Unlike user inputs, a system prompt is not visible to the end user.

    system_prompt = (
        "You are a helpful assistant. Use the given context to answer the question."
        "If you don't know the answer, say you don't know. "
        "{context}"
    )

    # Create a prompt Template
    prompt = ChatPromptTemplate.from_messages(
        [
            ("system", system_prompt),
            ("human", "{input}"),
        ]
    )

    # Create a chain
    # It creates a StuffDocumentsChain, which takes multiple documents (text data) and "stuffs" them together before passing them to the LLM for processing.

    question_answer_chain = create_stuff_documents_chain(llm, prompt)

Moving on, we create the core of the RAG framework, pasting together the retriever object and the prompt. This object adds relevant documents from a data source (e.g., a vector database) and makes it ready to be processed using an LLM to generate a response.

# Creates the RAG
     chain = create_retrieval_chain(retriever, question_answer_chain)

Finally, we create the variable question for the user input. If this question box is filled with a query, we pass it to the chain, which calls the LLM to process and return the response, which will be printed on the app’s screen.

# Streamlit input for question
    question = st.text_input("Ask a question about the document:")
    if question:
        # Answer
        response = chain.invoke({"input": question})['answer']
        st.write(response)

Here is a screenshot of the result.

Screenshot of the AI-Powered Document Q&A
Screenshot of the final app. Image by the author.

And this is a GIF for you to see the File Reader Ai Assistant in action!

GIF of the File Reader AI Assistant in action
File Reader AI Assistant in action. Image by the author.

Before you go

In this project, we learned what the RAG framework is and how it helps the Llm to perform better and also perform well with specific knowledge.

AI can be powered with knowledge from an instruction manual, databases from a company, some finance files, or contracts, and then become fine-tuned to respond accurately to domain-specific content queries. The knowledge base is augmented with a content store.

To recap, this is how the framework works:

1️⃣ User Query → Input text is received.

2️⃣ Retrieve Relevant Documents → Searches a knowledge base (e.g., a database, vector store).

3️⃣ Augment Context → Retrieved documents are added to the input.

4️⃣ Generate Response → An LLM processes the combined input and produces an answer.

GitHub repository

https://github.com/gurezende/Basic-Rag

About me

If you liked this content and want to learn more about my work, here is my website, where you can also find all my contacts.

https://gustavorsantos.me

References

https://cloud.google.com/use-cases/retrieval-augmented-generation

https://www.ibm.com/think/topics/retrieval-augmented-generation

https://youtu.be/T-D1OfcDW1M?si=G0UWfH5-wZnMu0nw

https://python.langchain.com/docs/introduction

https://www.geeksforgeeks.org/how-to-get-your-own-openai-api-key

The post LLM + RAG: Creating an AI-Powered File Reader Assistant appeared first on Towards Data Science.

]]>
Data Science: From School to Work, Part II https://towardsdatascience.com/data-science-from-school-to-work-part-ii/ Mon, 03 Mar 2025 14:00:00 +0000 https://towardsdatascience.com/?p=598609 How to write clean Python code

The post Data Science: From School to Work, Part II appeared first on Towards Data Science.

]]>
In my previous article, I highlighted the importance of effective project management in Python development. Now, let’s shift our focus to the code itself and explore how to write clean, maintainable code — an essential practice in professional and collaborative environments. 

  • Readability & Maintainability: Well-structured code is easier to read, understand, and modify. Other developers — or even your future self — can quickly grasp the logic without struggling to decipher messy code.
  • Debugging & Troubleshooting: Organized code with clear variable names and structured functions makes it easier to identify and fix bugs efficiently.
  • Scalability & Reusability: Modular, well-organized code can be reused across different projects, allowing for seamless scaling without disrupting existing functionality.

So, as you work on your next Python project, remember: 

Half of good code is Clean Code.


Introduction

Python is one of the most popular and versatile Programming languages, appreciated for its simplicity, comprehensibility and large community. Whether web development, data analysis, artificial intelligence or automation of tasks — Python offers powerful and flexible tools that are suitable for a wide range of areas.

However, the efficiency and maintainability of a Python project depends heavily on the practices used by the developers. Poor structuring of the code, a lack of conventions or even a lack of documentation can quickly turn a promising project into a maintenance and development-intensive puzzle. It is precisely this point that makes the difference between student code and professional code.

This article is intended to present the most important best practices for writing high-quality Python code. By following these recommendations, developers can create scripts and applications that are not only functional, but also readable, performant and easily maintainable by third parties.

Adopting these best practices right from the start of a project not only ensures better collaboration within teams, but also prepares your code to evolve with future needs. Whether you’re a beginner or an experienced developer, this guide is designed to support you in all your Python developments.


The code structuration

Good code structuring in Python is essential. There are two main project layouts: flat layout and src layout.

The flat layout places the source code directly in the project root without an additional folder. This approach simplifies the structure and is well-suited for small scripts, quick prototypes, and projects that do not require complex packaging. However, it may lead to unintended import issues when running tests or scripts.

📂 my_project/
├── 📂 my_project/                  # Directly in the root
│   ├── 🐍 __init__.py
│   ├── 🐍 main.py                   # Main entry point (if needed)
│   ├── 🐍 module1.py             # Example module
│   └── 🐍 utils.py
├── 📂 tests/                            # Unit tests
│   ├── 🐍 test_module1.py
│   ├── 🐍 test_utils.py
│   └── ...
├── 📄 .gitignore                      # Git ignored files
├── 📄 pyproject.toml              # Project configuration (Poetry, setuptools)
├── 📄 uv.lock                         # UV file
├── 📄 README.md               # Main project documentation
├── 📄 LICENSE                     # Project license
├── 📄 Makefile                       # Automates common tasks
├── 📄 DockerFile                   # Automates common tasks
├── 📂 .github/                        # GitHub Actions workflows (CI/CD)
│   ├── 📂 actions/               
│   └── 📂 workflows/

On the other hand, the src layout (src is the contraction of source) organizes the source code inside a dedicated src/ directory, preventing accidental imports from the working directory and ensuring a clear separation between source files and other project components like tests or configuration files. This layout is ideal for large projects, libraries, and production-ready applications as it enforces proper package installation and avoids import conflicts.

📂 my-project/
├── 📂 src/                              # Main source code
│   ├── 📂 my_project/            # Main package
│   │   ├── 🐍 __init__.py        # Makes the folder a package
│   │   ├── 🐍 main.py             # Main entry point (if needed)
│   │   ├── 🐍 module1.py       # Example module
│   │   └── ...
│   │   ├── 📂 utils/                  # Utility functions
│   │   │   ├── 🐍 __init__.py     
│   │   │   ├── 🐍 data_utils.py  # data functions
│   │   │   ├── 🐍 io_utils.py      # Input/output functions
│   │   │   └── ...
├── 📂 tests/                             # Unit tests
│   ├── 🐍 test_module1.py     
│   ├── 🐍 test_module2.py     
│   ├── 🐍 conftest.py              # Pytest configurations
│   └── ...
├── 📂 docs/                            # Documentation
│   ├── 📄 index.md                
│   ├── 📄 architecture.md         
│   ├── 📄 installation.md         
│   └── ...                     
├── 📂 notebooks/                   # Jupyter Notebooks for exploration
│   ├── 📄 exploration.ipynb       
│   └── ...                     
├── 📂 scripts/                         # Standalone scripts (ETL, data processing)
│   ├── 🐍 run_pipeline.py         
│   ├── 🐍 clean_data.py           
│   └── ...                     
├── 📂 data/                            # Raw or processed data (if applicable)
│   ├── 📂 raw/                    
│   ├── 📂 processed/
│   └── ....                                 
├── 📄 .gitignore                      # Git ignored files
├── 📄 pyproject.toml              # Project configuration (Poetry, setuptools)
├── 📄 uv.lock                         # UV file
├── 📄 README.md               # Main project documentation
├── 🐍 setup.py                       # Installation script (if applicable)
├── 📄 LICENSE                     # Project license
├── 📄 Makefile                       # Automates common tasks
├── 📄 DockerFile                   # To create Docker image
├── 📂 .github/                        # GitHub Actions workflows (CI/CD)
│   ├── 📂 actions/               
│   └── 📂 workflows/

Choosing between these layouts depends on the project’s complexity and long-term goals. For production-quality code, the src/ layout is often recommended, whereas the flat layout works well for simple or short-lived projects.

You can imagine different templates that are better adapted to your use case. It is important that you maintain the modularity of your project. Do not hesitate to create subdirectories and to group together scripts with similar functionalities and separate those with different uses. A good code structure ensures readability, maintainability, scalability and reusability and helps to identify and correct errors efficiently.

Cookiecutter is an open-source tool for generating preconfigured project structures from templates. It is particularly useful for ensuring the coherence and organization of projects, especially in Python, by applying good practices from the outset. The flat layout and src layout can be initiate using a UV tool.


The SOLID principles

SOLID programming is an essential approach to software development based on five basic principles for improving code quality, maintainability and scalability. These principles provide a clear framework for developing robust, flexible systems. By following the Solid Principles, you reduce the risk of complex dependencies, make testing easier and ensure that applications can evolve more easily in the face of change. Whether you are working on a single project or a large-scale application, mastering SOLID is an important step towards adopting object-oriented programming best practices.

S — Single Responsibility Principle (SRP)

The principle of single responsibility means that a class/function can only manage one thing. This means that it only has one reason to change. This makes the code more maintainable and easier to read. A class/function with multiple responsibilities is difficult to understand and often a source of errors.

Example:

# Violates SRP
class MLPipeline:
    def __init__(self, df: pd.DataFrame, target_column: str):
        self.df = df
        self.target_column = target_column
        self.scaler = StandardScaler()
        self.model = RandomForestClassifier()
   
    def preprocess_data(self):
        self.df.fillna(self.df.mean(), inplace=True)  # Handle missing values
        X = self.df.drop(columns=[self.target_column])
        y = self.df[self.target_column]
        X_scaled = self.scaler.fit_transform(X)  # Feature scaling
        return X_scaled, y
        
    def train_model(self):
        X, y = self.preprocess_data()  # Data preprocessing inside model training
        self.model.fit(X, y)
        print("Model training complete.")

Here, the Report class has two responsibilities: Generate content and save the file.

# Follows SRP
class DataPreprocessor:
    def __init__(self):
        self.scaler = StandardScaler()
        
    def preprocess(self, df: pd.DataFrame, target_column: str):
        df = df.copy()
        df.fillna(df.mean(), inplace=True)  # Handle missing values
        X = df.drop(columns=[target_column])
 
# Follows SRP
class DataPreprocessor:
    def __init__(self):
        self.scaler = StandardScaler()
        
    def preprocess(self, df: pd.DataFrame, target_column: str):
        df = df.copy()
        df.fillna(df.mean(), inplace=True)  # Handle missing values
        X = df.drop(columns=[target_column])
        y = df[target_column]
        X_scaled = self.scaler.fit_transform(X)  # Feature scaling
        return X_scaled, y


class ModelTrainer:
    def __init__(self, model):
        self.model = model
        
    def train(self, X, y):
        self.model.fit(X, y)
        print("Model training complete.")

O — Open/Closed Principle (OCP)

The open/close principle means that a class/function must be open to extension, but closed to modification. This makes it possible to add functionality without the risk of breaking existing code.

It is not easy to develop with this principle in mind, but a good indicator for the main developer is to see more and more additions (+) and fewer and fewer removals (-) in the merge requests during project development.

L — Liskov Substitution Principle (LSP)

The Liskov substitution principle states that a subordinate class can replace its parent class without changing the behavior of the program, ensuring that the subordinate class meets the expectations defined by the base class. It limits the risk of unexpected errors.

Example :

# Violates LSP
class Rectangle:
    def __init__(self, width, height):
        self.width = width
        self.height = height

    def area(self):
        return self.width * self.height


class Square(Rectangle):
    def __init__(self, side):
        super().__init__(side, side)
# Changing the width of a square violates the idea of a square.

To respect the LSP, it is better to avoid this hierarchy and use independent classes:

class Shape:
    def area(self):
        raise NotImplementedError


class Rectangle(Shape):
    def __init__(self, width, height):
        self.width = width
        self.height = height

    def area(self):
        return self.width * self.height


class Square(Shape):
    def __init__(self, side):
        self.side = side

    def area(self):
        return self.side * self.side

I — Interface Segregation Principle (ISP)

The principle of interface separation states that several small classes should be built instead of one with methods that cannot be used in certain cases. This reduces unnecessary dependencies.

Example:

# Violates ISP
class Animal:
    def fly(self):
        raise NotImplementedError

    def swim(self):
        raise NotImplementedError

It is better to split the class Animal into several classes:

# Follows ISP
class CanFly:
    def fly(self):
        raise NotImplementedError


class CanSwim:
    def swim(self):
        raise NotImplementedError


class Bird(CanFly):
    def fly(self):
        print("Flying")


class Fish(CanSwim):
    def swim(self):
        print("Swimming")

D — Dependency Inversion Principle (DIP)

The Dependency Inversion Principle means that a class must depend on an abstract class and not on a concrete class. This reduces the connections between the classes and makes the code more modular.

Example:

# Violates DIP
class Database:
    def connect(self):
        print("Connecting to database")


class UserService:
    def __init__(self):
        self.db = Database()

    def get_users(self):
        self.db.connect()
        print("Getting users")

Here, the attribute db of UserService depends on the class Database. To respect the DIP, db has to depend on an abstract class.

# Follows DIP
class DatabaseInterface:
    def connect(self):
        raise NotImplementedError


class MySQLDatabase(DatabaseInterface):
    def connect(self):
        print("Connecting to MySQL database")


class UserService:
    def __init__(self, db: DatabaseInterface):
        self.db = db

    def get_users(self):
        self.db.connect()
        print("Getting users")


# We can easily change the used database.
db = MySQLDatabase()
service = UserService(db)
service.get_users()

PEP standards

PEPs (Python Enhancement Proposals) are technical and informative documents that describe new features, language improvements or guidelines for the Python community. Among them, PEP 8, which defines style conventions for Python code, plays a fundamental role in promoting readability and consistency in projects.

Adopting the PEP standards, especially PEP 8, not only ensures that the code is understandable to other developers, but also that it conforms to the standards set by the community. This facilitates collaboration, re-reads and long-term maintenance.

In this article, I present the most important aspects of the PEP standards, including:

  • Style Conventions (PEP 8): Indentations, variable names and import organization.
  • Best practices for documenting code (PEP 257).
  • Recommendations for writing typed, maintainable code (PEP 484 and PEP 563).

Understanding and applying these standards is essential to take full advantage of the Python ecosystem and contribute to professional quality projects.


PEP 8

This documentation is about coding conventions to standardize the code, and there exists a lot of documentation about the PEP 8. I will not show all recommendation in this posts, only those that I judge essential when I review a code

Naming conventions

Variable, function and module names should be in lower case, and use underscore to separate words. This typographical convention is called snake_case.


my_variable
my_new_function()
my_module

Constances are written in capital letters and set at the beginning of the script (after the imports):


LIGHT_SPEED
MY_CONSTANT

Finally, class names and exceptions use the CamelCase format (a capital letter at the beginning of each word). Exceptions must contain an Error at the end.


MyGreatClass
MyGreatError

Remember to give your variables names that make sense! Don’t use variable names like v1, v2, func1, i, toto…

Single-character variable names are permitted for loops and indexes:

my_list = [1, 3, 5, 7, 9, 11]
for i in range(len(my_liste)):
    print(my_list[i])

A more “pythonic” way of writing, to be preferred to the previous example, gets rid of the i index:

my_list = [1, 3, 5, 7, 9, 11]
for element in my_list:
    print(element )

Spaces management

It is recommended surrounding operators (+, -, *, /, //, %, ==, !=, >, not, in, and, or, …) with a space before AND after:

# recommended code:
my_variable = 3 + 7
my_text = "mouse"
my_text == my_variable

# not recommended code:
my_variable=3+7
my_text="mouse"
my_text== ma_variable

You can’t add several spaces around an operator. On the other hand, there are no spaces inside square brackets, braces or parentheses:

# recommended code:
my_list[1]
my_dict{"key"}
my_function(argument)

# not recommended code:
my_list[ 1 ]
my_dict{ "key" }
my_function( argument )

A space is recommended after the characters “:” and “,”, but not before:

# recommended code:
my_list = [1, 2, 3]
my_dict = {"key1": "value1", "key2": "value2"}
my_function(argument1, argument2)

# not recommended code:
my_list = [1 , 2 , 3]
my_dict = {"key1":"value1", "key2":"value2"}
my_function(argument1 , argument2)

However, when indexing lists, we don’t put a space after the “:”:

my_list = [1, 3, 5, 7, 9, 1]

# recommended code:
my_list[1:3]
my_list[1:4:2]
my_list[::2]

# not recommended code:
my_list[1 : 3]
my_list[1: 4:2 ]
my_list[ : :2]

Line length

For the sake of readability, we recommend writing lines of code no longer than 80 characters long. However, in certain circumstances this rule can be broken, especially if you are working on a Dash project, it may be complicated to respect this recommendation 

The \ character can be used to cut lines that are too long.

For example:

my_variable = 3
if my_variable > 1 and my_variable < 10 \
    and my_variable % 2 == 1 and my_variable % 3 == 0:
    print(f"My variable is equal to {my_variable }")

Within a parenthesis, you can return to the line without using the \ character. This can be useful for specifying the arguments of a function or method when defining or using it:

def my_function(argument_1, argument_2,
                argument_3, argument_4):
    return argument_1 + argument_2

It is also possible to create multi-line lists or dictionaries by skipping a line after a comma:

my_list = [1, 2, 3,
          4, 5, 6,
          7, 8, 9]
my_dict = {"key1": 13,
          "key2": 42,
          "key2": -10}

Blank lines

In a script, blank lines are useful for visually separating different parts of the code. It is recommended to leave two blank lines before the definition of a function or class, and to leave a single blank line before the definition of a method (in a class). You can also leave a blank line in the body of a function to separate the logical sections of the function, but this should be used sparingly.

Comments

Comments always begin with the # symbol followed by a space. They give clear explanations of the purpose of the code and must be synchronized with the code, i.e. if the code is modified, the comments must be too (if applicable). They are on the same indentation level as the code they comment on. Comments are complete sentences, with a capital letter at the beginning (unless the first word is a variable, which is written without a capital letter) and a period at the end.I strongly recommend writing comments in English and it is important to be consistent between the language used for comments and the language used to name variables. Finally, Comments that follow the code on the same line should be avoided wherever possible, and should be separated from the code by at least two spaces.

Tool to help you

Ruff is a linter (code analysis tool) and formatter for Python code written in Rust. It combines the advantages of the flake8 linter and black and isort formatting while being faster.

Ruff has an extension on the VS Code editor.

To check your code you can type:

ruff check my_modul.py

But, it is also possible to correct it with the following command:

ruff format my_modul.py

PEP 20

PEP 20: The Zen of Python is a set of 19 principles written in poetic form. They are more a way of coding than actual guidelines.

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren’t special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one– and preferably only one –obvious way to do it.
Although that way may not be obvious at first unless you’re Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it’s a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea — let’s do more of those!

PEP 257

The aim of PEP 257 is to standardize the use of docstrings.

What is a docstring?

A docstring is a string that appears as the first instruction after the definition of a function, class or method. A docstring becomes the output of the __doc__ special attribute of this object.

def my_function():
    """This is a doctring."""
    pass

And we have:

>>> my_function.__doc__
>>> 'This is a doctring.'

We always write a docstring between triple double quote """.

Docstring on a line

Used for simple functions or methods, it must fit on a single line, with no blank line at the beginning or end. The closing quotes are on the same line as opening quotes and there are no blank lines before or after the docstring.

def add(a, b):
    """Return the sum of a and b."""
    return a + b

Single-line docstring MUST NOT reintegrate function/method parameters. Do not do:

def my_function(a, b):
    """ my_function(a, b) -> list"""

Docstring on several lines

The first line should be a summary of the object being documented. An empty line follows, followed by more detailed explanations or clarifications of the arguments.

def divide(a, b):
    """Divide a byb.

    Returns the result of the division. Raises a ValueError if b equals 0.
    """
    if b == 0:
        raise ValueError("Only Chuck Norris can divide by 0") return a / b

Complete Docstring

A complete docstring is made up of several parts (in this case, based on the numpydoc standard).

  1. Short description: Summarizes the main functionality.
  2. Parameters: Describes the arguments with their type, name and role.
  3. Returns: Specifies the type and role of the returned value.
  4. Raises: Documents exceptions raised by the function.
  5. Notes (optional): Provides additional explanations.
  6. Examples (optional): Contains illustrated usage examples with expected results or exceptions.
def calculate_mean(numbers: list[float]) -> float:
    """
    Calculate the mean of a list of numbers.

    Parameters
    ----------
    numbers : list of float
        A list of numerical values for which the mean is to be calculated.

    Returns
    -------
    float
        The mean of the input numbers.

    Raises
    ------
    ValueError
        If the input list is empty.

    Notes
    -----
    The mean is calculated as the sum of all elements divided by the number of elements.

    Examples
    --------
    Calculate the mean of a list of numbers:
    >>> calculate_mean([1.0, 2.0, 3.0, 4.0])
    2.5"""

Tool to help you

VsCode’s autoDocstring extension lets you automatically create a docstring template.

PEP 484

In some programming languages, typing is mandatory when declaring a variable. In Python, typing is optional, but strongly recommended. PEP 484 introduces a typing system for Python, annotating the types of variables, function arguments and return values. This PEP provides a basis for improving code readability, facilitating static analysis and reducing errors.

What is typing?

Typing consists in explicitly declaring the type (float, string, etc.) of a variable. The typing module provides standard tools for defining generic types, such as Sequence, List, Union, Any, etc.

To type function attributes, we use “:” for function arguments and “->” for the type of what is returned.

Here a list of none typing functions:

def show_message(message):
    print(f"Message : {message}")

def addition(a, b):
    return a + b

def is_even(n):
    return n % 2 == 0

def list_square(numbers):
      return [x**2 for x in numbers]

def reverse_dictionary(d):
    return {v: k for k, v in d.items()}

def add_element(ensemble, element):
    ensemble.add(element)
  return ensemble

Now here’s how they should look:

from typing import List, Tuple, Dict, Set, Any

def show_message(message: str) -> None:
    print(f"Message : {message}")

def addition(a: int, b: int) -> int:
    return a + b

def is_even(n: int) -> bool:
    return n % 2 == 0

def list_square(numbers: List[int]) -> List[int]:
    return [x**2 for x in numbers]

def reverse_dictionary(d: Dict[str, int]) -> Dict[int, str]:
    return {v: k for k, v in d.items()}

def add_element(ensemble: Set[int], element: int) -> Set[int]:
    ensemble.add(element)
    return ensemble

Tool to help you

The MyPy extension automatically checks whether the use of a variable corresponds to the declared type. For example, for the following function:

def my_function(x: float) -> float:
    return x.mean()

The editor will point out that a float has no “mean” attribute.

Image from author

The benefit is twofold: you’ll know whether the declared type is the right one and whether the use of this variable corresponds to its type.

In the above example, x must be of a type that has a mean() method (e.g. np.array).


Conclusion

In this article, we have looked at the most important principles for creating clean Python production code. A solid architecture, adherence to SOLID principles, and compliance with PEP recommendations (at least the four discussed here) are essential for ensuring code quality. The desire for beautiful code is not (just) coquetry. It standardizes development practices and makes teamwork and maintenance much easier. There’s nothing more frustrating than spending hours (or even days) reverse-engineering a program, deciphering poorly written code before you’re finally able to fix the bugs. By applying these best practices, you ensure that your code remains clear, scalable, and easy for any developer to work with in the future.


References

1. src layout vs flat layout

2. SOLID principles

3. Python Enhancement Proposals index

The post Data Science: From School to Work, Part II appeared first on Towards Data Science.

]]>
Debugging the Dreaded NaN https://towardsdatascience.com/debugging-the-dreaded-nan/ Thu, 27 Feb 2025 21:52:06 +0000 https://towardsdatascience.com/?p=598513 Capturing and reproducing failures in PyTorch training with Lightning

The post Debugging the Dreaded NaN appeared first on Towards Data Science.

]]>
You are training your latest AI model, anxiously watching as the loss steadily decreases when suddenly — boom! Your logs are flooded with NaNs (Not a Number) — your model is irreparably corrupted and you’re left staring at your screen in despair. To make matters worse, the NaNs don’t appear consistently. Sometimes your model trains just fine; other times, it fails inexplicably. Sometimes it will crash immediately, sometimes after many days of training.

NaNs in Deep Learning workloads are amongst the most frustrating issues to encounter. And because they often appear sporadically — triggered by a specific combination of model state, input data, and stochastic factors — they can be incredibly difficult to reproduce and debug.

Given the considerable cost of training AI models and the potential waste caused by NaN failures, it is recommended to have dedicated tools for capturing and analyzing NaN occurrences. In a previous post, we discussed the challenge of debugging NaNs in a TensorFlow training workload. We proposed an efficient scheme for capturing and reproducing NaNs and shared a sample TensorFlow implementation. In this post, we adopt and demonstrate a similar mechanism for debugging NaNs in PyTorch workloads. The general scheme is as follows:

On each training step:

  1. Save a copy of the training input batch.
  2. Check the gradients for NaN values. If any appear, save a checkpoint with the current model weights before the model is corrupted. Also, save the input batch and, if necessary, the stochastic state. Discontinue the training job.
  3. Reproduce and debug the NaN occurrence by loading the saved experiment state.

Although this scheme can be easily implemented in native PyTorch, we will take the opportunity to demonstrate some of the conveniences of PyTorch Lightning — a powerful open-source framework designed to streamline the development of machine learning (ML) models. Built on PyTorch, Lightning abstracts away many of the boiler-plate components of an ML experiment, such as training loops, data distribution, logging, and more, enabling developers to focus on the core logic of their models.

To implement our NaN capturing scheme, we will use Lightning’s callback interface — a dedicated structure that enables inserting custom logic at specific points during the flow of execution.

Importantly, please do not view our choice of Lightning or any other tool or technique that we mention as an endorsement of its use. The code that we will share is intended for demonstrative purposes — please do not rely on its correctness or optimality.

Many thanks to Rom Maltser for his contributions to this post.

NaNCapture Callback

To implement our NaN capturing solution, we create a NaNCapture Lightning callback. The constructor receives a directory path for storing/loading checkpoints and sets up the NaNCapture state. We also define utilities for checking for NaNs, storing checkpoints, and halting the training job.

 import os
import torch
from copy import deepcopy
import lightning.pytorch as pl

class NaNCapture(pl.Callback):

    def __init__(self, dirpath: str):
        # path to checkpoint
        self.dirpath = dirpath
        
        # update to True when Nan is identified
        self.nan_captured = False
        
        # stores a copy of the last batch
        self.last_batch = None
        self.batch_idx = None

    @staticmethod
    def contains_nan(tensor):
        return torch.isnan(tensor).any().item()
        # alternatively check for finite
        # return not torch.isfinite(tensor).item()

    @staticmethod
    def halt_training(trainer):
        trainer.should_stop = True
        # communicate stop command to all other ranks
        trainer.strategy.reduce_boolean_decision(trainer.should_stop,
                                                 all=False)

    def save_ckpt(self, trainer):
        os.makedirs(self.dirpath, exist_ok=True)
        # include trainer.global_rank to avoid conflict
        filename = f"nan_checkpoint_rank_{trainer.global_rank}.ckpt"
        full_path = os.path.join(self.dirpath, filename)
        print(f"saving ckpt to {full_path}")
        trainer.save_checkpoint(full_path, False)

Callback Function: on_train_batch_start

We begin by implementing the on_train_batch_start hook to store a copy of each input batch. In case of a NaN event, this batch will be stored in the checkpoint.

Callback Function: on_before_optimizer_step

Next we implement the on_before_optimizer_step hook. Here, we check for NaN entries in all of the gradient tensors. If found, we store a checkpoint with the uncorrupted model weights and halt the training.

Python">    def on_before_optimizer_step(self, trainer, pl_module, optimizer):
        if not self.nan_captured:
            # Check if gradients contain NaN
            grads = [p.grad.view(-1) for p in pl_module.parameters()
                     if p.grad is not None]
            all_grads = torch.cat(grads)
            if self.contains_nan(all_grads):
                print("nan found")
                self.save_ckpt(trainer)
                self.halt_training(trainer)

Capturing the Training State

To enable reproducibility, we include the NaNCapture state in the checkpoint by appending it to the training state dictionary. Lightning provides dedicated utilities for saving and loading a callback state:

def state_dict(self):
        d = {"nan_captured": self.nan_captured}
        if self.nan_captured:
            d["last_batch"] = self.last_batch
        return d


    def load_state_dict(self, state_dict):
        self.nan_captured = state_dict.get("nan_captured", False)
        if self.nan_captured:
            self.last_batch = state_dict["last_batch"]

Reproducing the NaN Occurrence

We have described how our NaNCapture callback can be used to store the training state that resulted in a NaN, but how do we reload this state in order to reproduce the issue and debug it? To accomplish this, we leverage Lightning’s dedicated data loading class, LightningDataModule.

DataModule Function: on_before_batch_transfer

In the code block below, we extend the LightningDataModule class to allow injecting a fixed training input batch. This is achieved by overriding the on_before_batch_transfer hook, as shown below:

from lightning.pytorch import LightningDataModule

class InjectableDataModule(LightningDataModule):

    def __init__(self):
        super().__init__()
        self.cached_batch = None

    def set_custom_batch(self, batch):
        self.cached_batch = batch

    def on_before_batch_transfer(self, batch, dataloader_idx):
        if self.cached_batch:
            return self.cached_batch
        return batch

Callback Function: on_train_start

The final step is modifying the on_train_start hook of our NaNCapture callback to inject the stored training batch into the LightningDataModule.

    def on_train_start(self, trainer, pl_module):
        if self.nan_captured:
            datamodule = trainer.datamodule
            datamodule.set_custom_batch(self.last_batch)

In the next section we will demonstrate the end-to-end solution using a toy example.

Toy Example

To test our new callback, we create a resnet50-based image classification model with a loss function deliberately designed to trigger NaN occurrences.

Instead of using the standard CrossEntropy loss, we compute binary_cross_entropy_with_logits for each class independently and divide the result by the number of samples belonging to that class. Inevitably, we will encounter a batch in which one or more classes are missing, leading to a divide-by-zero operation, resulting in NaN values and corrupting the model.

The implementation below follows Lightning’s introductory tutorial.

import lightning.pytorch as pl
import torch
import torchvision
import torch.nn.functional as F

num_classes = 20


# define a lightning module
class ResnetModel(pl.LightningModule):
    def __init__(self):
        """Initializes a new instance of the MNISTModel class."""
        super().__init__()
        self.model = torchvision.models.resnet50(num_classes=num_classes)

    def forward(self, x):
        return self.model(x)

    def training_step(self, batch, batch_nb):
        x, y = batch
        outputs = self(x)
        # uncomment for default loss
        # return F.cross_entropy(outputs, y)
        
        # calculate binary_cross_entropy for each class individually
        losses = []
        for c in range(num_classes):
            count = torch.count_nonzero(y==c)
            masked = torch.where(y==c, 1., 0.)
            loss = F.binary_cross_entropy_with_logits(
                outputs[..., c],
                masked,
                reduction='sum'
            )
            mean_loss = loss/count # could result in NaN
            losses.append(mean_loss)
        total_loss = torch.stack(losses).mean()
        return total_loss

    def configure_optimizers(self):
        return torch.optim.Adam(self.parameters(), lr=0.02)

We define a synthetic dataset and encapsulate it in our InjectableDataModule class:

import os
import random
from torch.utils.data import Dataset, DataLoader

batch_size = 128
num_steps = 800

# A dataset with random images and labels
class FakeDataset(Dataset):
    def __len__(self):
        return batch_size*num_steps

    def __getitem__(self, index):
        rand_image = torch.randn([3, 224, 224], dtype=torch.float32)
        label = torch.tensor(random.randint(0, num_classes-1),
                             dtype=torch.int64)
        return rand_image, label



# define a lightning datamodule
class FakeDataModule(InjectableDataModule):

    def train_dataloader(self):
        dataset = FakeDataset()
        return DataLoader(
            dataset,
            batch_size=batch_size,
            num_workers=os.cpu_count(),
            pin_memory=True
        )

Finally, we initialize a Lightning Trainer with our NaNCapture callback and call trainer.fit with our Lightning module and Lightning DataModule.

import time

if __name__ == "__main__":

    # Initialize a lightning module
    lit_module = ResnetModel()

    # Initialize a DataModule
    mnist_data = FakeDataModule()

    # Train the model
    ckpt_dir = "./ckpt_dir"
    trainer = pl.Trainer(
        max_epochs=1,
        callbacks=[NaNCapture(ckpt_dir)]
    )

    ckpt_path = None
    
    # check is nan ckpt exists
    if os.path.isdir(ckpt_dir):

    # check if nan ckpt exists
    if os.path.isdir(ckpt_dir):
        dir_contents = [os.path.join(ckpt_dir, f)
                        for f in os.listdir(ckpt_dir)]
        ckpts = [f for f in dir_contents
                 if os.path.isfile(f) and f.endswith('.ckpt')]
        if ckpts:
            ckpt_path = ckpts[0]

    t0 = time.perf_counter()
    trainer.fit(lit_module, mnist_data, ckpt_path=ckpt_path)
    print(f"total runtime: {time.perf_counter() - t0}")

After a number of training steps, a NaN event will occur. At this point a checkpoint is saved with the full training state and the training is halted.

When the script is run again the exact state that caused the NaN will be reloaded allowing us to easily reproduce the issue and debug its root cause.

Performance Overhead

To assess the impact of our NaNCapture callback on runtime performance, we modified our experiment to use CrossEntropyLoss (to avoid NaNs) and measured the average throughput when running with and without NaNCapture callback. The experiments were conducted on an NVIDIA L40S GPU, with a PyTorch 2.5.1 Docker image.

Overhead of NaNCapture Callback (by Author)

For our toy model, the NaNCapture callback adds a minimal 1.5% overhead to the runtime performance — a small price to pay for the valuable debugging capabilities it provides.

Naturally, the actual overhead will depend on the specifics of the model and runtime environment.

How to Handle Stochasticity

The solution we have described henceforth will succeed in reproducing the training state provided that the model does not include any randomness. However, introducing stochasticity into the model definition is often critical for convergence. A common example of a stochastic layer is torch.nn.Dropout.

You may find that your NaN event depends on the precise state of randomness when the failure occurred. Consequently, we would like to enhance our NaNCapture callback to capture and restore the random state at the point of failure. The random state is determined by a number of libraries. In the code block below, we attempt to capture the full state of randomness:

import os
import torch
import random
import numpy as np
from copy import deepcopy
import lightning.pytorch as pl

class NaNCapture(pl.Callback):

    def __init__(self, dirpath: str):
        # path to checkpoint
        self.dirpath = dirpath
        
        # update to True when Nan is identified
        self.nan_captured = False
        
        # stores a copy of the last batch
        self.last_batch = None
        self.batch_idx = None

        # rng state
        self.rng_state = {
            "torch": None,
            "torch_cuda": None,
            "numpy": None,
            "random": None
        }

    @staticmethod
    def contains_nan(tensor):
        return torch.isnan(tensor).any().item()
        # alternatively check for finite
        # return not torch.isfinite(tensor).item()

    @staticmethod
    def halt_training(trainer):
        trainer.should_stop = True
        trainer.strategy.reduce_boolean_decision(trainer.should_stop,
                                                 all=False)

    def save_ckpt(self, trainer):
        os.makedirs(self.dirpath, exist_ok=True)
        # include trainer.global_rank to avoid conflict
        filename = f"nan_checkpoint_rank_{trainer.global_rank}.ckpt"
        full_path = os.path.join(self.dirpath, filename)
        print(f"saving ckpt to {full_path}")
        trainer.save_checkpoint(full_path, False)

    def on_train_start(self, trainer, pl_module):
        if self.nan_captured:
            # inject batch
            datamodule = trainer.datamodule
            datamodule.set_custom_batch(self.last_batch)

    def on_train_batch_start(self, trainer, pl_module, batch, batch_idx):
       if self.nan_captured:
            # restore random state
            torch.random.set_rng_state(self.rng_state["torch"])
            torch.cuda.set_rng_state_all(self.rng_state["torch_cuda"])
            np.random.set_state(self.rng_state["numpy"])
            random.setstate(self.rng_state["random"])
        else:
            # capture current batch
            self.last_batch= deepcopy(batch)
            self.batch_idx = batch_idx
    
            # capture current random state
            self.rng_state["torch"] = torch.random.get_rng_state()
            self.rng_state["torch_cuda"] = torch.cuda.get_rng_state_all()
            self.rng_state["numpy"] = np.random.get_state()
            self.rng_state["random"] = random.getstate()
    
    def on_before_optimizer_step(self, trainer, pl_module, optimizer):
        if not self.nan_captured:
            # Check if gradients contain NaN
            grads = [p.grad.view(-1) for p in pl_module.parameters()
                     if p.grad is not None]
            all_grads = torch.cat(grads)
            if self.contains_nan(all_grads):
                print("nan found")
                self.save_ckpt(trainer)
                self.halt_training(trainer)

    def state_dict(self):
        d = {"nan_captured": self.nan_captured}
        if self.nan_captured:
            d["last_batch"] = self.last_batch
            d["rng_state"] = self.rng_state
        return d

    def load_state_dict(self, state_dict):
        self.nan_captured = state_dict.get("nan_captured", False)
        if self.nan_captured:
            self.last_batch = state_dict["last_batch"]
            self.rng_state = state_dict["rng_state"]

Importantly, setting the random state may not guarantee full reproducibility. The GPU owes its power to its massive parallelism. In some GPU operations, multiple threads may read or write concurrently to the same memory locations resulting in nondeterminism. PyTorch allows for some control over this via its use_deterministic_algorithms, but this may impact the runtime performance. Additionally, there is a possibility that the NaN event will not reproduced once this configuration setting is changed. Please see the PyTorch documentation on reproducibility for more details.

Summary

Encountering NaN failures is one of the most discouraging events that can happen in machine learning development. These errors not only waste valuable computation and development resources, but often indicate fundamental issues in the model architecture or experiment design. Due to their sporadic, sometimes elusive nature, debugging NaN failures can be a nightmare.

This post introduced a proactive approach for capturing and reproducing NaN errors using a dedicated Lightning callback. The solution we shared is a proposal which can be modified and extended for your specific use case.

While this solution may not address every possible NaN scenario, it significantly reduces debugging time when applicable, potentially saving developers countless hours of frustration and wasted effort.

The post Debugging the Dreaded NaN appeared first on Towards Data Science.

]]>
Is Python Set to Surpass Its Competitors? https://towardsdatascience.com/is-python-set-to-surpass-its-competitors/ Wed, 26 Feb 2025 00:46:31 +0000 https://towardsdatascience.com/?p=598431 The features that make Python the most suitable programming language for most people

The post Is Python Set to Surpass Its Competitors? appeared first on Towards Data Science.

]]>
A soufflé is a baked egg dish that originated in France in the 18th century. The process of making an elegant and delicious French soufflé is complex, and in the past, it was typically only prepared by professional French pastry chefs. However, with pre-made soufflé mixes now widely available in supermarkets, this classic French dish has found its way into the kitchens of countless households. 

Python is like the pre-made soufflé mixes in programming. Many studies have consistently shown that Python is the most popular programming language among developers, and this advantage will continue to expand in 2025. Python stands out compared to languages like C, C++, Java, and Julia because it’s highly readable and expressive, flexible and dynamic, beginner-friendly yet powerful. These characteristics make Python the most suitable programming language for people even without programming basics. The following features distinguish Python from other programming languages:

  • Dynamic Typing
  • List Comprehensions
  • Generators
  • Argument Passing and Mutability

These features reveal Python’s intrinsic nature as a programming language. Without this knowledge, you’ll never truly understand Python. In today’s article, I will elaborate how Python excels over other programming languages through these features.

Dynamic Typing

For most programming languages like Java or C++, explicit data type declarations are required. But when it comes to Python, you don’t have to declare the type of a variable when you create one. This feature in Python is called dynamic typing, which makes Python flexible and easy to use.

List Comprehensions

List comprehensions are used to generate lists from other lists by applying functions to each element in the list. They provide a concise way to apply loops and optional conditions in a list.

For example, if you’d like to create a list of squares for even numbers between 0 and 9, you can use JavaScript, a regular loop in Python and Python’s list comprehension to achieve the same goal. 

JavaScript

let squares = Array.from({ length: 10 }, (_, x) => x)  // Create array [0, 1, 2, ..., 9]
   .filter(x => x % 2 === 0)                          // Filter even numbers
   .map(x => x ** 2);                                 // Square each number
console.log(squares);  // Output: [0, 4, 16, 36, 64]

Regular Loop in Python

squares = []
for x in range(10):
   if x % 2 == 0:
       squares.append(x**2)
print(squares) 

Python’s List Comprehension

squares = [x**2 for x in range(10) if x % 2 == 0]print(squares) 

All the three sections of code above generate the same list [0, 4, 16, 36, 64], but Python’s list comprehension is the most elegant because the syntax is concise and clearly express the intent while the Python function is more verbose and requires explicit initialization and appending. The syntax of JavaScript is the least elegant and readable because it requires chaining methods of using Array.from, filter, and map. Both Python function and JavaScript function are not intuitive and cannot be read as natural language as Python list comprehension does.

Generator

Generators in Python are a special kind of iterator that allow developers to iterate over a sequence of values without storing them all in memory at once. They are created with the yield keyword. Other programming languages like C++ and Java, though offering similar functionality, don’t have built-in yield keyword in the same simple, integrated way. Here are several key advantages that make Python Generators unique:

  • Memory Efficiency: Generators yield one value at a time so that they only compute and hold one item in memory at any given moment. This is in contrast to, say, a list in Python, which stores all items in memory.
  • Lazy Evaluation: Generators enable Python to compute values only as needed. This “lazy” computation results in significant performance improvements when dealing with large or potentially infinite sequences.
  • Simple Syntax: This might be the biggest reason why developers choose to use generators because they can easily convert a regular function into a generator without having to manage state explicitly.
def fibonacci():
   a, b = 0, 1
   while True:
       yield a
       a, b = b, a + b

fib = fibonacci()
for _ in range(100):
   print(next(fib))

The example above shows how to use the yield keyword when creating a sequence. For the memory usage and time difference between the code with and without Generators, generating 100 Fibonacci numbers can hardly see any differences. But when it comes to 100 million numbers in practice, you’d better use generators because a list of 100 million numbers could easily strain many system resources.

Argument Passing and Mutability

In Python, we don’t really assign values to variables; instead, we bind variables to objects. The result of such an action depends on whether the object is mutable or immutable. If an object is mutable, changes made to it inside the function will affect the original object. 

def modify_list(lst):
   lst.append(4)

my_list = [1, 2, 3]
modify_list(my_list)
print(my_list)  # Output: [1, 2, 3, 4]

In the example above, we’d like to append ‘4’ to the list my_list which is [1,2,3]. Because lists are mutable, the behavior append operation changes the original list my_list without creating a copy. 

However, immutable objects, such as integers, floats, strings, tuples and frozensets, cannot be changed after creation. Therefore, any modification results in a new object. In the example below, because integers are immutable, the function creates a new integer rather than modifying the original variable.

def modify_number(n):
   n += 10
   return n

a = 5
new_a = modify_number(a)
print(a)      # Output: 5
print(new_a)  # Output: 15

Python’s argument passing is sometimes described as “pass-by-object-reference” or “pass-by-assignment.” This makes Python unique because Python pass references uniformly (pass-by-object-reference) while other languages need to differentiate explicitly between pass-by-value and pass-by-reference. Python’s uniform approach is simple yet powerful. It avoids the need for explicit pointers or reference parameters but requires developers to be mindful of mutable objects.

With Python’s argument passing and mutability, we can enjoy the following benefits in coding:

  • Memory Efficiency: It saves memory by passing references instead of making full copies of objects. This especially benefits code development with large data structures.
  • Performance: It avoids unnecessary copies and thus improves the overall coding performance.
  • Flexibility: This feature provides convenience for updating data structure because developers don’t need to explicitly choose between pass-by-value and pass-by-reference.

However, this characteristic of Python forces developers to carefully choose between mutable and immutable data types and it also brings more complex debugging.

So is Python Really Simple?

Python’s popularity results from its simplicity, memory efficiency, high performance, and beginner-friendiness. It’s also a programming language that looks most like a human’s natural language, so even people who haven’t received systematic and holistic programming training are still able to understand it. These characteristics make Python a top choice among enterprises, academic institutes, and government organisations. 

For example, when we’d like to filter out the the “completed” orders with amounts greater than 200, and update a mutable summary report (a dictionary) with the total count and sum of amounts for an e-commerce company, we can use list comprehension to create a list of orders meeting our criteria, skip the declaration of variable types and make changes of the original dictionary with pass-by-assignment

import random
import time

def order_stream(num_orders):
   """
   A generator that yields a stream of orders.
   Each order is a dictionary with dynamic types:
     - 'order_id': str
     - 'amount': float
     - 'status': str (randomly chosen among 'completed', 'pending', 'cancelled')
   """
   for i in range(num_orders):
       order = {
           "order_id": f"ORD{i+1}",
           "amount": round(random.uniform(10.0, 500.0), 2),
           "status": random.choice(["completed", "pending", "cancelled"])
       }
       yield order
       time.sleep(0.001)  # simulate delay

def update_summary(report, orders):
   """
   Updates the mutable summary report dictionary in-place.
   For each order in the list, it increments the count and adds the order's amount.
   """
   for order in orders:
       report["count"] += 1
       report["total_amount"] += order["amount"]

# Create a mutable summary report dictionary.
summary_report = {"count": 0, "total_amount": 0.0}

# Use a generator to stream 10,000 orders.
orders_gen = order_stream(10000)

# Use a list comprehension to filter orders that are 'completed' and have amount > 200.
high_value_completed_orders = [order for order in orders_gen
                              if order["status"] == "completed" and order["amount"] > 200]

# Update the summary report using our mutable dictionary.
update_summary(summary_report, high_value_completed_orders)

print("Summary Report for High-Value Completed Orders:")
print(summary_report)

If we’d like to achieve the same goal with Java, since Java lacks built-in generators and list comprehensions, we have to generate a list of orders, then filter and update a summary using explicit loops, and thus make the code more complex, less readable and harder to maintain.

import java.util.*;
import java.util.concurrent.ThreadLocalRandom;

class Order {
   public String orderId;
   public double amount;
   public String status;
  
   public Order(String orderId, double amount, String status) {
       this.orderId = orderId;
       this.amount = amount;
       this.status = status;
   }
  
   @Override
   public String toString() {
       return String.format("{orderId:%s, amount:%.2f, status:%s}", orderId, amount, status);
   }
}

public class OrderProcessor {
   // Generates a list of orders.
   public static List<Order> generateOrders(int numOrders) {
       List<Order> orders = new ArrayList<>();
       String[] statuses = {"completed", "pending", "cancelled"};
       Random rand = new Random();
       for (int i = 0; i < numOrders; i++) {
           String orderId = "ORD" + (i + 1);
           double amount = Math.round(ThreadLocalRandom.current().nextDouble(10.0, 500.0) * 100.0) / 100.0;
           String status = statuses[rand.nextInt(statuses.length)];
           orders.add(new Order(orderId, amount, status));
       }
       return orders;
   }
  
   // Filters orders based on criteria.
   public static List<Order> filterHighValueCompletedOrders(List<Order> orders) {
       List<Order> filtered = new ArrayList<>();
       for (Order order : orders) {
           if ("completed".equals(order.status) && order.amount > 200) {
               filtered.add(order);
           }
       }
       return filtered;
   }
  
   // Updates a mutable summary Map with the count and total amount.
   public static void updateSummary(Map<String, Object> summary, List<Order> orders) {
       int count = 0;
       double totalAmount = 0.0;
       for (Order order : orders) {
           count++;
           totalAmount += order.amount;
       }
       summary.put("count", count);
       summary.put("total_amount", totalAmount);
   }
  
   public static void main(String[] args) {
       // Generate orders.
       List<Order> orders = generateOrders(10000);
      
       // Filter orders.
       List<Order> highValueCompletedOrders = filterHighValueCompletedOrders(orders);
      
       // Create a mutable summary map.
       Map<String, Object> summaryReport = new HashMap<>();
       summaryReport.put("count", 0);
       summaryReport.put("total_amount", 0.0);
      
       // Update the summary report.
       updateSummary(summaryReport, highValueCompletedOrders);
      
       System.out.println("Summary Report for High-Value Completed Orders:");
       System.out.println(summaryReport);
   }
}

Conclusion

Equipped with features of dynamic typing, list comprehensions, generators, and its approach to argument passing and mutability, Python is making itself a simplified coding while enhancing memory efficiency and performance. As a result, Python has become the ideal programming language for self-learners.

Thank you for reading!

The post Is Python Set to Surpass Its Competitors? appeared first on Towards Data Science.

]]>
Efficient Data Handling in Python with Arrow https://towardsdatascience.com/efficient-data-handling-in-python-with-arrow/ Tue, 25 Feb 2025 20:56:16 +0000 https://towardsdatascience.com/?p=598426 Introducing Arrow to those who are still unaware of its power

The post Efficient Data Handling in Python with Arrow appeared first on Towards Data Science.

]]>
1. Introduction

We’re all used to work with CSVs, JSON files… With the traditional libraries and for large datasets, these can be extremely slow to read, write and operate on, leading to performance bottlenecks (been there). It’s precisely with big amounts of data that being efficient handling the data is crucial for our data science/analytics workflow, and this is exactly where Apache Arrow comes into play. 

Why? The main reason resides in how the data is stored in memory. While JSON and CSVs, for example, are text-based formats, Arrow is a columnar in-memory data format (and that allows for fast data interchange between different data processing tools). Arrow is therefore designed to optimize performance by enabling zero-copy reads, reducing memory usage, and supporting efficient compression. 

Moreover, Apache Arrow is open-source and optimized for analytics. It is designed to accelerate big data processing while maintaining interoperability with various data tools, such as Pandas, Spark, and Dask. By storing data in a columnar format, Arrow enables faster read/write operations and efficient memory usage, making it ideal for analytical workloads.

Sounds great right? What’s best is that this is all the introduction to Arrow I’ll provide. Enough theory, we want to see it in action. So, in this post, we’ll explore how to use Arrow in Python and how to make the most out of it.

2. Arrow in Python

To get started, you need to install the necessary libraries: pandas and pyarrow.

pip install pyarrow pandas

Then, as always, import them in your Python script:

import pyarrow as pa
import pandas as pd

Nothing new yet, just necessary steps to do what follows. Let’s start by performing some simple operations.

2.1. Creating and Storing a Table

The simplest we can do is hardcode our table’s data. Let’s create a two-column table with football data:

teams = pa.array(['Barcelona', 'Real Madrid', 'Rayo Vallecano', 'Athletic Club', 'Real Betis'], type=pa.string())
goals = pa.array([30, 23, 9, 24, 12], type=pa.int8())

team_goals_table = pa.table([teams, goals], names=['Team', 'Goals'])

The format is pyarrow.table, but we can easily convert it to pandas if we want:

df = team_goals_table.to_pandas()

And restore it back to arrow using:

team_goals_table = pa.Table.from_pandas(df)

And we’ll finally store the table in a file. We could use different formats, like feather, parquet… I’ll use this last one because it’s fast and memory-optimized:

import pyarrow.parquet as pq
pq.write_table(team_goals_table, 'data.parquet')

Reading a parquet file would just consist of using pq.read_table('data.parquet').

2.2. Compute Functions

Arrow has its own compute module for the usual operations. Let’s start by comparing two arrays element-wise:

import pyarrow.compute as pc
>>> a = pa.array([1, 2, 3, 4, 5, 6])
>>> b = pa.array([2, 2, 4, 4, 6, 6])
>>> pc.equal(a,b)
[
  false,
  true,
  false,
  true,
  false,
  true
]

That was easy, we could sum all elements in an array with:

>>> pc.sum(a)
<pyarrow.Int64Scalar: 21>

And from this we could easily guess how we can compute a count, a floor, an exp, a mean, a max, a multiplication… No need to go over them, then. So let’s move to tabular operations.

We’ll start by showing how to sort it:

>>> table = pa.table({'i': ['a','b','a'], 'x': [1,2,3], 'y': [4,5,6]})
>>> pc.sort_indices(table, sort_keys=[('y', descending)])
<pyarrow.lib.UInt64Array object at 0x1291643a0>
[
  2,
  1,
  0
]

Just like in pandas, we can group values and aggregate the data. Let’s, for example, group by “i” and compute the sum on “x” and the mean on “y”:

>>> table.group_by('i').aggregate([('x', 'sum'), ('y', 'mean')])
pyarrow.Table
i: string
x_sum: int64
y_mean: double
----
i: [["a","b"]]
x_sum: [[4,2]]
y_mean: [[5,5]]

Or we can join two tables:

>>> t1 = pa.table({'i': ['a','b','c'], 'x': [1,2,3]})
>>> t2 = pa.table({'i': ['a','b','c'], 'y': [4,5,6]})
>>> t1.join(t2, keys="i")
pyarrow.Table
i: string
x: int64
y: int64
----
i: [["a","b","c"]]
x: [[1,2,3]]
y: [[4,5,6]]

By default, it is a left outer join but we could twist it by using the join_type parameter.

There are many more useful operations, but let’s see just one more to avoid making this too long: appending a new column to a table.

>>> t1.append_column("z", pa.array([22, 44, 99]))
pyarrow.Table
i: string
x: int64
z: int64
----
i: [["a","b","c"]]
x: [[1,2,3]]
z: [[22,44,99]]

Before ending this section, we must see how to filter a table or array:

>>> t1.filter((pc.field('x') > 0) & (pc.field('x') < 3))
pyarrow.Table
i: string
x: int64
----
i: [["a","b"]]
x: [[1,2]]

Easy, right? Especially if you’ve been using pandas and numpy for years!

3. Working with files

We’ve already seen how we can read and write Parquet files. But let’s check some other popular file types so that we have several options available.

3.1. Apache ORC

Being very informal, Apache ORC can be understood as the equivalent of Arrow in the realm of file types (even though its origins have nothing to do with Arrow). Being more correct, it’s an open source and columnar storage format. 

Reading and writing it is as follows:

from pyarrow import orc
# Write table
orc.write_table(t1, 't1.orc')
# Read table
t1 = orc.read_table('t1.orc')

As a side note, we could decide to compress the file while writing by using the “compression” parameter.

3.2. CSV

No secret here, pyarrow has the CSV module:

from pyarrow import csv
# Write CSV
csv.write_csv(t1, "t1.csv")
# Read CSV
t1 = csv.read_csv("t1.csv")

# Write CSV compressed and without header
options = csv.WriteOptions(include_header=False)
with pa.CompressedOutputStream("t1.csv.gz", "gzip") as out:
    csv.write_csv(t1, out, options)

# Read compressed CSV and add custom header
t1 = csv.read_csv("t1.csv.gz", read_options=csv.ReadOptions(
    column_names=["i", "x"], skip_rows=1
)]

3.2. JSON

Pyarrow allows JSON reading but not writing. It’s pretty straightforward, let’s see an example supposing we have our JSON data in “data.json”:

from pyarrow import json
# Read json
fn = "data.json"
table = json.read_json(fn)

# We can now convert it to pandas if we want to
df = table.to_pandas()

Feather is a portable file format for storing Arrow tables or data frames (from languages like Python or R) that utilizes the Arrow IPC format internally. So, contrary to Apache ORC, this one was indeed created early in the Arrow project.

from pyarrow import feather
# Write feather from pandas DF
feather.write_feather(df, "t1.feather")
# Write feather from table, and compressed
feather.write_feather(t1, "t1.feather.lz4", compression="lz4")

# Read feather into table
t1 = feather.read_table("t1.feather")
# Read feather into df
df = feather.read_feather("t1.feather")

4. Advanced Features

We just touched upon the most basic features and what the majority would need while working with Arrow. However, its amazingness doesn’t end here, it’s right where it starts.

As this will be quite domain-specific and not useful for anyone (nor considered introductory) I’ll just mention some of these features without using any code:

  • We can handle memory management through the Buffer type (built on top of C++ Buffer object). Creating a buffer with our data does not allocate any memory; it is a zero-copy view on the memory exported from the data bytes object. Keeping up with this memory management, an instance of MemoryPool tracks all the allocations and deallocations (like malloc and free in C). This allows us to track the amount of memory being allocated.
  • Similarly, there are different ways to work with input/output streams in batches.
  • PyArrow comes with an abstract filesystem interface, as well as concrete implementations for various storage types. So, for example,  we can write and read parquet files from an S3 bucket using the S3FileSystem. Google Cloud and Hadoop Distributed File System (HDFS) are also accepted.

5. Conclusion and Key Takeaways

Apache Arrow is a powerful tool for efficient Data Handling in Python. Its columnar storage format, zero-copy reads, and interoperability with popular data processing libraries make it ideal for data science workflows. By integrating Arrow into your pipeline, you can significantly boost performance and optimize memory usage.

6. Resources

The post Efficient Data Handling in Python with Arrow appeared first on Towards Data Science.

]]>
How To Generate GIFs from 3D Models with Python https://towardsdatascience.com/how-to-generate-gifs-from-3d-models-with-python/ Fri, 21 Feb 2025 02:23:47 +0000 https://towardsdatascience.com/?p=598194 Complete Tutorial to Automate 3D Data Visualization. Use Python to convert point clouds and 3D models into GIFs & MP4s for easy sharing and collaboration

The post How To Generate GIFs from 3D Models with Python appeared first on Towards Data Science.

]]>
As a data scientist, you know that effectively communicating your insights is as important as the insights themselves.

But how do you communicate over 3D data?

I can bet most of us have been there: you spend days, weeks, maybe even months meticulously collecting and processing 3D data. Then comes the moment to share your findings, whether it’s with clients, colleagues, or the broader scientific community. You throw together a few static screenshots, but they just don’t capture the essence of your work. The subtle details, the spatial relationships, the sheer scale of the data—it all gets lost in translation.

Comparing 3D Data Communication Methods. © F. Poux

Or maybe you’ve tried using specialized 3D visualization software. But when your client uses it, they struggle with clunky interfaces, steep learning curves, and restrictive licensing.

What should be a smooth, intuitive process becomes a frustrating exercise in technical acrobatics. It’s an all-too-common scenario: the brilliance of your 3D data is trapped behind a wall of technical barriers.

This highlights a common issue: the need to create shareable content that can be opened by anyone, i.e., that does not demand specific 3D data science skills.

Think about it: what is the most used way to share visual information? Images.

But how can we convey the 3D information from a simple 2D image?

Well, let us use “first principle thinking”: let us create shareable content stacking multiple 2D views, such as GIFs or MP4s, from raw point clouds.

The bread of magic to generate GIF and MP4The bread of magic to generate GIF and MP4. © F. Poux

This process is critical for presentations, reports, and general communication. But generating GIFs and MP4s from 3D data can be complex and time-consuming. I’ve often found myself wrestling with the challenge of quickly generating rotating GIF or MP4 files from a 3D point cloud, a task that seemed simple enough but often spiraled into a time-consuming ordeal. 

Current workflows might lack efficiency and ease of use, and a streamlined process can save time and improve data presentation.

Let me share a solution that involves leveraging Python and specific libraries to automate the creation of GIFs and MP4s from point clouds (or any 3D dataset such as a mesh or a CAD model).

Think about it. You’ve spent hours meticulously collecting and processing this 3D data. Now, you need to present it in a compelling way for a presentation or a report. But how can we be sure it can be integrated into a SaaS solution where it is triggered on upload? You try to create a dynamic visualization to showcase a critical feature or insight, and yet you’re stuck manually capturing frames and stitching them together. How can we automate this process to seamlessly integrate it into your existing systems?

An example of a GIF generated with the methodology. © F. Poux

If you are new to my (3D) writing world, welcome! We are going on an exciting adventure that will allow you to master an essential 3D Python skill. Before diving, I like to establish a clear scenario, the mission brief.

Once the scene is laid out, we embark on the Python journey. Everything is given. You will see Tips (🦚Notes and 🌱Growing) to help you get the most out of this article. Thanks to the 3D Geodata Academy for supporting the endeavor.

The Mission 🎯

You are working for a new engineering firm, “Geospatial Dynamics,” which wants to showcase its cutting-edge LiDAR scanning services. Instead of sending clients static point cloud images, you propose to use a new tool, which is a Python script, to generate dynamic rotating GIFs of project sites.

After doing so market research, you found that this can immediately elevate their proposals, resulting in a 20% higher project approval rate. That’s the power of visual storytelling.

The three stages of the mission towards an increase project approval. © F. Poux

On top, you can even imagine a more compelling scenario, where “GeoSpatial Dynamics” is able to process point clouds massively and then generate MP4 videos that are sent to potential clients. This way, you lower the churn and make the brand more memorable.

With that in mind, we can start designing a robust framework to answer our mission’s goal.

The Framework

I remember a project where I had to show a detailed architectural scan to a group of investors. The usual still images just could not capture the fine details. I desperately needed a way to create a rotating GIF to convey the full scope of the design. That is why I’m excited to introduce this Cloud2Gif Python solution. With this, you’ll be able to easily generate shareable visualizations for presentations, reports, and communication.

The framework I propose is straightforward yet effective. It takes raw 3D data, processes it using Python and the PyVista library, generates a series of frames, and stitches them together to create a GIF or MP4 video. The high-level workflow includes:

The various stages of the framework in this article. © F. Poux

1. Loading the 3D data (mesh with texture).

2. Loading a 3D Point Cloud

3. Setting up the visualization environment.

4. Generating a GIF

 4.1. Defining a camera orbit path around the data.

 4.2. Rendering frames from different viewpoints along the path.

 4.3. Encoding the frames into a GIF or

5. Generating an orbital MP4

6. Creating a Function

7. Testing with multiple datasets

This streamlined process allows for easy customization and integration into existing workflows. The key advantage here is the simplicity of the approach. By leveraging the basic principles of 3D data rendering, a very efficient and self-contained script can be put together and deployed on any system as long as Python is installed.

This makes it compatible with various edge computing solutions and allows for easy integration with sensor-heavy systems. The goal is to generate a GIF and an MP4 from a 3D data set. The process is simple, requiring a 3D data set, a bit of magic (the code), and the output as GIF and MP4 files.

The growth of the solution as we move along the major stages. © F. Poux

Now, what are the tools and libraries that we will need for this endeavor?

1. Setup Guide: The Libraries, Tools and Data

© F. Poux

For this project, we primarily use the following two Python libraries:

  • NumPy: The cornerstone of numerical computing in Python. Without it, I would have to deal with every vertex (point) in a very inefficient way. NumPy Official Website
  • pyvista: A high-level interface to the Visualization Toolkit (VTK). PyVista enables me to easily visualize and interact with 3D data. It handles rendering, camera control, and exporting frames. PyVista Official Website
PyVista and Numpy libraries for 3D Data. © F. Poux

These libraries provide all the necessary tools to handle data processing, visualization, and output generation. This set of libraries was carefully chosen so that a minimal amount of external dependencies is present, which improves sustainability and makes it easily deployable on any system.

Let me share the details of the environment as well as the data preparation setup.

Quick Environment Setup Guide

Let me provide very brief details on how to set up your environment.

Step 1: Install Miniconda

Four simple steps to get a working Miniconda version:

How to install Anaconda for 3D Coding. © F. Poux

Step 2: Create a new environment

You can run the following code in your terminal

conda create -n pyvista_env python=3.10
conda activate pyvista_env

Step 3: Install required packages

For this, you can leverage pip as follows:

pip install numpy
pip install pyvista

Step 4: Test the installation

If you want to test your installation, type python in your terminal and run the following lines:

import numpy as np
import pyvista as pv
print(f”PyVista version: {pv.__version__}”)

This should return the pyvista version. Do not forget to exit Python from your terminal afterward (Ctrl+C).

🦚 Note: Here are some common issues and workarounds:

  • If PyVista doesn’t show a 3D window: pip install vtk
  • If environment activation fails: Restart the terminal
  • If data loading fails: Check file format compatibility (PLY, LAS, LAZ supported)

Beautiful, at this stage, your environment is ready. Now, let me share some quick ways to get your hands on 3D datasets.

Data Preparation for 3D Visualization

At the end of the article, I share with you the datasets as well as the code. However, in order to ensure you are fully independent, here are three reliable sources I regularly use to get my hands on point cloud data:

The LiDAR Data Download Process. © F. Poux

The USGS 3DEP LiDAR Point Cloud Downloads

OpenTopography

ETH Zurich’s PCD Repository

For quick testing, you can also use PyVista’s built-in example data:

# Load sample data
from pyvista import examples
terrain = examples.download_crater_topo()
terrain.plot()

🦚 Note: Remember to always check the data license and attribution requirements when using public datasets.

Finally, to ensure a complete setup, below is a typical expected folder structure:

project_folder/
├── environment.yml
├── data/
│ └── pointcloud.ply
└── scripts/
└── gifmaker.py

Beautiful, we can now jump right onto the first stage: loading and visualizing textured mesh data.

2. Loading and Visualizing Textured Mesh Data

One first critical step is properly loading and rendering 3D data. In my research laboratory, I have found that PyVista provides an excellent foundation for handling complex 3D visualization tasks. 

© F. Poux

Here’s how you can approach this fundamental step:

import numpy as np
import pyvista as pv

mesh = pv.examples.load_globe()
texture = pv.examples.load_globe_texture()

pl = pv.Plotter()
pl.add_mesh(mesh, texture=texture, smooth_shading=True)
pl.show()

This code snippet loads a textured globe mesh, but the principles apply to any textured 3D model.

The earth rendered as a sphere with PyVista. © F. Poux

Let me discuss and speak a bit about the smooth_shading parameter. It’s a tiny element that renders the surfaces more continuous (as opposed to faceted), which, in the case of spherical objects, improves the visual impact.

Now, this is just a starter for 3D mesh data. This means that we deal with surfaces that join points together. But what if we want to work solely with point-based representations? 

In that scenario, we have to consider shifting our data processing approach to propose solutions to the unique visual challenges attached to point cloud datasets.

3. Point Cloud Data Integration

Point cloud visualization demands extra attention to detail. In particular, adjusting the point density and the way we represent points on the screen has a noticeable impact. 

© F. Poux

Let us use a PLY file for testing (see the end of the article for resources). 

The example PLY point cloud data with PyVista. © F. Poux

You can load a point cloud pv.read and create scalar fields for better visualization (such as using a scalar field based on the height or extent around the center of the point cloud).

In my work with LiDAR datasets, I’ve developed a simple, systematic approach to point cloud loading and initial visualization:

cloud = pv.read('street_sample.ply')
scalars = np.linalg.norm(cloud.points - cloud.center, axis=1)

pl = pv.Plotter()
pl.add_mesh(cloud)
pl.show()

The scalar computation here is particularly important. By calculating the distance from each point to the cloud’s center, we create a basis for color-coding that helps convey depth and structure in our visualizations. This becomes especially valuable when dealing with large-scale point clouds where spatial relationships might not be immediately apparent.

Moving from basic visualization to creating engaging animations requires careful consideration of the visualization environment. Let’s explore how to optimize these settings for the best possible results.

4. Optimizing the Visualization Environment

The visual impact of our animations heavily depends on the visualization environment settings. 

© F. Poux

Through extensive testing, I’ve identified key parameters that consistently produce professional-quality results:

pl = pv.Plotter(off_screen=False)
pl.add_mesh(
   cloud,
   style='points',
   render_points_as_spheres=True,
   emissive=False,
   color='#fff7c2',
   scalars=scalars,
   opacity=1,
   point_size=8.0,
   show_scalar_bar=False
   )

pl.add_text('test', color='b')
pl.background_color = 'k'
pl.enable_eye_dome_lighting()
pl.show()

As you can see, the plotter is initialized off_screen=False to render directly to the screen. The point cloud is then added to the plotter with specified styling. The style=’points’ parameter ensures that the point cloud is rendered as individual points. The scalars=’scalars’ argument uses the previously computed scalar field for coloring, while point_size sets the size of the points, and opacity adjusts the transparency. A base color is also set.

🦚 Note: In my experience, rendering points as spheres significantly improves the depth perception in the final generated animation. You can also combine this by using the eye_dome_lighting feature. This algorithm adds another layer of depth cues through some sort of normal-based shading, which makes the structure of point clouds more apparent.

You can play around with the various parameters until you obtain a rendering that is satisfying for your applications. Then, I propose that we move to creating the animated GIFs.

A GIF of the point cloudA GIF of the point cloud. © F. Poux

5. Creating Animated GIFs

At this stage, our aim is to generate a series of renderings by varying the viewpoint from which we generate these. 

© F. Poux

This means that we need to design a camera path that is sound, from which we can generate frame rendering. 

This means that to generate our GIF, we must first create an orbiting path for the camera around the point cloud. Then, we can sample the path at regular intervals and capture frames from different viewpoints. 

These frames can then be used to create the GIF. Here are the steps:

The 4 stages in the animated gifs generation. © F. Poux
  1. I change to off-screen rendering
  2. I take the cloud length parameters to set the camera
  3. I create a path
  4. I create a loop that takes a point of this pass

Which translates into the following:

pl = pv.Plotter(off_screen=True, image_scale=2)
pl.add_mesh(
   cloud,
   style='points',
   render_points_as_spheres=True,
   emissive=False,
   color='#fff7c2',
   scalars=scalars,
   opacity=1,
   point_size=5.0,
   show_scalar_bar=False
   )

pl.background_color = 'k'
pl.enable_eye_dome_lighting()
pl.show(auto_close=False)

viewup = [0, 0, 1]

path = pl.generate_orbital_path(n_points=40, shift=cloud.length, viewup=viewup, factor=3.0)
pl.open_gif("orbit_cloud_2.gif")
pl.orbit_on_path(path, write_frames=True, viewup=viewup)
pl.close()

As you can see, an orbital path is created around the point cloud using pl.generate_orbital_path(). The path’s radius is determined by cloud_length, the center is set to the center of the point cloud, and the normal vector is set to [0, 0, 1], indicating that the circle lies in the XY plane.

From there, we can enter a loop to generate individual frames for the GIF (the camera’s focal point is set to the center of the point cloud).

The image_scale parameter deserves special attention—it determines the resolution of our output. 

I’ve found that a value of 2 provides a good balance between the perceived quality and the file size. Also, the viewup vector is crucial for maintaining proper orientation throughout the animation. You can experiment with its value if you want a rotation following a non-horizontal plane.

This results in a GIF that you can use to communicate very easily. 

Another synthetic point cloud generated GIFAnother synthetic point cloud generated GIF. © F. Poux

But we can push one extra stage: creating an MP4 video. This can be useful if you want to obtain higher-quality animations with smaller file sizes as compared to GIFs (which are not as compressed).

6. High-Quality MP4 Video Generation

The generation of an MP4 video follows the exact same principles as we used to generate our GIF. 

© F. Poux

Therefore, let me get straight to the point. To generate an MP4 file from any point cloud, we can reason in four stages:

© F. Poux
  • Gather your configurations over the parameters that best suit you.
  • Create an orbital path the same way you did with GIFs
  • Instead of using the open_gif function, let us use it open_movie to write a “movie” type file.
  • We orbit on the path and write the frames, similarly to our GIF method.

🦚 Note: Don’t forget to use your proper configuration in the definition of the path.

This is what the end result looks like with code:

pl = pv.Plotter(off_screen=True, image_scale=1)
pl.add_mesh(
   cloud,
   style='points_gaussian',
   render_points_as_spheres=True,
   emissive=True,
   color='#fff7c2',
   scalars=scalars,
   opacity=0.15,
   point_size=5.0,
   show_scalar_bar=False
   )

pl.background_color = 'k'
pl.show(auto_close=False)

viewup = [0.2, 0.2, 1]

path = pl.generate_orbital_path(n_points=40, shift=cloud.length, viewup=viewup, factor=3.0)
pl.open_movie("orbit_cloud.mp4")
pl.orbit_on_path(path, write_frames=True)
pl.close()

Notice the use of points_gaussian style and adjusted opacity—these settings provide interesting visual quality in video format, particularly for dense point clouds.

And now, what about streamlining the process?

7. Streamlining the Process with a Custom Function

© F. Poux

To make this process more efficient and reproducible, I’ve developed a function that encapsulates all these steps:

def cloudgify(input_path):
   cloud = pv.read(input_path)
   scalars = np.linalg.norm(cloud.points - cloud.center, axis=1)
   pl = pv.Plotter(off_screen=True, image_scale=1)
   pl.add_mesh(
       cloud,
       style='Points',
       render_points_as_spheres=True,
       emissive=False,
       color='#fff7c2',
       scalars=scalars,
       opacity=0.65,
       point_size=5.0,
       show_scalar_bar=False
       )

   pl.background_color = 'k'
   pl.enable_eye_dome_lighting()
   pl.show(auto_close=False)

   viewup = [0, 0, 1]

   path = pl.generate_orbital_path(n_points=40, shift=cloud.length, viewup=viewup, factor=3.0)
  
   pl.open_gif(input_path.split('.')[0]+'.gif')
   pl.orbit_on_path(path, write_frames=True, viewup=viewup)
   pl.close()
  
   path = pl.generate_orbital_path(n_points=100, shift=cloud.length, viewup=viewup, factor=3.0)
   pl.open_movie(input_path.split('.')[0]+'.mp4')
   pl.orbit_on_path(path, write_frames=True)
   pl.close()
  
   return

🦚 Note: This function standardizes our visualization process while maintaining flexibility through its parameters. It incorporates several optimizations I’ve developed through extensive testing. Note the different n_points values for GIF (40) and MP4 (100)—this balances file size and smoothness appropriately for each format. The automatic filename generation split(‘.’)[0] ensures consistent output naming.

And what better than to test our new creation on multiple datasets?

8. Batch Processing Multiple Datasets

© F. Poux

Finally, we can apply our function to multiple datasets:

dataset_paths= ["lixel_indoor.ply", "NAAVIS_EXTERIOR.ply", "pcd_synthetic.ply", "the_adas_lidar.ply"]

for pcd in dataset_paths:
   cloudgify(pcd)

This approach can be remarkably efficient when processing large datasets made of several files. Indeed, if your parametrization is sound, you can maintain consistent 3D visualization across all outputs.

🌱 Growing: I am a big fan of 0% supervision to create 100% automatic systems. This means that if you want to push the experiments even more, I suggest investigating ways to automatically infer the parameters based on the data, i.e., data-driven heuristics. Here is an example of a paper I wrote a couple of years down the line that focuses on such an approach for unsupervised segmentation (Automation in Construction, 2022)

A Little Discussion 

Alright, you know my tendency to push innovation. While relatively simple, this Cloud2Gif solution has direct applications that can help you propose better experiences. Three of them come to mind, which I leverage on a weekly basis:

© F. Poux
  • Interactive Data Profiling and Exploration: By generating GIFs of complex simulation results, I can profile my results at scale very quickly. Indeed, the qualitative analysis is thus a matter of slicing a sheet filled with metadata and GIFs to check if the results are on par with my metrics. This is very handy
  • Educational Materials: I often use this script to generate engaging visuals for my online courses and tutorials, enhancing the learning experience for the professionals and students that go through it. This is especially true now that most material is found online, where we can leverage the capacity of browsers to play animations.
  • Real-time Monitoring Systems: I worked on integrating this script into a real-time monitoring system to generate visual alerts based on sensor data. This is especially relevant for sensor-heavy systems, where it can be difficult to extract meaning from the point cloud representation manually. Especially when conceiving 3D Capture Systems, leveraging SLAM or other techniques, it can be helpful to get a feedback loop in real-time to ensure a cohesive registration.

However, when we consider the broader research landscape and the pressing needs of the 3D data community, the real value proposition of this approach becomes evident. Scientific research is increasingly interdisciplinary, and communication is key. We need tools that enable researchers from diverse backgrounds to understand and share complex 3D data easily.

The Cloud2Gif script is self-contained and requires minimal external dependencies. This makes it ideally suited for deployment on resource-constrained edge devices. And this may be the top application that I worked on, leveraging such a straightforward approach.

As a little digression, I saw the positive impact of the script in two scenarios. First, I designed an environmental monitoring system for diseases in farmland crops. This was a 3D project, and I could include the generation of visual alerts (with an MP4 file) based on the real-time LiDAR sensor data. A great project!

In another context, I wanted to provide visual feedback to on-site technicians using a SLAM-equipped system for mapping purposes. I integrated the process to generate a GIF every 30 seconds that showed the current state of data registration. It was a great way to ensure consistent data capture. This actually allowed us to reconstruct complex environments with better consistency in managing our data drift.

Conclusion

Today, I walked through a simple yet powerful Python script to transform 3D data into dynamic GIFs and MP4 videos. This script, combined with libraries like NumPy and PyVista, allows us to create engaging visuals for various applications, from presentations to research and educational materials.

The key here is accessibility: the script is easily deployable and customizable, providing an immediate way of transforming complex data into an accessible format. This Cloud2Gif script is an excellent piece for your application if you need to share, assess, or get quick visual feedback within data acquisition situations.

What is next?

Well, if you feel up for a challenge, you can create a simple web application that allows users to upload point clouds, trigger the video generation process, and download the resulting GIF or MP4 file. 

This, in a similar manner as shown here:

In addition to Flask, you can also create a simple web application that can be deployed on Amazon Web Services so that it is scalable and easily accessible to anyone, with minimal maintenance.

These are skills that you develop through the Segmentor OS Program at the 3D Geodata Academy.

About the author

Florent Poux, Ph.D. is a Scientific and Course Director focused on educating engineers on leveraging AI and 3D Data Science. He leads research teams and teaches 3D Computer Vision at various universities. His current aim is to ensure humans are correctly equipped with the knowledge and skills to tackle 3D challenges for impactful innovations.

Resources

  1. 🏆Awards: Jack Dangermond Award
  2. 📕Book: 3D Data Science with Python
  3. 📜Research: 3D Smart Point Cloud (Thesis)
  4. 🎓Courses: 3D Geodata Academy Catalog
  5. 💻Code: Florent’s Github Repository
  6. 💌3D Tech Digest: Weekly Newsletter

The post How To Generate GIFs from 3D Models with Python appeared first on Towards Data Science.

]]>
Data Scientist: From School to Work, Part I https://towardsdatascience.com/data-scientist-from-school-to-work-part-i/ Wed, 19 Feb 2025 12:00:00 +0000 https://towardsdatascience.com/?p=597916 Nowadays, data science projects do not end with the proof of concept; every project has the goal of being used in production. It is important, therefore, to deliver high-quality code. I have been working as a data scientist for more than ten years and I have noticed that juniors usually have a weak level in […]

The post Data Scientist: From School to Work, Part I appeared first on Towards Data Science.

]]>
Nowadays, data science projects do not end with the proof of concept; every project has the goal of being used in production. It is important, therefore, to deliver high-quality code. I have been working as a data scientist for more than ten years and I have noticed that juniors usually have a weak level in development, which is understandable, because to be a data scientist you need to master math, statistics, algorithmics, development, and have knowledge in operational development. In this series of articles, I would like to share some tips and good practices for managing a professional data science project in Python. From Python to Docker, with a detour to Git, I will present the tools I use every day.


The other day, a colleague told me how he had to reinstall Linux because of an incorrect manipulation with Python. He had restored an old project that he wanted to customize. As a result of installing and uninstalling packages and changing versions, his Linux-based Python environment was no longer functional: an incident that could easily have been avoided by setting up a virtual environment. But it shows how important it is to manage these environments. Fortunately, there is now an excellent tool for this: uv.
The origin of these two letters is not clear. According to Zanie Blue (one of the creators):

“We considered a ton of names — it’s really hard to pick a name without collisions this day so every name was a balance of tradeoffs. uv was given to us on PyPI, is Astral-themed (i.e. ultraviolet or universal), and is short and easy to type.”

Now, let’s go into a little more detail about this wonderful tool.


Introduction

UV is a modern, minimalist Python projects and packages manager. Developed entirely in Rust, it has been designed to simplify Dependency Management, virtual environment creation and project organization. UV has been designed to limit common Python project problems such as dependency conflicts and environment management. It aims to offer a smoother, more intuitive experience than traditional tools such as the pip + virtualenv combo or the Conda manager. It is claimed to be 10 to 100 times faster than traditional handlers.

Whether for small personal projects or developing Python applications for production, UV is a robust and efficient solution for package management. 


Starting with UV

Installation

To install UV, if you are using Windows, I recommend to use this command in a shell:

winget install --id=astral-sh.uv  -e

And, if you are on Mac or Linux use the command:

To verify correct installation, simply type into a terminal the following command:

uv version

Creation of a new Python project

Using UV you can create a new project by specifying the version of Python. To start a new project, simply type into a terminal:

uv init --python x:xx project_name

python x:xx must be replaced by the desired version (e.g. python 3.12). If you do not have the specified Python version, UV will take care of this and download the correct version to start the project.

This command creates and automatically initializes a Git repository named project_name. It contains several files:

  • A .gitignore<em> </em>file. It lists the elements of the repository to be ignored in the git versioning (it is basic and should be rewrite for a project ready to deploy).
  • A .python-version<em> </em>file. It indicates the python version used in the project.
  • The README.md file. It has a purpose to describe the project and explains how to use it.
  • A hello.py file.
  • The pyproject.toml file. This file contains all the information about tools used to build the project.
  • The uv.lock file. It is used to create the virtual environment when you use uv to run the script (it can be compared to the requierements.txt)

Package installation

To install new packages in this next environment you have to use:

uv add package_name

When the add command is used for the first time, UV creates a new virtual environment in the current working directory and installs the specified dependencies. A .venv/ directory appears. On subsequent runs, UV will use the existing virtual environment and install or update only the new packages requested. In addition, UV has a powerful dependency resolver. When executing the add command, UV analyzes the entire dependency graph to find a compatible set of package versions that meet all requirements (package version and Python version). Finally, UV updates the pyproject.toml and uv.lock files after each add command.

To uninstall a package, type the command:

uv remove package_name

It is very important to clean the unused package from your environment. You have to keep the dependency file as minimal as possible. If a package is not used or is no longer used, it must be deleted.

Run a Python script

Now, your repository is initiated, your packages are installed and your code is ready to be tested. You can activate the created virtual environment as usual, but it is more efficient to use the UV command run:

uv run hello.py

Using the run command guarantees that the script will be executed in the virtual environment of the project.


Manage the Python versions

It is usually recommended to use different Python versions. As mentioned before the introduction, you may be working on an old project that requires an old Python version. And often it will be too difficult to update the version.

uv python list

At any time, it is possible to change the Python version of your project. To do that, you have to modify the line requires-python in the pyproject.toml file.

For instance: requires-python = “>=3.9”

Then you have to synchronize your environment using the command:

uv sync

The command first checks existing Python installations. If the requested version is not found, UV downloads and installs it. UV also creates a new virtual environment in the project directory, replacing the old one.

But the new environment does not have the required package. Thus, after a sync command, you have to type:

uv pip install -e .

Switch from virtualenv to uv

If you have a Python project initiated with pip and virtualenv and wish to use UV, nothing could be simpler. If there is no requirements file, you need to activate your virtual environment and then retrieve the package + installed version.

pip freeze > requirements.txt

Then, you have to init the project with UV and install the dependencies:

uv init .
uv pip install -r requirements.txt
Correspondence table between pip + virtualenv and UV, image by author.

Use the tools

UV offers the possibility of using tools via the uv tool command. Tools are Python packages that provide command interfaces for such as ruff, pytests, mypy, etc. To install a tool, type the command line:

uv tool install tool_name

But, a tool can be used without having been installed:

uv tool run tool_name

For convenience, an alias was created: uvx, which is equivalent to uv tool run. So, to run a tool, just type:

uvx tool_name

Conclusion

UV is a powerful and efficient Python package manager designed to provide fast dependency resolution and installation. It significantly outperforms traditional tools like pip or conda, making it an excellent choice to manage your Python projects.

Whether you’re working on small scripts or large projects, I recommend you get into the habit of using UV. And believe me, trying it out means adopting it.


References

1 — UV documentation: https://docs.astral.sh/uv/

2 — UV GitHub repository: https://github.com/astral-sh/uv

3 — A great datacamp article: https://www.datacamp.com/tutorial/python-uv

The post Data Scientist: From School to Work, Part I appeared first on Towards Data Science.

]]>
Tutorial: Semantic Clustering of User Messages with LLM Prompts https://towardsdatascience.com/tutorial-semantic-clustering-of-user-messages-with-llm-prompts/ Mon, 17 Feb 2025 15:00:00 +0000 https://towardsdatascience.com/?p=598031 As a Developer Advocate, it’s challenging to keep up with user forum messages and understand the big picture of what users are saying. There’s plenty of valuable content — but how can you quickly spot the key conversations? In this tutorial, I’ll show you an AI hack to perform semantic clustering simply by prompting LLMs! […]

The post Tutorial: Semantic Clustering of User Messages with LLM Prompts appeared first on Towards Data Science.

]]>
As a Developer Advocate, it’s challenging to keep up with user forum messages and understand the big picture of what users are saying. There’s plenty of valuable content — but how can you quickly spot the key conversations? In this tutorial, I’ll show you an AI hack to perform semantic clustering simply by prompting LLMs!

TL;DR 🔄 this blog post is about how to go from (data science + code) → (AI prompts + LLMs) for the same results — just faster and with less effort! 🤖⚡. It is organized as follows:

  • Inspiration and Data Sources
  • Exploring the Data with Dashboards
  • LLM Prompting to produce KNN Clusters
  • Experimenting with Custom Embeddings
  • Clustering Across Multiple Discord Servers

Inspiration and Data Sources

First, I’ll give props to the December 2024 paper Clio (Claude insights and observations), a privacy-preserving platform that uses AI assistants to analyze and surface aggregated usage patterns across millions of conversations. Reading this paper inspired me to try this.

Data. I used only publicly available Discord messages, specifically “forum threads”, where users ask for tech help. In addition, I aggregated and anonymized content for this blog.  Per thread, I formatted the data into conversation turn format, with user roles identified as either “user”, asking the question or “assistant”, anyone answering the user’s initial question. I also added a simple, hard-coded binary sentiment score (0 for “not happy” and 1 for “happy”) based on whether the user said thank you anytime in their thread. For vectorDB vendors I used Zilliz/Milvus, Chroma, and Qdrant.

The first step was to convert the data into a pandas data frame. Below is an excerpt. You can see for thread_id=2, a user only asked 1 question. But for thread_id=3, a user asked 4 different questions in the same thread (other 2 questions at farther down timestamps, not shown below).

The first step was to convert the anonymized data into a pandas data frame with columns: score, user, role, message, timestamp, thread, user_turns.

I added a naive sentiment 0|1 scoring function.

Python">def calc_score(df):
   # Define the target words
   target_words = ["thanks", "thank you", "thx", "🙂", "😉", "👍"]


   # Helper function to check if any target word is in the concatenated message content
   def contains_target_words(messages):
       concatenated_content = " ".join(messages).lower()
       return any(word in concatenated_content for word in target_words)


   # Group by 'thread_id' and calculate score for each group
   thread_scores = (
       df[df['role_name'] == 'user']
       .groupby('thread_id')['message_content']
       .apply(lambda messages: int(contains_target_words(messages)))
   )
   # Map the calculated scores back to the original DataFrame
   df['score'] = df['thread_id'].map(thread_scores)
   return df


...


if __name__ == "__main__":
  
   # Load parameters from YAML file
   config_path = "config.yaml"
   params = load_params(config_path)
   input_data_folder = params['input_data_folder']
   processed_data_dir = params['processed_data_dir']
   threads_data_file = os.path.join(processed_data_dir, "thread_summary.csv")
  
   # Read data from Discord Forum JSON files into a pandas df.
   clean_data_df = process_json_files(
       input_data_folder,
       processed_data_dir)
  
   # Calculate score based on specific words in message content
   clean_data_df = calc_score(clean_data_df)


   # Generate reports and plots
   plot_all_metrics(processed_data_dir)


   # Concat thread messages & save as CSV for prompting.
   thread_summary_df, avg_message_len, avg_message_len_user = \
   concat_thread_messages_df(clean_data_df, threads_data_file)
   assert thread_summary_df.shape[0] == clean_data_df.thread_id.nunique()

Exploring the Data with Dashboards

From the processed data above, I built traditional dashboards:

  • Message Volumes: One-off peaks in vendors like Qdrant and Milvus (possibly due to marketing events).
  • User Engagement: Top users bar charts and scatterplots of response time vs. number of user turns show that, in general, more user turns mean higher satisfaction. But, satisfaction does NOT look correlated with response time. Scatterplot dark dots seem random with regard to y-axis (response time). Maybe users are not in production, their questions are not very urgent? Outliers exist, such as Qdrant and Chroma, which may have bot-driven anomalies.
  • Satisfaction Trends: Around 70% of users appear happy to have any interaction. Data note: make sure to check emojis per vendor, sometimes users respond using emojis instead of words! Example Qdrant and Chroma.
Image by author of aggregated, anonymized data. Top lefts: Charts display Chroma’s highest message volume, followed by Qdrant, and then Milvus. Top rights: Top messaging users, Qdrant + Chroma possible bots (see top bar in top messaging users chart). Middle rights: Scatterplots of Response time vs Number of user turns shows no correlation with respect to dark dots and y-axis (response time). Usually higher satisfaction w.r.t. x-axis (user turns), except Chroma. Bottom lefts: Bar charts of satisfaction levels, make sure you catch possible emoji-based feedback, see Qdrant and Chroma.

LLM Prompting to produce KNN Clusters

For prompting, the next step was to aggregate data by thread_id. For LLMs, you need the texts concatenated together. I separate out user messages from entire thread messages, to see if one or the other would produce better clusters. I ended up using just user messages.

Example anonymized data for prompting. All message texts concatenated together.

With a CSV file for prompting, you’re ready to get started using a LLM to do data science!

!pip install -q google.generativeai
import os
import google.generativeai as genai


# Get API key from local system
api_key=os.environ.get("GOOGLE_API_KEY")


# Configure API key
genai.configure(api_key=api_key)


# List all the model names
for m in genai.list_models():
   if 'generateContent' in m.supported_generation_methods:
       print(m.name)


# Try different models and prompts
GEMINI_MODEL_FOR_SUMMARIES = "gemini-2.0-pro-exp-02-05"
model = genai.GenerativeModel(GEMINI_MODEL_FOR_SUMMARIES)
# Combine the prompt and CSV data.
full_input = prompt + "\n\nCSV Data:\n" + csv_data
# Inference call to Gemini LLM
response = model.generate_content(full_input)


# Save response.text as .json file...


# Check token counts and compare to model limit: 2 million tokens
print(response.usage_metadata)
Image by author. Top: Example LLM model names. Bottom: Example inference call to Gemini LLM token counts: prompt_token_count = input tokens; candidates_token_count = output tokens; total_token_count = sum total tokens used.

Unfortunately Gemini API kept cutting short the response.text. I had better luck using AI Studio directly.

Image by author: Screenshot of example outputs from Google AI Studio.

My 5 prompts to Gemini Flash & Pro (temperature set to 0) are below.

Prompt#1: Get thread Summaries:

Given this .csv file, per row, add 3 columns:
– thread_summary = 205 characters or less summary of the row’s column ‘message_content’
– user_thread_summary = 126 characters or less summary of the row’s column ‘message_content_user’
– thread_topic = 3–5 word super high-level category
Make sure the summaries capture the main content without losing too much detail. Make user thread summaries straight to the point, capture the main content without losing too much detail, skip the intro text. If a shorter summary is good enough prefer the shorter summary. Make sure the topic is general enough that there are fewer than 20 high-level topics for all the data. Prefer fewer topics. Output JSON columns: thread_id, thread_summary, user_thread_summary, thread_topic.

Prompt#2: Get cluster stats:

Given this CSV file of messages, use column=’user_thread_summary’ to perform semantic clustering of all the rows. Use technique = Silhouette, with linkage method = ward, and distance_metric = Cosine Similarity. Just give me the stats for the method Silhouette analysis for now.

Prompt#3: Perform initial clustering:

Given this CSV file of messages, use column=’user_thread_summary’ to perform semantic clustering of all the rows into N=6 clusters using the Silhouette method. Use column=”thread_topic” to summarize each cluster topic in 1–3 words. Output JSON with columns: thread_id, level0_cluster_id, level0_cluster_topic.

Silhouette Score measures how similar an object is to its own cluster (cohesion) versus other clusters (separation). Scores range from -1 to 1. A higher average silhouette score generally indicates better-defined clusters with good separation. For more details, check out the scikit-learn silhouette score documentation.

Applying it to Chroma Data. Below, I show results from Prompt#2, as a plot of silhouette scores. I chose N=6 clusters as a compromise between high score and fewer clusters. Most LLMs these days for data analysis take input as CSV and output JSON.

Image by author of aggregated, anonymized data. Left: I chose N=6 clusters as compromise between higher score and fewer clusters. Right: The actual clusters using N=6. Highest sentiment (highest scores) are for topics about Query. Lowest sentiment (lowest scores) are for topics about “Client Problems”.

From the plot above, you can see we are finally getting into the meat of what users are saying!

Prompt#4: Get hierarchical cluster stats:

Given this CSV file of messages, use the column=’thread_summary_user’ to perform semantic clustering of all the rows into Hierarchical Clustering (Agglomerative) with 2 levels. Use Silhouette score. What is the optimal number of next Level0 and Level1 clusters? How many threads per Level1 cluster? Just give me the stats for now, we’ll do the actual clustering later.

Prompt#5: Perform hierarchical clustering:

Accept this clustering with 2-levels. Add cluster topics that summarize text column “thread_topic”. Cluster topics should be as short as possible without losing too much detail in the cluster meaning.
– Level0 cluster topics ~1–3 words.
– Level1 cluster topics ~2–5 words.
Output JSON with columns: thread_id, level0_cluster_id, level0_cluster_topic, level1_cluster_id, level1_cluster_topic.

I also prompted to generate Streamlit code to visualize the clusters (since I’m not a JS expert 😄). Results for the same Chroma data are shown below.

Image by author of aggregated, anonymized data. Left image: Each scatterplot dot is a thread with hover-info. Right image: Hierarchical clustering with raw data drill-down capabilities. Api and Package Errors looks like Chroma’s most urgent topic to fix, because sentiment is low and volume of messages is high.

I found this very insightful. For Chroma, clustering revealed that while users were happy with topics like Query, Distance, and Performance, they were unhappy about areas such as Data, Client, and Deployment.

Experimenting with Custom Embeddings

I repeated the above clustering prompts, using just the numerical embedding (“user_embedding”) in the CSV instead of the raw text summaries (“user_text”).I’ve explained embeddings in detail in previous blogs before, and the risks of overfit models on leaderboards. OpenAI has reliable embeddings which are extremely affordable by API call. Below is an example code snippet how to create embeddings.

from openai import OpenAI


EMBEDDING_MODEL = "text-embedding-3-small"
EMBEDDING_DIM = 512 # 512 or 1536 possible


# Initialize client with API key
openai_client = OpenAI(
   api_key=os.environ.get("OPENAI_API_KEY"),
)


# Function to create embeddings
def get_embedding(text, embedding_model=EMBEDDING_MODEL,
                 embedding_dim=EMBEDDING_DIM):
   response = openai_client.embeddings.create(
       input=text,
       model=embedding_model,
       dimensions=embedding_dim
   )
   return response.data[0].embedding


# Function to call per pandas df row in .apply()
def generate_row_embeddings(row):
   return {
       'user_embedding': get_embedding(row['user_thread_summary']),
   }


# Generate embeddings using pandas apply
embeddings_data = df.apply(generate_row_embeddings, axis=1)
# Add embeddings back into df as separate columns
df['user_embedding'] = embeddings_data.apply(lambda x: x['user_embedding'])
display(df.head())


# Save as CSV ...
Example data for prompting. Column “user_embedding” is an array length=512 of floating point numbers.

Interestingly, both Perplexity Pro and Gemini 2.0 Pro sometimes hallucinated cluster topics (e.g., misclassifying a question about slow queries as “Personal Matter”).

Conclusion: When performing NLP with prompts, let the LLM generate its own embeddings — externally generated embeddings seem to confuse the model.

Image by author of aggregated, anonymized data. Both Perplexity Pro and Google’s Gemini 1.5 Pro hallucinated Cluster Topics when given an externally-generated embedding column. Conclusion — when performing NLP with prompts, just keep the raw text and let the LLM create its own embeddings behind the scenes. Feeding in externally-generated embeddings seems to confuse the LLM!

Clustering Across Multiple Discord Servers

Finally, I broadened the analysis to include Discord messages from three different VectorDB vendors. The resulting visualization highlighted common issues — like both Milvus and Chroma facing authentication problems.

Image by author of aggregated, anonymized data: A multi-vendor VectorDB dashboard displays top issues across many companies. One thing that stands out is both Milvus and Chroma are having trouble with Authentication.

Summary

Here’s a summary of the steps I followed to perform semantic clustering using LLM prompts:

  1. Extract Discord threads.
  2. Format data into conversation turns with roles (“user”, “assistant”).
  3. Score sentiment and save as CSV.
  4. Prompt Google Gemini 2.0 flash for thread summaries.
  5. Prompt Perplexity Pro or Gemini 2.0 Pro for clustering based on thread summaries using the same CSV.
  6. Prompt Perplexity Pro or Gemini 2.0 Pro to write Streamlit code to visualize clusters (because I’m not a JS expert 😆).

By following these steps, you can quickly transform raw forum data into actionable insights — what used to take days of coding can now be done in just one afternoon!

References

  1. Clio: Privacy-Preserving Insights into Real-World AI Use, https://arxiv.org/abs/2412.13678
  2. Anthropic blog about Clio, https://www.anthropic.com/research/clio
  3. Milvus Discord Server, last accessed Feb 7, 2025
    Chroma Discord Server, last accessed Feb 7, 2025
    Qdrant Discord Server, last accessed Feb 7, 2025
  4. Gemini models, https://ai.google.dev/gemini-api/docs/models/gemini
  5. Blog about Gemini 2.0 models, https://blog.google/technology/google-deepmind/gemini-model-updates-february-2025/
  6. Scikit-learn Silhouette Score
  7. OpenAI Matryoshka embeddings
  8. Streamlit

The post Tutorial: Semantic Clustering of User Messages with LLM Prompts appeared first on Towards Data Science.

]]>