Analytics | Towards Data Science https://towardsdatascience.com/category/data-science/analytics/ The world’s leading publication for data science, AI, and ML professionals. Wed, 05 Mar 2025 14:57:21 +0000 en-US hourly 1 https://wordpress.org/?v=6.7.1 https://towardsdatascience.com/wp-content/uploads/2025/02/cropped-Favicon-32x32.png Analytics | Towards Data Science https://towardsdatascience.com/category/data-science/analytics/ 32 32 Simplicity Over Black Boxes https://towardsdatascience.com/simplicity-over-black-boxes-eefc72a5c507/ Thu, 23 Jan 2025 17:41:25 +0000 https://towardsdatascience.com/simplicity-over-black-boxes-eefc72a5c507/ Turning complex ML models into simple, interpretable rules with Human Knowledge Models for actionable insights and easy inference

The post Simplicity Over Black Boxes appeared first on Towards Data Science.

]]>
Cover, image by Author
Cover, image by Author

Modern machine learning has delivered incredible breakthroughs in lots of fields. These successes come from advanced models that can uncover intricate patterns in massive datasets.

But there’s one area where this approach falls short: easy interpretation. Most ML models, often referred to as "black boxes," take vast amounts of data and contain thousands of coefficients and weights. Instead of extracting clear, actionable insights, they leave us with results that are difficult for humans to understand or apply.

This gap highlights the need for a different approach – one that focuses on finding concise, interpretable rules rather than relying solely on complex models.

Most efforts focus on explaining "black box" models rather than directly extracting knowledge from raw data. The root of the problem lies in the "machine-first" mindset, where the priority is building optimal algorithms rather than human-like knowledge extraction. Humans rely on knowledge to generate experiments and new data, but the reverse – converting data into actionable knowledge – is largely overlooked.

The quest to turn complex data into simple, human-understandable logic has been around for decades. Starting in the 1960s, research on formal logic inspired the creation of rule-learning algorithms like Corels, RuleFit, and Skope-Rules. These methods extract concise Boolean expressions from complex data, often using greedy optimization techniques. However, despite progress in interpretable and Explainable Ai, significant challenges remain.

Not a long time ago a group of scientists from Russia and France suggested their approach – Human Knowledge Models (HKM), which distill data into the form of simple rules.

HKM creates simple rules that contain basic Boolean operators (AND, OR, NOT) and thresholds. It comes up not more than with 4 rules. These rules are easy to use in different domaines, especially when a human is involved.

When you may want to use this:

  • predictions will be used by field experts. For instance, a doctor might receive a model’s prediction about the likelihood of a patient having pneumonia. If the prediction comes from a "black box," it becomes challenging for the doctor to adjust the results based on personal experience, leading to low trust in such predictions. A more effective approach would be to distill clear, understandable rules from patient medical histories (e.g., "If the patient’s blood pressure is above 120 and their temperature exceeds 38°C, the risk of pneumonia is high").
  • When deploying an ML model to production is unjustifiably expensive or complex, a few business rules can often be implemented more easily, even directly on the front-end.
  • When you have small number of features or observations and can’t build a complex model.

HKM training

The HKM training identifies:

  • thresholds to simplify continuous data;
  • the best subset of features;
  • the optimal Boolean logic to model decision-making.

The process generates all possible Boolean functions, simplifies them into concise formulas, and evaluates their performance. Unlike traditional Machine Learning, which adjusts coefficients, HKMs explore different combinations of features and rules to find the most effective yet human-comprehensible models.

Training process, image by Author
Training process, image by Author

HKM training avoids imputing missing data, treating it as valuable information, and produces multiple top-performing models, giving users flexibility in choosing the most practical solution.

Where HKMs Fall Short

HKMs aren’t suited for every problem. For tasks with high branching complexity (big number of features) their simplicity becomes a disadvantage. These scenarios require substantial memory and logic that exceed human processing capacity. Nonetheless, HKMs can still play a critical role in applied fields like healthcare, where they serve as a practical starting point to address obvious blind spots.

Another limitation lies in feature identification. Unlike deep-learning models that can automatically extract complex patterns, HKMs depend on humans to define and measure key features. That’s why feature engineering falls on the shoulders of the analyst.

Churn prediction example

As a toy example we will use a generated Churn prediction dataset.

Install the libraries:

!pip install git+https://github.com/EgorDudyrev/PeaViner
!pip install bitarray

Dataset generation:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score

np.random.seed(42)

n_rows = 1500

charge_amount = np.random.normal(10, np.sqrt(2), n_rows).astype(int)
seconds_of_use = np.random.gamma(shape=2, scale=2500, size=n_rows).astype(int)
frequency_of_use = np.random.normal(20, np.sqrt(10), n_rows).astype(int)
tariff_plan = np.random.choice([1, 2], size=n_rows) 
status = np.random.choice([2, 3], size=n_rows) 
age = np.random.randint(20, 71, size=n_rows) 
noise = np.random.uniform(0, 1, n_rows)
churn = np.where(
    ((seconds_of_use <= 1000) &amp; (age >= 30)) | (frequency_of_use < 16) | (noise < 0.1),
    1,
    0
)

df = pd.DataFrame({
    "Charge Amount": charge_amount,
    "Seconds of Use": seconds_of_use,
    "Frequency of use": frequency_of_use,
    "Tariff Plan": tariff_plan,
    "Status": status,
    "Age": age,
    "Churn": churn
})

The dataset contains some basic metrics on user characteristics and the binary target – Churn. Sample data:

Dataset, image by Author
Dataset, image by Author

Split the dataset into train and test groups:

X = df.drop(columns=['Churn'])
y = df.Churn

X, y = X.values.astype(float), y.values.astype(int)
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=42)
print(f"Train size: {len(X_train)}; Test size: {len(X_test)}")

Train size: 1200; Test size: 300

Finally we may apply the model and check the quality:

from peaviner import PeaClassifier

model = PeaClassifier()
model.fit(X_train, y_train)
model_scores = f1_score(y_train, model.predict(X_train)), 
            f1_score(y_test, model.predict(X_test))

print(f"Train F1 score: {model_scores[0]:.2f}, 
        Test F1 score: {model_scores[1]:.2f}")

Train F1 score: 0.78, Test F1 score: 0.77

The results are pretty stable, training process took 3 minutes.

Now, we may check the rule, found by the model:

features = [f for f in df if f!='Churn']
model.explain(features)

(Age >= 30 AND Seconds of Use < 1010) OR Frequency of use < 16

Pretty straightforward and interpretable. Just 3 rules, no need to make inference using the model, you need just apply this simple rule to split the users. And as you see it’s almost identical to theoretical rule of data generation.

Now we’d like to compare the performance of the model with several other popular algorithms. We’ve chosen Decision Tree and XGBoost to compare 3 models of different level of complexity.

from sklearn.model_selection import KFold
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier
from sklearn.metrics import f1_score
import pandas as pd

kf = KFold(n_splits=5, random_state=42, shuffle=True)

def evaluate_model(model, X, y, get_leaves):
    scores_cv, leaves = [], []
    for train_idx, test_idx in kf.split(X, y):
        model.fit(X[train_idx], y[train_idx])
        scores_cv.append((
            f1_score(y[train_idx], model.predict(X[train_idx])),
            f1_score(y[test_idx], model.predict(X[test_idx]))
        ))
        leaves.append(get_leaves(model))
    return scores_cv, leaves

models = [
    ("XGBoost", XGBClassifier(), lambda m: (m.get_booster().trees_to_dataframe()['Feature'] == 'Leaf').sum()),
    ("Decision Tree", DecisionTreeClassifier(), lambda m: m.get_n_leaves()),
    ("Human Knowledge Model", PeaClassifier(), lambda m: 1),
]

models_data = []
for model_name, model, get_leaves in models:
    scores_cv, leaves = evaluate_model(model, X, y, get_leaves)
    models_data.extend({
        'Model': model_name,
        'Fold': fold_id,
        'Train F1': train,
        'Test F1': test,
        'Leaves': n_leaves
    } for fold_id, ((train, test), n_leaves) in enumerate(zip(scores_cv, leaves)))

models_data = pd.DataFrame(models_data)

The results for fold 0:

Models comparison, image by Author
Models comparison, image by Author

As you see, the Human Knowledge Model used just 1 rule, Decision Tree – 165 and XGBoost – 1647, but the quality on the Test group is comparable.

Now we want to visualize the quality results for all folds:

import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = [10, 6]
plt.rcParams["figure.dpi"] = 100
plt.rcParams["figure.facecolor"] = "white"
plt.rcParams['font.family'] = 'monospace'
plt.rcParams['font.size'] = 10

%load_ext autoreload
%autoreload 2
%config InlineBackend.figure_format = "retina"

plt.figure(figsize=(8, 5))

for ds_part, color in zip(['Train', 'Test'], ['black', '#f95d5f']):
    y_axis = f'{ds_part} F1'
    plt.scatter(models_data[y_axis], 1/models_data['Leaves'], label=ds_part, alpha=0.3, s=200, color=color)

avgs = models_data.groupby('Model')['Leaves'].mean().sort_values()
avg_f1 = models_data.groupby('Model')['Test F1'].mean()

# Add vertical lines
for model_name, n_leaves in avgs.items():
    plt.axhline(1/n_leaves, zorder=0, linestyle='--', color='gray', alpha=0.5, linewidth=0.6)
    plt.text(0.05, 1/n_leaves*1.1, model_name)

plt.xlabel('F1 score')
plt.ylabel('1 / Number of rules')
plt.yscale('log') 
plt.ylim(0, 3)
plt.xlim(0, 1.05)

# Removing top and right borders
plt.gca().spines['top'].set_visible(False)
plt.gca().spines['right'].set_visible(False)

plt.show()
Red dots - Test, grey dots - Train data, image by Author
Red dots – Test, grey dots – Train data, image by Author

As you see, the quality of HKM on the Test subset is event better than more complex models have. Obviously, that’s because the dataset is comparably small and feature dependencies are not that complex. Anyway the rules generated by HKM may be easily used for some personalisation offer on the website. You don’t need any ML infrastructure—the rules may be incorporated even on the front end.


Conclusion

Human Knowledge Models present a novel and practical approach to integrating human-centric logic with AI, bridging the gap between explainability and performance. They achieve several key objectives:

  • Simplifying Complexity: reduce complex Boolean expressions to their simplest forms;
  • Enhancing Explainable AI: Unlike traditional "interpretable" AI, it focuses on active human decision-making, offering a more precise and functional definition;
  • Challenging Black-Box Models: it provides an alternative to classical machine learning models, expanding AI into domains where black-box solutions are unacceptable.

References

  • Cowan N., "The magical number 4 in short-term memory: A reconsideration of mental storage capacity," The Behavioral and brain sciences, vol. 24, no. 1, pp. 87–114, 2001.
  • Taniguchi H., Sato H. and Shirakawa T., "A machine learning model with human cognitive biases capable of learning from small and biased datasets," Scientific reports, 2018–05–09, Vol.8 (1), p., vol. 8, no. 1, pp. 7397–13, 2018.
  • E. Dudyrev, I. Semenkov, S. O. Kuznetsov, G. Gusev, A. Sharp, and O. S. Pianykh. Human knowledge models: Learning applied knowledge from the data. Plos one, 17(10):e0275814, 2022.
  • E. Dudyrev and S. O. Kuznetsov. Towards fast finding optimal short classifiers. CEUR Workshop Proceedings, 3233:23–34, 2022.

The post Simplicity Over Black Boxes appeared first on Towards Data Science.

]]>
Unlocking the Power of Machine Learning in Analytics: Practical Use Cases and Skills https://towardsdatascience.com/unlocking-the-power-of-machine-learning-in-analytics-practical-use-cases-and-skills-5201cf457360/ Wed, 15 Jan 2025 16:02:07 +0000 https://towardsdatascience.com/unlocking-the-power-of-machine-learning-in-analytics-practical-use-cases-and-skills-5201cf457360/ Your essential machine learning checklist to excel as a data scientist in analytics

The post Unlocking the Power of Machine Learning in Analytics: Practical Use Cases and Skills appeared first on Towards Data Science.

]]>
In the past decade, we have seen explosive growth in the data science industry, with a rise in machine learning and AI use cases. Meanwhile, the "Data Scientist" title has evolved into different roles at different companies. Thinking about functions, there are Product Data Scientists, Marketing Data Scientists, those specialized in Finance, Risk, and people supporting Operations, HR, etc.

Another common distinction is the DS Analytics (often referred to as DSA) and the DS Machine Learning (DSML) tracks. As the name suggests, the prior focuses on analyzing data to derive insights, while the latter trains and deploys more machine learning models. However, this does not mean that DSA positions do not involve machine learning projects. You can often find machine learning among the required skills in the job descriptions of DSA openings.

This overlap often leads to confusion among aspiring data scientists. During coffee chats, I frequently hear questions like: Do DSA positions still require machine learning skills? Or do DSAs also deploy machine learning models? Unfortunately, the answer is not a simple Yes or No. Firstly, the boundaries between the two positions are always blurry (even a decade after the data science job became a trend). Sometimes, within the same company, DSAs supporting different functions could be using very different skill sets. Secondly, machine learning itself is a broad field, covering things from simple linear regression to complex neural networks, and even LLM. Therefore, some of them could be fantastic tools for analytics purposes, while others are more tailored for building a production-grade predictive model.

But if I have to answer the question: "Yes, DSAs also use machine learning skills, but just not in the same way as DSML positions. Instead of building complex models and optimizing for scalability, accuracy, and latency, DSAs primarily leverage machine learning as a tool to generate insights or support analyses and focus more on generating interpretable and actionable outputs to better inform business decisions."

In the section below, I will explain this answer further with three common machine learning use cases in Analytics. These examples will provide a more concrete answer to the above question, and highlight the machine learning skills you need to succeed in a DSA position.

Image created by DALL·E
Image created by DALL·E

Use Case 1: Metrics Driver Analysis

Usually, the first step to understanding a business or a product is to define and track the right metrics. But after completing this step, the question becomes how can we move the metrics in the right direction. Machine learning is a powerful tool for understanding what drives these metrics and how to influence them.

Let’s say you are tasked to improve customer retention. You might have heard tons of assumptions from user interviews, surveys, and even business intuition. A common way to validate the assumptions is to build a classification model to predict retention and identify important features.

Here is the process:

  1. Establish the assumptions based on conversations with your stakeholders and customers, and your business understanding. For example,

    • Customers with longer tenures are less likely to churn;
    • Heavy usage of feature X indicates higher loyalty to the product;
    • Customers who only use the mobile app are more likely to churn due to the missing features on the app.
  2. Collect features based on your assumptions, for example, customer tenure, time spent and frequency on feature X, and if the user is mobile only or not.
  3. Build a classification model to predict if a customer churned or not. Common classification models include Logistic Regression and tree-based models (random forests, XGBoost, LightGBM, CatBoost, etc.). My go-to model is the XGBoost model because it captures nonlinear relationships and feature interactions, has built-in methods to handle imbalanced datasets and avoid overfitting, and typically achieves a good accuracy baseline even before extensive parameter tuning.
  4. Generate data insights and business recommendations. You can use the model’s Feature Importance to understand which assumptions were correct. You can also use SHAP values to decompose the prediction into contributions from each feature. For example, if the model reveals that mobile user only is a very important feature and is correlated with a high churn rate, the next steps could be 1. improving the mobile app to close the feature gaps; 2. launching an email campaign encouraging mobile-only users to try more features on the desktop version. And of course, you will monitor the effectiveness of these recommendations and continue monitoring the retention rate.

From the above example, you can see that building the model is only part of the task for DSAs. The real value comes from interpreting results and translating them into actionable business strategies.

Use Case 2: Customer Segmentation

We always talk about product-market fit; a critical part is understanding your customer portfolio and their needs. Therefore, data scientists are often asked to conduct customer segmentation tasks, grouping users based on similar behaviors or preferences.

There are numerous approaches to customer segmentation. Let me name a few:

  1. Simple demographics segmentation. For example, you work for a fashion retailer that owns product lines with different price tiers. In this case, a simple solution to customer segmentation is to break down customers based on their household income (if you have this data).
  2. RFM (Recency, Frequency, Monetary) analysis. You can collect metrics including how recently a customer spends (recency), how often they spend (frequency), and how much they spend (monetary). Then you can categorize high-recency, high-frequency, and high monetary value customers as VIP customers; high-recency, high-frequency, and low monetary value customers as price-sensitive customers, etc.
  3. Back to our topic, you can also build an unsupervised learning model (K-Means Clustering, Hierarchical Clustering, DBSCAN, etc.). Unlike the classification model example above, unsupervised learning is a type of machine learning that identifies patterns and relationships in unlabeled data without predefined output labels. Let’s still stick to the fashion retailer example:

    • You can collect features including customer demographics (age, income, etc.) and purchase patterns (how often and how much they spend on different product lines and types), then run them through a clustering algorithm like K-Means. This will automatically group your customers into several clusters.
    • Though there are no existing true labels, you still need to evaluate the model outputs. Usually, you would examine the characteristics of each cluster to understand if the output makes business sense and what type of customers they represent, then give them a label that is intuitive to your stakeholders. For example, you might find a cluster of customers mostly purchasing discounted products, and you can label them "Bargain Hunters"; Another cluster of customers who spend a lot on luxury brands could be the "Luxury Shoppers" group.
    • Once you have the reasonable segments, they could be used to inform product strategies and conduct targeted email campaigns and personalized product recommendations.

Use Case 3: Experimentation and causal inference

Another common task for DSAs is measuring the causal impact of a certain event. This involves randomized experimentation and more advanced causal inference methods when a controlled experiment is not feasible.

When it comes to randomized experiments like A/B tests, an application of Machine Learning is to reduce noise in experiment results by controlling for covariates. These adjustments improve the sensitivity of experiments and lead to smaller sample sizes or shorter test durations while maintaining statistical power. CUPED (Controlled Pre-Experiment Data) is one of those variance reduction methods that can incorporate machine learning techniques. It reduces the variance of the outcome metric by adjusting for pre-experiment covariates that are predictive of the outcome – something you can achieve with a machine learning model.

When it comes to causal inference, there are lots of machine learning use cases as it can be used to address key challenges like confounding, non-linearity, and high-dimensional data.

Here is an example of using machine learning to enhance Propensity Score Matching, which **** is a useful technique to manually create two comparable groups when you don’t have perfectly randomized test and control groups. Suppose your company launched a newsletter program and your stakeholder wants you to assess its impact on customer retention. However, users who subscribed to the newsletter might inherently differ from those who did not. To apply the Propensity Score Matching method here,

  1. You can train a machine learning model (for example, logistics regression or XGBoost) to predict the likelihood of a user subscribing to the newsletter.
  2. Use the predicted "propensity scores" to match newsletter subscribers with similar non-subscribers.
  3. Compare the retention rate from this matched control and treatment group. This will give you a more fair evaluation of the newsletter’s impact on customer retention.

Machine learning also can be incorporated into other causal inference methods, such as Regression Discontinuity, Instrumental Variables, and Synthetic Control, which help address biases in observational data and derive the causal relationship.


I hope the three use cases above give you a concrete idea of how you can use machine learning in analytics workflows.

If you aim to be a Data Scientist specializing in Analytics, Here is your essential machine learning skills checklist for your interview preparation and daily work.

  1. Data Preparation:

    • Handle missing values and outliers
    • Encode categorical variables
    • Apply data normalization techniques
  2. Common ML Algorithm: understand assumptions and pros and cons of each model and how to pick the right one– Supervised learning models: Regression models (Linear Regression, Logistic Regression), Tree-based models (Random Forest, XGBoost)

    • Unsupervised learning models: K-Means and Hierarchical Clustering
  3. Model Training and Evaluation:

    • Select the right evaluation metric
    • Prevent overfitting
    • Handle imbalanced datasets
    • Feature selection and feature engineering
    • Hyperparameter tuning
  4. Model Interpretation:

    • Understand coefficients in Regression models
    • Understand Feature Importance in tree-based models
    • Use interpretability tools like SHAP
    • Translate model insights into business insights and communicate them effectively to non-technical stakeholders

By mastering these skills and understanding the use cases discussed, you’ll be well-equipped to leverage machine learning effectively as a Data Scientist in Analytics.


If you have enjoyed this article, please follow me and check out my other articles on Data Science, analytics, and AI.

Seven Common Causes of Data Leakage in Machine Learning

ChatGPT vs. Claude vs. Gemini for Data Analysis (Part 3): Best AI Assistant for Machine Learning

From Data Scientist to Data Manager: My First 3 Months Leading a Team

The post Unlocking the Power of Machine Learning in Analytics: Practical Use Cases and Skills appeared first on Towards Data Science.

]]>
Building Effective Metrics to Describe Users https://towardsdatascience.com/building-effective-metrics-to-describe-users-3212727c5a9e/ Wed, 08 Jan 2025 15:01:49 +0000 https://towardsdatascience.com/building-effective-metrics-to-describe-users-3212727c5a9e/ How can numerical user metrics be transformed into a personalized assessment of whether this behavior is typical or unusual for the user?

The post Building Effective Metrics to Describe Users appeared first on Towards Data Science.

]]>
How can numerical user metrics, such as "3 visits in the past week," be transformed into a personalized assessment of whether this behavior is typical or unusual for the user?
Cover, image by Author
Cover, image by Author

In almost any digital product, analysts often face the challenge of building a digital customer profile – a set of parameters that describe the customer’s state and behavior in one way or another.

What are the potential applications of these metrics?

  • Gaining insights into user behavior
  • Leveraging as features in ML models
  • Developing business rules for personalized offers

A simple example is e-commerce, with metrics like those listed in the table below.

Image by Author
Image by Author

These metrics are used everywhere, but what is the problem with them? They don’t take into account the specific user’s history or the dynamics of this particular metric. Is $200 spend a lot for user 1? It’s unclear. Yet this distinction significantly influences the business decision we make next.

Even within the context of a single user, $200 can have a different meaning for the business depending on the user’s stage in their lifecycle with the product. $200 spent during the user onboarding, pick activity and re-activation are different.

User journey, image by Author
User journey, image by Author

We’d like to have some normalized metrics to be able compare them between users. Something like this:

Image by Author
Image by Author

So how can we move from a numerical description of customer behavior to a more characteristic representation? For instance, how can the fact that "a customer hasn’t made a transaction for 7 days" be translated into an individualized assessment of whether this is unusual or typical for that specific customer? The challenge is to achieve this without losing interpretability, preserving business relevance, and avoiding black-box models.

A simple approach is to analyze the distribution of the metric and determine the probability of observing the current result (i.e., calculate the p-value). This helps us understand how extreme the value is compared to the user’s history.

Normal distribution, image by Author
Normal distribution, image by Author

But what’s the challenge here? In most cases, the distribution of metrics is not normal, making p-value calculations more complex.

The random metric probably would have a distribution similar to this one:

PDF, image by Author
PDF, image by Author

We can apply some small trick and to transform Probability Density Function to Cumulative Distribution Function (CDF). Calculating p-value in this case is much easier.

CDF, image by Author
CDF, image by Author

So, we simply need to reconstruct the CDF from the user’s metric, which can be done efficiently using splines. Let’s create a toy example.


Imagine you are an e-commerce platform aiming to personalize your email campaigns based on user activity from the past week. If a user has been less active compared to previous weeks, you plan to send them a discount offer.

You’ve gathered user Statistics and noticed the following for a user named John:

  • John visited the platform for the first time 15 days ago.
  • During the first 7 days (days 1–7), he made 9 visits.
  • During the next 7 days (days 2–8), he made 8 visits.
  • Totally we have 9 values.

Now, you want to evaluate how extreme the most recent value is compared to the previous ones.

import numpy as np
visits = np.array([9, 8, 6, 5, 8, 6, 8, 7])
num_visits_last_week = 6

Let’s create a CDF of these values.

import numpy as np
import matplotlib.pyplot as plt

values = np.array(sorted(set(visits)))
counts = np.array([data.count(x) for x in values])
probabilities = counts / counts.sum()
cdf = np.cumsum(probabilities)

plt.scatter(values, cdf, color='black', linewidth=10)
CDF, image by Author
CDF, image by Author

Now we need to restore the function based on these values. We will use spline interpolation.

from scipy.interpolate import make_interp_spline

x_new = np.linspace(values.min(), values.max(), 300)  
spline = make_interp_spline(values, cdf, k=3)         
cdf_smooth = spline(x_new)

plt.plot(x_new, cdf_smooth, label='Сплайн CDF', color='black', linewidth=4)
plt.scatter(values, cdf, color='black', linewidth=10)
plt.scatter(values[-2:], cdf[-2:], color='#f95d5f', linewidth=10, zorder=5)
plt.show()
CDF with spline interpolation, image by Author
CDF with spline interpolation, image by Author

Not bad. But we observe a small problem between red dots – the CDF must be monotonically increasing. Let’s fix this with Piecewise Cubic Hermite Interpolating Polynomial.

from scipy.interpolate import PchipInterpolator

spline_monotonic = PchipInterpolator(values, cdf)
cdf_smooth = spline_monotonic(x_new)

plt.plot(x_new, cdf_smooth, color='black', linewidth=4)
plt.scatter(values, cdf, color='black', linewidth=10) 
plt.show()
CDF with Piecewise Cubic Hermite Interpolating, image by Author
CDF with Piecewise Cubic Hermite Interpolating, image by Author

Alright, now it’s perfect.

To calculate p-value for our current observation (6 visits during the last week) we need to calculate the surface of filled area.

Critical area, image by Author
Critical area, image by Author

To do so let’s create a simple function _calculate_pvalue:

def calculate_p_value(x):
    if x < values.min():
        return 0  
    elif x > values.max():
        return 1  
    else:
        return spline_monotonic(x)  

p_value = calculate_p_value(num_visits_last_week)
print(f"Probability of getting less than {num_visits_last_week} equals: {p_value}")

Probability of getting less than 6 equals: 0.375

So the probability is quite high (we may compare it to a threshold 0.1 for instance) and we decide not to send the discount to John. Same calculation we need to do for all the users.


Conclusion

In this article, we have seen how transforming raw user metrics into personalized assessments can lead to more effective business decisions. By using statistical methods like CDF and spline interpolation, we can better understand the context behind user actions and provide personalized, relevant offers that are informed by data.

The post Building Effective Metrics to Describe Users appeared first on Towards Data Science.

]]>
From Default Python Line Chart to Journal-Quality Infographics https://towardsdatascience.com/from-default-python-line-chart-to-journal-quality-infographics-80e3949eacc3/ Mon, 30 Dec 2024 18:40:27 +0000 https://towardsdatascience.com/from-default-python-line-chart-to-journal-quality-infographics-80e3949eacc3/ Transform boring default Matplotlib line charts into stunning, customized visualizations

The post From Default Python Line Chart to Journal-Quality Infographics appeared first on Towards Data Science.

]]>
Cover, image by the Author
Cover, image by the Author

Everyone who has used Matplotlib knows how ugly the default Charts look like. In this series of posts, I’ll share some tricks to make your visualizations stand out and reflect your individual style.

We’ll start with a simple line chart, which is widely used. The main highlight will be adding a gradient fill below the plot – a task that’s not entirely straightforward.

So, let’s dive in and walk through all the key steps of this transformation!

Let’s make all the necessary imports first.

import pandas as pd
import numpy as np
import Matplotlib.dates as mdates
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
from matplotlib import rcParams
from matplotlib.path import Path
from matplotlib.patches import PathPatch

np.random.seed(38)

Now we need to generate sample data for our Visualization. We will create something similar to what stock prices look like.

dates = pd.date_range(start='2024-02-01', periods=100, freq='D')
initial_rate = 75
drift = 0.003
volatility = 0.1
returns = np.random.normal(drift, volatility, len(dates))
rates = initial_rate * np.cumprod(1 + returns)

x, y = dates, rates

Let’s check how it looks with the default Matplotlib settings.

fix, ax = plt.subplots(figsize=(8, 4))
ax.plot(dates, rates)
ax.xaxis.set_major_locator(mdates.DayLocator(interval=30))
plt.show()
Default plot, image by Author
Default plot, image by Author

Not really fascination, right? But we will gradually make it looking better.

  • set the title
  • set general chart parameters – size and font
  • placing the Y ticks to the right
  • changing the main line color, style and width
# General parameters
fig, ax = plt.subplots(figsize=(10, 6))
plt.title("Daily visitors", fontsize=18, color="black")
rcParams['font.family'] = 'DejaVu Sans'
rcParams['font.size'] = 14

# Axis Y to the right
ax.yaxis.tick_right()
ax.yaxis.set_label_position("right")

# Plotting main line
ax.plot(dates, rates, color='#268358', linewidth=2)
General params applied, image by Author
General params applied, image by Author

Alright, now it looks a bit cleaner.

Now we’d like to add minimalistic grid to the background, remove borders for a cleaner look and remove ticks from the Y axis.

# Grid
ax.grid(color="gray", linestyle=(0, (10, 10)), linewidth=0.5, alpha=0.6)
ax.tick_params(axis="x", colors="black")
ax.tick_params(axis="y", left=False, labelleft=False) 

# Borders
ax.spines["top"].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines["bottom"].set_color("black")
ax.spines['left'].set_color('white')
ax.spines['left'].set_linewidth(1)

# Remove ticks from axis Y
ax.tick_params(axis='y', length=0)
Grid added, image by Author
Grid added, image by Author

Now we’re adding a tine esthetic detail – year near the first tick on the axis X. Also we make the font color of tick labels more pale.

# Add year to the first date on the axis
def custom_date_formatter(t, pos, dates, x_interval):
    date = dates[pos*x_interval]
    if pos == 0:
        return date.strftime('%d %b '%y')  
    else:
        return date.strftime('%d %b')  
ax.xaxis.set_major_formatter(ticker.FuncFormatter((lambda x, pos: custom_date_formatter(x, pos, dates=dates, x_interval=x_interval))))

# Ticks label color
[t.set_color('#808079') for t in ax.yaxis.get_ticklabels()]
[t.set_color('#808079') for t in ax.xaxis.get_ticklabels()]
Year near first date, image by Author
Year near first date, image by Author

And we’re getting closer to the trickiest moment – how to create a gradient under the curve. Actually there is no such option in Matplotlib, but we can simulate it creating a gradient image and then clipping it with the chart.

# Gradient
numeric_x = np.array([i for i in range(len(x))])
numeric_x_patch = np.append(numeric_x, max(numeric_x))
numeric_x_patch = np.append(numeric_x_patch[0], numeric_x_patch)
y_patch = np.append(y, 0)
y_patch = np.append(0, y_patch)

path = Path(np.array([numeric_x_patch, y_patch]).transpose())
patch = PathPatch(path, facecolor='none')
plt.gca().add_patch(patch)

ax.imshow(numeric_x.reshape(len(numeric_x), 1),  interpolation="bicubic",
                cmap=plt.cm.Greens, 
                origin='lower',
                alpha=0.3,
                extent=[min(numeric_x), max(numeric_x), min(y_patch), max(y_patch) * 1.2], 
                aspect="auto", clip_path=patch, clip_on=True)
Gradient added, image by Author
Gradient added, image by Author

Now it looks clean and nice. We just need to add several details using any editor (I prefer Google Slides) – title, round border corners and some numeric indicators.

Final visualization, image by Author
Final visualization, image by Author

The full code to reproduce the visualization is below:

The post From Default Python Line Chart to Journal-Quality Infographics appeared first on Towards Data Science.

]]>
Linear Optimisations in Product Analytics https://towardsdatascience.com/linear-optimisations-in-product-analytics-ace19e925677/ Wed, 18 Dec 2024 11:01:52 +0000 https://towardsdatascience.com/linear-optimisations-in-product-analytics-ace19e925677/ Solving the knapsack problem

The post Linear Optimisations in Product Analytics appeared first on Towards Data Science.

]]>
It might be surprising, but in this article, I would like to talk about the knapsack problem, the classic optimisation problem that has been studied for over a century. According to Wikipedia, the problem is defined as follows:

Given a set of items, each with a weight and a value, determine which items to include in the collection so that the total weight is less than or equal to a given limit and the total value is as large as possible.

While product analysts may not physically pack knapsacks, the underlying mathematical model is highly relevant to many of our tasks. There are numerous real-world applications of the knapsack problem in product Analytics. Here are a few examples:

  • Marketing Campaigns: The marketing team has a limited budget and capacity to run campaigns across different channels and regions. Their goal is to maximize a KPI, such as the number of new users or revenue, all while adhering to existing constraints.
  • Retail Space Optimization: A retailer with limited physical space in their stores seeks to optimize product placement to maximize revenue.
  • Product Launch Prioritization: When launching a new product, the operations team’s capacity might be limited, requiring prioritization of specific markets.

Such and similar tasks are quite common, and many analysts encounter them regularly. So, in this article, I’ll explore different approaches to solving it, ranging from naive, simple techniques to more advanced methods such as Linear Programming.

Another reason I chose this topic is that linear programming is one of the most powerful and popular tools in prescriptive analytics – a type of analysis that focuses on providing stakeholders with actionable options to make informed decisions. As such, I believe it is an essential skill for any analyst to have in their toolkit.

Case

Let’s dive straight into the case we’ll be exploring. Imagine we’re part of a marketing team planning activities for the upcoming month. Our objective is to maximize key performance indicators (KPIs), such as the number of acquired users and revenue while operating within a limited marketing budget.

We’ve estimated the expected outcomes of various marketing activities across different countries and channels. Here is the data we have:

  • country – the market where we can do some promotional activities;
  • channel – the acquisition method, such as social networks or influencer campaigns;
  • users – the expected number of users acquired within a month of the promo campaign;
  • cs_contacts – the incremental Customer Support contacts generated by the new users;
  • marketing_spending – the investment required for the activity;
  • revenue – the first-year LTV generated from acquired customers.

Note that the dataset is synthetic and randomly generated, so don’t try to infer any market-related insights from it.

First, I’ve calculated the high-level statistics to get a view of the numbers.

Let’s determine the optimal set of marketing activities that maximizes revenue while staying within the $30M marketing budget.

Brute force approach

At first glance, the problem may seem straightforward: we could calculate all possible combinations of marketing activities and select the optimal one. However, it might be a challenging task.

With 62 segments, there are 2⁶² possible combinations, as each segment can either be included or excluded. This results in approximately 4.6*10¹⁸ combinations – an astronomical number.

To better understand the computational feasibility, let’s consider a smaller subset of 15 segments and estimate the time required for one iteration.

import itertools
import pandas as pd
import tqdm

# reading data
df = pd.read_csv('marketing_campaign_estimations.csv', sep = 't')
df['segment'] = df.country + ' - ' + df.channel

# calculating combinations
combinations = []
segments = list(df.segment.values)[:15]
print('number of segments: ', len(segments))

for num_items in range(len(segments) + 1):
  combinations.extend(
      itertools.combinations(segments, num_items)
  )
print('number of combinations: ', len(combinations))

tmp = []
for selected in tqdm.tqdm(combinations):
    tmp_df = df[df.segment.isin(selected)]
    tmp.append(
        {
        'selected_segments': ', '.join(selected),
        'users': tmp_df.users.sum(),
        'cs_contacts': tmp_df.cs_contacts.sum(),
        'marketing_spending': tmp_df.marketing_spending.sum(),
        'revenue': tmp_df.revenue.sum()
        }
    )

# number of segments:  15
# number of combinations:  32768

It took approximately 4 seconds to process 15 segments, allowing us to handle around 7,000 iterations per second. Using this estimate, let’s calculate the execution time for the full set of 62 segments.

2**62 / 7000 / 3600 / 24 / 365
# 20 890 800.6

Using brute force, it would take around 20.9 million years to get the answer to our question – clearly not a feasible option.

Execution time is entirely determined by the number of segments. Removing just one segment can reduce time twice. With this in mind, let’s explore possible ways to merge segments.

As usual, there are more small-sized segments than bigger ones, so merging them is a logical step. However, it’s important to note that this approach may reduce accuracy since multiple segments are aggregated into one. Despite this, it could still yield a solution that is "good enough."

To simplify, let’s merge all segments that contribute less than 0.1% of revenue.

df['share_of_revenue'] = df.revenue/df.revenue.sum() * 100
df['segment_group'] = list(map(
    lambda x, y: x if y >= 0.1 else 'other',
    df.segment,
    df.share_of_revenue
))

print(df[df.segment_group == 'other'].share_of_revenue.sum())
# 0.53
print(df.segment_group.nunique())
# 52

With this approach, we will merge ten segments into one, representing 0.53% of the total revenue (the potential margin of error). With 52 segments remaining, we can obtain the solution in just 20.4K years. While this is a significant improvement, it’s still not sufficient.

You may consider other heuristics tailored to your specific task. For instance, if your constraint is a ratio (e.g., contact rate = CS contacts / users ≤ 5%), you could group all segments where the constraint holds true, as the optimal solution will include all of them. In our case, however, I don’t see any additional strategies to reduce the number of segments, so brute force seems impractical.

That said, if the number of combinations is relatively small and brute force can be executed within a reasonable time, it can be an ideal approach. It’s simple to develop and provides accurate results.

Naive approach: looking at top-performing segments

Since brute force is not feasible for calculating all combinations, let’s consider a simpler algorithm to address this problem.

One possible approach is to focus on the top-performing segments. We can evaluate segment performance by calculating revenue per dollar spent, then sort all activities based on this ratio and select the top performers that fit within the marketing budget. Let’s implement it.

df['revenue_per_spend'] = df.revenue / df.marketing_spending 
df = df.sort_values('revenue_per_spend', ascending = False)
df['spend_cumulative'] = df.marketing_spending.cumsum()
selected_df = df[df.spend_cumulative <= 30000000]
print(selected_df.shape[0])
# 48 
print(selected_df.revenue.sum()/1000000)
# 107.92

With this approach, we selected 48 activities and got $107.92M in revenue.

Unfortunately, although the logic seems reasonable, it is not the optimal solution for maximizing revenue. Let’s look at a simple example with just three marketing activities.

Using the top markets approach, we would select France and achieve $68M in revenue. However, by choosing two other markets, we could achieve significantly better results – $97.5M. The key point is that our algorithm optimizes not only for maximum revenue but also for minimizing the number of selected segments. Therefore, this approach will not yield the best results, especially considering its inability to account for multiple constraints.

Linear Programming

Since all simple approaches have failed, we must return to the fundamentals and explore the theory behind this problem. Fortunately, the knapsack problem has been studied for many years, and we can apply Optimization techniques to solve it in seconds rather than years.

The problem we’re trying to solve is an example of Integer Programming, which is actually a subdomain of Linear Programming.

We’ll discuss this shortly, but first, let’s align on the key concepts of the optimization process. Each optimisation problem consists of:

  • Decision variables: Parameters that can be adjusted in the model, typically representing the levers or decisions we want to make.
  • Objective function: The target variable we aim to maximize or minimize. It goes without saying that it must depend on the decision variables.
  • Constraints: Conditions placed on the decision variables that define their possible values. For example, ensuring the team cannot work a negative number of hours.

With these basic concepts in mind, we can define Linear Programming as a scenario where the following conditions hold:

  • The objective function is linear.
  • All constraints are linear.
  • Decision variables are real-valued.

Integer Programming is very similar to Linear Programming, with one key difference: some or all decision variables must be integers. While this may seem like a minor change, it significantly impacts the solution approach, requiring more complex methods than those used in Linear Programming. One common technique is branch-and-bound. We won’t dive deeper into the theory here, but you can always find more detailed explanations online.

For linear optimization, I prefer the widely used Python package PuLP. However, there are other options available, such as Python MIP or Pyomo. Let’s install PuLP via pip.

! pip install pulp

Now, it’s time to define our task as a mathematical optimisation problem. There are the following steps for it:

  • Define the set of decision variables (levers we can adjust).
  • Align on the objective function (a variable that we will be optimising for).
  • Formulate constraints (the conditions that must hold true during optimisations).

Let’s go through the steps one by one. But first, we need to create the problem object and set the objective – maximization in our case.

from pulp import *
problem = LpProblem("Marketing_campaign", LpMaximize)

The next step is defining the decision variables – parameters that we can change during optimisation. Our main decision is either to run a marketing campaign or not. So, we can model it as a set of binary variables (0 or 1) for each segment. Let’s do it with the PuLP library.

segments = range(df.shape[0])  
selected = LpVariable.dicts("Selected", segments, cat="Binary")

After that, it’s time to align on the objective function. As discussed, we want to maximise the revenue. The total revenue will be a sum of revenue from all the selected segments (where decision_variable = 1 ). Therefore, we can define this formula as the sum of the expected revenue for each segment multiplied by the decision binary variable.

problem += lpSum(
  selected[i] * list(df['revenue'].values)[i] 
  for i in segments
)

The final step is to add constraints. Let’s start with a simple constraint: our marketing spending must be below $30M.

problem += lpSum(
    selected[i] * df['marketing_spending'].values[i]
    for i in segments
) <= 30 * 10**6

Hint: you can print problem to double check the objective function and constraints.

Now that we’ve defined everything, we can run the optimization and analyze the results.

problem.solve()

It takes less than a second to run the optimization, a significant improvement compared to the thousands of years that brute force would require.

Result - Optimal solution found

Objective value:                110162662.21000001
Enumerated nodes:               4
Total iterations:               76
Time (CPU seconds):             0.02
Time (Wallclock seconds):       0.02

Let’s save the results of the model execution – the decision variables indicating whether each segment was selected or not – into our dataframe.

df['selected'] = list(map(lambda x: x.value(), selected.values()))
print(df[df.selected == 1].revenue.sum()/10**6)
# 110.16

It works like magic, allowing you to obtain the solution quickly. Additionally, note that we achieved higher revenue compared to our naive approach: $110.16M versus $107.92M.

We’ve tested integer programming with a simple example featuring just one constraint, but we can extend it further. For instance, we can add additional constraints for our CS contacts to ensure that our Operations team can handle the demand in a healthy way:

  • The number of additional CS contacts ≤ 5K
  • Contact rate (CS contacts/users) ≤ 0.042
# define the problem
problem_v2 = LpProblem("Marketing_campaign_v2", LpMaximize)

# decision variables
segments = range(df.shape[0]) 
selected = LpVariable.dicts("Selected", segments, cat="Binary")

# objective function
problem_v2 += lpSum(
  selected[i] * list(df['revenue'].values)[i] 
  for i in segments
)

# Constraints
problem_v2 += lpSum(
    selected[i] * df['marketing_spending'].values[i]
    for i in segments
) <= 30 * 10**6

problem_v2 += lpSum(
    selected[i] * df['cs_contacts'].values[i]
    for i in segments
) <= 5000

problem_v2 += lpSum(
    selected[i] * df['cs_contacts'].values[i]
    for i in segments
) <= 0.042 * lpSum(
    selected[i] * df['users'].values[i]
    for i in segments
)

# run the optimisation
problem_v2.solve()

The code is straightforward, with the only tricky part being the transformation of the ratio constraint into a simpler linear form.

Another potential constraint you might consider is limiting the number of selected options, for example, to 10. This constraint could be pretty helpful in prescriptive analytics, for example, when you need to select the top-N most impactful focus areas.

# define the problem
problem_v3 = LpProblem("Marketing_campaign_v2", LpMaximize)

# decision variables
segments = range(df.shape[0]) 
selected = LpVariable.dicts("Selected", segments, cat="Binary")

# objective function
problem_v3 += lpSum(
  selected[i] * list(df['revenue'].values)[i] 
  for i in segments
)

# constraints
problem_v3 += lpSum(
    selected[i] * df['marketing_spending'].values[i]
    for i in segments
) <= 30 * 10**6

problem_v3 += lpSum(
    selected[i] for i in segments
) <= 10

# run the optimisation
problem_v3.solve()
df['selected'] = list(map(lambda x: x.value(), selected.values()))
print(df.selected.sum())
# 10

Another possible option to tweak our problem is to change the objective function. We’ve been optimising for the revenue, but imagine we want to maximise both revenue and new users at the same time. For that, we can slightly change our objective function.

Let’s consider the best approach. We could calculate the sum of revenue and new users and aim to maximize it. However, since revenue is, on average, 1000 times higher, the results might be skewed toward maximizing revenue. To make the metrics more comparable, we can normalize both revenue and users based on their total sums. Then, we can define the objective function as a weighted sum of these ratios. I would use equal weights (0.5) for both metrics, but you can adjust the weights to give more value to one of them.

# define the problem
problem_v4 = LpProblem("Marketing_campaign_v2", LpMaximize)

# decision variables
segments = range(df.shape[0]) 
selected = LpVariable.dicts("Selected", segments, cat="Binary")

# objective Function
problem_v4 += (
    0.5 * lpSum(
        selected[i] * df['revenue'].values[i] / df['revenue'].sum()
        for i in segments
    )
    + 0.5 * lpSum(
        selected[i] * df['users'].values[i] / df['users'].sum()
        for i in segments
    )
)

# constraints
problem_v4 += lpSum(
    selected[i] * df['marketing_spending'].values[i]
    for i in segments
) <= 30 * 10**6

# run the optimisation
problem_v4.solve()
df['selected'] = list(map(lambda x: x.value(), selected.values()))

We obtained the optimal objective function value of 0.6131, with revenue at $104.36M and 136.37K new users.

That’s it! We’ve learned how to use integer programming to solve various optimisation problems.

You can find the full code on GitHub.

Summary

In this article, we explored different methods for solving the knapsack problem and its analogues in product analytics.

  • We began with a brute-force approach but quickly realized it would take an unreasonable amount of time.
  • Next, we tried using common sense by naively selecting the top-performing segments, but this approach yielded incorrect results.
  • Finally, we turned to Integer Programming, learning how to translate our product tasks into optimization models and solve them effectively.

With this, I hope you’ve gained another valuable analytical tool for your toolkit.

Thank you a lot for reading this article. I hope this article was insightful for you. If you have any follow-up questions or comments, please leave them in the comments section.

Reference

All the images are produced by the author unless otherwise stated.

The post Linear Optimisations in Product Analytics appeared first on Towards Data Science.

]]>
SQL vs. Calculators: Building Champion/Challenger Tests from Scratch https://towardsdatascience.com/sql-vs-calculators-building-champion-challenger-tests-from-scratch-b457dc43d784/ Wed, 04 Dec 2024 14:01:03 +0000 https://towardsdatascience.com/sql-vs-calculators-building-champion-challenger-tests-from-scratch-b457dc43d784/ In depth SQL code for creating your own statistical test design

The post SQL vs. Calculators: Building Champion/Challenger Tests from Scratch appeared first on Towards Data Science.

]]>
CODE OR CLICK: WHAT IS BETTER FOR A/B TESTING
Image from Imagen 3
Image from Imagen 3

The $300 Million Button: How A/B Testing Changed E-Commerce Forever

I am sure a lot of people are aware of the $300 million button story. For those that are not aware of the story, it is about a major e-commerce platform losing millions in potential revenue due to customer drop-offs at checkout. This was a large online retailer, and a single button labeled "Register" when changed to "Continue," with an option to register later, the company saw a $300 million increase in annual revenue. This case study was documented by UX expert Jared Spool )Source: UIE, Jared Spool, "The $300 Million Button"), showing how a minor change can drastically impact business outcomes.

Yet surprisingly, 58% of executives still rely on intuition when making business decisions, according to a PwC report (Source: PwC Global Data and Analytics Survey). I always believe that folks with industry knowledge and well-versed with business processes intuition is important but adds more value when combined with observed evidence of data and numbers in decision making. champion-challenger testing is one such approach to decision-making that changes guesswork into scientific validation.

What Is Champion/Challenger Testing?

Champion/challenger testing (A/B testing) is a technique used in businesses to optimize processes and business operations by selecting best options that improve performance by increasing revenue, reduce costs, and enhance Decision Making. Champion here is the current operation or methodology that works best while challenger is the method or a new strategy you want to test against your champion to see if it works better or worse than your current process or strategy. Your champion challenger should have the same type of setup, like similar type of accounts or customer segments, to ensure you have an apples-to-apples comparison. It is important to know, what is the goal you are trying to achieve and know what your key performance indicators should be to measure the success of the test.

Implementation Through Oracle SQL: A Practical Guide

When implementing champion-challenger testing, I always wondered whether to rely on online calculators or invest in a database-driven Sql implementation. The answer depends on various factors but let us explore an SQL approach through a practical example. While going through the example, I will also be walking you through the importance of some of the variables and conditions to consider ensuring we have a solid champion-challenger testing created.

Imagine a collection agency wanting to test the effectiveness of leaving voicemail versus not leaving them. The current strategy involves no voicemails, and some believe leaving voicemails could improve metrics like contact rate and payment rate, but implementing this change across all accounts carries risks like potential reduction in contact rates, compliance considerations with leaving messages, resource costs of leaving voicemails, and a possible decrease in payment rates. Let us design a rigorous test to evaluate the hypothesis.

To begin our implementation, we need to create a structured foundation that will track our test from start to finish. I used Oracle SQL developer to write my SQL and for illustration purpose in the voicemail testing context, I assumed some of the key component values as mentioned below to generate voicemail champion-challenger test. Below are the details of what each of these key components mean:

  1. Baseline Conversion Rate: Your current conversion rate for the metric you’re testing. In this specific voicemail test example, we are assuming 8% current payment rate as baseline conversion rate.
  2. Minimum Detectable Effect (MDE): The smallest improvement in conversion rate you care about detecting. For voicemails, we want to see if we can improve the current conversion rate by 10% which is increasing to 8.8% (8% * (1 + 0.10) = 8.8%).
  3. Statistical Significance Level: Typically set at 95%, meaning you’re 95% confident that your results are not due to chance.
  4. Statistical Power: Often set at 80%, this is a measure of whether the test has enough data to reach a conclusive result.
  5. Hypothesis / Tail type: a statement that predicts whether changing a certain variable will affect customer behavior. There are two types of hypotheses to consider or more known as tail tests:

a) One-tail test: This test is recommended only when you are testing if something is either only better than current performance or only worse than current performance. Voicemail testing with one-tail test means we only want to know if voicemails improve payment rates.

b) Two-tail test: This test is recommended in scenarios where you need to understand any change in performance. You are testing if something is either better or worse than current performance. Voicemail testing with two -tail test means we want to know if voicemails will increase or decrease payment rates.

As we do not know whether voicemails will increase or decrease payment rates, we will be going with a two-tailed test.

with test_parameters as(
    select 
        0.08 as baseline_rate,       -- assuming current rate of 8% of payment rate
        10 as min_detectable_effect, -- wanting 10% improvement
        95 as significance_level,    -- 95% confidence level
        80 as statistical_power,     -- 80% statistical power
        'TWO' as tail_type,          -- 'ONE' or 'TWO' for tail type test 
        &amp;volume as monthly_volume    -- dynamic query to pull volume data can be used 
        -- example: (select count(*) from accounts where assign_date>=add_months(sysdate,-1) ) 
    from dual
    )

   select * from test_parameters;
SQL prompt for monthly volume input
SQL prompt for monthly volume input
Output Result
Output Result

This above configuration is important because it records what we are testing and why. These metrics are the key components in sample size calculation. I will show you the sample size calculation, split ratio, months and days needed to run your test and finally the recommendation results for different monthly volumes available.

Sample Size Calculation

Using the right sample size is important to make sure your test results are statistically significant. A sample size that’s too small may result in inaccurate results. Larger sample sizes will give you more accurate average values, identify outliers in data and provide smaller margins of error. The question here ultimately is what too small vs large sample sizes is. You will find out the answers to it as you go through the article.

The below oracle script shows how to calculate sample size. I am using a CTE and partitioned them into multiple sections of snapshots to explain the code better. If you want to use the script, you need to put all sections of code together. Now, I am going to set up our statistical parameters.

--statistical parameter conversion
    ,statistical_parameters as(
    select
        baseline_rate,
        min_detectable_effect,
        monthly_volume,
        tail_type,

    --set confidence level z-score based on tail type
        case when tail_type='ONE' then 
         case significance_level 
              when 90 then 1.28 -- One tailed test for 90% confidence
              when 95 then 1.645 -- One tailed test for 95% confidence
              when 99 then 2.326 -- One tailed test for 99% confidence
              else 1.645 end 
         else
             case significance_level 
              when 90 then 1.645 -- Two tailed test for 90% confidence
              when 95 then 1.96 -- Two tailed test for 95% confidence
              when 99 then 2.576 -- Two tailed test for 99% confidence
              else 1.96 end end as z_alpha,

    --set power level z-score (same for both tail types)
        case statistical_power
            when 80 then 0.84
            when 90 then 1.28
            when 95 then 1.645
            else 0.84 end as z_beta
    from test_parameters
    )

    select * from statistical_parameters;

This conversion converts the confidence levels into statistical values used in sample size calculations. For collections, 95% confidence means there is a possibility of 5% of the time results being wrong or when voicemails don’t help.

In statistical terms, z-alpha represents our confidence level, with different values based on both confidence level and tail-type test. Typically, two tailed test values are higher than one tailed test values because of the error rate split in both directions for a two-tailed test. In voicemail testing scenario, 5% chance of being wrong indicates error rate split evenly (0.025 chance probability for payments going lower and 0.025 for payments going higher) whereas a one-tailed test concentrates the entire 0.05 probability in one direction, as we’re only interested in payments going either up or down, not both.

Statistical power is known as z-beta. When we set 80% statistical power (z-beta = 0.84), we are saying we want to catch real changes 80% of the time and will accept missing them 20% of the time.

Z-alpha and Z-beta put together means, if voicemails truly help improve payment rates, we will detect this improvement 80% of the time, and when we do detect it, we can be 95% confident it is a real improvement and not due to a chance.

Output Result
Output Result

Let us now move into the calculation of the sample size volume needed. This calculation determines how many accounts we need to test. In our voicemail scenario, if we’re looking to improve from 8% to 8.8% payment rate, this tells us how many accounts we need to be confident that the payment rate will increase, or decrease is real and not just by chance.

--Sample size calculation
    ,sample_size_calculation as(
    select 
        baseline_rate,
        min_detectable_effect,
        monthly_volume,
        tail_type,
        z_alpha,
        z_beta,

    --calculate minimum effect size
        baseline_rate*(min_detectable_effect/100) as minimum_effect,

    --calculate base sample size
        ceil(
             case tail_type 
                  when 'ONE' then
                       ( power(z_alpha + z_beta, 2) * baseline_Rate * (1 - baseline_Rate)) / (power(baseline_Rate * (min_detectable_effect/100), 2))
                  else
                       (2 * power(z_alpha + z_beta, 2) * baseline_Rate * (1 - baseline_Rate)) / (power(baseline_Rate * (min_detectable_effect/100), 2)) 
                  end
             ) as required_sample_size     
    from statistical_parameters
    )
Output Result
Output Result

Split Ratios and Test Duration

Split ratios determine how you divide your dataset between the champion (your current version) and the challenger(s) (your test versions). Common split ratios include two way (like 50/50, 80/20 or 90/10 splits) or multi-way splits like 50/25/25 or 70/10/10/10. These multi-way tests are used to test different variations while we still have a control group.

Choosing a split ratio should not be random or solely depend on volume availability but also consider other factors like confidence level in the challenger, impact of the change especially if it hurts the current metrics, and ensure the test meets the minimum sample size needed requirement.

This below analysis translates statistical requirements into business terms and shows how different split ratios affect test duration. It also shows risk level based on split ratio. Split ratios represent how we divide accounts between champion and challenger.

 --split ratio
    ,split_ratios as(
    --generate split ratios from 10 to 50 for challenger
    Select  
        level * 10 as challenger_pct,
        100 - (level * 10) as control_pct
    from dual
    connect by level <= 5 -- This generates 10/90, 20/80, 30/70, 40/60, 50/50
    )

    --split_analysis
    ,split_analysis as(
    select 
        s.baseline_Rate * 100 as current_rate_pct,
        s.baseline_rate * (1 + s.min_detectable_effect/100) * 100 as target_rate_pct,
        s.min_detectable_effect as improvement_pct,
        s.tail_type,
        s.required_sample_size as sample_size_per_group,
        s.required_sample_size * 2 as total_sample_needed,
        s.monthly_volume,
        r.challenger_pct,
        r.control_pct,

    --calculate test duration (months) for different splits
        round(s.required_sample_size / (s.monthly_volume * (r.challenger_pct/100)), 1) as months_needed,

    --calculate test days needed for each split
        round(s.required_sample_size / (s.monthly_volume * (r.challenger_pct/100)) * 30, 0) as days_needed,

     --Assess risk level for each split
        case 
            when r.challenger_pct <= 20 then 'Conservative'
            when r.challenger_pct <= 35 then 'Balanced'
            else 'Aggressive' end as risk_level
    from sample_size_calculation s cross join split_ratios r
    )

    select * from split_analysis;

Conservative risk only impacts 10–20% of accounts getting new treatment and 80–90% accounts from potential negative impacts. This split ratio takes longer to gather enough data. Balanced risk will impact one third of the accounts and protect the rest while it gathers data faster. Aggressive risk impacts up to half the accounts though it gathers data quickly, it exposes more accounts to risk.

Part of the output result
Part of the output result

It is important to know how long a champion/challenger test should be run. Run a test for too short of a time, and you risk making decisions based on incomplete or misleading data. Run it too long, you may waste resources and delay decision making. To maintain the balance, generally, tests should run for a minimum of one full business cycle. Tests typically shouldn’t run for more than 4–8 weeks and this way we don’t mix up our results with other operational or seasonal changes taking place.

Risk Assessment and Volume Requirements

I observe analysts new to champion/challenger testing do not know what split ratio to opt for. We can decide on which split ratio to opt for by considering the risks associated in choosing for a certain split ratio and what volume is needed for that split ratio.

Worst-case scenario must be calculated to assess the risk level.

,risk_Assessment as(
        select 
            monthly_volume,
            sample_size_per_group,
            challenger_pct,
            risk_level,
        --assess potential impact
    round(monthly_volume * (challenger_pct/100) * (current_rate_pct/100)) as accounts_at_risk,
    round(monthly_volume * (challenger_pct/100) * (current_rate_pct/100) * (1 - (improvement_pct/100))) as worst_case_scenario
        from split_analysis
    )

    ,volume_recommendations as(
        select distinct 
            sample_size_per_group,
            --recommende monthly volumes for different completion timeframes for all splits
            ceil(sample_size_per_group / 0.5) as volume_for_1_month_50_50, --50/50 split
            ceil(sample_size_per_group / 0.4) as volume_for_1_month_40_60, --40/60 split
            ceil(sample_size_per_group / 0.3) as volume_for_1_month_30_70, --30/70 split
            ceil(sample_size_per_group / 0.2) as volume_for_1_month_20_80, --20/80 split
            ceil(sample_size_per_group / 0.1) as volume_for_1_month_10_90  --10/90 split
        from split_analysis
        )
Part of the output result
Part of the output result

Let us say we opt for 30/70 split ratio which is showing a ‘balanced’ split for voicemails. With 10,000 monthly accounts, 3000 accounts will receive voicemails while 7000 accounts continue as normal. If voicemails perform poorly, it affects 3,000 accounts and the maximum exposure will be 240 payments at risk (3,000 8%). In the scenario, voicemails test decrease payment rates by 10% instead of improving them, we would only receive 216 payments (3,000 8% * (1–10%)). This means we lose 24 payments which we would have otherwise received.

This worst-case calculation helps us understand what’s at risk. With a more aggressive 50/50 split, we would have 5,000 accounts in the test group, risking a potential loss of 40 payments under worse-case conditions. A conservative 20/80 split would only risk 16 payments, though it would take longer to complete the test.

With a 50/50 split, we need a total volume of 36k accounts to get our required 18k accounts in the test group. Since we only have 10k accounts monthly, this means our test would take approximately 3.6 months to complete. Moving to the most conservative 10/90 split would require 180k accounts, making the test duration impractically long at 18 months.

,final_Recommendation as(
    select
        sa.*,
        ra.accounts_At_Risk,
        ra.worst_case_scenario,
        vr.volume_for_1_month_50_50,
        vr.volume_for_1_month_40_60,
        vr.volume_for_1_month_30_70,
        vr.volume_for_1_month_20_80,
        vr.volume_for_1_month_10_90,
        --Generate final recommendations based on all split ratios
    case when sa.monthly_volume >= vr.volume_for_1_month_50_50 and sa.challenger_pct = 50 
         then 'AGGRESSIVE: 50/50 split possible. Fastest completion in ' || sa.days_needed || ' days but highest risk ' 
         when sa.monthly_volume >= vr.volume_for_1_month_40_60 and sa.challenger_pct = 40 
         then 'MODERATELY AGGRESSIVE: 40/60 split feasible. Completes in ' || sa.days_needed || ' days with moderate-high risk.'
         when sa.monthly_volume >= vr.volume_for_1_month_30_70 and sa.challenger_pct = 30 
         then 'BALANCED: 30/70 split recommended. Completes in ' || sa.days_needed || ' days with balanced risk.'
         when sa.monthly_volume >= vr.volume_for_1_month_20_80 and sa.challenger_pct = 20 
         then 'CONSERVATIVE: 20/80 split possible. Takes ' || sa.days_needed || ' days with lower risk.'
         when sa.monthly_volume >= vr.volume_for_1_month_10_90 and sa.challenger_pct = 10 
         then 'BALANCED: 10/90 split possible. Takes ' || sa.days_needed || ' days but minimizes risk.'
         else 'NOT RECOMMENDED: Current volume of ' || sa.monthly_volume || ' insufficient for reliable testing with ' 
              || sa.challenger_pct || '/' ||  sa.control_pct || ' split.' end as recommendation
    from split_analysis sa join risk_assessment ra on sa.challenger_pct=ra.challenger_pct
        cross join volume_recommendations vr 
        )
select      
        tail_type as test_type,
        current_rate_pct || '%' as current_rate,
        target_rate_pct || '%' as target_rate,
        improvement_pct || '%' as improvement,
        sample_size_per_group as needed_per_group,
        total_sample_needed as total_needed,
        monthly_volume,
        challenger_pct || '/' || control_pct || ' split' as split_ratio,
        days_needed || ' days (' || round(months_needed, 1) || ' months)' as duration,
        risk_level,
        accounts_At_Risk || ' accounts at risk' as risk_exposure,
        worst_Case_Scenario || ' worst case' as risk_scenario,
            case
                when challenger_pct = 10 then
                    case    
                        when monthly_volume >= volume_for_1_month_10_90 
                        then 'Current volume (' || monthly_volume || ') sufficient for 10/90 split'
                        else 'Need ' || volume_for_1_month_10_90 
                        || ' monthly accounts for 10/90 split (current: ' || monthly_volume || ')'
                    end
                when challenger_pct = 20 then
                    case    
                        when monthly_volume >= volume_for_1_month_20_80 
                        then 'Current volume (' || monthly_volume || ') sufficient for 20/80 split'
                        else 'Need ' || volume_for_1_month_20_80 
                        || ' monthly accounts for 20/80 split (current: ' || monthly_volume || ')'
                    end
                 when challenger_pct = 30 then
                    case    
                        when monthly_volume >= volume_for_1_month_30_70 
                        then 'Current volume (' || monthly_volume || ') sufficient for 30/70 split'
                        else 'Need ' || volume_for_1_month_30_70 
                        || ' monthly accounts for 30/70 split (current: ' || monthly_volume || ')'
                    end
                 when challenger_pct = 40 then
                    case    
                        when monthly_volume >= volume_for_1_month_40_60 
                        then 'Current volume (' || monthly_volume || ') sufficient for 40/60 split'
                        else 'Need ' || volume_for_1_month_40_60 
                        || ' monthly accounts for 40/60 split (current: ' || monthly_volume || ')'
                    end
                else
                    case    
                        when monthly_volume >= volume_for_1_month_50_50 
                        then 'Current volume (' || monthly_volume || ') sufficient for 50/50 split'
                        else 'Need ' || volume_for_1_month_50_50 
                        || ' monthly accounts for 50/50 split (current: ' || monthly_volume || ')'
                    end
                end as volume_assessment,
            recommendation
        from final_Recommendation
        order by challenger_pct;
Part of the output result for 10k monthly volume
Part of the output result for 10k monthly volume

If monthly volume is 50,000 accounts:

Part of the output result for 50k monthly volume
Part of the output result for 50k monthly volume

Certain questions need to be thought of in order to decide which split ratio to choose and risk level is acceptable and eventually understand the volume available to test voicemails. Can the business accept potentially losing 40 payments monthly in exchange for completing the test in 3.6 months or would it be better to risk only 16 payments monthly but extend the test duration? By carefully choosing your split ratios and understand what sample sizes are appropriate, you can design tests that provide accurate and actionable insights.

Calculators versus SQL Implementation

Online calculators like Evan Miller and Optimizely are valuable tools, typically defaulting to a 50/50 split ratio or two-tailed tests. Another online tool, Statsig, doesn’t default to anything but at the same time doesn’t provide additional details like what we just coded with our SQL implementation. The SQL implementation becomes valuable here as it helps track not just basic metrics, but also monitor risk exposure and test duration based on your actual monthly volume. This comprehensive view helps especially when you need to deviate from standard 50/50 splits or want to understand different split ratios on your test design and business risks.

Continuous Testing

Champion/challenger testing is not a one-time effort but a continuous cycle of improvement. Create performance reports and continuously monitor the results. Adapt to the changing conditions including seasonal shifts and economic changes. By integrating this approach into your strategy testing, you are creating a systematic approach to decision-making that drives innovation, mitigates risk, and most importantly intuition can be backed up with solid data evidence.

Note: All images, unless otherwise noted, are by the author.

The post SQL vs. Calculators: Building Champion/Challenger Tests from Scratch appeared first on Towards Data Science.

]]>
Your Data Quality Checks Are Worth Less (Than You Think) https://towardsdatascience.com/your-data-quality-checks-are-worth-less-than-you-think-c8bd181a1327/ Wed, 20 Nov 2024 18:04:28 +0000 https://towardsdatascience.com/your-data-quality-checks-are-worth-less-than-you-think-c8bd181a1327/ How to deliver outsized value on your data quality program

The post Your Data Quality Checks Are Worth Less (Than You Think) appeared first on Towards Data Science.

]]>
Over the last several years, Data quality and observability have become hot topics. There is a huge array of solutions in the space (in no particular order, and certainly not exhaustive):

Regardless of their specific features, all of these tools have a similar goal: improve visibility of Data Quality issues, reduce the number of data incidents, and increase trust. Despite a lower barrier to entry, however, data quality programs remain difficult to implement successfully. I believe that there are three low-hanging fruit that can improve your outcomes. Let’s dive in!

Hint 1: Focus on process failures, not bad records (when you can)

For engineering-minded folks, it can be hard pill to swallow that some number of "bad" records will not only flow into your system but through your system, and that may be OK! Consider the following:

  1. Will the bad records flush out when corrected in the source system? If so, you may go to extraordinary lengths in your warehouse or lakehouse to correct data that is trivial for a source system operator to fix, with the result that your reporting is correct on the next refresh
  2. Is the dataset useful if it’s "directionally correct" in aggregate? CRM data is a classic example, since many fields need to be populated manually, and there’s a relatively high error rate compared to automated processes. Even if these errors aren’t corrected, as long as they’re not systemic, the dataset may still be useful
  3. Is accuracy of individual records extremely important? Financial reporting, operational reporting on sensor data from expensive machinery, and other "spot-critical" use cases deserve the time and effort needed to identify (and possibly isolate, remove, or remediate) bad records

If your data product can tolerate Type 1 or Type 2 issues, fantastic! You can save a lot of effort by focusing on detection and alerting of process failures rather than one-off or limited anomalies. You can measure high-level metrics skimmed from metadata, such as record counts, unique counts of key columns, and min / max values. A rogue process in your application or SaaS systems can generate too many or too few records, or perhaps a new enumerated value has been added to a column unexpectedly. Depending on your specific use cases, you may need to write custom tests (e. g., total revenue by date and market segment or region), so make sure to profile your data and common failure scenarios.

On the other hand, Type 3 issues require more complex systems and decisions. Do you move bad records to a dead-letter queue and send an alert for manual remediation? Do you build a self-healing process for well-understood data quality issues? Do you simply modify the record in some way to indicate the data quality issue so that downstream processes can decide how to handle the problem? These are all valid approaches, but they do require compute ($) and developer time ($$$$) to maintain.

Hint 2: Don’t duplicate your efforts

Long data pipelines with lots of transformation steps require a lot of testing to ensure data quality throughout, but don’t make the mistake of repeatedly testing the same data with the same tests. For example, you may be testing that an identifier is not null from a SaaS object or product event stream upon ingestion, and then your transform steps implement the same tests:

Image by Author
Image by Author

These kinds of duplicate tests can add to cloud spend and development costs, even though they provide no value. The tricky part is that even if you’re aware that duplication is a bad pattern, long and complex pipelines can make reasoning about all of their data quality tests difficult.

To my knowledge, there isn’t mature tooling available to visualize data quality lineage; just because an upstream source has a data quality test doesn’t necessarily mean that it will capture the same kinds of issues as a test in a downstream model. To that end, engineers need to be intentional about data quality monitors. You can’t just add a test to a dataset and call it a day; you need to think about the broader data quality ecosystem and what a test adds to it.

Hint 3: Avoid alert fatigue and focus on what matters

Perhaps one of the biggest risks to your data quality program isn’t too few data quality tests; it’s too many! Frequently, we build out massive suites of data quality monitors and alerts, only to find our teams overwhelmed. When everything’s important, nothing is.

Photo by Brandon Schmidt on Unsplash
Photo by Brandon Schmidt on Unsplash

If your team can’t act on an alert, whether because of an internal force like capacity constraints or an external force like poor data source quality, you probably shouldn’t have it in place. That’s not to say that you shouldn’t have visibility into these issues, but they can be reserved for reports on a less frequent basis, where they can be evaluated alongside more actionable alerts.

Likewise, on a regular basis, review alerts and pages, and ruthlessly cut the ones that weren’t actionable. Nobody’s winning awards for generating pages and tickets for issues that couldn’t be resolved, or whose resolution wasn’t worth an engineer’s time to address.

Conclusion

Data quality monitoring is an essential component of any modern data operation, and despite the plethora of tools, both open source and proprietary, it can be difficult to implement a successful program. You can spend a lot of time and energy on data quality without seeing results.

To summarize:

  1. When possible, focus on aggregated data rather than individual data points
  2. Only test the data once for the same data quality issue. Duplicate tests waste compute and developer time
  3. Ensure your alert volume doesn’t overwhelm your team. If they can’t act on all of the alerts in a reasonable amount of time, you either have to staff up, or you need to cut down on the alerts

All of that being said, the most important thing to remember is to focus on value. It can be difficult to quantify the value of your data quality program, but at the very least, you should have some reasonable thesis about your interventions. We know that frozen, broken, or inaccurate pipelines can cost significant amounts of developer, analyst, and business stakeholder time. For every check and monitor, think about how you are or aren’t moving the needle. A big impact doesn’t require a massive program, as long as you target the right problems.

The post Your Data Quality Checks Are Worth Less (Than You Think) appeared first on Towards Data Science.

]]>
The “Gold-Rush Paradox” in Data: Why Your KPIs Need a Rethink https://towardsdatascience.com/the-gold-rush-paradox-in-data-why-your-kpis-need-a-rethink-9777e5dd01cd/ Tue, 05 Nov 2024 18:14:47 +0000 https://towardsdatascience.com/the-gold-rush-paradox-in-data-why-your-kpis-need-a-rethink-9777e5dd01cd/ You're not doing as good a job as you think you are

The post The “Gold-Rush Paradox” in Data: Why Your KPIs Need a Rethink appeared first on Towards Data Science.

]]>
Not a Medium member? Read free here 🚀

Introduction

There’s an interesting paradox in data engineering I’ve observed over the last couple of years.

On the one hand, Data and AI are touted as the new oil. The Data Engineering community is growing at a rate of knots. Some data engineers are "famous" and have reported salaries of $600,000. Must be getting some serious results…

Despite this, there are serious problems with data. Data Teams are viewed as a cost centre in many organisations, many were let go of during 2022. Duplication, or model sprawl, is rife, and governance is on the rise.

Businesses can barely make use of their Data, and the perception amongst executives and leaders in the space is that the general quality of all this data getting engineered is that it’s quite low; certainly not AI-ready.

This leads to the "Gold-rush" Paradox:

The "Gold-Rush Paradox" encapsulates the tension between the high value placed on data and AI (akin to a modern-day gold rush) and the substantial difficulties in making data truly valuable for business. While there’s an influx of talent and investment, companies still struggle with data quality, governance, and the actual utility of their data. In other words, the data is seen as incredibly valuable, yet businesses often fail to refine it into something useful, leaving the high salaries and intense demand somewhat disconnected from tangible business outcomes.

A possible explanation is due to the differences between data engineering and software engineering.

Data Engineering is not Software Engineering

The art of software engineering has been honed for many years. The DORA metrics are now commonly accepted as the standard for running software teams using Agile effectively.

There is no similar framework for Data Engineering. The portability of the DORA metrics for data teams is questionable.

In this article, we’ll take a look at some metrics Data Engineering Teams use and show why these need a rethink to truly render Data Teams high-performing.

Settings and use-cases

The use-cases for data engineering varies wildly. A Data Engineer could be responsible for maintaining a Kafka cluster that ingests and transforms petabytes of data in near real-time for a proprietary trading business – this type of use-case is operational.

The aforementioned data engineer with a $600k salary worked at Netflix – this, I would also classify as operational.

Ingesting data daily from Salesforce to calculate lead times for deals and surfacing this information to sales reps to better inform them of their performance would also be called data engineering. This is more of a "Business Intelligence" or "BI" use-case.

Chat GPT defines BI as

BI (Business Intelligence) is the process of collecting, analyzing, and transforming raw data into actionable insights to help businesses make informed decisions. It involves a combination of technologies, tools, and practices that allow organizations to analyze historical and current data to improve strategic, tactical, and operational decision-making.

I actually think the DORA metrics carry over pretty well when the day-to-day of data engineering looks more like software engineering. You are fundamentally shipping code, maintaining infrastructure, and yes the end product of what you do is data vs. an app or software, but the day-to-day is different.

BI is where the gold rush paradox exists.

The current state: technical metrics for technical engineers

When performance reviews for data engineers and Analytics engineers come around, as they inevitably do, there is an open question around what metrics to use.

One common metric that data teams think about is failure rate – the percentage of data pipelines that fail.

For example, you might have an hourly pipeline that fails once in a month. Thats 1/24*30 = 0.1%. Pretty good right?

The problem is that in BI, this can be detrimental.

That single failure could be 9am on the month-end, when a finance team member is relying on the data to help them automate their month close.

If that one point in time fails, then they can’t make a decision. There is no BI that is served here.

Data Engineers are therefore doing a great job, but in this exaggerated example, the value of the work is literally 0.

The answer: rejig the KPIs

Fundamentally, there is simply a misalignment of incentives. Although Data Engineers are effectively employed by the business, for the business, their KPIs don’t reflect this.

Neither do their processes – many data teams lack things like data orchestration, observability, data quality testing, automated alerts to end stakeholders and so on. Lacking these core parts of infrastructure increases the likelihood of impacting the end user adversely, and detrimentally – these are all things software engineers would sooner quit than do without.

The answer is to rejig the KPIs to re-align incentives.

Data Teams should focus on the problems they are trying to solve, and adapt accordingly. For example, you could change the KPI to the number of tickets or data requests received around month-end.

Data Teams should also take the relevant parts of software engineering, but not all of them.

The DORA metrics may be a helpful component (velocity springs to mind), but monitoring failure rates is doomed to fail.

The Key DORA Metrics. Image the Author's
The Key DORA Metrics. Image the Author’s

One area Data Teams should heed is that of standards – software engineers employ a culture of Continuous Integration and Continuous Deployment.

This requires careful set-up, end-to-end monitoring and logging for release pipelines, quality checking, testing and so on. Software engineers can spend anywhere from 25% to 70% of their time writing tests – it is a broad brush statement, but most data engineers do not do this who are serving BI use-cases.

We’ve already seen that extremely high levels of uptime, if we are to generalise it that way, are critical to get the job done (perhaps you don’t know when Finance will need the data, but it will be detrimental if there’s a failure or inaccuracy they don’t know about when they do).

How can we expect to deliver that if we don’t also have the same level of robust infrastructure and testing as software engineers?

Not only do KPIs need to be rejigged, but the whole culture and approach to testing, CI/CD, and uptime needs to change.

Conclusion

It’s not all doom and gloom! There are many wonderful case studies of Data Teams making hay while the sun is shining in this golden age of Data and AI.

However for BI use-cases, the level of the end product still has some way to go in general.

One possible solution is to rejig the KPIs Data Teams and Analytics Team use. In addition, changing the mindset, particularly around risk and the acceptability of failures in production, of Data Engineering teams to be closer to software engineering teams is recommended.

This will be hard. Many Data Engineers, myself included, have transitioned into the field from more of an analyst or non-technical background. The nouse for Software Engineering best principles is not something that can be picked up overnight (as the founder of a software company, I know this all too well).

What’s inevitable is that the KPIs need to change. The Data World is crying out for a standardised framework for effectively running data teams. The landscape is changing fast, and I for one am looking forward to less tools and more best practices.

The post The “Gold-Rush Paradox” in Data: Why Your KPIs Need a Rethink appeared first on Towards Data Science.

]]>
Are you Aware of the Potential of Your Data Expertise in Driving Business Profitability? https://towardsdatascience.com/are-you-aware-of-the-potential-of-your-data-expertise-in-driving-business-profitability-16cffb607437/ Thu, 24 Oct 2024 04:40:35 +0000 https://towardsdatascience.com/are-you-aware-of-the-potential-of-your-data-expertise-in-driving-business-profitability-16cffb607437/ A reflection of a supply chain data scientist who randomly discovered the power of data analytics to help small and large businesses

The post Are you Aware of the Potential of Your Data Expertise in Driving Business Profitability? appeared first on Towards Data Science.

]]>
The biggest challenge I faced in Analytics projects was estimating the return on investment (ROI) of a solution.

What is the ROI of a forecasting engine for sales prediction?

This is often the first question a decision-maker asks when you propose designing a tool to solve their operational or Business problem.

As a solution design manager in logistics, my job was to price warehousing and transportation operations for retail and fashion companies.

Because of this, estimating the ROI of my solutions was slightly more manageable, but it was still difficult to convince decision-makers.

For example, I would explain, "This algorithm will improve picking productivity by 25%, which will result in a 12% reduction in variable costs."

These successes inspired me to share the solutions covering 60+ operational case studies published on this Medium blog.

While my focus was on improving Logistics operations, something unexpected happened along the way:

What if I shifted my focus to business profitability?

In one project, I applied these tools to a business case study: maximizing the profitability of a bakery.

The feedback I received helped me realize that both small and large businesses need to optimize their margins and automate data-driven decisions.

By speaking the language of business, I could sell my analytics solutions more effectively.

In this article, I want to share the insights I’ve gained from discovering the power of data to help businesses – and why I believe you should consider this path, too.

Summary

I. Introduction
Exploring the challenge of proving ROI in analytics projects
II. How Did I Develop My Business Acumen?
Sharing my early career experience as a Solution Design Manager
III. My journey discovering data analytics for business optimization
Optimization methods to help a bakery business improve profitability.
VI. Use your Analytics Skills to Solve Business Problems
How data analytics can answer the needs of decision makers.
V. How to Adapt your Analytics Approach to Business Problems?
The importance of translating business problems into simple analytics solutions 
VI. Conclusion
Understanding processes is an important skill to answer business problems

How did I develop my business acumen?

For the first four years of my career, I designed warehousing and transportation solutions for major international companies across Asia.

What does a Supply Chain Solution Design Manager do?

For example, imagine a retailer like Costco wants to set up a distribution centre in Shanghai.

  1. They provide data on volumes and process requirements in an RFP.
  2. My job is to design the solution (layout, staffing, equipment) and develop the pricing based on a cost model with over 100 parameters.
  3. We present the solution to the customer with a detailed pricing grid.

Our main concern? Gross margin!

To win the project, I had to ensure competitive pricing while maintaining a minimum margin without pricing below our costs.

Pricing Structure - (Image by Author)
Pricing Structure – (Image by Author)

For example, if I quoted €1.25 per box for picking, I knew exactly how the costs and margin were broken down

  • €0.57 for workforce costs
  • €0.37 for equipment and consumables
  • €0.20 for fixed warehouse costs
  • €0.11 for our gross margin (an 8.8% margin on sales)

What’s next? You win the business and sign a three-year contract for a 5M€ budget.

But what happens if the customer wants to lower the price to €1.10 per box?

This often happens in low-margin companies like traditional retailers, automotive after-sales distributors or consumer goods.

You must find ways to reduce costs while keeping your 8.8% margin intact.

Therefore, I have been using data analytics to

Continuous Improvement Initiatives - (Image by Author)
Continuous Improvement Initiatives – (Image by Author)

Along with many other operational improvements supported by data analytics, which are shared in this Medium blog.

Can I apply similar methodologies outside of logistics?

My journey discovering data analytics for business optimization

It began while consulting for a small logistics company that delivered products to bakeries in Paris.

With my naive perspective as a solution designer, I spoke with a bakery chain owner to understand their business model:

  • What is the profit margin (%) on a baguette sold for €1.50?
  • How much does the workforce cost (€) per croissant sold?
  • What percentage (%) of your costs are fixed?
  • What is the most profitable item in your shop?

To my surprise, they couldn’t answer any of these questions.

I realized there was another approach to doing business, where prices were set without explicit knowledge of the underlying costs, and operational visibility was almost zero.

This was a huge opportunity – what if we provided these business owners with the visibility and insights they desperately needed?

So, I took on the challenge of modelling a bakery.

I applied the same methodology I used for continuous improvement projects in logistics:

  1. Understand their current operations: fixed and variable costs, bottlenecks and revenue sources.
  2. Collect and process data, making necessary assumptions (e.g., production costs and sale prices for each item).
  3. Build a Python model to replicate their current setup and simulate different scenarios.

The result is the solution presented in the article: Maximize Business Profitability with Python.

Methodology Applied for the Bakery Profitability Case - (Image by Author)
Methodology Applied for the Bakery Profitability Case – (Image by Author)

What is the most profitable mix of products to sell?

Considering the limited resources to produce and store products, this model can provide the exact mix of products to sell.

The impact on the client exceeded my expectations.

"This is the first time I can estimate the impact of my business strategies on overall profitability," the owner said.

Addition Indicators: Equipment and Workforce usage ratio - (Image by Author)
Addition Indicators: Equipment and Workforce usage ratio – (Image by Author)

After several iterations, we improved the algorithm to provide interesting insights.

Samir: "The bottlenecks of your production are the human resources and your oven capacity."

For such a simple algorithm, coded in less than an hour, the perceived business value was far greater than what I experimented with in the past.

This marked a turning point in my approach to using data analytics for business impact.

How can you support businesses with your analytics skills?

Use your Analytics Skills to Solve Business Problems

Since starting my consultancy and developing my sustainable supply chain SaaS, I exchanged with dozens of entrepreneurs across multiple industries.

This is an opportunity to assess my capabilities of solving their problems with data and business acumen.

What do they need? Let’s explore an example!

A friend of mine, who runs a small business in the F&B industry, used my model to support decision-making and maximize revenue.

Business model of my friend - (Image by Author)
Business model of my friend – (Image by Author)

They source renewable cups from China and ship them via air or sea freight to a local warehouse.

From the warehouse, the cups are delivered to coffee shops and distributors.

What problems can data analytics solve in this case?

Among his biggest challenges were inventory management and cash flow.

"We’ve had to turn down orders because we don’t have enough cash to pay suppliers for restocking," he explained.

The core issue was clear: they needed visibility on their financial flows.

Value chain of the coffee cups - (Image by Author)
Value chain of the coffee cups – (Image by Author)

I needed to map out all the parameters involved to address this issue and model the value chain.

  • Payment terms and lead times from suppliers.
  • Service level agreements and payment terms with customers, segmented by sales channels (direct customers vs. distributors).
  • Fixed operational costs and cash flow management.
Simulation Engine of the business - (Image by Author)
Simulation Engine of the business – (Image by Author)

The result is a simulation engine written in Python (with a reasonable granularity) that replicates my friend’s business.

What’s next? We can answer all my friend’s questions!

My friend’s biggest frustration was the lack of visibility and the inability to test hypotheses.

  • What if we reduce stock coverage from 8 weeks to 6 weeks?
  • Would it be more cost-effective to use air freight for deliveries?
  • Should we shift our sales strategy to focus only on distributors?

Due to the complexity of the parameters involved, he could not get clear answers before.

Summary of all the scenarios - (Image by Author)
Summary of all the scenarios – (Image by Author)

Simulating these hypotheses is a matter of seconds with the model.

This approach saved thousands of euros by confirming they could safely reduce stock coverage without risking disruptions to customer supply.

It turned out to be an incredibly powerful tool for managing his business, built with simple analytics by

  • Understanding the parameters influencing the value chain
  • Simulate "what if" scenarios to assess business strategies
  • Find the optimal set-up to minimize costs and maximize profitability

Does this sound like an issue your business is facing? For more details, check out this article

Business Planning with Python – Inventory and Cash Flow Management

The insights from the model helped my friend reduce the cash needed to run the business and reduce the Cost of Goods Sold (COGS).

Later, we went beyond cost reductions by tackling revenue maximization.

He came back with another request related to pricing strategy.

Example of pricing strategies - (Image by Author)
Example of pricing strategies – (Image by Author)

With their new business partner, an expert in F&B that brings capital and market expertise, they were drafting strategies to boost turnover.

She proposed multiple pricing strategies, presented above, to increase the quantity ordered by customers.

My friend: "How can I assess these strategies and their impact on profitability?".

You just have to adapt the model by adding a pricing module and estimate the impact on profitability.

We can estimate the profitability and other business indicators for each pricing strategy with multiple volume scenarios.

Simulation Pricing Strategy 2 [Article: Link] - (Image by Author)
Simulation Pricing Strategy 2 [Article: Link] – (Image by Author)

In the table above, we simulate the impact of pricing strategy two, assuming seven turnover scenarios.

Samir: "You need to get +200% growth to recover the profitability of your baseline scenario with this strategy."

We can assess each strategy proposed with a few clicks.

Business Planning with Python – Revenue Optimization

The insights generated by the model helped resolve heated discussions between the co-founders.

The process allowed them to reach common ground on the pricing strategy backed by precise profitability projections.

It sounds great, right?

But what does it take to succeed in this kind of project?

How to Adapt Analytics Approach to Business Problems?

As I mentioned several times, the analytics solutions designed for these projects are usually "technically basic."

I would say that 80% of the efforts are in translating the business problem into analytics solutions.

Be curious! Show interest in the business model.

It’s an active process of asking the right questions to understand which indicators are important for business owners and how to model processes.

Value Chain of the Coffee Cups - (Image by Author)
Value Chain of the Coffee Cups – (Image by Author)

Before reaching this level of modelization, I needed multiple iterations with my friend to ensure that my model accurately reflected the reality of his business.

Therefore, you need to make your model’s insights accessible to a non-technical audience so they can help you assess the accuracy of the results.

Is there a demand for this kind of solution?

There is a market for that.

Since I started working full-time as a consultant, I have received more requests for this kind of project than for my core Supply Chain engineer skills.

As you directly impact profitability and provide visibility to business decision-makers, getting CAPEX and involvement for the project is easier.

What did we learn from these two examples?

  • Business owners lack visibility on their processes and financial flows.
  • Understanding business models and processes is essential to design the right simulation model.
  • Decision makers value data-driven insights to support strategic projects.

This approach is adaptable to a wide range of business cases across different industries and company sizes.

Conclusion

I never expected to embark on a journey from a solution design manager optimizing logistics operations to a consultant helping businesses improve their profitability.

This is because I discovered that advanced analytics tools can be very effective in business optimization.

You don’t need to focus on the complexity of the analytics solutions (ML, optimization or GenAI) but on understanding the business.

Have you heard about sustainability?

I am currently learning about Europe’s Corporate Sustainability Reporting Directive (CSRD).

This will shape how companies must report on their sustainability efforts

The idea is to introduce stricter requirements for transparency in environmental, social and governance (ESG) metrics.

ESG Pillars Presentation [Article: Link] - (Image by Author)
ESG Pillars Presentation [Article: Link] – (Image by Author)

Our approach to business profitability presented in this article can also be applied to sustainability challenges.

Decision marker: "We need to reduce the scope 3 emissions of the distribution network by 30%."

For example, in this article about Green Inventory Management, I shared a case study on reducing carbon emissions from store deliveries.

The idea is to find the optimal delivery frequency to minimize the CO2 emissions.

Green Inventory Management [Article: Link] - (Image by Author)
Green Inventory Management [Article: Link] – (Image by Author)

The approach to answering this operational problem presented in the article is similar.

  1. Understand the operations Prepare and deliver orders for fashion retail stores

  2. Build a simulation model with Python to estimate the emissions. Inputs: Sales Data and Delivery Frequency / Outputs: CO2 emissions

  3. Test several scenarios with multiple delivery frequencies and calculate the reduction of emissions.

If you are interested in finding solutions to reduce emissions,

Data Science for Sustainability- Green Inventory Management

A similar approach for different problems.

For any case study, it is crucial to be curious, ask the right questions and engage with decision-makers to truly understand their pain points.

By doing so, you can create models that provide actionable insights to technical and non-technical audiences.

In my experience, these insights can significantly improve profitability and support decision-making for operational transformations.

If you haven’t yet considered applying your expertise to business challenges, now is the time!

You might discover a new way to create impact – just as I did.

About Me

Let’s connect on Linkedin and Twitter. I am a Supply Chain Engineer using data analytics to improve logistics operations and reduce costs.

For consulting or advice on business analytics and sustainable supply chain transformation, feel free to contact me through Logigreen Consulting.

If you are interested in data analytics and supply chain, please visit my website.

Samir Saci | Data Science & Productivity

The post Are you Aware of the Potential of Your Data Expertise in Driving Business Profitability? appeared first on Towards Data Science.

]]>