Vadim Arzamasov, Author at Towards Data Science

ML Metamorphosis: Chaining ML Models for Optimized Results

Vadim Arzamasov — Wed, 23 Oct 2024 04:09:37 +0000

Figure 1. This and other images were created by the author with the help of recraft.ai

Machine learning (ML) model training typically follows a familiar pipeline: start with data collection, clean and prepare it, then move on to model fitting. But what if we could take this process further? Just as some insects undergo dramatic transformations before reaching maturity, ML models can evolve in a similar way (see Hinton et al. [1]) – what I will call the ML metamorphosis. This process involves chaining different models together, resulting in a final model that achieves significantly better quality than if it had been trained directly from the start.

Here’s how it works:

Start with some initial knowledge, Data 1.
Train an ML model, Model A (say, a neural network), on this data.
Generate new data, Data 2, using Model A.
Finally, use Data 2 to fit your target model, Model B.

Figure 2. An illustration of the ML metamorphosis

You may already be familiar with this concept from knowledge distillation, where a smaller neural network replaces a larger one. But ML metamorphosis goes beyond this, and neither the initial model (Model A) nor the final one (Model B) need be neural networks at all.

Example: ML metamorphosis on the MNIST Dataset

Imagine you’re tasked with training a multi-class decision tree on the MNIST dataset of handwritten digit images, but only 1,000 images are labelled. You could train the tree directly on this limited data, but the accuracy would be capped at around 0.67. Not great, right? Alternatively, you could use ML metamorphosis to improve your results.

But before we dive into the solution, let’s take a quick look at the techniques and research behind this approach.

1. Knowledge distillation (2015)

Even if you haven’t used knowledge distillation, you’ve probably seen it in action. For example, Meta suggests distilling its Llama 3.2 model to adapt it to specific tasks [2]. Or take DistilBERT – a distilled version of BERT [3]— or the DMD framework, which distills Stable Diffusion to speed up image generation by a factor of 30 [4].

At its core, knowledge distillation transfers knowledge from a large, complex model (the teacher) to a smaller, more efficient model (the student). The process involves creating a transfer set that includes both the original training data and additional data (either original or synthesized) pseudo-labeled by the teacher model. The pseudo-labels are known as soft labels – derived from the probabilities predicted by the teacher across multiple classes. These soft labels provide richer information than hard labels (simple class indicators) because they reflect the teacher’s confidence and capture subtle similarities between classes. For instance, they might show that a particular "1" is more similar to a "7" than to a "5."

By training on this enriched transfer set, the student model can effectively mimic the teacher’s performance while being much lighter, faster, and easier to use.

The student model obtained in this way is more accurate than it would have been if it had been trained solely on the original training set.

2. Model compression (2007)

Model compression [5] is often seen as a precursor to knowledge distillation, but there are important differences. Unlike knowledge distillation, model compression doesn’t seem to use soft labels, despite some claims in the literature [1,6]. I haven’t found any evidence that soft labels are part of the process. In fact, the method in the original paper doesn’t even rely on artificial neural networks (ANNs) as Model A. Instead, it uses an ensemble of models – such as SVMs, decision trees, random forests, and others.

Model compression works by approximating the feature distribution p(x) to create a transfer set. This set is then labelled by Model A, which provides the conditional distribution p(y|x). The key innovation in the original work is a technique called MUNGE to approximate p(x). As with knowledge distillation, the goal is to train a smaller, more efficient Model B that retains the performance of the larger Model A.

As in knowledge distillation, the compressed model trained in this way can often outperform a similar model trained directly on the original data, thanks to the rich information embedded in the transfer set [5].

Often, "model compression" is used more broadly to refer to any technique that reduces the size of Model A [7,8]. This includes methods like knowledge distillation but also techniques that don’t rely on a transfer set, such as pruning, quantization, or low-rank approximation for neural networks.

3. Rule extraction (1995)

When the problem isn’t computational complexity or memory, but the opacity of a model’s decision-making, pedagogical rule extraction offers a solution [9]. In this approach, a simpler, more interpretable model (Model B) is trained to replicate the behavior of the opaque teacher model (Model A), with the goal of deriving a set of human-readable rules. The process typically starts by feeding unlabelled examples – often randomly generated – into Model A, which labels them to create a transfer set. This transfer set is then used to train the transparent student model. For example, in a classification task, the student model might be a decision tree that outputs rules such as: "If feature X1 is above threshold T1 and feature X2 is below threshold T2, then classify as positive".

The main goal of pedagogical rule extraction is to closely mimic the teacher model’s behavior, with fidelity – the accuracy of the student model relative to the teacher model – serving as the primary quality measure.

Interestingly, research has shown that transparent models created through this method can sometimes reach higher accuracy than similar models trained directly on the original data used to build Model A [10,11].

Pedagogical rule extraction belongs to a broader family of techniques known as "global" model explanation methods, which also include decompositional and eclectic rule extraction. See [12] for more details.

4. Simulations as Model A

Model A doesn’t have to be an ML model – it could just as easily be a computer simulation of an economic or physical process, such as the simulation of airflow around an airplane wing. In this case, Data 1 consists of the differential or difference equations that define the process. For any given input, the simulation makes predictions by solving these equations numerically. However, when these simulations become computationally expensive, a faster alternative is needed: a surrogate model (Model B), which can accelerate tasks like optimization [13]. When the goal is to identify important regions in the input space, such as zones of system stability, an interpretable Model B is developed through a process known as scenario discovery [14]. To generate the transfer set (Data 2) for both surrogate modelling and scenario discovery, Model A is run on a diverse set of inputs.

Back to our MNIST example

In an insightful article on TDS [15], Niklas von Moers shows how semi-supervised learning can improve the performance of a convolutional neural network (CNN) on the same input data. This result fits into the first stage of the ML metamorphosis pipeline, where Model A is a trained CNN classifier. The transfer set, Data 2, then contains the originally labelled 1,000 training examples plus about 55,000 examples pseudo-labelled by Model A with high confidence predictions. I now train our target Model B, a decision tree classifier, on Data 2 and achieve an accuracy of 0.86 – much higher than 0.67 when training on the labelled part of Data 1 alone. This means that chaining the decision tree to the CNN solution reduces error rate of the decision tree from 0.33 to 0.14. Quite an improvement, wouldn’t you say?

For the full experimental code, check out the GitHub repository.

Conclusion

In summary, ML metamorphosis isn’t always necessary – especially if accuracy is your only concern and there’s no need for interpretability, faster inference, or reduced storage requirements. But in other cases, chaining models may yield significantly better results than training the target model directly on the original data.

Figure 2: For easy reference, here’s the illustration again

For a classification task, the process involves:

Data 1: The original, fully or partially labeled data.
Model A: A model trained on Data 1.
Data 2: A transfer set that includes pseudo-labeled data.
Model B: The final model, designed to meet additional requirements, such as interpretability or efficiency.

So why don’t we always use ML metamorphosis? The challenge often lies in finding the right transfer set, Data 2 [9]. But that’s a topic for another story.

References

[1] Hinton, Geoffrey. "Distilling the Knowledge in a Neural Network." arXiv preprint arXiv:1503.02531 (2015).

[2] Introducing Llama 3.2

[3] Sanh, Victor, et al. "DistilBERT, a distilled version of BERT: Smaller, faster, cheaper and lighter. " arXiv preprint arXiv:1910.01108 (2019).

[4] Yin, Tianwei, et al. "One-step diffusion with distribution matching distillation." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.

[5] Buciluǎ, Cristian, Rich Caruana, and Alexandru Niculescu-Mizil. "Model compression." Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. 2006.

[6] Knowledge distillation, Wikipedia

[7] An Overview of Model Compression Techniques for Deep Learning in Space, on Medium

[8] Distilling BERT Using an Unlabeled Question-Answering Dataset, on Towards Data Science

[9] Arzamasov, Vadim, Benjamin Jochum, and Klemens Böhm. "Pedagogical Rule Extraction to Learn Interpretable Models – an Empirical Study." arXiv preprint arXiv:2112.13285 (2021).

[10] Domingos, Pedro. "Knowledge acquisition from examples via multiple models." Machine Learning-INTERNATIONAL WORKSHOP THEN CONFERENCE-. MORGAN KAUFMANN PUBLISHERS, INC., 1997.

[11] De Fortuny, Enric Junque, and David Martens. "Active learning-based pedagogical rule extraction." IEEE transactions on neural networks and learning systems 26.11 (2015): 2664–2677.

[12] Guidotti, Riccardo, et al. "A survey of methods for explaining black box models." ACM computing surveys (CSUR) 51.5 (2018): 1–42.

[13] Surrogate model, Wikipedia

[14] Scenario discovery in Python, blog post on Water Programming

[15] Teaching Your Model to Learn from Itself, on Towards Data Science

The post ML Metamorphosis: Chaining ML Models for Optimized Results appeared first on Towards Data Science.

Optimizing Marketing Campaigns with Budgeted Multi-Armed Bandits

Vadim Arzamasov — Fri, 16 Aug 2024 14:17:21 +0000

Let’s dive right into a running example. Suppose a bank or a telecom company launches a new product/plan for existing customers. To promote this product, the company creates several call templates (scripts) for its sales representatives. The goal is to effectively convince customers to buy the new product or sign up for the new plan.

Here’s how the campaign works:

Call script creation: The marketing team develops multiple versions of the call script, each with a different approach to promoting the new product or plan.
Agent calls: Sales agents use these scripts to call a subset of customers. Each customer interaction uses one of the predefined scripts.
Data Collection: During the calls, the company collects data on customer responses, such as interest shown, questions asked, and, ultimately conversion rates (i.e., how many customers buy the new product or sign up for the new plan).
Real-time analysis: The company analyzes this data in real time to evaluate the effectiveness of each script. This analysis helps determine which scripts are more successful in converting customers to the new plan/product.
Strategy Update: Based on ongoing analysis, the company dynamically adjusts the frequency of use of each script. Scripts with higher conversion rates are used more often, ensuring that the campaign becomes increasingly effective over time.

Next, we show how to model a simple version of this campaign using the conventional multi-armed bandit problem. As we add more details to make this model more realistic, we demonstrate how existing solutions and their simple adaptations fall short. We then present a new budgeted multi-armed bandit algorithm from our paper accepted for the KDD 2024 conference that performs very well at this task. We also provide links to the code and a short video summarizing the paper.

In this story, "we" is used because I am writing it together with Marco Heyden (Linkedin, Github), the author of the algorithm idea and of our paper [1]. All subsequent plots are created by us and with the code from this Jupyter notebook.

Multi-armed Bandit Solution

Our scenario is similar to a multi-armed bandit (MAB) problem. Imagine a player in a casino facing a slot machine ("bandit") with multiple arms, each with an unknown payoff (reward) distribution. The player’s goal is to maximize their total winnings by deciding which arms to play and how often to play each one. The challenge lies in balancing exploration, trying different arms to gather information about their rewards, and exploitation, using the information gathered to play the arm with the highest known reward.

In our example, each call script acts like a slot machine arm, where the reward is the success of the script. The reward is 1 if the customer signs up for the new plan or buys the new product and 0 otherwise. For instance, three call scripts with conversion rates of 0.1, 0.3, and 0.7 have successes following a Bernoulli distribution with expected values of 0.1, 0.3, and 0.7, respectively. The figure below illustrates the cumulative rewards for different strategies. The purple line represents using the script 1 with a conversion rate of 0.1, while the green line represents using the script 3 with a conversion rate of 0.7. These lines define the range of probable rewards. The light blue line shows the cumulative reward for a strategy that randomly selects a script for each call. In a realistic environment where only estimates of conversion rates are available, the cumulative reward of a good strategy should be close to the green line and at least above the light blue line.

A popular strategy for the multi-armed bandit problem is the Upper Confidence Bound (UCB) algorithm [2]. It assigns an upper confidence bound to the expected reward of each arm (call script) and plays the arm with the highest upper confidence bound. In this way, the algorithm actively explores actions with high uncertainty while exploiting actions with high known rewards. Mathematically, the UCB algorithm plays the arm i, which solves

rᵢ(t) is the empirical mean reward of arm i up to time t.
Nᵢ(t) is the number of times arm i has been played up to time t.
t is the total number of plays so far.

The white line in the figure below shows this strategy in action for our example.

This upper bound is based on the Chernoff-Hoeffding bounds assuming a payoff distribution with support in [0,1], which is exactly our case. For reward distributions with different support [a_ᵢ_ʳ,b_ᵢ_ʳ], where aᵢʳ and a_ᵢ_ʳ are finite, the UCB should be scaled accordingly:

The Importance of the Budget

So far we have focused on maximizing the sum of rewards after a given number of calls. However, it is unrealistic to expect calls corresponding to different scripts to have the same duration. If the marketing team’s capacity does not allow them to reach all customers within a given time budget (e.g., several months), it would be more practical to maximize the cumulative reward for a given call duration rather than the number of calls.

In our example, assume that the call duration (=costs) for scripts 1, 2, and 3 are constant (we will relax this assumption later) and equal 1, 2, and 8 minutes, respectively. If we now plot the results based on the total call duration instead of the number of calls, the strategy that always uses script 2 becomes the best choice, while the strategy that uses script 3 becomes the worst. The above UCB strategy that only considers conversion rates now performs much worse than a random strategy.

It is not difficult to cure the UCB strategy by normalizing reward estimates rᵢ(t) with cᵢ counterparts and adjusting the UCB as in formula (2) above:

The UCB strategy that was updated in this way is working quite well again:

Random Call Duration

Observe that the assumption of a fixed call duration is also unrealistic. When both reward and cost are not fixed, there are several ways to extend the UCB strategy to this case, such as

Reductive: By treating the reward to cost ratio vᵢ=rᵢ/cᵢ as a single random variable and pulling the arm with the highest upper bound UCBᵢᵛ of it:

United: By ignoring the variation in costs and using their estimates cᵢ(t) to scale the UCBᵢʳ of rewards similarly to formula (3):

Composite: By pulling the arm that maximizes the ratio UCB_ᵢʳ/_LCBᵢᶜ of the upper reward bound to the lower cost bound:

In (6) we assume that the rewards come from [0,1], as before, to keep the formula a bit simpler.

All the above strategies have problems, being either excessively optimistic, too pessimistic, or just optimizing the wrong quantity.

The Reductive strategy (4) **** seeks to maximize the expectation of the reward-cost ratio 𝔼(vᵢ)=𝔼(rᵢ/cᵢ). This is different from the campaign goal of maximizing reward while keeping costs within a given budget. For sufficiently high budgets, the latter is equivalent to maximizing the ratio of reward and cost expectations 𝔼(rᵢ_)/𝔼(cᵢ)._ To see why 𝔼(rᵢ/c_ᵢ)≠𝔼(_rᵢ)/𝔼_(cᵢ)_, o_bs_erve that if both rᵢ and _c_ᵢ are i.i.d. Bernoulli distributed variables, then 𝔼(rᵢ)/𝔼(cᵢ)=1, w_hi_le 𝔼(rᵢ/cᵢ) is _infi_nite or undefined, depending on how you resolve the division by zero. The support of vᵢ=rᵢ/cᵢ is also infinite in this case, making the UCB formula (4) useless.

Because the United strategy does not account for cost variation, it often produces upper confidence bounds that are too tight, limiting the exploration part of the strategy. This happens whenever the empirical mean of the costs exceeds the true mean, e.g. in 50% of the cases with a symmetric cost distribution.

The Composite strategy explicitly models uncertainty in costs. However, this model is too conservative, allowing the denominator to be zero or even negative(!). As a result, the composite strategy spends too many resources on exploration and undervalues the exploitation step.

Note also that the upper bound for the mean reward discussed so far can be well above 1, even though we know that the reward takes values in {0,1}. In such a case, this bound is obviously too loose, which may increase exploration and thus partially offset the effect of considering the costs fixed in the United strategy.

The following plot shows the performance of all three strategies for our example setting, where we now model the call duration with beta distributions with support [1,10] and expected values of 1, 2, and 8 minutes for scripts 1, 2, and 3, respectively. The Reductive strategy performs almost as poorly as selecting a random script, the Composite strategy performs slightly better, and the United strategy emerges as the clear winner. But can one beat the United strategy?

Our Solution: Asymmetric Confidence Intervals

Our ω-UCB strategy addresses the shortcomings of the previously described solutions. While we start with the same ratio of UCBᵢʳ/LCBᵢᶜ as the composite strategy, we use a different method to compute these bounds. Our confidence intervals are more precise and remain within the support of the rewards or costs. Specifically, let p be a bounded random variable – either a reward or a cost – with support [a,b]. We compute the confidence interval _[_LCBᵖ, UCBᵖ] for p as follows.

μ is the empirical mean of p at time t,
σ² is the empirical variance of p at time t,
N is the number of observations of p up to time t,
t is the current time,
c is the number of arm pulls after which one expects reliable estimates of mean and variance; in practice, c=30 works well,
ρ is the parameter of ω-UCB; ρ=1 gives better asymptotic properties of ω-UCB, but for practical use, when the number of games is ~10⁴ or less, we recommend ρ=1/4.

The following figure shows how good the ω-UCB is. Using it provides almost the maximum possible cumulative reward.

We also created a 2-minute video summarizing the idea of the ω-UCB:

Final Thoughts

By now, you are well equipped with insights on how to optimize a marketing campaign in real time using instant customer feedback. We have described a powerful algorithm to help you do this. However, it would be overly optimistic to assume that this alone will guarantee success. Below, we outline a few additional considerations that can further enhance a campaign’s effectiveness.

First, it is unlikely that the reward will be known immediately. Often, the best you can hope for is an indication of interest from the customer. Therefore, constructing a reliable proxy for the reward, perhaps using data from previous campaigns, is essential.

Next, this discussion has focused on selecting the best script for an average or representative customer. Digging a little deeper, it is likely that different scripts will work better for different groups of customers. The best script may vary by group. A simple approach would be to segment customers and treat each segment-script combination as a separate arm in the Budgeted MAB algorithm we described. In an earlier story, I discussed a method for identifying interesting customer segments; for a campaign, selecting an appropriate target variable for this method will be important.

Find Unusual Segments in Your Data with Subgroup Discovery

Finally, in addition to customer characteristics, "environmental" factors such as time of day or day of week can affect the relative performance of scripts. To account for all of these factors, you might consider extending the methodology to contextual budgeted bandits, which is the subject of a separate story.

References

[1] Marco Heyden, Vadim Arzamasov, Edouard Fouché, and Klemens Böhm. "Budgeted Multi-Armed Bandits with Asymmetric Confidence Intervals." KDD ’24

[2] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. "Finite-time analysis of the multiarmed bandit problem." Machine Learning 47 (2002): 235–256.

The post Optimizing Marketing Campaigns with Budgeted Multi-Armed Bandits appeared first on Towards Data Science.

An Undeservedly Forgotten Correlation Coefficient

Vadim Arzamasov — Tue, 30 Apr 2024 00:45:00 +0000

Traditional Correlation coefficients such as Pearson ρ, Spearman, or Kendall’s τ are limited to finding linear or monotonic relationships and struggle to identify more complex association structures. The recent article on TDS [1] about a new correlation coefficient ξ that aims to overcome these limitations has received a lot of attention and has been discussed intensively. One of the questions raised in the comments was what particular advantages ξ brings over a nonlinear correlation measure based on mutual information. An experiment may be worth a thousand words in such debates. So in this story, I experimentally compare ξ to the mutual information-based coefficient R along a variety of properties one would like a nonlinear correlation measure to satisfy. Based on the results, I would strongly recommend R over ξ for the majority of routines that require finding nonlinear associations.

Requirements

Let me first summarize and convince you about the desired properties of a coefficient we are looking for. We want an association measure A(x,y) that

is nonlinear. That is, it takes the value zero when x and y are independent; it has a value of one for the modulus of the measure when there is an exact nonlinear relationship between the variables, such as x = h(t), y=f(t), where t is a parameter;
is symmetric. That is, A(x,y)=A(y,x). The opposite would be confusing;
is consistent. That is, it is equal to the linear Correlation Coefficient ρ when x, y have a bivariate normal distribution, i.e. is a generalization of ρ to other distributions. This is because ρ is widely used in practice, and many of us have developed a sense of how its values relate to the strength of the relationship. In addition, ρ has a clear meaning for a standard normal distribution, since it completely defines it;
is scalable – one can compute correlations even for datasets with many observations in a reasonable time;
is precise, i.e., has a low variance estimator.

The table below summarizes the results of my experiments, where green indicates that the measure has the property tested, red indicates the opposite, and orange is slightly better than red. Let me now walk you through the experiments; you can find their code in this Github repo [2] in the R programming language.

Image created by author

Coefficients of correlation

I use the following coefficient implementations and their configurations

For the linear correlation coefficient ρ, I use the standard function cor() from the ‘stats’ package;
for ξ, I use the xicor()function from the ‘XICOR’ package [3];
mutual information (MI) takes values in the range [0,∞) and there are several ways to estimate it. Therefore, for R one has to choose (a) the MI estimator to use and (b) the transformation to bring MI into the range [0,1].

There are histogram-based and nearest neighbor-based MI estimators. Although many still use histogram-based estimators, I believe that Kraskov’s nearest neighbor estimator [4] is one of the best. I will use its implementation mutinfo() from the ‘FNN’ package [5] with the parameter k=2 as suggested in the paper.

Write in the comments if you want to know more about this particular MI estimator

There are also several ways to normalize the MI to the interval [0,1]. I will use the one below because it has been shown to have a consistency property, and I will demonstrate it in the experiments.

This measure R is called the Mutual Information Coefficient [6]. However, I have noticed a tendency to confuse it with the more recent Maximal Information Coefficient (MIC) [7]. The latter has been shown to be worse than some alternatives [8], and to lack some of the properties it is supposed to have [9].

Nonlinearity

In the figure below, I have calculated all three correlation coefficients for a donut data of 10K points with different donut thickness. As expected, the linear correlation coefficient ρ does not capture the existence of a relationship in any of the plots. In contrast, R correctly determines that x and y are related and takes the value of 1 for the data in the right plot which corresponds to a noiseless relationship between x and y: x = cos(t) and y = sin(t). However, the coefficient ξ is only 0.24 in the latter case. More importantly, in the left plot, ξ is close to zero, even though x and y are not independent.

Image created by author

Symmetry

In the figure below, I calculated these quantities for data sets generated from a different distribution. I obtained ρ(x,y)=ρ(y,x) and R(x,y)=R(y,x), so I report only a single value for these measures. However, ξ(x,y) and ξ(y,x) are very different. This is probably due to the fact that y=f(x), but x is not a function of y. This behavior may not be desirable in reality, since it is not easy to interpret a non-symmetric correlation matrix.

Image created by author

Consistency

In this experiment, I computed all coefficients for data sets resulting from a bivariate standard normal distribution with a given correlation coefficient of 0.4, 0.7, or 1. Both ρ and R are close to the true correlation, while ξ is not, i.e. it does not have the consistency property defined above.

Image created by author

Scalability

To check the performance of the estimators, I generated data sets of different sizes consisting of two independent and uniformly distributed variables. The figure below shows the time in milliseconds required to compute each coefficient. When the dataset consists of 50K points, R is about 1000 times slower than ξ and about 10000 times slower than ρ. However, it still takes ~10 seconds to compute, which is reasonable when computing a moderate number of correlations. Given the advantages of R discussed above, I’d suggest using it even for computing large numbers of correlations – just subsample your data randomly to ~10K points, where computing R takes less than a second.

Image created by author

Precision

For different samples from the same distribution, there will be different estimates of the correlation coefficient. If there is an association between x and y, we want the variance of these estimates to be small compared to the mean of the correlation. For a measure A(x,y) one can compute precision=sd(A)/mean(A), where sd is a standard deviation. Lower values of this quantity are better. The following table contains precision values calculated from a bivariate normal distribution on data sets of different sizes with different values of the correlation between dimensions. ξ is the least precise, while ρ is the most precise.

Image created by author

References

[1] A New Coefficient of Correlation

[2] My experiments on Github

[3] XICOR package for R

[4] Kraskov, A., Stögbauer, H., & Grassberger, P. (2004). Estimating mutual information. Physical review E, 69(6), 066138.

[5] FNN package for R

[6] Granger, C., & Lin, J. L. (1994). Using the mutual information coefficient to identify lags in nonlinear models. Journal of time series analysis, 15(4), 371–384.

[7] Reshef, D. N., Reshef, Y. A., Finucane, H. K., Grossman, S. R., McVean, G., Turnbaugh, P. J., … & Sabeti, P. C. (2011). Detecting novel associations in large data sets. science, 334(6062), 1518–1524.

[8] Simon, N., & Tibshirani, R. (2014). Comment on" Detecting Novel Associations In Large Data Sets" by Reshef Et Al, Science Dec 16, 2011. arXiv preprint arXiv:1401.7645.

[9] Kinney, J. B., & Atwal, G. S. (2014). Equitability, mutual information, and the maximal information coefficient. Proceedings of the National Academy of Sciences, 111(9), 3354–3359.

The post An Undeservedly Forgotten Correlation Coefficient appeared first on Towards Data Science.

A Benchmark and Taxonomy of Categorical Encoders

Vadim Arzamasov — Fri, 29 Mar 2024 04:38:41 +0000

Image created by author with recraft.ai

A large share of datasets contain categorical features. For example, out of 665 datasets on the UC Irvine Machine Learning Repository [1], 42 are fully categorical and 366 are reported as mixed. However, distance-based ML models and almost all scikit-learn implementations require features in a numerical format. Categorical encoders replace the categories in such features with real numbers.

A variety of categorical encoders exist, but there have been few attempts to compare them on many datasets, with various ML models, and in different pipelines. This article is about one of the latest benchmarks of encoders from our recent publication [2] (poster, code on GitHub). In this story, I focus on the content that complements the publication and is of practical importance. In particular, beyond the summary of our Benchmark results, I will:

Provide a list of 55 categorical encoders and the links to find their explanations and implementations for most of them.
Explain that you can also use our code as a supplement to the [Category Encoders](https://contrib.scikit-learn.org/category_encoders/)python module for the encoders not yet implemented there.
Categorize the encoders into families so that you do not have to remember each individual encoder, but instead have an idea of how to build a member of each family.
Explain how you can reuse the code from [2] and detailed benchmark data to include your encoder, dataset, or ML models in the comparison without having to re-run the existing experiments. Depending on the scope of your experiments and your computational resources, this can save you weeks of computation.

Why another benchmark?

There are already several scientific studies comparing categorical encoders [3–12] and at least one categorical encoder benchmark on TDS [13]. The study [2] that I will present here differs mainly in scope: We compared representative encoders in different configurations from a variety of encoder families. We experimented with 5 ML models (decision tree, kNN, SVM, logistic regression, LGBM), 4 quality metrics (AUC, accuracy, balanced accuracy, F1-score), 3 tuning strategies (which I will describe shortly), 50 datasets, and 32 encoder configurations.

The following table shows the encoders covered in our benchmark (the dark green column) and in other experimental comparisons (the light green columns). The blue columns show the encoders described in two additional sources: in the article dedicated to contrast encoders [15] and on Medium [14]:

Categorical Encoding: Key Insights

The last yellow column shows the encoders covered by the [Category Encoders](https://contrib.scikit-learn.org/category_encoders/) module [16]. Note that the code from [2] implements some encoders – from the similarity, binning, and data constraining families – that are not part of the Category Encoders module. In addition, we have found that the interface to the GLMM encoder implemented in R and used in [2] is much faster than the GLMM encoder from Category Encoders. Therefore, you may find our code useful for these implementations.

Table 1. Encoder families and their coverage by various resources. Author owns copyright

Table 1 already contains a number of encoders, and the list is by no means exhaustive. To navigate the encoder landscape, it is therefore useful to classify encoders in order to understand the principles of encoder design, rather than to memorize a large number of individual encoders.

Families of encoders

In the following, we consider a categorical feature of length n with cardinality k. At the top level, categorical encoders are supervised or unsupervised.

1. Unsupervised encoders do not include the target variable in the encoding process.

1.1. Identifier encoders transform categorical variables using a injective function, i.e., they assign each category a unique numeric value or a unique combination of numeric values. They create from 1 up to k new features. For example, One-Hot encoder creates k features, label or ordinal encoders create a single new feature, Base N encoders create ⌈log(k)⌉ new features, where the logarithm is of base N.

These encoders are useful for categorical variables that denote unique identifiers, such as product codes or zip codes. Typically, with the exception of ordinal encoder, identifier encoders do not assume that any inherent order in the categories conveys meaningful information for analysis, and thus ignore any such order during the encoding process.

1.2. Contrast encoders transform categorical variables by assigning numerical values based on comparisons between categories. Typically, a set of k-1 new features represents a categorical variable with k categories. Unlike identifier encoders, these encoders are specifically designed to explore the relationships between different levels of a categorical variable in a regression analysis.

To create a contrast encoder, one has a choice of different coding schemes [15]. Each contrasts a category with other categories in a particular way to test a related hypothesis about the data. For example, Helmert coding contrasts each level with the mean of subsequent levels, while sum coding contrasts each level with the overall mean.

1.3. Frequency encoders replace the categories of a categorical variable with corresponding values that are a function of the frequency of those categories in the data set. Unlike identifier or contrast encoders, frequency encoders create a single new feature and are not necessarily injective functions. They provide a numerical representation of occurrence rates, which can be particularly useful when an ML model benefits from understanding the commonality of category values. All three frequency encoders in Table 1 are monotonically increasing functions of frequency. However, this is not a necessary condition for defining this group.

1.4. Similarity encoders [5, 8, 18] transform Categorical Data into numerical form by applying similarity or distance metrics that capture similarities between different categories.

One group of similarity encoders [8, 18] is based on a morphological comparison between two categories treated as strings. Examples of similarity metrics are Levenshtein’s ratio, Jaro-Winkler similarity, or N-gram similarity. The categorical variable is then encoded as a vector, where each dimension corresponds to a pairwise comparison of a reference category with all categories, and the value represents the computed similarity score (similar to constructing a variance-covariance matrix). Encoders of this group typically create k new features. This encoding is particularly useful for handling "dirty" categorical datasets that may contain typos and redundancies [18]. One can think of One-Hot encoding as a special case of similarity encoding, where the similarity measure can take only two values, 0 or 1.

Another group of similarity encoders [5], including, e.g. Embeddings, Min-Hash, or Gamma-Poisson matrix factorization, is designed for high cardinality categories (k>>1). They project categorical features into a lower dimensional space. This group is thus similar to binning encoders, which are also particularly useful for large k but do not aim to preserve the morphological similarity of categories.

2. Supervised encoders use information about the target variable. For regression or binary classification tasks, they typically create a single new feature for each categorical one.

2.1. Simple target encoders capture the relationship between the categorical characteristic and the target. Constructing a simple target encoder involves calculating a statistic based on the target variable for each level of the categorical variable. Common statistics include the mean, median, or probability ratios of the target variable conditional on each category.

The procedure is as follows:

For each category level, group the data by the categorical variable.
Within each group, compute the desired statistic of interest.
For each instance in the data, replace the original categorical value with the corresponding statistic.

Simple target encoding runs the risk of overfitting, especially for small data sets and categories with very few observations. Techniques such as smoothing (regularization) and cross-validation, which we describe shortly, can help mitigate this risk.

2.2. Smoothing encoders are generalizations of simple target encoders that introduce a smoothing parameter. The purpose of this smoothing parameter is to prevent overfitting and to improve the generalization of the encoder to new data, especially when there are categories with a small number of observations. A common formula for the smoothed value is

where m is the number of times the category occurs in the data set. It may slightly vary, see [13]. By adjusting the smoothing parameter, you can control the balance between the category statistic and the overall statistic. A larger smoothing parameter makes the encoding less sensitive to the category-specific target statistic. Setting the smoothing parameter to zero in the above formula results in a simple target encoding.

2.3. Data-constraining encoders use information from a subset of rows in a data set. An example of a data-constraining coder is the leave-one-out coder, which, for a given categorical value, calculates the statistic of the target variable for all other instances in which the category occurs, excluding the current instance. This approach ensures that the encoded value for a given data set does not include its own target value.

Data-constraining strategies help create encodings that are more representative of the true relationships in the unseen data and less prone to overfitting. The number of unique values in the encoded column may exceed the cardinality of the original categorical column k. One can also introduce smoothing into data-constraining encodings.

3. Binning encoders can be both supervised and unsupervised. Often, their work can be divided into two steps. The first step creates an auxiliary categorical feature by sorting the original categories into bins. The second step applies an encoder from one of the other groups discussed above to this auxiliary feature. Both steps can be either unsupervised or supervised independently of each other. For example, in the first step, you can form the bins by grouping rare categories together (unsupervised, e.g., One-Hot MC encoder); alternatively, you can apply a simple target encoder and group the categories with close encoded values together (supervised, e.g., Discretized Target encoder).

Some binning encoders, e.g. Hashing, do not have this two-stage structure. The number of new features created by binning encoders is usually .


Tuning strategies
Before proceeding with the results, I will briefly summarize the tuning strategies from our benchmark. This is important to understand the scope of the following figures and to reuse the data and code from our benchmark. The three tuning strategies are
No tuning:

Encode the categorical columns of the training data;
Train an ML model with its default hyperparameters.

Model tuning:

Encode the categorical columns of the training data;
Divide the training data into folds;
Optimize the hyperparameters of an ML model using cross-validation.

Full tuning:

Split training data into folds;
Encode each training fold separately (in case of data constraining encoder family, each fold is further split into the nested folds);
Optimize the hyperparameters of an ML model with cross-validation.

Results
I will name the winning encoders from our experiments, point out which tuning strategy performed best and should be part of your ML pipeline, and present the runtimes for various encoders.
Ranking
Figure 1. Performance of encoders in all experiments. Author owns copyright
In Figure 1, the boxplots show the distribution of encoder ranks across all datasets, quality metrics, and tuning strategies. That is, each boxplot includes ~50×4×3=600 experiments for individual models and ~50×4×3×5=3000 experiments for all models, excluding the small fraction of experiments that timed out or failed for other reasons. We observed that four encoders: One-Hot, Binary (‘Bin’ on the plot), Sum, and Weight of Evidence are consistently among the best performing. For logistic regression, the difference of these four from the rest is statistically significant. This result is somewhat surprising since many previous studies (see [13, 14]) have reported drawbacks of unsupervised encoders, especially One-Hot. The exact meanings of all other encoder abbreviations on Figure 1 are not important for the rest, but you can find them in [2].
Tuning strategy
Figure 2. Performance gain of full tuning over no tuning. Author owns copyright
Figure 3. Performance gain of full tuning over model tuning. Author owns copyright
Figure 2 plots the performance difference between full and no tuning strategies (as I described above) for each categorical encoder. Each covers the experiments with different datasets, ML models, and performance metrics. Figure 3 is a similar plot comparing full and model tuning strategies. Note that these plots do not include some of the ML models and datasets because we limited the computational complexity. See [2] for details and more plots.
Based on these results, I would generally recommend sticking to the full tuning strategy, i.e., encoding data for each fold separately when optimizing the hyperparameters of an ML model; this is especially important for data constraining encoders.
Runtime
Figure 4. Runtime of encoders. Author owns copyright
Finally, Figure 4 plots the times required to encode data sets on a logarithmic scale. Simple target and binning encoders are the fastest, while smoothing and data constraining encoders are the slowest. The four best performing encoders – One-Hot, Binary (‘Bin’ on the plot), Sum, and Weight of Evidence – take a reasonable amount of time to process data.
Reusing the code and the results
A nice feature of our benchmark is that you can easily extend it to test your approach to Categorical Encoding and put it in the context of existing experimental results. For example, you may want to test the encoders from Table 1 that are not part of our benchmark, or you may want to test your custom composite encoder like if n/k>a then One-Hot MC else One-Hot where k is the feature cardinality and n is the size of the dataset, as before. We explain the process on GitHub and share the demonstration of how to do this in the kaggle notebook [17]. Below is a short summary of [17].
For illustrative purposes, we limited the example to a logistic regression model without hyperparameter tuning. The corresponding results of our benchmark are shown in Figure 5 below.
Figure 5. Encoder ranks for logistic regression with no tuning strategy. Author owns copyright
Now suppose that Yury Kashnitsky, who kindly provided feedback on our paper, wonders whether the [hashing](https://www.kaggle.com/code/kashnitsky/vowpal-wabbit-tutorial-blazingly-fast-learning) encoder is competitive with the encoders we evaluated. In [17], we show how to perform the missing experiments and find that the hashing encoder performs reasonably well with our choice of its hyperparameter, as Figure 6 shows.
Figure 6. Encoder ranks with hashing trick encoder added. Author owns copyright
Conclusion
I summarized our benchmark of categorical encoders and explained how you can benefit from our shared artifacts. You may have learned:

Categorical encoders, as ML models, can be supervised or unsupervised.
I presented eight families of encoders. Some encoders preserve the cardinality of categorical features, others can reduce it (often the case for binning, but also for some other encoders, e.g., Naive Target) or increase it (data constraining).
One-Hot, Binary, Sum, and Weight of Evidence encoders performed best on average in our experiments, especially with logistic regression.
See the kaggle notebook [17] for a demo of how to add the desired encoders to the benchmark and plot the results; the necessary code is available on GitHub.

Acknowledgements
This story would not exist if Federico Matteucci, my colleague and first author in [2], had not written the code for the benchmark and for the kaggle notebook [17].
Resources
[1] UC Irvine Machine Learning Repository
[2] Matteucci, F., Arzamasov, V., & Böhm, K. (2024). A benchmark of categorical encoders for binary classification. Advances in Neural Information Processing Systems, 36. (paper, code on GitHub, poster)
[3] Seca, D., & Mendes-Moreira, J. (2021). Benchmark of encoders of nominal features for regression. In World Conference on Information Systems and Technologies (pp. 146–155). Cham: Springer International Publishing.
[4] Pargent, F., Pfisterer, F., Thomas, J., & Bischl, B. (2022). Regularized target encoding outperforms traditional methods in supervised machine learning with high cardinality features. Computational Statistics, 37(5), 2671–2692.
[5] Cerda, P., & Varoquaux, G. (2020). Encoding high-cardinality string categorical variables. IEEE Transactions on Knowledge and Data Engineering, 34(3), 1164–1176.
[6] Potdar, K., Pardawala, T. S., & Pai, C. D. (2017). A comparative study of categorical variable encoding techniques for neural network classifiers. International journal of computer applications, 175(4), 7–9.
[7] Dahouda, M. K., & Joe, I. (2021). A deep-learned embedding technique for categorical features encoding. IEEE Access, 9, 114381–114391.
[8] Cerda, P., Varoquaux, G., & Kégl, B. (2018). Similarity encoding for learning with dirty categorical variables. Machine Learning, 107(8), 1477–1494.
[9] Wright, M. N., & König, I. R. (2019). Splitting on categorical predictors in random forests. PeerJ, 7, e6339.
[10] Gnat, S. (2021). Impact of categorical variables encoding on property mass valuation. Procedia Computer Science, 192, 3542–3550.
[11] Johnson, J. M., & Khoshgoftaar, T. M. (2021, August). Encoding techniques for high-cardinality features and ensemble learners. In 2021 IEEE 22nd international conference on information reuse and integration for Data Science (IRI) (pp. 355–361). IEEE.
[12] Valdez-Valenzuela, E., Kuri-Morales, A., & Gomez-Adorno, H. (2021). Measuring the effect of categorical encoders in machine learning tasks using synthetic data. In Advances in Computational Intelligence: 20th Mexican International Conference on Artificial Intelligence, MICAI 2021, Mexico City, Mexico, October 25–30, 2021, Proceedings, Part I 20 (pp. 92–107). Springer International Publishing.
[13] Benchmarking Categorical Encoders (a story on TDS)
[14] Categorical Encoding: Key Insights (my story on Medium)
[15] R Library Contrast Coding Systems for categorical variables
[16] Category Encoders (python module)
[17] Add a custom encoder to the benchmark (kaggle notebook)
[18] Similarity Encoding for Dirty Categories Using dirty_cat (a story on TDS)
The post A Benchmark and Taxonomy of Categorical Encoders appeared first on Towards Data Science.



Find Unusual Segments in Your Data with Subgroup Discovery
Vadim Arzamasov — Fri, 02 Feb 2024 14:46:14 +0000
Image created by author with recraft.ai
Inspired by an in-depth Medium article [1] with a case study on identifying bank customer segments with high churn reduction potential, this story explores a similar challenge through the lens of subgroup discovery methods [2]. Intrigued by the parallels, I applied a subgroup discovery approach to the same dataset and uncovered a segment with a 35% higher churn reduction potential – a significant improvement over what was previously reported. This story will take you through each step of the process, including building the methodology from the ground up. At the end of this journey, you’ll gain:

A clear understanding of the Patient Rule Induction Method (PRIM), a mature yet powerful subgroup discovery technique.
The skills to apply PRIM to your datasets and tailor it to your specific needs.

The complete code for PRIM and the experiment is on GitHub [3].
Patient Rule Induction Method
For the experiment, I’ve chosen my favorite subgroup discovery method: PRIM [4]. Despite its long presence in the field, PRIM has a unique mix of properties that make it very versatile:

Numerical data handling: PRIM easily handles numerical data without the need for binning. Unlike typical methods that discretize variables (e.g., categorizing age into predefined groups such as 45–54 years), PRIM overcomes this limitation. For example, it can identify more nuanced criteria such as age > 37.
Intelligent categorical data processing: PRIM can discover complex segments within categorical data. It can go beyond simple classifications such as country = Germany to more complex definitions such as country not in {France}.
Simplicity: While traditional subgroup discovery methods are often burdened with multiple parameters, PRIM is refreshingly simple. It relies primarily on a single, unambiguous peeling parameter: the proportion of points removed from a candidate segment in each iteration.
Efficiency: Being a heuristic approach, PRIM is remarkably fast. Despite its large search space, segment identification is typically resolved in milliseconds.
Interactivity and control: PRIM enables interactive analysis. Users can balance segment size against potential impact by examining a series of "nested" segments and selecting the most appropriate one. It also supports incremental segment discovery by removing already segmented data.
Flexibility: The flexibility of the method extends to the optimization function it is designed to enhance. This function isn’t limited to a single variable. For example, PRIM can identify segments where the correlation between two variables is significantly different from their correlation in the entire data set.

In summary, PRIM’s straightforward logic not only makes it easy to implement, but also allows for customization.
PRIM algorithm
PRIM works through two distinct phases: peeling and pasting. Peeling starts from a segment encompassing the entire dataset and gradually shrinks it while optimizing its quality. Pasting works similarly, but in the opposite direction – it tries to expand the selected candidate segment without quality loss. In our previous experiments [5], we observed that the pasting phase typically contributes minimally to the output quality. Therefore, I will focus on the peeling phase. The underlying logic of the peeling phase is as follows:
1. Initialize:
   - Set the peeling parameter (usually 0.05)
   - Set the initial box (segment) to encompass the entire data space.
   - Define the target quality function (e.g., a potential churn reduction).

2. While the stopping criterion is not met:
   - For each dimension of the data space:
     * Identify a small portion (defined by a peeling parameter) 
       of the data to remove that maximizes quality of remaining data
   - Update the box by removing the identified portion from 
     the current box.
   - Update the dataset by removing the data points that fall outside 
     the new box.

3. End when the stopping criterion is met 
   (e.g., after a certain number of iterations 
   or minimum number of data points remaining).

4. Return the final box and all the preceding boxes as candidate segments.
In this pseudo-code:

box refers to the current segment of the data.
The target quality function is typically some statistic of the response variable (mean, median, etc) that we want to maximize or minimize.
The peeling parameter determines the proportion of data points to be removed in each iteration. It is usually set to a small value, such as 0.05, hence the word "patient" in the method’s name.
The stopping criterion ensures that enough data points remain for analysis.

Consider simple examples of how PRIM handles numeric and categorical variables:
Numeric variables:Imagine you have a numeric variable such as age. In each step of the peeling phase, PRIM looks at the range of that variable (say, age from 18 to 80). PRIM then "peels off" a portion of that range from either end, as defined by the peeling parameter. For example, it might remove ages 75 to 80 because doing so improves the target quality function in the remaining data (e.g., increasing the churn reduction potential). The animation below shows PRIM finding an interesting segment (with a high proportion of orange squares) in a 2D numeric dataset.
PRIM at work on a 2D numerical data set. Image by the author
Categorical nominal variables:Now consider a categorical nominal variable such as country, with categories such as Germany, France, and Spain. In the peeling phase, PRIM evaluates each category based on how well it improves the target quality function. It then removes the least promising category. For example, if removing "Germany" results in a subset where the target quality function is improved (such as a higher potential churn reduction), then all data points with "Germany" are "peeled". Note that the peeling parameter has no effect on the processing of categorical data, which can cause undesired effects in some cases, as I will discuss and provide a simple remedy (in section "Better segments via enforced ‘patience’").
Categorical ordinal variables:
For ordinal variables, disjoint intervals in segment descriptions can sometimes be less intuitive. Consider an education variable with levels such as primary, secondary, vocational, bachelor, and graduate. Finding a rule like education in {primary, bachelor} may not fit well with the ordinal nature of the data because it combines non-adjacent categories. For those looking for a more coherent segmentation, such as education > secondary, that respects the natural order of the variable, using an ordinal encoding can be a useful workaround. For more insight into categorical encoding, you may find my earlier post [6] helpful, as it navigates you to the necessary information.
Experiment: Churn for bank customers
Now everything is ready to start the experiment. Following the Medium article on identifying unique data segments [1], I will apply the PRIM method to the Churn for Bank Customers [7] dataset from Kaggle, available under the CC0: Public Domain license. I will also adopt the target quality function from the article:

That is, I will look for the segments with many customers where the churn rate is much higher than the baseline, which is the average churn rate in the entire dataset. So I use PRIM, which gives me a set of nested candidate segments, and plot the churn_est_reduction against the number of clients.
Image by the author
The highest quality, churn_est_reduction = 457 is achieved for the 11th candidate segment with the description num_of_products < 2, is_active_member < 1, age > 37. This is quite an improvement over the previously reported maximum churn_est_reduction = 410 in [1]. Comparing the segment descriptions, I suspect that the main reason for this improvement is PRIM’s ability to handle numeric variables.
Better segments via enforced ‘patience’
Something suspicious is going on in the previous plot. By its nature, PRIM is expected to be "patient", i.e. to reduce the segment size only a little bit at each iteration. However, the second candidate segment is twice as small as the previous one – PRIM has cut off half the data at once. The reason for this is the low cardinality of some features, which is often the case with categorical or indicator variables. For example, is_active_member only takes the values 0 or 1. PRIM can only cut off large chunks of data for such variables, giving them an unfair advantage.
To address this issue, I’ve added an additional parameter called patience to give more weight to smaller cuts. Specifically, for the task at hand, I prioritize cuts by multiplying the churn rate reduction by the segment size raised to the power of patience. This approach helps to fine-tune the selection of segments based on their size, making it more tailored to our analysis needs. Applying PRIM with patience = 2 to the data yields the following candidate segments
Image by the author
Now the best candidate segment is num_of_products < 2, 37 < age < 64 with churn_est_reduction = 548, much better than any previous result!
Finding multiple segments
Let us say we have selected the just discovered segment and ask one of two responsible teams to focus on it. Can PRIM find a job for another team, i.e., find another group of clients, not in the first segment, with a high potential churn rate reduction? Yes it can, with so-called "covering" approach [4]. This means that one simply drops the clients belonging to the previously selected segment(s) from the dataset and apply PRIM once again. So I removed data with num_of_products < 2, 37 < age < 64 and applied PRIM to the rest:
Image by the author
Here the best candidate segment is gender != 'Male', num_of_products > 2, balance > 0.0 with chirn_est_reduction = 93.
Summary
To wrap things up, I illustrated PRIM’s strong performance on a Customer Churn Dataset for a task to find unusual segments. Points to note:

PRIM has identified highly insightful segments with 35% higher quality than previously reported.
I shared the code [3] for practical application and further experimentation. It is very concise and, unlike to other existing implementations [8–9], allows one to easily replace the target quality function tailored to a specific need.
I endorse PRIM for its robust features, such as effective handling of both numeric and categorical data, flexible segment definition, and fast execution, and recommend it for similar analytical challenges.

References
[1] Figuring out the most unusual segments in data
[2] Atzmueller, Martin. "Subgroup discovery." Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 5.1 (2015): 35–49.
[3] My code for PRIM and the experiment
[4] Friedman, Jerome H., and Nicholas I. Fisher. "Bump hunting in high-dimensional data." Statistics and computing 9.2 (1999): 123–143.
[5] Arzamasov, Vadim, and Klemens Böhm. "REDS: rule extraction for discovering scenarios." Proceedings of the 2021 International Conference on Management of Data. 2021.
[6] Categorical Encoding: Key Insights
[7] Churn for Bank Customers dataset
[8] Patient Rule Induction Method for Python
[9] Patient Rule Induction Method for R
The post Find Unusual Segments in Your Data with Subgroup Discovery appeared first on Towards Data Science.