Barrett Studdard, Author at Towards Data Science https://towardsdatascience.com/author/barrettstuddard/ The world’s leading publication for data science, AI, and ML professionals. Wed, 05 Mar 2025 13:48:06 +0000 en-US hourly 1 https://wordpress.org/?v=6.7.1 https://towardsdatascience.com/wp-content/uploads/2025/02/cropped-Favicon-32x32.png Barrett Studdard, Author at Towards Data Science https://towardsdatascience.com/author/barrettstuddard/ 32 32 Speed up Linear Regression with Matrix Math https://towardsdatascience.com/speed-up-linear-regression-with-matrix-math-fe5ff7f2b53b/ Wed, 24 Nov 2021 20:24:54 +0000 https://towardsdatascience.com/speed-up-linear-regression-with-matrix-math-fe5ff7f2b53b/ Linear Regression is an extremely popular and useful model. It’s used by Excel Gurus and Data Scientists alike – but how can we fit lots of regression models quickly? This article walks through various ways to fit a linear regression model and how to speed things up with some Linear Algebra. In this article, I’ll […]

The post Speed up Linear Regression with Matrix Math appeared first on Towards Data Science.

]]>
Photo by Jan Huber on Unsplash
Photo by Jan Huber on Unsplash

Linear Regression is an extremely popular and useful model. It’s used by Excel Gurus and Data Scientists alike – but how can we fit lots of regression models quickly? This article walks through various ways to fit a linear regression model and how to speed things up with some Linear Algebra.

In this article, I’ll walk through a few different approaches for ordinary least squares linear regression.

  1. Fitting a model using the statsmodels library
  2. Fitting a Simple Linear Regression using just Numpy
  3. Using the Matrix Formulation
  4. How to fit multiple models with one pass over grouped data
  5. Comparing performance of looping vs. matrix approaches

Data Overview

The data utilized in this article will be generated sample data. Using numpy + introducing some random noise, the below dataset has been created to use for demonstrating OLS fitting techniques. This data will be stored in and variables for the next few examples.

Data Overview, Image by author
Data Overview, Image by author

Statsmodels OLS

One great approach for fitting a linear model is utilizing the statsmodels library. We simply pass in our two variables, call the fit method, and can retrieve results via a call to summary(). Note – these parameters can also be accessed directly using fit.params.

Statsmodels OLS Linear Regression Results, Image by author
Statsmodels OLS Linear Regression Results, Image by author

The above results show a number of details, but for this walkthrough we’ll focus on the coefficients. You’ll see under the ceof column, the y intercept (labeled const) is ~211.62 and our one x regressor coefficient is 101.65. This fits in well with expectations given the parameters and randomness introduced to the initial dataset.

Numpy Arithmetic Solution for Simple Linear Regression

If we are dealing with just one regressor/independent variable (similar to this example), then it is pretty straightforward to solve for the slope and intercept directly.

Simple Linear Regression - Slope (Coefficient) Formula, Image by author
Simple Linear Regression – Slope (Coefficient) Formula, Image by author

Looking at the above formula, it’s the summation of x less its mean times y less its mean, divided by the x calculation squared. The intercept can also be found using this slope and the means of x and y as a second step.

This coefficient formula is straightforward to carry out directly in numpy with a couple lines of code. The dot function in numpy will take care of our sumproducts and we can call .mean() directly on the vectors to find the respective averages.

Finally, we check that the results of our numpy solution exactly match the statsmodels version using the allclose function.

Output:
Simple linear regression arithmetic slope of 101.65 vs. statsmodels ols fit slope of 101.65. 
Results equal: True

Matrix Formula

The matrix formula extends OLS linear regression one step further – allowing us to derive the intercept and slope from X and y directly, even for multiple regressors. This formula is as follows, for a detailed derivation check out this writeup from economic theory blog.

OLS Matrix Formula, Image by author
OLS Matrix Formula, Image by author

The numpy code below mirrors the formula quite directly. You could also use dot products here, but matmul is used as this fits our next scenario better when extending to three dimensions (multiple models fitted at once).

Before moving on to the next step with new datasets, a quick check of each method reveals the same intercept and slope – three different ways of achieving the same thing for the use case of simple linear regression.

OLS Results Three Ways, Image by author
OLS Results Three Ways, Image by author

Multiple Regressions at Once

Use Case Examples

Moving on to our next use case, what if you want to run multiple regressions at once? Let’s say you have a retail dataset with various stores, and you wanted to run ols regressions independently on each store – perhaps to compare coefficients or for finer grained linear model fits.

Another example use case – an ecommerce store may want to run a number of ols linear regressions over monthly sales data by product. They want to fit linear models independently in order to get a rough sense of directionality and trend over time by observing various models’ coefficients.

Data Shape and Multiple Matrix Formula

To demonstrate, a dataset with 2 sets of 50 data points is created. Both have 2 regressors/independent variables, but we want to fit a linear model of each of the two groupings separately.

The data has a shape of – representing 2 groupings, 50 data points in each, and 2 independent variables for each (+ intercept). Our Y variable is now – 2 groupings, 50 data points.

If we try to plug this into our statsmodels ols function, we’ll get an error. However, we can extend our matrix formula to find the intercepts/slopes for each model in one pass. Let’s define mathematically what we are trying to accomplish:

Multiple Groupings Matrix Formula, Image by author
Multiple Groupings Matrix Formula, Image by author

For n regressions (2 in this case) with each individual regression grouping of data represented by k, we want to run the matrix version of ols linear regression we performed in the previous section. You’ll see this is the same formula as the prior section, we are just introducing performing multiple iterations ( k = 1 to n), one set of matrix multiplications for each grouping.

Numpy Transpose Adjustments

Implementing this in numpy requires a few simple changes. If we just plugged our X and y variables directly into the prior formulation, the dimensions wouldn’t line up how we want. The reason this happens is using the same transpose function (.T in numpy) will change our shape from to . Not exactly what we want…

The numpy matmul documentation specifies if more than 2 dimensions is passed in with either argument, the operation is treated as a stack of matrices residing in the last two indexes. This is the behavior we are looking for, so the first dimension should remain as 2 (stacking our groupings) and we just want to transpose the 50 and 3.

We can use the transpose function and specify how to transpose our axes exactly to get the right shape. Passing in axes will get the desired shape – swapping the last to axes, but keeping our n regressions dimension in place to stack over.

For the following code examples, X will now be referred to as X_many and similarly for y in order to differentiate from prior examples.

Output: (2, 3, 50)

Revised Numpy Operations

With this transpose knowledge in mind, time to make the changes to solve this grouped version. For readability, the numpy code has been split into two lines. Transpose is modified to work along the proper dimensions and an additional empty axis is added to y_many just to line up the matmul operation.

Output:
array([[209.9240413375,  95.3254823487,  61.4157238175],
       [514.164936664 ,   8.1355322839,   4.3353162671]])

We get back an array that contains our intercept and coefficient for each independent variable. Performing a similar (but looped) operation using statsmodels and using numpy.allclose to check reveals that the coefficients are the same.

We were able to fit many regressions at once. This approach can be helpful in cases where we want to fit a bunch of linear models that are grouped and partitioned by one particular dimension and solved independently.

Einsum Notation

We can also solve this with the einsum function in numpy. This uses Einstein summation convention to specify how operations are performed. At a high level, we define the subscripts for each matrix as a string – if the order is swapped that specifies a transpose and multiplication occurs over common letters. For more information, check out the numpy docs as they walk through a few examples.

The einsum version has been broken out into a few more steps for readability. Inside the notation string, we are specifying the transposes along with the matrix multiplication to be performed. This also allows us to handle the np.newaxis piece of code by instead specifying the output dimensions.

Output:
[[209.9240413375  95.3254823487  61.4157238175]
 [514.164936664    8.1355322839   4.3353162671]]

This gives us the same results as prior version, although performance can be a tad slower. If you are familiar with einsum notation, this can be a useful implementation.

Speed Comparison

Groupby/Apply Pattern

One way I’ve seen multiple regressions implemented at once is using a pandas dataframe with the groupby/apply method. In this method, we define a function and pandas will split the dataset into groupings and apply that function – returning and reducing the results as needed.

This is the use case in our "grouped regression". Switching our numpy arrays into one dataframe, grouping by a regression_number column, and applying a custom ols function that fits and returns the results will provide for the same coefficients (although pandas displays with a bit few decimal points).

Groupby/Apply Regression Results, Image by author
Groupby/Apply Regression Results, Image by author

Comparing Performance

Now that we have two different ways of performing OLS over multiple groupings, how do they perform?

For this test, instead of 2 groupings of 50 points and 2 regressors, we’ll expand to 100 groupings and 3 regressors – shape of (100, 50, 4) – as a more realistic scenario.

Running a timeit test in our notebook using the %%timeit magic command shows the pandas groupby/apply approach takes 120 milliseconds while the numpy matrix version takes only 0.151 milliseconds. Matrix Multiplication was almost 800 times faster!

Granted, I’m sure there are ways to optimize the pandas implementation, but the nice part about the matrix version is a few lines of code directly in numpy is already fast.

Summary

Understanding the math behind certain basic approaches to Machine Learning can allow you to customize and extend approaches where needed. Many open source libraries are fast, well tested, and often do exactly what you need them to.

However, there may be one-off situations where implementing a simple, but well-understood approach such as linear regression using numpy can yield better performance without too much additional complexity.

_All examples and files available on Github._


Originally published at https://datastud.dev.

The post Speed up Linear Regression with Matrix Math appeared first on Towards Data Science.

]]>
Classification with Imbalanced Data https://towardsdatascience.com/classification-with-imbalanced-data-f13ccb0496b3/ Sat, 20 Nov 2021 05:15:05 +0000 https://towardsdatascience.com/classification-with-imbalanced-data-f13ccb0496b3/ Using various resampling methods to improve machine learning models

The post Classification with Imbalanced Data appeared first on Towards Data Science.

]]>
Photo by Aziz Acharki on Unsplash
Photo by Aziz Acharki on Unsplash

Building classification models on data that has largely imbalanced classes can be difficult. Using techniques such as oversampling, undersampling, resampling combinations, and custom filtering can improve accuracy.

In this article, I’ll walk through a few different approaches to deal with data imbalance in classification tasks.

  1. Oversampling
  2. Undersampling
  3. Combining Oversampling and Undersampling
  4. Custom Filtering and Sampling

Scenario and Data Overview

To demonstrate various class imbalance techniques, a fictitious dataset of credit card defaults will be used. In our scenario, we are trying to build an explainable classifier that takes two inputs (age and card balance) and predicts whether someone will miss an upcoming payment.

A mock up of the data is shown in the following charts. You’ll see there are some random defaults (orange dots) throughout the data, but they make up a small percentage (24 out of 374 training example instances, ~6.4%). This can make it challenging for some Machine Learning classification algorithms to pick up on, and we may want to limit our potential set of model choices in certain cases for explainability/regulation factors.

For this scenario, our goal will be 90%+ on accuracy and 50%+ on F1 score (harmonic mean of precision/recall) using logistic regression.

Data Overview, Image and Data by author
Data Overview, Image and Data by author

Baseline Logistic Regression

For a baseline model to use for comparisons, we’ll run a simple logistic regression and plot the decision bounds, as well as calculate various accuracy metrics.

The base logistic regression meets our 90%+ accuracy goal, but fails on the precision/recall front. You’ll see why below, with such a small relative size of the defaulted class, the model is just predicting every single data point as not defaulted (represented by the light blue background in decision boundary plot).

Baseline Logistic Regression - Before Any Resampling, Image by author
Baseline Logistic Regression – Before Any Resampling, Image by author

However, there is a clear section in the top left of our data (very young age with high balances) that seem to default more often than randomly. Can we do any better when incorporating resampling methods?

Oversampling

One popular method to dealing with this problem is oversampling using SMOTE. Imbalanced learn is a Python library that provides many different methods for classification tasks with imbalanced classes. One of the popular oversampling methods is SMOTE.

SMOTE stands for Synthetic Minority Over-sampling Technique. Given the name, you can probably intuit what it does – creating synthetic additional data points for the class with fewer data points. It does this by taking into account other features – you can almost think of it as using interpolation between the few samples we do have to add new data points where they might exist.

Applying SMOTE is straightforward; we simply pass in our x/y training data and get back the desired resampled data. Plotting this new data, we now show an even distribution of classes (350 defaulted vs. 350 not defaulted). A lot of new defaulted class data points were created, which should allow our model to learn a function that doesn’t just predict the same outcome for every data point.

SMOTE Adjusted Training Data, Image by author
SMOTE Adjusted Training Data, Image by author

Fitting a new logistic regression model on this resampled data yields the decision boundary below.

You can see now that instead of a blue background representing a non default decision for the entire chart (as our baseline model did), this new model trained on SMOTE resampled data is predicting the top left portion defaulting (represented with a light orange background).

SMOTE Decision Bounds, Image by author
SMOTE Decision Bounds, Image by author

Oversampling the class that only had a few data points certainly led to a higher percentage of default predictions, but did we accomplish our goals? Accuracy dropped to ~76%, but F1 score increased to ~30%. Not quite there yet, let’s try some additional methods to see if this can be improved.

Undersampling

The opposite of oversampling the class with fewer examples is undersampling the class with more. Using the approach of Edited Nearest Neighbors we can strategically undersample data points. Doing this leads to the following modified training data – we still have our 24 default class data points, but the majority class now only has 287 of the original 350 data points in our new training dataset.

ENN Adjusted Training Data, Image by author
ENN Adjusted Training Data, Image by author

This results in the following decision bounds. The model properly targets the top left as the most frequently defaulted region, but the F1 score isn’t where we need it to be. There are certainly data points still on the table that can be captured to create a more ideal decision boundary.

ENN Decision Bounds, Image by author
ENN Decision Bounds, Image by author

Undersampling + Oversampling

Another popular method is combining the two approaches. We can oversample using SMOTE and then clean the data points using ENN. This is referred to as SMOTEENN in imblearn.

SMOTEENN Adjusted Training Data, Image by author
SMOTEENN Adjusted Training Data, Image by author

Our count of labels is a lot closer to equal, but we have fewer overall data points. How does this do on our metrics?

SMOTEENN Decision Bounds, Image by author
SMOTEENN Decision Bounds, Image by author

This method resulted in a bit more of an extreme decision boundary, with a further drop in accuracy and lower F1 score.

Custom Sampling + Smote

For our data, SMOTE seems to help, but maybe we can be a bit more targeted with which data points we want to oversample. One approach we can take is using a simple K Nearest Neighbors classifier and pick just the data points that have neighbors also belonging to our defaulted class with some probability threshold.

With this in place, we now have the following set of data – reducing our defaulted class from 24 to 10 (but hopefully getting rid of the noisy data points that may be throwing off our SMOTE processes and creating too aggressive interpolations).

KNN Filtered Data, Image by author
KNN Filtered Data, Image by author

Performing SMOTE (using the same code as in the earlier steps) results in the following training dataset – creating 350 defaulted class samples from our original 10:

Custom Sampling + SMOTE Adjusted Training Data, Image by author
Custom Sampling + SMOTE Adjusted Training Data, Image by author

We train another new logistic regression model and using this resampled data, we now meet our goals! We can see the decision boundary now accounts for that pocket of defaults after training on our adjusted data. Accuracy is still 90%+ and F1 Score is above our goal of 50%.

Custom Sampling + SMOTE Decision Bounds, Image by author
Custom Sampling + SMOTE Decision Bounds, Image by author

Summary

There are a variety of methods to deal with imbalanced classes when building a machine learning model. Depending on your constraints, goals, and data – some may be more applicable than others. We can also come up with some creative methods for resampling in order to build a classifier that properly targets the decision bounds and scenarios we are interested in – filtering out what may appear to be noise in our data.

All examples and files available on Github.


Originally published at https://datastud.dev.

The post Classification with Imbalanced Data appeared first on Towards Data Science.

]]>
A Straightforward Guide to A/B Testing in Python https://towardsdatascience.com/a-straightforward-guide-to-a-b-testing-in-python-4432d40f997c/ Mon, 08 Nov 2021 14:50:18 +0000 https://towardsdatascience.com/a-straightforward-guide-to-a-b-testing-in-python-4432d40f997c/ Add rigor to your experiments with a few simple steps

The post A Straightforward Guide to A/B Testing in Python appeared first on Towards Data Science.

]]>
Photo by Jason Dent on Unsplash
Photo by Jason Dent on Unsplash

A/B Testing can be extremely useful during experimentation. Adding statistical rigor to situations where you compare one option against another. This is one step which can help guard against making faulty conclusions.

This article will demonstrate the critical steps in an A/B test:

  1. Determining Minimum Detectable Effect
  2. Calculating Sample Size
  3. Analyzing Results for Statistical Significance

Scenario Overview

To demonstrate a situation where a company may employ A/B testing, I’m going to create a fictitious video game dataset.

In this scenario, a video game development company has recognized lots of users are quit playing after one particular level. A hypothesis posed by the product team is that the level is too hard, and making it easier will allow for less user frustration and ultimately more players continuing to play the game.

Sample Size Calculation

Desired Business Effect

Our video game company has made some tweaks to the level to make it easier, effectively changing the difficulty from hard to medium. Before blindly rolling an update out to all the users, the product team wants to test the changes on a small sample to see if they make a positive impact.

The desired metric is the percentage of players who continue playing after reaching our level in question. Currently, 70% of users continue to play where the remaining 30% stop playing. The product team has decided increasing this to 75% would warrant deploying the changes and making an update to the game.

Minimum Detectable Effect

This piece of information (70% to 75%) helps us calculate the minimum detectable effect – one of the inputs to calculating sample size. Our problem is testing two proportions, so we’ll use the proportion_effectsize function in statsmodels to translate this change into something we can work with.

Output: For a change from 0.70 to 0.75 - the effect size is -0.11.

Sample Size

Now that we have our minimum detectable effect, we can calculate sample size. To do this for a proportion problem, zt ind solve power is used (also from the statsmodels library).

We set the nobs1 (number of observations of sample 1) parameter to which signifies this is what we are looking to solve for.

A significance level of 0.05 and power of 0.8 are commonly chosen default values, but these can be adjusted based on the scenario and desired false positive vs. false negative sensitivity.

Output: 1250 sample size required given power analysis and input parameters.

Experiment Data Overview

After communicating the sample size needed to the product team, an experiment was run where a random sample of 1,250 players played the new easier level and another 1,250 played the hard level.

After gathering the data, we learn that 980 users (of 1,250) continued playing after reaching the medium difficulty level while 881 (of 1,250) continued playing after reaching the hard difficulty level. This seems decent enough and above our hope of at least 5% improvement, should we make the change?

Before deciding, it’s important to test for statistical significance to guard against the possibility that this difference could have occurred simply by random chance (accounting for our significance and power).

Experiment Data, Image and Data by author
Experiment Data, Image and Data by author

Analyzing Results

Calculating Inputs

The statistical test we’ll use is the proportions_ztest. We need to calculate a number of inputs first – the number of successes and observations. Data is stored in a pandas dataframe with columns for medium and hard – with 0 representing users that stopped playing and 1 representing those that continued.

Output: Medium: 980 successes of 1250 trials. Hard: 881 successes of 1250 trials.

Performing the Z Test

Performing the z test is simple with the required inputs:

Output: z stat of 4.539 and p value of 0.0000. 

A p value below 0.05 meets our criteria. Reducing the level difficulty to medium is the way to go!

Summary

While this is a fictitious and straightforward example, utilizing a similar approach in the real world can add some rigor to your experiments. Consider using A/B testing when you have two samples you are looking to gauge whether an experiment resulted in a significant change or not based on your desired metric.

All examples and files available on Github.


Originally published at https://datastud.dev.

The post A Straightforward Guide to A/B Testing in Python appeared first on Towards Data Science.

]]>
Preprocessing Text Data for Machine Learning https://towardsdatascience.com/preprocessing-text-data-for-machine-learning-6b98f7bb0258/ Fri, 29 Oct 2021 06:41:32 +0000 https://towardsdatascience.com/preprocessing-text-data-for-machine-learning-6b98f7bb0258/ Using NLTK and animations to show common text transformations

The post Preprocessing Text Data for Machine Learning appeared first on Towards Data Science.

]]>
Photo by Patrick Tomasso on Unsplash
Photo by Patrick Tomasso on Unsplash

Unstructured text data requires unique steps to preprocess in order to prepare it for Machine Learning. This article walks through some of those steps including tokenization, stopwords, removing punctuation, lemmatization, stemming, and vectorization.

Dataset Overview

To demonstrate some natural language processing text cleaning methods, we’ll be using song lyrics from one of my favorite musicians – Jimi Hendrix. The raw song lyric data can be found here. For the purposes of this demo, we’ll be using a few lines from his famous song The Wind Cries Mary:

A broom is drearily sweeping
Up the broken pieces of yesterdays life
Somewhere a queen is weeping
Somewhere a king has no wife
And the wind, it cries Mary

Tokenization

One of the first steps in most natural language processing workflows is to tokenize your text. There are a few different varieties, but at the most basic sense this involves splitting a text string into individual words.

We’ll first review Nltk (used to demo most of the concepts in this article) and quickly see tokenization applied in a couple other frameworks.

NLTK

We’ve read the dataset into a list of strings and can use the word tokenize function from the NLTK python library. You’ll see that looping through each line, applying word tokenize splits the line into individual words and special characters.

Before Tokenization, Image by author
Before Tokenization, Image by author
After Tokenization, Image by author
After Tokenization, Image by author

Tokenizers In Other Libraries

There are many different ways to accomplish tokenization. The NLTK library has some great functions in this realm, but others include spaCy and many of the deep learning frameworks. Some examples of tokenization in those libraries are below.

Torch

spaCy

There can be slight differences from one tokenizer to another, but the above more or less do the same. The spaCy library has its own objects that incorporate the framework’s features, for example returning a doc object instead of a list of tokens.

Stopwords

There may be some instances where removing stopwords improves the understanding or accuracy of a natural language processing model. Stopwords are commonly used words that may not carry much information and may be able to be removed with little information loss. You can get a list of stopwords from NLTK with the following Python commands.

Loading and Viewing Stopwords

Stopwords, Image by author- Click for Full Size Version
Stopwords, Image by author- Click for Full Size Version

Removing Stopwords

We can create a simple function for removing stopwords and returning an updated list.

Text Before & After Stopword Removal, Image by author
Text Before & After Stopword Removal, Image by author

Punctuation

Similar to stopwords, since our text is already split into sentences removing punctuation can be performed without much information loss and to clean up the text to just words. One approach is to simple use the string object list of punctuation characters.

We had one extra comma that is now removed after applying this function:

Text After Punctuation Removal, Image by author
Text After Punctuation Removal, Image by author

Lemmatization

We can further standardize our text through lemmatization. This boils down a word to just the root, which can be useful in minimizing the unique number of words used. This is certainly an optional step, in some cases such as text generation this information may be important – while in others such as classification it may be less so.

Single Word Test

To lemmatize our tokens, we’ll use the NLTK WordNetLemmatizer. One example applying the lemmatizer to the word "cries" yields the root word "cry".

Output text: cry

All Tokens/Parts of Speech

The NLTK function runs on specific parts of speech, so we’ll loop through these in a generalized function to lemmatize tokens.

Text Before & After Lemmatization, Image by author
Text Before & After Lemmatization, Image by author

Stemming

Stemming is similar to lemmatization, but rather than converting to a root word it chops off suffixes and prefixes. I prefer lemmatization since it is less aggressive and the words still are valid; however, stemming is also still sometimes used so I show how here.

Snowball Stemmer

There are many different flavors of stemming algorithms, for this example we use the SnowballStemmer from NLTK. Applying stemming to "sweeping" removes the suffix and yields the word "sweep".

Output text: sweep

Apply to All Tokens

Similar to past steps, we can create a more generic function and apply this to each line.

Text Before & After Stemming, Image by author
Text Before & After Stemming, Image by author

As you can see, some of these are not words. For this reason, I prefer to go with lemmatization in almost all cases so that word lookups during embedding is more successful.

Put it all together

We’ve went through a number of possible steps to clean our text and created functions for each while doing so. One final step would be combining this into one simple and generalized function to run on the text. I wrapped the functions in one combined function that allows enabling of any desired function and to run each sequentially on the various lines of text. I used a functional approach below, but a class could certainly be built as well using similar principles.

Vector Embedding

Now that we finally have our text cleaned, is it ready for machine learning? Not quite. Most models require numeric inputs rather than strings. In order to do this, embeddings where strings are converted into vectors are often used. You can think of this as numerically capturing the information and meaning of text in a fixed length numerical vector.

We’ll walk through an example of using gensim; however, many of the deep learning frameworks may have ways to quickly load pre-trained embeddings as well.

Gensim Pre-Trained Model Overview

The library that we’ll be using to lookup pre-trained embedding vectors for our cleaned tokens is gensim. They have multiple pre-trained embeddings available for download, you can review these in the word2vec module inline documentation.

Most Similar Words

Gensim provides multiple functionalities to use with the pre-trained embeddings. One is viewing which words are most similar. To get an idea of how this works, let’s try the word "queen" which is contained in our sample Jimi Hendrix lyrics.

Most Similar Words, Image by author
Most Similar Words, Image by author

Retrieving Vector Embedding Examples

To convert a word to embedding vector we simply use the pre-trained model like a dictionary. Let’s see what the embedding vector looks like for the word "broom".

Sample Embedding Vector, Image by author
Sample Embedding Vector, Image by author

Applying to All Tokens

Similar to past steps, we can simply loop through the cleaned tokens and build out a list converting to vectors instead. In reality, there is likely some error handling for words that don’t lookup successfully (and cause a key error since it is missing from the dictionary), but I’ve omitted that in this simple example.

Padding Vectors

Many natural language processing models require the same number of words as inputs. However, text length is often ragged, with each line not conforming to the exact same number of words. To fix this, one approach often taken is padding sequences. We can add dummy vectors at the end of the shorter sentences to get everything aligned.

Padding Sequences in Pytorch

Many libraries have helper methods for this type of workflow. For example, torch allows us to pad sequences in this manner as follows.

Output: torch.Size([5, 4, 100])

After padding our sequences, you can now see that the 5 lines of text are each of length 4 with an embedding dimension of 100 as expected. But what happened to our first line which only had three words (tokens) after cleaning?

Viewing Padded Vectors

Torch by default just creates zero value vectors for anything that needs to be padded.

Padded Zero Vector Example, Image by author
Padded Zero Vector Example, Image by author

Summary

Text data often requires unique steps when preparing data for machine learning. Cleaning text is important to standardize words to allow for embeddings and lookups, while losing the least amount of information possible for a given task. Once you’ve cleaned and prepared text data, it can be used for more advanced machine learning workflows like text generation or classification.

All examples and files available on Github.


Originally published at https://datastud.dev.

The post Preprocessing Text Data for Machine Learning appeared first on Towards Data Science.

]]>
Filling Gaps in Time Series Data https://towardsdatascience.com/filling-gaps-in-time-series-data-2db7366f1965/ Fri, 22 Oct 2021 14:55:02 +0000 https://towardsdatascience.com/filling-gaps-in-time-series-data-2db7366f1965/ Resample using pandas as a step in your time series data prep

The post Filling Gaps in Time Series Data appeared first on Towards Data Science.

]]>
Photo by Aron Visuals on Unsplash
Photo by Aron Visuals on Unsplash

Time Series data does not always come perfectly clean. Some days may have gaps and missing values. Machine Learning models may require no data gaps, and you will need to fill missing values as part of the data analysis and cleaning process. This article walks through how to identify and fill those gaps using the pandas resample method.

Original Data

For demonstration purposes, I mocked up some daily time series data (range of 10 days total) with some purposeful gaps. The initial data looks as follows:

Initial Dataset, Image by author
Initial Dataset, Image by author

Resample Method

One powerful time series function in pandas is resample function. This allows us to specify a rule for resampling a time series.

This resampling functionality is also useful for identifying and filling gaps in time series data – if we call resample on the same grain. For example, the original dataset we are working with has gaps and not every day has a value. Utilizing the resample function as follows will identify these gaps as NA values.

Simple Resample, Image by author
Simple Resample, Image by author
Simple Resample Chart, Image by author
Simple Resample Chart, Image by author

As you’ll see in the above, the resample method inserts NA values for days that did not exist. This expands our dataframe and essentially identifies the gaps to be handled. The next step is to fill these NA values with actual numbers based on a variety of methods.

Forward Fill Resample

One method for filling the missing values is a forward fill. With this approach, the value directly prior is used to fill the missing value. For example, the 2nd through 4th were missing in our data and will be filled with the value from the 1st (1.0).

Forward Fill Resample, Image by author
Forward Fill Resample, Image by author
Forward Fill Chart, Image by author
Forward Fill Chart, Image by author

Backward Fill Resample

A similar method is the backward fill. After the above, you can probably guess what this does – uses the value after to fill missing data points. Instead of filling the 2nd through 4th with the 1.0 from the first day in our time series – you’ll see below that it now takes on the value of 2.0 (pulling from October 5th).

Backward Fill Resample, Image by author
Backward Fill Resample, Image by author
Backward Fill Chart, Image by author
Backward Fill Chart, Image by author

Interpolate Fill Resample

The final method in this article is the interpolate method. The below charts show interpolation, where data is essentially fitted from one point to the next. You’ll see in the below examples that smooth lines connect the missing values.

Interpolate Resample, Image by author
Interpolate Resample, Image by author
Interpolate Fill Chart, Image by author
Interpolate Fill Chart, Image by author

Summary

There are many ways to identify and fill gaps in time series data. The resample function is one easy way to identify and then fill missing data points. This can be used to prepare and clean data before building your machine learning model.

All examples and files available on Github.


Originally published at https://datastud.dev.

The post Filling Gaps in Time Series Data appeared first on Towards Data Science.

]]>
Automated Exploratory Data Analysis https://towardsdatascience.com/automated-exploratory-data-analysis-da9fc5928e0d/ Mon, 18 Oct 2021 16:09:36 +0000 https://towardsdatascience.com/automated-exploratory-data-analysis-da9fc5928e0d/ Using the python edatk library to find insights in your data

The post Automated Exploratory Data Analysis appeared first on Towards Data Science.

]]>
Photo by Possessed Photography on Unsplash
Photo by Possessed Photography on Unsplash

Exploratory data analysis is a critical initial step to building a Machine Learning model. Better understanding your data can make discovering outliers, feature engineering, and ultimately modeling more effective.

Some parts of exploratory data analysis, such as generating feature histograms and missing values counts, can be mostly automated. This article walks through an open source library I created that runs some basic automated EDA processes.

EDATK: Automated EDA Toolkit

To help speed up exploratory data analysis, I created edatk and open sourced the code. This allows you to install via pip and run automated eda with a few lines of code. It is still in alpha stages, so treat as a supplement to your existing eda workflow.

The main features of edatk are:

  1. Ease of Use: Running automated exploratory data analysis over a pandas dataframe is just one line of code.
  2. HTML Report Output: By providing a folder location, edatk will build an html report that presents visuals and tables in a clean manner.
  3. Target Column Exploration: This is one of the key features of edatk. Passing in an optional target_column parameter specifies to add visual layers and cues where possible, helping you spot trends between input features and the column you are predicting in a supervised machine learning setup. If your problem doesn’t fit a supervised machine learning pattern, you can simply ignore this parameter.
  4. Inferred Chart Types: Based on column types in your dataframe, edatk will infer which metrics to calculate and chart types to display.

For this demonstration, we’ll be using the common iris dataset. The dataset has various features of an iris plant and the task is to predict the species.

We won’t build a machine learning model in this article, but will run automated eda to spot trends that may be useful for selecting or building new features to incorporate into a model training.

Running Automated EDA

The main way to run edatk is as follows, with a couple critical steps:

  1. Import the library and load your dataset. For this demo, we use seaborn to load the iris dataset into a pandas dataframe.
  2. Run the auto_eda method, passing in your dataframe, save (output) location, and target column. The output location and target column are optional, but recommended if you can provide these values.

It’s that simple! Edatk runs through various routines based on the column types and cardinality of each column. Visualizations are automatically generated and an html report is built with the outputs. The full html report generated by the below code can be viewed here.

Analyzing Results

Single Column Statistics

The first portion of the report loops through all the columns and calculates basic descriptive statistics. This takes the form of an initial table with min, max, percentage of rows with missing values, etc. The next portion shows some basic descriptive charts such as box plots and histograms.

The following screenshots show what gets generated for each column, with Sepal Length (one of the dataset features used to predict species) as the example.

Single Column Table, Image by author
Single Column Table, Image by author
Single Column Visuals, Image by author
Single Column Visuals, Image by author

Multiple Column Statistics

One of the most useful things of exploring your data is plotting feature pairs and analyzing against the target. This can give you ideas on how to engineer new features. If you pass in a _target_column_ when calling auto_eda, many of these feature pair visualizations will include color coding according to this target variable. This makes it quick and easy to spot potential trends.

For example, one of the plots produced is a scatter plot with _petal_length on the x axis and petal_width_ on the y axis. The three different types of species we are looking to train our model to predict are color coded. One can quickly spot some separation here. Including these two features alone should produce a decent starting point to a model. You could also potentially combine into one newly engineered feature to capture the relationship.

Paired Column Visuals Scatter, Image by author
Paired Column Visuals Scatter, Image by author

The generated visuals are not always scatter plots. The library looks at column types to determine the type of visualization that should be generated. For example, the categorical column is plotted against _petal_width_ using a box plot (example below).

Paired Column Visuals Box Plot, Image by author
Paired Column Visuals Box Plot, Image by author

Caveats

Edatk can handle some larger datasets (in terms of number of rows) as some sampling does occur for plots that are known to be performance intensive. However, since pair plot combinations are generated – an extremely wide dataset with a large number of columns may cause issues. The auto_eda method provides a _column_list_ parameter to pass in a smaller list of column names in the event of this scenario.

Finally, edatk is still in alpha stages – so treat as a supplement to your existing eda workflow.

Contributing

This library is still a work in progress, but open sourced to all who wish to contribute to make it better!

The planned features can be viewed here on the github repo, as well as some basic instructions and git commands for those looking to make a first pull request.

Summary

Automated exploratory data analysis can help you better understand the data and spot initial trends.

Edatk is one such library that seeks to automate some of this work. Check it out and let me know what you think!

All examples and files available on Github.


Originally published at https://datastud.dev.

The post Automated Exploratory Data Analysis appeared first on Towards Data Science.

]]>
Detecting Outliers Using Python https://towardsdatascience.com/detecting-outliers-using-python-66b25fc66e67/ Fri, 08 Oct 2021 21:09:16 +0000 https://towardsdatascience.com/detecting-outliers-using-python-66b25fc66e67/ Using Isolation Forests for Automated Outlier Detection

The post Detecting Outliers Using Python appeared first on Towards Data Science.

]]>
Photo by Rupert Britton on Unsplash
Photo by Rupert Britton on Unsplash

What is Outlier Detection?

Detecting outliers can be important when exploring your data before building any type of Machine Learning model. Some causes of outliers include data collection issues, measurement errors, and data input errors. Detecting outliers is one step in analyzing data points for potential errors that may need to be removed prior to model training. This helps prevent a machine learning model from learning incorrect relationships and potentially lowering accuracy.

In this article, we will mock up a dataset from two distributions and see if we can detect the outliers.

Data Generation

To test out the outlier detection model, a fictitious dataset from two samples was generated. Drawing 200 points at random from one distribution and 5 points at random from a separate shifted distribution gives us the below starting point. You’ll see the 200 initial points in blue and our outliers in orange. We know which is which since this was generated data, but on an unknown dataset the goal is to essentially spot the outliers without having that inside knowledge. Let’s see how well some out of the box scikit-learn algorithms can do.

Initial Dataset
Initial Dataset

Isolation Forest

One method of detecting outliers is using an Isolation Forest model from scikit-learn. This allows us to build a model that is similar to a random forest, but designed to detect outliers.

The pandas dataframe starting point after data generation is as follows – one column for the numerical values and a second ground truth that we can use for accuracy scoring:

Initial Dataframe - First 5 Rows
Initial Dataframe – First 5 Rows

Fit Model

The first step is to fit our model, note the fit method just takes in X as this is an unsupervised machine learning model.

Predict Outliers

Using the predict method, we can predict whether a value is an outlier or not (1 is not an outlier, closer to -1 is an outlier).

Review Results

To review the results, we’ll both plot and calculate accuracy. Plotting our new prediction column on the original dataset yields the following. We can see that the outliers were picked up properly; however, some of the tails of our standard distribution were as well. We could further modify a contamination parameter to tune this to our dataset, but this is a great out of the box pass.

Accuracy, precision, and recall can also be simply calculated in this example. The model was 90% accurate as some of the data points from the initial dataset were incorrectly flagged as outliers.

Output: Accuracy 90%, Precision 20%, Recall 100%

Explain Rules

We can use decision tree classifiers to explain some of what is going on here.

|--- Value <= 1.57
|   |--- Value <= -1.50
|   |   |--- class: Outlier
|   |--- Value >  -1.50
|   |   |--- class: Standard
|--- Value >  1.57
|   |--- class: Outlier

The basic rules are keying off -1.5 and 1.57 as the range to determine "normal" and everything else is an outlier.

Elliptic Envelope

Isolation forests are not the only method for detecting outliers. Another that is suited for Gaussian distributed data is an Elliptic Envelope.

The code is essentially the same, we are just swapping out the model being used. Since our data was pulled from a random sample, this resulted in a slightly better fit.

Output: Accuracy 92%, Precision 24%, Recall 100%

Different outlier detection models can be run on our data to automatically detect outliers. This can be a first step taken to analyze potential data issues that may negatively affect our modeling efforts.

All examples and files available on Github.


Originally published at https://datastud.dev.

The post Detecting Outliers Using Python appeared first on Towards Data Science.

]]>
Predicting Wine Prices with Tuned Gradient Boosted Trees https://towardsdatascience.com/predicting-wine-prices-with-tuned-gradient-boosted-trees-9ab5ebd0b85e/ Mon, 27 Sep 2021 21:45:42 +0000 https://towardsdatascience.com/predicting-wine-prices-with-tuned-gradient-boosted-trees-9ab5ebd0b85e/ Using Optuna to find the optimal hyperparameter combination

The post Predicting Wine Prices with Tuned Gradient Boosted Trees appeared first on Towards Data Science.

]]>
What is Hyperparameter Tuning?

Many popular Machine Learning libraries use the concept of hyperparameters. These can be though of as configuration settings or controls for your machine learning model. While many parameters are learned or solved for during the fitting of your model (think regression coefficients), some inputs require a data scientist to specify values up front. These are the hyperparameters which are then used to build and train the model.

One example in gradient boosted decision trees is the depth of a decision tree. Higher values yield potentially more complex trees that can pick up on certain relationships, while smaller trees may be able to generalize better and avoid overfitting to our outcome – potentially leading to issues when predicting unseen data. This is just one example of a hyperparameter – many models have a number of these inputs which all must be defined by a data scientist or alternatively use defaults provided by the code library.

This can seem overwhelming – how do we know which combination of hyperparameters will result in the most accurate model? Tuning (finding the best combination) by hand can take a long time and cover a small sample space. One approach which will be covered here is using Optuna to automate some of this work. Rather than manually testing combinations, ranges for hyperparameters can be specified and Optuna runs a study to determine the optimal combination given time constraints.

Dataset Overview

To demonstrate Optuna and hyperparameter tuning, we’ll be using a dataset containing wine ratings and prices from Kaggle. Given some input features for a bottle of red wine – such as region, points, and variety – how close can we predict the price of a wine using hyperparameter tuning?

Dataset Overview Click for Full Size Version
Dataset Overview Click for Full Size Version

A few lines of code to load in our data and train/test split:

# Read in data from local csv
df = pd.read_csv('winemag-data-130k-v2.csv')

# Choose just a few features for demonstration, infer categorical features
feature_cols = ['country', 'points', 'province', 'region_1', 'region_2', 'taster_name', 'variety', 'winery']
cat_features = [col for col in feature_cols if df[col].dtype == 'object']
for col in cat_features:
    df[col] = df[col].fillna('Other')
target_col = 'price'

# Train test split
train_df, test_df = train_test_split(df, test_size=0.3, shuffle=False)

train_x = train_df.loc[:, feature_cols]
train_y = train_df.loc[:, target_col]

test_x = test_df.loc[:, feature_cols]
test_y = test_df.loc[:, target_col]

Model Training

Baseline Models

To know if our hyperparameter optimization is helpful, we’ll train a couple baseline models. The first is taking a simple mean price. Using this methodology results in a mean absolute percentage error of 79% – not very good, hopefully some machine learning modeling can improve our predictions!

The second baseline is training our model (using the Catboost library) with default parameters. Below are a few lines of this code. This beats our baseline simple mean prediction, but can we do better with further optimization?

# Train a model with default parameters and score
model = CatBoostRegressor(loss_function = 'RMSE', eval_metric='RMSE', verbose=False, cat_features=cat_features, random_state=42)
default_train_score = np.mean(eda.cross_validate_custom(train_x, train_y, model, mean_absolute_percentage_error))
print('Training with default parameters results in a training score of {:.3f}.'.format(default_train_score))
Output: Training with default parameters results in a training score of 0.298.

Hyperparameter Optimized Model

Setting up the optimization study

To create our model with optimized hyperparameters, we create what Optuna calls a study – this allows us to define a trial with hyperparameter ranges and optimize for the best combination.

You’ll see in the below code, we define an objective function with a trial object that suggests hyperparameters according to our defined ranges. We then create the study and optimize to let Optuna do its thing.

def objective(trial):

    # Define parameter dictionary used to build catboost model
    params = {
        'loss_function': 'RMSE',
        'eval_metric': 'RMSE',
        'verbose': False,
        'cat_features': cat_features,
        'random_state': 42,
        'learning_rate': trial.suggest_float('learning_rate', 0.001, 0.2),
        'depth': trial.suggest_int('depth', 2, 12),
        'n_estimators': trial.suggest_int('n_estimators', 100, 1000, step=50)
    }

    # Build and score model
    clf = CatBoostRegressor(**params)
    score = np.mean(eda.cross_validate_custom(train_x, train_y, clf, mean_absolute_percentage_error))

    return score

Reviewing results

Optuna stores the best results in our study object. Running the below allows us to access the best trial and review training results.

# Grab best trial from optuna study
best_trial_optuna = study.best_trial
print('Best score {:.3f}. Params {}'.format(best_trial_optuna.value, best_trial_optuna.params))
Output: Best score 0.288. Params {'learning_rate': 0.0888813729642258 'depth': 12 'n_estimators': 800}

Comparing to default parameters

Doing a quick comparison of training results to our initial run with default parameters shows good signs. You’ll see that the optimized model has a better training fit (score is percentage error in this case, so lower = better).

# Compare best trial vs. default parameters
print('Default parameters resulted in a score of {:.3f} vs. Optuna hyperparameter optimization score of {:.3f}.'.format(default_train_score, best_trial_optuna.value))
Output: Default parameters resulted in a score of 0.298 vs. Optuna hyperparameter optimization score of 0.288.

Analyzing Optimization Trends

One neat thing is the parallel coordinates plot. This lets us view trials and analyze hyperparameters for potential trends. We may want to run a new study after reviewing results if we find interesting optimizations, allowing us to search additional hyperparameter spaces.

# Visualize results to spot any hyperparameter trends
plot_parallel_coordinate(study)
Parallel Coordinates Plot Click for Full Size Version
Parallel Coordinates Plot Click for Full Size Version

You can see the cost metric on the left (lower = better). Following the dark lines (best trials), you’ll notice a higher depth works best, a learning rate in the middle of our values tested, and a larger number of estimators. Given these findings, we could re-run a study narrowing in on these values and potentially expanding ranges of others – such as depth potentially increasing above our upper bound or adding in additional hyperparameters to tune.

Comparing Test Results

The final step is to compare test results. The first step is seeing how our simple mean prediction baseline does on the test set.

# Run baseline model (default predicting mean)
preds_baseline = np.zeros_like(test_y)
preds_baseline = np.mean(train_y) + preds_baseline
baseline_model_score = mean_absolute_percentage_error(test_y, preds_baseline)
print('Baseline score (mean) is {:.2f}.'.format(baseline_model_score))
Output: Baseline score (mean) is 0.79.

The next step is to view test results on our default hyperparameter model:

# Rerun default model on full training set and score on test set
simple_model = model.fit(train_x, train_y)
simple_model_score = mean_absolute_percentage_error(test_y, model.predict(test_x))
print('Default parameter model score is {:.2f}.'.format(simple_model_score))
Output: Default parameter model score is 0.30.

A lot better than simply using the mean as a prediction. Can our hyperparameter-optimized solution do any better on the test set?

# Rerun optimized model on full training set and score on test set
params = best_trial_optuna.params
params['loss_function'] = 'RMSE'
params['eval_metric'] ='RMSE'
params['verbose'] = False
params['cat_features'] = cat_features
params['random_state'] = 42
opt_model = CatBoostRegressor(**params)
opt_model.fit(train_x, train_y)
opt_model_score = mean_absolute_percentage_error(test_y, opt_model.predict(test_x))
print('Optimized model score is {:.2f}.'.format(opt_model_score))
Output: Optimized model score is 0.29.

We were able to improve on our model with hyperparameter optimization! We only searched a small space for a few trials, but improved our cost metric, leading to a better score (lower error by 1%).

All examples and files available on Github.


Originally published at https://datastud.dev.

The post Predicting Wine Prices with Tuned Gradient Boosted Trees appeared first on Towards Data Science.

]]>
Learning Power BI https://towardsdatascience.com/learning-power-bi-f06d95119f28/ Mon, 20 Sep 2021 14:35:59 +0000 https://towardsdatascience.com/learning-power-bi-f06d95119f28/ My favorite resources I've used to continually learn and master Power BI

The post Learning Power BI appeared first on Towards Data Science.

]]>
One of the most frequent questions for those learning Power Bi I encounter relates to what are the best resources to learn from.

There are a lot of great blogs, YouTube channels, and various community members who post great tips and tricks when learning to develop Power BI datasets and reports. In this article, I’m going to list some of my favorites that I’ve encountered so far in my journey learning the tool. Resources will be broken down in categories, roughly going from backend to frontend.

Power Query

The first step in building a Power BI dataset relies in sourcing and transforming the data. This is where Power Query comes in – this part of the tool allows you to ingest data from a variety of sources (csv, SQL, etc) and make any transformations before loading into your model.

  1. Alex Powers and 30 Day Query: Alex is knowledgeable on all things Power BI, but I’ve learned a ton from him in the Power Query and M code realm. One initiative he put together is the 30 Day Query challenge, where you solve a problem each day for 30 days attempting not to break query folding. You can follow him on Twitter and learn more about the 30 Day Query Challenge here.
  2. Matthew Roche: Matthew’s writing is fantastic, you can find his blog here. His articles on dataflows in Power BI (think Power Query in the service) help both understand and solidify concepts in general as well as how to apply knowledge to building enterprise-grade solutions.

DAX and Data Modeling

DAX is the formula language that you use to create calculated columns and measures in your dataset. I’ll also group in resources around modeling your data (relationships between tables, cardinality, building star schemas, defining data types, etc).

  1. My go-to favorite is sqlbi for all things DAX and modeling data in Power BI. Their articles dive deeply into concepts in these areas and others. They also have great books and video courses that I highly recommend.
  2. Phil Seamark’s blog is fantastic and has been really beneficial when digging into areas around performance optimization and more complex aggregation scenarios. If you’re looking to push the limits of performance and optimizing your dataset, Phil’s blog is a great place for inspiration.

Front End Design

Creating interactive reports that perform well and focus on end user experience is particularly important.

  1. Chris Hamill’s blog has great concepts in the area of report design. Creatively using bookmarks, improving front end rendering performance, and many other concepts are covered in his blog and associated works.
  2. Reid Havens covers some great tips and tricks around building Power BI reports on his YouTube channel.

All Things Power BI

  1. Guy In A Cube: This YouTube channel covers all things Power BI. It’s a great way to learn as well as stay on top of recent developments and pick up tips/tricks along the way.
  2. Microsoft’s Power BI Blog can be a great way to stay on top of product releases. I’d recommend bookmarking and returning periodically to keep up-to-date on all the developments. New features can be seen every month and this helps to keep on top of all the functionality. You can view the blog here.

That’s it for my top list. This is by no means exhaustive – there are tons of fantastic resources out there. Have a favorite of your own? Post in the comments to help others.


Originally published at https://datastud.dev.

The post Learning Power BI appeared first on Towards Data Science.

]]>