Hands-on Tutorials, Getting Started

A beginner’s guide to understanding and performing hyperparameter tuning for Machine Learning models

The What, Why, and How of Hyperparameter Tuning

Hyperparameter tuning is an important part of developing a Machine Learning model.

In this article, I illustrate the importance of hyperparameter tuning by comparing the predictive power of logistic regression models with various hyperparameter values.

First thing’s first.

What are hyperparameters? – The what

Parameter vs. hyperparameter

Parameters are estimated from the dataset. They are part of the model equation. The equation below is a logistic regression model. Theta is the vector containing the parameters of the model.

Hyperparameters are set manually to help in the estimation of the model parameters. They are not part of the final model equation.

Examples of hyperparameters in logistic regression

Learning rate (α). One way of training a logistic regression model is with gradient descent. The learning rate (α) is an important part of the gradient descent algorithm. It determines by how much parameter theta changes with each iteration.

Gradient descent for parameter (θ) of feature j

Need a refresher on gradient descent? Read this absolute intro article to linear regression and gradient descent.

Linear regression and gradient descent for absolute beginners

2. Regularization parameter (λ). The regularization parameter (λ), is a constant in the "penalty" term added to the cost function. Adding this penalty to the cost function is called regularization. There are two types of regularization – L1 and L2. They differ in the equation for penalty.

In a linear regression, the cost function is simply the sum of squared errors. Adding an L2 regularization term and it becomes:

In logistic regression, the cost function is the binary cross entropy, or log loss, function. Adding a L2 regularization term and it becomes:

What does regularization do?

In training a model, the model is supposed to find a weight for each feature. Each weight is a value in the vector theta. Because there’s now a penalty on having a weight for a feature, it incentivizes the model to move weights closer to 0 for some features. Therefore, regularization minimizes the complexity of a model to avoid overfitting.

How do you go about tuning hyperparameters? – The how

Now that we know WHAT to tune, let’s talk about the process for tuning them.

There are several strategies for tuning hyperparameters. Two of them are Grid Search and Random Search.

Grid Search

In grid search, we preset a list of values for each hyperparameter. Then, we evaluate the model for every combination of the values in this list.

The pseudocode would go something like this:

penalty = ['none, 'l1', 'l2']
lambda = [0.001, 0.1, 1, 5, 10]
alpha = [0.001, 0.01, 0.1]

hyperparameters = [penalty, lambda, alpha]

# grid_values is a list of all possible combinations of penalty, lambda, and alpha
grid_values = list(itertools.product(*hyperparameters))

scores = []

for combination in grid_values:
   # create a logistic regression classifier
   classifier = MyLogisticRegression(penalty=combination[0], ...)

   # train the model with training data
   classifier.fit(X_train, y_train)

   # score the model with test data
   score = classifier.score(X_test, y_test)
   scores.append([ combination, score])

# Use scores to determine which combination had the best score
print(scores)

(Realistically, we’d evaluate several types of "score" such as accuracy, F1 score, etc. I will go over these in a later section.)

Random Search

In random search, we don’t provide a preset list of hyperparameters. Instead, we give the searcher a distribution for each hyperparameter. The search algorithm tries random combinations of values to find the best one. For large sets of hyperparameters, random search is a lot more efficient.

Let’s do some machine learning!

The dataset I used is the Titanic dataset from Kaggle. In a previous article, I used this dataset to predict whether our beloved Jack would have indeed died in shipwreck if he were a real passenger.

We’ll continue the trend in this post to predict survival likelihood based on passenger characteristics.

Would Jack Realistically Have Died aboard the Titanic?

Before we can train the model, we need to do some data processing. Essentially, I had to:

Drop columns that might not be useful, for instance, "Name."
Drop rows (2 total) with missing value for "Embark."
Fill in missing age values (177 total) with a guess. In this case, the guess is based on "Parch" – number of parents and children aboard.
Transform categorical variables with one-hot encoding.

You can also find the code for these data processing steps along with more detailed explanation on my Jupyter Notebook.

After all the data processing, this is my final DataFrame.

Next up, separate the data into training set and test set for model training and evaluation.

Let’s see how regularization and the learning rate alpha affect model performance.

Why tuning hyperparameters is important? – The why

As you’ll see shortly, tuning of hyperparameters affect a model’s accuracy and F1 score. Not sure what these metrics mean? See their definitions in my previous Titanic article.

Effect of regularization

I used SciKit-Learn’s LogisticRegression classifier to fit and test my data. There are many solvers to choose from, each solver has their own algorithm for convergence. For illustrative purposes, I choose the "saga" solver. It’s the only solver to support L1, L2, and no regularization.

Note: for Scikit-Learn’s LogisticRegression, instead of the λ regularization parameter, the classifier takes in a "C", which is the inverse of regularization strength. Think of it as 1/λ.

I used SciKit-Learn’s GridSearchCV to obtain the model’s score for every combination of penalty = ["none", "l1", "l2"] and C = [0.05, 0.1, 0.5, 1, 5] .

from sklearn.model_selection import GridSearchCV

clf = LogisticRegression(solver='saga', max_iter=5000, random_state=0)

param_grid = { 'penalty': ['none', 'l1', 'l2'], 'C': [0.05, 0.1, 0.5, 1, 5] }

grid_search = GridSearchCV(clf, param_grid=param_grid)

grid_search.fit(X, y)

result = grid_search.cv_results_

GridSearchCV does an internal 5-fold cross validation. The average of model scores for each combination is:

L2 regularization with a C of 0.1 performed the best!

Side note #1: I also implemented a random search algorithm with SciKit-Learn’s RandomizedSearchCV. If you’re curious, you can find the example in my Jupyter Notebook.

Side note #2: I’m sure you noticed that no regularization performed better than L1, and in many cases, there was no difference between no regularization and L2. The best explanation I have is that SciKit Learn’s LogisticRegression might already be working well without regularization. Nevertheless, regularization did bring some improvement.

We’ll see later that regularization does play a big role in the SGDClassifier.

I then did a side-by-side comparison of several performance metrics without regularization and with L2 regularization.

tuned = LogisticRegression(solver='saga', penalty='l2', C=0.1, max_iter=5000, random_state=2)

not_tuned = LogisticRegression(solver='saga', penalty='none', max_iter=5000, random_state=2)

tuned.fit(X_train, y_train)
not_tuned.fit(X_train, y_train)

y_pred_tuned = tuned.predict(X_test)
y_pred_not_tuned = not_tuned.predict(X_test)

data = {
    'accuracy': [accuracy_score(y_test, y_pred_tuned), accuracy_score(y_test, y_pred_not_tuned)],
    'precision': [precision_score(y_test, y_pred_tuned), precision_score(y_test, y_pred_not_tuned)],
    'recall': [recall_score(y_test, y_pred_tuned), recall_score(y_test, y_pred_not_tuned)],
    'f1 score': [f1_score(y_test, y_pred_tuned), f1_score(y_test, y_pred_not_tuned)]
}

pd.DataFrame.from_dict(data, orient='index', columns=['tuned', 'not tuned'])

Tuned performed better than not tuned in every metric except recall. Again, read this [blog post](http://Titanic article) if you need a refresher on what these metrics mean.

Effect of learning rate (and regularization)

To see how different learning rates can affect model performance, I used SciKit Learn’s SGDClassifier (stochastic gradient descent classifier). It allows me to tweak learning rate whereas the LogisticRegression classifier does not.

There are three parameters to SGDClassifier we could tweak: alpha, learning_rate, and eta0. The terminology is a bit confusing, so bear with me.

The learning_rate is the type of learning rate ("optimal" vs. "constant").

The eta0 is the algorithm’s learning rate when learning_rate is "constant". Normally, I call eta0 alpha.

The alpha is the constant that multiplies the regularization term. It’s also used to calculate the learning rate when learning_rate is "optimal". alpha serves the purpose of what’s commonly referred to as lambda.

Thus, there are several ways to set learning rate in SGDClassifier. If you want a constant learning rate, set learning_rate='constant' and eta0=the_learning_rate_you_want . If you want a dynamic learning rate (that depends on the step you’re at), set learning_rate='optimal'. In the case of "optimal", eta0 is not used, and alpha serves the dual purpose of regularization strength and a constant in computing the dynamic learning rate at each step.

Below is a grid search algorithm for finding the best hyperparameters (for constant learning rate). I’m using the "constant" learning rate and I set the maximum iteration to 50,000.

from sklearn.linear_model import SGDClassifier
import matplotlib.pyplot as plt

sgd = SGDClassifier(loss="log", penalty="l2", max_iter=50000, random_state=100)

param_grid = {
  'eta0': [0.00001, 0.0001, 0.001, 0.01, 0.1, 1],
  'learning_rate': ['constant'],
  'alpha': [0.00001, 0.0001, 0.001, 0.01, 0.1, 1]
}

grid_search = GridSearchCV(sgd, param_grid=param_grid)

grid_search.fit(X, y)

result = grid_search.cv_results_

The searcher gave alpha (here it means regularization strength) of 0.1 and eta0 (learning rate) of 0.0001 as the best params with a score of 0.7176.

I’ve plotted accuracy vs. learning rate (eta0) for couple different values of regularization strength (alpha). You can see that learning rate as well as regularization strength both have significant effect on a model’s performance.

The accuracy is pretty low for 0.00001 learning rate. This is likely due to the algorithm converging too slowly during gradient descent; after 50000 iterations, we’re nowhere near the minimum. The accuracy is also low for high learning rate (0.1 & 1). This is likely due to overshooting. Below is a more scaled plot with all the alphas.

Regularization strength (alpha) plays a role in accuracy too. For any given learning rate (eta0), there’s a large distribution of accuracy based on what the alpha value is.

Learning rate and regularization are just two hyperparameters in machine learning models. Every machine learning algorithm have their own set of hyperparameters. Questions? Comments? Respond below.