Big Sales Mart Regression Revisited: Enter the tidymodels

Tidymodels is a meta package much like the tidyverse that loads an array of useful tidy packages into your session. The difference being it loads a series of packages used through the machine learning model development process.

I’ll introduce each one as we utilise them and provide the practical use case. Tidymodels was developed by Max Kuhn who was onboarded by RStudio to develop a tidy friendly means of generating machine learning models. Kuhn’s previous and most well known work is Caret, being Categorical Regression Training. Caret is already an incredibly powerful package in it’s own right and rivals that of Python’s Scikit-Learn.

However, each model type has its own specific package, input requirements, specifications and training methods. Tidymodels seeks to resolve that issue by simplifying the workflow and creating a uniform syntax needed to generate machine learning models.

The Big Sales Mart dataset is available through Analytics Vidhya and one of their free projects they provide to members.

https://courses.analyticsvidhya.com/courses/big-mart-sales-prediction-using-r

The aim of the project is to develop and evaluate a series of regression models in predicting Item Outlet Sales from a number of input variables.

When going through the project myself, many aspects were written in base R and weren’t as intuitive as it could be and wasn’t taking advantage of the ease of use that the tidyverse packages offer. Hence I saw a real opportunity to translate this project into the "tidy" way of thinking and demonstrate tidymodels.

Exploratory Data Analysis

#Load Packages
library(tidymodels)
library(magrittr) #Tidy pipes
library(patchwork) #Visualisation grids
library(stringr) #String Manipulation
library(forcats) #Working with Factors
library(corrr) #Develop tidy Correlation Plots
library(vip) #Most important feature visualisations

Load a series of packages, of most importance is the tidymodels meta package. Much like the tidyverse, it implements a series of smaller packages that have discrete uses and play a part in the model development process, as well as dplyr, ggplot2, purrr, tibble etc. We’ll introduce each package as we progress through the model development.

#Load Data
train = read_csv("train.csv")
skim(train)

The data is loaded into the project, and we check the structure using the handy skim function from the skimr package. There are 12 variables (7 character and 5 numeric) and 8,523 rows of data.

From here we need to tidy the dataset with some variables missing values. We investigate further by visualising each variable in our dataset.

I wrote this function to make the initial EDA steps of my Data Science projects very straightforward, and create a list of plots via purrr’s map2() and passing to patchwork’s wrap_plots() to generate a grid of plots.

viz_by_dtype &lt;- function (x,y) {
  title &lt;- str_replace_all(y,"_"," ") %&gt;% 
           str_to_title()
  if ("factor" %in% class(x)) {
    ggplot(combi, aes(x, fill = x)) +
      geom_bar() +
      theme(legend.position = "none",
            axis.text.x = element_text(angle = 45, hjust = 1),
            axis.text = element_text(size = 8)) +
      theme_minimal() +
      scale_fill_viridis_d()+
      labs(title = title, y = "", x = "")
  }
  else if ("numeric" %in% class(x)) {
    ggplot(combi, aes(x)) +
      geom_histogram()  +
      theme_minimal() +
      scale_fill_viridis_d()+
      labs(title = title, y = "", x = "")
  } 
  else if ("integer" %in% class(x)) {
    ggplot(combi, aes(x)) +
      geom_histogram() +
      theme_minimal() +
      scale_fill_viridis_d()+
      labs(title = title, y = "", x = "")
  }
  else if ("character" %in% class(x)) {
    ggplot(combi, aes(x, fill = x)) +
      geom_bar() +
      theme_minimal() +
      scale_fill_viridis_d()+
      theme(legend.position = "none",
            axis.text.x = element_text(angle = 45, hjust = 1),
            axis.text = element_text(size = 8)) +
      labs(title = title, y  ="", x= "")
  }
}
variable_list &lt;- colnames(train) %&gt;% as.list()
variable_plot &lt;- map2(train, variable_list, viz_by_dtype) %&gt;%
  wrap_plots(               
    ncol = 3,
    nrow = 4,
    heights = 150,
    widths = 150
  )

From this we note:

Item Identifier has 1559 individual values, and for the sake of model simplicity we will drop this variable
Item Weight has no apparent distribution
Item Fat Content has inconsistent labels
Item Visibility is right-skewed
Item Type has a variety of labels
Item MRP has four major groups
Outlet Identifier has 10 unique labels
Outlet Establishment Year is varied, many stores from the mid-90s onwards
Outlet Size has inconsistent labels
Outlet Location Type has three labels
Outlet Type has four labels
Item Outlet Sales is right-skewed

Furthermore, let’s investigate if any interesting relationships exist between Item_Outlet_Sales and other numeric variables. Using ggscatmat() from GGally we generate a neat bivariate analysis, gaussian smoothed along the diagonal and the pearson correlations for each variable pair.

We find, plotting the numeric variables against the target variables three different observations.

Item Sales is spread well across the entire range of the Item_Weight without any obvious patterns
Item_Visibility has no relationship with Item_Outlet_Sales
Item_MRP has four distinct bands and is moderately correlated with Item_Outlet_Sales, we will exploit this later in feature engineering.

Data Wrangling

As noted we need to correct a number of our variable outputs.

#Correct mislabeled Item_Fat_Content

train %&lt;&gt;% mutate(Item_Fat_Content = if_else(Item_Fat_Content %in% c("reg", "Regular"), "Regular", "Low Fat"))

#Outlet Size is missing a label

train %&lt;&gt;% mutate(Outlet_Size = if_else(is.na(Outlet_Size),"Small",Outlet_Size))

Outlet Size is imputed based on the NA label having a very similar Item_Outlet_Sales distribution to Small outlets.

Model Development: Introducing tidymodels

So far we have used an array of tidyverse packages to wrangle and visualise data, let’s now walk through the process of developing a Machine Learning pipeline using the tidymodels meta package.

Data Splitting – rsamples

Much like sci-kit learns test train split module, rsamples initial_split completes a similar operation. In the below, an mc_split object mart_split is generated by using the intial_split function, passing our training dataframe, setting the proportion to 75%, and stratified by the Item_Outlet_Sales variable.

This creates effectively a list of two dataframes, called by either training() or test(). We will use the test dataframe to judge how well our model performs on unseen data.

set.seed(55)
mart_split &lt;- initial_split(train, prop = 0.75, strata = Item_Outlet_Sales

mart_train &lt;- training(mart_split)
mart_test &lt;- testing(mart_split)

Feature Engineering – recipes

The recipes package provides a large array of functions for processing variables and make them machine learning friendly. The three main functions of the recipes package are

recipe() – defines a sequential series of pre-processing steps (OR gather and list my ingredients and how I’m going to prepare them)
prep() – calculate required statistical transformations (OR prepare my ingredients for cooking)
bake() – applies the pre-processing steps to the data set (OR cook my prepared dish)

The recipe function is constructed like other models, outcome ~ predictor, here we have indicated we want to use all other variables as predictors for the model by using a tilde (~.).

We have used the recipe to generate a new variable, Price_Per_Unit, which is the quotient of Item_MRP and the Item_Weight, then seeking to reduce skewness of Price_Per_Unit by applying a Box-Cox transformation, and also taking the square root of Item_Visibility. Previously to generate dummy variables, whereby nominal/character variables are converted to binary columns indicating whether or not that row belongs to a particular group (1 = True, 0 = False), we could use reshape2::dcast(), however the power of recipes is demonstrated, being able to pipe a simple instruction, using step_dummy to create a dummy variable for all nominal variables. I’ve decided not have a dummy variable created for Item_Identifier for simplicity, and besides, would create over 5000 columns and likely overcomplicate the model.

mart_recipe &lt;- 
  training(mart_split) %&gt;% 
  recipe(Item_Outlet_Sales ~ .) %&gt;% 
  step_rm(Item_Identifier) %&gt;%
  step_impute_bag(Item_Weight) %&gt;% 
  step_impute_knn(Item_Visibility) %&gt;% 
  step_mutate(Price_Per_Unit = Item_MRP/Item_Weight) %&gt;% 
  step_sqrt(Item_Visibility) %&gt;%
  step_log(Price_Per_Unit, offset = 1) %&gt;% 
  step_discretize(Item_MRP,num_breaks = 4) %&gt;%
  step_normalize(all_numeric_predictors()) %&gt;%
  step_dummy(all_nominal())

mart_recipe_prepped &lt;- prep(mart_recipe)

The recipe object is then passed through the prep() function and also generates a recipe object, however the transformations have been conducted, and thus the data has been ‘prep’ared.

Passing the prep(ed) recipe object to bake executes the transformations, and because a number of dummy variables have been created we now have 54 columns.

mart_train &lt;- bake(mart_recipe_prepped, new_data = mart_train)
dim(mart_train)
[1] 6391 54

As the prepared recipe is effectively a transformation pipeline, we can recall these same transformations on the test data set.

mart_test &lt;- mart_recipe_prepped %&gt;% 
             bake(testing(mart_split))
dim(mart_test)
[1] 2132 54

Voila, fully processed training and test datasets.

Correlation Plot – corrr

Correlation plots are very interesting and important in understanding which predictors are correlated with the outcome variable. To create a correlation plot, the data needs to be transformed into a correlation data frame (cor_df). We take mart_train, select all numeric variables and pass to corr::correlate which generates the cor_df giving the pearson or rho value between variable pairs.

The cor_df object is passed through to rplot(), and treating it just like any other ggplot object we can customise it’s appearance by using theme() and colour schemes.

corr_df &lt;- mart_train %&gt;% select(is.numeric) %&gt;% 
              correlate() %&gt;%
              rearrange() %&gt;% 
              shave()
rplot(corr_df,) +
 theme_minimal() +
 theme(axis.text.x = element_text(angle = 90)) +
 scale_colour_viridis_c()

A word to the wise, you probably aren’t going to to use a correlation plot "generally" with this many variables, and as I write this is hard to read without a larger monitor. However, do observe how elegant the code was to create such a plot.

At this juncture, one might decide to proceed to complete a PCA analysis and reduce the number of variables by importance. I’ll leave that to another time.

Model Development – parsnip

The team at RStudio, particularly Max Kuhn have a seriously wicked sense of humour. In naming this package, they couldn’t call it caret2, so they chose parsnip.

As we are seeking to create a regression model that predicts Item_Outlet_Sales, for the sake of demonstration we will develop two models and use the root mean squared error (RMSE) as the evaluation metric.

Linear Regression
Random Forest

To generate a model using parsnip is incredibly elegant. Starting with our first model to evaluate, linear regression, we declare a model specification, and then the computational engine that will be used to fit the model. In this instance the parsnip model will use the stats:lm() function.

lm_model &lt;- linear_reg() %&gt;% 
            set_engine("lm")

lm_mart_fit &lt;- 
  lm_model %&gt;%
  fit(Item_Outlet_Sales ~ ., data = mart_train)

Calling the parsnip model object will generate an output similar to when calling a stats:lm() object, the formula and list of coefficients for each input variable.

lm_mart_res &lt;- 
  predict(lm_mart_fit, new_data = mart_test) %&gt;% 
  bind_cols(mart_test %&gt;% select(Item_Outlet_Sales))

lm_mart_res %&gt;%
  ggplot(aes(x = Item_Outlet_Sales, y = .pred)) +
  geom_abline() +
  geom_point(alpha = 0.5) +
  theme_minimal() +
  labs(x = "Item Outlet Sales", y = "Predicted Item Outlet Sales")

Measuring Performance – yardstick

Using the output of our regression model on mart_test, we can evaluate how accurate the model is using the yardstick package. The yardstick package provides an effective tidy manner for evaluating model performance.

metrics &lt;- metric_set(rmse, rsq, mae)

metrics(lm_mart_res, truth = Item_Outlet_Sales, estimate = .pred)

yardstick::metric_set creates a wrapper function for specified error metrics. In our case, we have specified Root Mean Squared Error (RMSE), R² and Mean Absolute Error (MAE).

Calling metrics, specifying the tibble, actual and predicted measures, generates a tibble with each of the error measures and the output.

Comparing the error of the model to the error of the prediction on the test set we find that RMSE(Test) = 1141.8 and RMSE(Model) = 1144.3. The difference is slight and we can say that the model applies well to the test data.

However, the regression model has predicted negative values for a number of values and thus is inappropriate for this context, as we cannot have negative sales.

Generating ML Pipelines – Workflows

As above, there are a number of granular steps we’ve undertaken to generate a basic linear model, the workflows package provides an elegant means of piping each of these steps. We’ll now demonstrate how useful workflows are by generating a random forest model.

Random Forest Models

Random Forest algorithms are an ensemble method of Decision Trees, generally trained via the bagging method. RF algorithms search for the best feature among a random subset of features. This results in greater tree diversity, trading higher bias for lower variance and a better model overall.

rf_workflow &lt;- 
  workflow() %&gt;% 
  add_recipe(mart_recipe) %&gt;% #Pre-processing Steps
  add_model(rand_forest(mode = "regression") %&gt;% #Specify Model
              set_engine("ranger"))

In four lines of code, we have generated an ML workflow that can now be used as a model object.

rf_mart_res &lt;- 
  rf_workflow %&gt;% 
  fit(training(mart_split)) %&gt;% 
  predict(new_data = testing(mart_split)) %&gt;%
  bind_cols(mart_test %&gt;% select(Item_Outlet_Sales))

rf_mart_res %&gt;% 
  ggplot(aes(x= Item_Outlet_Sales, y = .pred)) +
   geom_abline(lty = 2)+
   geom_point(alpha = 0.5)+
   theme_minimal()+
   labs(x="Item Outlet Sales", y= "Predicted Item Outlet Sales", title = "pRandom Forest Regression")

Here we find that RMSE(Train) = 1110.446 and RMSE(Test) = 1126.359. Random Forest models do have a tendency to overfit, we can probably improve the model further by training the hyperparameters, this too can be accomplished using workflows and the dials package.

Dials – Refining Model Development

Tuning models by adjusting their hyperparameters is an important aspect of model development.

Enter the dials package, and the function grid_regular(), which sets up a tibble with every possible combination of the of the hyperparameters that require tuning. dials::grid_regular works similarly to dplyr::expand_grid.

Below, we re-establish a random forest model specification, this time setting trees to 500, mtry and min_n to tune(). Mtry is the number of predictors sampled at each split, and min_n is the minimum number of observations to split node.

A tuning workflow is generated just as before, we pipe a model and recipe specification.

Furthermore, we create a 4-fold cross validation object (vfold_cv).

Lastly, we generate a tune_results object through initialising tune_grid() (a grid search) which iterates through each of the combinations in the rf_grid, with the tune_wf using 4-fold cross validation.

set.seed(256)
rf_mod &lt;- 
  rand_forest(trees = 500,
              mtry = tune(),
              min_n = tune()) %&gt;% 
  set_engine("ranger", importance = "impurity", num.threads = 12) %&gt;% 
  set_mode("regression")

#Establish Model Flow
tune_wf &lt;- workflow() %&gt;%
  add_recipe(mart_recipe) %&gt;%
  add_model(rf_mod)

#Generate grid to perform grid search for hyperparameter optimisation

rf_grid &lt;- grid_regular(mtry(range = c(6,10)), 
                        min_n(range = c(14,20)), 
                        levels = c(10,9))

# 4-fold Cross Validation Stratified by Item_Outlet_Sales
folds &lt;- vfold_cv(train, v = 4, strata = Item_Outlet_Sales)

#Train and evaluate all combinations of hyperparameters specified in rf_grid

doParallel::registerDoParallel(cores = 12)
rf_grid_search &lt;- tune_grid(
  tune_wf,
  resamples = folds,
  grid = rf_grid)

Please note, we’ve utilised parallel processing to use more of the workstations compute power as grid searches can be computationally demanding.

We can visualise the results of this computation and visually see which combination of hyperparameters provides the lowest RMSE.

rf_grid_search %&gt;%
  collect_metrics() %&gt;% 
  filter(.metric == "rmse") %&gt;%
  select(mean, min_n, mtry) %&gt;%
  filter(mtry &gt; 4) %&gt;% 
  ggplot(aes(min_n, mean, color = as_factor(mtry))) +
  geom_point()+
  geom_line()+
  scale_color_viridis_d() +
  theme_minimal()+
  scale_x_continuous(breaks = pretty_breaks())+
  theme(legend.position = "bottom") +
  labs(x = "Minimum Number of Observations to Split Node", y = "RMSE", title = "Grid Search Results for Random Forest", color = "Number of Predictors Sampled at Each Split")

Calling show_best() provides more details about the results of the grid search

rf_grid_search %&gt;% show_best()

We can use the results of the grid search to update the random forest model specifications.

rf_best_rmse &lt;- select_best(rf_grid_search, "rmse")
final_rf &lt;- finalize_model(
  rf_mod,
  rf_best_rmse
)

final_rf

Using the vip package, we can fit final_rf and highlight the top 10 most important features.

final_rf %&gt;%
  fit(Item_Outlet_Sales ~., data = bake(prep(mart_recipe),training(mart_split))) %&gt;% 
  vip(geom=c("col"), num_features = 10) +
  theme_minimal()

What this really highlights is the importance of feature engineering, several of these were created during our cleaning of the data or through recipe generation.

Finally let’s evaluate final_rf

final_wf &lt;- 
  workflow() %&gt;% 
  add_recipe(mart_recipe) %&gt;% 
  add_model(final_rf)

final_rf_res &lt;- 
  fit(final_wf, training(mart_split)) %&gt;% 
  predict(new_data = testing(mart_split)) %&gt;% 
  bind_cols(mart_test %&gt;% select(Item_Outlet_Sales))

final_rf_res %&gt;% ggplot(aes(x= Item_Outlet_Sales, y = .pred)) +
                  geom_abline(lty = 2)+
                  geom_point(alpha = 0.5)+
                  theme_minimal()+
                  labs(x="Item Outlet Sales", y= "Predicted Item Outlet Sales", title = "Tuned Random Forest Regression")

metrics(final_rf_res, truth = Item_Outlet_Sales, estimate = .pred)

Comparing RMSE(Test) = 1120.365 to RMSE(Train) = 1106.629, we have slightly decreased overfitting by 2 RMSE units and produced a marginally better model following hyperparameter tuning.

Conclusion

This project has sought to exemplify the ease of which the Tidymodels meta package can be used to generate machine learning workflows by making each step pipeable, fitting two model types and evaluate their performance against a testing set. If you wanted to take this one step further, it is very easy to train an XGBoost model, I haven’t included this for the sake of brevity.

One of the key takeaways when comparing tidymodels with say sci-kit learn is that every step in the tidymodels process takes tibble dataframes, and doesn’t necessitate the transformation of dataframes to matrices. Furthermore given that everything works through tibbles, as I’ve demonstrated makes visual diagnostics of model performance and hyperparameter training easier to understand.

I highly recommend people who are interested in learning more about tidymodels to seek out the fantastic Julia Slige @ https://juliasilge.com/, and her collaboration with Max Kuhn on the highly anticipated Tidy Modeling with R (which you can preview as it’s being written here https://www.tmwr.org/)