Murray Gillin, Author at Towards Data Science

HELP! We’ve Been HECS’d

Murray Gillin — Wed, 01 May 2024 11:14:50 +0000

A statistical simulation of the Universities Accord recommendations

Introduction

In Australia, students can afford tertiary education through a government loan known as the Higher Education Loan Program (HELP). To ensure the value of the loan is not devalued, they are indexed annually based on the Consumer Price Index. Candidates begin to repay the loan when their post-tax income exceeds ~$51k and has a stepped repayment rate that reaches a maximum of 10% when earning above ~$151k. The most recent indexation rate was a record 7.1% and hadn’t been that high since the 1990s, causing a lot of people to become more aware of the creeping nature of how these student debts are maintained.

I’m one of these people, completing a Bachelor with Honours when leaving school, and then later retraining with a Master’s program during COVID. For most of my 20s I wasn’t earning enough to make the compulsory contributions threshold, so in time the debt crept up. The addition of a Master’s degree (~40k) then pushed the debt to a new level. Whilst now I’m taking steps to remove the debt and recover the non-productive deductions that are being made, you can imagine this is a burden on many graduates.

The Universities Accord Review has made the following proposed changes to how the debts are repaid (Recommendation 16).

That to reduce the long-term financial costs of studying for students, the Australian Government make student contributions fairer and better reflective of lifetime benefits that students will gain from studying and reduce the burden of HELP loans by introducing fairer and simpler indexation and repayment arrangements.

This should involve:

A. reducing student contributions to address the most significant impacts of the Job-ready Graduates (JRG) package starting with students in humanities, other society and culture, communications and human movement, and moving toward a student contribution system based on projected potential lifetime earnings

B. reducing the financial burden of repayment on low-income earners and limiting disincentives to work additional hours by moving to a system of HELP repayment based on marginal rates

C. reducing repayment times by changing the timing of indexation for HELP loans so that amounts withheld for compulsory repayment can be accounted for before indexation is applied

D. ensuring that growth in HELP loans does not outpace growth in wages by setting the HELP indexation rate to the lower of the Consumer Price Index (CPI) and the Wage Price Index (WPI)

E. reviewing bank lending practices to ensure banks recognise that HELP loans are not like other types of loans and are not treated in a way that unduly limits peoples’ borrowing capacity for home loans.

Let’s unpack these recommendations, specifically those relating to the indexation and repayment of the student loan, and frame these as questions of statistics.

Point B indicates that the current repayment rates create possible disincentives for low-income earners. This whilst a compassionate view, neglects the long-term impact of a protracted debt position. The debt compounds each year, independently of an individual’s capacity to pay. We’ll examine the impact of repayment rates on clearance year and review how other countries index or apply interest to student loans.

Point C is an interesting case, and at face value seems fair, like any other debt as deductions are made from post-tax salaries they should be credited against principle before the later indexation point. We can run simulations to this effect and calculate the change in the distribution of debt clearance years across students.

Point D is a curious attempt to try and limit debt growth to the lower of CPI or WPI. This we can do a simple statistical exercise to measure over the last 20 years how often the CPI is lower than the WPI.

To answer these questions we will attempt to answer the following:

Will students benefit from moving between CPI and WPI to minimise the indexation rate?
Simulate the trajectory of student loans under three scenarios, current state, proposed future, noted above, as well as applying a 10% rate of repayment across all income levels above the current minimum threshold. This exercise will then allow us to assess if there is a difference in the distribution of clearance years between scenarios.

Will students benefit from applying the smallest of CPI and WPI?

This is one of the key recommendations designed to reduce indexation’s impact on students. It is logically flawed as salary growth and level are independent of the debt level. However, let us take it at face value and evaluate if there is any reasonable difference where doing so will slow the growth of debt vs CPI alone. Figures have been sourced and reviewed for WPI and CPI, both made available under CC BY 4.0 thanks to the Australian Bureau of Statistics.

wpi_df <- readr::read_csv("All sector WPI, quarterly and annual movement (%), seasonally adjusted (a).csv", 
    col_types = cols(`Quarterly (%)` = col_skip()), 
    skip = 1) %>% 
    rename(month_year = ...1,
           value = `Annual (%)`) %>% 
    drop_na() %>% 
    mutate(month_year = lubridate::my(month_year),
           rate = 'wpi')

cpi_df <- read_csv("All groups CPI, Australia, quarterly and annual movement (%).csv", 
    col_types = cols(`Change from previous quarter (%)` = col_skip()), 
    skip = 1) %>% 
    rename(month_year = ...1,
           value = `Annual change (%)`) %>% 
    drop_na() %>% 
    mutate(month_year = lubridate::my(month_year),
           rate = 'cpi')

rates_df <- bind_rows(cpi_df, wpi_df) %>% 
    filter(month_year >= '2014-03-01' & month_year <= '2023-12-01') %>% 
  mutate(rate = as_factor(rate))

rates_df %>% 
  ggplot(aes(month_year, value, color = rate)) +
    geom_point() +
    geom_line() +
    theme_ggdist() +
    scale_color_viridis_d(begin = 0.3, end = 0.7) +
    labs(x = 'Date', y = 'Index Value', color = 'Rate', title = 'Comparison of Annual CPI and WPI')

Comparison of Annual CPI and WPI by Quarter (Image by Author)

We can see that the variance of CPI is greater than the WPI, but over time, is there a real difference between the two underlying distributions about the mean?

We fit a Bayesian model to 10 years of historical data, take draws from the expected values of the posterior distributions then perform a difference in means assessment, using Bayesian methods.

indices <- 
  brm(
    bf(value ~ rate + 0,
       sigma ~ rate + 0),
    data = rates_df,
    prior = c(prior(normal(2, 2), class = 'b')),
    family = gaussian, 
    iter = 2000, chains = 4, seed = 246, cores = 4, sample_prior = 'yes'
  )

new_df <- tibble(rate = c('wpi', 'cpi'))

new_df %>% 
  add_epred_draws(indices) %>% 
  compare_levels(.epred, rate, comparison = list(c('cpi', 'wpi'))) %>% 
  ggplot(aes(.epred, fill = after_stat(x > 0))) +
    stat_halfeye() +
    geom_vline(xintercept = 0, lty = 2) +
    theme_ggdist() +
    scale_fill_manual(values = c("gray80", "skyblue")) +
    labs(y = 'Density', x = 'Difference in Posterior Means', title = 'Difference in Posterior Means of WPI and CPI',
         subtitle = "80% of Density is Greater then 0/nApplying ROPE of 10%, Difference is Neglible", fill = 'Value Greater Then 0')

Difference in Posterior Mean Distribution of CPI and WPI (Image by Author)

Based on posterior distributions for WPI and CPI, the mean CPI is greater than the mean WPI by an average of 0.3bps. 80% of the posterior distribution of differences is greater than 0, and within a ROPE of 10%, we can consider this difference to be negligible. Shifting between WPI and CPI will at best minimize the impact of inflation, but in the long term won’t materially assist students overcome these debts.

Simulating Indexation and Repayment Scenarios of Students

Let us continue to assess the other elements of the recommendation, largely, indexation timing and compulsory repayments rates.

Outline of Key Simulated Assumptions

To simulate student outcomes we need reasonable prior knowledge of the key variables.

To simulate graduate salaries we’ve taken a fairly broad view of using the median graduate salary of $68,000 and applying a lognormal distribution featured below. This way we capture most of the graduate salaries around this point, but allow for much higher starting salaries that can be available in some sectors. Similarly, for debt, we’ve taken the current average student debt and applied a reasonably right-skewed distribution to capture the broad range of debts.

For the indexation rate, we’ve simply taken the mean quarterly annual CPI figure across 10 years and assume this follows a normal distribution, as above. Similarly, we’ve taken a conservative, and positive view on salary growth by assuming an average annual increase of 3% following a log-normal distribution that enables graduates to climb faster because of promotions or new job opportunities.

Simulated Variable Distributions (Image by Author)

index <- 1:50000 #simulate 50000 students

year <- 0:19 #over 20 years

calculate_salary <- function(previous, basevalue, multiplier) {
  coalesce(basevalue, multiplier * previous)
}

set.seed(246)

base_df <- 
  crossing(index, year) %>% 
  group_by(index) %>% #salary and growth varies by individual
  mutate(salary_0 = rlnorm(1, mean = log(68000), sdlog = log(1.34)),
         debt_0 =  rlnorm(1, meanlog = 10.2, sdlog = 0.5)) %>% 
  group_by(year) %>% #indexation rate applies uniformly across all indices each year
  mutate(indexation_rate = rnorm(1, mean = 0.027, sd = 0.012),
         indexation_rate = round(indexation_rate, 3)) %>% 
  group_by(index, year) %>% 
  mutate(salary_growth = rlnorm(1, meanlog = -3.5, sdlog = 0.6) + 1,
         salary_growth = round(salary_growth, 3)) %>% 
  group_by(index) %>% 
  mutate(salary_0 = if_else(year > 0, NA, salary_0),
         salary_1 = accumulate2(salary_0, salary_growth[-1], calculate_salary),
         salary_1 = case_when(salary_1 < 18200 ~ salary_1 * (1-0), #calculate post-tax incomee
           salary_1 >= 18201 & salary_1 <= 45000 ~  salary_1 - 0 - (salary_1-18200)*(0.19),
           salary_1 >= 45001 & salary_1 <= 120000 ~ salary_1 - 5092 - (salary_1-45000)*(0.325),
           salary_1 >= 120001 & salary_1 <= 180000 ~ salary_1 - 29467 - (salary_1-120000)*(0.37),
           salary_1 >= 180001 ~ salary_1 - 51667 - (salary_1-180000)*(0.45)
         ))

The above code snap sets up the simulation of 50,000 students over 20 years. We use our custom function calculate_salary and purrr::accumulate to iteratively calculate the salary increase given the sampled growth distribution, and then their post-tax income. What is also a point to note, each candidate draws from their growth distribution, but indexation applies across all students at the same rate per year.

calculate_remaining_debt <- function(principal, payment, interest_rate) {
interest = principal * interest_rate
remaining_debt = principal + interest - payment
remaining_debt = ifelse(remaining_debt < 0, 0, remaining_debt)
remaining_debt
}

set.seed(246)

df <- base_df %>% mutate(
         repayment_rate = case_when(
           salary_1 < 51550 ~ 0.0, # Repayment Rates Post-Tax Income
           salary_1 >= 51550 & salary_1 <= 59518 ~ 0.01,
           salary_1 >= 59519 & salary_1 <= 63089 ~ 0.02,
           salary_1 >= 63090 & salary_1 <= 66875 ~ 0.025,
           salary_1 >= 66876 & salary_1 <= 70888 ~ 0.03,
           salary_1 >= 70889 & salary_1 <= 75140 ~ 0.035,
           salary_1 >= 75141 & salary_1 <= 79649 ~ 0.04,
           salary_1 >= 79650 & salary_1 <= 84429 ~ 0.045,
           salary_1 >= 84430 & salary_1 <= 89494 ~ 0.05,
           salary_1 >= 89495 & salary_1 <= 94865 ~ 0.055,
           salary_1 >= 94866 & salary_1 <= 100557 ~ 0.06,
           salary_1 >= 100558 & salary_1 <= 106590 ~ 0.065,
           salary_1 >= 106591 & salary_1 <= 112985 ~ 0.07,
           salary_1 >= 112986 & salary_1 <= 119764 ~ 0.075,
           salary_1 >= 119765 & salary_1 <= 126950 ~ 0.08,
           salary_1 >= 126951 & salary_1 <= 134568 ~ 0.085,
           salary_1 >= 134569 & salary_1 <= 142642 ~ 0.09,
           salary_1 >= 142643 & salary_1 <= 151200 ~ 0.095,
           salary_1 > 151201 ~ 0.1),
         repayment = salary_1 * repayment_rate,
         debt_1 = accumulate(2:n(), .init = first(debt_0),
~ calculate_remaining_debt(.x, repayment[.y], indexation_rate[.y])
),
         repayment = if_else(debt_1 == 0, 0, repayment),
         debt_paid = if_else(debt_1 == 0, 'y', 'n'))

df <- 
df %>% 
  group_by(index) %>% 
  mutate(clearance_year = if_else(lag(debt_1, default = first(debt_1)) > 0 & debt_1 == 0, 1, 0),
         clearance_cum = cumsum(clearance_year)) %>% 
  filter(clearance_cum == 0 | clearance_year == 1) %>% 
  select(1:11) %>%
  mutate(is_paid = if_else(debt_paid == 'y', 1, 0),
         debt = round(debt_1, 2),
         salary = round(salary_1, 2),
         scenario = 'INDEXATION PRE PAYMENT; PROGRESSIVE PAYMENT RATE',
         scenario_l = 'A',
         group = if_else(max(is_paid) == 1, 'paid', 'unpaid'))

# Repeat Again with Different Calculation of Debt, Indexation Post Repayment

calculate_remaining_debt <- function(principal, payment, interest_rate) {
remaining_debt = principal - payment + (principal-payment) * interest_rate
remaining_debt = ifelse(remaining_debt < 0, 0, remaining_debt)
remaining_debt 
}

df2 <- 
df2 %>% 
  group_by(index) %>% 
  mutate(clearance_year = if_else(lag(debt_1, default = first(debt_1)) > 0 & debt_1 == 0, 1, 0),
         clearance_cum = cumsum(clearance_year)) %>% 
  filter(clearance_cum == 0 | clearance_year == 1) %>% 
  select(1:11) %>%
  mutate(is_paid = if_else(debt_paid == 'y', 1, 0),
         debt = round(debt_1, 2),
         salary = round(salary_1, 2),
         scenario = 'INDEXATION AFTER PAYMENT; PROGRESSIVE PAYMENT RATE',
         scenario_l = 'B',
         group = if_else(max(is_paid) == 1, 'paid', 'unpaid'))

# Last Scenario Applies a Flat Rate of 10% Repayment Above Minimum Threshold

df3 <- base_df %>% 
  mutate(repayment_rate = case_when(
           salary_1 < 51550 ~ 0.0,
           salary_1 >= 51550 ~ 0.1), 
         repayment = salary_1 * repayment_rate,
         debt_1 = accumulate(2:n(), .init = first(debt_0),
~ calculate_remaining_debt(.x, repayment[.y], indexation_rate[.y])
),
         repayment = if_else(debt_1 == 0, 0, repayment),
         debt_paid = if_else(debt_1 == 0, 'y', 'n'))

df3 <- 
df3 %>% 
  group_by(index) %>% 
  mutate(clearance_year = if_else(lag(debt_1, default = first(debt_1)) > 0 & debt_1 == 0, 1, 0),
         clearance_cum = cumsum(clearance_year)) %>% 
  filter(clearance_cum == 0 | clearance_year == 1) %>% 
  select(1:11) %>%
  mutate(is_paid = if_else(debt_paid == 'y', 1, 0),
         debt = round(debt_1, 2),
         salary = round(salary_1, 2),
         scenario = 'INDEXATION PRE PAYMENT; FLAT PAYMENT RATE',
         scenario_l = 'C',
         group = if_else(max(is_paid) == 1, 'paid', 'unpaid'))

Sample Dataframe (Image by Author)

We then combine the three scenarios into a single view for later analysis. Below we visualise the first 9 students in our dataset.

df4 <- bind_rows(df, df2, df3)

df4 %>% 
  filter(index <= 9) %>% 
  ggplot(aes(year, debt, color = scenario)) +
    geom_point() +
    geom_line() +
    facet_wrap(~index) +
    theme_ggdist() +
    scale_color_brewer(palette = "Dark2") +
    theme(legend.position = 'bottom', legend.direction = 'vertical') +
    labs(x = 'Year', y = 'Debt', title = 'Scenarios of First Nine Students')

Let’s unpack, we’ve simulated three scenarios noted above. The reason we’ve simulated a 10% repayment rate, is because this is what New Zealand do for student debt, starting from ~ NZD 25,000. Some might say this is quite punitive, however, NZ does not charge interest or apply indexation unless graduates leave the country to work. I thought this would be a reasonable scenario to scope out. The above visualization shows the debt trajectory for each scenario for the first 9 students in our dataset – we’ve simulated 50,000 students in total.

Index 4 is a great example of how salary and debt indexation are conditionally independent. For the first 14 years in the current state, no repayments were made against the debt. Secondly, it wasn’t until year 15 that the graduate had a post-tax salary that was above the ~51,000 threshold for compulsory payments, but the debt grew nearly 22,000 in that period, and within the 20-year horizon doesn’t clear the debt.

The low progressive rates mean that this student may never pay off this debt as the indexation will continue to outpace the current rate of repayment each year. Which is the better outcome, a graduate that pays off their student debt or has this burden for the rest of their professional lives, scraping increments from their salary at values less than the effect of indexation?

I think this highlights a key missing point of the review, what problem/s are we solving for? In my opinion, the HELP system must enable graduates to pay off their debt as soon as reasonably possible.

Modelling Years to Debt Clearance

Now that we have a dataset with simulations for all students, we can assess the distribution in years to debt clearance across each scenario.

df4 %>% 
  filter(group == 'paid') %>% 
  group_by(index, scenario) %>%
  summarise(debt_clearance = max(year)) %>% 
  ggplot(aes(debt_clearance, fill = scenario)) +
    geom_histogram(binwidth = 1) +
    facet_grid(~scenario) +
    theme_ggdist() +
    scale_fill_brewer(palette = "Dark2", aesthetics = c('color', 'fill')) +
    theme(legend.position = 'none') +
    labs(x = 'Years to Debt Clearance', y = 'Count', title = 'Distribution of Years to Debt Clearance by Scenario')

Distribution of Years to Debt Clearance by Scenario (Image by Author)

What is becoming quite clear is that there is little difference between changing the indexation point, however setting a higher repayment rate, greater than the indexation rates by far, enables students to pay their debt off much sooner. Let’s complete a Bayesian ANOVA to get a sense of the differences in posterior mean years. Given we are dealing with count data, in the below we fit three models, varying the likelihood and number of parameters.

set.seed(246)

# Take Sample of Total Dataframe
count_df_sample <- df4 %>%
  filter(group == 'paid') %>% 
  group_by(index, scenario, scenario_l) %>%
  summarise(debt_clearance = max(year))  %>% 
  group_by(scenario_l) %>% 
  slice_sample(n = 4000)

# Poisson Likelihood
m1a <- brm(
  debt_clearance ~ scenario_l + 0,
  data = count_df_sample,
  family = poisson,
  prior = c(prior(gamma(9, 1), class = 'b', lb = 0)),
  chains = 4, iter = 2000, cores = 4, threads = threading(2)
) %>% 
  add_criterion(c('loo', 'waic'),  moment_match = T)

# Negative Binomial Likelihood w/ Pooling
m1b <- brm(
  debt_clearance ~ scenario_l + 0,
  data = count_df_sample,
  family = negbinomial,
  prior = c(prior(gamma(9, 1), class = 'b', lb = 0)),
  chains = 4, iter = 2000, cores = 4, threads = threading(2)
) %>% 
  add_criterion(c('loo', 'waic'),  moment_match = T)

# Establish Prior for Non-Pooling Negative Binomial 

prior <- get_prior(
  bf(debt_clearance ~ scenario_l + 0,
  shape  ~ scenario_l + 0),
  data = count_df_sample,
  family = negbinomial,
  prior = c(prior(gamma(9, 1), class = 'b', lb = 0))) %>% 
  as_tibble() %>% 
  mutate(prior = if_else(class == 'b' & dpar == 'shape' & coef == '', 'gamma(9, 1)',  prior),
         lb = if_else(class == 'b' & dpar == 'shape' & coef == '', '0',  lb),
         prior = if_else(class == 'b' & dpar == '' & coef == '', 'gamma(6, 1)', prior),
         lb = if_else(class == 'b' & dpar == '' & coef == '', '0', lb)) %>% 
  as.brmsprior()

# Negative Binomial Likelihood w/o Pooling

m1c <- brm(
  bf(debt_clearance ~ scenario_l + 0,
  shape  ~ scenario_l + 0),
  data = count_df_sample,
  family = negbinomial,
  prior = prior,
  chains = 4, iter = 2000, cores = 4, threads = threading(2)
) %>% 
  add_criterion(c('loo', 'waic'),  moment_match = T)

loo_compare(m1a, m1b, m1c) %>% print(simplify = F)

Output from LOO Comparison (Image by Author)

From the three models we’ve created – the negative binomial model without pooling shows better out-of-sample predictive power. For the sake of our ANOVA, we’ll use this model. For reference, scenario A is Indexation Pre, Progressive, B is Indexation Post, Progressive and C is Indexation Pre, Flat Rate.

new_df <- tibble(scenario_l = c('A', 'B', 'C'))

m1c %>% 
  tidybayes::epred_draws(new_df, ndraws = 4000, seed = 111) %>% 
  compare_levels(.epred, scenario_l) %>% 
  ggplot(aes(.epred, scenario_l, fill = scenario_l)) + 
    stat_dist_halfeye() +
    geom_vline(xintercept = 0, linetype = 'dashed') +
    theme_ggdist() +
    scale_fill_brewer(palette = "Dark2") +
    labs(x = 'Difference in Posterior Means', y = 'Scenario', title = 'Differences in Posterior Mean Between Each Scenario', fill = 'Scenario')

The above tells us two things – that changing the timing of indexation has a negligible impact on the maturity of HELP debts. Secondly, having a 10% flat rate of repayment will on average decrease the time to pay off the debt by an average of 4 years.

It also goes, that the shorter the time of the debt, the lower the impact of effective compounding has on the debt value, and means graduates will regain the portion of their salaries otherwise directed to payments, giving them an effective post-tax pay increase.

Concluding Remarks

This analysis has quantitatively evaluated the recommendations put forth by the Universities Accord Review in a wholly transparent and repeatable manner.

Our analysis has sought to provide what we must consider a best-case situation of alternative approaches to the status quo. Our simulations are optimistic, but not unrealistic views on debt and salary growth over 20 years.

Firstly, we assessed whether or not swapping between CPI and WPI would be worthwhile in the long term. Based on our modelling, we expect in the long term to be a negligible difference shifting between the two indices for the calculation of indexation, amounting to a mean difference of 0.3bps in favour of CPI.

Secondly, we simulated debt and salary trajectories for 50,000 graduates over 20 years given the three scenarios above, moving the indexation point, and increasing the repayment rate. We then used this data to assess any difference in mean years to debt clearance and noted that changing the indexation point will have not a tangible effect, and increasing the repayment rate to a flat 10% represents a viable option to enable graduates to pay off their debt faster. We can express these statements concerning our DAG, by increasing repayments we observed a faster decrease in debt levels. Also, the impact of indexation is independent of salary, and as such any attempt to mitigate its impact based on income levels ignores the relationship below.

Causal Diagram for HELP Debt Maintenence (Image by Author)

It becomes a case of more sacrifice up front, but over time avoid the compounding effect that indexation has on principal debt levels. This scenario replicates what occurs in NZ to an extent, a flat, broad rate that applies at an even lower minimum threshold.

It is incumbent on the authors of the Universities Accord Review, to discuss the implications of their recommendations. The report makes implied causal statements without validation or appropriate simulations. You cannot in one sentence seek to alleviate the financial pressures of expensive tertiary Education, and then not offer costed solutions to how students can either through compulsory and voluntary payments to clear debts in a more timely fashion.

The Federal Government to their credit have said HELP debts are under review, but given the recommendations in the report, and based on the simulations we’ve conducted, doubt that any of these will have a material benefit in improving graduate financial outcomes.

The post HELP! We’ve Been HECS’d appeared first on Towards Data Science.

How does Socio-Educational Index Influence School Leaver Outcomes?

Murray Gillin — Tue, 12 Sep 2023 21:59:05 +0000

How Does Socio-Educational Index Influence School Leaver Outcomes? – A Bayesian Analysis with R and brms

Following the very positive reception of the previous article seeking to understand if there is a proportional difference between school sectors and post-secondary outcomes, let’s follow up with a little more. We stated that the modelling was not intended to be causal, and is merely descriptive. This was prefaced by saying that there would be numerous factors that one might consider in a causal model. Well it so happens a proxy measure exists, ICSEA or Index of Community Socio-Educational Advantage.

A Bayesian Comparison of School Leaver Outcomes with R and brms

Continuing on from the previous article, we will seek to explore if there is a causal relationship between ICSEA across proportional outcomes by sector, amounting to a Bayesian ANCOVA workflow.

Photo by Vasily Koloda on Unsplash

Background

Before jumping into the modelling, lets develop our causal model and understanding. ICSEA is a scale that identifies the socio-educational advantage of a school. This is a measure calculated by ACARA (Australian Curriculum Assessment Reporting Authority) that factors in parents educational background and occupation, Aboriginal status of students and the geographic location of the school. The average value of ICSEA is set to 1000 with a standard deviation of 100, and hence higher ICSEA values refer to schools with higher socio-educational advantage.

There are an enumerate number of factors that can affect student performance and outcomes. We’ve simplified to something more tangible below. Starting on the left hand side, we can reasonably expect a parents level of education to be an influence on their child’s personal ambitions, parental education will influence their salary, and hence what suburb they afford to live in, putting them proximal to schools within certain sectors. We are going to reasonably assume that affluent areas have a greater mix of Independent and Catholic schools ahead of Government schools. These are a lot of fields and data points to collate, summarise and create regression models to test our causal assumptions. Fortunately ICSEA is our magical proxy measure for most of these input variables, and thus on the right hand side we have our simplified causal model.

Causal DAGs for Student Outcomes – Image by Author

Academic literature has extensively explored this. One such review paper, available below by Alyahyan & Düştegör, provides a summary of an array of possible inputs that may influence academic success. We can see this is an incredibly complex set of relationships, across psychological, sociological, economic and environmental factors.

Predicting academic success in higher education: literature review and best practices …

Summary of Factors Potentially Impacting the Prediction of Students’ Academic Success – Image Taken From Ref [1] Used Under CC BY 4.0

Let’s not muddy the waters, because success is a subjective term derived from some preferred outcome.

Our research question is about understanding if there is a difference in outcomes across ICSEA levels between sectors, a subtle but important distinction.

Load Libraries and Datasets

ACARA is the Australian Curriculum Assessment Reporting Authority and they publish ICSEA and other performance indexes for all schools nationally here. This dataset is made available under CC BY 4.0. As before we’ll load data from the Victorian State Government’s On Track Survey of post-school leaver outcomes for students graduating in 2021, also available under CC BY 4.0

Below we do some rudimentary cleaning, and then join our two datasets profile and df_long to create df_longb, which contains our sector, outcome, proportions as previous as well as the corresponding ICSEA score for each school. Two schools didn’t have ICSEA scores so they have been removed.

library(tidyverse)
library(brms)
library(tidybayes)
library(readxl)
df <- read_excel("DestinationData2022.xlsx", 
    sheet = "SCHOOL PUBLICATION TABLE 2022", 
    skip = 3)
colnames(df) <- c('vcaa_code', 'school_name', 'sector', 'locality', 'total_completed_year12', 'on_track_consenters', 'respondants', 'bachelors', 'deferred', 'tafe', 'apprentice_trainee', 'employed', 'looking_for_work', 'other') 

df <- drop_na(df)

df2 <- df |> 
  mutate(across(5:14, ~ as.numeric(.x)), #convert all numeric fields captured as characters
         across(8:14, ~ .x/100 * respondants), #calculate counts from proportions
         across(8:14, ~ round(.x, 0)), #round to whole integers
         respondants = bachelors + deferred + tafe + apprentice_trainee + employed + looking_for_work + other, #recalculate total respondents
         filter(sector != 'A') #remove A schools (low volume)

df_long <- df2 |>
  select(sector, school_name, 7:14) |> 
  pivot_longer(c(-1, -2, -3), names_to = 'outcome', values_to = 'proportion') |> 
  mutate(proportion = proportion/respondants)

profile <- read_excel("school-profile-2022.xlsx", 
    sheet = "SchoolProfile 2022") |> #From ACARA
  filter(State == 'VIC' & `School Type` %in% c('Secondary', 'Combined')) |> 
  select(`School Sector`, `School Name`, ICSEA, `ICSEA Percentile`) |> 
  drop_na() |> 
  mutate(`School Sector` = substr(`School Sector`, 1, 1),
         ICSEA_s = ((ICSEA - min(ICSEA-1)) / (max(ICSEA+1) - min(ICSEA-1))))

colnames(profile) <- c('sector', 'school_name', 'ICSEA', 'Percentile', 'ICSEA_s', 'percentile_d', 'percentile_c')

df_longb <-
  df_long |> 
  mutate(school_name = str_to_upper(school_name)) |> 
  left_join(profile |> mutate(school_name = str_to_upper(school_name))) |> 
  drop_na()

Exploratory Data Analysis

Our datasets are prepped and ready to go, lets visualise features of interest.

profile |> 
  ggplot(aes(y = sector, x = ICSEA, fill = sector)) +
    stat_halfeye(alpha = 0.5) +
    theme_ggdist() +
    scale_fill_viridis_d(begin = 0.4, end = 0.8) +
    labs(title = 'Distribution of ICSEA Values By Sector', y = 'Sector')

Distribution of ICSEA Scores by School Sector – Image by Author

Interesting, and perhaps unsurprising is that public schools have the greatest variance, and independent schools with the highest median.

Bayesian ANOVA – Estimating Difference in Posterior Means of ICSEA

Let’s first examine if the difference between sectors amongst ICSEA is real, and like the previous article will conduct a Bayesian ANOVA. Also this begins to address the first part of our DAG, the relationship between ICSEA and Sector. We’ve instructed the model to provide parameter values for mu and phi for each level of sector.

Mathematical Form for Model m4e – Image by Author

m4e <- 
  brm(bf(ICSEA_s ~ sector + 0,
         phi ~ sector + 0),
      family = Beta,
      data = profile,
      prior = c(prior(normal(-1, 2), class = 'b'),
                prior(gamma(6, 2), class = b, dpar = phi, lb = 1)),
      seed = 246, warmup = 1000, iter = 6000, cores = 4, chains = 4, save_pars = save_pars(all = T)) |> 
  add_criterion(c('waic', 'loo'), moment_match = T)

summary(m4e)

We made a decision to rescale ICSEA using min-max so that we could model this using a Beta likelihood function, and thus the mu and phi parameters use logit and log link functions respectively.

Effectively the beta terms of the mean will be the logit value of the mean percentile for each sector, given rescaled ICSEA values.

Model Summary Output for m4e – Image by Author

Let’s briefly perform a posterior predictive check of each sector to see if the model is capturing as much of the observed data as reasonable.

alt_df <- profile |> 
  select(sector, ICSEA) |> 
  mutate(y = 'y', 
         draw = 99)

sim_df <- tibble(sector = c('C', 'I', 'G')) |> 
  add_predicted_draws(m4e, ndraws = 1200, seed = 246) |> #generates 1200 posterior draws for each sector
  mutate(ICSEA = (.prediction * (1211 - 808) + 808)) |> #rescale ICSEA back
  ungroup() |> 
  mutate(y = 'y_rep',
         draw = rep(seq(from = 1, to = 20, length.out = 20), times = 180)) |> #draws 
  select(sector, ICSEA, draw, y) |> 
  bind_rows(alt_df)

sim_df |> 
  ggplot(aes(ICSEA, group = draw)) +
    geom_density(aes(color = y)) +
    facet_grid(~ sector) +
    scale_color_manual(name = '', 
                       values = c('y' = "darkblue",
                                  'y_rep' = "grey")) +
    theme_ggdist() +
    labs(y = 'Density', x = 'y', title = 'Distribution of Observed and Replicated Proportions by Sector', subtitle = 'Model = m4e - Non-Pooled Phi')

Posterior Predictive Check of Model m4e – Image by Author

Our model is reasonably capturing the shape about each sectors ICSEA distribution. Lastly let’s ask the model to tell us what the mean ICSEA difference between each sector using Bayesian ANOVA.

new_df <- tibble(sector = c('I', 'G', 'C'))

epred_draws(m4d, newdata = new_df) |> 
  mutate(ICSEA = .epred * (1211 - 808) + 808) |> 
  compare_levels(ICSEA, by = sector, comparison = rlang::exprs(C - G, I - G, C - I)) |> 
    mutate(sector = fct_inorder(sector),
         sector = fct_shift(sector, -1), 
         sector = fct_rev(sector)) |> 
  ggplot(aes(x = ICSEA, y = sector, fill = sector)) +
      stat_halfeye() +
      geom_vline(xintercept = 0, lty = 2) + 
      theme_ggdist() +
      theme(legend.position = 'bottom') +
      scale_fill_viridis_d(begin = 0.4, end = 0.8) + 
      labs(x = 'Difference in Mean Posterior ICSEAS', y = 'Sector Comparison', title = 'Sector Comparison of Posterior Distributions of Mean ICSEAS')

Bayesian ANOVA – Mean ICSEA Difference Between Sectors – Image by Author

Given what we’ve observed above in our exploratory data analysis, these differences are not surprising. The average difference between Catholic and Government Schools is 46 points on the ICSEA scale, and 80 points between Independent and Government schools. The latter is almost a standard deviation away. Independent schools are on average 30 points higher then Catholic Schools. All these results are non-zero and real.

Bayesian ANCOVA – Impact of ICSEA on Proportional Outcomes by Sector.

Given we’d like to understand how much outcome proportions vary by ICSEA across and sector, we need to employ a hierarchical model. As our target variable is a proportion, we’ll continue using a Beta likelihood function.

We want to model the variation of ICSEA across sectors and proportional outcomes. This model took about 78 minutes to run on my machine using within-chain threading, so be warned. We’re not going to pretend to understand every single parameter in this model, we set broad general priors, and more importantly we are building a model that will enable us to answer the question.

m5d <- 
  brm(bf(proportion ~ 1 + (ICSEA_s|sector:outcome),
         phi ~ 1 + (ICSEA_s|sector:outcome)),
      family = Beta, 
      data = df_longb |> mutate(proportion = proportion + 0.01), 
      prior = c(prior(cauchy(0, 2), class = sd),
                prior(cauchy(0, 2), class = Intercept)),
      seed = 246, chains = 4, cores = 4, iter = 8000, warmup = 1000, save_pars = save_pars(all = T),
  control = list(adapt_delta = 0.99, max_treedepth = 15), threads = threading(2)) |> 
  add_criterion(c('waic', 'loo'), moment_match = T)

summary(m5d)

Model Output for Model m5d- Image by Author

Let’s inspect the curve fit against the observed data.

new_df <- expand_grid(sector = c('C', 'I', 'G'),
            outcome = unique(df_long$outcome),
            ICSEA_s = seq(from = 0, to = 1, by = 0.05))

m5d |> 
epred_draws(new_df, ndraws = 200) |> 
  filter(outcome != 'other') |> 
  mutate(ICSEA = ICSEA_s * ((max(df_longb$ICSEA) + 1) - (min(df_longb$ICSEA) - 1)) + min(df_longb$ICSEA) - 1) |> 
  ggplot(aes(ICSEA, .epred, color = sector)) +
    stat_lineribbon() +
    geom_point(data = df_longb |> filter(outcome != 'other'), aes(y = proportion, x = ICSEA), alpha = 0.3) +
    facet_grid(sector ~ outcome) + 
    theme_ggdist() +
    theme(legend.position = 'bottom') +
    scale_color_viridis_d(begin = 0.2, end = 0.8) + 
    scale_fill_brewer(palette = "Greys") +
    labs(y = 'Proportion', x = 'ICSEA', title = 'Beta Regression Curve Fits of Proportion & ICSEA by Sector and Outcome', subtitle = 'm5d Non-Pooled Phi Term')

Regression Curve Fits of Proportion by ICSEA for Sector and Outcome – Image by Author

The above tells an interesting story. Firstly, the proportion of students attending university post-high school demonstrates a positive trend with ICSEA, almost uniformly across sectors. Contrary to this, proportion of students in employment post high-school is negatively correlated with increasing ICSEA. The impact of ICSEA is now becoming a little more obvious, so lets understand the expected differences of proportional outcomes by ICSEA across sectors, ANCOVA.

new_df <- expand_grid(sector = c('C', 'I', 'G'),
            outcome = c('apprentice_trainee', 'bachelors', 'deferred', 'employed', 'looking_for_work', 'tafe'),
            ICSEA_s = seq(from = 0, to = 1, by = 0.05))

m5d |> 
epred_draws(new_df, ndraws = 2000, seed = 246) |> 
  compare_levels(variable = .epred, by = sector, comparison = rlang::exprs(C - G, I - G, C - I)) |> 
  mutate(ICSEA = ICSEA_s * ((max(df_longb$ICSEA) + 1) - (min(df_longb$ICSEA) - 1)) + min(df_longb$ICSEA) - 1,
         sector = fct_inorder(sector),
         sector = fct_shift(sector, -1), 
         sector = fct_rev(sector))  |> 
  ggplot(aes(y = .epred, x = ICSEA, color = sector)) +
    stat_lineribbon() +
    geom_hline(yintercept = c(0, -0.05, 0.05), lty = c(2)) +
    scale_fill_brewer(palette = "Greys") +
    scale_color_viridis_d(begin = 0.2, end = 0.8) + 
    facet_grid(outcome ~ sector, scales = 'free_y') +
    theme_ggdist() +
    theme(legend.position = 'bottom') +
    scale_y_continuous(labels = scales::label_percent()) +
    labs(x = 'ICSEA', y = 'Difference in Posterior Means', title = 'Sector Comparison of Outcomes by ICSEA')

Comparison of Mean Outcome Proportions by ICSEA and Sector – Image by Author

The above tells an interesting story, and also highlights the benefit of a Bayesian approach – measuring uncertainty. Let’s hone in on the Bachelors outcome. At ICSEA levels below ~1000, Government schools on average have a higher proportion of students attending undergraduate education then both Independent and Catholic Schools.

Comparison of Bachelor Proportions by Sector and ICSEA – Image by Author

Readers will be drawn to the narrowing of the difference in posterior means, between 1000–1100 ICSEA. Our estimates are more certain because there is general overlap in the schools in this space. Either side of these is reflected by a broadening of the uncertainty because of decreasing populations in the respective sectors, reinforced by our density plot above by sector and ICSEA.

Generally speaking the 50% of the posterior density largely resides between ±5% when conditioning on ICSEA for the bachelors outcome. This reinforces the positive relationship between ICSEA and school leavers attending some form of university education, with small bias towards Non-Government schools.

If we were to apply a ROPE (Region of Practical Equivalence) set to within ±5% of marginal differences, we could say that differences are indeed neglible.

Pipe DAG – Image by Author

Evidenced by the differences in expected proportional outcomes by ICSEA, there is marginal difference across sectors when conditioning on ICSEA. Referring back to our causal DAG above, we can effectively reframe this from a Collider to a Pipe.

The outcome proportion is conditionally independent of sector when conditioning on ICSEA. Alternatively once we know the ICSEA of a school, the influence of sector is neglible within a ROPE of ±5% of mean posterior difference.

Outcome Proportion || Sector | ICSEA

Summary and Conclusion

In this article we have continued to explore secondary school leaver outcomes, this time incorporating the Index of Community Socio-Educational Advantage (ICSEA) as a scale that identifies the socio-educational advantage of a school as a proxy measure. We performed an initial Bayesian ANOVA to identify difference in posterior means of ICSEA across sectors. We then merged this with the outcomes dataset, performed a Bayesian ANCOVA to answer whether or not outcome proportion vary across sector when conditioning on ICSEA.

This work begins to reveal a possible causal model, which we’ve illustrated throughout with DAGs. Our conclusion, that Outcome Proportion || Sector | ICSEA is not unsurprising given what we understand about ICSEA. It was designed to capture an array of possible confounders in a single metric to describe a school or population level effect. For the purpose of this exercise it has allowed us to quantitate the mean differences across sectors, and found by in large to be neglible within ±5% ROPE.

References

Alyahyan, E., Düştegör, D. Predicting academic success in higher education: literature review and best practices. Int J Educ Technol High Educ 17, 3 (2020). https://doi.org/10.1186/s41239-020-0177-7

The post How does Socio-Educational Index Influence School Leaver Outcomes? appeared first on Towards Data Science.

A Bayesian Comparison of School Leaver Outcomes with R and brms

Murray Gillin — Sat, 02 Sep 2023 07:12:21 +0000

ANOVA – Bayesian Style

Much is made of what we want to do when we leave school. We get asked as young children what we want to do when we grow up, and then proceed to spend 13 years in pre-Tertiary Education. In public policy, much is made around the differences between the Government, Catholic, and Independent school systems, particularly when it comes to public funding of non-Government schools, and how resources should be allocated.

Is there any real difference between student outcomes given the school sector?

In Victoria, Australia, the State Government conduct an annual survey to gauge post-High School outcomes (On Track Survey). This dataset (available under CC BY 4.0) is available across several years, and we’ll only look at the most recent available as of the authoring of this article, 2021.

This article will seek to answer these questions applying Bayesian analysis methodologies, using the R package brms.

Load Libraries and Dataset

Below we load our required packages, the dataset is a presented in a wide report format, with many merged cells. R doesn’t like this so we have do a bit of hard coding to relabel vectors and create a tidy data frame.

library(tidyverse) #Tidyverse meta package
library(brms) #Bayesian Modelling Package
library(tidybayes) #Tidy Helper Functions and Visualisation Geoms for Bayesian Models
library(readxl)
df <- read_excel("DestinationData2022.xlsx", 
    sheet = "SCHOOL PUBLICATION TABLE 2022", 
    skip = 3)
colnames(df) <- c('vcaa_code', 
                  'school_name', 
                  'sector', 
                  'locality', 
                  'total_completed_year12', 
                  'on_track_consenters', 
                  'respondants', 
                  'bachelors', 
                  'deferred', 
                  'tafe', 
                  'apprentice_trainee', 
                  'employed', 
                  'looking_for_work', 
                  'other') 

df <- drop_na(df)

Here is an sample of what our dataset looks like.

The dataset has 14 fields.

VCAA Code – Administrative ID for each code
School Name
Sector – G = Government, C = Catholic & I = Independent
Locality or Suburb
Total Students Completing Year 12
On Track Consenters
Respondents
Rest of Columns Represent the Percentage of Respondents for Each Outcome

For the purposes of our cross-sectional study we are interested in the outcome proportions by sector, and thus we need to convert this wide data frame into a long format.

df_long <- df |> 
  mutate(across(5:14, ~ as.numeric(.x)), #convert all numeric fields captured as characters
         across(8:14, ~ .x/100 * respondants), #calculate counts from proportions
         across(8:14, ~ round(.x, 0)), #round to whole integers
         respondants = bachelors + deferred + tafe + apprentice_trainee + employed + looking_for_work + other) |> #recalculate total respondents |> 
  filter(sector != 'A') |>  #Low volume 
  select(sector, school_name, 7:14) |> 
  pivot_longer(c(-1, -2, -3), names_to = 'outcome', values_to = 'proportion') |> 
  mutate(proportion = proportion/respondants)

Exploratory Data Analysis

Let’s briefly visualize and summarize our dataset. The government sector has 157, Independent has 57, and Catholic has 50 schools listed in this dataset.

df |> 
  mutate(sector = fct_infreq(sector)) |> 
  ggplot(aes(sector)) +
    geom_bar(aes(fill = sector)) + 
    geom_label(aes(label = after_stat(count)), stat = 'count', vjust = 1.5) +
    labs(x = 'Sector', y = 'Count', title = 'Count of Schools by Sector', fill = 'Sector') +
    scale_fill_viridis_d(begin = 0.2, end = 0.8) +
    theme_ggdist()

df_long |> 
  ggplot(aes(sector, proportion, fill = outcome)) +
    geom_boxplot(alpha = 0.8) +
    geom_point(position = position_dodge(width=0.75), alpha = 0.1, color = 'grey',  aes(group = outcome)) +
    labs(x = 'Sector', y = 'Proportion', fill = 'Outcome', title = 'Boxplot of Respondant Proportions by Sector and Outcome') +
    scale_fill_viridis_d(begin = 0.2, end = 0.8) +
    theme_ggdist()

The distributions tell an interesting story. Bachelors outcome shows the greatest variability across all sectors, with Independent schools having the greatest median student proportion electing to pursue undergraduate education. Interesting to note that Government schools have the highest median proportion of students going on to employment after completing high school. All other outcomes don’t vary too much – we’ll revisit this soon.

Statistical Modelling – Beta Likelihood Regression

We’re focused on proportions of students by school and their outcomes after high school graduation. A beta likelihood is our best bet in these instances. brms is a brilliant package by Buerkner to develop Bayesian models. Our goal is to model the distribution of proportions by outcome and sector.

Beta regression models two parameters, μ and φ. μ represents the mean proportion, and φ is a measure of precision (or variance).

Our first model is represented below, note we are already starting with an interaction between Sector and Outcome because this is the question we want the model to answer, and hence this is an ANOVA model.

We are asking the model to create individual Beta terms for each combination of Sector and Outcome, with a pooled φ term, or model different proportion means with the same variance. We expect 50% of these proportions to sit between (-3, 1) on the logit scale or (0.01, 0.73) as proportions. This is sufficiently wide but informed prior. The global Phi estimate is a positive number, so we use a gamma distribution that is sufficiently wide.

# Modelling Proportion with Sector:Outcome Interaction term using Beta - Pooled Phi term

m3a <- brm(
  proportion ~ sector:outcome + 0, 
  family = Beta,
  data = df_long |> mutate(proportion = proportion + 0.01), # Beta regression can't take outcomes that are 0 so we fudge here by adding tiny increment
  prior = c(prior(normal(-1, 2), class = 'b'),
            prior(gamma(6, 1), class = 'phi')),
  iter = 8000, warmup = 1000, cores = 4, chains = 4, seed = 246, save_pars = save_pars(all = T),
  control = list(adapt_delta = 0.99, max_treedepth = 15)
) |> 
  add_criterion(c('waic', 'loo'), moment_match = T)

summary(m3a)

Note in the model setup we’ve added a 1% increment to all proportions – this is because Beta regression does not handle zero-values. We attempted to model this with zero inflated beta but it took much longer to converge.

Similarly, we can model without a pooled phi, this intuitively makes sense given what we observed in the distributions above, there is variation amongst each of the sector and outcome combinations. The model is defined below.

m3b <- brm(
  bf(proportion ~ sector:outcome + 0,
     phi ~ sector:outcome + 0),
  family = Beta,
  data = df_long |> mutate(proportion = proportion + 0.01),
  prior = c(prior(normal(-1, 2), class = 'b')),
  iter = 8000, warmup = 1000, cores = 4, chains = 4, seed = 246, save_pars = save_pars(all = T),
  control = list(adapt_delta = 0.99)
) |> 
  add_criterion(c('waic', 'loo'), moment_match = T)

summary(m3b)

Model Diagnostics and Comparison

With two models in hand, we’ll now compare their out of sample predictive accuracy using Bayesian LOO estimate of the expected log pointwise predictive density (elpd_loo).

comp <- loo_compare(m3a, m3b)

print(comp, simplify = F)

Simply put, the higher the expected logpoint wise leave one out value is, the greater the predictive accuracy on unseen data. This gives us a good relative measure of model accuracy between models. We can further check this by completing a posterior predictive check, a comparison of observed and simulated values. In our case, model m3b is the better models the observed data.

alt_df <- df_long |> 
  select(sector, outcome, proportion) |> 
  rename(value = proportion) |> 
  mutate(y = 'y', 
         draw = 99) |> 
  select(sector, outcome, draw, value, y)

sim_df <- expand_grid(sector = c('C', 'I', 'G'),
            outcome = unique(df_long$outcome)) |> 
  add_predicted_draws(m3b, ndraws = 1200) |> 
  rename(value = .prediction) |> 
  ungroup() |> 
  mutate(y = 'y_rep',
         draw = rep(seq(from = 1, to = 50, length.out = 50), times = 504)) |> 
  select(sector, outcome, draw, value, y) |> 
  bind_rows(alt_df)

sim_df |> 
  ggplot(aes(value, group = draw)) +
    geom_density(aes(color = y)) +
    facet_grid(outcome ~ sector, scales = 'free_y') +
    scale_color_manual(name = '', 
                       values = c('y' = "darkblue",
                                  'y_rep' = "grey")) +
    theme_ggdist() +
    labs(y = 'Density', x = 'y', title = 'Distribution of Observed and Replicated Proportions by Sector and Outcome')

Model m3b given it’s non-pooled variance or phi term is capable of better capturing the variation in proportion distributions by sector and outcome.

ANOVA – Bayesian Style

Recall, our research question is about seeking to understand if there are differences in outcomes amongst sectors and by how much. In frequentist statistics, we might use ANOVA, a difference of means approach between groups. The weakness of this approach is that the results provide an estimate and a confidence interval, without sense of uncertainty about these estimates, and a counterintuitive p-value stating whether or not the difference in means is statistically significant. No, thank you.

Below we generate a set of expected values for each sector and outcome combination interaction, then use the excellent tidybayes::compare_levels() function that performs the heavy lifting. It computes the difference in posterior means between sector for each outcome. We excluded the ‘other’ outcome for sake of brevity.

new_df <- expand_grid(sector = c('I', 'G', 'C'), 
          outcome = c('apprentice_trainee', 'bachelors', 'deferred', 'employed', 'looking_for_work', 'tafe'))

epred_draws(m3b, newdata = new_df) |> 
  compare_levels(.epred, by = sector, comparison = rlang::exprs(C - G, I - G, C - I)) |> 
    mutate(sector = fct_inorder(sector),
         sector = fct_shift(sector, -1), 
         sector = fct_rev(sector))  |> 
  ggplot(aes(x = .epred, y = sector, fill = sector)) +
      stat_halfeye() +
      geom_vline(xintercept = 0, lty = 2) + 
      facet_wrap(~ outcome, scales = 'free_x') +
      theme_ggdist() +
      theme(legend.position = 'bottom') +
      scale_fill_viridis_d(begin = 0.4, end = 0.8) +
      labs(x = 'Proportional Difference', y = 'Sector', title = 'Differences in Posterior Means by Sector and Outcome', fill = 'Sector')

Alternatively we can summarise all these distributions with a neat table for easier interpretation and a 89% credible interval.

marg_gt <- epred_draws(m3b, newdata = new_df) |> 
  compare_levels(.epred, by = sector, comparison = rlang::exprs(C - G, I - G, C - I)) |> 
  median_qi(.width = .89) |> 
  mutate(across(where(is.numeric), ~round(.x, 3))) |> 
  select(-c(.point, .interval, .width)) |> 
  arrange(outcome, sector) |> 
  rename(diff_in_means = .epred, 
         Q5.5 = .lower, 
         Q94.5 = .upper) |> 
  group_by(outcome) |> 
  gt() |> 
  tab_header(title = 'Sector Marginal Distributions by Outcome') |> 
  #tab_stubhead(label = 'Sector Comparison') |> 
  fmt_percent() |> 
  gtExtras::gt_theme_pff()

For example if we were to summarise a comparison in a formal report we might write the following.

Students in Government schools are less likely to begin undergraduate degrees then their counterparts in Catholic and Independent Schools post-VCE completion.

On average, 42.5% (between 39.5% and 45.6%) of Government school students, 53.2% (between 47.7% and 58.4%) of Catholic school students and 65% (between 60.1% and 69.7%) of Independent school students begin undergraduate education after completion of Year 12 students.

There is an 89% posterior probability that the difference between Catholic and Government student undergraduate enrolment is between 5.6% and 15.7% with a mean of 10.7%. Additionally, the difference between Independent and Government student undergraduate enrolment is between 17.8% and 27% with a mean of 22.5%.

These differences are substantial, and there’s a 100% chance that these differences are not zero.

Summary and Conclusion

In this article we have demonstrated how to model proportional data using a Beta likelihood function using Bayesian modelling, and then perform Bayesian ANOVA to explore the differences in proportional outcomes between sectors.

We have not sought to create a causal understanding of these differences. One can imagine that there are several factors that influence how individual students perform, socioeconomic background, parents educational attainment, in addition to impacts at school level, resourcing etc. It’s an incredibly complex area of public policy that can often get bogged down in zero-sum rhetoric.

Personally, I’m the first person in my extended family to attend and complete tertiary education. I went to a middle band public high school and achieved reasonably good scores to gain entrance into my first preference. My parents encouraged me to pursue education, both electing to leave school at the age of 16. Whilst this article provides evidence that the difference between Government and Non-Government schools is real, it is purely descriptive in nature.

The post A Bayesian Comparison of School Leaver Outcomes with R and brms appeared first on Towards Data Science.

Building Blocks of Causal Inference – A DAGgy approach using Lego

Murray Gillin — Sat, 04 Feb 2023 01:56:32 +0000

Building Blocks of Causal Inference – A DAGgy Approach Using Lego

Causal Inference is a fascinating topic. Causal models seek to create a mechanistic understanding of how variables are related. Recently I’ve been reading Statistical Rethinking by Richard McElreath, whose eloquent and accessible writing has changed the way I think about not just regression and statistical analysis, but life.

This article seeks to explore causal modelling using Directed Acyclic Graphs or DAGs and brms (Buerkner).

Photo by Markus Spiske on Unsplash

I found this wonderful article by Ziegler et al, that seeks to teach the pedagogy of multiple linear regression models to school children. The dataset I’ll be using is taken from this and is available under CC BY 4.0 license.

Building a Multiple Linear Regression Model With LEGO Brick Data

Load Packages and Data

library(tidyverse)
library(tidybayes)
library(brms)
library(ggdag)
library(dagitty)

train <- read_csv('lego.population.csv')

train_parse <- 
train %>% 
  drop_na(Weight, Pieces) %>% 
  mutate(Weight = as.numeric(str_sub(Weight, 1, 3)),
         Price = as.numeric(str_remove(Price, '[^[:alnum:]]')),
         Amazon_Price = as.numeric(str_remove(Amazon_Price, '[^[:alnum:]]')),
         Price = coalesce(Price, Amazon_Price)) %>% 
  drop_na(Price, Weight) %>% 
  mutate(Weight_c = Weight/max(Weight),
         Price_c = Price/max(Price),
         Pieces_c = Pieces/max(Pieces)) %>% 
  drop_na(Weight, Pieces, Price)

We perform some basic data cleansing, and then min/max scale the variables so that they’re on the same scale so that our later interpretation of beta posterior means later is lot easier.

Exploratory Data Analysis

Using GGally:ggpairs we generate the below pair plot for our variables of interest.

GGally::ggpairs(train_parse %>% 
                  select(Weight_c, Pieces_c, Price_c),
                aes(alpha = 1/3)) +
  theme_minimal() +
  labs(title = 'Pairplot for Weight, Pieces and Price')

Variable Pairplot Using Min-Max Scaled Variables (Image by Author)

Nothing overtly interesting to note, other than the strong positive linear relationships between all three variables. But what is the correct causal model? Does increasing the price also increase the number of pieces? I don’t think so.

Causal Inference

Using the loaded dataset we are going to create a causal understanding of Price. The dataset is conceptually easy to understand causally.

Thousands of Lego sets exist from basic children’s Duplo to large sets with thousands of pieces like the Millenium Falcon or Death Star from Star Wars. Prices can start from $10–20 to thousands for limited edition scale model sets. Can we create a reasonable causal model for the price given a set’s features?

DAGs

In the below, we use the daggity package to set up our proposed DAG with some plot co-ordinates and then use ggdag to plot the below figure that describes all the different pathways that Weight and the number of Pieces can influence the Price of a lego set.

dag_coords <- 
  tibble(name = c("Pcs", 'W', 'Pr'),
         x = c(1, 2, 2),
         y = c(2, 2, 1))

dagify(Pr ~ Pcs,
       Pr ~ W, 
       W ~ Pcs,
       coords = dag_coords) %>% 
  ggplot(aes(x = x, y = y, xend = xend, yend = yend)) +
    geom_dag_point(color = 'dark red', alpha = 1/4, size = 20) +
    geom_dag_edges(edge_color = 'dark red') +
    geom_dag_text(color = 'dark red') +
    scale_x_continuous(NULL, breaks = NULL, expand = c(.1, .1)) +
    scale_y_continuous(NULL, breaks = NULL, expand = c(.1, .1)) +
    theme_bw() +
    theme(panel.grid = element_blank())

Our task is now to test the implications of these DAGs and then finalise a causal model that best describes the influence of these variables on Price based on our dataset. To begin with, we can reasonably say that the number of pieces is positively correlated with weight and price, i.e. the more pieces a set has, the heavier it will be and more expensive. Similarly, we can also reason that larger, heavier sets should cost more.

Complete DAG for Weight, Pieces and Price (Image by Author)

Regress Price on Pieces

A natural place to start is regressing the price on the number of pieces, Pcs → Pr. The beauty of Bayesian analysis is being able to provide prior distributions before looking at the data. The distribution of set prices is described by a normal distribution, where the mean is described by linear terms of the intercept and gradient term for pieces. The intercept is described by a relatively broad gaussian distribution with a mean value of $20 and a standard deviation of $6. In essence, this describes the distribution of base prices of a Lego set without any pieces. The beta term describes the value increase per piece. A Google search indicates the average price of a Lego piece is 11c, so this is a good start, and we’ll make the standard deviation sufficiently broad.

Pcs → Pr Prior Formula (Image by Author)

mP_Pr <- brm(Price ~ 1 + Pieces,
             family = gaussian,
             prior = c(prior(exponential(1), class = sigma),
                       prior(normal(20, 6), class = Intercept),
                       prior(normal(0.11, 0.04), class = b)
                       ),
             data = train_parse,
             warmup = 1000, iter = 2000, chains = 4, cores = 4, seed = 246) %>% 
  add_criterion(criterion = c('loo'))
summary(mP_Pr)

Summary Output for Pcs → Pr (Image by Author)

Our priors weren’t too far off after being exposed to the data, the posterior mean for the intercept term represents the average price of a Lego set with zero pieces, and each piece adds 8c to the value of the set. The posterior distributions have fairly discrete error terms on their respective scales, thus the model is fairly confident of these values based on the data.

Regress Price on Weight

Similarly above, set some reasonable yet broad priors that will be swamped by the data. We’ve assumed the same intercept prior as before, the beta term for Weight, again very broad, with a mean of $50/kg and a standard deviation of $15.

W → Pr Prior Formula (Image by Author)

mW_Pr <- brm(Price ~ 1 + Weight,
             family = gaussian,
             prior = c(prior(exponential(5), class = sigma),
                       prior(normal(20, 6), class = Intercept),
                       prior(normal(50, 15), class = b)),                       )
             data = train_parse,
             warmup = 1000, iter = 2000, chains = 4, cores = 4, seed = 246)

Summary Output for W → Pr (Image by Author)

This time the Intercept term is lower than before, with a posterior mean of $5.77, and the gradient term for Weight has a posterior mean of $57.90/kg. Note this model is more confident as the error values are more discrete than when regressing Price on Pieces.

Regress Price on Pieces and Weight

Now things get a little more interesting – we’ve created two Bayesian regression models for Price, using both the number of Pieces and Weight. Both seem reasonably ok models, but which is the better causal model?

Meet the Collider, whereby the number of Pieces and Weight are independent of each other (Pcs || W). We can already remove this form consideration given what we mechanistically know about the variables, but let’s develop models to support this.

The Collider DAG and Formula for W → Pr ← Pcs (Images by Author)

# The Colllider
mWP_Pr <- brm(
  Price ~ 1 + Pieces + Weight, 
  family = gaussian,
             prior = c(prior(exponential(5), class = sigma),
                       prior(normal(20, 6), class = Intercept),
                       prior(normal(50, 15), class = b, coef = Weight),
                       prior(normal(0.11, 0.04), class = b, coef = Pieces)
                       ),
  data = train_parse,
             warmup = 1000, iter = 4000, chains = 4, cores = 4, seed = 246)

Outputs Summary for W → Pr ← Pcs (Image by Author)

We notice immediately that the coefficient for the intercept has not changed much at all – and both the predictor coefficients have decreased, this isn’t surprising given how strongly correlated they are, and an example of multicollinearity. This is enough to disprove the Collider as an appropriate causal model, violating the independence assumption between the two variables. Below we visualise the posterior distributions of Weight and Pieces together as a scatter plot, to display multicollinearity, there is some shared axis of variance. On the right-hand side, we also see the near-zero difference in posterior distributions of the Weight term between regressing on Weight vs. Weight and Pieces together.

# LHS: Plot of Posterior Draws Weight vs. Pieces
as_draws_df(mWP_Pr) %>%
  ggplot(aes(x = b_Pieces_c, y=b_Weight_c)) +
    geom_point(alpha = 0.3) +
    labs(y = 'Weight', x = 'Pieces', title = 'Weight and Pieces Display Multicollinearity and Not Indpendant', subtitle = 'How Can Price per Piece Increase as Price per Unit Weight Decreases?') +
    theme_minimal()

# RHS: Comparison of Posterior Draws for Weight Across W → Pr ← Pcs & W → Pr
tidy_draws(mWP_Pr) %>% 
  select(b_Pieces_c, b_Weight_c) %>% 
  transmute(Pieces_Weight = b_Pieces_c + b_Weight_c,
         Weight = tidy_draws(mW_Pr) %>% select(b_Weight_c) %>% as_vector()) %>% 
  pivot_longer(1:2, names_to = 'Variable', values_to = 'posterior_samples') %>% 
  ggplot(aes(posterior_samples, fill = Variable)) +
           geom_density(alpha = 1/3) +
  theme_minimal() +
  labs(title = 'Addition of Pieces and Weight is Nearly Equivalent to Weight Alone',
       subtitle = 'Variables Min-Max Scaled',
       y = 'Density',
       x = 'Posterior Distribution')

#NB We've used models regenerated using min-max scaled variables.

Weight and Pieces Demonstrate Multicollinearity, Variables Min-Max Scaled (Image by Author)

The next DAG is known as the Pipe, whereby the number of pieces is independent of Price when conditioning on Weight, (or Pcs || Pr | W). Alternatively, we think about it as, once we know the Weight of a set, knowing the number of Pieces adds no further value to our understanding of Price.

The Pipe Pcs → W → Pr (Image by Author)

Below we establish the Pipe as a Bayesian model, using the priors from our previous examples.

# The Pipe
pr_model <- bf(Price ~ 1 + Weight)
w_model <- bf(Weight ~ 1 + Pieces)

mWP_Pr2 <- brm(pr_model + w_model + set_rescor(F),
    prior = c(prior(exponential(5), class = sigma, resp = Price),
              prior(exponential(5), class = sigma, resp = Weight),
              prior(normal(50, 15), class = b, coef = Weight, resp = Price),
              prior(normal(0.11, 0.04), class = b, coef = Pieces, resp = Weight)),
             family = gaussian,
             data = train_parse,
             warmup = 1000, iter = 4000, chains = 4, cores = 4, seed = 246)

Summary Output for Pcs → W → Pr (Image by Author)

The Price intercept remains unchanged as we are regressing on Weight, and the posterior mean of which is effectively the same as our first model W → Pr. The posterior mean of the Weight Intercept could potentially represent the weight of the packaging before any pieces are added. The regression coefficient for Pieces is relatively small (0.0014kg/1.4g per piece).

Let’s test the implied conditional independence with a counterfactual model. We establish new data, varying the number of pieces while holding the weight constant at 1kg.

# Counterfactual Plot Pieces on Price, Holding Weight Consant
nd <- tibble(Pieces = seq(from = 0, to = 5000, length.out = 50),
             Weight = 1)

predict(mWP_Pr2, 
        resp = c('Price'),
        newdata = nd) %>% 
  as_tibble() %>% 
  bind_cols(nd) %>% 
  ggplot(aes(x = Pieces, y = Estimate, ymin = Q2.5, ymax = Q97.5)) +
    geom_smooth(stat = 'identity', alpha = 1/5, size = 1/4) +
    labs(y = 'Counterfactual Price', x = 'Manipulated Pieces')

Counterfactual Plot for Pieces on Price (Image by Author)

As the impact of Pieces on Price is mediated through Weight, it holds that Price will remain constant as Weight is constant regardless of the number of Pieces. This supports the implied conditional independency, that Price is independent of Pieces when conditioning on Weight (or Pcs || Pr | W). Alternatively, once we know the Weight of a set, knowing the number of Pieces adds no further information to our understanding of Price.

Concluding Remarks

In this article, we’ve demonstrated an accessible introduction to causal modelling using Bayesian regression. We’ve developed a causal model for the impacts of physical variables (Pieces and Weight) on Price, and found that we can reasonably believe that regressing Price on Weight is a decent causal model.

I’d like to take a chance to thank Richard McElreath for his wonderful writing and lectures on causal modelling which have inspired me to change the way I think, taking a more rigorous Bayesian approach to my work and life.

Thank you. I hope you enjoyed reading this as much as I enjoyed writing. If you’re not a Medium member – use my referral link below and get regular updates on new publications from myself and other fantastic Medium authors.

Murray Gillin – Medium

The post Building Blocks of Causal Inference – A DAGgy approach using Lego appeared first on Towards Data Science.

Bank Customer Churn with Tidymodels - Part 2 Decision Threshold Analysis

Murray Gillin — Thu, 06 Jan 2022 07:20:55 +0000

Bank Customer Churn with Tidymodels – Part 2 Decision Threshold Analysis

Welcome back for Part 2 of our exploration of the Bank Customer Churn problem now examining decision threshold analysis.

In Part 1 we developed a candidate workflow that achieved strong results across a variety of classification metrics, and discussed the impacts of different up- and downsampling techniques to manage the 4:1 class imbalance in the bank customer churn dataset (taken from https://www.kaggle.com/shivan118/churn-modeling-dataset (License CC0: Public Domain)).

Part 1 is available here, and I recommend reading prior to better understand context and model development.

This article will focus on explaining the consequences of model output to a non-technical audience. We will complete decision threshold analysis and generate a cost function and present two scenarios – the threshold that either best differentiates between classes OR has the lowest cost. We will use the probably package from tidymodels to complete this analysis. Our aim is to identify and present a costed decision threshold analysis to our stakeholders weighing the cost of customer churn and intervention strategies.

Load Packages

library(tidymodels) #ML Metapackage
library(probably) #Threshold Analysis
library(forcats) #Working with factors
library(patchwork) #ggplot grids
tidymodels_prefer()
options(yardstick.event_first = FALSE)

class_metric <- metric_set(accuracy, f_meas, j_index, kap, precision, sensitivity, specificity, mcc)

Finalize and Fit Model

Picking up from where we left off in Part 1, we’ll identify the best performing workflow and finalise the model.

best_result <- wf_sample_exp %>% 
  extract_workflow_set_result("UPSAMPLE_Boosted_Trees") %>% 
  select_best(metric = 'j_index')

xgb_fit <- wf_sample_exp %>% 
  extract_workflow("UPSAMPLE_Boosted_Trees") %>% 
  finalize_workflow(best_result) %>%
  fit(training(cust_split))

workflowsets::extract_workflow_set_result takes a named workflow to generate a tibble of all trialled hyperparameter combinations and will select the best based on the the metric called in select_best(). workflows::extract_workflow() again takes a named workflow and then updates the hyperparameters based on what is stored in best_result. The resulting workflow is then fit to the training data.

Threshold Analysis

xgb_fit %>% 
  predict(new_data = testing(cust_split), type = 'prob') %>% 
  bind_cols(testing(cust_split)) %>% 
  ggplot(aes(x=.pred_1, fill = Exited, color = Exited)) +
    geom_histogram(bins = 40, alpha = 0.5) +
    theme_minimal() +
    scale_fill_viridis_d(aesthetics = c('color', 'fill'), end = 0.8) +
    labs(title = 'Distribution of Prediction Probabilities by Exited Status', x = 'Probability Prediction', y = 'Count')

By predicting probabilities, we can visualise the respective distribution of churn status. At the default threshold of 0.5, predictions greater then are predicted as churning and vice versa. Threshold analysis identifies an optimal threshold given desired metrics. The probably package enables us to carry out such analysis. probably::threshold_perf() takes the Truth, Estimate and sequentially varies the threshold and calculates sensitivity, specificity and J-Index for each threshold.

#Generate Probability Prediction Dataset
xgb_pred <- xgb_fit %>% 
  predict(new_data = testing(cust_split), type = 'prob') %>% 
  bind_cols(testing(cust_split)) %>% 
  select(Exited, .pred_0, .pred_1)

#Generate Sequential Threshold Tibble
threshold_data <- xgb_pred %>% 
  threshold_perf(truth = Exited, Estimate = .pred_1, thresholds = seq(0.1, 1, by = 0.01))

#Identify Threshold for Maximum J-Index
max_j_index <- threshold_data %>% 
  filter(.metric == 'j_index') %>% 
  filter(.estimate == max(.estimate)) %>% 
  select(.threshold) %>% 
  as_vector()

#Visualise Threshold Analysis
threshold_data %>% 
  filter(.metric != 'distance') %>% 
  ggplot(aes(x=.threshold, y=.estimate, color = .metric)) +
   geom_line(size = 2) +
   geom_vline(xintercept = max_j_index, lty = 5, alpha = .6) +
   theme_minimal() +
   scale_colour_viridis_d(end = 0.8) +
   labs(x='Threshold', 
        y='Estimate', 
        title = 'Balancing Performance by Varying Threshold',
        subtitle = 'Verticle Line = Max J-Index',
        color = 'Metric')

Our analysis indicates that the threshold with the highest J-index is 0.47.

To extend this analysis we can hypothetically tune the threshold to any available metric as below. I couldn’t get this to work using a yardstick::metric_set() and probably::threshold_perf() and include pr_auc and roc_auc, so had to use a purrr-fect tricks.

#Threshold Analysis by Several Classification Metrics
list(pred_df = list(pred_df = xgb_pred), 
     threshold = list(threshold = seq(0.03, 0.99, by = 0.01))) %>% 
cross_df() %>% 
  mutate(pred_data = map2(pred_df, threshold, ~mutate(.x, .prob_class = as_factor(if_else(.pred_1 < .y , 0, 1)))),
         pred_data = map2(pred_data,  threshold, ~mutate(.x, .prob_metric = if_else(.pred_1 < .y , 0, 1))),
         pred_metric = map(pred_data, ~class_metric(.x, truth = Exited, estimate = .prob_class)),
         roc_auc = map(pred_data, ~roc_auc(.x, truth = Exited, estimate = .prob_metric)),
         pr_auc = map(pred_data, ~pr_auc(.x, truth = Exited, estimate = .prob_metric)),
         pred_metric = pmap(list(pred_metric, roc_auc, pr_auc),~bind_rows(..1,..2,..3))) %>%
  select(pred_metric, threshold) %>%                                                            
  unnest(pred_metric) %>%                                                                        
  ggplot(aes(x=threshold, y=.estimate, color = .metric)) +
    geom_line(size = 1) +
    scale_color_viridis_d() +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 45)) +
    facet_wrap(~.metric, nrow = 2) +
    labs(title = 'Impact of Decision Threshold on Classification Metrics', x= 'Threshold', y = 'Estimate', color = 'Metric')

The above aside, we have all we need from the output of probably::threshold_perf(). Sensitivity and Specificity enable us to calculate the FPR and FNR for a particular threshold, and hence a cost function.

Cost Function

Here we enter a hypothetical situation not captured within the original dataset. In order to calculate the Total Cost of FN and FP as below we need an approximation for Customer Lifetime Value (CLV) and a cost of intervention. Now let’s suppose the cost of intervention is $99 or the value of an annual fee for a standard account, if we suspect a customer is churning, customer service will honour a years discount for an annual fee.

For annualised CLV we take the sum of account fees and credit card fees. We assume that each product has a $99 annual fee except credit cards which have a $149 fee. So as below we calculate an approximate CLV per customer and then take the median value as CLV. That so happens to be $149.

N.B. I know this is a very basic view on what constitutes CLV, and no doubt a real life scenario would calculate the amount of interest a customer has paid on credit or loans amongst other avenues of bank customer revenue.

train %>% 
  mutate(CreditCardFees = HasCrCard*149,
         AccountFees = (NumOfProducts - HasCrCard)*99,
         CLV = CreditCardFees + AccountFees) %>% 
  ggplot(aes(CLV)) +
   geom_histogram() +
   theme_minimal() +
   labs(title = 'Distribution of Annual CLV', x='CLV', y = 'Count')

In applying this logic to our threshold_data tibble, we can visualise these functions.

threshold_data %>% 
  filter(.metric %in% c('sens', 'spec')) %>% 
  pivot_wider(id_cols = .threshold, values_from = .estimate, names_from = .metric) %>% 
  mutate(Cost_FN = ((1-sens) * 510 * 149), 
         Cost_FP = ((1-spec) * 1991 * 99),
         Total_Cost = Cost_FN + Cost_FP) %>% 
 select(.threshold, Cost_FN, Cost_FP, Total_Cost) %>% 
 pivot_longer(2:4, names_to = 'Cost_Function', values_to = 'Cost') %>% 
  ggplot(aes(x = .threshold, y = Cost, color = Cost_Function)) +
    geom_line(size = 1.5) +
    theme_minimal() +
    scale_colour_viridis_d(end = 0.8) +
    labs(title = 'Threshold Cost Function', x = 'Threshold')

Scenario Analysis – Minimising Cost or Maximising Differentiation

As we have established cost functions, we can then identify a decision threshold that minimises these costs. As noted in the introduction, we can think of two scenarios, as we’ve identified above, the threshold that optimises the J-index or the threshold that minimises cost. This is demonstrated below.

threshold_data %>% 
  filter(.metric %in% c('sens', 'spec')) %>% 
  pivot_wider(id_cols = .threshold, values_from = .estimate, names_from = .metric) %>% 
  mutate(Cost = ((1-sens) * 510 * 149) + ((1-spec) * 1991 * 99),
         j_index = (sens+spec)-1) %>% 
  ggplot(aes(y=Cost, x = .threshold)) +
    geom_line() +
    geom_point(aes(size = j_index, color = j_index)) +
    geom_vline(xintercept = 0.47, lty = 2) +
    annotate(x = 0.36, y=100000, geom = 'text', label = 'Best Class DifferentiationnJ-Index = 0.56,nCost = $57,629,nThreshold = 0.47') +
    geom_vline(xintercept = 0.69, lty = 2) +
    annotate(x = 0.81, y = 100000, geom = 'text', label = 'Lowest Cost ModelnJ-Index = 0.48,nCost = $48,329,nThreshold = 0.69') +    
    theme_minimal() +
    scale_colour_viridis_c() +
    labs(title = 'Decision Threshold Attrition Cost Function', 
         subtitle = 'Where Cost(FN) = $149 & Cost(FP) = $99',
         x = 'Classification Threshold', size = 'J-Index', color = 'J-Index')

Interestingly, given our cost assumptions, the lowest cost threshold is 0.69, unsurprisingly increasing specificity (TPR) at the cost of sensitivity (TNR). We visualise confusion matrices as below.

t1 <- xgb_pred %>% 
  mutate(.pred = make_two_class_pred(.pred_0, levels(Exited), threshold = 0.5)) %>%
  conf_mat(estimate = .pred, Exited) %>% 
  autoplot(type = 'heatmap') + 
  scale_fill_gradient2() +
  labs(title = 'Default Decision Threshold = 0.50')

t2 <- xgb_pred %>% 
  mutate(.pred = make_two_class_pred(.pred_0, levels(Exited), threshold = 0.47)) %>%
  conf_mat(estimate = .pred, Exited) %>% 
  autoplot(type = 'heatmap') + 
  scale_fill_gradient2() +
  labs(title = 'With Adjusted Decision Threshold = 0.47')

t3 <- xgb_pred %>% 
  mutate(.pred = make_two_class_pred(.pred_0, levels(Exited), threshold = 0.69)) %>%
  conf_mat(estimate = .pred, Exited) %>% 
  autoplot(type = 'heatmap') + 
  scale_fill_gradient2() +
  labs(title ='With Adjusted Decision Threshold = 0.69')

t2 / t1 / t3 +
  plot_annotation(title = 'Confusion Matrices for UPSAMPLE_Boosted_Trees')

Concluding Remarks

We’ve completed a decision threshold analysis using the probably package, and a constructed a hypothetical scenario analysis. Given our assumptions, the lowest cost model reduces model performance. This is the trade off the business needs to consider, productionalise an effective model that can differentiate classes moderately well or go with the lower cost one despite the need for more interventions with greater false positive predictions.

Thank you for reading this article, and I hope you enjoyed it. I write these to teach myself something, and I hope you’ve learnt something too. If you’re not a Medium member – use my referral link below and get regular updates on new publications from myself and other fantastic Medium authors.

Join Medium with my referral link – Murray Gillin

The post Bank Customer Churn with Tidymodels - Part 2 Decision Threshold Analysis appeared first on Towards Data Science.

Bank Customer Churn with Tidymodels – Part 1 Model Development

Murray Gillin — Wed, 29 Dec 2021 11:17:22 +0000

Exploring imbalanced classification with Tidymodels

Imagine you’re a data scientist at a large multi-national bank and the Chief Customer Officer approaches you to develop a means of predicting customer churn. You develop a snapshot dataset of 10,000 customers with class imbalance of 1:4 in favour of customers not leaving to use to train such a binary classification model. To assist in model development, you decide to investigate various sampling techniques that might help with the class imbalance.

Photo by Jorge Fernández Salas on Unsplash

This multi-part series will cover the following topics in the process of developing and explaining a model, with a focus on translating model output to business outcomes and communicating to senior stakeholders.

Part 1 Model Development for Imbalanced Classification
Part 2 Threshold Analysis and Cost Impacts
Part 3 Explainable ML with DALEXtra & Tidymodels
Part 4 Alternative Approach Using Survival Analysis

The end goal is to create a model that enables the bank to target current customers that might be classified as churning and apply some intervention to prevent that churn. Interventions come at a cost so we’ll seek to balance the false negative rate against the false positive rate. We’ll develop a cost function and threshold analysis in Part 2 with the probably package.

Part 3 will focus on understanding the variable space where churn is more predominant in and enable us to understand at local and global levels the leading factors that lead to customer churn by using the DALEX/DALEXtra XAI packages.

In Part 4 we will take a different approach and apply survival analysis methods to this dataset as it is right censored and features time to event (Tenure) and outcome utilising the recently published survival analysis features in tidymodels.

Load Packages

library(tidymodels)
library(themis) #Recipe functions to deal with class imbalances
library(tidyposterior) #Bayesian Resampling Comparisons
library(baguette) #Bagging Model Specifications
library(corrr) #Correlation Plots
library(readr) #Read .csv Files
library(magrittr) #Pipe Operators
library(stringr) #String Manipulation
library(forcats) #Handling Factors
library(skimr) #Quick Statistical EDA
library(patchwork) #Create ggplot Patchworks
library(GGally) #Pair Plots

options(yardstick.event_first = FALSE) #Evaluate second factor level as factor of interest for yardstick metrics

Load Data

Data is taken from https://www.kaggle.com/shivan118/churn-modeling-dataset (License CC0: Public Domain) and loaded as below.

train <- read_csv("Churn_Modelling.csv") %>% 
  select(-c(Surname, RowNumber, CustomerId))

Exploratory Data Analysis

I like the skimr package to quickly provide a summary of all dataset variables.

skim(train)

Output from Skim (Image by Author)

Our target variable, Exited has an approximate 4:1 ratio between two possible outcomes where Exited = 1 refers to customer churn. To visualise this we take the custom function below.

viz_by_dtype <- function (x,y) {
  title <- str_replace_all(y,"_"," ") %>% 
           str_to_title()
  if ("factor" %in% class(x)) {
    ggplot(train, aes(x, fill = x)) +
      geom_bar() +
      theme_minimal() +
      theme(legend.position = "none",
            axis.text.x = element_text(angle = 45, hjust = 1),
            axis.text = element_text(size = 8)) +
      scale_fill_viridis_d()+
      labs(title = title, y = "", x = "")
  }
  else if ("numeric" %in% class(x)) {
    ggplot(train, aes(x)) +
      geom_histogram()  +
      theme_minimal() +
      theme(legend.position = "none") +
      scale_fill_viridis_d() +
      labs(title = title, y = "", x = "")
  } 
  else if ("integer" %in% class(x)) {
    ggplot(train, aes(x)) +
      geom_histogram() +
      theme_minimal() +
      theme(legend.position = "none") +
      scale_fill_viridis_d()+
      labs(title = title, y = "", x = "")
  }
  else if ("character" %in% class(x)) {
    ggplot(train, aes(x, fill = x)) +
      geom_bar() +
      theme_minimal() +
      scale_fill_viridis_d() +
      theme(legend.position = "none",
            axis.text.x = element_text(angle = 45, hjust = 1),
            axis.text = element_text(size = 8)) +
      labs(title = title, y  ="", x= "")
  }
}
variable_list <- colnames(train) %>% as.list()
variable_plot <- map2(train, variable_list, viz_by_dtype) %>%
  wrap_plots(               
    ncol = 3,
    heights = 150,
    widths = 150)
ggsave("eda.png", dpi = 600)

Output from viz_by_dtype function (Image by Author)

From the above we get a better understanding of the distribution of continuous variables and counts of discrete variables.

Credit Score is approximately normally distributed
Geography is split across three countries, with France being predominant
Gender is almost evenly split
Age is approximately right-skewed normally distributed
Tenure has no apparent distribution, with the bulk of customers staying between 2–9 years
Balance is normally distributed with a large number of customers with a zero balance
Most customers either have 1 or 2 products
Has Credit Card indicates 70% of customers have a credit card
Is Active Member shows that 51.5% of customers are active users
Estimated Salary shows no apparent distribution

Bivariate Numeric Analysis

Now we’ll seek to understand if there is any relationship between numeric variables using the GGally::ggpairs()

ggpairs(train %>% 
          select(-c(HasCrCard,IsActiveMember,NumOfProducts, Gender, Geography)) %>% 
          drop_na() %>% 
          mutate(Exited = if_else(Exited == 1, "Y","N")), ggplot2::aes(color = Exited, alpha = 0.3)) + 
  scale_fill_viridis_d(end = 0.8, aesthetics = c("color", "fill")) + 
  theme_minimal() +
  labs(title = "Numeric Bivariate Analysis of Customer Churn Data")

Numeric Pairplots by Target (Image by Author)

Only thing worth noting in the above is the rightward shift of customers who have churned, indicating older customers may have a greater likelihood of leaving.

Categorical Variable Analysis

Next step is to establish if there are any relationships between the categorical variables and target.

The below describes the creation of a summary dataframe that calculates the mean and 95% confidence interval for each categorical variable and the target variable.

train %>% 
  mutate(Exited = if_else(Exited == 1, "Y", "N"),
         HasCrCard = if_else(HasCrCard == 1, "Y", "N"),
         IsActiveMember = if_else(IsActiveMember == 1, "Y", "N"),
         NumOfProducts = as.character(NumOfProducts)) %>% 
  select(Exited,where(is.character)) %>% 
  drop_na() %>% 
  mutate(Exited = if_else(Exited == "Y",1,0)) %>% 
  pivot_longer(2:6, names_to = "Variables", values_to = "Values") %>% 
  group_by(Variables, Values) %>% 
    summarise(mean = mean(Exited),
              conf_int = 1.96*sd(Exited)/sqrt(n())) %>% 
  ggplot(aes(x=Values, y=mean, color=Values)) +
    geom_point() +
    geom_errorbar(aes(ymin = mean - conf_int, ymax = mean + conf_int), width = 0.1) +
    theme_minimal() +
    theme(legend.position = "none",
        axis.title.x = element_blank(),
        axis.title.y = element_blank()) +
   scale_color_viridis_d(aesthetics = c("color", "fill"), end = 0.8) +
   facet_wrap(~Variables, scales = 'free') +
   labs(title = 'Categorical Variable Analysis', subtitle = 'With 95% Confidence Intervals')

Categorical Variable Analysis with 95% Confidence Intervals (Image by Author)

Of note, we see that gender, a non-active membership, the number of products and geography show a significant different propensity to churn. Conversely, whether or not a customer has a credit card shows no significant impact on churn likelihood. This should be taken with a slight grain of salt given the class imbalance.

Model Development

Data Splitting – rsample

Using rsample::initial_split(), we specify a split of the training data 3:1.

set.seed(246)
cust_split <- initial_split(train, prop = 0.75, strata = Exited)

Model Specifications – Parnsip & Baguette

We are going to constrain the range of model specifications to those that are tree based models of increasing complexity using parsnip and baguette packages (for bag_trees). Each model specifies that their respective hyperparameters are set to tune() for screening in the next step.

dt_cust <- 
decision_tree(cost_complexity = tune(), tree_depth = tune(), min_n = tune()) %>% 
  set_engine("rpart") %>% 
  set_mode("classification")

rf_cust <- 
rand_forest(mtry = tune(), trees = tune(), min_n = tune()) %>% 
  set_engine("ranger", importance = "impurity") %>% 
  set_mode("classification")

xgboost_cust <- 
boost_tree(mtry = tune(), trees = tune(), min_n = tune(), tree_depth = tune(), learn_rate = tune(), loss_reduction = tune(), sample_size = tune())  %>% 
  set_engine("xgboost") %>% 
  set_mode("classification")

bagged_cust <- 
bag_tree(cost_complexity = tune(), tree_depth = tune(), min_n = tune()) %>% 
  set_engine("rpart") %>% 
  set_mode("classification")

Feature Engineering – Recipes

Next we’ll specify the feature engineering steps using the recipes package. We will use this stage to develop 10 recipes that each have different sampling techniques to handle the class imbalance. The themis package provides recipes steps to facilitate different sampling techniques.

SMOTE – Synthetic Minority Oversampling Technique
ROSE – Random Oversampling Technique
BSMOTE – Borderline Synthetic Minority Oversampling Technique
UPSAMPLING – Add duplicate minority class data to specified ratio with majority class
ADASYN – Adaptive Synthetic Oversampling
TOMEK – Remove TOMEK links in Majority Class
NEARMISS – Remove Majority Class Instances by Undersampling
NOSAMPLING – No sampling procedure
SMOTE-DOWNSAMPLING – Generate synthetic minority instances and remove majority instances
ROSE-DOWNSAMPLING – Oversample minority instances and downsample majority

recipe_template <-
    recipe(Exited ~., data = training(cust_split)) %>% 
    step_integer(HasCrCard, IsActiveMember, zero_based = T) %>% 
    step_integer(NumOfProducts) %>% 
    step_mutate(SalaryBalanceRatio = EstimatedSalary/Balance,
              CreditScoreAgeRatio = CreditScore/Age,
              TenureAgeRatio = Tenure/Age,
              SalaryBalanceRatio = if_else(is.infinite(SalaryBalanceRatio),0,SalaryBalanceRatio)) %>% 
    step_scale(all_numeric_predictors(), -c(HasCrCard, Age, IsActiveMember, NumOfProducts)) %>% 
    step_dummy(all_nominal_predictors()) %>% 
    step_samplingmethod(Exited) #Change or Add Sampling Steps Here as Necessary

The above we’ve engineered additional features taking the quotients of of the different continuous variables.

Correlation Plot – corrr

We visualise the correlation plot of the entire trained dataset, using the recipe with no sampling techniques applied.

cust_train <- recipe_8 %>% prep() %>% bake(new_data = NULL)
cust_test <- recipe_8 %>% prep() %>% bake(testing(cust_split))

cust_train %>% 
  bind_rows(cust_test) %>% 
  mutate(Exited = as.numeric(Exited)) %>% 
  correlate() %>%
  rplot(print_cor = T, .order = "alphabet") +
    scale_color_gradient2(low = 'orange', high = 'light blue') + 
    theme(axis.text.x = element_text(angle = 90)) +
    labs(title = "Correlation Plot for Trained Dataset")

Correlation Plot for Trained Dataset (Image by Author)

Nothing surprising in the above, negative correlations between age and the derivative ratios that we generated. Exited has a positive correlation with age, slight negative correlation with the CreditScoreAgeRatio and slight positive correlation with the SalaryBalanceRatio.

Workflow Map – workflowsets

Using the workflowsets package we can generate a list of 40 workflows between the parsnip and recipes combinations AND screen 20 hyperparameter combinations for each workflow. We create a a 5-fold cross validation obect using the rsample::vfold_cv(), using the strata kwarg to Exited, such that each fold has the consistent ratio of the target variable levels.

recipe_list <- 
list(SMOTE = recipe_1, ROSE = recipe_2, BSMOTE = recipe_3, UPSAMPLE = recipe_4, ADASYN = recipe_5, TOMEK=recipe_6, NEARMISS = recipe_7, NOSAMPLING = recipe_8, SMOTEDOWNSAMPLE= recipe_9, ROSEDOWNSAMPLE = recipe_10)

model_list <- 
list(Decision_Tree = dt_cust, Boosted_Trees = xgboost_cust, Random_Forest = rf_cust, Bagged_Trees = bagged_cust)

wf_set <- 
workflow_set(preproc = recipe_list, models = model_list, cross = T)

set.seed(246)
train_resamples <- 
vfold_cv(training(cust_split), v = 5, strata = Exited)

class_metric <- metric_set(accuracy, f_meas, j_index, kap, precision, sensitivity, specificity, roc_auc, mcc, pr_auc)

doParallel::registerDoParallel(cores = 12)
wf_sample_exp <- 
  wf_set %>% 
  workflow_map(resamples = train_resamples, 
               verbose = TRUE, 
               metrics = class_metric, 
               seed = 246)

We use the parsnip::metric_set() function to create a custom set of metrics for evaluation. These custom metrics are passed along with the wf_set object to workflow_map to screen all 40 workflows and output all of the metrics calculations for each workflow. The resulting workflow_set object wf_sample_exp can now be analysed and used for model comparisons.

This is quite computationally taxing, so I recommend firing up all available cores to ease this along.

collect_metrics(wf_sample_exp) %>% 
  separate(wflow_id, into = c("Recipe", "Model_Type"), sep = "_", remove = F, extra = "merge") %>% 
  group_by(.metric) %>% 
  select(-.config) %>% 
  distinct() %>%
  group_by(.metric, wflow_id) %>% 
  filter(mean == max(mean)) %>% 
  group_by(.metric) %>% 
  mutate(Workflow_Rank =  row_number(-mean),
         .metric = str_to_upper(.metric)) %>% 
  arrange(Workflow_Rank) %>% 
  ggplot(aes(x=Workflow_Rank, y = mean, color = Model_Type)) +
    geom_point(aes(shape = Recipe)) +
    scale_shape_manual(values = 1:n_distinct(recipe_list)) +
    geom_errorbar(aes(ymin = mean-std_err, ymax = mean+std_err)) +
    theme_minimal() +
    scale_color_viridis_d() +
    labs(title = "Performance Comparison of Workflows", x = "Workflow Rank", y="Error Metric", color = "Model Types", shape = "Recipes") +
    facet_wrap(~.metric,scales = 'free_y',ncol = 4)

Comparison of all workflows by various classification metrics. (Image by Author)

The purpose of the above is to demonstrate just a sample of classification metrics that one might look at with an imbalanced dataset. We want a model that sufficiently differentiates between classes, but given our particular problem we need to minimise false negatives (customers that churn but are predicted otherwise). Part 2 discusses a cost function to find balance between cost of intervention and class differentiation.

Given class imbalance, ROC AUC and Accuracy are not appropriate metrics. We consider Precision-Recall AUC, KAP, J-Index, Mathews Correlation Coefficient and Specificity. With this in mind UPSAMPLE_Boosted_Trees is a strong candidate with good results across all these metrics.

Bayesian Model Comparison – tidyposterior

Focusing on the J-index, we can compare the resampling posterior distributions using tidyposterior. tidyposterior::perf_mod() takes the wf_sample_exp object that contains results from workflow_map and completes bayesian comparision of resamples and generates posterior distributions of the metric mean of interest. N.B the workflow_set object MUST have the target metric calculated else this will not work.

jindex_model_eval <- 
  perf_mod(wf_sample_exp, metric = "j_index", iter = 5000)
jindex_model_eval %>% 
  tidy() %>% 
  mutate(model = fct_inorder(model)) %>% 
  separate(model, into = c("Recipe", "Model_Type"), sep = "_", remove = F, extra = "merge") %>% 
  ggplot(aes(x=posterior, fill = Model_Type)) +
    geom_density(aes(alpha = 0.7)) +
    theme_minimal() +
    scale_fill_viridis_d(end = 0.8) +
    facet_wrap(~Recipe, nrow = 10) +
    labs(title = "Comparison of Posterior Distributions of Model Recipe Combinations", 
       x = expression(paste("Posterior for Mean J Index")), 
       y = "")

Comparison of Workflows and Impact of Sampling Procedure on J-Index. (Image by Author)

There is a lot to unpack in the above, but in short, it demonstrates the impact of the sampling procedure to handle class imbalance and effect on J-Index. This can be completed for any metric and I think is a useful exercise to explore impact of different up- and downsampling procedures. The sampling procedure should be treated like a hyperparameter and the ideal will be different depending on the nature of the dataset, feature engineering and the metrics of interest. We note that boosted trees performs generally well, alongside the bagged tree methods all depending on the sampling procedure.

Concluding Remarks

We have demonstrated the model development process for a binary classifier with a 4:1 imbalanced dataset. Our best performing workflow features an XGBoost model with an upsampling procedure to level the ratio of classes. In Part 2 we will fit the model and complete decision threshold analysis using the probably package and develop two scenarios – either change the decision threshold to minimise cost of churn and intervention OR enables better class differentiation. Part 2 will focus heavily on translating the model output to business stakeholders and enable them to make an informed decision around cost and risk of customer churn.

Thankyou for reading and please keep an eye out for the subsequent parts. I hope you enjoyed reading this as much as I enjoyed writing. If you’re not a Medium member – use my referral link below and get regular updates on new publications from myself and other fantastic Medium authors.

Join Medium with my referral link – Murray Gillin

The post Bank Customer Churn with Tidymodels – Part 1 Model Development appeared first on Towards Data Science.

5 Tips for Analysts (and their Managers)

Murray Gillin — Fri, 27 Aug 2021 12:01:11 +0000

Office Hours

I’ve worked professionally as an Analyst for just over two years in large retail organisations. Depending on the company you work for, analysts are either seen as an asset or just a numbers grunt, and this pigeon holing in my experience is driven by the insecurity of some managers when it comes to being more data driven and not decisions based on their gut feeling, elsewise known as the HiPPO (Highest Paid Person’s Opinion).

Photo by Isaac Smith on Unsplash

The inspiration for this article stems from that one weekly meeting we all sit through where various teams present their functions KPI metrics for the previous week. You know the meeting, where manager’s explain their figures with a spreadsheet prepared by an analyst in their team, with KPI boxes, or worse slabs of unformatted tables. Is anyone paying attention, and how can we do better? Here I provide 5 tips for analysts to increase the value they bring to their team, and advice for managers to better engage with your analysts.

Analyst Tip #1— Know your end user

The life of an analyst is tough. We need to be strong communicators, extremely logical thinkers and creative. Before you go diving into your data warehouse (or data lake), spend some time considering your end user needs. This may even mean setting aside time to take a brief from them, and ask meaningful questions, because in the end this will make writing that query much easier. Whatever they’ve asked for add another 50% in scope. Explore adjacent datasets that may add more value to your analysis. I always have ‘what if?’ circling in my mind when writing queries for new analysis.

Manager Tip: Be very specific about what your asking your analyst to do, and be patient as they seek to flesh out your idea. Set clear expectations and give reasonable timelines for delivery. Treat this as a collaboration and not an exercise in top-down management.

Analyst Tip #2 – Be impactful with visual analytics

Data visualisation is an art form. Most people are visual, and the endeavour of an analyst is to make insight easily perceptible to their target audience by using well designed visualisations. So many times have I seen people have four weeks worth of figures in a table, when your audience would perceive changes easier and better with a time series chart. Visualisations should be designed with a specific question in mind that they are seeking to answer. Understanding trends over time is incredibly important for managers, and understanding how and why trends are shifting or staying the same is even more important. Reporting aggregated figures tends to hide things that are shifting in the underlying sands that simple KPI reporting doesn’t reflect, tell the story with a series of well designed visualisations.

Manager Tip: Be brave and turn up to your next stand up with a handful of visualisations that report the same figures with better detail and improve engagement with your audience. It may prompt a discussion that leads to better business outcomes.

Analyst Tip #3 – Develop flexible and dynamic dashboards

An extension of the above, is the dashboard design process. Who doesn’t love a good dashboard? Dashboards are collections of metrics and visualisations, with the intention of being able to provide deeper level analysis and enable underlying trends not immediately visible with KPIs to be perceived. I’ve worked with Tableau, Microstrategy, PowerBI and Quicksight, and by in large they are all variations on the same theme. They become even more powerful when you start developing means of interactivity. This is usually in the form of controllable filters and parameters.

Parameters are my absolute favourite, particularly when working with time series data, being able to change how the dimension that the group by clause uses, or changing how the metric is calculated is so powerful. A recent dashboard I’ve developed enables you to change the time series from a week to week figures, cumulative year to date, percent change week on week and a 4 week moving average. Dashboards should be easy and intuitive to use, and enable your customer to answer an array of questions using the data you’ve curated. Design is an incredibly thoughtful and creative process and shouldn’t be rushed. Look back at your work from time to time and consider ways it can be improved

Manager Tip: Spend quality time with your analysts so that they can explain the features of a new dashboard and how you can use it most effectively. Make sure to understand the scope of the questions it can answer, and provide meaningful feedback. It’s a collaboration.

Analyst Tip #4 – Don’t enable access to raw data

This is a pet peeve of mine, when a manager doesn’t trust the figures you’ve presented so they insist on seeing the data or having raw data visible in a dashboard. What’s the point really? The point actually is, you haven’t done your job properly. If ever your approached to provide large datasets, always ask "What are you seeking to understand?". The last thing you want is a manager or colleague wasting their time playing with a spreadsheet trying to find something that isn’t there, or worse, is there and you haven’t surfaced it. For the sake of transparency if someone would like to understand how the dataset was generated, explain (where reasonable) the table logic. This goes back to our first tip, properly understand the data you’ve collated after you’ve understood your end user’s needs.

Manager Tip: Don’t ask to see raw data, use this as an opportunity for feedback and explore with your analyst the question you’d like answered. 9 times out of 10 it may require a further query to be generated.

Analyst Tip #5 – Value yourself

Sometimes analysts don’t get the praise or acknowledgement we deserve. We slave over complex queries, work with data engineers and architects so that we can access additional data sources, develop visualisations and dashboards. The last thing you want after all that effort is to not be valued by your team and management. What we do takes time, but when analytics is done reproducibly adds tremendous value to our respective business’. It has never been a better time to be an analyst, the next wave of technological uplift is being accelerated with the impacts of COVID-19 being felt worldwide, and business’ are shifting to better utilise their most precious asset, data. So value yourself, your skills and if you feel as if you aren’t being valued, (discretely) find another job. You’ve got nothing to lose.

Manager Tip: Keep your analysts engaged in meaningful work, and facilitate their ongoing development. Find them an internal mentor and/or if your organisation can afford to do so, enable further training. The job market is going through a generational shift and is increasingly competitive as organisations seek to grow their internal analytics capabilities. Employee churn is generally the result of poor management and organisational culture.

Concluding Remarks

Having worked in environments that take different views on the role of the analyst here’s mine. Analysts are inherently business partners; the interface between our line of business and data. You are playing a growing and important role in your organisations future direction. You enable data driven decision making in how you communicate your findings and enable self service analytics. You have a unique understanding of business systems and see links that others may not. You will collaborate across functional areas and create professional networks internal and external to your organisation. Think about ways you can add value, and how you can work smarter. Lastly be visible and take credit for your work.

To the managers who read along, I recommend establishing a regular 1:1 meeting to maintain open lines of communication with your analyst. Collaborate constructively with them and they’ll provide you with the data firepower you require to drive business outcomes.

Today’s analysts are tomorrow’s business leaders, informed by data and unique business understanding, we represent a new generation of managers that will make better and faster decisions.

The post 5 Tips for Analysts (and their Managers) appeared first on Towards Data Science.

Using Workflow Sets to Screen and Compare Model-Recipe Combinations for Bank Loan Classification

Murray Gillin — Sat, 17 Jul 2021 17:31:09 +0000

Consider you are a data scientist at a large bank, and your CDO has instructed you to develop a means of automating bank loan decisions. You decide that this should be a binary classifier and proceed to collect several hundred data points. But what kind of model should you start off and what sort of feature engineering should you complete? Why not screen multiple combinations?

This project will use the loan prediction data set to exemplify the use of workflow_sets() within the Tidymodels machine learning ecosystem. Screening a series of models and feature engineering pipelines are important to avoid any internal bias towards a particular model type, or the "No Free Lunch" theorem. If you’re interested in learning more about the Tidymodels ecosystem before reading, please check out my previous article.

Loan Prediction Data has been taken from https://datahack.analyticsvidhya.com/contest/practice-problem-loan-prediction-iii/.

Photo by Iro Klg on Unsplash

Load Packages and Data

library(tidymodels) #ML meta packages
library(themis) #Recipe functions to deal with class imbalances
library(discrim) #Naive Bayes Models
library(tidyposterior) #Bayesian Performance Comparison
library(corrr) #Correlation Viz
library(readr) #Read Tabular Data
library(magrittr) #Pipe Operators
library(stringr) #Work with Strings
library(forcats) #Work with Factors
library(skimr) #Data Summary
library(patchwork) #ggplot grids
library(GGally) #Scatterplot Matrices

train <- read_csv("train_ctrUa4K.csv")

train %<>% rename(Applicant_Income = ApplicantIncome,
                  CoApplicant_Income = CoapplicantIncome,
                  Loan_Amount = LoanAmount)

Exploratory Data Analysis

Before we start screening models, complete our due diligence with some basic exploratory data analysis.

We use the skimr package to produce the below output

skim(train)

Output of skimr (Image by Author)

Furthermore, we can visualise these results with appropriate visualisations using the following function.

#Automated Exploratory Data Analysis
viz_by_dtype <- function (x,y) {
  title <- str_replace_all(y,"_"," ") %>% 
           str_to_title()
  if ("factor" %in% class(x)) {
    ggplot(train, aes(x, fill = x)) +
      geom_bar() +
      theme(legend.position = "none",
            axis.text.x = element_text(angle = 45, hjust = 1),
            axis.text = element_text(size = 8)) +
      theme_minimal() +
      scale_fill_viridis_c()+
      labs(title = title, y = "", x = "")
  }
  else if ("numeric" %in% class(x)) {
    ggplot(train, aes(x)) +
      geom_histogram()  +
      theme_minimal() +
      scale_fill_viridis_c()+
      labs(title = title, y = "", x = "")
  } 
  else if ("integer" %in% class(x)) {
    ggplot(train, aes(x)) +
      geom_histogram() +
      theme_minimal() +
      scale_fill_viridis_c()+
      labs(title = title, y = "", x = "")
  }
  else if ("character" %in% class(x)) {
    ggplot(train, aes(x, fill = x)) +
      geom_bar() +
      theme_minimal() +
      scale_fill_viridis_d()+
      theme(legend.position = "none",
            axis.text.x = element_text(angle = 45, hjust = 1),
            axis.text = element_text(size = 8)) +
      labs(title = title, y  ="", x= "")
  }
}

variable_plot <- map2(train, colnames(train), viz_by_dtype) %>%
  wrap_plots(ncol = 3,
             nrow = 5)

Visual EDA for Loan Dataset (Image by Author)

Gender imbalance, high proportion of male applicants, with NAs present
Marital Status, almost 3:2 married applicants to unmarried, with NAs present
Majority of applicants do not have children
Majority of applicants are university graduates
Majority are not self-employed
Applicant Income is right skewed
Co-applicant Income is right skewed
Loan Amount is right skewed
Loan Amount Term is generally 360 days
Majority of applicants have a credit history, with NAs present
Mix of Property Area Types
Majority of applications had their application approved, target variable Loan Status is imbalanced 3:2

Bivariate Data Analysis

Quantitative Variables

Using GGally::ggpairs(), we generate a variable matrix plot for all numeric variables and coloured by Loan_Status.

#Correlation Matrix Plot
ggpairs(train %>% select(7:10,13), ggplot2::aes(color = Loan_Status, alpha = 0.3)) + 
  theme_minimal() + 
  scale_fill_viridis_d(aesthetics = c("color", "fill"), begin = 0.15, end = 0.85) +
  labs(title = "Numeric Bivariate Analysis of Loan Data")

Bivariate Analysis of Numeric Predictors and Loan Status (Image by Author)

Qualitative Variables

The below visualisation was generated iteratively using a summary dataframe generated as below, changing the desired character variable.

#Generate Summary Variables for Qualitative Variables
summary_train <- 
  train %>% 
  select(where(is.character),
         -Loan_ID) %>% 
  drop_na() %>% 
  mutate(Loan_Status = if_else(Loan_Status == "Y",1,0)) %>% 
  pivot_longer(1:6, names_to = "Variables", values_to = "Values") %>% 
  group_by(Variables, Values) %>% 
    summarise(mean = mean(Loan_Status),
              conf_int = 1.96*sd(Loan_Status)/sqrt(n())) %>% 
  pivot_wider(names_from = Variables, values_from = Values)

summary_train %>% select(Married, mean, conf_int) %>% 
  drop_na() %>% 
  ggplot(aes(x=Married, y = mean, color = Married)) +
  geom_point() +
  geom_errorbar(aes(ymin = mean - conf_int, ymax = mean + conf_int), width = 0.1) +
  theme_minimal() +
  theme(legend.position = "none",
        axis.title.x = element_blank(),
        axis.title.y = element_blank()) +
  scale_colour_viridis_d(aesthetics = c("color", "fill"), begin = 0.15, end = 0.85) +
  labs(title="Married")

Proportion of a Successful Loan Application for Each Categorical Variable with 95% Confidence Intervals (Image by Author)

The benefit of having completed a visualisation as above is that it gives an understanding of the magnitude of difference of loan application success as well as the uncertainty, visualised by the 95% confidence intervals for each categorical variable. From this we note the following:

Married applicants tend to have a greater chance of their application being approved.
Graduate applicants have a greater chance of success, maybe because of more stable/higher incomes.
Similarly, there is far greater variation in the success of applicants who work for themselves, likely because of variation in income stability
Number of children doesn’t appear to have a great impact on loan status
Female customers have a much greater variability in the success of the application, male customers is more discrete. We need to be careful of this so as to not create ingrained bias’ in the algorithm.
Semiurban, or Suburban has the highest success proportion.
Lastly, there is a significant difference in success rates between customers with and without a Credit History.

If we wanted to, we could explore these relationships further with ANOVA/Chi Squared testing.

Data Splitting – rsamples

We’ve split the data 80:20, stratified by Loan_Status. Each split tibble can be accessed by calling training() or testing() on the mc_split object.

#Split Data for Testing and Training
set.seed(101)
loan_split <- initial_split(train, prop = 0.8, strata = Loan_Status)

Model Development – parsnip.

As noted above, we want to screen an array of models and hone in on a couple that would be good candidates. Below we’ve initialised a series of standard Classification models.

#Initialise Seven Models for Screening
nb_loan <- 
  naive_Bayes(smoothness = tune(), Laplace = tune()) %>% 
  set_engine("klaR") %>% 
  set_mode("classification")

logistic_loan <- 
  logistic_reg(penalty = tune(), mixture = tune()) %>% 
  set_engine("glmnet") %>% 
  set_mode("classification")

dt_loan <- decision_tree(cost_complexity = tune(), tree_depth = tune(), min_n = tune()) %>% 
  set_engine("rpart") %>% 
  set_mode("classification")

rf_loan <- 
  rand_forest(mtry = tune(), trees = tune(), min_n = tune()) %>% 
  set_engine("ranger") %>% 
  set_mode("classification")

knn_loan <- nearest_neighbor(neighbors = tune(), weight_func = tune(), dist_power = tune()) %>% 
  set_engine("kknn") %>% 
  set_mode("classification")

svm_loan <- 
  svm_rbf(cost = tune(), rbf_sigma = tune(), margin = tune()) %>% 
  set_engine("kernlab") %>% 
  set_mode("classification")

xgboost_loan <- boost_tree(mtry = tune(), trees = tune(), min_n = tune(), tree_depth = tune(), learn_rate = tune(), loss_reduction = tune(), sample_size = tune())  %>% 
  set_engine("xgboost") %>% 
  set_mode("classification")

Feature Engineering – recipes

The below is a series of recipe configurations. How will the two different imputation methods compare against one and other, in the below note we’ve used impute_mean (for numeric), impute_mode (for character) or use impute_bag (imputation using bagged trees) which can impute both numeric and character variables.

Note with Credit_History we’ve elected not to impute this value, instead, allocated three results to an integer value, setting unknown instances to 0.

As Loan_Status is imbalanced, we’ve applied SMOTE (Synthetic Minority Overs-Sampling Technique) to recipe 2 and 4. We’ll complete workflow comparison to understand any difference in accuracy between the imputation strategies.

#Initialise Four Recipes
recipe_1 <- 
  recipe(Loan_Status ~., data = training(loan_split)) %>% 
  step_rm(Loan_ID) %>%
  step_mutate(Credit_History = if_else(Credit_History == 1, 1, -1,0)) %>% 
  step_scale(all_numeric_predictors(), -Credit_History) %>% 
  step_impute_bag(Gender, 
                  Married, 
                  Dependents, 
                  Self_Employed, 
                  Loan_Amount, 
                  Loan_Amount_Term) %>% 
  step_dummy(all_nominal_predictors())

recipe_2 <- 
  recipe(Loan_Status ~., data = training(loan_split)) %>% 
  step_rm(Loan_ID) %>%
  step_mutate(Credit_History = if_else(Credit_History == 1, 1, -1,0)) %>% 
  step_scale(all_numeric_predictors(), -Credit_History) %>% 
  step_impute_bag(Gender, 
                  Married, 
                  Dependents, 
                  Self_Employed, 
                  Loan_Amount, 
                  Loan_Amount_Term) %>% 
  step_dummy(all_nominal_predictors()) %>% 
  step_smote(Loan_Status)

recipe_3 <- 
  recipe(Loan_Status ~., data = training(loan_split)) %>% 
  step_rm(Loan_ID) %>%
  step_mutate(Credit_History = if_else(Credit_History == 1, 1, -1,0)) %>%  
  step_scale(all_numeric_predictors(), -Credit_History) %>%  
  step_impute_mean(all_numeric_predictors()) %>%
  step_impute_mode(all_nominal_predictors()) %>% 
  step_dummy(all_nominal_predictors()) %>% 
  step_zv(all_predictors())

recipe_4 <- 
  recipe(Loan_Status ~., data = training(loan_split)) %>% 
  step_rm(Loan_ID) %>%
  step_mutate(Credit_History = if_else(Credit_History == 1, 1, -1,0)) %>% 
  step_scale(all_numeric_predictors(), -Credit_History) %>%  
  step_impute_mean(all_numeric_predictors()) %>%
  step_impute_mode(all_nominal_predictors()) %>% 
  step_dummy(all_nominal_predictors()) %>% 
  step_zv(all_predictors()) %>% 
  step_smote(Loan_Status)

Taking recipe_1, we can prepare both the training and test datasets per the transformations.

#Prep and Bake Training and Test Datasets
loan_train <- recipe_1 %>% prep() %>% bake(new_data = NULL)
loan_test <- recipe_1 %>% prep() %>% bake(testing(loan_split))

Then take these transformed datasets and visualise variable correlation.

#Generate Correlation Visualisation
loan_train %>% bind_rows(loan_test) %>% 
  mutate(Loan_Status = if_else(Loan_Status == "Y",1,0)) %>% 
              correlate() %>%
              rearrange() %>% 
              shave() %>% 
              rplot(print_cor = T,.order = "alphabet") +
                theme_minimal() +
                theme(axis.text.x = element_text(angle = 90)) +
                scale_color_viridis_c() +
                labs(title = "Correlation Plot for Trained Loan Dataset")

Correlation Plot for Trained Loan Dataset (Image by Author)

From the above we note a strong relationship between Loan_Status and whether an applicant has a Credit_History. This makes sense, bank’s are more inclined to give loans to customers with a credit history, evidence positive or otherwise of a customers ability to repay the loan.

Iterating Parnsip and Recipe Combinations – Workflow Sets

Workflow Sets enables us to screen all possible combinations of recipe and parnsip model specifications. As below, we’ve created two named lists, all models and recipes.

#Generate List of Recipes
recipe_list <- 
list(Recipe1 = recipe_1, Recipe2 = recipe_2, Recipe3 = recipe_3, Recipe4 = recipe_4)

#Generate List of Model Types
model_list <- 
list(Random_Forest = rf_loan, SVM = svm_loan, Naive_Bayes = nb_loan, Decision_Tree = dt_loan, Boosted_Trees = xgboost_loan, KNN = knn_loan, Logistic_Regression = logistic_loan)

Here comes the fun part, creating a series of workflows using workflow_sets()

model_set <- workflow_set(preproc = recipe_list, models = model_list, cross = T)

Setting cross = T instructs workflow_set to create all possible combinations of parsnip model and recipe specification.

set.seed(2)
train_resamples <- bootstraps(training(loan_split), strata = Loan_Status)

doParallel::registerDoParallel(cores = 12)
all_workflows <- 
  model_set %>% workflow_map(resamples = train_resamples, 
                             verbose = TRUE)

We’ve initialised a boostraps resampling procedure, this is computationally more demanding, however results in less error overall. Passing the model_set object through to workflow_map, with the resampling procedure will initate a screen of all parsnip-recipe combinations showing an output like this. This is a demanding procedure, and hence why we’ve initiated parrallel compute to ease this along.

Using the resampling procedure, workflow_map will fit each workflow to the training data and compute accuracy on each resample and take the mean. Furthermore, as each of the recipes specifies that the hyperparameters should be tuned, workflow_map will perform some tuning to provide an optimal result.

Output from workflow_map (Image by Author)

Once completed, we can call the all_workflows object and collect all the metrics that were calculated in this procedure, and visualise the results.

N.B there is an autoplot() function available for workflow_set objects, however we want to be able to compare recipe and model performance so have had to re-engineer features for visualisation below.

#Visualise Performance Comparison of Workflows
collect_metrics(all_workflows) %>% 
  separate(wflow_id, into = c("Recipe", "Model_Type"), sep = "_", remove = F, extra = "merge") %>% 
  filter(.metric == "accuracy") %>% 
  group_by(wflow_id) %>% 
  filter(mean == max(mean)) %>% 
  group_by(model) %>% 
  select(-.config) %>% 
  distinct() %>%
  ungroup() %>% 
  mutate(Workflow_Rank =  row_number(-mean),
         .metric = str_to_upper(.metric)) %>%
  ggplot(aes(x=Workflow_Rank, y = mean, shape = Recipe, color = Model_Type)) +
    geom_point() +
    geom_errorbar(aes(ymin = mean-std_err, ymax = mean+std_err)) +
    theme_minimal()+
    scale_colour_viridis_d() +
    labs(title = "Performance Comparison of Workflow Sets", x = "Workflow Rank", y = "Accuracy", color = "Model Types", shape = "Recipes")

Performance Comparison of All Workflows (Image by Author)

From the above we can observe:

Naive Bayes and KNN models have performed the worst
Tree-Based methods have performed very well
SMOTE has not offered an increase in performance for more complex model types, no change between recipes for decision tree model.

Whilst we’ve made very general first glance observations, given the slight difference in the mean accuracy of say the top 8 workflows, how do we discern which are distinct, or otherwise practically the same?

Post Hoc Analysis of Resampling – tidyposterior

One way to compare models is to examine the resampling results, and ask are the models actually different? The tidyposterior::perf_mod function enables comparisons of all of our workflow combinations by generating a set of posterior distributions, for a given metric, which can then be compared against each other for practical equivalence. Essentially, we can make between-workflow comparisons.

doParallel::registerDoParallel(cores = 12)
set.seed(246)
acc_model_eval <- perf_mod(all_workflows, metric = "accuracy", iter = 5000)

Sample output from perf_mod (Image by Author)

The results of perf_mod are then passed to tidy() to extract the underlying findings of the posterior analysis. Their distributions are visualised below.

#Extract Results from Posterior Analysis and Visualise Distributions
acc_model_eval %>% 
  tidy() %>% 
  mutate(model = fct_inorder(model)) %>% 
  ggplot(aes(x=posterior)) +
   geom_histogram(bins = 50) +
   theme_minimal() +
   facet_wrap(~model, nrow = 7, ncol = 6) +
   labs(title = "Comparison of Posterior Distributions of Model Recipe Combinations", x = expression(paste("Posterior for Mean Accuracy")), y = "")

Posterior Distributions of Mean Accuracy for all Workflows (Image by Author)

Looking back at our performance comparison, we observe that the two highest rank models are Boosted Trees and Decision Trees with Recipe1, models on the extreme sides of the interpretability scale for tree based models. We can seek to understand further by taking the difference in means of their respective posterior distributions.

Using contrast_models() we can conduct that analysis.

#Compare Two Models - Difference in Means
mod_compare <- contrast_models(acc_model_eval,
                            list_1 = "Recipe1_Decision_Tree",
                            list_2 = "Recipe1_Boosted_Trees")

a1 <- mod_compare %>% 
  as_tibble() %>% 
  ggplot(aes(x=difference)) +
  geom_histogram(bins = 50, col = "white", fill = "#73D055FF")+
  geom_vline(xintercept = 0, lty = 2) +
  theme_minimal()+
  scale_fill_viridis_b()+
  labs(x= "Posterior for Mean Difference in Accuracy", y="", title = "Posterior Mean Difference Recipe1_Decision_Tree & Recipe3_Boosted_Trees")

a2 <- acc_model_eval %>% 
  tidy() %>% mutate(model = fct_inorder(model)) %>% 
  filter(model %in% c("Recipe1_Boosted_Trees", "Recipe1_Decision_Tree")) %>% 
  ggplot(aes(x=posterior)) +
  geom_histogram(bins = 50, col = "white", fill = "#73D055FF") +
  theme_minimal()+
  scale_colour_viridis_b() +
  facet_wrap(~model, nrow = 2, ncol = 1) +
  labs(title = "Comparison of Posterior Distributions of Model Recipe Combinations", x = expression(paste("Posterior for Mean Accuracy")), y = "")

a2/a1

Individual Mean Accuracy Posterior Distributions, and Their Difference (Image by Author)

Piping the mod_compare object to summary generates the following output

mod_compare %>% summary()
mean
0.001371989 #Difference in means between posterior distributions
probability
0.5842 #Proportion of Posterior of Mean Difference > 0

As also evidenced by the posterior for mean difference, the mean difference is small between the respective workflow posterior distributions for mean accuracy. 58.4% of the posterior difference of means distribution is above 0, so we can infer that the positive difference is real (albeit slight).

We can investigate further, with the notion of practical equivalence.

summary(mod_compare, size = 0.02)
pract_equiv
0.9975

The effect size are thresholds on both sides [-0.02, 0.02] of the difference in means posterior distribution, and the pract_equiv measures how much of the diff. in means post. distribution is within these thresholds. For our comparison, Boosted_Trees is 99.75% within the posterior distribution for Decision_Tree.

We can complete this exercise for all workflows as below.

#Pluck and modify underlying tibble from autoplot()
autoplot(acc_model_eval, type = "ROPE", size = 0.02) %>% 
  pluck("data") %>% 
  mutate(rank = row_number(-pract_equiv)) %>% 
  arrange(rank) %>% 
  separate(model, into = c("Recipe", "Model_Type"), sep = "_", remove = F, extra = "merge") %>% 
  ggplot(aes(x=rank, y= pract_equiv, color = Model_Type, shape = Recipe)) +
   geom_point(size = 5) +
   theme_minimal() +
   scale_colour_viridis_d() +
   labs(y= "Practical Equivalance", x = "Workflow Rank", size = "Probability of Practical Equivalence", color = "Model Type", title = "Practical Equivalence of Workflow Sets", subtitle = "Calculated Using an Effect Size of 0.02")

Practical Equivalence of Workflows With Effects Size of 0.02 (Image by Author)

We’ve taken the data that underpins the autoplot object that conducts the ROPE (Region of Practical Equivalence) calculation, so that we can engineer and display more features. In doing so we can easily compare the each of the workflows using Recipe1_Decision_Tree as a benchmark. There are a handful of workflows that would likely perform just as well as Recipe1_Decision_Tree.

The top two candidates, Boosted_Trees and Decision_Tree sit on opposite ends of the interpretability spectrum. Boosted Trees is very much a black-box method, conversely Decision Trees are more easily understood, and in this instance provides the best result, and hence we’ll finalise our workflow on the basis of using Recipe1_Decision_Tree.

#Pull Best Performing Hyperparameter Set From workflow_map Object
best_result <- all_workflows %>% 
  pull_workflow_set_result("Recipe1_Decision_Tree") %>% 
  select_best(metric = "accuracy")

#Finalise Workflow Object With Best Parameters
dt_wf <- all_workflows %>% 
  pull_workflow("Recipe1_Decision_Tree") %>% 
  finalize_workflow(best_result)

#Fit Workflow Object to Training Data and Predict Using Test Dataset
dt_res <- 
  dt_wf %>%
  fit(training(loan_split)) %>% 
  predict(new_data = testing(loan_split)) %>% 
  bind_cols(loan_test) %>% 
  mutate(.pred_class = fct_infreq(.pred_class),
         Loan_Status = fct_infreq(Loan_Status))

#Calculate Accuracy of Prediction
accuracy(dt_res, truth = Loan_Status, estimate = .pred_class)

Model Accuracy Against Test Data (Image by Author)

Confusion Matrix for Predictions (Image by Author)

Our finalised model has managed to generate a very strong result, 84.68% accurate on the test dataset. Model is generally performing well on predicting approvals for loans that were granted. However model is approving loans that were decided against by the bank. This is a curious case, obviously more loans a bank offers means higher potential for revenue, at a higher risk exposure. This indicates the dataset contains some inconsistencies that the model is not able to fully distinguish and has been ‘confused’ based on observations in the training set.

This presents an argument for model interpretability, in this example we cannot generate more data, so how do we know the basis by which the predictions have been made?

#Fit and Extract Fit from Workflow Object
dt_wf_fit <- 
  dt_wf %>% 
  fit(training(loan_split))

dt_fit <- 
  dt_wf_fit %>% 
  pull_workflow_fit()

#Generate Decision Tree Plot Using rpart.plot package
rpart.plot::rpart.plot(dt_fit$fit)

Decision Tree Diagram for Model Decision Making (Image by Author)

As our model is simple enough to visualise and understand, a tree plot suffices.

Immediately we observe that the decision has been based solely on the presence of a Credit History. The model is also erring on the side of providing applicants with an unknown Credit_History (= 0), and denying applicants without one (Credit_History = -1). The inconsistencies noted above are evident, not everyone with a credit history is granted a loan, but it certainly helps.

Conclusion

You return to your CDO and explain your findings. They’re impressed with the accuracy result, but now the business has to weigh up the risk of default permitting applications with unknown credit histories, and the potential revenue that it might otherwise generate.

This article has sought to explain the power of workflow_sets within the Tidymodels ecosystem. Furthermore, we’ve explored the "No Free Lunch" theorem, that no one model can be said to be better or best, and hence why it is best practice to screen many models. The types of models that respond well are also dependant on the feature engineering that is carried out.

Thank you for reading my second publication. If you’re interested in Tidymodels, I published an introductory project explaining each of the core packages when building a regression model.

Would like to acknowledge the wonderful work of Julie Slige and Max Kuhn for their ongoing efforts in developing the Tidymodels ecosystem. Their in-progress book can be viewed https://www.tmwr.org/, which has been instrumental in building my understanding.

The post Using Workflow Sets to Screen and Compare Model-Recipe Combinations for Bank Loan Classification appeared first on Towards Data Science.

Big Sales Mart Regression Revisited: Enter the tidymodels

Murray Gillin — Sun, 20 Jun 2021 23:03:21 +0000

Tidymodels is a meta package much like the tidyverse that loads an array of useful tidy packages into your session. The difference being it loads a series of packages used through the machine learning model development process.

I’ll introduce each one as we utilise them and provide the practical use case. Tidymodels was developed by Max Kuhn who was onboarded by RStudio to develop a tidy friendly means of generating machine learning models. Kuhn’s previous and most well known work is Caret, being Categorical Regression Training. Caret is already an incredibly powerful package in it’s own right and rivals that of Python’s Scikit-Learn.

However, each model type has its own specific package, input requirements, specifications and training methods. Tidymodels seeks to resolve that issue by simplifying the workflow and creating a uniform syntax needed to generate machine learning models.

The Big Sales Mart dataset is available through Analytics Vidhya and one of their free projects they provide to members.

https://courses.analyticsvidhya.com/courses/big-mart-sales-prediction-using-r

The aim of the project is to develop and evaluate a series of regression models in predicting Item Outlet Sales from a number of input variables.

When going through the project myself, many aspects were written in base R and weren’t as intuitive as it could be and wasn’t taking advantage of the ease of use that the tidyverse packages offer. Hence I saw a real opportunity to translate this project into the "tidy" way of thinking and demonstrate tidymodels.

Exploratory Data Analysis

#Load Packages
library(tidymodels)
library(magrittr) #Tidy pipes
library(patchwork) #Visualisation grids
library(stringr) #String Manipulation
library(forcats) #Working with Factors
library(corrr) #Develop tidy Correlation Plots
library(vip) #Most important feature visualisations

Load a series of packages, of most importance is the tidymodels meta package. Much like the tidyverse, it implements a series of smaller packages that have discrete uses and play a part in the model development process, as well as dplyr, ggplot2, purrr, tibble etc. We’ll introduce each package as we progress through the model development.

#Load Data
train = read_csv("train.csv")
skim(train)

The data is loaded into the project, and we check the structure using the handy skim function from the skimr package. There are 12 variables (7 character and 5 numeric) and 8,523 rows of data.

From here we need to tidy the dataset with some variables missing values. We investigate further by visualising each variable in our dataset.

I wrote this function to make the initial EDA steps of my Data Science projects very straightforward, and create a list of plots via purrr’s map2() and passing to patchwork’s wrap_plots() to generate a grid of plots.

viz_by_dtype <- function (x,y) {
  title <- str_replace_all(y,"_"," ") %>% 
           str_to_title()
  if ("factor" %in% class(x)) {
    ggplot(combi, aes(x, fill = x)) +
      geom_bar() +
      theme(legend.position = "none",
            axis.text.x = element_text(angle = 45, hjust = 1),
            axis.text = element_text(size = 8)) +
      theme_minimal() +
      scale_fill_viridis_d()+
      labs(title = title, y = "", x = "")
  }
  else if ("numeric" %in% class(x)) {
    ggplot(combi, aes(x)) +
      geom_histogram()  +
      theme_minimal() +
      scale_fill_viridis_d()+
      labs(title = title, y = "", x = "")
  } 
  else if ("integer" %in% class(x)) {
    ggplot(combi, aes(x)) +
      geom_histogram() +
      theme_minimal() +
      scale_fill_viridis_d()+
      labs(title = title, y = "", x = "")
  }
  else if ("character" %in% class(x)) {
    ggplot(combi, aes(x, fill = x)) +
      geom_bar() +
      theme_minimal() +
      scale_fill_viridis_d()+
      theme(legend.position = "none",
            axis.text.x = element_text(angle = 45, hjust = 1),
            axis.text = element_text(size = 8)) +
      labs(title = title, y  ="", x= "")
  }
}
variable_list <- colnames(train) %>% as.list()
variable_plot <- map2(train, variable_list, viz_by_dtype) %>%
  wrap_plots(               
    ncol = 3,
    nrow = 4,
    heights = 150,
    widths = 150
  )

From this we note:

Item Identifier has 1559 individual values, and for the sake of model simplicity we will drop this variable
Item Weight has no apparent distribution
Item Fat Content has inconsistent labels
Item Visibility is right-skewed
Item Type has a variety of labels
Item MRP has four major groups
Outlet Identifier has 10 unique labels
Outlet Establishment Year is varied, many stores from the mid-90s onwards
Outlet Size has inconsistent labels
Outlet Location Type has three labels
Outlet Type has four labels
Item Outlet Sales is right-skewed

Furthermore, let’s investigate if any interesting relationships exist between Item_Outlet_Sales and other numeric variables. Using ggscatmat() from GGally we generate a neat bivariate analysis, gaussian smoothed along the diagonal and the pearson correlations for each variable pair.

We find, plotting the numeric variables against the target variables three different observations.

Item Sales is spread well across the entire range of the Item_Weight without any obvious patterns
Item_Visibility has no relationship with Item_Outlet_Sales
Item_MRP has four distinct bands and is moderately correlated with Item_Outlet_Sales, we will exploit this later in feature engineering.

Data Wrangling

As noted we need to correct a number of our variable outputs.

#Correct mislabeled Item_Fat_Content

train %<>% mutate(Item_Fat_Content = if_else(Item_Fat_Content %in% c("reg", "Regular"), "Regular", "Low Fat"))

#Outlet Size is missing a label

train %<>% mutate(Outlet_Size = if_else(is.na(Outlet_Size),"Small",Outlet_Size))

Outlet Size is imputed based on the NA label having a very similar Item_Outlet_Sales distribution to Small outlets.

Model Development: Introducing tidymodels

So far we have used an array of tidyverse packages to wrangle and visualise data, let’s now walk through the process of developing a Machine Learning pipeline using the tidymodels meta package.

Data Splitting – rsamples

Much like sci-kit learns test train split module, rsamples initial_split completes a similar operation. In the below, an mc_split object mart_split is generated by using the intial_split function, passing our training dataframe, setting the proportion to 75%, and stratified by the Item_Outlet_Sales variable.

This creates effectively a list of two dataframes, called by either training() or test(). We will use the test dataframe to judge how well our model performs on unseen data.

set.seed(55)
mart_split <- initial_split(train, prop = 0.75, strata = Item_Outlet_Sales

mart_train <- training(mart_split)
mart_test <- testing(mart_split)

Feature Engineering – recipes

The recipes package provides a large array of functions for processing variables and make them machine learning friendly. The three main functions of the recipes package are

recipe() – defines a sequential series of pre-processing steps (OR gather and list my ingredients and how I’m going to prepare them)
prep() – calculate required statistical transformations (OR prepare my ingredients for cooking)
bake() – applies the pre-processing steps to the data set (OR cook my prepared dish)

The recipe function is constructed like other models, outcome ~ predictor, here we have indicated we want to use all other variables as predictors for the model by using a tilde (~.).

We have used the recipe to generate a new variable, Price_Per_Unit, which is the quotient of Item_MRP and the Item_Weight, then seeking to reduce skewness of Price_Per_Unit by applying a Box-Cox transformation, and also taking the square root of Item_Visibility. Previously to generate dummy variables, whereby nominal/character variables are converted to binary columns indicating whether or not that row belongs to a particular group (1 = True, 0 = False), we could use reshape2::dcast(), however the power of recipes is demonstrated, being able to pipe a simple instruction, using step_dummy to create a dummy variable for all nominal variables. I’ve decided not have a dummy variable created for Item_Identifier for simplicity, and besides, would create over 5000 columns and likely overcomplicate the model.

mart_recipe <- 
  training(mart_split) %>% 
  recipe(Item_Outlet_Sales ~ .) %>% 
  step_rm(Item_Identifier) %>%
  step_impute_bag(Item_Weight) %>% 
  step_impute_knn(Item_Visibility) %>% 
  step_mutate(Price_Per_Unit = Item_MRP/Item_Weight) %>% 
  step_sqrt(Item_Visibility) %>%
  step_log(Price_Per_Unit, offset = 1) %>% 
  step_discretize(Item_MRP,num_breaks = 4) %>%
  step_normalize(all_numeric_predictors()) %>%
  step_dummy(all_nominal())

mart_recipe_prepped <- prep(mart_recipe)

The recipe object is then passed through the prep() function and also generates a recipe object, however the transformations have been conducted, and thus the data has been ‘prep’ared.

Passing the prep(ed) recipe object to bake executes the transformations, and because a number of dummy variables have been created we now have 54 columns.

mart_train <- bake(mart_recipe_prepped, new_data = mart_train)
dim(mart_train)
[1] 6391 54

As the prepared recipe is effectively a transformation pipeline, we can recall these same transformations on the test data set.

mart_test <- mart_recipe_prepped %>% 
             bake(testing(mart_split))
dim(mart_test)
[1] 2132 54

Voila, fully processed training and test datasets.

Correlation Plot – corrr

Correlation plots are very interesting and important in understanding which predictors are correlated with the outcome variable. To create a correlation plot, the data needs to be transformed into a correlation data frame (cor_df). We take mart_train, select all numeric variables and pass to corr::correlate which generates the cor_df giving the pearson or rho value between variable pairs.

The cor_df object is passed through to rplot(), and treating it just like any other ggplot object we can customise it’s appearance by using theme() and colour schemes.

corr_df <- mart_train %>% select(is.numeric) %>% 
              correlate() %>%
              rearrange() %>% 
              shave()
rplot(corr_df,) +
 theme_minimal() +
 theme(axis.text.x = element_text(angle = 90)) +
 scale_colour_viridis_c()

A word to the wise, you probably aren’t going to to use a correlation plot "generally" with this many variables, and as I write this is hard to read without a larger monitor. However, do observe how elegant the code was to create such a plot.

At this juncture, one might decide to proceed to complete a PCA analysis and reduce the number of variables by importance. I’ll leave that to another time.

Model Development – parsnip

The team at RStudio, particularly Max Kuhn have a seriously wicked sense of humour. In naming this package, they couldn’t call it caret2, so they chose parsnip.

As we are seeking to create a regression model that predicts Item_Outlet_Sales, for the sake of demonstration we will develop two models and use the root mean squared error (RMSE) as the evaluation metric.

Linear Regression
Random Forest

To generate a model using parsnip is incredibly elegant. Starting with our first model to evaluate, linear regression, we declare a model specification, and then the computational engine that will be used to fit the model. In this instance the parsnip model will use the stats:lm() function.

lm_model <- linear_reg() %>% 
            set_engine("lm")

lm_mart_fit <- 
  lm_model %>%
  fit(Item_Outlet_Sales ~ ., data = mart_train)

Calling the parsnip model object will generate an output similar to when calling a stats:lm() object, the formula and list of coefficients for each input variable.

lm_mart_res <- 
  predict(lm_mart_fit, new_data = mart_test) %>% 
  bind_cols(mart_test %>% select(Item_Outlet_Sales))

lm_mart_res %>%
  ggplot(aes(x = Item_Outlet_Sales, y = .pred)) +
  geom_abline() +
  geom_point(alpha = 0.5) +
  theme_minimal() +
  labs(x = "Item Outlet Sales", y = "Predicted Item Outlet Sales")

Measuring Performance – yardstick

Using the output of our regression model on mart_test, we can evaluate how accurate the model is using the yardstick package. The yardstick package provides an effective tidy manner for evaluating model performance.

metrics <- metric_set(rmse, rsq, mae)

metrics(lm_mart_res, truth = Item_Outlet_Sales, estimate = .pred)

yardstick::metric_set creates a wrapper function for specified error metrics. In our case, we have specified Root Mean Squared Error (RMSE), R² and Mean Absolute Error (MAE).

Calling metrics, specifying the tibble, actual and predicted measures, generates a tibble with each of the error measures and the output.

Comparing the error of the model to the error of the prediction on the test set we find that RMSE(Test) = 1141.8 and RMSE(Model) = 1144.3. The difference is slight and we can say that the model applies well to the test data.

However, the regression model has predicted negative values for a number of values and thus is inappropriate for this context, as we cannot have negative sales.

Generating ML Pipelines – Workflows

As above, there are a number of granular steps we’ve undertaken to generate a basic linear model, the workflows package provides an elegant means of piping each of these steps. We’ll now demonstrate how useful workflows are by generating a random forest model.

Random Forest Models

Random Forest algorithms are an ensemble method of Decision Trees, generally trained via the bagging method. RF algorithms search for the best feature among a random subset of features. This results in greater tree diversity, trading higher bias for lower variance and a better model overall.

rf_workflow <- 
  workflow() %>% 
  add_recipe(mart_recipe) %>% #Pre-processing Steps
  add_model(rand_forest(mode = "regression") %>% #Specify Model
              set_engine("ranger"))

In four lines of code, we have generated an ML workflow that can now be used as a model object.

rf_mart_res <- 
  rf_workflow %>% 
  fit(training(mart_split)) %>% 
  predict(new_data = testing(mart_split)) %>%
  bind_cols(mart_test %>% select(Item_Outlet_Sales))

rf_mart_res %>% 
  ggplot(aes(x= Item_Outlet_Sales, y = .pred)) +
   geom_abline(lty = 2)+
   geom_point(alpha = 0.5)+
   theme_minimal()+
   labs(x="Item Outlet Sales", y= "Predicted Item Outlet Sales", title = "pRandom Forest Regression")

Here we find that RMSE(Train) = 1110.446 and RMSE(Test) = 1126.359. Random Forest models do have a tendency to overfit, we can probably improve the model further by training the hyperparameters, this too can be accomplished using workflows and the dials package.

Dials – Refining Model Development

Tuning models by adjusting their hyperparameters is an important aspect of model development.

Enter the dials package, and the function grid_regular(), which sets up a tibble with every possible combination of the of the hyperparameters that require tuning. dials::grid_regular works similarly to dplyr::expand_grid.

Below, we re-establish a random forest model specification, this time setting trees to 500, mtry and min_n to tune(). Mtry is the number of predictors sampled at each split, and min_n is the minimum number of observations to split node.

A tuning workflow is generated just as before, we pipe a model and recipe specification.

Furthermore, we create a 4-fold cross validation object (vfold_cv).

Lastly, we generate a tune_results object through initialising tune_grid() (a grid search) which iterates through each of the combinations in the rf_grid, with the tune_wf using 4-fold cross validation.

set.seed(256)
rf_mod <- 
  rand_forest(trees = 500,
              mtry = tune(),
              min_n = tune()) %>% 
  set_engine("ranger", importance = "impurity", num.threads = 12) %>% 
  set_mode("regression")

#Establish Model Flow
tune_wf <- workflow() %>%
  add_recipe(mart_recipe) %>%
  add_model(rf_mod)

#Generate grid to perform grid search for hyperparameter optimisation

rf_grid <- grid_regular(mtry(range = c(6,10)), 
                        min_n(range = c(14,20)), 
                        levels = c(10,9))

# 4-fold Cross Validation Stratified by Item_Outlet_Sales
folds <- vfold_cv(train, v = 4, strata = Item_Outlet_Sales)

#Train and evaluate all combinations of hyperparameters specified in rf_grid

doParallel::registerDoParallel(cores = 12)
rf_grid_search <- tune_grid(
  tune_wf,
  resamples = folds,
  grid = rf_grid)

Please note, we’ve utilised parallel processing to use more of the workstations compute power as grid searches can be computationally demanding.

We can visualise the results of this computation and visually see which combination of hyperparameters provides the lowest RMSE.

rf_grid_search %>%
  collect_metrics() %>% 
  filter(.metric == "rmse") %>%
  select(mean, min_n, mtry) %>%
  filter(mtry > 4) %>% 
  ggplot(aes(min_n, mean, color = as_factor(mtry))) +
  geom_point()+
  geom_line()+
  scale_color_viridis_d() +
  theme_minimal()+
  scale_x_continuous(breaks = pretty_breaks())+
  theme(legend.position = "bottom") +
  labs(x = "Minimum Number of Observations to Split Node", y = "RMSE", title = "Grid Search Results for Random Forest", color = "Number of Predictors Sampled at Each Split")

Calling show_best() provides more details about the results of the grid search

rf_grid_search %>% show_best()

We can use the results of the grid search to update the random forest model specifications.

rf_best_rmse <- select_best(rf_grid_search, "rmse")
final_rf <- finalize_model(
  rf_mod,
  rf_best_rmse
)

final_rf

Using the vip package, we can fit final_rf and highlight the top 10 most important features.

final_rf %>%
  fit(Item_Outlet_Sales ~., data = bake(prep(mart_recipe),training(mart_split))) %>% 
  vip(geom=c("col"), num_features = 10) +
  theme_minimal()

What this really highlights is the importance of feature engineering, several of these were created during our cleaning of the data or through recipe generation.

Finally let’s evaluate final_rf

final_wf <- 
  workflow() %>% 
  add_recipe(mart_recipe) %>% 
  add_model(final_rf)

final_rf_res <- 
  fit(final_wf, training(mart_split)) %>% 
  predict(new_data = testing(mart_split)) %>% 
  bind_cols(mart_test %>% select(Item_Outlet_Sales))

final_rf_res %>% ggplot(aes(x= Item_Outlet_Sales, y = .pred)) +
                  geom_abline(lty = 2)+
                  geom_point(alpha = 0.5)+
                  theme_minimal()+
                  labs(x="Item Outlet Sales", y= "Predicted Item Outlet Sales", title = "Tuned Random Forest Regression")

metrics(final_rf_res, truth = Item_Outlet_Sales, estimate = .pred)

Comparing RMSE(Test) = 1120.365 to RMSE(Train) = 1106.629, we have slightly decreased overfitting by 2 RMSE units and produced a marginally better model following hyperparameter tuning.

Conclusion

This project has sought to exemplify the ease of which the Tidymodels meta package can be used to generate machine learning workflows by making each step pipeable, fitting two model types and evaluate their performance against a testing set. If you wanted to take this one step further, it is very easy to train an XGBoost model, I haven’t included this for the sake of brevity.

One of the key takeaways when comparing tidymodels with say sci-kit learn is that every step in the tidymodels process takes tibble dataframes, and doesn’t necessitate the transformation of dataframes to matrices. Furthermore given that everything works through tibbles, as I’ve demonstrated makes visual diagnostics of model performance and hyperparameter training easier to understand.

I highly recommend people who are interested in learning more about tidymodels to seek out the fantastic Julia Slige @ https://juliasilge.com/, and her collaboration with Max Kuhn on the highly anticipated Tidy Modeling with R (which you can preview as it’s being written here https://www.tmwr.org/)

The post Big Sales Mart Regression Revisited: Enter the tidymodels appeared first on Towards Data Science.