The world’s leading publication for data science, AI, and ML professionals.

A Bayesian Comparison of School Leaver Outcomes with R and brms

ANOVA - Bayesian Style

Much is made of what we want to do when we leave school. We get asked as young children what we want to do when we grow up, and then proceed to spend 13 years in pre-Tertiary Education. In public policy, much is made around the differences between the Government, Catholic, and Independent school systems, particularly when it comes to public funding of non-Government schools, and how resources should be allocated.

Is there any real difference between student outcomes given the school sector?

In Victoria, Australia, the State Government conduct an annual survey to gauge post-High School outcomes (On Track Survey). This dataset (available under CC BY 4.0) is available across several years, and we’ll only look at the most recent available as of the authoring of this article, 2021.

This article will seek to answer these questions applying Bayesian analysis methodologies, using the R package brms.


Load Libraries and Dataset

Below we load our required packages, the dataset is a presented in a wide report format, with many merged cells. R doesn’t like this so we have do a bit of hard coding to relabel vectors and create a tidy data frame.

library(tidyverse) #Tidyverse meta package
library(brms) #Bayesian Modelling Package
library(tidybayes) #Tidy Helper Functions and Visualisation Geoms for Bayesian Models
library(readxl)
df <- read_excel("DestinationData2022.xlsx", 
    sheet = "SCHOOL PUBLICATION TABLE 2022", 
    skip = 3)
colnames(df) <- c('vcaa_code', 
                  'school_name', 
                  'sector', 
                  'locality', 
                  'total_completed_year12', 
                  'on_track_consenters', 
                  'respondants', 
                  'bachelors', 
                  'deferred', 
                  'tafe', 
                  'apprentice_trainee', 
                  'employed', 
                  'looking_for_work', 
                  'other') 

df <- drop_na(df)

Here is an sample of what our dataset looks like.

The dataset has 14 fields.

  • VCAA Code – Administrative ID for each code
  • School Name
  • Sector – G = Government, C = Catholic & I = Independent
  • Locality or Suburb
  • Total Students Completing Year 12
  • On Track Consenters
  • Respondents
  • Rest of Columns Represent the Percentage of Respondents for Each Outcome

For the purposes of our cross-sectional study we are interested in the outcome proportions by sector, and thus we need to convert this wide data frame into a long format.

df_long <- df |> 
  mutate(across(5:14, ~ as.numeric(.x)), #convert all numeric fields captured as characters
         across(8:14, ~ .x/100 * respondants), #calculate counts from proportions
         across(8:14, ~ round(.x, 0)), #round to whole integers
         respondants = bachelors + deferred + tafe + apprentice_trainee + employed + looking_for_work + other) |> #recalculate total respondents |> 
  filter(sector != 'A') |>  #Low volume 
  select(sector, school_name, 7:14) |> 
  pivot_longer(c(-1, -2, -3), names_to = 'outcome', values_to = 'proportion') |> 
  mutate(proportion = proportion/respondants)

Exploratory Data Analysis

Let’s briefly visualize and summarize our dataset. The government sector has 157, Independent has 57, and Catholic has 50 schools listed in this dataset.

df |> 
  mutate(sector = fct_infreq(sector)) |> 
  ggplot(aes(sector)) +
    geom_bar(aes(fill = sector)) + 
    geom_label(aes(label = after_stat(count)), stat = 'count', vjust = 1.5) +
    labs(x = 'Sector', y = 'Count', title = 'Count of Schools by Sector', fill = 'Sector') +
    scale_fill_viridis_d(begin = 0.2, end = 0.8) +
    theme_ggdist()

df_long |> 
  ggplot(aes(sector, proportion, fill = outcome)) +
    geom_boxplot(alpha = 0.8) +
    geom_point(position = position_dodge(width=0.75), alpha = 0.1, color = 'grey',  aes(group = outcome)) +
    labs(x = 'Sector', y = 'Proportion', fill = 'Outcome', title = 'Boxplot of Respondant Proportions by Sector and Outcome') +
    scale_fill_viridis_d(begin = 0.2, end = 0.8) +
    theme_ggdist()

The distributions tell an interesting story. Bachelors outcome shows the greatest variability across all sectors, with Independent schools having the greatest median student proportion electing to pursue undergraduate education. Interesting to note that Government schools have the highest median proportion of students going on to employment after completing high school. All other outcomes don’t vary too much – we’ll revisit this soon.

Statistical Modelling – Beta Likelihood Regression

We’re focused on proportions of students by school and their outcomes after high school graduation. A beta likelihood is our best bet in these instances. brms is a brilliant package by Buerkner to develop Bayesian models. Our goal is to model the distribution of proportions by outcome and sector.

Beta regression models two parameters, μ and φ. μ represents the mean proportion, and φ is a measure of precision (or variance).

Our first model is represented below, note we are already starting with an interaction between Sector and Outcome because this is the question we want the model to answer, and hence this is an ANOVA model.

We are asking the model to create individual Beta terms for each combination of Sector and Outcome, with a pooled φ term, or model different proportion means with the same variance. We expect 50% of these proportions to sit between (-3, 1) on the logit scale or (0.01, 0.73) as proportions. This is sufficiently wide but informed prior. The global Phi estimate is a positive number, so we use a gamma distribution that is sufficiently wide.

# Modelling Proportion with Sector:Outcome Interaction term using Beta - Pooled Phi term

m3a <- brm(
  proportion ~ sector:outcome + 0, 
  family = Beta,
  data = df_long |> mutate(proportion = proportion + 0.01), # Beta regression can't take outcomes that are 0 so we fudge here by adding tiny increment
  prior = c(prior(normal(-1, 2), class = 'b'),
            prior(gamma(6, 1), class = 'phi')),
  iter = 8000, warmup = 1000, cores = 4, chains = 4, seed = 246, save_pars = save_pars(all = T),
  control = list(adapt_delta = 0.99, max_treedepth = 15)
) |> 
  add_criterion(c('waic', 'loo'), moment_match = T)

summary(m3a)

Note in the model setup we’ve added a 1% increment to all proportions – this is because Beta regression does not handle zero-values. We attempted to model this with zero inflated beta but it took much longer to converge.

Similarly, we can model without a pooled phi, this intuitively makes sense given what we observed in the distributions above, there is variation amongst each of the sector and outcome combinations. The model is defined below.

m3b <- brm(
  bf(proportion ~ sector:outcome + 0,
     phi ~ sector:outcome + 0),
  family = Beta,
  data = df_long |> mutate(proportion = proportion + 0.01),
  prior = c(prior(normal(-1, 2), class = 'b')),
  iter = 8000, warmup = 1000, cores = 4, chains = 4, seed = 246, save_pars = save_pars(all = T),
  control = list(adapt_delta = 0.99)
) |> 
  add_criterion(c('waic', 'loo'), moment_match = T)

summary(m3b)

Model Diagnostics and Comparison

With two models in hand, we’ll now compare their out of sample predictive accuracy using Bayesian LOO estimate of the expected log pointwise predictive density (elpd_loo).

comp <- loo_compare(m3a, m3b)

print(comp, simplify = F)

Simply put, the higher the expected logpoint wise leave one out value is, the greater the predictive accuracy on unseen data. This gives us a good relative measure of model accuracy between models. We can further check this by completing a posterior predictive check, a comparison of observed and simulated values. In our case, model m3b is the better models the observed data.

alt_df <- df_long |> 
  select(sector, outcome, proportion) |> 
  rename(value = proportion) |> 
  mutate(y = 'y', 
         draw = 99) |> 
  select(sector, outcome, draw, value, y)

sim_df <- expand_grid(sector = c('C', 'I', 'G'),
            outcome = unique(df_long$outcome)) |> 
  add_predicted_draws(m3b, ndraws = 1200) |> 
  rename(value = .prediction) |> 
  ungroup() |> 
  mutate(y = 'y_rep',
         draw = rep(seq(from = 1, to = 50, length.out = 50), times = 504)) |> 
  select(sector, outcome, draw, value, y) |> 
  bind_rows(alt_df)

sim_df |> 
  ggplot(aes(value, group = draw)) +
    geom_density(aes(color = y)) +
    facet_grid(outcome ~ sector, scales = 'free_y') +
    scale_color_manual(name = '', 
                       values = c('y' = "darkblue",
                                  'y_rep' = "grey")) +
    theme_ggdist() +
    labs(y = 'Density', x = 'y', title = 'Distribution of Observed and Replicated Proportions by Sector and Outcome')

Model m3b given it’s non-pooled variance or phi term is capable of better capturing the variation in proportion distributions by sector and outcome.

ANOVA – Bayesian Style

Recall, our research question is about seeking to understand if there are differences in outcomes amongst sectors and by how much. In frequentist statistics, we might use ANOVA, a difference of means approach between groups. The weakness of this approach is that the results provide an estimate and a confidence interval, without sense of uncertainty about these estimates, and a counterintuitive p-value stating whether or not the difference in means is statistically significant. No, thank you.

Below we generate a set of expected values for each sector and outcome combination interaction, then use the excellent tidybayes::compare_levels() function that performs the heavy lifting. It computes the difference in posterior means between sector for each outcome. We excluded the ‘other’ outcome for sake of brevity.

new_df <- expand_grid(sector = c('I', 'G', 'C'), 
          outcome = c('apprentice_trainee', 'bachelors', 'deferred', 'employed', 'looking_for_work', 'tafe'))

epred_draws(m3b, newdata = new_df) |> 
  compare_levels(.epred, by = sector, comparison = rlang::exprs(C - G, I - G, C - I)) |> 
    mutate(sector = fct_inorder(sector),
         sector = fct_shift(sector, -1), 
         sector = fct_rev(sector))  |> 
  ggplot(aes(x = .epred, y = sector, fill = sector)) +
      stat_halfeye() +
      geom_vline(xintercept = 0, lty = 2) + 
      facet_wrap(~ outcome, scales = 'free_x') +
      theme_ggdist() +
      theme(legend.position = 'bottom') +
      scale_fill_viridis_d(begin = 0.4, end = 0.8) +
      labs(x = 'Proportional Difference', y = 'Sector', title = 'Differences in Posterior Means by Sector and Outcome', fill = 'Sector')

Alternatively we can summarise all these distributions with a neat table for easier interpretation and a 89% credible interval.

marg_gt <- epred_draws(m3b, newdata = new_df) |> 
  compare_levels(.epred, by = sector, comparison = rlang::exprs(C - G, I - G, C - I)) |> 
  median_qi(.width = .89) |> 
  mutate(across(where(is.numeric), ~round(.x, 3))) |> 
  select(-c(.point, .interval, .width)) |> 
  arrange(outcome, sector) |> 
  rename(diff_in_means = .epred, 
         Q5.5 = .lower, 
         Q94.5 = .upper) |> 
  group_by(outcome) |> 
  gt() |> 
  tab_header(title = 'Sector Marginal Distributions by Outcome') |> 
  #tab_stubhead(label = 'Sector Comparison') |> 
  fmt_percent() |> 
  gtExtras::gt_theme_pff()

For example if we were to summarise a comparison in a formal report we might write the following.

Students in Government schools are less likely to begin undergraduate degrees then their counterparts in Catholic and Independent Schools post-VCE completion.

On average, 42.5% (between 39.5% and 45.6%) of Government school students, 53.2% (between 47.7% and 58.4%) of Catholic school students and 65% (between 60.1% and 69.7%) of Independent school students begin undergraduate education after completion of Year 12 students.

There is an 89% posterior probability that the difference between Catholic and Government student undergraduate enrolment is between 5.6% and 15.7% with a mean of 10.7%. Additionally, the difference between Independent and Government student undergraduate enrolment is between 17.8% and 27% with a mean of 22.5%.

These differences are substantial, and there’s a 100% chance that these differences are not zero.

Summary and Conclusion

In this article we have demonstrated how to model proportional data using a Beta likelihood function using Bayesian modelling, and then perform Bayesian ANOVA to explore the differences in proportional outcomes between sectors.

We have not sought to create a causal understanding of these differences. One can imagine that there are several factors that influence how individual students perform, socioeconomic background, parents educational attainment, in addition to impacts at school level, resourcing etc. It’s an incredibly complex area of public policy that can often get bogged down in zero-sum rhetoric.

Personally, I’m the first person in my extended family to attend and complete tertiary education. I went to a middle band public high school and achieved reasonably good scores to gain entrance into my first preference. My parents encouraged me to pursue education, both electing to leave school at the age of 16. Whilst this article provides evidence that the difference between Government and Non-Government schools is real, it is purely descriptive in nature.


Related Articles