AI4CI logo CASA logo

Data Overview: School Attainment Analysis

Sources, construction, and descriptive statistics for the KS4 panel dataset

Author

Adam Dennett with some assistance from Claude Code

Published

March 19, 2026

1 Introduction

This document describes the construction and characteristics of the school-level panel dataset used in the multilevel regression analysis of KS4 Attainment 8 scores. It covers:

  1. Data sources — where each variable originates
  2. Panel construction — how per-year data are harmonised and linked
  3. Cleaning and imputation — handling of missing values and the 2024–25 carry-forward strategy
  4. Descriptive statistics — distributions, correlations, and completeness of the analysis variables

All figures are saved as publication-ready PDF and JPEG files in output/data_overview_files/.

2 Data Sources

Table 1 summarises the six primary data feeds that are combined into the final panel.

Show code
tibble::tribble(
  ~Source, ~Provider, ~Coverage, ~`Key variables`, ~`Data link`,
  "KS4 Performance Tables", "DfE", "4 years (2021-22 to 2024-25)",
    "ATT8SCR, P8MEA, PTFSM6CLA1A, PTPRIORLO, ADMPOL_PT, TOTPUPS",
    '<a href="https://www.find-school-performance-data.service.gov.uk/" target="_blank">find-school-performance-data.service.gov.uk</a>',
  "School Census", "DfE", "4 years (matched to KS4)",
    "PNUMEAL, PNUMENGFL, TPUP, TFSM6CLA1A",
    '<a href="https://www.find-school-performance-data.service.gov.uk/" target="_blank">find-school-performance-data.service.gov.uk</a>',
  "Absence (perf. tables + API)", "DfE / EES", "4 years",
    "PERCTOT, PPERSABS10",
    '<a href="https://explore-education-statistics.service.gov.uk/find-statistics/pupil-absence-in-schools-in-england" target="_blank">explore-education-statistics.service.gov.uk</a>',
  "Ofsted Ratings", "Ofsted Official Statistics", "As at 31 Aug 2024",
    "OFSTEDRATING_1 (harmonised 1\u20134)",
    '<a href="https://www.gov.uk/government/statistical-data-sets/monthly-management-information-ofsteds-school-inspections-outcomes" target="_blank">gov.uk/ofsted-outcomes</a>',
  "Edubase", "DfE", "Oct 2024 snapshot",
    "gor_name (region), easting, northing",
    '<a href="https://get-information-schools.service.gov.uk/" target="_blank">get-information-schools.service.gov.uk</a>',
  "School Workforce Census", "DfE / EES", "4 years (2024-25 delayed)",
    "Teacher retention, leadership pay %, sickness days",
    '<a href="https://explore-education-statistics.service.gov.uk/find-statistics/school-workforce-in-england" target="_blank">explore-education-statistics.service.gov.uk</a>'
) %>%
  kable(format = "html", align = "lllll", escape = FALSE) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE, font_size = 11) %>%
  save_tbl("data_sources")
Table 1: Primary data sources and their contribution to the panel
Source Provider Coverage Key variables Data link
KS4 Performance Tables DfE 4 years (2021-22 to 2024-25) ATT8SCR, P8MEA, PTFSM6CLA1A, PTPRIORLO, ADMPOL_PT, TOTPUPS find-school-performance-data.service.gov.uk
School Census DfE 4 years (matched to KS4) PNUMEAL, PNUMENGFL, TPUP, TFSM6CLA1A find-school-performance-data.service.gov.uk
Absence (perf. tables + API) DfE / EES 4 years PERCTOT, PPERSABS10 explore-education-statistics.service.gov.uk
Ofsted Ratings Ofsted Official Statistics As at 31 Aug 2024 OFSTEDRATING_1 (harmonised 1–4) gov.uk/ofsted-outcomes
Edubase DfE Oct 2024 snapshot gor_name (region), easting, northing get-information-schools.service.gov.uk
School Workforce Census DfE / EES 4 years (2024-25 delayed) Teacher retention, leadership pay %, sickness days explore-education-statistics.service.gov.uk

3 Variable Definitions

Table 2 lists all variables used in the multilevel model, together with their role and a brief description. This serves as a reference for the sections that follow.

Show code
model_vars_def <- c(
  "ATT8SCR", "ATT8SCR_FSM6CLA1A", "ATT8SCR_NFSM6CLA1A",
  "PTFSM6CLA1A", "PERCTOT", "PNUMEAL", "PTPRIORLO",
  "ADMPOL_PT", "gorard_segregation",
  "OFSTEDRATING_1", "gor_name", "LANAME",
  "remained_in_the_same_school",
  "teachers_on_leadership_pay_range_percent",
  "average_number_of_days_taken",
  "has_imputed_predictors"
)

tibble(
  Variable = model_vars_def,
  Label = var_labels(model_vars_def),
  Role = c(rep("Outcome", 3),
           rep("Fixed effect", 6),
           rep("Random intercept", 3),
           rep("Fixed effect", 3),
           "Flag")
) %>%
  kable(align = "lll") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE, font_size = 11) %>%
  pack_rows("Outcomes", 1, 3) %>%
  pack_rows("Fixed-effect predictors", 4, 9) %>%
  pack_rows("Random-effect grouping", 10, 12) %>%
  pack_rows("Workforce predictors", 13, 15) %>%
  pack_rows("Metadata", 16, 16) %>%
  save_tbl("variable_definitions")
Table 2: Definitions of all variables used in the multilevel model
Variable Label Role
Outcomes
ATT8SCR Avg. Attainment 8 score per pupil Outcome
ATT8SCR_FSM6CLA1A Avg. Attainment 8 score (disadvantaged) Outcome
ATT8SCR_NFSM6CLA1A Avg. Attainment 8 score (non-disadvantaged) Outcome
Fixed-effect predictors
PTFSM6CLA1A % KS4 pupils who are disadvantaged Fixed effect
PERCTOT % overall absence Fixed effect
PNUMEAL % pupils with English not first language Fixed effect
PTPRIORLO % KS4 pupils with low prior attainment (KS2) Fixed effect
ADMPOL_PT Admissions policy (new definition from 2019) Fixed effect
gorard_segregation Gorard segregation index Fixed effect
Random-effect grouping
OFSTEDRATING_1 Ofsted rating (harmonised) Random intercept
gor_name Government Office Region Random intercept
LANAME Local authority name Random intercept
Workforce predictors
remained_in_the_same_school Teachers remaining in same school (FTE) Fixed effect
teachers_on_leadership_pay_range_percent % teachers on leadership pay range Fixed effect
average_number_of_days_taken Avg. teacher sickness days Fixed effect
Metadata
has_imputed_predictors 2024-25 predictors are estimates (carry-forward) Flag

4 Panel Construction Pipeline

The panel is assembled through four sequential scripts, each adding data and producing an intermediate .rds file.

Show code
tibble::tribble(
  ~Step, ~Script, ~`Input`, ~`Output`, ~Description,
  "1", "01_extract_data.R",
    "Raw DfE ZIPs", "year_data_list.rds",
    "Unzip, read KS4 + census + absence + school info per year; join by URN; filter RECTYPE=1 (mainstream)",
  "2", "02_build_panel.R",
    "year_data_list.rds", "panel_raw.rds",
    "Download Edubase (region/coords), Ofsted ratings, absence API; join to each year; harmonise Ofsted text/numeric; stack into panel",
  "3", "03_add_workforce.R",
    "panel_raw.rds", "panel_with_workforce.rds",
    "Download 5 workforce datasets from EES (turnover, pay, sickness, size, ratios); join by URN + time_period",
  "4", "04_compute_derived.R",
    "panel_with_workforce.rds", "panel_data.rds",
    "Gorard segregation index; impute 2024-25 missing predictors; log transforms; final cleaning"
) %>%
  kable(align = "clllll") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE, font_size = 10) %>%
  column_spec(5, width = "25em") %>%
  save_tbl("pipeline_steps")
Table 3: Data pipeline steps from raw ZIP files to the final analysis panel
Step Script Input Output Description
1 01_extract_data.R Raw DfE ZIPs year_data_list.rds Unzip, read KS4 + census + absence + school info per year; join by URN; filter RECTYPE=1 (mainstream)
2 02_build_panel.R year_data_list.rds panel_raw.rds Download Edubase (region/coords), Ofsted ratings, absence API; join to each year; harmonise Ofsted text/numeric; stack into panel
3 03_add_workforce.R panel_raw.rds panel_with_workforce.rds Download 5 workforce datasets from EES (turnover, pay, sickness, size, ratios); join by URN + time_period
4 04_compute_derived.R panel_with_workforce.rds panel_data.rds Gorard segregation index; impute 2024-25 missing predictors; log transforms; final cleaning

5 Sample Overview

5.1 Panel Dimensions

Show code
sample_tbl <- model_panel %>%
  group_by(`Academic year` = year_label) %>%
  summarise(
    Schools = n_distinct(URN),
    `School-year obs.` = n(),
    LAs = n_distinct(LANAME),
    Regions = n_distinct(gor_name),
    .groups = "drop"
  )

totals <- tibble(
  `Academic year` = "Total / Overall",
  Schools = n_distinct(model_panel$URN),
  `School-year obs.` = nrow(model_panel),
  LAs = n_distinct(model_panel$LANAME),
  Regions = n_distinct(model_panel$gor_name)
)

bind_rows(sample_tbl, totals) %>%
  kable(align = "lrrrr", format.args = list(big.mark = ",")) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE) %>%
  row_spec(nrow(sample_tbl) + 1, bold = TRUE) %>%
  save_tbl("panel_dimensions")
Table 4: Panel dimensions by academic year (state-funded mainstream schools only)
Academic year Schools School-year obs. LAs Regions
2021-22 3,349 3,349 152 9
2022-23 3,351 3,351 152 9
2023-24 3,372 3,372 152 9
2024-25 3,347 3,347 152 9
Total / Overall 3,523 13,419 152 9
ImportantSample restriction

All figures and tables in this report are based on the analysis sample of mainstream state-funded schools only (Academies and LA-maintained schools). The following school types have been excluded from the raw extract:

Table 5
School type Observations Schools
Independent school 3,641 987
College 55 17
Special school 15 15
Total excluded 3,711 NA

This reduces the panel from 17,130 to 13,419 school-year observations (3,523 unique schools across 152 LAs). These school types are excluded because they operate under different funding, governance, and accountability frameworks that make direct comparison with the mainstream state-funded sector inappropriate. Independent schools, in particular, do not participate in the same performance table reporting regime, and their inclusion would confound the relationship between deprivation-related predictors and attainment outcomes.

5.2 Geographic Distribution

Show code
p_region <- model_panel %>%
  filter(!is.na(gor_name)) %>%
  count(gor_name, year_label) %>%
  ggplot(aes(x = fct_reorder(gor_name, n, .fun = sum), y = n, fill = year_label)) +
  geom_col(position = "dodge", width = 0.7) +
  coord_flip() +
  scale_fill_manual(values = year_cols, name = "Academic year") +
  labs(x = NULL, y = "Number of schools",
       title = "Schools per region by academic year") +
  theme_pub

save_fig(p_region, "region_counts", width = 8, height = 5)
p_region
Figure 1: Number of schools per region and academic year

The South East, North West, and London contribute the largest number of schools, consistent with their population sizes. School counts are highly stable across years, reflecting the fact that the panel tracks the same institutions over time with minimal entry or exit.

5.3 School Type Composition

Show code
p_types <- model_panel %>%
  filter(!is.na(MINORGROUP)) %>%
  count(MINORGROUP) %>%
  mutate(pct = n / sum(n) * 100,
         MINORGROUP = fct_reorder(MINORGROUP, n)) %>%
  ggplot(aes(x = MINORGROUP, y = pct)) +
  geom_col(fill = "#2e6260", width = 0.7) +
  geom_text(aes(label = sprintf("%.1f%%", pct)), hjust = -0.1, size = 3) +
  coord_flip() +
  labs(x = NULL, y = "% of school-year observations",
       title = "School type composition (analysis sample, all years pooled)") +
  scale_y_continuous(expand = expansion(mult = c(0, 0.15))) +
  theme_pub

save_fig(p_types, "school_types", width = 7, height = 4)
p_types
Figure 2: School type composition across the panel

Academies form the largest group in the analysis sample, followed by Maintained schools. Independent schools, colleges, and special schools have been excluded from all figures and tables in this report (see Table 10 for the full breakdown).

5.4 Ofsted Rating Distribution

Show code
p_ofsted <- model_panel %>%
  filter(!is.na(OFSTEDRATING_1)) %>%
  count(year_label, OFSTEDRATING_1) %>%
  group_by(year_label) %>%
  mutate(pct = n / sum(n) * 100) %>%
  ungroup() %>%
  ggplot(aes(x = year_label, y = pct, fill = OFSTEDRATING_1)) +
  geom_col(position = "stack", width = 0.65) +
  geom_text(aes(label = sprintf("%.0f%%", pct)),
            position = position_stack(vjust = 0.5), size = 3, colour = "white") +
  scale_fill_manual(
    values = c("Outstanding" = "#2e6260", "Good" = "#abc766",
               "Requires Improvement" = "#f9dd73", "Inadequate" = "#4e3c56"),
    name = "Ofsted rating"
  ) +
  labs(x = "Academic year", y = "% of schools",
       title = "Distribution of Ofsted ratings by year") +
  theme_pub

save_fig(p_ofsted, "ofsted_distribution", width = 7, height = 4.5)
p_ofsted
Figure 3: Ofsted rating distribution by academic year

The vast majority of schools in the analysis sample are rated Good, followed by Outstanding. The proportion rated Requires Improvement or Inadequate is small and relatively stable across years. Because Ofsted ratings are carried forward until a school is re-inspected, the year-to-year changes are modest and reflect the pace of the inspection cycle rather than rapid shifts in school quality.

5.5 Admissions Policy

Show code
if ("ADMPOL_PT" %in% names(model_panel)) {
  p_adm <- model_panel %>%
    filter(!is.na(ADMPOL_PT)) %>%
    mutate(ADMPOL_PT = factor(ADMPOL_PT)) %>%
    count(year_label, ADMPOL_PT) %>%
    group_by(year_label) %>%
    mutate(pct = n / sum(n) * 100) %>%
    ggplot(aes(x = year_label, y = pct, fill = ADMPOL_PT)) +
    geom_col(position = "stack", width = 0.65) +
    scale_fill_manual(values = c("#2e6260", "#abc766", "#f9dd73", "#e16fca", "#4e3c56", "#49a0c4", "#ff7f89"), name = "Admissions policy") +
    labs(x = "Academic year", y = "% of schools",
         title = "Admissions policy composition by year") +
    theme_pub

  save_fig(p_adm, "admissions_policy", width = 7, height = 3.5)
  p_adm
}
Figure 4: Admissions policy distribution

The admissions policy variable captures whether a school selects pupils on academic ability. The overwhelming majority of mainstream state-funded schools operate non-selective admissions; the small selective share is concentrated in areas that retain grammar schools. This variable is included in the model to control for the compositional advantage that selective schools enjoy.

6 Outcome Variables

NoteAnalysis sample

From this section onward, all statistics, figures, and tables are based on the analysis sample of mainstream state-funded schools only — Academies and Maintained schools. Independent schools (3,641 observations), colleges (55), and special schools (15) are excluded because they operate under different funding, governance, and accountability frameworks that make direct comparison with the mainstream sector inappropriate. This reduces the panel from 17,130 to 13,419 school-year observations. Full details of the restriction are given in Table 10.

6.1 Descriptive Statistics

Show code
outcome_vars <- c("ATT8SCR", "ATT8SCR_FSM6CLA1A", "ATT8SCR_NFSM6CLA1A")

outcome_stats <- model_panel %>%
  pivot_longer(all_of(outcome_vars), names_to = "Variable", values_to = "value") %>%
  filter(!is.na(value)) %>%
  group_by(Variable, `Year` = year_label) %>%
  summarise(
    N     = n(),
    Mean  = round(mean(value), 1),
    SD    = round(sd(value), 1),
    Min   = round(min(value), 1),
    Median = round(median(value), 1),
    Max   = round(max(value), 1),
    .groups = "drop"
  ) %>%
  mutate(Variable = var_label(Variable))

outcome_stats %>%
  kable(align = "llrrrrrr") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE, font_size = 11) %>%
  collapse_rows(columns = 1, valign = "top") %>%
  save_tbl("outcome_statistics")
Table 6: Descriptive statistics for Attainment 8 outcome variables by academic year
Variable Year N Mean SD Min Median Max
Avg. Attainment 8 score per pupil 2021-22 3228 49.3 9.3 9.0 48.4 88.2
2022-23 3250 46.9 9.3 12.0 45.8 87.6
2023-24 3270 46.7 9.5 0.8 45.7 86.6
2024-25 3277 46.8 9.3 0.9 45.6 87.2
Avg. Attainment 8 score (disadvantaged) 2021-22 3184 40.4 9.2 12.9 38.8 85.1
2022-23 3204 37.7 9.2 10.9 36.0 85.8
2023-24 3232 37.5 9.6 14.3 36.0 85.9
2024-25 3246 37.7 9.4 18.1 36.0 86.0
Avg. Attainment 8 score (non-disadvantaged) 2021-22 3186 52.0 8.3 11.2 51.4 86.5
2022-23 3206 49.6 8.2 12.8 48.8 86.4
2023-24 3235 49.5 8.3 14.8 48.6 86.7
2024-25 3247 49.9 8.3 25.9 48.7 87.5
Show code
outcome_stats_pooled <- model_panel %>%
  pivot_longer(all_of(outcome_vars), names_to = "Variable", values_to = "value") %>%
  filter(!is.na(value)) %>%
  group_by(Variable) %>%
  summarise(
    N      = n(),
    Mean   = round(mean(value), 1),
    SD     = round(sd(value), 1),
    Min    = round(min(value), 1),
    Median = round(median(value), 1),
    Max    = round(max(value), 1),
    .groups = "drop"
  ) %>%
  mutate(Variable = var_label(Variable))

outcome_stats_pooled %>%
  kable(align = "lrrrrrr") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE, font_size = 11) %>%
  save_tbl("outcome_statistics_pooled")
Table 7: Descriptive statistics for Attainment 8 outcome variables (full panel, pooled across years)
Variable N Mean SD Min Median Max
Avg. Attainment 8 score per pupil 13025 47.4 9.4 0.8 46.3 88.2
Avg. Attainment 8 score (disadvantaged) 12866 38.3 9.4 10.9 36.8 86.0
Avg. Attainment 8 score (non-disadvantaged) 12874 50.2 8.3 11.2 49.3 87.5

Mean Attainment 8 scores for all pupils sit in the mid-to-high 40s and have edged upward over the four-year window. The disadvantaged subgroup consistently scores around 10 points below the non-disadvantaged subgroup, and its standard deviation is slightly wider, reflecting greater heterogeneity in outcomes for this group. Scores for all three measures are approximately normally distributed, so the log transformation applied to the outcome in the multilevel model serves primarily to stabilise variance across the fitted-value range rather than to correct severe skew.

6.2 Outcome Distributions

Show code
outcome_long <- model_panel %>%
  select(year_label, all_of(outcome_vars)) %>%
  pivot_longer(-year_label, names_to = "outcome", values_to = "score") %>%
  filter(!is.na(score)) %>%
  mutate(
    outcome = recode(outcome,
      ATT8SCR               = "All pupils",
      ATT8SCR_FSM6CLA1A     = "Disadvantaged",
      ATT8SCR_NFSM6CLA1A    = "Non-disadvantaged"
    ),
    outcome = factor(outcome,
                     levels = c("All pupils", "Disadvantaged", "Non-disadvantaged"))
  )

p_density <- ggplot(outcome_long, aes(x = score, fill = year_label)) +
  geom_density(alpha = 0.45, colour = NA) +
  facet_wrap(~ outcome, ncol = 1, scales = "free_y") +
  scale_fill_manual(values = year_cols, name = "Academic year") +
  labs(x = "Attainment 8 score", y = "Density",
       title = "Distribution of Attainment 8 scores") +
  theme_pub

save_fig(p_density, "outcome_density", width = 8, height = 7)
p_density
Figure 5: Density plots of Attainment 8 scores by pupil group and year

The density plots confirm that all three Attainment 8 measures are unimodal and broadly symmetric. The disadvantaged subgroup distribution is shifted noticeably to the left, with a slight positive tail. Year-on-year shapes are stable (especially since 2022-23 - post COVID), suggesting that the national distribution of school-level attainment has not shifted dramatically over the period, except for an immediately post-COVID drop.

6.3 Disadvantage Gap Over Time

Show code
gap_data <- model_panel %>%
  filter(!is.na(ATT8SCR_FSM6CLA1A), !is.na(ATT8SCR_NFSM6CLA1A)) %>%
  mutate(gap = ATT8SCR_NFSM6CLA1A - ATT8SCR_FSM6CLA1A)

gap_summary <- gap_data %>%
  group_by(year_label) %>%
  summarise(
    mean_gap   = mean(gap, na.rm = TRUE),
    median_gap = median(gap, na.rm = TRUE),
    sd_gap     = sd(gap, na.rm = TRUE),
    n = n(),
    .groups = "drop"
  )

p_gap <- ggplot(gap_data, aes(x = year_label, y = gap)) +
  geom_boxplot(aes(fill = year_label), alpha = 0.6, outlier.size = 0.5,
               show.legend = FALSE) +
  geom_point(data = gap_summary, aes(y = mean_gap),
             shape = 18, size = 3, colour = "red") +
  scale_fill_manual(values = year_cols) +
  labs(x = "Academic year",
       y = "ATT8 gap (non-disadvantaged \u2013 disadvantaged)",
       title = "School-level disadvantage gap in Attainment 8",
       subtitle = "Red diamond = annual mean") +
  theme_pub

save_fig(p_gap, "disadvantage_gap_trend", width = 7, height = 4.5)
p_gap
Figure 6: Attainment gap between non-disadvantaged and disadvantaged pupils by year

The school-level disadvantage gap (the difference in attainment between disadvantaged pupils and non-disadvantaged pupils) is remarkably persistent over the period, with median values of around 13–14 Attainment 8 points across all four years. The interquartile range is wide, indicating that some schools narrow the gap far more effectively than others — a key motivation for the multilevel modelling approach. A small number of schools show negative gaps (i.e. disadvantaged pupils outperforming their non-disadvantaged peers), visible as outliers below zero.

6.4 Outcomes by Region

Show code
p_outcome_region <- model_panel %>%
  filter(!is.na(gor_name)) %>%
  select(gor_name, all_of(outcome_vars)) %>%
  pivot_longer(-gor_name, names_to = "outcome", values_to = "score") %>%
  filter(!is.na(score)) %>%
  mutate(
    outcome = recode(outcome,
      ATT8SCR            = "All pupils",
      ATT8SCR_FSM6CLA1A  = "Disadvantaged",
      ATT8SCR_NFSM6CLA1A = "Non-disadvantaged"
    ),
    outcome = factor(outcome,
                     levels = c("All pupils", "Disadvantaged", "Non-disadvantaged")),
    gor_name = fct_reorder(gor_name, score, .fun = median)
  )

p_out_reg <- ggplot(p_outcome_region, aes(x = gor_name, y = score, fill = gor_name)) +
  geom_violin(alpha = 0.6, colour = NA, scale = "width") +
  geom_boxplot(width = 0.15, alpha = 0.8, outlier.size = 0.3, show.legend = FALSE) +
  facet_wrap(~ outcome, ncol = 3) +
  coord_flip() +
  labs(x = NULL, y = "Attainment 8 score",
       title = "Attainment 8 by region and pupil group",
       subtitle = "All years pooled") +
  theme_pub +
  theme(legend.position = "none",
        strip.text = element_text(size = 9))

save_fig(p_out_reg, "outcome_by_region", width = 11, height = 6)
p_out_reg
Figure 7: Attainment 8 distributions by region — the grouping variable for the multilevel model’s regional random intercept

London stands out with the highest median scores across all three pupil groups — a well-documented feature of the English school system. The North East and Yorkshire & The Humber cluster at the lower end, though there is substantial overlap between regions. Notably, regional rankings are broadly consistent across the three outcome groups, suggesting that the “London effect” benefits disadvantaged and non-disadvantaged pupils alike.

6.5 Outcomes by Ofsted Rating

Show code
p_outcome_ofsted <- model_panel %>%
  filter(!is.na(OFSTEDRATING_1)) %>%
  select(OFSTEDRATING_1, all_of(outcome_vars)) %>%
  pivot_longer(-OFSTEDRATING_1, names_to = "outcome", values_to = "score") %>%
  filter(!is.na(score)) %>%
  mutate(
    outcome = recode(outcome,
      ATT8SCR            = "All pupils",
      ATT8SCR_FSM6CLA1A  = "Disadvantaged",
      ATT8SCR_NFSM6CLA1A = "Non-disadvantaged"
    ),
    outcome = factor(outcome,
                     levels = c("All pupils", "Disadvantaged", "Non-disadvantaged"))
  )

ofsted_cols <- c("Outstanding" = "#2e6260", "Good" = "#abc766",
                 "Requires Improvement" = "#f9dd73", "Inadequate" = "#4e3c56")

p_out_ofs <- ggplot(p_outcome_ofsted,
                    aes(x = OFSTEDRATING_1, y = score, fill = OFSTEDRATING_1)) +
  geom_violin(alpha = 0.6, colour = NA, scale = "width") +
  geom_boxplot(width = 0.15, alpha = 0.8, outlier.size = 0.3, show.legend = FALSE) +
  facet_wrap(~ outcome, ncol = 3) +
  scale_fill_manual(values = ofsted_cols) +
  labs(x = NULL, y = "Attainment 8 score",
       title = "Attainment 8 by Ofsted rating and pupil group",
       subtitle = "All years pooled") +
  theme_pub +
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 30, hjust = 1, size = 8),
        strip.text = element_text(size = 9))

save_fig(p_out_ofs, "outcome_by_ofsted", width = 10, height = 5)
p_out_ofs
Figure 8: Attainment 8 distributions by Ofsted rating — the grouping variable for the multilevel model’s Ofsted random intercept

There is a clear gradient in attainment across Ofsted categories: Outstanding schools have the highest median scores and the tightest distributions, while Inadequate schools sit noticeably lower with wider spread. The step from Good to Requires Improvement is the sharpest for all three pupil groups.

TipRelation to the multilevel null model

These two plots illustrate the structure captured by the random intercepts in the multilevel model. Region (gor_name) and Ofsted rating (OFSTEDRATING_1) each enter as grouping variables, allowing the model to estimate a separate baseline (intercept) for each level. Before any school-level predictors are added, these group-level intercepts absorb the systematic variation visible in the violin plots above — effectively representing a null model that accounts only for which region and Ofsted category a school belongs to. The fixed-effect predictors (deprivation, absence, EAL, etc.) then explain the remaining within-group variation around these baselines.

7 Predictor Variables

The multilevel model includes eight fixed-effect predictors capturing school composition, pupil characteristics, and workforce conditions. This section summarises their distributions and motivates the log transformations applied to the most heavily skewed variables.

7.1 Descriptive Statistics

Show code
pred_vars_yr <- c(
  "PTFSM6CLA1A", "PERCTOT", "PNUMEAL", "PTPRIORLO",
  "gorard_segregation",
  "remained_in_the_same_school",
  "teachers_on_leadership_pay_range_percent",
  "average_number_of_days_taken"
)
pred_vars_yr <- intersect(pred_vars_yr, names(model_panel))

pred_table_yearly <- model_panel %>%
  pivot_longer(all_of(pred_vars_yr), names_to = "Variable", values_to = "value") %>%
  filter(!is.na(value)) %>%
  group_by(Variable, Year = year_label) %>%
  summarise(
    N      = n(),
    Mean   = round(mean(value), 2),
    SD     = round(sd(value), 2),
    Min    = round(min(value), 2),
    Median = round(median(value), 2),
    Max    = round(max(value), 2),
    .groups = "drop"
  ) %>%
  mutate(Variable = var_label(Variable))

pred_table_yearly %>%
  kable(align = "llrrrrrr") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE, font_size = 11) %>%
  collapse_rows(columns = 1, valign = "top") %>%
  save_tbl("predictor_statistics_yearly")
Table 8: Descriptive statistics for predictor variables by academic year
Variable Year N Mean SD Min Median Max
% overall absence 2021-22 3309 9.14 2.40 2.10 9.00 28.10
2022-23 3316 9.19 2.62 3.10 8.90 32.60
2023-24 3332 9.08 2.63 2.60 8.80 26.90
2024-25 3322 8.25 2.45 1.98 8.05 26.73
% pupils with English not first language 2021-22 3303 17.29 18.83 0.00 9.30 93.30
2022-23 3308 17.96 18.46 0.00 10.45 92.70
2023-24 3318 18.43 18.20 0.00 11.40 94.80
2024-25 3284 18.48 18.24 0.00 11.40 94.80
% KS4 pupils who are disadvantaged 2021-22 3228 26.88 14.87 0.00 24.00 96.00
2022-23 3250 26.70 14.39 0.00 24.00 93.00
2023-24 3270 26.65 13.99 0.00 24.40 82.80
2024-25 3277 28.35 14.46 0.00 25.90 80.10
% KS4 pupils with low prior attainment (KS2) 2021-22 3228 25.78 11.20 0.00 26.00 74.00
2022-23 3250 22.24 9.99 0.00 22.00 92.00
2023-24 3270 22.13 9.65 0.00 22.00 79.20
Avg. teacher sickness days 2021-22 3098 8.66 3.88 0.00 8.50 85.00
2022-23 3018 7.83 3.58 0.00 7.30 91.50
2023-24 3248 8.07 3.54 0.00 7.60 74.00
Gorard segregation index 2021-22 3226 0.18 0.05 0.00 0.17 0.31
2022-23 3248 0.17 0.05 0.00 0.17 0.32
2023-24 3267 0.17 0.05 0.00 0.16 0.30
2024-25 3276 0.16 0.04 0.00 0.15 0.27
Teachers remaining in same school (FTE) 2021-22 3302 52.71 21.77 0.00 51.40 146.50
2022-23 3292 52.56 21.63 0.00 51.00 151.90
2023-24 3311 53.49 21.65 0.00 51.90 155.90
% teachers on leadership pay range 2021-22 3301 11.59 5.14 0.00 10.64 66.67
2022-23 3292 11.63 5.01 0.00 10.77 60.00
2023-24 3311 12.07 5.15 0.00 11.25 75.00
2024-25 3314 12.23 5.09 0.00 11.32 66.67
Show code
pred_vars <- c(
  "PTFSM6CLA1A", "PERCTOT", "PNUMEAL", "PTPRIORLO",
  "gorard_segregation",
  "remained_in_the_same_school",
  "teachers_on_leadership_pay_range_percent",
  "average_number_of_days_taken"
)
pred_vars <- intersect(pred_vars, names(model_panel))

pred_table <- map_dfr(pred_vars, function(v) {
  x <- model_panel[[v]]
  tibble(
    Variable = var_label(v),
    N        = sum(!is.na(x)),
    `% available` = round(sum(!is.na(x)) / nrow(model_panel) * 100, 1),
    Mean     = round(mean(x, na.rm = TRUE), 2),
    SD       = round(sd(x, na.rm = TRUE), 2),
    Min      = round(min(x, na.rm = TRUE), 2),
    Median   = round(median(x, na.rm = TRUE), 2),
    Max      = round(max(x, na.rm = TRUE), 2)
  )
})

pred_table %>%
  kable(align = "lrrrrrrr") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE, font_size = 11) %>%
  pack_rows("School composition", 1, 4) %>%
  pack_rows("Segregation (LA-level)", 5, 5) %>%
  pack_rows("Workforce", 6, nrow(pred_table)) %>%
  save_tbl("predictor_statistics")
Table 9: Descriptive statistics for predictor variables (full panel, pooled across years)
Variable N % available Mean SD Min Median Max
School composition
% KS4 pupils who are disadvantaged 13025 97.1 27.14 14.45 0.00 24.90 96.00
% overall absence 13279 99.0 8.91 2.56 1.98 8.70 32.60
% pupils with English not first language 13213 98.5 18.04 18.44 0.00 10.80 94.80
% KS4 pupils with low prior attainment (KS2) 9748 72.6 23.38 10.43 0.00 23.00 92.00
Segregation (LA-level)
Gorard segregation index 13017 97.0 0.17 0.05 0.00 0.17 0.32
Workforce
Teachers remaining in same school (FTE) 9905 73.8 52.92 21.68 0.00 51.40 155.90
% teachers on leadership pay range 13218 98.5 11.88 5.10 0.00 11.00 75.00
Avg. teacher sickness days 9364 69.8 8.19 3.68 0.00 7.70 91.50

There is wide variation across schools in most composition variables. The proportion of disadvantaged pupils (PTFSM6CLA1A) averages around 30% but ranges from near-zero to over 90%, reflecting the substantial segregation that exists within the English school system. Overall absence rates (PERCTOT) are tightly clustered for most schools but exhibit a long right tail of high-absence outliers. The workforce variables — teacher retention, leadership pay share, and average sickness days — have somewhat lower coverage (particularly for 2024–25, which relies on carry-forward imputation; see Table 13).

7.2 Predictor Distributions

Show code
pred_long <- model_panel %>%
  select(year_label, all_of(pred_vars)) %>%
  pivot_longer(-year_label, names_to = "variable", values_to = "value") %>%
  filter(!is.na(value)) %>%
  mutate(variable = var_label(variable))

p_hist <- ggplot(pred_long, aes(x = value)) +
  geom_histogram(aes(fill = year_label), bins = 40, alpha = 0.65,
                 position = "identity", colour = NA) +
  facet_wrap(~ variable, scales = "free", ncol = 3) +
  scale_fill_manual(values = year_cols, name = "Academic year") +
  labs(x = NULL, y = "Frequency",
       title = "Predictor variable distributions") +
  theme_pub +
  theme(strip.text = element_text(size = 8))

save_fig(p_hist, "predictor_distributions", width = 10, height = 8)
p_hist
Figure 9: Distributions of key predictor variables (raw scale)

Several predictors exhibit clear right skew — notably the proportion of pupils with English as an additional language (PNUMEAL) and average teacher sickness days. The FSM eligibility rate and overall absence rate are also right-skewed, though less severely. By contrast, the Gorard segregation index and teacher retention FTE are approximately symmetric. These distributional differences motivate the selective log-transformation strategy described below.

7.3 Log-Transformed Predictors

Predictors with |skewness| > 1 are log-transformed for the multilevel model. remained_in_the_same_school (skewness = 0.47) is entered on the raw scale as it is near-symmetric. Figure 10 shows the raw and log-transformed distributions for the variables that undergo this transformation.

Show code
log_vars <- c("PTFSM6CLA1A", "PERCTOT", "PNUMEAL",
              "average_number_of_days_taken")
log_vars <- intersect(log_vars, names(model_panel))

raw_long <- model_panel %>%
  select(all_of(log_vars)) %>%
  pivot_longer(everything(), names_to = "variable", values_to = "value") %>%
  filter(!is.na(value), value > 0) %>%
  mutate(scale = "Raw")

log_long <- model_panel %>%
  select(all_of(log_vars)) %>%
  mutate(across(everything(), safe_log)) %>%
  pivot_longer(everything(), names_to = "variable", values_to = "value") %>%
  filter(!is.na(value)) %>%
  mutate(scale = "Log-transformed")

compare_long <- bind_rows(raw_long, log_long) %>%
  mutate(
    variable = var_label(variable),
    scale = factor(scale, levels = c("Raw", "Log-transformed"))
  )

p_log <- ggplot(compare_long, aes(x = value)) +
  geom_histogram(bins = 40, fill = "#2e6260", alpha = 0.7, colour = "white",
                 linewidth = 0.2) +
  facet_grid(variable ~ scale, scales = "free") +
  labs(x = NULL, y = "Frequency",
       title = "Effect of log transformation on skewed predictors") +
  theme_pub +
  theme(strip.text = element_text(size = 9))

save_fig(p_log, "log_transform_comparison", width = 8, height = 7)
p_log
Figure 10: Raw vs. log-transformed distributions for skewed predictor variables

The side-by-side comparison confirms that the log transformation substantially reduces skewness for all four treated variables. After transformation, the distributions are much closer to the normal shape assumed by the linear mixed-effects model, improving the reliability of coefficient estimates and standard errors.

8 Bivariate Relationships

The interactive scatter plot below allows exploration of the bivariate relationships between each Attainment 8 outcome and the predictor variables. Use the dropdowns to select the outcome (y-axis) and predictor (x-axis), and the checkboxes to independently toggle log-transformation on each axis. This lets you compare the raw-scale relationship with the model-scale relationship (where several variables enter as logs). The OLS regression line and R² update in real time.

Show code
library(plotly)
library(htmltools)
library(jsonlite)

# --- Build a clean data frame with both raw and log versions ---
bivar_df <- model_panel %>%
  filter(!is.na(ATT8SCR)) %>%
  mutate(
    log_ATT8SCR            = safe_log(ATT8SCR),
    log_ATT8SCR_FSM6CLA1A  = safe_log(ATT8SCR_FSM6CLA1A),
    log_ATT8SCR_NFSM6CLA1A = safe_log(ATT8SCR_NFSM6CLA1A),
    log_PTFSM6CLA1A        = safe_log(PTFSM6CLA1A),
    log_PERCTOT             = safe_log(PERCTOT),
    log_PNUMEAL             = safe_log(PNUMEAL),
    log_PTPRIORLO           = safe_log(PTPRIORLO),
    log_gorard_segregation  = safe_log(gorard_segregation),
    log_remained            = safe_log(remained_in_the_same_school),
    log_leadership_pay      = safe_log(teachers_on_leadership_pay_range_percent),
    log_sickness_days       = safe_log(average_number_of_days_taken)
  )

# Hover label
bivar_df$hover_label <- if ("SCHNAME" %in% names(bivar_df) && "LANAME" %in% names(bivar_df)) {
  paste0(bivar_df$SCHNAME, " (", bivar_df$LANAME, ")")
} else {
  paste0("URN ", bivar_df$URN)
}

# --- Define variable metadata ---
# Each entry: raw column, log column, display label
outcome_meta <- list(
  list(raw = "ATT8SCR",             log = "log_ATT8SCR",
       label = "Attainment 8 — All Pupils"),
  list(raw = "ATT8SCR_FSM6CLA1A",   log = "log_ATT8SCR_FSM6CLA1A",
       label = "Attainment 8 — Disadvantaged"),
  list(raw = "ATT8SCR_NFSM6CLA1A",  log = "log_ATT8SCR_NFSM6CLA1A",
       label = "Attainment 8 — Non-Disadvantaged")
)

pred_meta <- list(
  list(raw = "PTFSM6CLA1A", log = "log_PTFSM6CLA1A",
       label = "% Disadvantaged (FSM)"),
  list(raw = "PERCTOT", log = "log_PERCTOT",
       label = "% Overall Absence"),
  list(raw = "PNUMEAL", log = "log_PNUMEAL",
       label = "% EAL"),
  list(raw = "PTPRIORLO", log = "log_PTPRIORLO",
       label = "% Low Prior Attainment (KS2)"),
  list(raw = "gorard_segregation", log = "log_gorard_segregation",
       label = "Gorard Segregation Index"),
  list(raw = "remained_in_the_same_school", log = "log_remained",
       label = "Teachers Remaining (FTE)"),
  list(raw = "teachers_on_leadership_pay_range_percent",
       log = "log_leadership_pay",
       label = "% Teachers on Leadership Pay"),
  list(raw = "average_number_of_days_taken", log = "log_sickness_days",
       label = "Avg. Teacher Sickness Days")
)

# Filter out predictors with no data
pred_meta <- Filter(function(p) any(!is.na(bivar_df[[p$raw]])), pred_meta)

# Split by year
years <- as.character(sort(unique(bivar_df$year_label)))
year_dfs <- lapply(years, function(yr) bivar_df[bivar_df$year_label == yr, ])
names(year_dfs) <- years

yr_colours <- c("#2e6260", "#abc766", "#e16fca", "#4e3c56")
n_tr <- length(years)

# --- Build base plotly (one trace per year + one OLS line) ---
fig <- plot_ly()
for (i in seq_along(years)) {
  df_i <- year_dfs[[i]]
  fig <- fig %>%
    add_markers(
      x = df_i[[pred_meta[[1]]$raw]],
      y = df_i[[outcome_meta[[1]]$raw]],
      text = df_i[["hover_label"]],
      name = years[i],
      marker = list(color = yr_colours[i], opacity = 0.35, size = 4),
      hovertemplate = paste0(
        "<b>%{text}</b><br>x: %{x:.3f}<br>y: %{y:.2f}",
        "<extra>", years[i], "</extra>"
      )
    )
}

# Regression line trace (hidden until JS computes OLS)
fig <- fig %>%
  add_lines(
    x = c(0, 1), y = c(0, 1),
    name = "OLS fit",
    line = list(color = "#7b132b", width = 2),
    showlegend = FALSE, hoverinfo = "skip", visible = FALSE
  )

fig <- fig %>%
  layout(
    xaxis  = list(title = pred_meta[[1]]$label),
    yaxis  = list(title = outcome_meta[[1]]$label),
    legend = list(orientation = "h", x = 1, xanchor = "right",
                  y = 1.02, yanchor = "bottom"),
    margin = list(t = 50, l = 80, b = 60)
  )

fig$elementId <- "bivar-overview"

# --- Pack ALL column data (raw + log) into JSON ---
all_cols <- c(
  unlist(lapply(outcome_meta, function(o) c(o$raw, o$log))),
  unlist(lapply(pred_meta, function(p) c(p$raw, p$log)))
)

js_data <- list()
for (yr in years) {
  yr_list <- list()
  df <- year_dfs[[yr]]
  for (col in all_cols) {
    if (col %in% names(df)) {
      yr_list[[col]] <- round(df[[col]], 5)
    }
  }
  js_data[[yr]] <- yr_list
}

# --- HTML dropdown + checkbox controls ---
select_style <- paste0(
  "padding:6px 10px; border:1px solid #bbb; border-radius:4px;",
  "font-size:0.88em; background:#fff; cursor:pointer;"
)
label_style <- paste0(
  "font-weight:600; font-size:0.82em; display:block;",
  "margin-bottom:4px; color:#555;"
)
cb_style <- paste0(
  "display:flex; align-items:center; gap:6px; margin-top:6px;",
  "font-size:0.82em; color:#555;"
)

controls <- tags$div(
  style = "display:flex; gap:24px; margin-bottom:14px; flex-wrap:wrap; align-items:end;",
  tags$div(
    tags$label("Outcome (y-axis):", `for` = "bvo-outcome", style = label_style),
    tags$select(
      id = "bvo-outcome", style = paste0(select_style, "min-width:230px;"),
      lapply(seq_along(outcome_meta), function(k)
        tags$option(value = k, outcome_meta[[k]]$label))
    ),
    tags$div(style = cb_style,
      tags$input(type = "checkbox", id = "bvo-log-y"),
      tags$label("Log scale", `for` = "bvo-log-y")
    )
  ),
  tags$div(
    tags$label("Predictor (x-axis):", `for` = "bvo-predictor", style = label_style),
    tags$select(
      id = "bvo-predictor", style = paste0(select_style, "min-width:270px;"),
      lapply(seq_along(pred_meta), function(k)
        tags$option(value = k, pred_meta[[k]]$label))
    ),
    tags$div(style = cb_style,
      tags$input(type = "checkbox", id = "bvo-log-x"),
      tags$label("Log scale", `for` = "bvo-log-x")
    )
  ),
  tags$div(
    tags$label("Year:", `for` = "bvo-year", style = label_style),
    tags$select(
      id = "bvo-year", style = paste0(select_style, "min-width:140px;"),
      tags$option(value = "all", "All Years"),
      lapply(years, function(yr) tags$option(value = yr, yr))
    )
  )
)

# --- Metadata arrays for JS ---
outcome_js <- lapply(outcome_meta, function(o)
  list(raw = o$raw, log = o$log, label = o$label))
pred_js <- lapply(pred_meta, function(p)
  list(raw = p$raw, log = p$log, label = p$label))

# --- JavaScript: wire controls -> Plotly.restyle / relayout + live OLS ---
js_block <- tags$script(HTML(sprintf('
(function() {
  var DATA  = %s;
  var YEARS = %s;
  var N     = YEARS.length;
  var LINE  = N;

  var OUTCOMES   = %s;
  var PREDICTORS = %s;

  function fitOLS(xAll, yAll) {
    var x = [], y = [];
    for (var i = 0; i < xAll.length; i++) {
      if (xAll[i] != null && yAll[i] != null &&
          isFinite(xAll[i]) && isFinite(yAll[i])) {
        x.push(xAll[i]); y.push(yAll[i]);
      }
    }
    var n = x.length;
    if (n < 3) return null;
    var sx=0, sy=0, sxx=0, sxy=0, syy=0;
    for (var i = 0; i < n; i++) {
      sx += x[i]; sy += y[i];
      sxx += x[i]*x[i]; sxy += x[i]*y[i]; syy += y[i]*y[i];
    }
    var xb = sx/n, yb = sy/n;
    var Sxx = sxx - n*xb*xb, Sxy = sxy - n*xb*yb, Syy = syy - n*yb*yb;
    if (Sxx === 0) return null;
    var b1 = Sxy / Sxx, b0 = yb - b1 * xb;
    var SSres = 0;
    for (var i = 0; i < n; i++) {
      var r = y[i] - (b0 + b1*x[i]); SSres += r*r;
    }
    var r2 = (Syy > 0) ? 1 - SSres/Syy : 0;
    var xmin = x[0], xmax = x[0];
    for (var i = 1; i < n; i++) {
      if (x[i] < xmin) xmin = x[i];
      if (x[i] > xmax) xmax = x[i];
    }
    return { b0:b0, b1:b1, r2:r2, n:n, xmin:xmin, xmax:xmax };
  }

  function update() {
    var gd = document.getElementById("bivar-overview");
    if (!gd || !gd._fullLayout) return;

    var outIdx  = parseInt(document.getElementById("bvo-outcome").value) - 1;
    var predIdx = parseInt(document.getElementById("bvo-predictor").value) - 1;
    var yrVal   = document.getElementById("bvo-year").value;
    var logX    = document.getElementById("bvo-log-x").checked;
    var logY    = document.getElementById("bvo-log-y").checked;

    var outMeta  = OUTCOMES[outIdx];
    var predMeta = PREDICTORS[predIdx];
    var xCol = logX ? predMeta.log : predMeta.raw;
    var yCol = logY ? outMeta.log  : outMeta.raw;
    var xLabel = (logX ? "log(" : "") + predMeta.label + (logX ? ")" : "");
    var yLabel = (logY ? "log(" : "") + outMeta.label  + (logY ? ")" : "");

    var xArr = [], yArr = [], visArr = [];
    var allX = [], allY = [];
    for (var i = 0; i < N; i++) {
      var yr = YEARS[i];
      var xd = DATA[yr][xCol] || [];
      var yd = DATA[yr][yCol] || [];
      xArr.push(xd); yArr.push(yd);
      var vis = (yrVal === "all" || yr === yrVal);
      visArr.push(vis);
      if (vis) {
        for (var j = 0; j < xd.length; j++) {
          allX.push(xd[j]); allY.push(yd[j]);
        }
      }
    }

    var idx = []; for (var i = 0; i < N; i++) idx.push(i);
    Plotly.restyle(gd, { x: xArr, y: yArr, visible: visArr }, idx);

    var fit = fitOLS(allX, allY);
    if (fit) {
      Plotly.restyle(gd, {
        x: [[ fit.xmin, fit.xmax ]],
        y: [[ fit.b0 + fit.b1*fit.xmin, fit.b0 + fit.b1*fit.xmax ]],
        visible: [true]
      }, [LINE]);

      var sign = (fit.b1 >= 0) ? " + " : " \\u2212 ";
      var eqn  = "\\u0177 = " + fit.b0.toFixed(2) + sign +
                 Math.abs(fit.b1).toFixed(3) + "x" +
                 "\\u2003\\u2003R\\u00b2 = " + fit.r2.toFixed(3) +
                 "\\u2003\\u2003n = " + fit.n.toLocaleString();

      Plotly.relayout(gd, {
        "xaxis.title": xLabel,
        "yaxis.title": yLabel,
        annotations: [{
          text: eqn,
          xref: "paper", yref: "paper",
          x: 0.02, y: 0.98,
          xanchor: "left", yanchor: "top",
          showarrow: false,
          font: { size: 12.5, color: "#4e3c56", family: "monospace" },
          bgcolor: "rgba(255,255,255,0.88)",
          borderpad: 5
        }]
      });
    } else {
      Plotly.restyle(gd, { visible: [false] }, [LINE]);
      Plotly.relayout(gd, {
        "xaxis.title": xLabel,
        "yaxis.title": yLabel,
        annotations: []
      });
    }
  }

  ["bvo-outcome","bvo-predictor","bvo-year"].forEach(function(id) {
    document.getElementById(id).addEventListener("change", update);
  });
  ["bvo-log-x","bvo-log-y"].forEach(function(id) {
    document.getElementById(id).addEventListener("change", update);
  });

  var poll = setInterval(function() {
    var gd = document.getElementById("bivar-overview");
    if (gd && gd._fullLayout) { clearInterval(poll); update(); }
  }, 250);
})();
',
  toJSON(js_data, digits = 5, na = "null", auto_unbox = FALSE),
  toJSON(years),
  toJSON(outcome_js, auto_unbox = TRUE),
  toJSON(pred_js, auto_unbox = TRUE)
)))

htmltools::tagList(controls, fig, js_block)
Figure 11

Toggling the log checkboxes reveals how the transformation reshapes each relationship. For example, the raw % Disadvantaged vs raw Attainment 8 plot shows a curved, heteroscedastic pattern; switching both axes to log scale linearises the relationship and stabilises the variance — illustrating why the model uses log(y) ~ log(x) for these variables.

9 Analysis Sample Restriction

The multilevel models are fitted to mainstream state-funded schools only (Academies and Maintained schools). Independent schools, colleges, and special schools are excluded because they operate under different funding, governance, and admissions structures that make direct comparison inappropriate.

Table 10 shows the effect of this restriction.

Show code
type_summary <- panel_data %>%
  mutate(
    in_model = MINORGROUP %in% c("Academy", "Maintained school"),
    MINORGROUP = ifelse(is.na(MINORGROUP), "(Missing)", MINORGROUP)
  ) %>%
  count(MINORGROUP, in_model) %>%
  arrange(desc(n)) %>%
  mutate(
    `% of panel` = round(n / sum(n) * 100, 1),
    Status = ifelse(in_model, "Included", "Excluded")
  ) %>%
  select(`School type` = MINORGROUP, `School-year obs.` = n,
         `% of panel`, Status)

type_summary %>%
  kable(align = "lrrl") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE) %>%
  row_spec(which(type_summary$Status == "Excluded"),
           color = "#4e3c56", italic = TRUE) %>%
  save_tbl("school_type_filter")
Table 10: Sample restriction by school type (MINORGROUP)
School type School-year obs. % of panel Status
Academy 10905 63.7 Included
Independent school 3641 21.3 Excluded
Maintained school 2514 14.7 Included
College 55 0.3 Excluded
Special school 15 0.1 Excluded
Show code
model_summary <- model_panel %>%
  group_by(`Academic year` = year_label) %>%
  summarise(
    Schools = n_distinct(URN),
    `School-year obs.` = n(),
    LAs = n_distinct(LANAME),
    .groups = "drop"
  )

totals <- tibble(
  `Academic year` = "Total / Overall",
  Schools = n_distinct(model_panel$URN),
  `School-year obs.` = nrow(model_panel),
  LAs = n_distinct(model_panel$LANAME)
)

bind_rows(model_summary, totals) %>%
  kable(align = "lrrr", format.args = list(big.mark = ",")) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE) %>%
  row_spec(nrow(model_summary) + 1, bold = TRUE) %>%
  save_tbl("analysis_sample_dimensions")
Table 11: Analysis sample dimensions after excluding non-mainstream schools
Academic year Schools School-year obs. LAs
2021-22 3,349 3,349 152
2022-23 3,351 3,351 152
2023-24 3,372 3,372 152
2024-25 3,347 3,347 152
Total / Overall 3,523 13,419 152

All descriptive statistics and figures from this point forward are based on the analysis sample (13,419 school-year observations from 3523 mainstream state-funded schools).

10 Missing Data

Understanding the pattern of missing data is important for assessing whether the analysis sample is representative and whether the imputation strategy (described in the next section) is appropriate.

10.1 Completeness by Variable and Year

Show code
all_model_vars <- c(
  "ATT8SCR", "ATT8SCR_FSM6CLA1A", "ATT8SCR_NFSM6CLA1A",
  "PTFSM6CLA1A", "PERCTOT", "PPERSABS10", "PNUMEAL", "PTPRIORLO",
  "ADMPOL_PT", "gorard_segregation", "OFSTEDRATING_1",
  "remained_in_the_same_school",
  "teachers_on_leadership_pay_range_percent",
  "average_number_of_days_taken",
  "gor_name", "LANAME"
)
all_model_vars <- intersect(all_model_vars, names(model_panel))

completeness <- model_panel %>%
  group_by(year_label) %>%
  summarise(
    across(all_of(all_model_vars),
           ~ round(sum(!is.na(.)) / n() * 100, 1),
           .names = "{.col}"),
    .groups = "drop"
  ) %>%
  pivot_longer(-year_label, names_to = "Variable", values_to = "pct") %>%
  pivot_wider(names_from = year_label, values_from = pct) %>%
  mutate(Variable = var_label(Variable))

completeness %>%
  kable(align = "lrrrr") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE, font_size = 11) %>%
  save_tbl("data_completeness")
Table 12: Data completeness (% non-missing) by variable and academic year
Variable 2021-22 2022-23 2023-24 2024-25
Avg. Attainment 8 score per pupil 96.4 97.0 97.0 97.9
Avg. Attainment 8 score (disadvantaged) 95.1 95.6 95.8 97.0
Avg. Attainment 8 score (non-disadvantaged) 95.1 95.7 95.9 97.0
% KS4 pupils who are disadvantaged 96.4 97.0 97.0 97.9
% overall absence 98.8 99.0 98.8 99.3
% persistent absentees (10%+ sessions missed) 98.8 99.0 98.8 99.3
% pupils with English not first language 98.6 98.7 98.4 98.1
% KS4 pupils with low prior attainment (KS2) 96.4 97.0 97.0 0.0
Admissions policy (new definition from 2019) 100.0 100.0 100.0 100.0
Gorard segregation index 96.3 96.9 96.9 97.9
Ofsted rating (harmonised) 99.5 99.2 97.3 98.1
Teachers remaining in same school (FTE) 98.6 98.2 98.2 0.0
% teachers on leadership pay range 98.6 98.2 98.2 99.0
Avg. teacher sickness days 92.5 90.1 96.3 0.0
Government Office Region 100.0 100.0 100.0 100.0
Local authority name 100.0 100.0 100.0 100.0

Core attainment and composition variables (Attainment 8, FSM eligibility, absence, EAL) enjoy near-complete coverage across all four years. The notable exception is the workforce block — teacher retention, leadership pay share, and sickness absence — which drops to 0% availability in 2024–25 because the School Workforce Census for that year had not been published at the time of data extraction. KS2 prior attainment (PTPRIORLO) is also missing for 2024–25 due to the lingering effects of COVID-era assessment cancellations.

10.2 Missing Data Heatmap

Show code
miss_long <- model_panel %>%
  select(year_label, all_of(all_model_vars)) %>%
  group_by(year_label) %>%
  summarise(
    across(everything(), ~ sum(is.na(.)) / n() * 100),
    .groups = "drop"
  ) %>%
  pivot_longer(-year_label, names_to = "variable", values_to = "pct_missing") %>%
  mutate(variable = var_label(variable))

p_miss <- ggplot(miss_long, aes(x = year_label,
                                 y = fct_reorder(variable, pct_missing, .fun = max),
                                 fill = pct_missing)) +
  geom_tile(colour = "white", linewidth = 0.8) +
  geom_text(aes(label = sprintf("%.0f%%", pct_missing)), size = 3) +
  scale_fill_gradient(low = "#bbd2f0", high = "#7b132b",
                      name = "% missing", limits = c(0, 100)) +
  labs(x = "Academic year", y = NULL,
       title = "Missing data by variable and year") +
  theme_pub +
  theme(panel.grid = element_blank())

save_fig(p_miss, "missing_heatmap", width = 8, height = 5)
p_miss
Figure 12: Missing data pattern across model variables by academic year

The heatmap makes the pattern stark: missingness is concentrated almost entirely in the workforce and prior attainment variables for the 2024–25 year. For the earlier three years, data completeness is consistently high across all variables. This structured, year-specific pattern supports the use of a carry-forward imputation strategy rather than more complex multiple imputation approaches.

NoteTeacher sickness absence: a notable gap in 2021–22 and 2022–23

Although the heatmap shows high overall completeness for the earlier years, one variable — average_number_of_days_taken (teacher sickness absence) — has a less visible but consequential pattern of missingness. Approximately 570 school-year observations pass all other filters but fail on this variable: ~250 in 2021-22 and ~244 in 2022-23, falling to ~54 in 2023-24 and ~22 in 2024-25. Around 420 of these are genuinely missing (NA) and ~150 are reported as zero (excluded because the log transform requires strictly positive values). Academies account for roughly 80% of the affected schools, suggesting patchier sickness absence reporting in the early post-COVID workforce census returns. Because the model uses log(average_number_of_days_taken), these cases cannot be retained without either imputation or removing the variable from the model specification.

11 Imputation: 2024–25 Carry-Forward

Four predictor variables are unavailable for the 2024–25 academic year:

  • PTPRIORLO — KS2 prior attainment was cancelled due to the lingering effects of COVID-19 disruption
  • remained_in_the_same_school — workforce turnover data not yet published
  • teachers_on_leadership_pay_range_percent — workforce pay data not yet published
  • average_number_of_days_taken — workforce sickness data not yet published

These are imputed using a two-step approach:

  1. Carry-forward: use the school’s 2023–24 value
  2. Fallback: if no 2023–24 value exists, use the school-level mean across all available years

A flag variable has_imputed_predictors is set to TRUE for any 2024–25 row that received at least one imputed value.

Show code
impute_vars <- c("PTPRIORLO", "remained_in_the_same_school",
                 "teachers_on_leadership_pay_range_percent",
                 "average_number_of_days_taken")
impute_vars <- intersect(impute_vars, names(model_panel))

rows_2425 <- model_panel %>% filter(year_label == "2024-25")
n_2425 <- nrow(rows_2425)

imputation_summary <- map_dfr(impute_vars, function(v) {
  n_available <- sum(!is.na(rows_2425[[v]]))
  n_missing   <- n_2425 - n_available
  tibble(
    Variable          = var_label(v),
    `2024-25 rows`    = n_2425,
    `Non-missing`     = n_available,
    `Still missing`   = n_missing,
    `% imputed/available` = round(n_available / n_2425 * 100, 1)
  )
})

flagged <- sum(rows_2425$has_imputed_predictors, na.rm = TRUE)

imputation_summary %>%
  kable(align = "lrrrr") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE) %>%
  footnote(general = paste0(flagged, " of ", n_2425,
                            " 2024-25 rows flagged as having imputed predictors (",
                            round(flagged / n_2425 * 100, 1), "%).")) %>%
  save_tbl("imputation_summary")
Table 13: Imputation summary for 2024-25 rows
Variable 2024-25 rows Non-missing Still missing % imputed/available
% KS4 pupils with low prior attainment (KS2) 3347 0 3347 0
Teachers remaining in same school (FTE) 3347 0 3347 0
% teachers on leadership pay range 3347 3314 33 99
Avg. teacher sickness days 3347 0 3347 0
Note:
3224 of 3347 2024-25 rows flagged as having imputed predictors (96.3%).
Show code
if (length(impute_vars) > 0) {
  compare_df <- model_panel %>%
    filter(year_label %in% c("2023-24", "2024-25")) %>%
    select(year_label, all_of(impute_vars)) %>%
    pivot_longer(-year_label, names_to = "variable", values_to = "value") %>%
    filter(!is.na(value)) %>%
    mutate(variable = var_label(variable))

  p_impute <- ggplot(compare_df, aes(x = value, fill = year_label)) +
    geom_density(alpha = 0.5, colour = NA) +
    facet_wrap(~ variable, scales = "free", ncol = 2) +
    scale_fill_manual(
      values = c("2023-24" = "#e16fca", "2024-25" = "#4e3c56"),
      name = "Academic year"
    ) +
    labs(x = NULL, y = "Density",
         title = "Observed (2023-24) vs. imputed (2024-25) distributions",
         subtitle = "2024-25 values are carry-forwards from 2023-24 or school-level means") +
    theme_pub

  save_fig(p_impute, "imputed_vs_observed", width = 8, height = 6)
  p_impute
}
Figure 13: Comparison of observed (2023-24) vs. carried-forward (2024-25) values for imputed variables

The density overlays show that the carried-forward 2024–25 distributions closely match the observed 2023–24 distributions, as expected. This provides reassurance that the imputation does not introduce systematic bias into the predictor values. The assumption underpinning this approach is that school-level workforce characteristics are slow-moving and unlikely to change dramatically within a single year.

11.1 Year-on-Year Stability of KS2 Prior Attainment

A key question for the carry-forward strategy is whether imputed variables are sufficiently stable from one year to the next at the school level. Figure 14 addresses this for PTPRIORLO (the percentage of pupils entering with low KS2 prior attainment) by plotting each school’s value in one year against its value in the next. If the variable were perfectly stable, all points would fall on the 45-degree line.

Show code
# Build wide data: one row per school with each year's PTPRIORLO
prior_wide <- model_panel %>%
  filter(!is.na(PTPRIORLO)) %>%
  select(URN, year_label, PTPRIORLO) %>%
  pivot_wider(names_from = year_label, values_from = PTPRIORLO,
              names_prefix = "y_")

# Pair 1: 2021-22 vs 2022-23
pair1 <- prior_wide %>%
  filter(!is.na(`y_2021-22`), !is.na(`y_2022-23`))
r1 <- cor(pair1$`y_2021-22`, pair1$`y_2022-23`)

# Pair 2: 2022-23 vs 2023-24
pair2 <- prior_wide %>%
  filter(!is.na(`y_2022-23`), !is.na(`y_2023-24`))
r2 <- cor(pair2$`y_2022-23`, pair2$`y_2023-24`)

p_scatter1 <- ggplot(pair1, aes(x = `y_2021-22`, y = `y_2022-23`)) +
  geom_point(alpha = 0.25, size = 1, colour = "#2e6260") +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", colour = "grey30") +
  annotate("text", x = Inf, y = -Inf,
           label = paste0("r = ", sprintf("%.3f", r1), "\nn = ", format(nrow(pair1), big.mark = ",")),
           hjust = 1.1, vjust = -0.5, size = 3.5, colour = "grey30") +
  labs(x = "% low prior attainment (2021-22)",
       y = "% low prior attainment (2022-23)",
       title = "2021-22 vs 2022-23") +
  coord_equal() +
  theme_pub

p_scatter2 <- ggplot(pair2, aes(x = `y_2022-23`, y = `y_2023-24`)) +
  geom_point(alpha = 0.25, size = 1, colour = "#abc766") +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", colour = "grey30") +
  annotate("text", x = Inf, y = -Inf,
           label = paste0("r = ", sprintf("%.3f", r2), "\nn = ", format(nrow(pair2), big.mark = ",")),
           hjust = 1.1, vjust = -0.5, size = 3.5, colour = "grey30") +
  labs(x = "% low prior attainment (2022-23)",
       y = "% low prior attainment (2023-24)",
       title = "2022-23 vs 2023-24") +
  coord_equal() +
  theme_pub

p_prior_scatter <- p_scatter1 + p_scatter2 +
  plot_annotation(
    title = "Year-on-year stability of PTPRIORLO at the school level",
    subtitle = "Dashed line = perfect carry-forward (y = x)",
    theme = theme(plot.title = element_text(face = "bold", size = 12),
                  plot.subtitle = element_text(colour = "grey40", size = 10))
  )

save_fig(p_prior_scatter, "ptpriorlo_year_scatter", width = 10, height = 5)
p_prior_scatter
Figure 14: Year-on-year stability of KS2 prior attainment (% low prior) at the school level

Both panels show a tight clustering of schools along the 45-degree line, with high year-on-year correlations. This confirms that the proportion of pupils entering with low KS2 prior attainment is a stable, slow-moving school characteristic. The degree of scatter is modest and symmetric around the diagonal, indicating no systematic drift. This pattern provides strong support for the carry-forward imputation of PTPRIORLO into 2024–25: if a school’s value barely changes from 2021–22 to 2022–23 or from 2022–23 to 2023–24, it is reasonable to assume a similar level of stability into the following year.

12 Derived Variables

12.1 Gorard Segregation Index

The Gorard segregation index measures how unevenly disadvantaged pupils are distributed across schools within each Local Authority (calculation details). It is computed per LA per year as:

\[G_s = \frac{1}{2} \sum_{i=1}^{n} \left| \frac{F_i}{F_{\text{total}}} - \frac{T_i}{T_{\text{total}}} \right|\]

where \(F_i\) is the number of disadvantaged pupils in school \(i\), \(F_{\text{total}}\) the LA total, and \(T_i\), \(T_{\text{total}}\) the corresponding all-pupil totals.

Show code
gorard_la <- model_panel %>%
  filter(!is.na(gorard_segregation)) %>%
  distinct(LANAME, year_label, gorard_segregation)

p_gorard <- ggplot(gorard_la, aes(x = gorard_segregation, fill = year_label)) +
  geom_histogram(bins = 30, alpha = 0.65, position = "identity", colour = "white",
                 linewidth = 0.2) +
  scale_fill_manual(values = year_cols, name = "Academic year") +
  labs(x = "Gorard segregation index", y = "Number of LAs",
       title = "Distribution of LA-level segregation index") +
  theme_pub

save_fig(p_gorard, "gorard_segregation", width = 7, height = 4)
p_gorard
Figure 15: Distribution of Gorard segregation index across LAs by year

Most Local Authorities have segregation index values between 0.10 and 0.35, indicating moderate unevenness in the distribution of disadvantaged pupils across schools. A small number of LAs show notably high segregation (above 0.40), often corresponding to areas with grammar schools or other selective provision. The distribution is fairly stable across years, consistent with the view that school-level segregation is a structural feature that changes slowly.

13 Correlation Structure

Show code
# Variables that enter the model as log(x)
log_transform_vars <- c("PTFSM6CLA1A", "PERCTOT", "PNUMEAL",
                         "average_number_of_days_taken")

cor_vars <- c("ATT8SCR", "PTFSM6CLA1A", "PERCTOT", "PNUMEAL",
              "PTPRIORLO", "gorard_segregation",
              "remained_in_the_same_school",
              "teachers_on_leadership_pay_range_percent",
              "average_number_of_days_taken")
cor_vars <- intersect(cor_vars, names(model_panel))

cor_data <- model_panel %>%
  select(all_of(cor_vars)) %>%
  drop_na() %>%
  mutate(across(all_of(intersect(log_transform_vars, names(.))), safe_log))

cor_mat <- cor(cor_data)

# Apply labels — prefix logged variables with "log()"
cor_labels <- sapply(cor_vars, function(v) {
  lbl <- var_label(v)
  if (v %in% log_transform_vars) paste0("log(", lbl, ")") else lbl
})
labs_vec <- str_wrap(cor_labels, width = 20)
rownames(cor_mat) <- labs_vec
colnames(cor_mat) <- labs_vec

cor_long <- cor_mat %>%
  as.data.frame() %>%
  rownames_to_column("var1") %>%
  pivot_longer(-var1, names_to = "var2", values_to = "r") %>%
  mutate(
    var1 = factor(var1, levels = rev(labs_vec)),
    var2 = factor(var2, levels = labs_vec)
  )

p_cor <- ggplot(cor_long, aes(x = var2, y = var1, fill = r)) +
  geom_tile(colour = "white", linewidth = 0.5) +
  geom_text(aes(label = sprintf("%.2f", r)), size = 2.2) +
  scale_fill_gradient2(low = "#7b132b", mid = "#f3f3f3", high = "#2e6260",
                       midpoint = 0, limits = c(-1, 1), name = "Pearson r") +
  labs(x = NULL, y = NULL,
       title = "Correlation matrix: outcome and predictors") +
  theme_pub +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1, size = 8),
    axis.text.y = element_text(size = 8),
    panel.grid = element_blank()
  )

save_fig(p_cor, "correlation_matrix", width = 9, height = 7)
p_cor
Figure 16: Pairwise Pearson correlations among the outcome and all fixed-effect predictors (log-transformed variables shown on the model scale)

Correlations are computed on the model scale — i.e. the four log-transformed predictors are shown after transformation, matching the scale on which they enter the multilevel model. Several expected patterns emerge. The proportion of disadvantaged pupils has a strong negative correlation with Attainment 8, confirming that school-level deprivation is the dominant compositional predictor. Absence is positively correlated with deprivation and negatively with attainment. The EAL proportion shows a modest positive association with attainment, likely reflecting the concentration of EAL pupils in London schools which tend to perform above the national average. Among the workforce variables, teacher retention shows a positive correlation with attainment, while sickness absence is negatively associated. Importantly, no pair of predictors is so highly correlated as to raise serious multicollinearity concerns for the multilevel model.

14 Summary

The full panel dataset contains 17,130 school-year observations from 4534 unique schools. After restricting to mainstream state-funded schools (Academies and Maintained schools), the analysis sample comprises 13,419 school-year observations from 3523 schools across 152 Local Authorities and 4 academic years (2021–22 to 2024–25). Key features:

  • Sample restriction: Independent schools (3,641 obs.), colleges, and special schools are excluded from the analysis.
  • Outcome completeness: Attainment 8 scores (all pupils) are available for 100% of analysis sample observations by construction.
  • Predictor completeness: Core predictors (disadvantage, absence, EAL) exceed 95% availability across all years. Workforce variables are unavailable for 2024–25 and are carry-forward imputed.
  • Imputation scope: 24% of analysis sample observations (confined to 2024–25) contain at least one imputed predictor.
  • Transformations: Four right-skewed predictors are log-transformed; the Gorard segregation index is computed at the LA-year level.