Data Overview: School Attainment Analysis

Sources, construction, and descriptive statistics for the KS4 panel dataset

Author

Adam Dennett with some assistance from Claude Code

Published

March 19, 2026

1 Introduction

This document describes the construction and characteristics of the school-level panel dataset used in the multilevel regression analysis of KS4 Attainment 8 scores. It covers:

Data sources — where each variable originates
Panel construction — how per-year data are harmonised and linked
Cleaning and imputation — handling of missing values and the 2024–25 carry-forward strategy
Descriptive statistics — distributions, correlations, and completeness of the analysis variables

All figures are saved as publication-ready PDF and JPEG files in output/data_overview_files/.

2 Data Sources

Table 1 summarises the six primary data feeds that are combined into the final panel.

Show code

tibble::tribble(
  ~Source, ~Provider, ~Coverage, ~`Key variables`, ~`Data link`,
  "KS4 Performance Tables", "DfE", "4 years (2021-22 to 2024-25)",
    "ATT8SCR, P8MEA, PTFSM6CLA1A, PTPRIORLO, ADMPOL_PT, TOTPUPS",
    '<a href="https://www.find-school-performance-data.service.gov.uk/" target="_blank">find-school-performance-data.service.gov.uk</a>',
  "School Census", "DfE", "4 years (matched to KS4)",
    "PNUMEAL, PNUMENGFL, TPUP, TFSM6CLA1A",
    '<a href="https://www.find-school-performance-data.service.gov.uk/" target="_blank">find-school-performance-data.service.gov.uk</a>',
  "Absence (perf. tables + API)", "DfE / EES", "4 years",
    "PERCTOT, PPERSABS10",
    '<a href="https://explore-education-statistics.service.gov.uk/find-statistics/pupil-absence-in-schools-in-england" target="_blank">explore-education-statistics.service.gov.uk</a>',
  "Ofsted Ratings", "Ofsted Official Statistics", "As at 31 Aug 2024",
    "OFSTEDRATING_1 (harmonised 1\u20134)",
    '<a href="https://www.gov.uk/government/statistical-data-sets/monthly-management-information-ofsteds-school-inspections-outcomes" target="_blank">gov.uk/ofsted-outcomes</a>',
  "Edubase", "DfE", "Oct 2024 snapshot",
    "gor_name (region), easting, northing",
    '<a href="https://get-information-schools.service.gov.uk/" target="_blank">get-information-schools.service.gov.uk</a>',
  "School Workforce Census", "DfE / EES", "4 years (2024-25 delayed)",
    "Teacher retention, leadership pay %, sickness days",
    '<a href="https://explore-education-statistics.service.gov.uk/find-statistics/school-workforce-in-england" target="_blank">explore-education-statistics.service.gov.uk</a>'
) %>%
  kable(format = "html", align = "lllll", escape = FALSE) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE, font_size = 11) %>%
  save_tbl("data_sources")

Table 1: Primary data sources and their contribution to the panel

Source	Provider	Coverage	Key variables	Data link
KS4 Performance Tables	DfE	4 years (2021-22 to 2024-25)	ATT8SCR, P8MEA, PTFSM6CLA1A, PTPRIORLO, ADMPOL_PT, TOTPUPS	find-school-performance-data.service.gov.uk
School Census	DfE	4 years (matched to KS4)	PNUMEAL, PNUMENGFL, TPUP, TFSM6CLA1A	find-school-performance-data.service.gov.uk
Absence (perf. tables + API)	DfE / EES	4 years	PERCTOT, PPERSABS10	explore-education-statistics.service.gov.uk
Ofsted Ratings	Ofsted Official Statistics	As at 31 Aug 2024	OFSTEDRATING_1 (harmonised 1–4)	gov.uk/ofsted-outcomes
Edubase	DfE	Oct 2024 snapshot	gor_name (region), easting, northing	get-information-schools.service.gov.uk
School Workforce Census	DfE / EES	4 years (2024-25 delayed)	Teacher retention, leadership pay %, sickness days	explore-education-statistics.service.gov.uk

3 Variable Definitions

Table 2 lists all variables used in the multilevel model, together with their role and a brief description. This serves as a reference for the sections that follow.

Show code

model_vars_def <- c(
  "ATT8SCR", "ATT8SCR_FSM6CLA1A", "ATT8SCR_NFSM6CLA1A",
  "PTFSM6CLA1A", "PERCTOT", "PNUMEAL", "PTPRIORLO",
  "ADMPOL_PT", "gorard_segregation",
  "OFSTEDRATING_1", "gor_name", "LANAME",
  "remained_in_the_same_school",
  "teachers_on_leadership_pay_range_percent",
  "average_number_of_days_taken",
  "has_imputed_predictors"
)

tibble(
  Variable = model_vars_def,
  Label = var_labels(model_vars_def),
  Role = c(rep("Outcome", 3),
           rep("Fixed effect", 6),
           rep("Random intercept", 3),
           rep("Fixed effect", 3),
           "Flag")
) %>%
  kable(align = "lll") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE, font_size = 11) %>%
  pack_rows("Outcomes", 1, 3) %>%
  pack_rows("Fixed-effect predictors", 4, 9) %>%
  pack_rows("Random-effect grouping", 10, 12) %>%
  pack_rows("Workforce predictors", 13, 15) %>%
  pack_rows("Metadata", 16, 16) %>%
  save_tbl("variable_definitions")

Table 2: Definitions of all variables used in the multilevel model

Variable	Label	Role
Outcomes
ATT8SCR	Avg. Attainment 8 score per pupil	Outcome
ATT8SCR_FSM6CLA1A	Avg. Attainment 8 score (disadvantaged)	Outcome
ATT8SCR_NFSM6CLA1A	Avg. Attainment 8 score (non-disadvantaged)	Outcome
Fixed-effect predictors
PTFSM6CLA1A	% KS4 pupils who are disadvantaged	Fixed effect
PERCTOT	% overall absence	Fixed effect
PNUMEAL	% pupils with English not first language	Fixed effect
PTPRIORLO	% KS4 pupils with low prior attainment (KS2)	Fixed effect
ADMPOL_PT	Admissions policy (new definition from 2019)	Fixed effect
gorard_segregation	Gorard segregation index	Fixed effect
Random-effect grouping
OFSTEDRATING_1	Ofsted rating (harmonised)	Random intercept
gor_name	Government Office Region	Random intercept
LANAME	Local authority name	Random intercept
Workforce predictors
remained_in_the_same_school	Teachers remaining in same school (FTE)	Fixed effect
teachers_on_leadership_pay_range_percent	% teachers on leadership pay range	Fixed effect
average_number_of_days_taken	Avg. teacher sickness days	Fixed effect
Metadata
has_imputed_predictors	2024-25 predictors are estimates (carry-forward)	Flag

4 Panel Construction Pipeline

The panel is assembled through four sequential scripts, each adding data and producing an intermediate .rds file.

Show code

tibble::tribble(
  ~Step, ~Script, ~`Input`, ~`Output`, ~Description,
  "1", "01_extract_data.R",
    "Raw DfE ZIPs", "year_data_list.rds",
    "Unzip, read KS4 + census + absence + school info per year; join by URN; filter RECTYPE=1 (mainstream)",
  "2", "02_build_panel.R",
    "year_data_list.rds", "panel_raw.rds",
    "Download Edubase (region/coords), Ofsted ratings, absence API; join to each year; harmonise Ofsted text/numeric; stack into panel",
  "3", "03_add_workforce.R",
    "panel_raw.rds", "panel_with_workforce.rds",
    "Download 5 workforce datasets from EES (turnover, pay, sickness, size, ratios); join by URN + time_period",
  "4", "04_compute_derived.R",
    "panel_with_workforce.rds", "panel_data.rds",
    "Gorard segregation index; impute 2024-25 missing predictors; log transforms; final cleaning"
) %>%
  kable(align = "clllll") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE, font_size = 10) %>%
  column_spec(5, width = "25em") %>%
  save_tbl("pipeline_steps")

Table 3: Data pipeline steps from raw ZIP files to the final analysis panel

Step	Script	Input	Output	Description
1	01_extract_data.R	Raw DfE ZIPs	year_data_list.rds	Unzip, read KS4 + census + absence + school info per year; join by URN; filter RECTYPE=1 (mainstream)
2	02_build_panel.R	year_data_list.rds	panel_raw.rds	Download Edubase (region/coords), Ofsted ratings, absence API; join to each year; harmonise Ofsted text/numeric; stack into panel
3	03_add_workforce.R	panel_raw.rds	panel_with_workforce.rds	Download 5 workforce datasets from EES (turnover, pay, sickness, size, ratios); join by URN + time_period
4	04_compute_derived.R	panel_with_workforce.rds	panel_data.rds	Gorard segregation index; impute 2024-25 missing predictors; log transforms; final cleaning

5 Sample Overview

5.1 Panel Dimensions

Show code

sample_tbl <- model_panel %>%
  group_by(`Academic year` = year_label) %>%
  summarise(
    Schools = n_distinct(URN),
    `School-year obs.` = n(),
    LAs = n_distinct(LANAME),
    Regions = n_distinct(gor_name),
    .groups = "drop"
  )

totals <- tibble(
  `Academic year` = "Total / Overall",
  Schools = n_distinct(model_panel$URN),
  `School-year obs.` = nrow(model_panel),
  LAs = n_distinct(model_panel$LANAME),
  Regions = n_distinct(model_panel$gor_name)
)

bind_rows(sample_tbl, totals) %>%
  kable(align = "lrrrr", format.args = list(big.mark = ",")) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE) %>%
  row_spec(nrow(sample_tbl) + 1, bold = TRUE) %>%
  save_tbl("panel_dimensions")

Table 4: Panel dimensions by academic year (state-funded mainstream schools only)

Academic year	Schools	School-year obs.	LAs	Regions
2021-22	3,349	3,349	152	9
2022-23	3,351	3,351	152	9
2023-24	3,372	3,372	152	9
2024-25	3,347	3,347	152	9
Total / Overall	3,523	13,419	152	9

Sample restriction

All figures and tables in this report are based on the analysis sample of mainstream state-funded schools only (Academies and LA-maintained schools). The following school types have been excluded from the raw extract:

Table 5

School type	Observations	Schools
Independent school	3,641	987
College	55	17
Special school	15	15
Total excluded	3,711	NA

This reduces the panel from 17,130 to 13,419 school-year observations (3,523 unique schools across 152 LAs). These school types are excluded because they operate under different funding, governance, and accountability frameworks that make direct comparison with the mainstream state-funded sector inappropriate. Independent schools, in particular, do not participate in the same performance table reporting regime, and their inclusion would confound the relationship between deprivation-related predictors and attainment outcomes.

5.2 Geographic Distribution

Show code

p_region <- model_panel %>%
  filter(!is.na(gor_name)) %>%
  count(gor_name, year_label) %>%
  ggplot(aes(x = fct_reorder(gor_name, n, .fun = sum), y = n, fill = year_label)) +
  geom_col(position = "dodge", width = 0.7) +
  coord_flip() +
  scale_fill_manual(values = year_cols, name = "Academic year") +
  labs(x = NULL, y = "Number of schools",
       title = "Schools per region by academic year") +
  theme_pub

save_fig(p_region, "region_counts", width = 8, height = 5)
p_region

Figure 1: Number of schools per region and academic year

The South East, North West, and London contribute the largest number of schools, consistent with their population sizes. School counts are highly stable across years, reflecting the fact that the panel tracks the same institutions over time with minimal entry or exit.

5.3 School Type Composition

Show code

p_types <- model_panel %>%
  filter(!is.na(MINORGROUP)) %>%
  count(MINORGROUP) %>%
  mutate(pct = n / sum(n) * 100,
         MINORGROUP = fct_reorder(MINORGROUP, n)) %>%
  ggplot(aes(x = MINORGROUP, y = pct)) +
  geom_col(fill = "#2e6260", width = 0.7) +
  geom_text(aes(label = sprintf("%.1f%%", pct)), hjust = -0.1, size = 3) +
  coord_flip() +
  labs(x = NULL, y = "% of school-year observations",
       title = "School type composition (analysis sample, all years pooled)") +
  scale_y_continuous(expand = expansion(mult = c(0, 0.15))) +
  theme_pub

save_fig(p_types, "school_types", width = 7, height = 4)
p_types

Figure 2: School type composition across the panel

Academies form the largest group in the analysis sample, followed by Maintained schools. Independent schools, colleges, and special schools have been excluded from all figures and tables in this report (see Table 10 for the full breakdown).

5.4 Ofsted Rating Distribution

Show code

p_ofsted <- model_panel %>%
  filter(!is.na(OFSTEDRATING_1)) %>%
  count(year_label, OFSTEDRATING_1) %>%
  group_by(year_label) %>%
  mutate(pct = n / sum(n) * 100) %>%
  ungroup() %>%
  ggplot(aes(x = year_label, y = pct, fill = OFSTEDRATING_1)) +
  geom_col(position = "stack", width = 0.65) +
  geom_text(aes(label = sprintf("%.0f%%", pct)),
            position = position_stack(vjust = 0.5), size = 3, colour = "white") +
  scale_fill_manual(
    values = c("Outstanding" = "#2e6260", "Good" = "#abc766",
               "Requires Improvement" = "#f9dd73", "Inadequate" = "#4e3c56"),
    name = "Ofsted rating"
  ) +
  labs(x = "Academic year", y = "% of schools",
       title = "Distribution of Ofsted ratings by year") +
  theme_pub

save_fig(p_ofsted, "ofsted_distribution", width = 7, height = 4.5)
p_ofsted

Figure 3: Ofsted rating distribution by academic year

The vast majority of schools in the analysis sample are rated Good, followed by Outstanding. The proportion rated Requires Improvement or Inadequate is small and relatively stable across years. Because Ofsted ratings are carried forward until a school is re-inspected, the year-to-year changes are modest and reflect the pace of the inspection cycle rather than rapid shifts in school quality.

5.5 Admissions Policy

Show code

if ("ADMPOL_PT" %in% names(model_panel)) {
  p_adm <- model_panel %>%
    filter(!is.na(ADMPOL_PT)) %>%
    mutate(ADMPOL_PT = factor(ADMPOL_PT)) %>%
    count(year_label, ADMPOL_PT) %>%
    group_by(year_label) %>%
    mutate(pct = n / sum(n) * 100) %>%
    ggplot(aes(x = year_label, y = pct, fill = ADMPOL_PT)) +
    geom_col(position = "stack", width = 0.65) +
    scale_fill_manual(values = c("#2e6260", "#abc766", "#f9dd73", "#e16fca", "#4e3c56", "#49a0c4", "#ff7f89"), name = "Admissions policy") +
    labs(x = "Academic year", y = "% of schools",
         title = "Admissions policy composition by year") +
    theme_pub

  save_fig(p_adm, "admissions_policy", width = 7, height = 3.5)
  p_adm
}

Figure 4: Admissions policy distribution

The admissions policy variable captures whether a school selects pupils on academic ability. The overwhelming majority of mainstream state-funded schools operate non-selective admissions; the small selective share is concentrated in areas that retain grammar schools. This variable is included in the model to control for the compositional advantage that selective schools enjoy.

6 Outcome Variables

Analysis sample

From this section onward, all statistics, figures, and tables are based on the analysis sample of mainstream state-funded schools only — Academies and Maintained schools. Independent schools (3,641 observations), colleges (55), and special schools (15) are excluded because they operate under different funding, governance, and accountability frameworks that make direct comparison with the mainstream sector inappropriate. This reduces the panel from 17,130 to 13,419 school-year observations. Full details of the restriction are given in Table 10.

6.1 Descriptive Statistics

Show code

outcome_vars <- c("ATT8SCR", "ATT8SCR_FSM6CLA1A", "ATT8SCR_NFSM6CLA1A")

outcome_stats <- model_panel %>%
  pivot_longer(all_of(outcome_vars), names_to = "Variable", values_to = "value") %>%
  filter(!is.na(value)) %>%
  group_by(Variable, `Year` = year_label) %>%
  summarise(
    N     = n(),
    Mean  = round(mean(value), 1),
    SD    = round(sd(value), 1),
    Min   = round(min(value), 1),
    Median = round(median(value), 1),
    Max   = round(max(value), 1),
    .groups = "drop"
  ) %>%
  mutate(Variable = var_label(Variable))

outcome_stats %>%
  kable(align = "llrrrrrr") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE, font_size = 11) %>%
  collapse_rows(columns = 1, valign = "top") %>%
  save_tbl("outcome_statistics")

Table 6: Descriptive statistics for Attainment 8 outcome variables by academic year

Variable	Year	N	Mean	SD	Min	Median	Max
Avg. Attainment 8 score per pupil	2021-22	3228	49.3	9.3	9.0	48.4	88.2
	2022-23	3250	46.9	9.3	12.0	45.8	87.6
	2023-24	3270	46.7	9.5	0.8	45.7	86.6
	2024-25	3277	46.8	9.3	0.9	45.6	87.2
Avg. Attainment 8 score (disadvantaged)	2021-22	3184	40.4	9.2	12.9	38.8	85.1
	2022-23	3204	37.7	9.2	10.9	36.0	85.8
	2023-24	3232	37.5	9.6	14.3	36.0	85.9
	2024-25	3246	37.7	9.4	18.1	36.0	86.0
Avg. Attainment 8 score (non-disadvantaged)	2021-22	3186	52.0	8.3	11.2	51.4	86.5
	2022-23	3206	49.6	8.2	12.8	48.8	86.4
	2023-24	3235	49.5	8.3	14.8	48.6	86.7
	2024-25	3247	49.9	8.3	25.9	48.7	87.5

Show code

outcome_stats_pooled <- model_panel %>%
  pivot_longer(all_of(outcome_vars), names_to = "Variable", values_to = "value") %>%
  filter(!is.na(value)) %>%
  group_by(Variable) %>%
  summarise(
    N      = n(),
    Mean   = round(mean(value), 1),
    SD     = round(sd(value), 1),
    Min    = round(min(value), 1),
    Median = round(median(value), 1),
    Max    = round(max(value), 1),
    .groups = "drop"
  ) %>%
  mutate(Variable = var_label(Variable))

outcome_stats_pooled %>%
  kable(align = "lrrrrrr") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE, font_size = 11) %>%
  save_tbl("outcome_statistics_pooled")

Table 7: Descriptive statistics for Attainment 8 outcome variables (full panel, pooled across years)

Variable	N	Mean	SD	Min	Median	Max
Avg. Attainment 8 score per pupil	13025	47.4	9.4	0.8	46.3	88.2
Avg. Attainment 8 score (disadvantaged)	12866	38.3	9.4	10.9	36.8	86.0
Avg. Attainment 8 score (non-disadvantaged)	12874	50.2	8.3	11.2	49.3	87.5

Mean Attainment 8 scores for all pupils sit in the mid-to-high 40s and have edged upward over the four-year window. The disadvantaged subgroup consistently scores around 10 points below the non-disadvantaged subgroup, and its standard deviation is slightly wider, reflecting greater heterogeneity in outcomes for this group. Scores for all three measures are approximately normally distributed, so the log transformation applied to the outcome in the multilevel model serves primarily to stabilise variance across the fitted-value range rather than to correct severe skew.

6.2 Outcome Distributions

Show code

outcome_long <- model_panel %>%
  select(year_label, all_of(outcome_vars)) %>%
  pivot_longer(-year_label, names_to = "outcome", values_to = "score") %>%
  filter(!is.na(score)) %>%
  mutate(
    outcome = recode(outcome,
      ATT8SCR               = "All pupils",
      ATT8SCR_FSM6CLA1A     = "Disadvantaged",
      ATT8SCR_NFSM6CLA1A    = "Non-disadvantaged"
    ),
    outcome = factor(outcome,
                     levels = c("All pupils", "Disadvantaged", "Non-disadvantaged"))
  )

p_density <- ggplot(outcome_long, aes(x = score, fill = year_label)) +
  geom_density(alpha = 0.45, colour = NA) +
  facet_wrap(~ outcome, ncol = 1, scales = "free_y") +
  scale_fill_manual(values = year_cols, name = "Academic year") +
  labs(x = "Attainment 8 score", y = "Density",
       title = "Distribution of Attainment 8 scores") +
  theme_pub

save_fig(p_density, "outcome_density", width = 8, height = 7)
p_density

Figure 5: Density plots of Attainment 8 scores by pupil group and year

The density plots confirm that all three Attainment 8 measures are unimodal and broadly symmetric. The disadvantaged subgroup distribution is shifted noticeably to the left, with a slight positive tail. Year-on-year shapes are stable (especially since 2022-23 - post COVID), suggesting that the national distribution of school-level attainment has not shifted dramatically over the period, except for an immediately post-COVID drop.

6.3 Disadvantage Gap Over Time

Show code

gap_data <- model_panel %>%
  filter(!is.na(ATT8SCR_FSM6CLA1A), !is.na(ATT8SCR_NFSM6CLA1A)) %>%
  mutate(gap = ATT8SCR_NFSM6CLA1A - ATT8SCR_FSM6CLA1A)

gap_summary <- gap_data %>%
  group_by(year_label) %>%
  summarise(
    mean_gap   = mean(gap, na.rm = TRUE),
    median_gap = median(gap, na.rm = TRUE),
    sd_gap     = sd(gap, na.rm = TRUE),
    n = n(),
    .groups = "drop"
  )

p_gap <- ggplot(gap_data, aes(x = year_label, y = gap)) +
  geom_boxplot(aes(fill = year_label), alpha = 0.6, outlier.size = 0.5,
               show.legend = FALSE) +
  geom_point(data = gap_summary, aes(y = mean_gap),
             shape = 18, size = 3, colour = "red") +
  scale_fill_manual(values = year_cols) +
  labs(x = "Academic year",
       y = "ATT8 gap (non-disadvantaged \u2013 disadvantaged)",
       title = "School-level disadvantage gap in Attainment 8",
       subtitle = "Red diamond = annual mean") +
  theme_pub

save_fig(p_gap, "disadvantage_gap_trend", width = 7, height = 4.5)
p_gap

Figure 6: Attainment gap between non-disadvantaged and disadvantaged pupils by year

The school-level disadvantage gap (the difference in attainment between disadvantaged pupils and non-disadvantaged pupils) is remarkably persistent over the period, with median values of around 13–14 Attainment 8 points across all four years. The interquartile range is wide, indicating that some schools narrow the gap far more effectively than others — a key motivation for the multilevel modelling approach. A small number of schools show negative gaps (i.e. disadvantaged pupils outperforming their non-disadvantaged peers), visible as outliers below zero.

6.4 Outcomes by Region

Show code

p_outcome_region <- model_panel %>%
  filter(!is.na(gor_name)) %>%
  select(gor_name, all_of(outcome_vars)) %>%
  pivot_longer(-gor_name, names_to = "outcome", values_to = "score") %>%
  filter(!is.na(score)) %>%
  mutate(
    outcome = recode(outcome,
      ATT8SCR            = "All pupils",
      ATT8SCR_FSM6CLA1A  = "Disadvantaged",
      ATT8SCR_NFSM6CLA1A = "Non-disadvantaged"
    ),
    outcome = factor(outcome,
                     levels = c("All pupils", "Disadvantaged", "Non-disadvantaged")),
    gor_name = fct_reorder(gor_name, score, .fun = median)
  )

p_out_reg <- ggplot(p_outcome_region, aes(x = gor_name, y = score, fill = gor_name)) +
  geom_violin(alpha = 0.6, colour = NA, scale = "width") +
  geom_boxplot(width = 0.15, alpha = 0.8, outlier.size = 0.3, show.legend = FALSE) +
  facet_wrap(~ outcome, ncol = 3) +
  coord_flip() +
  labs(x = NULL, y = "Attainment 8 score",
       title = "Attainment 8 by region and pupil group",
       subtitle = "All years pooled") +
  theme_pub +
  theme(legend.position = "none",
        strip.text = element_text(size = 9))

save_fig(p_out_reg, "outcome_by_region", width = 11, height = 6)
p_out_reg

Figure 7: Attainment 8 distributions by region — the grouping variable for the multilevel model’s regional random intercept

London stands out with the highest median scores across all three pupil groups — a well-documented feature of the English school system. The North East and Yorkshire & The Humber cluster at the lower end, though there is substantial overlap between regions. Notably, regional rankings are broadly consistent across the three outcome groups, suggesting that the “London effect” benefits disadvantaged and non-disadvantaged pupils alike.

6.5 Outcomes by Ofsted Rating

Show code

p_outcome_ofsted <- model_panel %>%
  filter(!is.na(OFSTEDRATING_1)) %>%
  select(OFSTEDRATING_1, all_of(outcome_vars)) %>%
  pivot_longer(-OFSTEDRATING_1, names_to = "outcome", values_to = "score") %>%
  filter(!is.na(score)) %>%
  mutate(
    outcome = recode(outcome,
      ATT8SCR            = "All pupils",
      ATT8SCR_FSM6CLA1A  = "Disadvantaged",
      ATT8SCR_NFSM6CLA1A = "Non-disadvantaged"
    ),
    outcome = factor(outcome,
                     levels = c("All pupils", "Disadvantaged", "Non-disadvantaged"))
  )

ofsted_cols <- c("Outstanding" = "#2e6260", "Good" = "#abc766",
                 "Requires Improvement" = "#f9dd73", "Inadequate" = "#4e3c56")

p_out_ofs <- ggplot(p_outcome_ofsted,
                    aes(x = OFSTEDRATING_1, y = score, fill = OFSTEDRATING_1)) +
  geom_violin(alpha = 0.6, colour = NA, scale = "width") +
  geom_boxplot(width = 0.15, alpha = 0.8, outlier.size = 0.3, show.legend = FALSE) +
  facet_wrap(~ outcome, ncol = 3) +
  scale_fill_manual(values = ofsted_cols) +
  labs(x = NULL, y = "Attainment 8 score",
       title = "Attainment 8 by Ofsted rating and pupil group",
       subtitle = "All years pooled") +
  theme_pub +
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 30, hjust = 1, size = 8),
        strip.text = element_text(size = 9))

save_fig(p_out_ofs, "outcome_by_ofsted", width = 10, height = 5)
p_out_ofs

Figure 8: Attainment 8 distributions by Ofsted rating — the grouping variable for the multilevel model’s Ofsted random intercept

There is a clear gradient in attainment across Ofsted categories: Outstanding schools have the highest median scores and the tightest distributions, while Inadequate schools sit noticeably lower with wider spread. The step from Good to Requires Improvement is the sharpest for all three pupil groups.

Relation to the multilevel null model

These two plots illustrate the structure captured by the random intercepts in the multilevel model. Region (gor_name) and Ofsted rating (OFSTEDRATING_1) each enter as grouping variables, allowing the model to estimate a separate baseline (intercept) for each level. Before any school-level predictors are added, these group-level intercepts absorb the systematic variation visible in the violin plots above — effectively representing a null model that accounts only for which region and Ofsted category a school belongs to. The fixed-effect predictors (deprivation, absence, EAL, etc.) then explain the remaining within-group variation around these baselines.

7 Predictor Variables

The multilevel model includes eight fixed-effect predictors capturing school composition, pupil characteristics, and workforce conditions. This section summarises their distributions and motivates the log transformations applied to the most heavily skewed variables.

7.1 Descriptive Statistics

Show code

pred_vars_yr <- c(
  "PTFSM6CLA1A", "PERCTOT", "PNUMEAL", "PTPRIORLO",
  "gorard_segregation",
  "remained_in_the_same_school",
  "teachers_on_leadership_pay_range_percent",
  "average_number_of_days_taken"
)
pred_vars_yr <- intersect(pred_vars_yr, names(model_panel))

pred_table_yearly <- model_panel %>%
  pivot_longer(all_of(pred_vars_yr), names_to = "Variable", values_to = "value") %>%
  filter(!is.na(value)) %>%
  group_by(Variable, Year = year_label) %>%
  summarise(
    N      = n(),
    Mean   = round(mean(value), 2),
    SD     = round(sd(value), 2),
    Min    = round(min(value), 2),
    Median = round(median(value), 2),
    Max    = round(max(value), 2),
    .groups = "drop"
  ) %>%
  mutate(Variable = var_label(Variable))

pred_table_yearly %>%
  kable(align = "llrrrrrr") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE, font_size = 11) %>%
  collapse_rows(columns = 1, valign = "top") %>%
  save_tbl("predictor_statistics_yearly")

Table 8: Descriptive statistics for predictor variables by academic year

Variable	Year	N	Mean	SD	Min	Median	Max
% overall absence	2021-22	3309	9.14	2.40	2.10	9.00	28.10
	2022-23	3316	9.19	2.62	3.10	8.90	32.60
	2023-24	3332	9.08	2.63	2.60	8.80	26.90
	2024-25	3322	8.25	2.45	1.98	8.05	26.73
% pupils with English not first language	2021-22	3303	17.29	18.83	0.00	9.30	93.30
	2022-23	3308	17.96	18.46	0.00	10.45	92.70
	2023-24	3318	18.43	18.20	0.00	11.40	94.80
	2024-25	3284	18.48	18.24	0.00	11.40	94.80
% KS4 pupils who are disadvantaged	2021-22	3228	26.88	14.87	0.00	24.00	96.00
	2022-23	3250	26.70	14.39	0.00	24.00	93.00
	2023-24	3270	26.65	13.99	0.00	24.40	82.80
	2024-25	3277	28.35	14.46	0.00	25.90	80.10
% KS4 pupils with low prior attainment (KS2)	2021-22	3228	25.78	11.20	0.00	26.00	74.00
	2022-23	3250	22.24	9.99	0.00	22.00	92.00
	2023-24	3270	22.13	9.65	0.00	22.00	79.20
Avg. teacher sickness days	2021-22	3098	8.66	3.88	0.00	8.50	85.00
	2022-23	3018	7.83	3.58	0.00	7.30	91.50
	2023-24	3248	8.07	3.54	0.00	7.60	74.00
Gorard segregation index	2021-22	3226	0.18	0.05	0.00	0.17	0.31
	2022-23	3248	0.17	0.05	0.00	0.17	0.32
	2023-24	3267	0.17	0.05	0.00	0.16	0.30
	2024-25	3276	0.16	0.04	0.00	0.15	0.27
Teachers remaining in same school (FTE)	2021-22	3302	52.71	21.77	0.00	51.40	146.50
	2022-23	3292	52.56	21.63	0.00	51.00	151.90
	2023-24	3311	53.49	21.65	0.00	51.90	155.90
% teachers on leadership pay range	2021-22	3301	11.59	5.14	0.00	10.64	66.67
	2022-23	3292	11.63	5.01	0.00	10.77	60.00
	2023-24	3311	12.07	5.15	0.00	11.25	75.00
	2024-25	3314	12.23	5.09	0.00	11.32	66.67

Show code

pred_vars <- c(
  "PTFSM6CLA1A", "PERCTOT", "PNUMEAL", "PTPRIORLO",
  "gorard_segregation",
  "remained_in_the_same_school",
  "teachers_on_leadership_pay_range_percent",
  "average_number_of_days_taken"
)
pred_vars <- intersect(pred_vars, names(model_panel))

pred_table <- map_dfr(pred_vars, function(v) {
  x <- model_panel[[v]]
  tibble(
    Variable = var_label(v),
    N        = sum(!is.na(x)),
    `% available` = round(sum(!is.na(x)) / nrow(model_panel) * 100, 1),
    Mean     = round(mean(x, na.rm = TRUE), 2),
    SD       = round(sd(x, na.rm = TRUE), 2),
    Min      = round(min(x, na.rm = TRUE), 2),
    Median   = round(median(x, na.rm = TRUE), 2),
    Max      = round(max(x, na.rm = TRUE), 2)
  )
})

pred_table %>%
  kable(align = "lrrrrrrr") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE, font_size = 11) %>%
  pack_rows("School composition", 1, 4) %>%
  pack_rows("Segregation (LA-level)", 5, 5) %>%
  pack_rows("Workforce", 6, nrow(pred_table)) %>%
  save_tbl("predictor_statistics")

Table 9: Descriptive statistics for predictor variables (full panel, pooled across years)

Variable	N	% available	Mean	SD	Min	Median	Max
School composition
% KS4 pupils who are disadvantaged	13025	97.1	27.14	14.45	0.00	24.90	96.00
% overall absence	13279	99.0	8.91	2.56	1.98	8.70	32.60
% pupils with English not first language	13213	98.5	18.04	18.44	0.00	10.80	94.80
% KS4 pupils with low prior attainment (KS2)	9748	72.6	23.38	10.43	0.00	23.00	92.00
Segregation (LA-level)
Gorard segregation index	13017	97.0	0.17	0.05	0.00	0.17	0.32
Workforce
Teachers remaining in same school (FTE)	9905	73.8	52.92	21.68	0.00	51.40	155.90
% teachers on leadership pay range	13218	98.5	11.88	5.10	0.00	11.00	75.00
Avg. teacher sickness days	9364	69.8	8.19	3.68	0.00	7.70	91.50

There is wide variation across schools in most composition variables. The proportion of disadvantaged pupils (PTFSM6CLA1A) averages around 30% but ranges from near-zero to over 90%, reflecting the substantial segregation that exists within the English school system. Overall absence rates (PERCTOT) are tightly clustered for most schools but exhibit a long right tail of high-absence outliers. The workforce variables — teacher retention, leadership pay share, and average sickness days — have somewhat lower coverage (particularly for 2024–25, which relies on carry-forward imputation; see Table 13).

7.2 Predictor Distributions

Show code

pred_long <- model_panel %>%
  select(year_label, all_of(pred_vars)) %>%
  pivot_longer(-year_label, names_to = "variable", values_to = "value") %>%
  filter(!is.na(value)) %>%
  mutate(variable = var_label(variable))

p_hist <- ggplot(pred_long, aes(x = value)) +
  geom_histogram(aes(fill = year_label), bins = 40, alpha = 0.65,
                 position = "identity", colour = NA) +
  facet_wrap(~ variable, scales = "free", ncol = 3) +
  scale_fill_manual(values = year_cols, name = "Academic year") +
  labs(x = NULL, y = "Frequency",
       title = "Predictor variable distributions") +
  theme_pub +
  theme(strip.text = element_text(size = 8))

save_fig(p_hist, "predictor_distributions", width = 10, height = 8)
p_hist

Figure 9: Distributions of key predictor variables (raw scale)

Several predictors exhibit clear right skew — notably the proportion of pupils with English as an additional language (PNUMEAL) and average teacher sickness days. The FSM eligibility rate and overall absence rate are also right-skewed, though less severely. By contrast, the Gorard segregation index and teacher retention FTE are approximately symmetric. These distributional differences motivate the selective log-transformation strategy described below.

7.3 Log-Transformed Predictors

Predictors with |skewness| > 1 are log-transformed for the multilevel model. remained_in_the_same_school (skewness = 0.47) is entered on the raw scale as it is near-symmetric. Figure 10 shows the raw and log-transformed distributions for the variables that undergo this transformation.

Show code

log_vars <- c("PTFSM6CLA1A", "PERCTOT", "PNUMEAL",
              "average_number_of_days_taken")
log_vars <- intersect(log_vars, names(model_panel))

raw_long <- model_panel %>%
  select(all_of(log_vars)) %>%
  pivot_longer(everything(), names_to = "variable", values_to = "value") %>%
  filter(!is.na(value), value > 0) %>%
  mutate(scale = "Raw")

log_long <- model_panel %>%
  select(all_of(log_vars)) %>%
  mutate(across(everything(), safe_log)) %>%
  pivot_longer(everything(), names_to = "variable", values_to = "value") %>%
  filter(!is.na(value)) %>%
  mutate(scale = "Log-transformed")

compare_long <- bind_rows(raw_long, log_long) %>%
  mutate(
    variable = var_label(variable),
    scale = factor(scale, levels = c("Raw", "Log-transformed"))
  )

p_log <- ggplot(compare_long, aes(x = value)) +
  geom_histogram(bins = 40, fill = "#2e6260", alpha = 0.7, colour = "white",
                 linewidth = 0.2) +
  facet_grid(variable ~ scale, scales = "free") +
  labs(x = NULL, y = "Frequency",
       title = "Effect of log transformation on skewed predictors") +
  theme_pub +
  theme(strip.text = element_text(size = 9))

save_fig(p_log, "log_transform_comparison", width = 8, height = 7)
p_log

Figure 10: Raw vs. log-transformed distributions for skewed predictor variables

The side-by-side comparison confirms that the log transformation substantially reduces skewness for all four treated variables. After transformation, the distributions are much closer to the normal shape assumed by the linear mixed-effects model, improving the reliability of coefficient estimates and standard errors.

8 Bivariate Relationships

The interactive scatter plot below allows exploration of the bivariate relationships between each Attainment 8 outcome and the predictor variables. Use the dropdowns to select the outcome (y-axis) and predictor (x-axis), and the checkboxes to independently toggle log-transformation on each axis. This lets you compare the raw-scale relationship with the model-scale relationship (where several variables enter as logs). The OLS regression line and R² update in real time.

Show code

library(plotly)
library(htmltools)
library(jsonlite)

# --- Build a clean data frame with both raw and log versions ---
bivar_df <- model_panel %>%
  filter(!is.na(ATT8SCR)) %>%
  mutate(
    log_ATT8SCR            = safe_log(ATT8SCR),
    log_ATT8SCR_FSM6CLA1A  = safe_log(ATT8SCR_FSM6CLA1A),
    log_ATT8SCR_NFSM6CLA1A = safe_log(ATT8SCR_NFSM6CLA1A),
    log_PTFSM6CLA1A        = safe_log(PTFSM6CLA1A),
    log_PERCTOT             = safe_log(PERCTOT),
    log_PNUMEAL             = safe_log(PNUMEAL),
    log_PTPRIORLO           = safe_log(PTPRIORLO),
    log_gorard_segregation  = safe_log(gorard_segregation),
    log_remained            = safe_log(remained_in_the_same_school),
    log_leadership_pay      = safe_log(teachers_on_leadership_pay_range_percent),
    log_sickness_days       = safe_log(average_number_of_days_taken)
  )

# Hover label
bivar_df$hover_label <- if ("SCHNAME" %in% names(bivar_df) && "LANAME" %in% names(bivar_df)) {
  paste0(bivar_df$SCHNAME, " (", bivar_df$LANAME, ")")
} else {
  paste0("URN ", bivar_df$URN)
}

# --- Define variable metadata ---
# Each entry: raw column, log column, display label
outcome_meta <- list(
  list(raw = "ATT8SCR",             log = "log_ATT8SCR",
       label = "Attainment 8 — All Pupils"),
  list(raw = "ATT8SCR_FSM6CLA1A",   log = "log_ATT8SCR_FSM6CLA1A",
       label = "Attainment 8 — Disadvantaged"),
  list(raw = "ATT8SCR_NFSM6CLA1A",  log = "log_ATT8SCR_NFSM6CLA1A",
       label = "Attainment 8 — Non-Disadvantaged")
)

pred_meta <- list(
  list(raw = "PTFSM6CLA1A", log = "log_PTFSM6CLA1A",
       label = "% Disadvantaged (FSM)"),
  list(raw = "PERCTOT", log = "log_PERCTOT",
       label = "% Overall Absence"),
  list(raw = "PNUMEAL", log = "log_PNUMEAL",
       label = "% EAL"),
  list(raw = "PTPRIORLO", log = "log_PTPRIORLO",
       label = "% Low Prior Attainment (KS2)"),
  list(raw = "gorard_segregation", log = "log_gorard_segregation",
       label = "Gorard Segregation Index"),
  list(raw = "remained_in_the_same_school", log = "log_remained",
       label = "Teachers Remaining (FTE)"),
  list(raw = "teachers_on_leadership_pay_range_percent",
       log = "log_leadership_pay",
       label = "% Teachers on Leadership Pay"),
  list(raw = "average_number_of_days_taken", log = "log_sickness_days",
       label = "Avg. Teacher Sickness Days")
)

# Filter out predictors with no data
pred_meta <- Filter(function(p) any(!is.na(bivar_df[[p$raw]])), pred_meta)

# Split by year
years <- as.character(sort(unique(bivar_df$year_label)))
year_dfs <- lapply(years, function(yr) bivar_df[bivar_df$year_label == yr, ])
names(year_dfs) <- years

yr_colours <- c("#2e6260", "#abc766", "#e16fca", "#4e3c56")
n_tr <- length(years)

# --- Build base plotly (one trace per year + one OLS line) ---
fig <- plot_ly()
for (i in seq_along(years)) {
  df_i <- year_dfs[[i]]
  fig <- fig %>%
    add_markers(
      x = df_i[[pred_meta[[1]]$raw]],
      y = df_i[[outcome_meta[[1]]$raw]],
      text = df_i[["hover_label"]],
      name = years[i],
      marker = list(color = yr_colours[i], opacity = 0.35, size = 4),
      hovertemplate = paste0(
        "<b>%{text}</b><br>x: %{x:.3f}<br>y: %{y:.2f}",
        "<extra>", years[i], "</extra>"
      )
    )
}

# Regression line trace (hidden until JS computes OLS)
fig <- fig %>%
  add_lines(
    x = c(0, 1), y = c(0, 1),
    name = "OLS fit",
    line = list(color = "#7b132b", width = 2),
    showlegend = FALSE, hoverinfo = "skip", visible = FALSE
  )

fig <- fig %>%
  layout(
    xaxis  = list(title = pred_meta[[1]]$label),
    yaxis  = list(title = outcome_meta[[1]]$label),
    legend = list(orientation = "h", x = 1, xanchor = "right",
                  y = 1.02, yanchor = "bottom"),
    margin = list(t = 50, l = 80, b = 60)
  )

fig$elementId <- "bivar-overview"

# --- Pack ALL column data (raw + log) into JSON ---
all_cols <- c(
  unlist(lapply(outcome_meta, function(o) c(o$raw, o$log))),
  unlist(lapply(pred_meta, function(p) c(p$raw, p$log)))
)

js_data <- list()
for (yr in years) {
  yr_list <- list()
  df <- year_dfs[[yr]]
  for (col in all_cols) {
    if (col %in% names(df)) {
      yr_list[[col]] <- round(df[[col]], 5)
    }
  }
  js_data[[yr]] <- yr_list
}

# --- HTML dropdown + checkbox controls ---
select_style <- paste0(
  "padding:6px 10px; border:1px solid #bbb; border-radius:4px;",
  "font-size:0.88em; background:#fff; cursor:pointer;"
)
label_style <- paste0(
  "font-weight:600; font-size:0.82em; display:block;",
  "margin-bottom:4px; color:#555;"
)
cb_style <- paste0(
  "display:flex; align-items:center; gap:6px; margin-top:6px;",
  "font-size:0.82em; color:#555;"
)

controls <- tags$div(
  style = "display:flex; gap:24px; margin-bottom:14px; flex-wrap:wrap; align-items:end;",
  tags$div(
    tags$label("Outcome (y-axis):", `for` = "bvo-outcome", style = label_style),
    tags$select(
      id = "bvo-outcome", style = paste0(select_style, "min-width:230px;"),
      lapply(seq_along(outcome_meta), function(k)
        tags$option(value = k, outcome_meta[[k]]$label))
    ),
    tags$div(style = cb_style,
      tags$input(type = "checkbox", id = "bvo-log-y"),
      tags$label("Log scale", `for` = "bvo-log-y")
    )
  ),
  tags$div(
    tags$label("Predictor (x-axis):", `for` = "bvo-predictor", style = label_style),
    tags$select(
      id = "bvo-predictor", style = paste0(select_style, "min-width:270px;"),
      lapply(seq_along(pred_meta), function(k)
        tags$option(value = k, pred_meta[[k]]$label))
    ),
    tags$div(style = cb_style,
      tags$input(type = "checkbox", id = "bvo-log-x"),
      tags$label("Log scale", `for` = "bvo-log-x")
    )
  ),
  tags$div(
    tags$label("Year:", `for` = "bvo-year", style = label_style),
    tags$select(
      id = "bvo-year", style = paste0(select_style, "min-width:140px;"),
      tags$option(value = "all", "All Years"),
      lapply(years, function(yr) tags$option(value = yr, yr))
    )
  )
)

# --- Metadata arrays for JS ---
outcome_js <- lapply(outcome_meta, function(o)
  list(raw = o$raw, log = o$log, label = o$label))
pred_js <- lapply(pred_meta, function(p)
  list(raw = p$raw, log = p$log, label = p$label))

# --- JavaScript: wire controls -> Plotly.restyle / relayout + live OLS ---
js_block <- tags$script(HTML(sprintf('
(function() {
  var DATA  = %s;
  var YEARS = %s;
  var N     = YEARS.length;
  var LINE  = N;

  var OUTCOMES   = %s;
  var PREDICTORS = %s;

  function fitOLS(xAll, yAll) {
    var x = [], y = [];
    for (var i = 0; i < xAll.length; i++) {
      if (xAll[i] != null && yAll[i] != null &&
          isFinite(xAll[i]) && isFinite(yAll[i])) {
        x.push(xAll[i]); y.push(yAll[i]);
      }
    }
    var n = x.length;
    if (n < 3) return null;
    var sx=0, sy=0, sxx=0, sxy=0, syy=0;
    for (var i = 0; i < n; i++) {
      sx += x[i]; sy += y[i];
      sxx += x[i]*x[i]; sxy += x[i]*y[i]; syy += y[i]*y[i];
    }
    var xb = sx/n, yb = sy/n;
    var Sxx = sxx - n*xb*xb, Sxy = sxy - n*xb*yb, Syy = syy - n*yb*yb;
    if (Sxx === 0) return null;
    var b1 = Sxy / Sxx, b0 = yb - b1 * xb;
    var SSres = 0;
    for (var i = 0; i < n; i++) {
      var r = y[i] - (b0 + b1*x[i]); SSres += r*r;
    }
    var r2 = (Syy > 0) ? 1 - SSres/Syy : 0;
    var xmin = x[0], xmax = x[0];
    for (var i = 1; i < n; i++) {
      if (x[i] < xmin) xmin = x[i];
      if (x[i] > xmax) xmax = x[i];
    }
    return { b0:b0, b1:b1, r2:r2, n:n, xmin:xmin, xmax:xmax };
  }

  function update() {
    var gd = document.getElementById("bivar-overview");
    if (!gd || !gd._fullLayout) return;

    var outIdx  = parseInt(document.getElementById("bvo-outcome").value) - 1;
    var predIdx = parseInt(document.getElementById("bvo-predictor").value) - 1;
    var yrVal   = document.getElementById("bvo-year").value;
    var logX    = document.getElementById("bvo-log-x").checked;
    var logY    = document.getElementById("bvo-log-y").checked;

    var outMeta  = OUTCOMES[outIdx];
    var predMeta = PREDICTORS[predIdx];
    var xCol = logX ? predMeta.log : predMeta.raw;
    var yCol = logY ? outMeta.log  : outMeta.raw;
    var xLabel = (logX ? "log(" : "") + predMeta.label + (logX ? ")" : "");
    var yLabel = (logY ? "log(" : "") + outMeta.label  + (logY ? ")" : "");

    var xArr = [], yArr = [], visArr = [];
    var allX = [], allY = [];
    for (var i = 0; i < N; i++) {
      var yr = YEARS[i];
      var xd = DATA[yr][xCol] || [];
      var yd = DATA[yr][yCol] || [];
      xArr.push(xd); yArr.push(yd);
      var vis = (yrVal === "all" || yr === yrVal);
      visArr.push(vis);
      if (vis) {
        for (var j = 0; j < xd.length; j++) {
          allX.push(xd[j]); allY.push(yd[j]);
        }
      }
    }

    var idx = []; for (var i = 0; i < N; i++) idx.push(i);
    Plotly.restyle(gd, { x: xArr, y: yArr, visible: visArr }, idx);

    var fit = fitOLS(allX, allY);
    if (fit) {
      Plotly.restyle(gd, {
        x: [[ fit.xmin, fit.xmax ]],
        y: [[ fit.b0 + fit.b1*fit.xmin, fit.b0 + fit.b1*fit.xmax ]],
        visible: [true]
      }, [LINE]);

      var sign = (fit.b1 >= 0) ? " + " : " \\u2212 ";
      var eqn  = "\\u0177 = " + fit.b0.toFixed(2) + sign +
                 Math.abs(fit.b1).toFixed(3) + "x" +
                 "\\u2003\\u2003R\\u00b2 = " + fit.r2.toFixed(3) +
                 "\\u2003\\u2003n = " + fit.n.toLocaleString();

      Plotly.relayout(gd, {
        "xaxis.title": xLabel,
        "yaxis.title": yLabel,
        annotations: [{
          text: eqn,
          xref: "paper", yref: "paper",
          x: 0.02, y: 0.98,
          xanchor: "left", yanchor: "top",
          showarrow: false,
          font: { size: 12.5, color: "#4e3c56", family: "monospace" },
          bgcolor: "rgba(255,255,255,0.88)",
          borderpad: 5
        }]
      });
    } else {
      Plotly.restyle(gd, { visible: [false] }, [LINE]);
      Plotly.relayout(gd, {
        "xaxis.title": xLabel,
        "yaxis.title": yLabel,
        annotations: []
      });
    }
  }

  ["bvo-outcome","bvo-predictor","bvo-year"].forEach(function(id) {
    document.getElementById(id).addEventListener("change", update);
  });
  ["bvo-log-x","bvo-log-y"].forEach(function(id) {
    document.getElementById(id).addEventListener("change", update);
  });

  var poll = setInterval(function() {
    var gd = document.getElementById("bivar-overview");
    if (gd && gd._fullLayout) { clearInterval(poll); update(); }
  }, 250);
})();
',
  toJSON(js_data, digits = 5, na = "null", auto_unbox = FALSE),
  toJSON(years),
  toJSON(outcome_js, auto_unbox = TRUE),
  toJSON(pred_js, auto_unbox = TRUE)
)))

htmltools::tagList(controls, fig, js_block)

Outcome (y-axis):

Log scale

Predictor (x-axis):

Log scale

Year:

Figure 11

Toggling the log checkboxes reveals how the transformation reshapes each relationship. For example, the raw % Disadvantaged vs raw Attainment 8 plot shows a curved, heteroscedastic pattern; switching both axes to log scale linearises the relationship and stabilises the variance — illustrating why the model uses log(y) ~ log(x) for these variables.

9 Analysis Sample Restriction

The multilevel models are fitted to mainstream state-funded schools only (Academies and Maintained schools). Independent schools, colleges, and special schools are excluded because they operate under different funding, governance, and admissions structures that make direct comparison inappropriate.

Table 10 shows the effect of this restriction.

Show code

type_summary <- panel_data %>%
  mutate(
    in_model = MINORGROUP %in% c("Academy", "Maintained school"),
    MINORGROUP = ifelse(is.na(MINORGROUP), "(Missing)", MINORGROUP)
  ) %>%
  count(MINORGROUP, in_model) %>%
  arrange(desc(n)) %>%
  mutate(
    `% of panel` = round(n / sum(n) * 100, 1),
    Status = ifelse(in_model, "Included", "Excluded")
  ) %>%
  select(`School type` = MINORGROUP, `School-year obs.` = n,
         `% of panel`, Status)

type_summary %>%
  kable(align = "lrrl") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE) %>%
  row_spec(which(type_summary$Status == "Excluded"),
           color = "#4e3c56", italic = TRUE) %>%
  save_tbl("school_type_filter")

Table 10: Sample restriction by school type (MINORGROUP)

School type	School-year obs.	% of panel	Status
Academy	10905	63.7	Included
Independent school	3641	21.3	Excluded
Maintained school	2514	14.7	Included
College	55	0.3	Excluded
Special school	15	0.1	Excluded

Show code

model_summary <- model_panel %>%
  group_by(`Academic year` = year_label) %>%
  summarise(
    Schools = n_distinct(URN),
    `School-year obs.` = n(),
    LAs = n_distinct(LANAME),
    .groups = "drop"
  )

totals <- tibble(
  `Academic year` = "Total / Overall",
  Schools = n_distinct(model_panel$URN),
  `School-year obs.` = nrow(model_panel),
  LAs = n_distinct(model_panel$LANAME)
)

bind_rows(model_summary, totals) %>%
  kable(align = "lrrr", format.args = list(big.mark = ",")) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE) %>%
  row_spec(nrow(model_summary) + 1, bold = TRUE) %>%
  save_tbl("analysis_sample_dimensions")

Table 11: Analysis sample dimensions after excluding non-mainstream schools

Academic year	Schools	School-year obs.	LAs
2021-22	3,349	3,349	152
2022-23	3,351	3,351	152
2023-24	3,372	3,372	152
2024-25	3,347	3,347	152
Total / Overall	3,523	13,419	152

All descriptive statistics and figures from this point forward are based on the analysis sample (13,419 school-year observations from 3523 mainstream state-funded schools).

10 Missing Data

Understanding the pattern of missing data is important for assessing whether the analysis sample is representative and whether the imputation strategy (described in the next section) is appropriate.

10.1 Completeness by Variable and Year

Show code

all_model_vars <- c(
  "ATT8SCR", "ATT8SCR_FSM6CLA1A", "ATT8SCR_NFSM6CLA1A",
  "PTFSM6CLA1A", "PERCTOT", "PPERSABS10", "PNUMEAL", "PTPRIORLO",
  "ADMPOL_PT", "gorard_segregation", "OFSTEDRATING_1",
  "remained_in_the_same_school",
  "teachers_on_leadership_pay_range_percent",
  "average_number_of_days_taken",
  "gor_name", "LANAME"
)
all_model_vars <- intersect(all_model_vars, names(model_panel))

completeness <- model_panel %>%
  group_by(year_label) %>%
  summarise(
    across(all_of(all_model_vars),
           ~ round(sum(!is.na(.)) / n() * 100, 1),
           .names = "{.col}"),
    .groups = "drop"
  ) %>%
  pivot_longer(-year_label, names_to = "Variable", values_to = "pct") %>%
  pivot_wider(names_from = year_label, values_from = pct) %>%
  mutate(Variable = var_label(Variable))

completeness %>%
  kable(align = "lrrrr") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE, font_size = 11) %>%
  save_tbl("data_completeness")

Table 12: Data completeness (% non-missing) by variable and academic year

Variable	2021-22	2022-23	2023-24	2024-25
Avg. Attainment 8 score per pupil	96.4	97.0	97.0	97.9
Avg. Attainment 8 score (disadvantaged)	95.1	95.6	95.8	97.0
Avg. Attainment 8 score (non-disadvantaged)	95.1	95.7	95.9	97.0
% KS4 pupils who are disadvantaged	96.4	97.0	97.0	97.9
% overall absence	98.8	99.0	98.8	99.3
% persistent absentees (10%+ sessions missed)	98.8	99.0	98.8	99.3
% pupils with English not first language	98.6	98.7	98.4	98.1
% KS4 pupils with low prior attainment (KS2)	96.4	97.0	97.0	0.0
Admissions policy (new definition from 2019)	100.0	100.0	100.0	100.0
Gorard segregation index	96.3	96.9	96.9	97.9
Ofsted rating (harmonised)	99.5	99.2	97.3	98.1
Teachers remaining in same school (FTE)	98.6	98.2	98.2	0.0
% teachers on leadership pay range	98.6	98.2	98.2	99.0
Avg. teacher sickness days	92.5	90.1	96.3	0.0
Government Office Region	100.0	100.0	100.0	100.0
Local authority name	100.0	100.0	100.0	100.0

Core attainment and composition variables (Attainment 8, FSM eligibility, absence, EAL) enjoy near-complete coverage across all four years. The notable exception is the workforce block — teacher retention, leadership pay share, and sickness absence — which drops to 0% availability in 2024–25 because the School Workforce Census for that year had not been published at the time of data extraction. KS2 prior attainment (PTPRIORLO) is also missing for 2024–25 due to the lingering effects of COVID-era assessment cancellations.

10.2 Missing Data Heatmap

Show code

miss_long <- model_panel %>%
  select(year_label, all_of(all_model_vars)) %>%
  group_by(year_label) %>%
  summarise(
    across(everything(), ~ sum(is.na(.)) / n() * 100),
    .groups = "drop"
  ) %>%
  pivot_longer(-year_label, names_to = "variable", values_to = "pct_missing") %>%
  mutate(variable = var_label(variable))

p_miss <- ggplot(miss_long, aes(x = year_label,
                                 y = fct_reorder(variable, pct_missing, .fun = max),
                                 fill = pct_missing)) +
  geom_tile(colour = "white", linewidth = 0.8) +
  geom_text(aes(label = sprintf("%.0f%%", pct_missing)), size = 3) +
  scale_fill_gradient(low = "#bbd2f0", high = "#7b132b",
                      name = "% missing", limits = c(0, 100)) +
  labs(x = "Academic year", y = NULL,
       title = "Missing data by variable and year") +
  theme_pub +
  theme(panel.grid = element_blank())

save_fig(p_miss, "missing_heatmap", width = 8, height = 5)
p_miss

Figure 12: Missing data pattern across model variables by academic year

The heatmap makes the pattern stark: missingness is concentrated almost entirely in the workforce and prior attainment variables for the 2024–25 year. For the earlier three years, data completeness is consistently high across all variables. This structured, year-specific pattern supports the use of a carry-forward imputation strategy rather than more complex multiple imputation approaches.

Teacher sickness absence: a notable gap in 2021–22 and 2022–23

Although the heatmap shows high overall completeness for the earlier years, one variable — average_number_of_days_taken (teacher sickness absence) — has a less visible but consequential pattern of missingness. Approximately 570 school-year observations pass all other filters but fail on this variable: ~250 in 2021-22 and ~244 in 2022-23, falling to ~54 in 2023-24 and ~22 in 2024-25. Around 420 of these are genuinely missing (NA) and ~150 are reported as zero (excluded because the log transform requires strictly positive values). Academies account for roughly 80% of the affected schools, suggesting patchier sickness absence reporting in the early post-COVID workforce census returns. Because the model uses log(average_number_of_days_taken), these cases cannot be retained without either imputation or removing the variable from the model specification.

11 Imputation: 2024–25 Carry-Forward

Four predictor variables are unavailable for the 2024–25 academic year:

PTPRIORLO — KS2 prior attainment was cancelled due to the lingering effects of COVID-19 disruption
remained_in_the_same_school — workforce turnover data not yet published
teachers_on_leadership_pay_range_percent — workforce pay data not yet published
average_number_of_days_taken — workforce sickness data not yet published

These are imputed using a two-step approach:

Carry-forward: use the school’s 2023–24 value
Fallback: if no 2023–24 value exists, use the school-level mean across all available years

A flag variable has_imputed_predictors is set to TRUE for any 2024–25 row that received at least one imputed value.

Show code

impute_vars <- c("PTPRIORLO", "remained_in_the_same_school",
                 "teachers_on_leadership_pay_range_percent",
                 "average_number_of_days_taken")
impute_vars <- intersect(impute_vars, names(model_panel))

rows_2425 <- model_panel %>% filter(year_label == "2024-25")
n_2425 <- nrow(rows_2425)

imputation_summary <- map_dfr(impute_vars, function(v) {
  n_available <- sum(!is.na(rows_2425[[v]]))
  n_missing   <- n_2425 - n_available
  tibble(
    Variable          = var_label(v),
    `2024-25 rows`    = n_2425,
    `Non-missing`     = n_available,
    `Still missing`   = n_missing,
    `% imputed/available` = round(n_available / n_2425 * 100, 1)
  )
})

flagged <- sum(rows_2425$has_imputed_predictors, na.rm = TRUE)

imputation_summary %>%
  kable(align = "lrrrr") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE) %>%
  footnote(general = paste0(flagged, " of ", n_2425,
                            " 2024-25 rows flagged as having imputed predictors (",
                            round(flagged / n_2425 * 100, 1), "%).")) %>%
  save_tbl("imputation_summary")

Table 13: Imputation summary for 2024-25 rows

Variable	2024-25 rows	Non-missing	Still missing	% imputed/available
% KS4 pupils with low prior attainment (KS2)	3347	0	3347	0
Teachers remaining in same school (FTE)	3347	0	3347	0
% teachers on leadership pay range	3347	3314	33	99
Avg. teacher sickness days	3347	0	3347	0
Note:
3224 of 3347 2024-25 rows flagged as having imputed predictors (96.3%).

Show code

if (length(impute_vars) > 0) {
  compare_df <- model_panel %>%
    filter(year_label %in% c("2023-24", "2024-25")) %>%
    select(year_label, all_of(impute_vars)) %>%
    pivot_longer(-year_label, names_to = "variable", values_to = "value") %>%
    filter(!is.na(value)) %>%
    mutate(variable = var_label(variable))

  p_impute <- ggplot(compare_df, aes(x = value, fill = year_label)) +
    geom_density(alpha = 0.5, colour = NA) +
    facet_wrap(~ variable, scales = "free", ncol = 2) +
    scale_fill_manual(
      values = c("2023-24" = "#e16fca", "2024-25" = "#4e3c56"),
      name = "Academic year"
    ) +
    labs(x = NULL, y = "Density",
         title = "Observed (2023-24) vs. imputed (2024-25) distributions",
         subtitle = "2024-25 values are carry-forwards from 2023-24 or school-level means") +
    theme_pub

  save_fig(p_impute, "imputed_vs_observed", width = 8, height = 6)
  p_impute
}

Figure 13: Comparison of observed (2023-24) vs. carried-forward (2024-25) values for imputed variables

The density overlays show that the carried-forward 2024–25 distributions closely match the observed 2023–24 distributions, as expected. This provides reassurance that the imputation does not introduce systematic bias into the predictor values. The assumption underpinning this approach is that school-level workforce characteristics are slow-moving and unlikely to change dramatically within a single year.

11.1 Year-on-Year Stability of KS2 Prior Attainment

A key question for the carry-forward strategy is whether imputed variables are sufficiently stable from one year to the next at the school level. Figure 14 addresses this for PTPRIORLO (the percentage of pupils entering with low KS2 prior attainment) by plotting each school’s value in one year against its value in the next. If the variable were perfectly stable, all points would fall on the 45-degree line.

Show code

# Build wide data: one row per school with each year's PTPRIORLO
prior_wide <- model_panel %>%
  filter(!is.na(PTPRIORLO)) %>%
  select(URN, year_label, PTPRIORLO) %>%
  pivot_wider(names_from = year_label, values_from = PTPRIORLO,
              names_prefix = "y_")

# Pair 1: 2021-22 vs 2022-23
pair1 <- prior_wide %>%
  filter(!is.na(`y_2021-22`), !is.na(`y_2022-23`))
r1 <- cor(pair1$`y_2021-22`, pair1$`y_2022-23`)

# Pair 2: 2022-23 vs 2023-24
pair2 <- prior_wide %>%
  filter(!is.na(`y_2022-23`), !is.na(`y_2023-24`))
r2 <- cor(pair2$`y_2022-23`, pair2$`y_2023-24`)

p_scatter1 <- ggplot(pair1, aes(x = `y_2021-22`, y = `y_2022-23`)) +
  geom_point(alpha = 0.25, size = 1, colour = "#2e6260") +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", colour = "grey30") +
  annotate("text", x = Inf, y = -Inf,
           label = paste0("r = ", sprintf("%.3f", r1), "\nn = ", format(nrow(pair1), big.mark = ",")),
           hjust = 1.1, vjust = -0.5, size = 3.5, colour = "grey30") +
  labs(x = "% low prior attainment (2021-22)",
       y = "% low prior attainment (2022-23)",
       title = "2021-22 vs 2022-23") +
  coord_equal() +
  theme_pub

p_scatter2 <- ggplot(pair2, aes(x = `y_2022-23`, y = `y_2023-24`)) +
  geom_point(alpha = 0.25, size = 1, colour = "#abc766") +
  geom_abline(slope = 1, intercept = 0, linetype = "dashed", colour = "grey30") +
  annotate("text", x = Inf, y = -Inf,
           label = paste0("r = ", sprintf("%.3f", r2), "\nn = ", format(nrow(pair2), big.mark = ",")),
           hjust = 1.1, vjust = -0.5, size = 3.5, colour = "grey30") +
  labs(x = "% low prior attainment (2022-23)",
       y = "% low prior attainment (2023-24)",
       title = "2022-23 vs 2023-24") +
  coord_equal() +
  theme_pub

p_prior_scatter <- p_scatter1 + p_scatter2 +
  plot_annotation(
    title = "Year-on-year stability of PTPRIORLO at the school level",
    subtitle = "Dashed line = perfect carry-forward (y = x)",
    theme = theme(plot.title = element_text(face = "bold", size = 12),
                  plot.subtitle = element_text(colour = "grey40", size = 10))
  )

save_fig(p_prior_scatter, "ptpriorlo_year_scatter", width = 10, height = 5)
p_prior_scatter

Figure 14: Year-on-year stability of KS2 prior attainment (% low prior) at the school level

Both panels show a tight clustering of schools along the 45-degree line, with high year-on-year correlations. This confirms that the proportion of pupils entering with low KS2 prior attainment is a stable, slow-moving school characteristic. The degree of scatter is modest and symmetric around the diagonal, indicating no systematic drift. This pattern provides strong support for the carry-forward imputation of PTPRIORLO into 2024–25: if a school’s value barely changes from 2021–22 to 2022–23 or from 2022–23 to 2023–24, it is reasonable to assume a similar level of stability into the following year.

12 Derived Variables

12.1 Gorard Segregation Index

The Gorard segregation index measures how unevenly disadvantaged pupils are distributed across schools within each Local Authority (calculation details). It is computed per LA per year as:

\[G_s = \frac{1}{2} \sum_{i=1}^{n} \left| \frac{F_i}{F_{\text{total}}} - \frac{T_i}{T_{\text{total}}} \right|\]

where \(F_i\) is the number of disadvantaged pupils in school \(i\), \(F_{\text{total}}\) the LA total, and \(T_i\), \(T_{\text{total}}\) the corresponding all-pupil totals.

Show code

gorard_la <- model_panel %>%
  filter(!is.na(gorard_segregation)) %>%
  distinct(LANAME, year_label, gorard_segregation)

p_gorard <- ggplot(gorard_la, aes(x = gorard_segregation, fill = year_label)) +
  geom_histogram(bins = 30, alpha = 0.65, position = "identity", colour = "white",
                 linewidth = 0.2) +
  scale_fill_manual(values = year_cols, name = "Academic year") +
  labs(x = "Gorard segregation index", y = "Number of LAs",
       title = "Distribution of LA-level segregation index") +
  theme_pub

save_fig(p_gorard, "gorard_segregation", width = 7, height = 4)
p_gorard

Figure 15: Distribution of Gorard segregation index across LAs by year

Most Local Authorities have segregation index values between 0.10 and 0.35, indicating moderate unevenness in the distribution of disadvantaged pupils across schools. A small number of LAs show notably high segregation (above 0.40), often corresponding to areas with grammar schools or other selective provision. The distribution is fairly stable across years, consistent with the view that school-level segregation is a structural feature that changes slowly.

13 Correlation Structure

Show code

# Variables that enter the model as log(x)
log_transform_vars <- c("PTFSM6CLA1A", "PERCTOT", "PNUMEAL",
                         "average_number_of_days_taken")

cor_vars <- c("ATT8SCR", "PTFSM6CLA1A", "PERCTOT", "PNUMEAL",
              "PTPRIORLO", "gorard_segregation",
              "remained_in_the_same_school",
              "teachers_on_leadership_pay_range_percent",
              "average_number_of_days_taken")
cor_vars <- intersect(cor_vars, names(model_panel))

cor_data <- model_panel %>%
  select(all_of(cor_vars)) %>%
  drop_na() %>%
  mutate(across(all_of(intersect(log_transform_vars, names(.))), safe_log))

cor_mat <- cor(cor_data)

# Apply labels — prefix logged variables with "log()"
cor_labels <- sapply(cor_vars, function(v) {
  lbl <- var_label(v)
  if (v %in% log_transform_vars) paste0("log(", lbl, ")") else lbl
})
labs_vec <- str_wrap(cor_labels, width = 20)
rownames(cor_mat) <- labs_vec
colnames(cor_mat) <- labs_vec

cor_long <- cor_mat %>%
  as.data.frame() %>%
  rownames_to_column("var1") %>%
  pivot_longer(-var1, names_to = "var2", values_to = "r") %>%
  mutate(
    var1 = factor(var1, levels = rev(labs_vec)),
    var2 = factor(var2, levels = labs_vec)
  )

p_cor <- ggplot(cor_long, aes(x = var2, y = var1, fill = r)) +
  geom_tile(colour = "white", linewidth = 0.5) +
  geom_text(aes(label = sprintf("%.2f", r)), size = 2.2) +
  scale_fill_gradient2(low = "#7b132b", mid = "#f3f3f3", high = "#2e6260",
                       midpoint = 0, limits = c(-1, 1), name = "Pearson r") +
  labs(x = NULL, y = NULL,
       title = "Correlation matrix: outcome and predictors") +
  theme_pub +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1, size = 8),
    axis.text.y = element_text(size = 8),
    panel.grid = element_blank()
  )

save_fig(p_cor, "correlation_matrix", width = 9, height = 7)
p_cor

Figure 16: Pairwise Pearson correlations among the outcome and all fixed-effect predictors (log-transformed variables shown on the model scale)

Correlations are computed on the model scale — i.e. the four log-transformed predictors are shown after transformation, matching the scale on which they enter the multilevel model. Several expected patterns emerge. The proportion of disadvantaged pupils has a strong negative correlation with Attainment 8, confirming that school-level deprivation is the dominant compositional predictor. Absence is positively correlated with deprivation and negatively with attainment. The EAL proportion shows a modest positive association with attainment, likely reflecting the concentration of EAL pupils in London schools which tend to perform above the national average. Among the workforce variables, teacher retention shows a positive correlation with attainment, while sickness absence is negatively associated. Importantly, no pair of predictors is so highly correlated as to raise serious multicollinearity concerns for the multilevel model.

14 Summary

The full panel dataset contains 17,130 school-year observations from 4534 unique schools. After restricting to mainstream state-funded schools (Academies and Maintained schools), the analysis sample comprises 13,419 school-year observations from 3523 schools across 152 Local Authorities and 4 academic years (2021–22 to 2024–25). Key features:

Sample restriction: Independent schools (3,641 obs.), colleges, and special schools are excluded from the analysis.
Outcome completeness: Attainment 8 scores (all pupils) are available for 100% of analysis sample observations by construction.
Predictor completeness: Core predictors (disadvantage, absence, EAL) exceed 95% availability across all years. Workforce variables are unavailable for 2024–25 and are carry-forward imputed.
Imputation scope: 24% of analysis sample observations (confined to 2024–25) contain at least one imputed predictor.
Transformations: Four right-skewed predictors are log-transformed; the Gorard segregation index is computed at the LA-year level.