Tidy Tuesday: Selected British Literary Prizes

tidytuesday
R
literature
diversity
Three decades of British literary prize data — examining gender representation, ethnic diversity trends, and the educational backgrounds that predict literary prestige.
Author

Sean Thimons

Published

October 28, 2025

Preface

From TidyTuesday repository.

This dataset from the Post45 Data Collective examines 32 years (1990–2022) of British literary award data, documenting author demographics alongside prize information — including gender identity, ethnicity, educational background, and whether authors were shortlisted or won.

  • In which genres are women and ethnically diverse writers most frequently shortlisted or awarded?
  • Have prizes demonstrated improvement in gender and ethnic representation?
  • Does a connection exist between specific educational backgrounds and writer selection success?

Loading necessary packages

My handy booster pack that allows me to install (if needed) and load my usual and favorite packages, as well as some helpful functions.

Code
# Packages ----------------------------------------------------------------

{
  if (!requireNamespace("pak", quietly = TRUE)) {
    install.packages(
      "pak",
      repos = sprintf(
        "https://r-lib.github.io/p/pak/stable/%s/%s/%s",
        .Platform$pkgType,
        R.Version()$os,
        R.Version()$arch
      )
    )
  }

  install_booster_pack <- function(package, load = TRUE) {
    for (pkg in package) {
      if (!requireNamespace(pkg, quietly = TRUE)) {
        pak::pkg_install(pkg)
      }
      if (load) {
        library(pkg, character.only = TRUE)
      }
    }
  }

  if (file.exists('packages.txt')) {
    packages <- read.table('packages.txt')
    install_booster_pack(package = packages$Package, load = FALSE)
    rm(packages)
  } else {
    booster_pack <- c(
      ### IO ----
      'fs',
      'here',
      'janitor',
      'rio',
      'tidyverse',

      ### EDA ----
      'skimr',

      ### Plot ----
      'ggrepel',
      'ggtext',
      'scales',

      ### Misc ----
      'tidytuesdayR'
    )

    install_booster_pack(package = booster_pack, load = TRUE)
    rm(install_booster_pack, booster_pack)
  }

  # Custom Functions ----

  `%ni%` <- Negate(`%in%`)

  geometric_mean <- function(x) {
    exp(mean(log(x[x > 0]), na.rm = TRUE))
  }

  my_skim <- skim_with(
    numeric = sfl(
      n = length,
      min = ~ min(.x, na.rm = T),
      p25 = ~ stats::quantile(., probs = .25, na.rm = TRUE, names = FALSE),
      med = ~ median(.x, na.rm = T),
      p75 = ~ stats::quantile(., probs = .75, na.rm = TRUE, names = FALSE),
      max = ~ max(.x, na.rm = T),
      mean = ~ mean(.x, na.rm = T),
      geo_mean = ~ geometric_mean(.x),
      sd = ~ stats::sd(., na.rm = TRUE),
      hist = ~ inline_hist(., 5)
    ),
    append = FALSE
  )
}

Load raw data from package

raw <- tidytuesdayR::tt_load('2025-10-28')

prizes <- raw$prizes

Exploratory Data Analysis

The my_skim() function is a modified version of the skimr::skim() function that returns the number of missing data points (cells as NA) as well as the inverse (e.g.: number of rows that are not NA), the count, minimum, 25%, median, 75%, max, mean, geometric mean, and standard deviation. It also generates a little ASCII histogram. Neat!

Literary Prizes

prizes %>%
  my_skim(.)
Data summary
Name Piped data
Number of rows 952
Number of columns 23
_______________________
Column type frequency:
character 20
logical 1
numeric 2
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
prize_alias 0 1 11 39 0 15 0
prize_name 0 1 11 39 0 29 0
prize_institution 0 1 5 44 0 10 0
prize_genre 0 1 3 18 0 9 0
person_id 0 1 4 4 0 682 0
person_role 0 1 6 11 0 2 0
last_name 0 1 2 16 0 622 0
first_name 0 1 2 16 0 435 0
name 0 1 8 25 0 681 0
gender 0 1 3 10 0 4 0
sexuality 0 1 5 12 0 3 0
ethnicity_macro 0 1 5 18 0 9 0
ethnicity 0 1 5 34 0 149 0
highest_degree 0 1 2 24 0 10 0
degree_institution 0 1 3 52 0 187 0
degree_field_category 0 1 3 28 0 13 0
degree_field 0 1 3 48 0 160 0
viaf 0 1 3 22 0 676 0
book_id 0 1 5 5 0 861 0
book_title 0 1 1 91 0 859 0

Variable type: logical

skim_variable n_missing complete_rate mean count
uk_residence 0 1 0.7 TRU: 670, FAL: 282

Variable type: numeric

skim_variable n_missing complete_rate n min p25 med p75 max mean geo_mean sd hist
prize_id 0 1 952 1 8 10 13 16 9.74 8.41 4.22 ▃▂▇▃▅
prize_year 0 1 952 1991 2002 2009 2016 2022 2008.52 2008.50 8.44 ▅▆▆▇▇
prizes %>%
  count(prize_alias, sort = TRUE)
# A tibble: 15 × 2
   prize_alias                                 n
   <chr>                                   <int>
 1 Booker Prize                              191
 2 Women's Prize for Fiction                 162
 3 Baillie Gifford Prize for Non-Fiction     141
 4 Gold Dagger                               107
 5 Ted Hughes Award for New Work in Poetry    60
 6 BSFA Award for Best Novel                  33
 7 James Tait Black Prize for Fiction         33
 8 Costa Biography Award                      32
 9 Costa Book of the Year                     31
10 Costa Children's Book Award                31
11 Costa First Novel Award                    31
12 Costa Novel Award                          31
13 Costa Poetry Award                         31
14 TS Eliot Prize                             30
15 James Tait Black Prize for Drama            8
prizes %>%
  count(person_role, sort = TRUE)
# A tibble: 2 × 2
  person_role     n
  <chr>       <int>
1 shortlisted   534
2 winner        418
prizes %>%
  count(gender, sort = TRUE)
# A tibble: 4 × 2
  gender         n
  <chr>      <int>
1 man          475
2 woman        470
3 non-binary     6
4 transman       1
prizes %>%
  count(ethnicity_macro, sort = TRUE)
# A tibble: 9 × 2
  ethnicity_macro        n
  <chr>              <int>
1 White British        499
2 Non-UK White         199
3 Asian                 67
4 Irish                 53
5 Jewish                43
6 Black British         29
7 African               23
8 Caribbean             20
9 Non-White American    19
prizes %>%
  count(prize_genre, sort = TRUE)
# A tibble: 9 × 2
  prize_genre            n
  <chr>              <int>
1 fiction              448
2 non-fiction          141
3 poetry               121
4 crime                107
5 sff                   33
6 biography             32
7 children's            31
8 no/any/multi genre    31
9 drama                  8

Diversity in Literary Prizes

Gender Representation Over Time

gender_by_year <- prizes %>%
  filter(!is.na(gender), gender %in% c("woman", "man")) %>%
  group_by(prize_year, gender) %>%
  summarize(n = n(), .groups = "drop") %>%
  group_by(prize_year) %>%
  mutate(pct = n / sum(n)) %>%
  ungroup()

# 5-year rolling window
gender_rolling <- gender_by_year %>%
  filter(gender == "woman") %>%
  arrange(prize_year) %>%
  mutate(pct_rolling = zoo::rollmean(pct, k = 5, fill = NA, align = "right"))

Winners vs. Shortlisted by Gender

win_rate <- prizes %>%
  filter(!is.na(gender), gender %in% c("woman", "man")) %>%
  group_by(gender, person_role) %>%
  summarize(n = n(), .groups = "drop") %>%
  group_by(gender) %>%
  mutate(pct = n / sum(n)) %>%
  ungroup()

win_rate
# A tibble: 4 × 4
  gender person_role     n   pct
  <chr>  <chr>       <int> <dbl>
1 man    shortlisted   243 0.512
2 man    winner        232 0.488
3 woman  shortlisted   289 0.615
4 woman  winner        181 0.385

Educational Background

prizes %>%
  filter(!is.na(degree_institution)) %>%
  count(degree_institution, sort = TRUE) %>%
  head(15)
# A tibble: 15 × 2
   degree_institution            n
   <chr>                     <int>
 1 unknown                     181
 2 University of Oxford        116
 3 University of Cambridge      73
 4 none                         51
 5 University of East Anglia    29
 6 University of London         29
 7 Harvard University           20
 8 Trinity College, Dublin      19
 9 Columbia University          13
10 University of Iowa           10
11 University of Sheffield      10
12 Queen's University            9
13 University of Edinburgh       9
14 University of Glasgow         8
15 Yale University               8
prizes %>%
  filter(!is.na(highest_degree)) %>%
  count(highest_degree, sort = TRUE)
# A tibble: 10 × 2
   highest_degree               n
   <chr>                    <int>
 1 Bachelors                  296
 2 unknown                    227
 3 Masters                    208
 4 Doctorate                  157
 5 none                        51
 6 Postgraduate                 6
 7 Juris Doctor                 4
 8 Certificate of Education     1
 9 Diploma                      1
10 MD                           1

Visualizing Diversity in British Letters

# Subdued literary palette
gender_cols <- c("woman" = "#8E44AD", "man" = "#2C3E50")

ggplot(gender_by_year, aes(x = prize_year, y = pct, fill = gender)) +
  geom_area(alpha = 0.7) +
  geom_hline(yintercept = 0.5, linetype = "dashed", color = "#888888") +
  scale_y_continuous(labels = scales::percent_format()) +
  scale_x_continuous(breaks = seq(1990, 2022, 4)) +
  scale_fill_manual(values = gender_cols, name = "Gender") +
  labs(
    title = "Gender Balance in British Literary Prizes (1990–2022)",
    subtitle = "Share of shortlisted and winning authors by gender | Dashed line = parity",
    x = "Year",
    y = "Share of Nominations + Wins",
    caption = "Source: TidyTuesday 2025-10-28 | Post45 Data Collective"
  ) +
  theme_minimal(base_size = 13) +
  theme(
    plot.title = element_text(face = "bold", size = 17, color = "#2C3E50"),
    plot.subtitle = element_text(size = 11, color = "#555555"),
    plot.caption = element_text(size = 9, color = "#888888"),
    legend.position = "bottom",
    panel.grid.minor = element_blank()
  )

ggplot(ethnicity_by_year, aes(x = prize_year, y = pct_diverse)) +
  geom_col(fill = "#D35400", alpha = 0.7, width = 0.8) +
  geom_smooth(method = "loess", se = TRUE, color = "#8E44AD", linewidth = 1, alpha = 0.2) +
  scale_y_continuous(labels = scales::percent_format()) +
  scale_x_continuous(breaks = seq(1990, 2022, 4)) +
  labs(
    title = "Ethnic Diversity in British Literary Prizes",
    subtitle = "Percentage of non-white shortlisted/winning authors per year | LOESS trend",
    x = "Year",
    y = "% Ethnically Diverse Authors",
    caption = "Source: TidyTuesday 2025-10-28 | Post45 Data Collective"
  ) +
  theme_minimal(base_size = 13) +
  theme(
    plot.title = element_text(face = "bold", size = 17, color = "#2C3E50"),
    plot.subtitle = element_text(size = 11, color = "#555555"),
    plot.caption = element_text(size = 9, color = "#888888"),
    panel.grid.minor = element_blank()
  )

Final thoughts and takeaways

British literary prizes have slowly but measurably diversified over three decades. Gender representation has approached parity in recent years, with women regularly making up half or more of shortlists — a meaningful shift from the male-dominated early 1990s. Ethnic diversity shows a more uneven trajectory, with notable gains in some years but inconsistent progress overall.

The educational background data reveals an uncomfortable truth about literary prestige: Oxbridge graduates remain heavily overrepresented among prize nominees and winners. This suggests that access to literary networks, mentorship, and publishing connections may matter as much as raw talent — a structural advantage that diversity initiatives in prize selection alone cannot fully address.

Note

The researchers emphasize that this data captures broad patterns and “is not intended as definitive.” Demographic classifications — particularly around ethnicity and gender — are based on publicly available information and may not reflect individuals’ self-identification.