Tidy Tuesday: CDC Archived Datasets

tidytuesday

public health

text analysis

Exploring the metadata of 1,257 CDC datasets archived before a federal purge of LGBTQ+ and HIV-related health data.

Author

Sean Thimons

Published

February 11, 2025

Preface

From TidyTuesday repository.

This week’s dataset documents CDC datasets that were archived before the Trump administration purged health agency websites of LGBTQ+ and HIV-related content. The data includes metadata for 1,257 archived datasets along with Federal Program Inventory (FPI) codes and OMB bureau codes that map programs to their parent agencies.

The Infectious Disease Society of America emphasized that “removal of HIV- and LGBTQ-related resources…creates a dangerous gap in scientific information” crucial for disease professionals and outbreak response.

Which Bureaus and Programs contain the most archived datasets?

What keywords appear most frequently across datasets?

Loading necessary packages

My handy booster pack that allows me to install (if needed) and load my usual and favorite packages, as well as some helpful functions.

Code

# Packages ----------------------------------------------------------------

{
  # Install pak if it's not already installed
  if (!requireNamespace("pak", quietly = TRUE)) {
    install.packages(
      "pak",
      repos = sprintf(
        "https://r-lib.github.io/p/pak/stable/%s/%s/%s",
        .Platform$pkgType,
        R.Version()$os,
        R.Version()$arch
      )
    )
  }

  # CRAN Packages ----
  install_booster_pack <- function(package, load = TRUE) {
    for (pkg in package) {
      if (!requireNamespace(pkg, quietly = TRUE)) {
        pak::pkg_install(pkg)
      }
      if (load) {
        library(pkg, character.only = TRUE)
      }
    }
  }

  if (file.exists('packages.txt')) {
    packages <- read.table('packages.txt')

    install_booster_pack(package = packages$Package, load = FALSE)

    rm(packages)
  } else {
    ## Packages ----

    booster_pack <- c(
      ### IO ----
      'fs',
      'here',
      'janitor',
      'rio',
      'tidyverse',

      ### EDA ----
      'skimr',

      ### Plot ----
      # 'esquisse',          # Interactive plot builder
      # 'paletteer',         # Color palette collection
      'patchwork', # Multi-panel layouts — combining keyword and category plots
      'ggtext', # Rich text in ggplot — formatted subtitle/caption text
      'ggrepel', # Non-overlapping labels — labeling top categories

      ### Text ----
      'tidytext', # Text mining — tokenizing tags/keywords column

      ### Reporting ----
      'gt', # Grammar of tables — formatted summary tables

      ### Misc ----
      'tidytuesdayR'
    )

    # ! Change load flag to load packages
    install_booster_pack(package = booster_pack, load = TRUE)
    rm(install_booster_pack, booster_pack)
  }

  # Custom Functions ----

  `%ni%` <- Negate(`%in%`)

  geometric_mean <- function(x) {
    exp(mean(log(x[x > 0]), na.rm = TRUE))
  }

  my_skim <- skim_with(
    numeric = sfl(
      n = length,
      min = ~ min(.x, na.rm = T),
      p25 = ~ stats::quantile(., probs = .25, na.rm = TRUE, names = FALSE),
      med = ~ median(.x, na.rm = T),
      p75 = ~ stats::quantile(., probs = .75, na.rm = TRUE, names = FALSE),
      max = ~ max(.x, na.rm = T),
      mean = ~ mean(.x, na.rm = T),
      geo_mean = ~ geometric_mean(.x),
      sd = ~ stats::sd(., na.rm = TRUE),
      hist = ~ inline_hist(., 5)
    ),
    append = FALSE
  )
}

Load raw data from package

raw <- tidytuesdayR::tt_load('2025-02-11')

cdc_datasets <- raw$cdc_datasets
fpi_codes <- raw$fpi_codes
omb_codes <- raw$omb_codes

Exploratory Data Analysis

The my_skim() function is a modified version of the skimr::skim() function that returns the number of missing data points (cells as NA) as well as the inverse (e.g.: number of rows that are not NA), the count, minimum, 25%, median, 75%, max, mean, geometric mean, and standard deviation. It also generates a little ASCII histogram. Neat!

CDC Datasets

The CDC datasets table is primarily character columns (URLs, tags, contact info), so we’ll focus on completeness patterns rather than numeric summaries. Columns like footnotes, license, described_by, and glossary_methodology are likely sparse and less analytically useful.

cdc_datasets %>%
  select(
    -footnotes,
    -license,
    -suggested_citation,
    -glossary_methodology,
    -analytical_methods_reference,
    -access_level_comment,
    -collection,
    -language
  ) %>%
  skim()

Data summary
Name	Piped data
Number of rows	1257
Number of columns	19
_______________________
Column type frequency:
character	19
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
dataset_url	0	1.00	64	257	1257
contact_name	13	0.99	3	144	134
contact_email	7	0.99	4	54	91
bureau_code	8	0.99	6	7	6
program_code	8	0.99	6	7	16
category	0	1.00	4	100	53
tags	0	1.00	4	1683	798
publisher	597	0.53	3	178	29
public_access_level	647	0.49	6	17	4
source_link	673	0.46	19	140	143
issued	900	0.28	4	13	122
geographic_coverage	920	0.27	2	118	25
temporal_applicability	948	0.25	4	143	187
update_frequency	919	0.27	5	82	31
described_by	1085	0.14	39	150	126
homepage	767	0.39	23	123	122
geographic_unit_of_analysis	1214	0.03	5	56	21
geospatial_resolution	1122	0.11	5	37	28
references	1063	0.15	3	651	75

The key columns for our analysis are category, tags, bureau_code, program_code, and public_access_level. The tags column contains comma-separated keywords that we can tokenize for text analysis.

FPI Codes

fpi_codes %>%
  skim()

Data summary
Name	Piped data
Number of rows	1554
Number of columns	6
_______________________
Column type frequency:
character	6
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
agency_name	0	1.00	11	45	25
program_name	0	1.00	4	190	1496
additional_information_optional	1464	0.06	18	50	17
agency_code	0	1.00	3	3	25
program_code	0	1.00	7	7	1554
program_code_pod_format	0	1.00	7	7	1554

OMB Codes

omb_codes %>%
  skim()

Data summary
Name	Piped data
Number of rows	368
Number of columns	6
_______________________
Column type frequency:
character	3
numeric	3
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
agency_name	1	12	81	134
bureau_name	1	4	84	359
treasury_code	1	1	2	74

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
agency_code	0	1.00	161.40	211.68	1	9.75	22.0	352.75	920	▇▂▂▁▁
bureau_code	0	1.00	20.61	25.26	0	0.00	11.0	30.50	97	▇▂▁▁▁
cgac_code	28	0.92	131.96	193.05	0	13.75	49.5	91.00	920	▇▁▂▁▁

Joining Datasets: Mapping Programs to Bureaus

Before diving into the analysis, we need to connect the CDC dataset metadata to the organizational structure provided by the FPI and OMB code tables. The bureau_code and program_code columns in cdc_datasets serve as keys to look up human-readable agency and program names.

# Parse bureau_code into agency and bureau components
# Format is "agency_code:bureau_code" e.g., "009:20"
cdc_enriched <- cdc_datasets %>%
  separate(
    bureau_code,
    into = c("agency_code_str", "bureau_code_str"),
    sep = ":",
    remove = FALSE,
    fill = "right"
  ) %>%
  mutate(
    agency_code_num = as.numeric(agency_code_str),
    bureau_code_num = as.numeric(bureau_code_str)
  ) %>%
  left_join(
    omb_codes,
    by = c("agency_code_num" = "agency_code", "bureau_code_num" = "bureau_code")
  ) %>%
  left_join(
    fpi_codes %>% select(program_name, program_code_pod_format),
    by = c("program_code" = "program_code_pod_format")
  )

cdc_enriched %>%
  count(bureau_name, sort = TRUE) %>%
  head(10)

# A tibble: 5 × 2
  bureau_name                                      n
  <chr>                                        <int>
1 Centers for Disease Control and Prevention     953
2 Department of Health and Human Services        285
3 <NA>                                            12
4 Health Resources and Services Administration     6
5 Centers for Medicare and Medicaid Services       1

Which Bureaus and Programs Hold the Most Archived Data?

# Top bureaus
top_bureaus <- cdc_enriched %>%
  filter(!is.na(bureau_name)) %>%
  count(bureau_name, sort = TRUE) %>%
  head(10) %>%
  mutate(bureau_name = fct_reorder(bureau_name, n))

# Top programs
top_programs <- cdc_enriched %>%
  filter(!is.na(program_name)) %>%
  count(program_name, sort = TRUE) %>%
  head(10) %>%
  mutate(program_name = fct_reorder(program_name, n))

top_bureaus %>%
  gt() %>%
  tab_header(
    title = "Top 10 Bureaus by Archived Dataset Count"
  ) %>%
  cols_label(
    bureau_name = "Bureau",
    n = "Datasets"
  )

Bureau	Datasets
Top 10 Bureaus by Archived Dataset Count
Centers for Disease Control and Prevention	953
Department of Health and Human Services	285
Health Resources and Services Administration	6
Centers for Medicare and Medicaid Services	1

What Are These Datasets About? Keyword Analysis

The tags column contains comma-separated keywords describing each dataset. By tokenizing these tags, we can see which public health topics are most represented in the archived data.

# Tokenize the tags column — each tag is comma-separated
keyword_counts <- cdc_datasets %>%
  select(tags) %>%
  filter(!is.na(tags)) %>%
  separate_rows(tags, sep = ",") %>%
  mutate(tags = str_trim(str_to_lower(tags))) %>%
  filter(tags != "") %>%
  count(tags, sort = TRUE)

keyword_counts %>%
  head(20) %>%
  gt() %>%
  tab_header(
    title = "Top 20 Keywords Across Archived CDC Datasets"
  ) %>%
  cols_label(
    tags = "Keyword",
    n = "Frequency"
  )

Keyword	Frequency
Top 20 Keywords Across Archived CDC Datasets
nndss	292
wonder	292
nedss	291
netss	291
covid-19	166
this dataset does not have any tags	154
coronavirus	129
nchs	127
united states	120
mortality	116
deaths	104
nvss	86
2019	81
osh	79
provisional	73
age	71
brfss	70
mmwr	65
prevalence	65
tobacco	64

Categorizing Keywords by Public Health Domain

To understand the thematic landscape, we can group keywords into broader public health domains and see how the archived data breaks down.

# Flag keywords related to the purge's stated focus
hiv_lgbtq_keywords <- c(
  "hiv",
  "aids",
  "lgbtq",
  "lesbian",
  "gay",
  "bisexual",
  "transgender",
  "sexual orientation",
  "gender identity",
  "sexual health",
  "sti",
  "sexually transmitted",
  "prep",
  "antiretroviral",
  "hiv/aids",
  "hiv prevention",
  "men who have sex with men",
  "msm"
)

keyword_flagged <- keyword_counts %>%
  mutate(
    hiv_lgbtq_related = str_detect(
      tags,
      str_c(hiv_lgbtq_keywords, collapse = "|")
    )
  )

hiv_lgbtq_summary <- keyword_flagged %>%
  group_by(hiv_lgbtq_related) %>%
  summarize(
    unique_keywords = n(),
    total_occurrences = sum(n),
    .groups = "drop"
  )

hiv_lgbtq_summary %>%
  gt() %>%
  tab_header(
    title = "HIV/LGBTQ+ Related Keywords vs. Other Keywords"
  ) %>%
  cols_label(
    hiv_lgbtq_related = "HIV/LGBTQ+ Related",
    unique_keywords = "Unique Keywords",
    total_occurrences = "Total Occurrences"
  )

HIV/LGBTQ+ Related	Unique Keywords	Total Occurrences
HIV/LGBTQ+ Related Keywords vs. Other Keywords
FALSE	1325	9484
TRUE	31	63

Important

The removal of these datasets doesn’t just affect researchers studying HIV or LGBTQ+ health. Many of these datasets are cross-cutting — surveillance data, behavioral surveys, and demographic health indicators that inform a wide range of public health decisions.

Dataset Access Levels

Understanding which datasets were public vs. restricted helps quantify the transparency impact.

cdc_datasets %>%
  count(public_access_level, sort = TRUE) %>%
  gt() %>%
  tab_header(
    title = "Distribution of Public Access Levels"
  ) %>%
  cols_label(
    public_access_level = "Access Level",
    n = "Count"
  )

Access Level	Count
Distribution of Public Access Levels
NA	647
public	532
public domain	70
non-public	7
restricted public	1

Category Landscape

category_counts <- cdc_datasets %>%
  filter(!is.na(category)) %>%
  count(category, sort = TRUE) %>%
  mutate(category = fct_reorder(category, n))

Visualization

Top Keywords in Archived CDC Datasets

# Pull HIV/LGBTQ+ keywords that actually appear in the data
hiv_lgbtq_hits <- keyword_flagged %>%
  filter(hiv_lgbtq_related) %>%
  arrange(desc(n))

# Combine: top 20 overall + any HIV/LGBTQ+ keywords not already in the top 20
top_overall <- keyword_counts %>% head(20)

hiv_extras <- hiv_lgbtq_hits %>%
  filter(tags %ni% top_overall$tags)

combined_kw <- bind_rows(
  top_overall %>% mutate(hiv_lgbtq = tags %in% hiv_lgbtq_hits$tags),
  hiv_extras %>%
    head(10) %>%
    mutate(hiv_lgbtq = TRUE) %>%
    select(-hiv_lgbtq_related)
) %>%
  distinct(tags, .keep_all = TRUE) %>%
  mutate(tags = fct_reorder(tags, n))

p_keywords <- ggplot(combined_kw, aes(x = n, y = tags, fill = hiv_lgbtq)) +
  geom_col() +
  scale_fill_manual(
    values = c("TRUE" = "#D32F2F", "FALSE" = "#1565C0"),
    labels = c("TRUE" = "HIV/LGBTQ+ Related", "FALSE" = "Other"),
    name = NULL
  ) +
  labs(
    title = "Most Frequent Keywords in Archived CDC Datasets",
    subtitle = "Top 20 overall keywords plus HIV/LGBTQ+-related keywords (red) wherever they rank",
    x = "Number of Datasets",
    y = NULL,
    caption = "Source: TidyTuesday 2025-02-11 | CDC Archived Datasets"
  ) +
  theme_minimal(base_size = 13) +
  theme(
    plot.title = element_text(face = "bold", size = 16),
    plot.subtitle = element_text(color = "gray40", size = 11),
    legend.position = "top",
    panel.grid.major.y = element_blank()
  )

p_keywords

Datasets by Category and Bureau

p_bureau <- ggplot(top_bureaus, aes(x = n, y = bureau_name)) +
  geom_col(fill = "#2E7D32") +
  labs(
    title = "Top 10 Bureaus",
    x = "Archived Datasets",
    y = NULL
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title = element_text(face = "bold"),
    panel.grid.major.y = element_blank()
  )

p_category <- ggplot(
  category_counts %>% tail(10),
  aes(x = n, y = category)
) +
  geom_col(fill = "#6A1B9A") +
  labs(
    title = "Top 10 Categories",
    x = "Archived Datasets",
    y = NULL
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title = element_text(face = "bold"),
    panel.grid.major.y = element_blank()
  )

p_bureau +
  p_category +
  plot_annotation(
    title = "Organizational and Thematic Distribution of Archived CDC Data",
    subtitle = "Which corners of the CDC were most affected?",
    caption = "Source: TidyTuesday 2025-02-11 | CDC Archived Datasets",
    theme = theme(
      plot.title = element_text(face = "bold", size = 16),
      plot.subtitle = element_text(color = "gray40", size = 12)
    )
  )

Update Frequency of Archived Datasets

Understanding how frequently these datasets were being updated before archival tells us something about how “alive” the data was.

# Decode ISO 8601 duration codes and normalize free-text entries
update_freq <- cdc_datasets %>%
  filter(!is.na(update_frequency)) %>%
  mutate(
    update_label = case_when(
      str_detect(str_to_lower(update_frequency), "r/p1d|daily") ~ "Daily",
      str_detect(
        str_to_lower(update_frequency),
        "r/p1w|weekly|weekdays"
      ) ~ "Weekly",
      str_detect(
        str_to_lower(update_frequency),
        "r/p2w|biweekly|two weeks"
      ) ~ "Biweekly",
      str_detect(str_to_lower(update_frequency), "r/p1m|monthly") ~ "Monthly",
      str_detect(
        str_to_lower(update_frequency),
        "r/p3m|quarterly"
      ) ~ "Quarterly",
      str_detect(
        str_to_lower(update_frequency),
        "r/p6m|semiannual"
      ) ~ "Semiannually",
      str_detect(str_to_lower(update_frequency), "r/p1y|annual") ~ "Annually",
      str_detect(
        str_to_lower(update_frequency),
        "r/p2y|r/p4y|r/p5y"
      ) ~ "Multi-year",
      str_detect(
        str_to_lower(update_frequency),
        "irregular|continuous"
      ) ~ "Irregular",
      str_detect(
        str_to_lower(update_frequency),
        "no longer|not updated|archived|will not be"
      ) ~ "No longer updated",
      TRUE ~ "Other"
    )
  ) %>%
  count(update_label, sort = TRUE) %>%
  mutate(update_label = fct_reorder(update_label, n))

ggplot(update_freq, aes(x = n, y = update_label)) +
  geom_col(fill = "#E65100") +
  labs(
    title = "Update Frequency of Archived CDC Datasets",
    subtitle = "Most archived datasets were on active update schedules before removal",
    x = "Number of Datasets",
    y = NULL,
    caption = "Source: TidyTuesday 2025-02-11 | CDC Archived Datasets"
  ) +
  theme_minimal(base_size = 13) +
  theme(
    plot.title = element_text(face = "bold", size = 16),
    plot.subtitle = element_text(color = "gray40"),
    panel.grid.major.y = element_blank()
  )

Final thoughts and takeaways

The 1,257 archived CDC datasets represent a broad cross-section of the agency’s public health surveillance and reporting infrastructure. The data is not narrowly scoped to HIV or LGBTQ+ health — it spans chronic disease surveillance, environmental health, injury prevention, and population-level health indicators. The keyword analysis reveals that while HIV- and LGBTQ+-related terms are present, the majority of affected datasets cover general public health topics, suggesting that the purge cast a wider net than its stated focus.

The organizational breakdown shows that the bulk of archived data came from a small number of CDC bureaus, concentrating the knowledge gap in specific programmatic areas. Many of these datasets were being updated on annual or more frequent cycles, meaning they were actively maintained resources — not stale archives gathering dust. Their removal creates gaps in longitudinal data that may be difficult or impossible to reconstruct.

Note

The archival effort that produced this dataset was a proactive response by civil society and data preservation organizations. The fact that this metadata exists at all is thanks to groups who anticipated the purge and acted to document what was publicly available before it disappeared.

The broader implication is structural: when public health data disappears from federal servers, it doesn’t just affect researchers. Clinicians, state health departments, epidemiologists tracking outbreaks, and community health organizations all lose access to the evidence base they rely on for decision-making. Data removal is, in practice, a form of policy change that bypasses the legislative process.