Tidy Tuesday: Astronomy Picture of the Day

tidytuesday
R
astronomy
text-analysis
Mining NASA’s APOD archive (2007-2025) — what astronomical subjects capture the most attention, and how has the archive evolved over nearly two decades?
Author

Sean Thimons

Published

January 20, 2026

Preface

From TidyTuesday repository.

The Astronomy Picture of the Day (APOD) archive contains daily astronomy-related images with a scientific explanation, spanning 2007–2025. The data originates from NASA’s popular APOD website and has been curated into the astropic R package.

  • What types of objects are most common in the archive?
  • Are any images posted more than once?

Loading necessary packages

My handy booster pack that allows me to install (if needed) and load my usual and favorite packages, as well as some helpful functions.

Code
# Packages ----------------------------------------------------------------

{
  if (!requireNamespace("pak", quietly = TRUE)) {
    install.packages(
      "pak",
      repos = sprintf(
        "https://r-lib.github.io/p/pak/stable/%s/%s/%s",
        .Platform$pkgType,
        R.Version()$os,
        R.Version()$arch
      )
    )
  }

  install_booster_pack <- function(package, load = TRUE) {
    for (pkg in package) {
      if (!requireNamespace(pkg, quietly = TRUE)) {
        pak::pkg_install(pkg)
      }
      if (load) {
        library(pkg, character.only = TRUE)
      }
    }
  }

  if (file.exists('packages.txt')) {
    packages <- read.table('packages.txt')
    install_booster_pack(package = packages$Package, load = FALSE)
    rm(packages)
  } else {
    booster_pack <- c(
      ### IO ----
      'fs',
      'here',
      'janitor',
      'rio',
      'tidyverse',

      ### EDA ----
      'skimr',

      ### Text ----
      'tidytext',

      ### Plot ----
      'ggrepel',
      'ggtext',

      ### Misc ----
      'tidytuesdayR'
    )

    install_booster_pack(package = booster_pack, load = TRUE)
    rm(install_booster_pack, booster_pack)
  }

  # Custom Functions ----

  `%ni%` <- Negate(`%in%`)

  geometric_mean <- function(x) {
    exp(mean(log(x[x > 0]), na.rm = TRUE))
  }

  my_skim <- skim_with(
    numeric = sfl(
      n = length,
      min = ~ min(.x, na.rm = T),
      p25 = ~ stats::quantile(., probs = .25, na.rm = TRUE, names = FALSE),
      med = ~ median(.x, na.rm = T),
      p75 = ~ stats::quantile(., probs = .75, na.rm = TRUE, names = FALSE),
      max = ~ max(.x, na.rm = T),
      mean = ~ mean(.x, na.rm = T),
      geo_mean = ~ geometric_mean(.x),
      sd = ~ stats::sd(., na.rm = TRUE),
      hist = ~ inline_hist(., 5)
    ),
    append = FALSE
  )
}

Load raw data from package

raw <- tidytuesdayR::tt_load('2026-01-20')

apod <- raw$apod

Exploratory Data Analysis

The my_skim() function is a modified version of the skimr::skim() function that returns the number of missing data points (cells as NA) as well as the inverse (e.g.: number of rows that are not NA), the count, minimum, 25%, median, 75%, max, mean, geometric mean, and standard deviation. It also generates a little ASCII histogram. Neat!

APOD Archive

I’ll drop the URL columns and explanation text for the initial skim — they’re useful later but not for profiling.

apod %>%
  select(-url, -hdurl, -explanation) %>%
  my_skim(.)
Data summary
Name Piped data
Number of rows 6888
Number of columns 4
_______________________
Column type frequency:
character 3
Date 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
copyright 2699 0.61 3 160 0 2251 0
media_type 0 1.00 5 5 0 3 0
title 0 1.00 4 66 0 5151 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
date 0 1 2007-01-01 2025-12-23 2016-06-28 6888
apod %>%
  count(media_type, sort = TRUE)
# A tibble: 3 × 2
  media_type     n
  <chr>      <int>
1 image       6499
2 video        379
3 other         10

Date Coverage

apod %>%
  mutate(year = year(date)) %>%
  count(year) %>%
  print(n = 30)
# A tibble: 19 × 2
    year     n
   <dbl> <int>
 1  2007   363
 2  2008   365
 3  2009   361
 4  2010   357
 5  2011   361
 6  2012   364
 7  2013   365
 8  2014   363
 9  2015   365
10  2016   366
11  2017   365
12  2018   364
13  2019   365
14  2020   365
15  2021   365
16  2022   365
17  2023   365
18  2024   366
19  2025   338

APOD Archive Analysis

Duplicate Images

The repo asks: are any images posted more than once? Let’s check for repeated URLs and titles.

# Check for duplicate URLs
apod %>%
  filter(!is.na(url)) %>%
  count(url, sort = TRUE) %>%
  filter(n > 1) %>%
  head(10)
# A tibble: 10 × 2
   url                                                                         n
   <chr>                                                                   <int>
 1 https://www.youtube.com/embed/OfM7VlonD5c?rel=0                             4
 2 https://www.youtube.com/embed/sNUNB6CMnE8?rel=0                             4
 3 //player.vimeo.com/video/108650530?title=0&byline=0&portrait=0&badge=0…     3
 4 https://apod.nasa.gov/apod/image/1803/AstroSoM/hudf.html                    3
 5 https://player.vimeo.com/video/32095756?title=0&byline=0&portrait=0         3
 6 https://www.youtube.com/embed/B1R3dTdcpSU?rel=0                             3
 7 https://www.youtube.com/embed/FG0fTKAqZ5g?rel=0                             3
 8 https://www.youtube.com/embed/I_88S8DWbcU?rel=0                             3
 9 https://www.youtube.com/embed/PBL1RBj-P1g?rel=0                             3
10 https://www.youtube.com/embed/afHfMMC-MJE?rel=0                             3
# Check for duplicate titles
apod %>%
  count(title, sort = TRUE) %>%
  filter(n > 1) %>%
  head(15)
# A tibble: 15 × 2
   title                                           n
   <chr>                                       <int>
 1 M31: The Andromeda Galaxy                      10
 2 The Horsehead Nebula                           10
 3 M13: The Great Globular Cluster in Hercules     9
 4 NGC 4565: Galaxy on Edge                        9
 5 NGC 602 and Beyond                              8
 6 The Medusa Nebula                               8
 7 The Seagull Nebula                              8
 8 A Beautiful Trifid                              7
 9 Galaxies in the River                           7
10 Halo of the Cat's Eye                           7
11 Lynds Dark Nebula 1251                          7
12 M1: The Crab Nebula from Hubble                 7
13 M27: The Dumbbell Nebula                        7
14 M45: The Pleiades Star Cluster                  7
15 Millions of Stars in Omega Centauri             7

Most Common Subjects via Title Keywords

To answer “what types of objects are most common,” we can tokenize the titles and look for astronomical keywords.

# Define astronomical object categories
astro_keywords <- tribble(
  ~keyword, ~category,
  "nebula", "Nebulae",
  "galaxy", "Galaxies",
  "moon", "Moon",
  "sun", "Sun/Solar",
  "solar", "Sun/Solar",
  "eclipse", "Eclipses",
  "mars", "Planets",
  "jupiter", "Planets",
  "saturn", "Planets",
  "venus", "Planets",
  "mercury", "Planets",
  "comet", "Comets",
  "aurora", "Auroras",
  "milky way", "Milky Way",
  "meteor", "Meteors",
  "supernova", "Supernovae",
  "cluster", "Star Clusters",
  "star", "Stars"
)

# Tokenize titles and match keywords
title_words <- apod %>%
  select(date, title) %>%
  mutate(title_lower = str_to_lower(title))

# Count keyword matches
keyword_counts <- astro_keywords %>%
  rowwise() %>%
  mutate(
    count = sum(str_detect(title_words$title_lower, keyword))
  ) %>%
  ungroup() %>%
  group_by(category) %>%
  summarize(count = sum(count), .groups = "drop") %>%
  arrange(desc(count))

keyword_counts
# A tibble: 13 × 2
   category      count
   <chr>         <int>
 1 Planets         772
 2 Nebulae         722
 3 Stars           548
 4 Moon            538
 5 Sun/Solar       487
 6 Galaxies        472
 7 Comets          311
 8 Eclipses        230
 9 Star Clusters   206
10 Milky Way       180
11 Auroras         157
12 Meteors         120
13 Supernovae       88

Visualizing the APOD Universe

The hero plot shows the most common astronomical subjects across the archive, styled with a deep-space aesthetic.

# Space-themed palette
space_cols <- c(
  "Nebulae" = "#FF6B8A",
  "Galaxies" = "#A855F7",
  "Moon" = "#E2E8F0",
  "Sun/Solar" = "#FBBF24",
  "Eclipses" = "#F97316",
  "Planets" = "#22D3EE",
  "Comets" = "#34D399",
  "Auroras" = "#4ADE80",
  "Milky Way" = "#C4B5FD",
  "Meteors" = "#FB923C",
  "Stars" = "#FDE68A",
  "Supernovae" = "#F43F5E",
  "Star Clusters" = "#818CF8"
)

# Prepare data — top categories by total count
top_categories <- apod_tagged %>%
  filter(category != "Other") %>%
  count(category, sort = TRUE) %>%
  head(10)

ggplot(top_categories, aes(x = reorder(category, n), y = n, fill = category)) +
  geom_col(width = 0.7, show.legend = FALSE) +
  geom_text(
    aes(label = n),
    hjust = -0.2,
    color = "white",
    fontface = "bold",
    size = 4
  ) +
  scale_fill_manual(values = space_cols) +
  scale_y_continuous(expand = expansion(mult = c(0, 0.15))) +
  coord_flip() +
  labs(
    title = "The Most Photographed Objects in Space",
    subtitle = "Astronomical subjects in NASA's APOD archive (2007\u20132025), classified by title keywords",
    x = NULL,
    y = "Number of APOD Entries",
    caption = "Source: TidyTuesday 2026-01-20 | NASA APOD via astropic R package"
  ) +
  theme_minimal(base_size = 13) +
  theme(
    plot.background = element_rect(fill = "#0F172A", color = NA),
    panel.background = element_rect(fill = "#0F172A", color = NA),
    text = element_text(color = "white"),
    plot.title = element_text(face = "bold", size = 18, color = "#E2E8F0"),
    plot.subtitle = element_text(size = 12, color = "#94A3B8"),
    plot.caption = element_text(size = 9, color = "#64748B"),
    axis.text = element_text(color = "#CBD5E1"),
    axis.title = element_text(color = "#94A3B8"),
    panel.grid.major.y = element_blank(),
    panel.grid.major.x = element_line(color = "#1E293B"),
    panel.grid.minor = element_blank()
  )

Final thoughts and takeaways

NASA’s Astronomy Picture of the Day is one of the longest-running and most beloved science communication projects on the internet. This archive spanning nearly two decades reveals what captures our collective astronomical imagination — and nebulae consistently come out on top, which makes sense given their visual drama and the stunning images that space telescopes produce.

Galaxies and our own Moon round out the top three, reflecting both the accessibility of lunar photography for amateur astronomers and the deep fascination with the large-scale structure of the universe. The presence of auroras and eclipses in the rankings highlights how APOD balances deep-sky objects with phenomena visible from Earth, keeping the archive grounded (literally) for a general audience.

The duplicate analysis reveals that some iconic images do get revisited — either because the same event is captured from different perspectives or because a particularly striking image merits a re-share. This isn’t noise; it’s editorial curation at work.

Tip

The title-keyword approach to classification is deliberately simple. A more sophisticated analysis could use NLP on the explanation text to capture objects mentioned in the description but not the title. The explanations are rich with scientific context that the titles alone don’t capture.