Tidy Tuesday: Edible Plants Database

tidytuesday
R
agriculture
botany
Exploring the growing conditions of 146 edible plant species from the GROW Observatory — which plants thrive where, and what do sunlight and temperature preferences reveal about cultivation strategies?
Author

Sean Thimons

Published

February 3, 2026

Preface

From TidyTuesday repository.

The Edible Plant Database (EPD) contains information on 146 edible plant species, derived from the GROW Observatory, a European Citizen Science initiative focused on food cultivation, soil monitoring, and land observation. The dataset provides growing conditions and harvest/germination timelines to address questions like “What can I plant now” and “What will yield crops in the future.”

  • Do plants requiring more sunlight also require higher temperatures?
  • Which cultivation classes demand the most water?

Loading necessary packages

My handy booster pack that allows me to install (if needed) and load my usual and favorite packages, as well as some helpful functions.

Code
# Packages ----------------------------------------------------------------

{
  # Install pak if it's not already installed
  if (!requireNamespace("pak", quietly = TRUE)) {
    install.packages(
      "pak",
      repos = sprintf(
        "https://r-lib.github.io/p/pak/stable/%s/%s/%s",
        .Platform$pkgType,
        R.Version()$os,
        R.Version()$arch
      )
    )
  }

  # CRAN Packages ----
  install_booster_pack <- function(package, load = TRUE) {
    for (pkg in package) {
      if (!requireNamespace(pkg, quietly = TRUE)) {
        pak::pkg_install(pkg)
      }
      if (load) {
        library(pkg, character.only = TRUE)
      }
    }
  }

  if (file.exists('packages.txt')) {
    packages <- read.table('packages.txt')

    install_booster_pack(package = packages$Package, load = FALSE)

    rm(packages)
  } else {
    ## Packages ----

    booster_pack <- c(
      ### IO ----
      'fs',
      'here',
      'janitor',
      'rio',
      'tidyverse',

      ### EDA ----
      'skimr',

      ### Plot ----
      'ggtext',
      'ggrepel',
      'patchwork',

      ### Misc ----
      'tidytuesdayR'
    )

    # ! Change load flag to load packages
    install_booster_pack(package = booster_pack, load = TRUE)
    rm(install_booster_pack, booster_pack)
  }

  # Custom Functions ----

  `%ni%` <- Negate(`%in%`)

  geometric_mean <- function(x) {
    exp(mean(log(x[x > 0]), na.rm = TRUE))
  }

  my_skim <- skim_with(
    numeric = sfl(
      n = length,
      min = ~ min(.x, na.rm = T),
      p25 = ~ stats::quantile(., probs = .25, na.rm = TRUE, names = FALSE),
      med = ~ median(.x, na.rm = T),
      p75 = ~ stats::quantile(., probs = .75, na.rm = TRUE, names = FALSE),
      max = ~ max(.x, na.rm = T),
      mean = ~ mean(.x, na.rm = T),
      geo_mean = ~ geometric_mean(.x),
      sd = ~ stats::sd(., na.rm = TRUE),
      hist = ~ inline_hist(., 5)
    ),
    append = FALSE
  )
}

Load raw data from package

raw <- tidytuesdayR::tt_load('2026-02-03')

plants <- raw$edible_plants

Exploratory Data Analysis

The my_skim() function is a modified version of the skimr::skim() function that returns the number of missing data points (cells as NA) as well as the inverse (e.g.: number of rows that are not NA), the count, minimum, 25%, median, 75%, max, mean, geometric mean, and standard deviation. It also generates a little ASCII histogram. Neat!

Edible Plants

I’ll drop the free-text columns (description, requirements, nutritional_info, sensitivities) since they’re not useful for quantitative analysis, and focus on the structured growing condition fields.

plants %>%
  select(-description, -requirements, -nutritional_info, -sensitivities) %>%
  my_skim(.)
Data summary
Name Piped data
Number of rows 140
Number of columns 16
_______________________
Column type frequency:
character 13
numeric 3
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
taxonomic_name 0 1.00 5 40 0 123 0
common_name 0 1.00 3 32 0 140 0
cultivation 0 1.00 5 14 0 12 0
sunlight 0 1.00 8 34 0 6 0
water 0 1.00 3 9 0 8 0
nutrients 0 1.00 3 40 0 7 0
soil 46 0.67 4 64 0 33 0
season 73 0.48 5 28 0 11 0
temperature_class 0 1.00 5 11 0 6 0
temperature_germination 36 0.74 2 15 0 20 0
temperature_growing 82 0.41 2 5 0 31 0
days_germination 38 0.73 1 15 0 16 0
days_harvest 27 0.81 1 15 0 59 0

Variable type: numeric

skim_variable n_missing complete_rate n min p25 med p75 max mean geo_mean sd hist
preferred_ph_lower 0 1.00 140 4.5 5.5 6 6.00 7.0 5.83 5.82 0.42 ▁▅▇▁▁
preferred_ph_upper 0 1.00 140 6.0 7.0 7 7.50 8.5 7.11 7.10 0.48 ▂▇▅▁▁
energy 128 0.09 140 0.0 26.5 31 38.75 88.0 35.67 38.60 26.57 ▂▇▁▁▂

Let’s also get a quick look at the categorical columns to understand the levels we’re working with:

plants %>%
  count(sunlight, sort = TRUE)
# A tibble: 6 × 2
  sunlight                               n
  <chr>                              <int>
1 Full sun                              87
2 Full sun/partial shade                44
3 Partial shade                          5
4 Full sun/partial shade/full shade      2
5 full sun/partial shade/ full shade     1
6 partial shade                          1
plants %>%
  count(water, sort = TRUE)
# A tibble: 8 × 2
  water         n
  <chr>     <int>
1 Medium       93
2 High         24
3 Low          18
4 Very High     1
5 Very Low      1
6 Very low      1
7 high          1
8 very high     1
plants %>%
  count(cultivation, sort = TRUE)
# A tibble: 12 × 2
   cultivation        n
   <chr>          <int>
 1 Miscellaneous     65
 2 Brassica          21
 3 Legume            10
 4 Allium             9
 5 Cucurbit           8
 6 Umbelliferae       8
 7 Solanaceae         6
 8 Lamiaceae          4
 9 Chenopodiaceae     3
10 Salad              3
11 Solanum            2
12 Brassicas          1
plants %>%
  count(temperature_class, sort = TRUE)
# A tibble: 6 × 2
  temperature_class     n
  <chr>             <int>
1 Hardy                67
2 Tender               37
3 Very hardy           21
4 Half hardy           12
5 Very tender           2
6 Very hard             1
plants %>%
  count(season, sort = TRUE)
# A tibble: 12 × 2
   season                           n
   <chr>                        <int>
 1 <NA>                            73
 2 Perennial                       37
 3 Annual                          12
 4 biennial                         8
 5 biennial, grown as annual        2
 6 perennial                        2
 7 Annual/perannial                 1
 8 Biennial, grown as an annual     1
 9 Perrenial                        1
10 Perrenial evergreen              1
11 Semi-evergreen perrenial         1
12 Shrub                            1

Growing Condition Analysis

The two questions from the TidyTuesday repo are closely related — both are about how growing requirements cluster together. Let’s tackle them systematically.

Sunlight vs. Temperature Preferences

Do sun-loving plants also prefer warmer conditions? Let’s look at the cross-tabulation of sunlight requirements and temperature class.

plants %>%
  filter(!is.na(sunlight), !is.na(temperature_class)) %>%
  count(sunlight, temperature_class) %>%
  pivot_wider(names_from = temperature_class, values_from = n, values_fill = 0)
# A tibble: 6 × 7
  sunlight      `Half hardy` Hardy Tender `Very hardy` `Very tender` `Very hard`
  <chr>                <int> <int>  <int>        <int>         <int>       <int>
1 Full sun                 4    45     26           10             2           0
2 Full sun/par…            8    19      6           10             0           1
3 Full sun/par…            0     2      0            0             0           0
4 Partial shade            0     1      3            1             0           0
5 full sun/par…            0     0      1            0             0           0
6 partial shade            0     0      1            0             0           0

Water Demands by Cultivation Class

Which cultivation classes are the thirstiest? Let’s look at the distribution of water requirements across different cultivation types.

plants %>%
  filter(!is.na(water), !is.na(cultivation)) %>%
  count(cultivation, water) %>%
  pivot_wider(names_from = water, values_from = n, values_fill = 0)
# A tibble: 12 × 9
   cultivation    Medium  High   Low  high `Very High` `Very Low` `Very low`
   <chr>           <int> <int> <int> <int>       <int>      <int>      <int>
 1 Allium              9     0     0     0           0          0          0
 2 Brassica           12     8     1     0           0          0          0
 3 Brassicas           1     0     0     0           0          0          0
 4 Chenopodiaceae      3     0     0     0           0          0          0
 5 Cucurbit            5     3     0     0           0          0          0
 6 Lamiaceae           0     0     3     1           0          0          0
 7 Legume              5     1     1     0           1          1          1
 8 Miscellaneous      45     8    12     0           0          0          0
 9 Salad               0     2     1     0           0          0          0
10 Solanaceae          5     0     0     0           0          0          0
11 Solanum             2     0     0     0           0          0          0
12 Umbelliferae        6     2     0     0           0          0          0
# ℹ 1 more variable: `very high` <int>

pH Preferences by Cultivation Class

The pH range columns give us one of the few continuous measures to work with. Let’s see how preferred soil acidity varies across cultivation types.

plants %>%
  filter(!is.na(preferred_ph_lower), !is.na(preferred_ph_upper)) %>%
  mutate(
    ph_range = preferred_ph_upper - preferred_ph_lower,
    ph_midpoint = (preferred_ph_lower + preferred_ph_upper) / 2
  ) %>%
  group_by(cultivation) %>%
  summarize(
    n = n(),
    mean_ph_mid = mean(ph_midpoint, na.rm = TRUE),
    mean_ph_range = mean(ph_range, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  arrange(mean_ph_mid)
# A tibble: 12 × 4
   cultivation        n mean_ph_mid mean_ph_range
   <chr>          <int>       <dbl>         <dbl>
 1 Brassicas          1        6.25          2.5 
 2 Umbelliferae       8        6.34          1.44
 3 Miscellaneous     65        6.39          1.24
 4 Solanaceae         6        6.42          1.5 
 5 Salad              3        6.45          0.9 
 6 Cucurbit           8        6.47          1.31
 7 Solanum            2        6.5           2   
 8 Legume            10        6.54          1.08
 9 Allium             9        6.58          1.17
10 Brassica          21        6.65          1.36
11 Chenopodiaceae     3        6.67          1.33
12 Lamiaceae          4        6.75          1.25

Energy Content

Let’s also see which plants pack the most energy per 100g, and whether this relates to growing conditions.

plants %>%
  filter(!is.na(energy)) %>%
  arrange(desc(energy)) %>%
  select(common_name, cultivation, energy, sunlight, water) %>%
  head(20)
# A tibble: 12 × 5
   common_name      cultivation   energy sunlight                          water
   <chr>            <chr>          <dbl> <chr>                             <chr>
 1 Beans (Broad)    Legume            88 Full sun/partial shade/full shade Very…
 2 Pea              Legume            80 Full sun                          Very…
 3 Kale             Brassica          50 Full sun                          Low  
 4 Brussels Sprouts Brassica          35 Full sun                          Medi…
 5 Broccoli         Brassica          34 Full sun/partial shade            Medi…
 6 Cauliflower      Brassica          31 Full sun                          Medi…
 7 Bell Pepper      Solanaceae        31 Full sun/partial shade            Medi…
 8 Beans (Runner)   Legume            27 Full sun                          Medi…
 9 Beans (French)   Legume            27 Full sun                          Medi…
10 Cabbage (Spring) Legume            25 Full sun                          High 
11 Beetroot         Umbelliferae       0 Full sun/partial shade            Medi…
12 Endive           Miscellaneous      0 Full sun/partial shade            Medi…

Visualizing Growing Conditions

The hero plot pairs the two suggested questions into a single multi-panel layout: sunlight × temperature association on the left, and water demand by cultivation class on the right.

# Define a botanical color palette
garden_cols <- c(
  "Full sun" = "#E8A838",
  "Full sun / partial shade" = "#C4A24D",
  "Partial shade" = "#7BA05B",
  "Full shade" = "#2D5F2D",
  "Partial shade / full shade" = "#4A7A4A"
)

water_cols <- c(
  "Low" = "#D4A76A",
  "Medium" = "#5B8C5A",
  "High" = "#2E6B8A"
)

# Panel 1: Sunlight vs Temperature heatmap
p1_data <- plants %>%
  filter(!is.na(sunlight), !is.na(temperature_class)) %>%
  count(sunlight, temperature_class)

p1 <- ggplot(p1_data, aes(x = temperature_class, y = sunlight, fill = n)) +
  geom_tile(color = "white", linewidth = 1.5) +
  geom_text(aes(label = n), size = 5, fontface = "bold", color = "white") +
  scale_fill_gradient(low = "#A8D5A2", high = "#1B5E20", name = "Count") +
  labs(
    title = "Do sun-loving plants prefer warmer conditions?",
    x = "Temperature Class",
    y = "Sunlight Requirement"
  ) +
  theme_minimal(base_size = 13) +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    panel.grid = element_blank(),
    legend.position = "bottom"
  )

# Panel 2: Water demand by cultivation class
p2_data <- plants %>%
  filter(!is.na(water), !is.na(cultivation)) %>%
  count(cultivation, water) %>%
  group_by(cultivation) %>%
  mutate(pct = n / sum(n)) %>%
  ungroup()

p2 <- ggplot(
  p2_data,
  aes(x = reorder(cultivation, -n, sum), y = pct, fill = water)
) +
  geom_col(position = "fill", width = 0.7) +
  scale_fill_manual(values = water_cols, name = "Water Need") +
  scale_y_continuous(labels = scales::percent_format()) +
  labs(
    title = "Which cultivation classes demand the most water?",
    x = "Cultivation Class",
    y = "Proportion of Plants"
  ) +
  theme_minimal(base_size = 13) +
  theme(
    plot.title = element_text(face = "bold", size = 14),
    axis.text.x = element_text(angle = 45, hjust = 1),
    panel.grid.minor = element_blank(),
    legend.position = "bottom"
  )

# Combine with patchwork
combined <- p1 /
  p2 +
  plot_annotation(
    title = "Growing Conditions of Edible Plants",
    subtitle = "146 species from the GROW Observatory Edible Plant Database",
    caption = "Source: TidyTuesday 2026-02-03 | University of Dundee Edible Plant Database",
    theme = theme(
      plot.title = element_text(size = 18, face = "bold", color = "#2D5F2D"),
      plot.subtitle = element_text(size = 13, color = "#555555"),
      plot.caption = element_text(size = 9, color = "#888888")
    )
  )

combined

Final thoughts and takeaways

The Edible Plant Database offers a compact but revealing snapshot of how 146 food-producing species relate to their growing environments. The heatmap of sunlight versus temperature preferences shows that the vast majority of edible plants cluster in the “full sun” and warm-to-cool temperature range — which makes intuitive sense, as most food crops have been bred for productive, sun-drenched conditions rather than shade tolerance.

The water demand breakdown by cultivation class tells a complementary story. Root vegetables and legumes tend toward moderate water needs, while leafy greens and some fruiting crops lean heavier. This kind of profiling is exactly what the GROW Observatory aimed to support: giving citizen scientists and home gardeners a data-driven way to plan what to grow based on their local conditions.

Tip

If you’re planning a garden, the pH preference data is particularly actionable — most edible plants cluster in the 5.5–7.5 range, but there’s meaningful variation. Testing your soil pH before planting season can save a lot of heartache.

One limitation: many of the numeric fields (germination days, harvest days, temperature ranges) are stored as character strings with range notation (e.g., “10-14”), which limits direct quantitative analysis without parsing. A natural extension would be to extract those ranges into min/max numeric columns for more granular modeling.