Sean Thimons – The Good, The Bad, and The Ugly

About This Talk

The project: Serapeum — an R/Shiny research tool with RAG pipelines, citation networks, slide generation, and PDF extraction.

The experiment: Use Claude Code as an autonomous workflow engine via a custom framework called GSD (Get Stuff Done).

The numbers:

256 sessions over 26 days
1,568 messages, ~60/day
91 commits, 587 files touched
+56,491 / -3,594 lines

/gsd-discuss-phase
/gsd-plan-phase
/gsd-execute-phase
/gsd-verify-work
/gsd-audit-milestone
/gsd-complete-milestone

“Don’t prompt Claude to write code. Prompt Claude to run a software team.”

What Is GSD?

A phase-based development framework built entirely from custom Claude Code slash commands.

The Lifecycle

%%{init: {'theme': 'dark', 'flowchart': {'rankSpacing': 12, 'nodeSpacing': 10, 'padding': 20}, 'themeVariables': {'fontSize': '26px'}}}%%
flowchart TD
    D["/gsd-discuss-phase"] --> P["/gsd-plan-phase"]
    P --> E["/gsd-execute-phase"]
    E --> V["/gsd-verify-work"]
    V --> A["/gsd-audit-milestone"]
    A --> C["/gsd-complete-milestone"]

What Happens Under the Hood

Each command orchestrates subagents:

Research agents explore the codebase
Pattern mappers find analogous code
Plan checkers validate before execution
Executors work in parallel waves
Verifiers confirm goal achievement

Milestones group phases into versioned releases (v12–v20 in one month).

ACT I: THE GOOD

The Numbers That Worked

Metric	Value
Fully achieved outcomes	73 / 91 (80%)
Multi-file changes	56 sessions
Parallel sessions (multi-clauding)	54 events, 15% of messages
Dissatisfied signals	Only 8

These aren’t rename this variable tasks.

These are multi-file, multi-step, plan-execute-verify pipelines.

Win: Autonomous Content Pipeline

Tidy Tuesday blog posts: 26 sessions, 20+ complete QMD posts

One Slash Command =

Fetch this week’s dataset
Exploratory data analysis
Three visualizations
Written narrative
Quarto render
Smoke test
Git commit

Self-Correction

Claude handles without intervention:

Palette errors
Column name mismatches
Windows path issues
Broken formatting

This is what workflows buy you: repeatability at scale.

Win: Phase Orchestration

Phase 35 example: discussion -> planning -> UI spec -> execution -> 80 passing tests

Multi-wave subagent architecture:

Wave 1: Research + pattern mapping
Wave 2: Planning with checker validation
Wave 3: Parallel execution in isolated tasks
Wave 4: Verification against plan goals

Each wave has gates — the next wave doesn’t start until the previous one passes.

Bug Bash Discipline

plan
  -> fix
    -> Codex cross-audit
      -> Copilot review
        -> merge

The plan artifact does real work — it’s a contract between you and the agent about what “done” means.

The Good: Summary

Workflows transform an AI assistant from a pair programmer into a project manager with commit access.

Three things that worked:

Reusable skill templates — encode hard-won knowledge once, reuse forever
Phase gates — verify between stages, prevent compound errors
Parallel orchestration — three things at once, safely

…or so I thought.

ACT II: THE BAD

Friction by the Numbers

Friction Type	Count
Buggy generated code	25
Wrong initial approach	18
Misunderstood request	7
Environment issues	4
User rejected action	3

Error Type	Count
Command failed	222
User rejected tool call	40
File too large	29
File not found	20
Edit failed	7

The workflow doesn’t eliminate mistakes — it changes where they happen.

The Inverted Narrative

The most common bug class: Claude assumes things about data it hasn’t seen.

Recurring Data Bugs

Column names after janitor::clean_names() — guessed wrong
Proportions (0–1) vs. percentages (0–100) — broken filters
quantile() returns double, sprintf %d expects integer — crash

The British Library Post

Claude wrote: “GIA funding share decreased”

The data showed: it increased.

Every workflow step passed:

EDA ran
Visualization rendered
Smoke test succeeded
Commit created

Structural correctness ≠ semantic correctness.

The workflow ran perfectly. The conclusion was backwards.

Claude Overreach

When you give an agent autonomy, it uses that autonomy:

Scope creep: Proposed dropping an embedding column without checking downstream usage. The research refiner needed it.

Not listening: Asked about merge strategy. Got a milestone proposal instead.

Premature claims: Claimed 12 issues were closed before finishing. “Did something stop?” became a recurring question.

Broken tools: The AskUserQuestion tool rendered blank questions 3 times in a row. Had to reject and switch to plain text.

Autonomy requires trust. Trust requires boundaries. The more autonomous the workflow, the more damage a wrong autonomous decision causes.

The Bad: Summary

Workflows solve the process problem but not the judgment problem.

The agent doesn’t say “I’m not sure about this column name” — it just uses the wrong one
The more structured the workflow, the more confidently it executes wrong steps

Better workflows can mask worse judgment.

ACT III: THE UGLY

The Setup: Six Parallel Milestones

Mid-March. Six milestones. Six branches. Six Claude sessions. Each running the full GSD pipeline independently.

PR	Milestone	Created	Merged
#156	v12 UX Polish & Onboarding	Mar 13	Mar 20
#161	v13 Search & Discovery	Mar 17	Mar 20
#168	v14 Citation Network Evolution	Mar 19	Mar 20
#162	v15 AI Infrastructure	Mar 18	Mar 20
#221	v16 Content & Output Quality	Mar 21	never merged
#163	v17 PDF Image Pipeline	Mar 19	Mar 21

Each milestone: individually disciplined. Full discuss-plan-execute-verify lifecycle.

All six needed to merge in the same 48-hour window.

The Anatomy of the Disaster

v12 merged into integration — fine.
v13 merged — 8 review issues fixed first, but ok.
v14 merged, then immediately reverted.
- Namespace scope bugs + missing FWCI scaffold discovered post-merge.
- Had to fix and re-merge.
v15 merged — only after a dedicated “resolve conflicts with integration” commit.
v17 had to absorb v12–v15 INTO ITSELF before it could merge.
- The feature branch swallowed four other milestones.
v16 never merged as a PR. PR #221 closed.
- Changes rescued via a separate “uncommitted integration” fix PR the next day.

The Git Graph

*   d834744 merge: v17 + v16 into integration
|\
| *   6beabed v17.0 PDF image pipeline
| |\
| | *   202b90c merge: integration (v12-v15) INTO v17
| | |\
| | |/
| |/|
* | |   0f2ad9a merge: v16 Content & Output Quality
|\ \ \
| |/ /
|/| |
| * | 4322df7 fix: address PR #221 review feedback
* |     4e299ca Merge PR #162 (v15-ai-infrastructure)
| * |   69b9b32 merge: resolve conflicts with integration
|/
*       4c1516b Merge v14 into integration
*       da06085 Revert "Merge PR #168 (v14)"  <-- THE REVERT
*       152ffdb Merge PR #168 (v14-citation-network)
*       3b1fde3 Merge PR #161 (v13-search-discovery)
*       05852b6 Merge v12-ux-polish into integration

135 files. +28,656 lines. -1,261 lines.

The Aftermath

The next week was integration debris:

PR	What Broke	Days After
#233	v16 uncommitted changes salvaged	+1
#237	Document download broken	+2
#241	v18 Bug Bash: 13 issues, 5 fix sessions	+3
#253	Missing community column migration	+4
#262	Citation network bind length mismatch	+4

The v18 Bug Bash wasn’t planned work.

It was cleanup. Five sessions (A through E) to fix what the integration broke.

“The workflows produced code faster than I could integrate it. I was assembling six sets of IKEA furniture that all shared the same screws.”

Why the Workflow Couldn’t Save Me

The GSD framework had no concept of:

Cross-milestone coordination — no shared integration cadence; branches diverged for days
File overlap awareness — rag.R, mod_settings.R, mod_slides.R, db.R all touched by 3+ milestones
Migration collision — numbered migrations from parallel branches guaranteed conflicts
An integration agent — the one role that doesn’t exist in the framework

Each agent was a perfectly disciplined soldier executing a flawless plan. But nobody was the general.

The Windows Tax

And then there’s the unglamorous stuff:

Claude uses /tmp/ paths
Windows R can’t access /tmp/
This happened in 5+ separate sessions
R segfaults on network calls on Windows
File locking blocks git worktree remove

I have this rule in:

My CLAUDE.md
My skill templates
My memory system
My hooks

Claude still uses /tmp/.

Instruction adherence degrades with context length.

By the time Claude is deep in a multi-stage pipeline, the guardrails have scrolled off its priority list.

THE LESSONS

What Workflows Buy You (And What They Don’t)

Workflows are a floor, not a ceiling.

What They Buy

Repeatability
Parallelism
Institutional knowledge
Scale (56K lines / month)

What They Don’t

Semantic correctness
Cross-workflow coordination
Judgment under ambiguity
Self-awareness about confidence

The uncomfortable truth: the maintenance burden of the workflow system is itself a significant time investment. Every mistake adds a new guardrail. Every guardrail adds context length. Context length degrades guardrail adherence.

What You Actually Need Beyond Workflows

Semantic verification gates — “Is this trend going up or down?” before writing the narrative
Cross-workflow coordination — if two agents touch the same files, they need sequencing, not parallelism
Hooks over instructions — hooks actually block bad behavior; instructions get forgotten at context length
A convergence strategy — regular integration merges, not a big bang at the end
Honest accounting — report net productivity after subtracting the error correction and integration tax

The Maturity Model

Level	Description	Where It Breaks
1. Prompting	Ask the AI to write code	Quality of each prompt
2. Templates	Reusable prompt patterns	Doesn’t adapt to context
3. Skills / Workflows	Multi-step pipelines	Semantic errors, overreach
4. Verification loops	Workflows + runtime validation	Maintenance burden
5. Coordinated fleets	Parallel agents with shared state	We’re not here yet

I’m at Level 3–4.

Level 5 requires solving Level 3’s problems first.

The Good: Workflows made me 5x more productive on a good day.

The Bad: The agent doesn’t know what it doesn’t know.

The Ugly: Six perfect pipelines, one catastrophic merge.

Workflows are not all you need.

You also need verification, coordination, humility about what autonomy actually means, and a willingness to maintain the machine that maintains your code.

Sean Thimons

github.com/seanthimons