How I Debug Production Issues (A Real Framework, Not Guessing)
Early in my career, I debugged by vibes. Something broke, I'd stare at the code, change something, redeploy, hope. Sometimes it worked. Often it made things worse.
At Home Depot — where a bug affected 2,300+ stores — I couldn't afford to guess. I developed a framework. It's not glamorous, but it works every time.
The Framework: ISOLATE
I — Identify the symptom (not the cause) S — Scope the blast radius O — Observe the data (logs, metrics, traces) L — List hypotheses (minimum 3) A — Assess each hypothesis with evidence T — Test the fix in isolation E — Explain what happened (postmortem)
Let me walk through a real example.
Real Case: Dashboard Loading 30 Seconds
I — Identify the symptom. Users report the quality dashboard takes 30+ seconds to load. Locally it loads in 2 seconds. Production only.
Don't jump to "it's a database problem" or "it's a network issue" yet. Just describe what you see.
S — Scope the blast radius. Is it all users or specific ones? All browsers? Started when? Correlated with a deploy?
In this case: all users, started 3 days ago, no deploy in that window. That eliminates "we shipped broken code" as the cause.
O — Observe the data.
```bash
Check API response times
curl -o /dev/null -s -w "Total: %{time_total}s\n" https://api.example.com/quality
Check database query times
SELECT query, calls, mean_exec_time FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 10;
Check for connection pool exhaustion
SELECT count(*) FROM pg_stat_activity WHERE state = 'active'; ```
The data showed: API response time was 28 seconds. Database query took 0.3 seconds. So the bottleneck wasn't the database.
L — List hypotheses.
- GitHub API rate limiting (we fetch CI artifacts)
- Artifact ZIP file grew too large (parsing bottleneck)
- DNS resolution delay on the serverless function cold start
Never go with just one hypothesis. If you only have one, you'll confirm it whether it's right or not.
A — Assess each hypothesis.
Hypothesis 1: Checked GitHub API response headers — `X-RateLimit-Remaining: 4,823`. Not rate limited.
Hypothesis 2: Downloaded the latest artifact ZIP — 340KB. Normal size. But wait — the function was downloading artifacts from the LAST 20 workflow runs to find the newest one. And GitHub's artifact API was paginating slowly because the repo had 500+ workflow runs.
Found it. The API was making 20 sequential HTTP requests to GitHub, each taking ~1.4 seconds.
T — Test the fix. Changed the logic to fetch only the latest 3 runs instead of 20. Tested locally against production GitHub data. Load time: 3.2 seconds.
Deployed behind a feature flag. Monitored for 24 hours. Confirmed fix.
E — Explain what happened. Wrote a 1-page postmortem: what broke, why, how we found it, what we changed, and what we'd do to prevent similar issues (added a `perPage=3` parameter and a circuit breaker for GitHub API calls).
Why This Works Better Than Guessing
The framework forces you to separate observation from interpretation. Most debugging goes wrong at step 1: someone observes "it's slow" and immediately concludes "it's a database problem." They spend 4 hours optimizing queries while the real issue is a network call.
By listing 3+ hypotheses before investigating any of them, you prevent confirmation bias. The fix is usually in the hypothesis you almost didn't write down.
The One-Liner Version
When someone asks me how I debug, I say: "I don't look for the bug. I look for the data that tells me where the bug isn't. I eliminate everything it's not, and what's left is the answer."
It's Sherlock Holmes, but for API calls.