โ† Writing

What a Failed Database Migration Taught Me About Assumptions in Production

A migration script that worked fine in every lower environment. A deployment that looked clean. And then, 30 minutes later, errors in prod. Here's what we missed, why we missed it, and what I'd do differently.

What happened

Routine deployment. Code changes promoted to production, pipeline completed clean, logs looked fine. Thirty minutes later, errors - customers couldn't see order updates, a column reference firing that didn't exist in the schema.

Time to detect: 30 minutes. Time to recover: 12 hours.

The root cause: a database schema change was included in the codebase but the migration script was inadvertently omitted from the production pipeline step. In every environment below prod, that migration had been applied weeks earlier - so nothing caught the gap. The pipeline ran clean. It just didn't run the thing that mattered.

The sneaky part

We'd been running that migration in lower environments for weeks without a problem. That track record created confidence - exactly the wrong kind. The migration worked fine. What we didn't verify was whether production would actually execute it.

The mistake in one sentence

We tested the migration. We didn't test that production would run the migration. Lower environments validate the migration - not the pipeline that runs it. Those are two different things.

The second thing that hurt us: we underestimated recovery time by a significant margin. Every benchmark we had came from lower environment replays. Prod data volume was roughly 10x that. When you're giving stakeholders a recovery timeline during an incident, being off by that factor erodes trust fast. Prod is not a bigger staging. Treat your time estimates accordingly.

What I'd do differently

Add a schema pre-flight check to the pipeline

Before deploying code that expects a new schema state, verify that the target environment actually has it. This should be a pipeline gate, not a manual step. If the migration hasn't run, the deployment stops - not silently, not after the fact.

Decouple schema changes from app deployments

Run the migration first, confirm the schema, then deploy the application. If the migration fails, you haven't deployed anything - the old code still runs against the old schema. Coupling them means a migration failure becomes a partial deployment problem at the worst possible time.

Write error messages for 10 AM under pressure

The error correctly named the missing column. It didn't suggest why it was missing or where to look. That gap added unnecessary time to diagnosis. Good error messages include context - what was expected, what was found, where the configuration comes from. That's not polish, it's part of your recovery tooling.

This incident wasn't caused by carelessness. It was caused by a system that assumed consistency it hadn't verified. The fix wasn't to be more careful - it was to make the pipeline explicit enough that the gap couldn't hide.
Key takeaways
  • A clean deployment is not a healthy system - verify schema state, not just deployment status
  • Lower environments validate migrations, not the pipelines that run them
  • Decouple schema changes from app deployments to keep blast radius small
  • Prod volume is not nonprod volume - give recovery estimates as ranges with an explicit multiplier
  • Error messages are recovery tooling - write them before you need them