← Writing

Why I Proposed Removing Our Manual Approval Gate

Manual approval gates feel safe. They feel responsible. They feel like the kind of thing any developer would insist on to use. I used to think that too - until I started paying attention to what they actually cost us.

The backstory

I work on a high-volume order processing system. On any given day, the system is handling well over a million events, serving millions of customer accounts, and carrying years of transaction history. When something goes wrong in a system like this, people notice immediately - support tickets spike, operations teams get paged, and the trust customers place in the platform takes a quiet hit.

For a long time, our CI/CD pipeline relied on a manual approval gate before any production release. Developers had to manually check the last deployment time, review recent changes, and apply a production tag. On paper, it was a safety measure. In practice, it had become something else entirely.

Manual gates often feel safe, but they sometimes create bottlenecks without adding actual quality checks.

What the pain actually looked like

The problems weren't dramatic. There was no single moment where the approval gate caused a catastrophic incident I could point to. It was slower and quieter than that - accumulated friction that was easy to dismiss in isolation but hard to ignore in aggregate.

  • Deployments slowed down in ways that were hard to argue against individually. An approver was in a meeting. A reviewer was in a different timezone. A hotfix that should have taken 20 minutes took 3 hours because the right person wasn't available.
  • Developer feedback loops stretched. When you can't ship a fix and see the result quickly, you start batching changes. Batching means bigger deployments. Bigger deployments mean more risk and harder rollbacks.
  • The gate created false confidence. Because a human reviewed it, it felt safer. But the review was happening at the worst possible time - after the code was written, after tests ran, when cognitive load was highest and context was lowest.
  • Real incidents got worse. We had a production issue involving duplicate orders linked to a toggle-related race condition. By the time a fix was ready, the manual approval step added meaningful delay to getting it out. A faster pipeline would have meant fewer affected records.
  • Uncertainty about what we were deploying. Without deep visibility into each deployment's change bundle, approvers often felt uncertain - not about the specific code, but about what else might be riding along. That uncertainty made people conservative in ways that slowed things down without meaningfully improving safety.

Writing the proposal

I've been at companies where people complain about process in Slack but never do anything about it. I didn't want that to be me. So I wrote a formal proposal and brought it to the team.

Writing it was harder than I expected - not technically, but structurally. The challenge was that I was asking people to give up something that felt safe. The burden of proof was on me to show that we'd actually be safer without the gate, not just faster.

Here's how I framed it:

01
Acknowledge the real risk first. Don't pretend the gate has no value. It exists because someone got burned before. Name that, respect it, then explain why the world has changed since then.
02
Show what replaces it - specifically. Saying "we'll use automation instead" isn't enough. I outlined: feature toggles for risky changes, dedicated release branches for high-risk deployments, enhanced monitoring as a real-time safety net, automated rollback capability, and a phased rollout starting with one service.
03
Use a real incident to make it concrete. Abstract arguments about speed don't land. A real example - even a small one - where the gate delayed response to a production issue lands differently. I used one from our own history.
04
Don't make it a binary choice. The proposal wasn't "remove the gate forever." It was "pilot removal on one service, measure it, then decide." That framing dramatically lowers perceived risk for stakeholders.
05
Name your safety measures clearly. Feature toggles, automated rollbacks, production fallback plans, comprehensive test coverage. I put these in the proposal explicitly so no one could walk away thinking we were removing a guardrail with nothing underneath it.
Honest reflection

The hardest part wasn't getting the proposal approved - it was confronting my own uncertainty. I had to be honest with the team that I wasn't 100% certain this was the right call. What I was certain about was that the current state had a cost we weren't measuring. Engineers who frame proposals as "I'm right" tend to get pushback. Engineers who frame proposals as "here's the trade-off I think we should make" tend to get conversations.

What replaces the gate

The alternative to manual gates isn't chaos - it's investment in the right infrastructure.

Feature toggles

This is the single most powerful tool you can have. If you can deploy code that's "dark" - in production but not activated - you've separated deployment from release. That means you can ship at any time and turn things on deliberately, with observability, and the ability to turn them back off instantly without a rollback. Manual approval gates largely exist because teams haven't built this capability yet.

Automated rollback

Rollback should be a button, not a procedure. If reverting to the last known good state requires human intervention and coordination, it's too slow. The goal is: detect → decide → revert in minutes, not hours. When rollback is fast and reliable, the cost of a bad deployment drops dramatically - which means the value of a gate trying to prevent bad deployments also drops.

Real monitoring, not just alerts

There's a difference between having alerts and having observability. Alerts tell you something broke. Observability lets you understand what changed, when, and why - and correlate it with the deployment timeline. If you have that, you don't need a human to review every deployment. You need humans to respond when the system tells them something is wrong.

Phased rollout

Deploy to a small percentage of traffic first. Watch it. Then expand. This is how you get the safety benefit of "someone review this" without the bottleneck of "someone manually approve this." The system is your reviewer. It just doesn't need a meeting invite.


The broader lesson

I've worked in healthcare software, insurance platforms, and high-volume e-commerce. Every domain has its version of the manual approval gate. Sometimes it's a deployment gate. Sometimes it's a change management ticket that requires three signatures. Sometimes it's a code freeze window that lasts two weeks before a major release.

These things usually start as reasonable responses to real problems. They accumulate over time, and nobody removes them because removal feels risky. The process becomes the system. Eventually, the process itself is causing the kind of incidents it was designed to prevent.

The question I try to ask now: Is this control actually reducing risk, or is it distributing the feeling of risk? Those are different things. A manual gate that makes the approver feel responsible for a deployment doesn't make the deployment safer. It just moves accountability around in a way that can feel like safety.

The goal of a CI/CD pipeline isn't just to ship code faster. It's to make the cost of being wrong low enough that being right doesn't require perfection.

When you build toward that, manual approval gates start to look less like safety nets and more like the thing you built before you had a real safety net.

Key takeaways
  • Manual gates add latency without eliminating risk - measure the cost before defending the process
  • The alternative to gates is investment: feature toggles, automated rollback, real observability
  • Frame proposals as trade-off conversations, not arguments - you'll get further
  • Start with a pilot on one service. Let the data make the case for broader rollout
  • Rollback speed is your best safety guarantee - faster than any approval process
  • Separate "deployment" from "release" with feature flags and you've solved 80% of the problem

Where I'd look next

If you're working through a similar conversation at your company: the DORA research on deployment frequency and change failure rate tells a clear story. The State of DevOps reports consistently show that speed and stability aren't in tension - teams that deploy more frequently also have lower failure rates. That data is useful when making the case internally.

Also worth understanding: trunk-based development. A lot of the risk that manual gates are trying to manage comes from long-lived feature branches that accumulate divergence. Shorter cycles reduce that risk structurally, before you even need a deployment gate conversation.

The infrastructure investment is real. Feature toggles take time to build well. Good rollback requires you to think about database migrations, state management, and downstream consumers. It's not free. But it pays back in ways that are hard to overstate once you've worked on a team that has it.