The backstory
I work on a high-volume order processing system. On any given day, the system is handling well over a million events, serving millions of customer accounts, and carrying years of transaction history. When something goes wrong in a system like this, people notice immediately - support tickets spike, operations teams get paged, and the trust customers place in the platform takes a quiet hit.
For a long time, our CI/CD pipeline relied on a manual approval gate before any production release. Developers had to manually check the last deployment time, review recent changes, and apply a production tag. On paper, it was a safety measure. In practice, it had become something else entirely.
Manual gates often feel safe, but they sometimes create bottlenecks without adding actual quality checks.
What the pain actually looked like
The problems weren't dramatic. There was no single moment where the approval gate caused a catastrophic incident I could point to. It was slower and quieter than that - accumulated friction that was easy to dismiss in isolation but hard to ignore in aggregate.
- Deployments slowed down in ways that were hard to argue against individually. An approver was in a meeting. A reviewer was in a different timezone. A hotfix that should have taken 20 minutes took 3 hours because the right person wasn't available.
- Developer feedback loops stretched. When you can't ship a fix and see the result quickly, you start batching changes. Batching means bigger deployments. Bigger deployments mean more risk and harder rollbacks.
- The gate created false confidence. Because a human reviewed it, it felt safer. But the review was happening at the worst possible time - after the code was written, after tests ran, when cognitive load was highest and context was lowest.
- Real incidents got worse. We had a production issue involving duplicate orders linked to a toggle-related race condition. By the time a fix was ready, the manual approval step added meaningful delay to getting it out. A faster pipeline would have meant fewer affected records.
- Uncertainty about what we were deploying. Without deep visibility into each deployment's change bundle, approvers often felt uncertain - not about the specific code, but about what else might be riding along. That uncertainty made people conservative in ways that slowed things down without meaningfully improving safety.
Writing the proposal
I've been at companies where people complain about process in Slack but never do anything about it. I didn't want that to be me. So I wrote a formal proposal and brought it to the team.
Writing it was harder than I expected - not technically, but structurally. The challenge was that I was asking people to give up something that felt safe. The burden of proof was on me to show that we'd actually be safer without the gate, not just faster.
Here's how I framed it:
The hardest part wasn't getting the proposal approved - it was confronting my own uncertainty. I had to be honest with the team that I wasn't 100% certain this was the right call. What I was certain about was that the current state had a cost we weren't measuring. Engineers who frame proposals as "I'm right" tend to get pushback. Engineers who frame proposals as "here's the trade-off I think we should make" tend to get conversations.
What replaces the gate
The alternative to manual gates isn't chaos - it's investment in the right infrastructure.
Feature toggles
This is the single most powerful tool you can have. If you can deploy code that's "dark" - in production but not activated - you've separated deployment from release. That means you can ship at any time and turn things on deliberately, with observability, and the ability to turn them back off instantly without a rollback. Manual approval gates largely exist because teams haven't built this capability yet.
Automated rollback
Rollback should be a button, not a procedure. If reverting to the last known good state requires human intervention and coordination, it's too slow. The goal is: detect → decide → revert in minutes, not hours. When rollback is fast and reliable, the cost of a bad deployment drops dramatically - which means the value of a gate trying to prevent bad deployments also drops.
Real monitoring, not just alerts
There's a difference between having alerts and having observability. Alerts tell you something broke. Observability lets you understand what changed, when, and why - and correlate it with the deployment timeline. If you have that, you don't need a human to review every deployment. You need humans to respond when the system tells them something is wrong.
Phased rollout
Deploy to a small percentage of traffic first. Watch it. Then expand. This is how you get the safety benefit of "someone review this" without the bottleneck of "someone manually approve this." The system is your reviewer. It just doesn't need a meeting invite.
The broader lesson
I've worked in healthcare software, insurance platforms, and high-volume e-commerce. Every domain has its version of the manual approval gate. Sometimes it's a deployment gate. Sometimes it's a change management ticket that requires three signatures. Sometimes it's a code freeze window that lasts two weeks before a major release.
These things usually start as reasonable responses to real problems. They accumulate over time, and nobody removes them because removal feels risky. The process becomes the system. Eventually, the process itself is causing the kind of incidents it was designed to prevent.
The question I try to ask now: Is this control actually reducing risk, or is it distributing the feeling of risk? Those are different things. A manual gate that makes the approver feel responsible for a deployment doesn't make the deployment safer. It just moves accountability around in a way that can feel like safety.
The goal of a CI/CD pipeline isn't just to ship code faster. It's to make the cost of being wrong low enough that being right doesn't require perfection.
When you build toward that, manual approval gates start to look less like safety nets and more like the thing you built before you had a real safety net.
- Manual gates add latency without eliminating risk - measure the cost before defending the process
- The alternative to gates is investment: feature toggles, automated rollback, real observability
- Frame proposals as trade-off conversations, not arguments - you'll get further
- Start with a pilot on one service. Let the data make the case for broader rollout
- Rollback speed is your best safety guarantee - faster than any approval process
- Separate "deployment" from "release" with feature flags and you've solved 80% of the problem
Where I'd look next
If you're working through a similar conversation at your company: the DORA research on deployment frequency and change failure rate tells a clear story. The State of DevOps reports consistently show that speed and stability aren't in tension - teams that deploy more frequently also have lower failure rates. That data is useful when making the case internally.
Also worth understanding: trunk-based development. A lot of the risk that manual gates are trying to manage comes from long-lived feature branches that accumulate divergence. Shorter cycles reduce that risk structurally, before you even need a deployment gate conversation.
The infrastructure investment is real. Feature toggles take time to build well. Good rollback requires you to think about database migrations, state management, and downstream consumers. It's not free. But it pays back in ways that are hard to overstate once you've worked on a team that has it.