We Skipped a Game-Day Drill. Then Chaos Came Calling.

Sep 12, 2025

Disasters in tech never arrive on schedule. They show up at 2 a.m., during a peak sale, or in the middle of a long-awaited product launch. And yet, many engineering teams still treat game-day drills—the practice of rehearsing failure scenarios in a controlled environment—as optional.

We were one of those teams.

This is the story of how we skipped a game-day drill, why chaos came calling soon after, and what we learned about resilience, responsibility, and the hidden costs of untested assumptions.

Setting the Stage: The Calm Before the Storm

Our engineering team was working on a distributed service that powered a mid-size SaaS platform. Think authentication flows, background jobs, and data APIs used by thousands of paying customers daily.

We prided ourselves on uptime. CI/CD was humming, observability dashboards glowed in Grafana, and we had just migrated core workloads to Kubernetes. The architecture diagram looked like something straight out of a conference talk:

Load Balancers in front of Kubernetes ingress controllers.
Stateless Node.js services wrapped in Docker.
PostgreSQL managed on RDS.
Redis cluster for caching and queues.
Prometheus + Alertmanager wired into Slack.

It wasn’t perfect, but it was clean.

So when the idea of running a game-day drill came up—intentionally breaking things to validate recovery processes—the response was lukewarm.

“We already know how to recover. The runbooks are written. Besides, we’re too busy shipping features.”

That was the prevailing mindset. And so, the drill was quietly pushed to “next quarter.”

Chaos Knocks: The Unscheduled Game-Day

It was a Thursday afternoon when it happened. A spike in traffic from a new marketing campaign coincided with a minor Redis failover event.

That should have been a routine hiccup. But our assumptions were about to be tested.

Redis Failover Took Longer Than Expected.
Connections from our Node.js services piled up. Latency spiked.
Circuit Breakers Didn’t Trigger Properly.
Instead of gracefully degrading, services kept retrying requests.
Postgres Connection Pool Saturated.
With Redis unavailable, fallback queries hammered the database. Suddenly, we weren’t serving cached results—we were hitting the source of truth at full blast.
Cascading Failures Began.
Auth service slowed to a crawl. API response times ballooned. Monitoring dashboards lit up red.
PagerDuty Went Off.
First line engineers scrambled. Senior SREs joined the call. Within minutes, we had a war room running with 30+ people.

The Human Side of Failure

Technical failures are stressful. But the human response to failure is what often defines the outcome.

During the incident, some painful realities surfaced:

Runbooks Were Outdated.
The Redis failover guide referenced old Kubernetes manifests and service names. Nobody could find the updated docs in Confluence.
Roles Weren’t Clear.
Who was incident commander? Who was responsible for customer comms? We lost precious minutes figuring that out.
Tooling Assumptions Failed.
Alert thresholds were tuned for “normal” traffic, not marketing-driven spikes. We had noise in Slack, but little actionable signal.
Decision Fatigue Set In.
Engineers were debating fixes in chat while customers tweeted angrily about outages. The stress curve went exponential.

After 90 minutes of firefighting, we stabilized Redis, restarted overloaded services, and gradually restored normal operations.

But the damage was done:

Downtime lasted 1 hour 32 minutes.
Multiple angry customers threatened churn.
An internal postmortem revealed a dozen preventable failures.

We hadn’t planned a game-day. But chaos had delivered one for us—at the worst possible time.

Why Game-Day Drills Matter

That incident changed how we thought about resilience. We realized something simple but profound:

Game-day drills aren’t about finding out if things can break. They’re about finding out how people respond when things break.

A game-day drill is essentially a fire drill for software systems. Firefighters don’t wait for a real building to catch fire before practicing. Pilots don’t wait for a real engine failure before rehearsing emergency procedures. So why should engineering teams wait for production outages before testing their responses?

Key benefits of running game-day drills include:

Validating Runbooks.
Outdated docs are worse than no docs. Drills expose gaps before customers do.
Clarifying Roles.
Everyone should know who the incident commander is, who’s on-call, and who handles external comms.
Building Muscle Memory.
In high-stress situations, people fall back on practice. Drills help make response patterns automatic.
Testing Observability.
Metrics, logs, and alerts often look great until they’re actually needed. Game-days test whether your monitoring is signal or noise.
Psychological Safety.
Practicing in a controlled setting reduces panic when real chaos arrives.

How to Run a Game-Day Drill

After the incident, we instituted quarterly game-day drills. Here’s the structure we adopted—adapted from chaos engineering and SRE best practices.

1. Define Objectives

Don’t just “break stuff.” Decide what you want to test. Examples:

Redis failover recovery.
API latency under 2x normal load.
Incident commander role clarity.

2. Pick a Failure Scenario

Common scenarios include:

Shutting down a database replica.
Killing Kubernetes pods.
Injecting network latency.
Rotating expired certificates.

We used tools like Gremlin and Chaos Mesh, but manual triggers (e.g., stopping a container) worked too.

3. Schedule and Announce

A drill isn’t meant to ambush people. Give notice, schedule it during business hours, and ensure the right people are available.

4. Run the Drill

Start a timer.
Introduce the failure.
Observe responses.
Capture decisions and communication patterns.

5. Debrief and Document

The real value comes from the postmortem:

What went well?
What slowed us down?
Which assumptions were wrong?
What action items should we take?

6. Automate Learnings

Update runbooks, adjust monitoring thresholds, refine escalation paths. Then repeat the cycle.

Lessons Learned from Skipping the Drill

Our painful outage crystallized a few lessons that I believe every engineering team should hear:

Runbooks rot without rehearsal.
Documentation is like gym equipment: unused, it gathers dust.
Chaos finds the weakest link.
In our case, it was Redis failover. For your team, it might be TLS certs, DNS, or IAM policies.
Incidents are people problems, not just system problems.
Miscommunication burned more minutes than Redis failover did.
Game-day drills are cheaper than real outages.
Our 90-minute outage cost us more in lost revenue and goodwill than a year of game-day drills would have.

Building a Culture of Resilience

Instituting game-day drills wasn’t just about process—it was about culture. Initially, engineers grumbled:

“This feels like busywork.”
“We don’t have time for pretend outages.”

But over time, attitudes shifted. After a few drills, engineers began to see the value:

Confidence grew.
Communication became smoother.
New hires learned faster.

Eventually, game-day drills became part of our DevOps DNA. They weren’t a chore; they were a competitive advantage.

Broader Industry Examples

We’re not alone in this realization. Some of the world’s top engineering organizations bake game-day drills into their practice:

Netflix’s Chaos Monkey randomly terminates production instances to ensure resilience.
Google SREs run DiRT (Disaster Recovery Testing), simulating outages to validate response.
Amazon is known for “Game Day Exercises” where teams intentionally break services.

If billion-dollar platforms with world-class engineering talent rely on drills, why should smaller teams think they’re exempt?

Practical Takeaways for DevOps Teams

If you take only a few things from this story, let it be these:

Don’t wait for chaos to test resilience. Practice it intentionally.
Game-day drills are about people as much as systems. Test communication, roles, and decision-making.
Start small, iterate fast. You don’t need a full chaos engineering platform to begin. Kill a pod. Rotate a cert. Pull the plug—safely.
Document relentlessly. Every drill should produce updated runbooks, playbooks, and lessons learned.
Make it cultural. Treat drills as part of engineering craftsmanship, not optional overhead.

Closing Thoughts

When we skipped our scheduled game-day drill, chaos gave us one anyway—on its own terms. It cost us money, sleep, and customer trust.

But it also forced us to grow. We became more disciplined, more communicative, and more resilient. Today, game-day drills are no longer negotiable for us. They are a pillar of our engineering practice.

The truth is simple:

Outages are inevitable.
Preparedness is optional.
And the only thing more expensive than running a game-day drill is not running one.

So ask yourself, when chaos comes calling, will your team be rehearsed, or will they be improvising?

The DevOps Dojo’s Substack

Discussion about this post