The Incident No One Declared

Summary: Explore a gripping account of managing a major incident, detailing the challenges faced, lessons learned, and strategies implemented to ensure better responses in the future. Ideal for leaders in incident management and operational excellence.

This article is from our Tales from the Trenches collection. Hear first-hand experiences from leaders and operators as they face challenging customers, leadership scenarios, incidents, and more.

It’s 7:30am. You’ve just woken up and reached for your phone to see an ongoing major incident. No incident has been declared for over an hour. There’s a single team member fighting the blaze. The customer is asking for a call. At this same moment your sole responder is also about to clock off for the day.

Beyond getting to your desk as fast as possible, what’s the right next move? This was the conundrum I faced several years ago for one of our largest accounts on their biggest day of the year.

Declaring the incident

In hindsight, I was incredibly lucky. By the time I had landed into my home office and written a calm response in Slack to the initial responder, the incident had stabilised. This let me declare the incident and step in as Incident Commander, before comfortably replying to the customer that the immediate impact had passed (they never replied about that call). From here I began the process of untangling the events that had occurred overnight.

Given the incident had stabilised, I was able to start with a dissection of the Zendesk tickets, 40-odd replies and private notes, and Slack communications between us and the customer.

To build a clear picture of “what went wrong”, and gain confidence that it isn’t going to occur again imminently, I generally start by researching to establish a baseline. In this case the customer had let us know ahead of the day that they were expecting a significant traffic increase to affect their service and would require an infrastructure change to accommodate it.

We’d run a relatively comprehensive series of scaling tests to ensure we’d be able to cope a couple of days before, so in theory, everything should have been fine. This pointed to one of two options. A problem with how we executed the plan on the day, or an unknown factor affecting how the infrastructure behaved under real load.

Establishing a timeline

Once I’d built an understanding of what was meant to happen, I started building a timeline of what actually happened. I read further through a combination of our customer replies and observability metrics in Grafana to identify correlations between our communications and infrastructure events.

Grafana was useful but painted a rather confusing tale. When the service failed I could see 4xx (client error) response codes on the load balancer instead of the 5xx (server error) codes I expected, with a spike of 2xx (success) as the service began to recover. This spike of 2xx gave us a false sense of hope early on which caused us to communicate erroneously that no data had been lost - with only a single responder no one was sanity checking our comms.

As the scene further emerged, I had several key questions that I needed to answer:

Did we execute the plan as intended against the timeline and scaling marks?
What stopped us from adjusting our scaling strategy in response to the incident?
What was the true nature of the 2xx spike - request retries or a natural traffic spike?
What stopped us from declaring a major incident sooner?

These four key questions became the focus point for my investigation as I pushed on through until my boss at the time came online. After a long debrief call on Zoom (at which point I had relatively confident answers to the above questions) the next stage of the response began.

Coordinating cross-functionally

When you have a major incident on a major account, you end up with a rather crowded room (metaphorically of course because this was all in Slack). As Incident Commander, my role shifted swiftly from investigator to coordinator as other regions came online throughout the afternoon and evening. My focus here spanned several areas:

working with our engineering and SRE teams to detail why the alternative scaling approach used was ultimately infeasible, what the chance of recovering the failed requests was, and establish a higher level of confidence around that pesky 2xx spike,
communicating to the account team the current status of investigations and immediate outcomes (largely through an Incident Status document while vigilantly telling people not to send comms unless they’re approved),
setting up a post-incident review (PIR) to collaboratively dissect and document the incident, lessons learnt, and action items to mitigate the issue reoccurring in the future,
and liaising with leadership to set up a customer facing meeting and talking points.

Most of the details were squared out in Slack ahead of the PIR, and I spent my early evening drafting the postmortem based on my research from earlier in the day and the details I’d gathered from the team so far.

After the PIR - which ran for a cozy hour and 45 minutes - everyone walked out of the room looking a bit gloomy, but we had a solid set of takeaway actions to make sure what happened could never happen again. It also gave us very clear talking points for the customer meeting a couple of hours later on.

Turning resolution into learning

Being on the inside of both the response team and Support team as a leader, I had a unique position of insight into the cause, fallout, and operational outcomes of this incident.

A key takeaway for our team was that we had left everything in the hands of a single responder from the get go. This, paired with a hesitation to declare an incident sooner offered us no avenue to have wider visibility or rein things in before they got out of hand. We mitigated this the same way airlines do, by adding a copilot for significant infrastructure changes, and having an SRE verify the plan ahead of time.

While it was a notable bump in the road for the account, we recovered from it. We walked away not only knowing more, but having a shared appreciation for high-impact incidents and how to handle them in the future as an organisation.

We also gained a chance to strengthen cross-functional efforts between Support and SRE, and develop scenario-specific playbooks for high-impact customer events. We pushed education to the company to decrease hesitancy to declare incidents.

While major incidents aren’t fun, they expose what’s working and what isn’t. To ensure your team are supported and minimise risk, it’s important that your team are trained to:

Declare incidents early.
Work together to form shared theories and manage workload.
Apply checks and balances to customer comms.
Take tangible action to enshrine learnings.

Ultimately, when incidents are embraced, they help you to work better together as a team, accelerating evolution and delivering improved outcomes for your employees and customers.

Declaring the incident

Establishing a timeline

Coordinating cross-functionally

Turning resolution into learning

Latest articles

Mrs Beeton’s Hiring Book: Part I

Managing Organisational Change

A Case For Customer-centric Targets