Three Guiding Lights on Sustaining Resilience

8 Apr

I was recently invited to speak on an internal panel on reliability and resilience culture at a big tech company. And honestly, they’re the kind of rooms I love being in where people want to talk about more than just tools and metrics.

Toward the end of the session, someone asked a question I really liked:

“If we had to choose just three things to sustain a resilient, healthy reliability culture, what would they be?”

It helped me summarize what I wanted them to take away from the session. I thought I’d write about it now too.

When things inevitably get messy, clarity about what matters most is incredibly valuable.

1. Know what matters to your users, and make it really visible

Reliability, at its simplest, means a system does what’s expected of it when it’s needed. That means it only matters what users care about, and when they care about it.

When I was working at an online supermarket during the height of the pandemic, we experienced this in a clear and pretty intense way.

Every morning at exactly 9 a.m., traffic would spike like clockwork for exactly seven minutes. We knew it was coming, because we had notified our users about it due to our limited physical capacity at the time. You probably remember how important it was to order groceries online during the pandemic. But it turned out, we weren’t as ready as we thought we were. At least not at first.

In that seven short minutes, thousands of people were trying to place their grocery orders. The site started stall very quickly, fire started growing in our infrastructure. And we weren’t just dealing with a slow site, we were, of course, failing our users in a moment that mattered. Perhaps even more than in normal times.

During that time, we started adapting our systems specifically for those seven minutes. Even though we had scaled up, we needed to do a lot more resilience engineering work to ensure our systems could handle the load during those minutes. The rest of the day didn’t matter nearly as much, but it gave us the time to make changes for the next day and keep testing in real traffic.

Knowing (and updating when needed) your critical user journeys is the foundation of the reliability work because it makes user needs impossible to ignore. Which is why we develop our products in the first place.

Knowing is great but then we also need them to be painfully visible to everyone. Everyone in our organization deserves to understand the real impact of their work, and to know where to put the attention when prioritization is needed. It should not be buried in dashboards. Not hidden in metrics. Somewhere from product designer to intern can see and understand it.

When that visibility is built into our systems and our culture, reliability becomes something everyone can actually act on, and not just talk about.

2. Create Psychological Safety Around Failure

There’s plenty of research showing the link between psychological safety and high-performing teams (e.g Google’s Project Aristotle). And I believe this becomes especially critical around failure.

Incidents are stressful. They throw off your week, your priorities, and sometimes your sleep. But they’re also incredibly human. They create moments where people show up for each other in unexpected ways and they hold that space for the sense of togetherness and collaboration.

At my latest company (about 5,000 people), we had a major outage that lasted nearly three days. It was long, and messy. We got challenged both organizationally and technically in ways we could not have imagined or prepared for.

But I’ll never forget when someone wrote afterward in Slack:

“Best team-building experience ever.”

This was not sarcasm, they really meant it.

There was something about being in it together, solving something pretty substantial, without fear, that made it meaningful and worth it.

I get it though. Measuring the “success” of learning from incidents can feel complicated. We want numbers that are easy to calculate. But honestly, I don’t think they tell us much about how people actually feel.

When we focus too much on metrics like mean-time-to-resolution, or push teams to keep incident numbers low every quarter, we risk building a culture of fear. (you might like Štěpán Davidovič’s paper on MTTR for further read.)

I think metrics should always be a starting point, an invitation to look deeper, not a target to hit. And we know this: if the system punishes failure, people will find ways to hide it.

What if we measured how supported and clear people felt during incidents? How prepared our teams were under pressure?

What if we celebrated our togetherness, and creative problem-solving? Perhaps allowed ourselves to feel frustrated by failure —not to assign blame, but to fuel our drive to learn.

The safer your teams feel, the more resilient and reliable your systems will become.

And often, the simplest way to know is just to ask: How did it feel to be part of that incident or incident review?

3. Let incidents update your mental models

Our systems evolve all the time. New features get shipped, teams restructure. But our mental models of how everything fits together often lag behind. And I think, that’s expected.

When something breaks, we rely heavily on those mental models. We don’t have time to check every diagram or document. We need to make decisions quickly, based on what we believe is true about how the system behaves. So —can we use incidents to update our understanding?

At my last company, our incident report template started with an “overview” section. We created high-level diagrams as part of every report. These showed the affected services and how they interacted. If you’ve ever tried explaining a system to someone new, you know how powerful a simple sketch can be. Useful for the author. Useful for the reader.

Over time, those diagrams became incredibly handy. New hires used them to get oriented. Teams referenced them to understand upstream and downstream effects. And because they were updated after every incident, they reflected the actual messiness and changes we had just navigated.

Incidents are more than things to “get through.” They’re signals. They show us where our models need updating, where our assumptions break, and where we might need to shift priorities next.

If you find yourself asking, “Are we actually learning from this?”—check whether your understanding has changed. Do you have a new version of how your systems behave? Or how your organization responds to triggers? That kind of questioning opens the door to deeper understanding. It forces us to adjust how we work, and how we design the systems we rely on.

Over time, you’ll notice a shift.

You’ll catch yourself mentioning—and celebrating—how something didn’t get worse this time, because the failure modes you saw last time shaped how you designed your system to adapt better.

A little more prepared.
A little more resilient.

Closing thoughts

These three ideas aren’t quick fixes or instant improvements, they’re ways of thinking and working that I find meaningful and impactful. They help remind us what's genuinely important, beyond immediate metrics or short-term goals.

At its core, resilience is deeply human. How we intervene in the systems we build, how we collaborate, and how we navigate challenges together. All of that shapes the resilience of both our systems and our teams.

It’s not always easy. And it won’t ever be perfect.

But it can be deeply human.
And that makes it worth doing.

Reliability CulturePsychological SafetyPost-Incident LearningBlameless Practices

Busra Koken

Three Guiding Lights on Sustaining Resilience

What “managing up“ really means: A Practical Guide to Working with Your Manager

You Don’t Have to Burn Out to Deserve a Break: A Story About Work, Rest & Rediscovery

Humans in Systems