You Can’t Debug a System by Blaming a Person

Reliability & ResilienceEngineering Mindset

7 Jan

“I understand why we need to be blameless, but I have this person in my team who is often reckless. How can I not blame them when their actions continuously make things worse?”

Someone asked me this at the SRE meetup, right after my talk on incidents. Since then I’ve been thinking about it, because it surfaces a concern many people might have.

Underneath that question, I actually hear these ones:

If we say we are blameless, do we lose accountability?
Do we silently tolerate behavior that feels unsafe in practice?

That’s what I want to explore in this article. I think the point is not blame, the point is what blame does to our work.

How blame stops your debugging

I want to start by stepping away from incidents for a second.

Let’s say you're in the kitchen with a friend. You’re cooking together, chatting, chopping, stirring. At some point, your friend cuts their finger while chopping onions. The first thing you do is obvious: mitigate the impact. You grab a paper towel, help them rinse the cut, and find a band-aid.

Now imagine that right after you put on the band-aid, you say a version of: “Well, be a little more careful with that knife.”

In that very moment, you close an important connection as you’ve decided the “cause” is that they weren’t careful enough. As a result, you may never find out that the knife has a small notch in the blade that makes it slip sometimes, or that your friend’s attention was low because of a hard conversation earlier, or that you were both rushing because the pan was already hot and you wanted to get the onions in before they burned.

You might only see some of that later, when it happens to you.

“Just be more careful” feels like an answer, but it trades a simple story for a chance to understand what’s really going on.

When “human error” is as far as you go

Now obviously, the little kitchen scene is very simple, but a version of this shows up in incident work all the time when we accept “human error” as the cause.

Let’s imagine Joe deploys a change that puts a system into a bad state and impacts customers. The obvious cause seems like ”Joe didn’t test properly.”

Notice what we lose when we stop there:

How tests are actually planned and used in this team
Whether the test suite runs efficiently enough to enable the right pace/confidence, or if people are mostly running parts of it in practice
What did Joe think was true when he hit deploy? What conditions told him “this is probably fine”?
How deploys are set up, and what kind of feedback they give before and after changes go live
How easy it is to roll back when something looks off

Joe is also more stressed now. He already was, because most of us are pretty good at blaming ourselves after causing an adverse impact. He may now hesitate even more before deploying, or his team might avoid owning changes that are slightly risky, or someone else might do exactly the same a little later with an even worse outcome (complex systems are messy).

The more you do this (because it feels like “easy” troubleshooting), the more you build systems that depend on humans being constantly careful and alert. Think of the endless low-severity alerts in Slack, or 20 steps runbooks you are expected to run on 3 AM in the morning.

As a result, your organization never really learns from these events, and people end up babysitting mediocre systems in a constant, low-level state of vigilance or worse, quite ignorance.

And yet, you hired Joe because he was a highly capable engineer, but with this you end up doing the opposite of enabling good engineering to happen. You lose a lot more.

If Joe and your organization look at this incident with curiosity and debug it as an event in socio-technically complex system, something more interesting becomes possible: you can continuously improve the system in which this decision lives for all of you and build more resilience and capacity from that insight.

That might look like:

making it easier to see test coverage in one place before deploying,
improving the deploy pipeline to include better safety checks,
designing for quicker rollback when indicators move in a worrying direction and so on.

In other terms; you engineer safety & resilience into your systems. This way Joe and everyone involved become empowered engineers who learn a new class of triggers. You can help shape the system in ways that go beyond the change he made.

In a way, that’s really Joe’s job, and all of ours, isn’t it?

Blameless ≠ silence about human actions

Blameless also doesn’t mean you stop talking about what people did.

In post-incident work – reviews, reports, debriefs – you’re needed to contribute and talk about what happened (even though that’s much harder if blame is already in the air). These interactions are a crucial part of debugging and understanding the bigger picture of your complex systems.

The difference is that you treat those actions as clues, simple data points and not verdicts. (I love this old post from Lorin, too) You want to understand how an action made sense at the time (and why it didn’t later), so you can see more of how your system actually behaves under the kinds of triggers many of you hit every day.

You are inevitably part of the complex systems you build. You interact with them all the time. Sometimes, those interactions become part of a bigger story, like an incident. That’s the gold mine of the event: if you keep debugging it instead of stopping at “human error”, you can turn what you learn back into better systems.

When you see a “reckless” pattern

Let’s come back to the original question from the meetup: what about the person in your team who really does seem “reckless” to you?

This obviously needs a lot more context, so I will leave you with these questions to debug further instead of stopping at “reckless“:

What conditions are they working in when this happens? (time pressure, unclear ownership, missing support etc.)
What kind of power or access do they have? (For example: can every engineer write directly to the production database? Can anyone bypass guardrails easily?)
If it’s mostly under incident stress, what is it about how we run incidents that might push people into “fight or flight”?

These questions are aimed at the joint system you all live in: humans + tools/systems + roles + pressures, not at someone’s personality.

If, after looking at that context, you still see a worrying pattern in everyday work that impacts the work culture, that’s a signal for follow-up outside the incident space. Name what you’re noticing, explain how this is impacting your team, be clear about expectations, and offer support to change.

Incidents as an honest reflection of how work is “done”

One thing I believe deeply is that incidents are one of the most honest places to see how your organization already functions.

If you look with curiosity, you will see how people communicate when things are unclear, who is able to step in (and why), where responsibilities are fuzzy, where people are improvising around gaps, how much psychological safety there is in the team, and how understandable your systems actually are.

You’ll also see how much your teams are doing despite a lack of clarity or support, continuously adapting in ways to keep your systems up and running.

If you keep the incident space as a learning space and “work as usual”, you get access to all of this. People feel safer to do honest engineering, and they’re more able to build resilient systems and engineering cultures.

If this resonates and you want to explore the safety science behind it, here are a few resources: