The security incident at Symantec. The Volkswagen diesel emissions scandal. The market rigging allegations at Citigroup. In each of these cases, employees were singled out as the root causes for an organization’s failures. Yet given the operational scale, complexity and clip of these companies, is it viable — or beneficial — to put the onus on specific individuals?
Not at all, according to Dave Zwieback, the Head of Engineering at music analytics company Next Big Sound (acquired by Pandora) and author of Beyond Blame: Learning From Failure and Success. Zwieback has spent decades leading engineers in companies operating in high-stakes, pressure-cooker industries, such as technology, financial services and pharma, where it’s commonplace for someone to take the fall when critical systems fails.
In this interview, Zwieback deconstructs why it’s so tempting to blame a person or team for every mishap. To counter this habit, he outlines principles and tactics to help dynamic companies shift from identifying culprits to learning to make improvements that matter. Any fast-growing company that seeks to adapt with real accountability and honesty will gain from Zwieback’s methodology to prioritize resilience over punishment.
When it comes to the importance of banishing blame from the organization, Zwieback likes to invoke technology journalist David Kirkpatrick. To paraphrase him: In a world consumed by software, your company is now a learning company, regardless of what industry you’re in. Pretending that it’s not spells serious peril. Zwieback’s counsel is meant to protect startups from this trap, by reprogramming their responses to setbacks and evolving traditional postmortems into “learning reviews.”
For all the talk of disruption, companies frequently default to the path of least resistance when something goes awry. “Blame and biases — such as hindsight bias — give us a really convenient story about what happened in any negative situation. Cognitive science research shows us that to the extent that a story feels comfortable, we believe that it's true,” says Zwieback. “The reality is that when we get to that comfortable story we stop learning — we say ‘Oh that explains it.’ This short-circuits the learning process to alleviate discomfort of dealing with the complexity of systems — and organizations — that we’re building. That may be a short-term relief, but it’s a compounding expense over the long run because we’re not addressing areas of fragility.”
The only way to extract a full account of what happened is to remove blame and punishment from our retrospectives.
The way to reset that hardwiring is to remove fear and minimize biases that creep in during the investigation of failures. “Say there’s an incident and five minutes into the postmortem, we find out what happened and who’s responsible: Bobby and Susan screwed up. That feels good because there’s an unambiguous explanation: the so-called ‘root cause’. In this case, we’ve found our ‘bad apples,’ and can deal with them punitively so that such failures will never happen again. We may even feel better about our company culture and our colleagues if the individuals accept the blame and own up to what they did to ‘cause’ the incident,” says Zwieback.
The truth is that the most critical learning has been left on the table because we’ve overlooked the deeper context of the incident. “If we remove Bobby and Susan from the equation, could we be sure that the incident would never happen again? No. In all likelihood, the conditions that contributed to the negative outcome — that fragility that was necessary for the incident to occur — are still there. What’s worse, the two folks who know the most about these conditions — Bobby and Susan — are no longer there to help learn from this incident, and make the system more resilient. And others who have information that could materially improve future outcomes now have even more reasons to withhold it because they don’t want to be punished. That’s the problem with blame and punishment.” says Zwieback. “The recent Volkswagen incident is a perfect example of an effort to stop the bleeding with blame shifting: ‘This was a couple of software engineers who put [the cheating software] in for whatever reason," Michael Horn, VW's U.S. chief executive, told a House subcommittee hearing. ‘To my understanding, this was not a corporate decision. This was something individuals did.’”
In the case of VW, the real story is far more complex and uncomfortable than the ‘few bad apples’ fairy tale. “You’ve fired the engineers to show everyone that you’re taking this seriously, that something is being swiftly done to address the failure. But you haven’t actually figured out how these people were able to do what they did, and can’t guarantee that something similar won’t happen in the future. You haven’t answered the tough questions: What tradeoffs were the engineers making? How did their actions make sense to them at the time? Were they under pressure from management or project timelines? How were the engineers (and their bosses) incentivized? What deeper organizational issues may have contributed to this outcome? Without answering these questions, you’ve learned almost nothing from this incident, and have little chance of improving. That’s the crux of the issue,” says Zwieback.
If we want to make our systems and companies more resilient, we have to not wish away the failure, but learn and improve from it.
The only way to establish real accountability is to move past blame and punishment on an organizational level. “This is a difficult task, but if you want people to be able to provide the full account of what happened — which is the very definition of accountability — you need to remove blame and punishment from the equation. Otherwise, folks are naturally disincentivized to share what they know. In most cases where something goes wrong, they won’t disclose the full situation, omitting critical information, and even go as far as to remove evidence of their activity. You might say ‘Oh, that’ll never happen in my organization, we hire responsible adults!’ But if people’s jobs or pay are on the line when they make mistakes, I can guarantee you that some level of avoidance of accountability is happening in your organization.”
To encourage honesty, we need safety. “It’s impossible to learn without all the data, without the full account of what happened. This is why information is more important than punishment,” says Zwieback. “It's counterintuitive but we can take a page from our justice system, which is quite concerned with punishment, for examples of granting people immunity in exchange for information. Unlike people in our organizations, these individuals have often committed heinous crimes. Why does the system let them get away with murder? Because in these cases, the information that they provide can have a much bigger impact than any punishment that is doled out.”
Choose reconciliation over retribution when something goes wrong. You’re less likely to lose people and lessons.
By shifting the role of the individual in an incident from suspect to witness, the process of learning what happened becomes inclusive and far richer. We can now discover weak links in our processes and organizations. “In all my years, I’ve learned that, by and large, people show up to work to do a good job. Sometimes it doesn’t go well,” say Zwieback. “In the context of restorative justice, we think of an account as something to provide rather than something to settle. Then you’re able to go beyond blaming Bobby or Susan and towards evaluating the conditions that were necessary for an incident to happen. Only then we have a chance to address the underlying fragility.”
The first step to banishing blame to make your startup more resilient is to swap out the traditional postmortem. Instead, Zwieback and his partner Yulia Sheynkman have developed the following three-step framework to help companies insitute learning reviews. Here’s how:
At many companies, postmortems are only conducted at the end of projects in order to outline and analyze the factors that contributed their failure. “In theory, a postmortem has a lot of redeeming elements — reflection, examination and evaluation — but, in practice, it often devolves into an unhelpful process,” says Zwieback. “First, most people wait for failures to conduct them. Then, when they do them, they are laser-focused on finding a root cause, which, besides being an illusion, is more often than not attributed to a person. When they do identify the cause, they stop the analysis.”
Learning reviews can be conducted after each experiment or iteration and are designed to facilitate learning from both failures and successes. “If we only wait for death and destruction — as the macabre ‘postmortem’ implies — we are grossly limiting our opportunities to learn. Failures just don't happen frequently enough to learn at the rate that’s needed to really thrive in technology,” says Zwieback. “Mostly things are neutral or go reasonably well. We also want to be constantly learning why things went well. Let's figure out what contributed to this really successful iteration so we can feed the learning back into our organization and systems, and make subsequent iterations even better.”
Most importantly, remind your team over and over again that they’re part of a learning organization. In a subtle way, calling for a learning review (instead of a postmortem) primes people to focus on the desired outcome, namely, learning. But the real work comes from building trust over time, by repeatedly focusing on the context of the incident versus the culprit. “You can't have full accountability with blame. Remind your people that you are all operating within complex systems,” says Zwieback. “The way they function and fail is often unpredictable. One of the signs that you’re working with a complex system is you do something today and it has one outcome and then you do the same thing tomorrow and it has a different outcome.”
Here are Zwieback’s and Sheynkman’s tenets (with additional reading) to help keep your company focused on setting the context to maximize learning:
The purpose of the learning review is to learn so that we can improve our systems and organizations. No one will be blamed, shamed, demoted, fired, or punished in any way for providing a full account of what happened. Going beyond blame and punishment is the only way to gather full accounts of what happened—to fully hold people accountable.
We’re likely working within complex, adaptive systems, and thus cannot apply the simplistic, linear, cause-and-effect models to investigating trouble within such systems. (See A Leader's Framework for Decision Making by David J. Snowden and Mary E. Boone)
Failure is a normal part of the functioning of complex systems. All systems fail—it’s just a matter of time. (See How Complex Systems Fail by Richard I. Cook, MD.)
We seek not only to understand the few things that go wrong, but also the many things that go right, in order to make our systems more resilient. (See From Safety-I to Safety-II: A White Paper by Erik Hollnagel, et al.)
The root cause for both the functioning and malfunctions in all complex systems is impermanence (i.e., the fact that all systems are changeable by nature). Knowing the root cause, we no longer seek it, and instead look for the many conditions that allowed a particular situation to manifest. We accept that not all conditions are knowable or fixable.
Human error is a symptom—never the cause—of trouble deeper within the system (e.g., the organization). We accept that no person wants to do a bad job, and we reject the “few bad apples” theory. We seek to understand why it made sense for people to do what they did, given the information they had at the time. (From The Field Guide to Understanding Human Error by Sidney Dekker)
While conducting the learning review, we will fall under the influence of cognitive biases. The most common ones are hindsight, outcome, and availability biases; and fundamental attribution error. We may not notice that we’re under the influence, so we request help from participants in becoming aware of biases during the review. (Read Thinking, Fast and Slow by Daniel Kahneman)
A timeline should represent an account of what happened as determined by the people who were involved and impacted. “If it's a really big failure or success, create a timeline with input from as many people from diverse points of view, including marketing, PR, HR, engineering and other teams,” says Zwieback. “Although in technology companies the task of facilitating, consolidating and assembling those diverse perspectives into a timeline often falls on someone in IT or Engineering, with some training, anyone in the organization can do it. A good timeline shows not just what happened, but serves as the backbone of the conversation — a reference point — to keep the review on track.”
The most powerful feature of a well-constructed timeline is its ability to help transport those involved back in time. “The timeline should not reflect what happened from the biased perspective of the present. Build it to capture what people were thinking at the time it was happening,” says Zwieback. “This is how to surface biases, specifically hindsight bias. We tend to look back at an incident and it seems obvious what we could, would or should have done, but didn't. But at the time, that was not apparent.”
Follow these three principles to construct a robust timeline:
Ask each individual involved to share what they knew, when they knew it and how they knew it. There are three key points to emphasize: 1) tell them to describe what happened without explaining and 2) remind them to timestamp (or estimate) when they knew what they knew and 3) systematically ask how they made sense of the incident at each stage.
Encourage and protect diverse points of view. The more varied points of view that the facilitator can collect, the fuller the picture of the incident. Honor divergent and dissenting opinions by verbally asking for and acknowledging them.
Reframe the facilitator's role. Given that the facilitator is frequently the leader or a senior member of the team, it’s likely that she’s already coming to the table with a mix of inputs — and biases — from her people. Her role must be clear and distinctly different: to listen to discover and verify by synthesizing.
Focusing on building the timeline is another method to keep participants working together rather than pointing fingers at each other. “During the learning review, listen for, and help participants be aware of blaming, cognitive biases, and counterfactuals, such as ‘we could have,’ ‘we should have,’ ‘if only’ and ‘we didn’t.’ Use empathy and humor throughout the learning review to defuse tense situations, especially at the beginning.”
Zwieback recommends the following set of questions to help build the timeline and keep the learning review on track:
Did we know this at the time or is it only obvious in hindsight?
When did we learn this fact?
How does knowing the outcome affect our perception of the situation (or the individual involved in the incident)?
We’ve recently had a similar problem. How is this one different?
Can you please describe what happened without explaining (too much)?
Any one of us would have done what Bobby or Susan did. Knowing what they knew then, how did their actions make sense at the time?
What did we do right?
How do we know this?
Question how, not why an incident occurred. Asking how unearths the conditions that contributed to success or failure. Probing why gets you tangled in bias.
Closure is of utmost importance for a learning review. The first step is to determine and prioritize remediation items. “It’s essentially ranking the steps that should be taken to change the conditions that ‘hosted’ the incident in the first place. That may involve paying down some technical debt, creating a new policy, scheduling a quarterly stand-up so managers can flag challenges or tracking a new metric on a dashboard,” says Zwieback. “To keep the learning review focused, these action items can be discussed and prioritized in separate follow-up meetings with the relevant people or teams.”
Lastly, the facilitator must publish the learning review write-up as widely as possible. “Both successes and failures need to become part of institutional memory at any learning organization. I recommend sharing the findings with everyone in the company via e-mail or during all-hands meetings,” says Zwieback. “If the incident negatively impacted people — especially customers — consider using the 3 Rs to structure the writeup. The Rs stand for ‘Regret, Reason, and Remedy’ from Drop the Pink Elephant by Bill McFarlan. It provides a straightforward formula for a meaningful, satisfying apology.”
A company that has especially mastered closing the loop — and philosophy of the learning review — is Etsy. “It's culturally ingrained there. John Allspaw, Etsy’s CTO, who coined the term ‘blameless postmortem,’ has brought research from the field of Human Factors and Systems Safety into the world of large-scale web operations,” says Zwieback. “Failures are celebrated as learning and improvement opportunities. I highly recommend a post on Etsy's 'just culture,' which is the company’s practice of balancing safety and accountability in its work.”
There are enough pressures that can overheat startups: fundraising, recruiting talent, battling market entrants, hitting sales targets. So, when something goes wrong, why blame individuals for failures and accelerate internal combustion? By doing so, companies are at risk of not only alienating their people, but also missing the full context of why a situation happened. Instead, rally the team to get a holistic snapshot of the conditions — not the culprits — that led to an outcome. Couple immunity and accountability to gather more reliable data about incidents. Lastly, produce timelines and write-ups to generate productive institutional memory.
“The failure rates of companies are too high to sabotage your organization from the inside. Startups need to learn fast and adapt to survive, and blame and bias short-circuits any real learning,” says Zwieback. “If the magnitude of a mishap or success is great, you can be sure that the opportunity to learn will be, too. So, you have the option to choose punitive or restorative justice. You may feel resolution by punishing a few people, but you’ve set back the company by not focusing on what will prevent the error from happening again. Failure will happen. Make sure learning does, too.”