United Airlines flight 232 to Philadelphia was being redirected to the small Sioux City airport because of serious mechanical difficulties. It crashed, killing 111 passengers and crew. Fortunately, the large number of emergency workers available and the heroic airmanship of the crew helped make it possible to save 185 onboard. Most of my unit spent the first day of our annual training collecting the dead from the tarmac and the nearby cornfields.
During the flight, the DC-10's tail-mounted engine failed catastrophically, causing the fast-spinning turbine blades to fly out like shrapnel in all directions. The debris from the turbine managed to cut the lines to all three redundant hydraulic systems, making the aircraft nearly uncontrollable. Although the crew was able to guide the aircraft in the direction of the airport by varying the thrust to the two remaining wing-mounted engines, the lack of tail control made a normal landing impossible.
Aviation officials would refer to this as a “one-in-a-billion” event 2 and the media repeated this claim. But because mathematical misconceptions are much more common than one in a billion, if someone tells you that something that had just occurred had merely a one-in-a-billion chance of occurrence, you should consider the possibility that they calculated the odds incorrectly.
This event, as may be the case with the recent 737 MAX 8 crashes, was an example of a common mode failure because a single source caused multiple failures. If the failures of three hydraulic systems were entirely independent of each other, then the failure of all three hydraulic systems in the DC-10 would be extremely unlikely. But because all three hydraulic systems had lines near the tail engine, a single event could damage all of them. The common mode failure wiped out the benefits of redundancy. Likewise, a single software problem may cause problems on multiple 737 crashes.
Now consider that the cracks in the turbine blades of the DC-10 would have been detected except for what the National Transportation Safety Board (NTSB) called “inadequate consideration given to human factors” in the turbine blade inspection process. Is human error more likely than one in a billion? Absolutely. And human error in large complex software systems like those used on the 737 MAX 8 is almost inevitable and takes significant quality control to avoid. In a way, human error was an even-more-common common mode failure in the system.
But the common mode failure hierarchy could be taken even further. Suppose that the risk management method itself was fundamentally flawed. If that were the case, then perhaps problems in design and inspection procedures, whether it is hydraulics or software, would be very hard to discover and much more likely to materialize. In effect, a flawed risk management is the ultimate common mode failure .
And suppose they are flawed not just in one airline but in most organizations. The effects of disasters like Katrina, the financial crisis of 2008/2009, Deepwater Horizon, Fukashima, or even the 737 MAX 8 could be inadequately planned for simply because the methods used to assess the risk were misguided. Ineffective risk management methods that somehow manage to become standard spread this vulnerability to everything they touch.
The ultimate common mode failure would be a failure of the risk management process itself. A weak risk management approach is effectively the biggest risk in the organization.
The financial crisis occurring while I wrote the first edition of this book was another example of a common mode failure that traces its way back to the failure of risk management of firms such as AIG, Lehman Brothers, Bear Stearns, and the federal agencies appointed to oversee them. Previously loose credit practices and overly leveraged positions combined with an economic downturn to create a cascade of loan defaults, tightening credit among institutions, and further economic downturns. Poor risk management methods are used in government and business to make decisions that not only guide risk decisions involving billions—or trillions—of dollars but also are used to affect decisions that impact on human health and safety.
Fortunately, the cost to fix the problem is almost always a fraction of a percent of the size of what is being risked. For example, a more realistic evaluation of risks in a large IT portfolio worth over a hundred million dollars would not have to cost more than a million—probably a lot less. Unfortunately, the adoption of a more rigorous and scientific management of risk is still not widespread. And for major risks, such as those in the previous list, that is a big problem for corporate profits, the economy, public safety, national security, and you.
A NASA scientist once told me the way that NASA reacts to risk events. If she were driving to work, veered off the road and ran into a tree, NASA management would develop a class to teach everyone how not to run into that specific tree . In a way, that's how most organizations deal with risk events. They may fix that immediate cause but not address whether the original risk analysis allowed that entire category of flaws to happen in the first place.
KEY DEFINITIONS: RISK MANAGEMENT AND SOME RELATED TERMS
There are numerous topics in the broad term of risk management but this term is often used in a much narrower sense than it should be. This is because risk is used too narrowly, management is used too narrowly, or both. And we also need to discuss a few other key terms that will come up a lot and how they fit together with risk management, especially the terms risk assessment, risk analysis, and decision analysis .
If you start looking for definitions of risk, you will find many wordings that add up to the same thing and a few versions that are fundamentally different. For now, I'll skirt some of the deeper philosophical issues about what risk means (yes, there are some, but that will come later) and I'll avoid some of the definitions that seem to be unique to specialized uses. Chapter 6is devoted to why the definition I am going to propose is preferable to various mutually exclusive alternatives that each have proponents who assume their definition is the “one true” definition.
For now, I'll focus on a definition that, although it contradicts some uses of the term, best represents the one used by well-established, mathematical treatments of the term (e.g., actuarial science), as well as any English dictionary or even how the lay public uses the term.
Long definition: A potential loss, disaster, or other undesirable event measured with probabilities assigned to losses of various magnitudes
Shorter (equivalent) definition: The possibility that something bad could happen
The second definition is more to the point, but the first definition describes a way to quantify a risk. First, we determine a probability that the undesirable event will occur. Then, we need to determine the magnitude of the loss from this event in terms of financial losses, lives lost, and so on.
The undesirable event could be just about anything, including natural disasters, a major product recall, the default of a major debtor, hackers releasing sensitive customer data, political instability surrounding a foreign office, workplace accidents resulting in injuries, or a pandemic flu virus disrupting supply chains. It could also mean personal misfortunes, such as a car accident on the way to work, loss of a job, a heart attack, and so on. Almost anything that could go wrong is a risk.
Because risk management generally applies to a management process in an organization, I'll focus a bit less on personal risks. Of course, my chance of having a heart attack is an important personal risk to assess and I certainly try to manage that risk. But when I'm talking about the failure of risk management—as the title of this book indicates—I'm not really focusing on whether individuals couldn't do a better job of managing personal risks like losing weight to avoid heart attacks. I'm referring to major organizations that have adopted what is ostensibly some sort of formal risk management approach that they use to make critical business and public policy decisions.
Читать дальше