Features | From Pivot Magazine

We live in the golden age of meltdowns 

The systems that run our banks, airlines, hospitals and other vital services are in danger of collapsing at any moment. How do we avoid catastrophe? 

A Facebook IconFacebook A Twitter IconTwitter A Linkedin IconLinkedin An Email IconEmail

Blurry photo of airplanes Airlines have grounded planes for days because of problems with their reservation systems, says author Chris Clearfield. “Can’t they just use paper tickets? No. There’s no roomful of people typing tickets on typewriters anymore.” (Photo by Getty)

Eleven years ago in Washington, D.C., Metro train 112—travelling along a track wired so that operators would know precisely where all rolling stock was located—crashed into another, effectively invisible train. Nine people died. Three years later, in 2012, the stock trading system at Knight Capital, a Wall Street financial services company, went into what could charitably be described as psychosis and executed four million nonsensical trades in 45 minutes. Overnight, the company lost more than half a billion dollars; within a year, it no longer existed as an independent firm. And in 2015, Washington state announced that the criminal tracking system used by its corrections department had been miscalculating prisoner release dates for 12 years, resulting in nearly 3,200 felons being released too early. By the time the state realized its mistake, two of the prisoners had been charged for violent crimes committed while they should have been in jail. 

Isolated incidents, yes, but all of them linked by common DNA, according to Chris Clearfield and András Tilcsik, co-authors of Meltdown: What Plane Crashes, Oil Spills, and Dumb Business Decisions Can Teach Us About How to Succeed at Work and at Home, which won the 2019 National Business Book Award. You can blame these disasters—and the many others that show up in the pages of Meltdown—on any number of factors: the Internet of Things, the continuous and almost organic layering of lines of computer code upon lines of computer code over the years, the leaderless and unpredictable power of social media, or—in a more root-cause explanation—our lust for speed and efficiency. But the end result is the same: the systems that facilitate (travel, banking) and even rule (medical care, 911 calls) our lives are more complex and more closely linked than ever before, with ever-shrinking margins for error. As a result, the authors and many of their contemporaries argue, we are living in the “Golden Age of meltdowns.”

According to Clearfield (a pilot, former derivatives trader and founder of the research and consulting firm System Logic) and his co-author, Tilcsik (Canada Research Chair in Strategy, Organizations, and Society at the University of Toronto’s Rotman School of Management), the shared DNA of these meltdowns comes from two factors in combination. Systems are far more complex than they once were, more like webs than through lines. A car assembly plant, for instance, is complicated but linear—the manufacture moves from station to station, and if one knot in the line unravels, it is immediately clear where the failure is located, and movement can be halted or rerouted until a repair is completed. But internet-based systems and some physical plants—a nuclear power plant, say—are more complex, with parts interacting in intricate and often invisible ways. When something goes wrong, it can be hard to pinpoint the problems and easy to make it worse with an ill-informed diagnosis. Sometimes, as in three years of Brexit turmoil, no one “can control the argument and control how it evolves,” says Rick Nason, a risk management and complexity consultant.

Compounding the complexity is what engineers call “tight coupling.” Slack, which causes redundancies and slowdowns but adds time and space for fixes, is increasingly squeezed from systems. Now those systems all have to run like a ­Prussian railway, with the precise number and combination of inputs always available—and the system can’t be shut down while being repaired. Complexity plus tight coupling, basic features of the contemporary socioeconomic landscape, means breakdowns often come at system operators like rockets out of a fog bank, fired randomly and hitting targets less by design than happenstance.

Nor are fallbacks generally available, adds Clearfield. “Off the top of my head, I can think of perhaps five examples of airlines grounding for days because of a problem with their reservation systems,” he says. “Part of me is always thinking, ‘Can’t you just send paper tickets to people?’ No, they can’t. They don’t have the infrastructure for that anymore. There is no roomful of people typing tickets or cheques on typewriters now.” 

Canadian public servants can swear to that. When the federal government decommissioned its 40-year-old payroll system, two-thirds of its employees were under- or overpaid by the new, problem-riddled Phoenix pay system, and they could secure no quick relief, such as a cheque made out in the correct amount. 

Systems are not simply complex within themselves but also in the ways they interact with other systems. “We have made our systems more efficient,” Clearfield says, as much in labour and material costs as in any other area, “so efficient that we can’t do that kind of reversion as easily as before.” So efficient, in fact, as to open new doors for malicious actors: as Clearfield and Tilcsik discuss, car hijackings, which not so long ago required physical presence, can now be done remotely, by hacking into a vehicle’s internet connection. Interconnectivity is wonderful, right up until the moment it isn’t.

Blurry photo of an operating roomIn ICUs, alerts sounded every eight minutes. Since 90 per cent of them were false positives, the natural boy-who-cried-wolf reaction from harried staffers was to “tune them out,” write Chris Clearfield and András Tilcsik, co-authors of Meltdown: What Plane Crashes, Oil Spills, and Dumb Business Decisions Can Teach Us About How to Succeed at Work and at Home (Photo by Getty)

Meltdown’s diagnosis is as disturbing as it is convincing. So what can be done to avert these disasters? The book’s prescriptions for public policy and private enterprise are “more simple than easy,” says Clearfield. For the most part, they turn on human psychology. He notes how, counter-intuitively, more warning alerts can lead to less safety. A study of bedside alarms in intensive-care hospital units, for instance, found that an alert sounded every eight minutes. Since 90 per cent of them were false positives, the natural boy-who-cried-wolf reaction from harried staffers “was to tune them out.” Nason calls that “risk homeostasis,” a situation where safety bells and whistles lead to a smug and dangerous assumption of security. In the same way, jaywalkers pay closer attention to oncoming traffic than people crossing at a light or crosswalk—the increased safety features “do not actually increase your safety.” The solution to that problem is to prune safety indicators and restrict attention-grabbing alerts to true emergencies, write Clearfield and Tilcsik. On commercial aircraft, for example, a low-level advisory like an amber text message will inform the pilots if fluid is low in a hydraulic system; a noisier, higher-level alarm arrives only if the fluid actually runs out.

In the end, transparency, aided by a healthy dollop of humility, has to be the goal. “Transparency is huge,” says Clearfield, who adds you need to understand what’s going on in your system, treat the data that it gives you as a learning opportunity, and listen to it. Clearfield offers the example of a physician who investigated a close call in a University of California hospital (a patient received 38 times the requisite medication despite three separate alerts), and urges an organizational structure in which the CEO sends a thank you note not just to the employee who makes a great catch, but even to the one who flags a suspected error but turns out to be wrong. 

That’s the very definition of “simple but not easy,” requiring the upending of standard notions of hierarchy and other innate human tendencies. One effective strategy, however it’s termed—psychologists call it “prospective hindsight,” while Clearfield prefers “premortems”—is to inspire people’s imaginations. If you ask team members if they’ve covered any foreseeable problem that might plague a new launch, the answer is liable to be yes. Instead, ask them to imagine that it’s now two years later and their project has gone grotesquely wrong—what could have happened? “The premortem is sort of a mental hack that taps into our love of storytelling.” It flips the incentive by rewarding negative thinking that can save the day.

Diversity among safety watchers is also an enormous help, argues Meltdown. The more people within a system resemble one another—in any combination of education, background, skin colour or chromosome—the more they will unconsciously defer to one another’s assurances, even when doubts niggle in the back of their minds. The more outsider-ish participants are—again by various combinations, even simply coming from elsewhere in the organization or from outside the firm entirely—the less automatic the assumptions, the more careful and thought-through the arguments.

Aviation and nuclear power, because the stakes are so high, have always been the most scrutinized industries in terms of system security. Despite continuing setbacks in the aviation sector, Clearfield is emphatic in his praise of how the airline industry and government regulators have built continuous questioning into their best practices. “They have all these checklists that question even the captain,” he notes, referencing studies that showed that accidents were more prevalent when the senior pilot was at the helm because the junior pilot was reluctant to speak up and upset the hierarchy of authority. “In the 1990s, the Federal Aviation Administration made a bold move by effectively saying, ‘Look, we’re not going to get safer by fining more people but by figuring out how this industry can operate in a different way.’ That was the primary driving factor and it turned out to be really powerful in making aviation safety-oriented. Now there are 100,000 flights a day, almost all of which land safely.”

But, as Clearfield’s earlier comments on flight reservations prove, flight safety doesn’t mean airlines have avoided more ordinary business meltdowns, less deadly but extremely damaging nonetheless. For most businesses—which now have supply lines, turnaround times and distribution schedules as complicated and tightly coupled as a power plant’s—almost all crises have core causes and/or consequences that are financial or reputational (which, ultimately, come to the same thing). Those hard-to-see-coming breakdowns are what worry businesses today.

When it comes to preventing meltdowns, a firm’s in-house financial professionals, especially its accountants, are key figures. “For any business,” Clearfield repeatedly comments in his book and in an interview with Pivot, “there is data in your system. You just have to find a way to get at it and then read it.”

CPAs are in a prime position to do just that, says Tashia Batstone, CPA Canada’s senior vice-president of external relations and business development. “Data scientists can come up with lots of great, valid information, but they’re not necessarily able to translate it in a way that’s useful, because they may not have a deep understanding of the business,” she says. “That’s the sweet spot for CPAs.” She envisions accountants as trusted advisers who can ensure information is valid, reliable and consistent, and then translate it into insights that help senior management drive the success of the business. “It’s clear that CPAs are trying to pivot from being number crunchers to being business advisers,” Nason adds. “I think that’s absolutely brilliant.”

Financial professionals are always involved in coping with any business breakdown “because it’s very difficult for any company to have a crisis that doesn’t have a financial component,” says Jacqui d’Eon, who spent 10 years with Deloitte Canada before establishing JAd’E Communications, a firm that specializes in reputation management and corporate crisis communications. But those professionals also have a role to play in preparing for catastrophic possibilities that affect more than finance. When H1N1, a highly unusual variety of influenza, began infecting children and working-age populations in the spring of 2009, Deloitte was concerned it would sideline scores of employees. So Deloitte assembled a task force, which included the CFO, to look at all aspects of the business: “How would we pay people, for example, if our payroll staff were not able to work?” says d’Eon. “Can we move work from one place to another if an entire project team got sick? What alternate systems could we establish?” In that case, professional accountants were managing company-wide operations and systems management. “Accountants,” says d’Eon, “have a whole set of great skills.”

In managing risk, the first thing to do is determine whether you are in a “complicated” or “complex” environment, says Nason. A complicated environment calls for the checklist approach: “Find the experts, find the best practices,” such as premortems. 

A complex environment, though, is emergent and social. It’s Brexit, for example, or a flock of starlings “all moving in one pattern and then another, completely unpredictable and completely leaderless.” Or, in Nason’s preferred analogy, it’s a teenager. “An argument that worked with teens at 10 o’clock doesn’t work at 10:15 because teenagers are in a social system where they’re interacting with their peers and constantly adapting. You don’t ‘solve’ a teenager, right?” Following a checklist with a teen will likely lead to unattractive and unintended consequences, he says. “As a parent, sometimes you’ve got to take a deep breath and say, ‘Okay, I can’t fix this. So how do I manage it?’ ” says Nason. “Business managers should do the same.”

Complex situations require “a great deal of humility on the part of the manager, which is probably the major hindrance to managing complexity: you have to think and manage, not solve.” And one aspect you have to think about, says Nason, is that “risk” should have its neutral element. All crises are also opportunities, and the unexpected may be good news. That concept, the opposite of treating risk entirely as “the Department of No,” injects “a totally different culture into an organization, so that people are scanning ahead for good things as well as bad.”

Accountants, Nason adds, must have the insight to realize that questions of value almost always have shades of grey. “And shades of grey imply risk and the need for thinking people as guides,” he says. Accountants are especially adept at pointing out that even numbers that look great can actually be hiding something—“how they could, if you did this, lead us in a different direction.” 

It all comes down to Donald Rumsfeld’s famous remark, that there are known knowns, known unknowns and unknown unknowns out there waiting for us. Humility, transparency and fostering a workplace culture that rewards questioning from all ranks are the best guardrails we have.