Contributed Articles

The Many Faces of Resilience

A review of network science and complexity theory as they apply to the ability of systems to resist stress and recover from faults.
Posted

Qualitative definitions of resilience are abundant in the literature, which says resilience is the ability to resist stress and recover from faults. Essentially, a resilient object is like the timepiece that “takes a licking and keeps on ticking.” However, a more rigorous definition is needed for purposes of measuring and protecting critical assets, such as a nation’s infrastructure. If we cannot quantify it, how do we know if we have enough?

Key Insights

• Resilience and risk of collapse in complex systems, such as supply chains and communication networks, began with the foundational work of Perrow and Bak circa 1980.
• Perrow and Bak’s work is captured in modern complex systems theory that is modeled as networks with connectivity topology.
• Resilience/fragility of complex systems emerges (partially) from network topology, aka structure, and is quantified by the network’s spectral radius.
• Frequency of collapse is modeled as a power law versus consequence yielding a fractal dimension. Resilience increases with fractal dimension.

When my garaged automobile fails to start in the morning, it is of little consequence to anyone but me. However, if it stalls in the middle of a busy freeway, the consequences can be large. It risks tying up traffic and delaying perhaps thousands of commuters. My automobile becomes more important as a component of a (traffic) system. The traffic system’s ability to deal with the stress of a stalled automobile is a kind of resilience. Thus, resilience is a property of a system.

When a cog in a system breaks, the cost to the system may be much larger than the cost of the cog itself. The simple idea of a small fault leading to major, even catastrophic, failure was the principle underlying Perrow’s normal accident theory (NAT) in the 1980s.12 According to NAT, catastrophic system failure is a dynamic process which begins with a small fault that cascades into a larger fault and leads to an even bigger fault—a cascade of faults. At each subsequent fault in the cascade, the consequence is magnified by an “invisible” coupling force. Perrow documented this chain reaction through careful examination of the Three Mile Island nuclear power plant collapse and other notable catastrophes. But he never identified the invisible coupling force. To Perrow, the problem was poor management.

More recently, Kott et al.9 modeled the cascading of one or more faults as propagation of node failure in a directed network. In the previous automobile and highway system example, the system is the highway network consisting of nodes (intersections) and links (highway segments). When one node is blocked, upstream links and nodes become congested and fail. This kind of cascading is common in infrastructure failures due to its system connectivity. According to the authors, resilience is relative to critical functionality, which may be defined as the percentage of nodes that are functioning or the ratio of a network’s actual flow to its maximum capacity.

At about the same time as Perrow’s pioneering work, Per Bak and colleagues became enamored by a simple sand-pile experiment subsequently known as the Bak-Tang-Weisenfeld (BTW) experiment.3,8 The BTW similitude modeled collapse as a sand pile building up to a peak as grains of sand are dropped on a plate, eventually resulting in landslides. The interesting thing was that the size and elapsed time between landslides was unpredictable. Statistically, they obeyed a long-tailed power law without moments (no average or standard deviation). This was a hint that resilience is related to the mysterious coupling force that Perrow described in his writings.

Bak later developed a theory of collapse he called punctuated equilibrium, a concept first proposed by Niles Eldredge and Stephen Jay Gould7 (see Figure 1.4) Punctuated equilibrium is cyclical: Systems act like sand piles, whereby self-organization builds up to a critical level, then collapses occur, followed by repeated build ups, and so on. Most events are small, but a rare number are extremely large. Events considered normal accidents result in ratcheting up self-organization: self-organizing criticality (SOC). Events that occur with low frequency but high consequence (at the long end of the power law) are considered “black swans” and result in large adaptations or extinctions. Bak applied his ideas to evolution and major cataclysms in history.

Figure 1. The author’s adaptation of Bak’s punctuated equilibrium theory to terrorism. SOC increases to a critical tipping point, resulting in a major collapse called a “black swan.”10 Fragility is proportional to SOC according to Per Bak.

Bak discovered SOC, Perrow’s invisible coupling force. As it turns out, self-organization is a major cause of fragility—that is, anti-resiliency. Bak went further in 1999, claiming, rightfully, that a major source of SOC is efficiency and optimization. Business practices that attempt to optimize systems such as today’s supply chains reduce redundancy and resiliency and increase the likelihood of a normal accident or even worse: a black swan event.13

Exceedance probability (EP) obtained from empirical frequency data is one quantitative measure of resilience based on the frequency of consequences exceeding a certain value along the consequence x-axis, while probable loss/probable maximum loss (PML) is another way of quantifying resilience. PML is the maximum of the expected loss computed from the product of consequence and probable loss = x EP(x).

For small probability of component failure (v), exceedance probability EP(x) ~ x-q approximates a power law where q is the fractal dimension. Therefore, probable loss is a function of component failure probability and consequence, x, yielding an exceedance probability curve with fractal dimension, q. PML is the maximin value of probable loss, but the two are different facets of the same thing.

More to the point, resilience declines as the long-tailed power law, x-q describing exceedance due to a failure, becomes fatter and/or longer. That is, systems with fat-tailed failure distributions (smaller q) are more fragile than shorter and thinner-tailed distributions (larger q), as shown in Figure 2.

Figure 2. (a) Exceedance probability and probable maximum loss for the Western Power Grid of the U.S. (WECC). Actual data collected from 2007 to 2017 fits a fat-tailed (q = .953) power law with R2 of .97. Data from: https://www.nerc.com/pa/RAPA/PA/Pages/Section1600DataRequests.aspx. (b) The fit of exceedance probability versus shipping port size results in a thinner, shorter power law (q = 1.331) with R2 of .97. Data from: https://www.worldshipping.org/top-50-ports.

The probable loss curve of Figure 2b declines more rapidly than the probable loss curve of Figure 2a because q = 1.331 is greater than q = 0.953, indicating a thinner tail. Regardless of differences in consequence and range of consequences, the theory holds equally for electric power outages, earthquakes, forest fires, shipping port risk, and cyber exploits on the Internet.

Thus, while PML is a measure of risk, it can be used as a measure of resilience when a system is subjected to stress. That is, one can compute a baseline PML under nominal conditions and compare the PML values when the system is stressed via simulation or in the real world. This requires that sources of fragility be identified and applied to network models of infrastructure.

Common Sources of Fragility

Of course, there is more to resilience than fat-tailed EP. In a previous work,11 I listed several cause-and-effect scenarios that lead to fragility—that is, stress-induced faults. Bak’s optimization is a prime example, but there are others, called criticality factors:

Tragedy of the commons (ToC). A system becomes fragile when carrying capacity is exceeded by overloading. For example, the U.S. builds more interstate highways while maintenance funds diminish and more power plants depend on fewer transmission lines, leading to more frequent and larger outages.

Paradox of enrichment (PoE). A system may become unstable due to an enrichment that exceeds its carrying capacity. For example, Braess’s paradox is the observation that adding one or more roads to a road network can slow down overall traffic flow through it. One can argue that the 2008–2009 financial collapse was the result of home ownership exceeding its carrying capacity (65%), thus causing the Great Recession.

Competitive exclusion principle (CEP). Gause’s competitive exclusion principle says that competitive ecosystems tend to eliminate all but one competitor, because sooner or later, one competitor gains a small advantage over all others, grows faster, and becomes fitter than all others. This leads to a monopoly, in general, which reduces redundancy and diversity. As a result, resilience is reduced, largely because monopolies are optimized organizations that tend to build optimized (profitable) systems. The Internet’s universal TCP/IP standard protocol is an example. The Internet is easy to attack because TCP/IP forms a mono-culture—an attack on one server is an attack on all servers. The pioneers of the Internet call this the “Internet’s original sin.”

Preferential attachment. Preferential attachment is the most common form of self-organization that leads to SOC and CEP. In practice, preferential attachment creates concentrations of assets, bottlenecks, and single points of failure. The Internet continues to undergo restructuring because of preferential attachment—a hub-and-spoke architecture is emerging due to economics and regulation. Power grids and transportation systems have reached high levels of SOC due to decades of preferential attachment. Wherever preferential attachment is at work, the resulting system is likely to be vulnerable because of a critical hub, essential bottleneck, or weakest link.

Each of these criticality factors may be quantified, studied, and altered to improve resilience, but they must first be understood. Resilience is not an add-on; it is a property of every system that can be formalized and quantitatively studied like any other engineering discipline.

Applying the Theory

The salient properties of a system with invisible coupling can easily be represented as a network, G = {N, L, f}, where N is a set of nodes representing the system components, L is a set of links representing real or intangible connections between pairs of nodes, and f is a mapping function that defines connectivity—that is, which nodes are connected by which links. Function f is often given as a connection matrix with zeros everywhere except where an element is set to 1, signifying a link connecting two nodes. Links make invisible coupling tangible and hence are the subject of analysis.

Any system can be represented as a network. For example, Perrow’s Three Mile Island power plant example is a system of reactors, pumps, and so on, represented as nodes, and pipes represented as links. Transportation systems contain links (roads) and intersections (nodes); the Internet has nodes (servers) and links (communications); communities are made up of people (nodes) and links (relationships), and so on.

Networks are useful abstractions of systems because we know a lot about them. We know that the World Wide Web’s network structure obeys a power law such as Bak’s sand pile1 and that its structure makes it less resilient due to attacks on its hubs.2,8 In fact, structure contributes to the fragility or resilience of every network, especially when they are scale-free.5

Network science and complexity theory provide the first steps towards a rigorous theory of resilience. This theory is emerging, but we know a few of its characteristics. First, as a system becomes more structured, its resilience may decrease under certain kinds of stress and increase under other kinds of stress. It is as if structure exaggerates resilience or the lack of it. Structure may emerge from preferential attachment and optimization that shape a network by increasing the number of connections of a few nodes while reducing the connections of most others. This results in a scale-free or scale-free-like topology with a central hub node and many minimally connected nodes.

The topology of a network can be captured in a single number, called the spectral radius, which is the largest eigenvalue of the connection matrix. Spectral radius increases when the density of links or the number of connections of the hub(s) increases. Intuitively, increasing the density of links as measured by number of links per node also increases the opportunity for cascading, whereby failure of an adjacent node or link propagates failure to other nodes and links. The overall effect on resilience depends on how resilience is measured.

Figure 3 illustrates how resilience is impacted by structure, in particular the emergence of hubs. Figure 3a shows a random network (links are randomly assigned to nodes), and Figure 3b shows a scale-free network with a hub and a greater number of less-connected nodes. Note that the number of nodes and links are equal in each case, but topologies are radically different. The spectral radius on the random network is 4.75 while the more structured scale-free network has a spectral radius of 6.48.

Figure 3. Structure plays a central role in resilience.

Figure 3a might represent a road network while Figure 3b could be a supply-chain/distribution network with a centralized warehouse. Both are subject to failure due to several criticality factors. For example, the road network might fail due to blockage, congestion, or removal of one or more links. The supply chain may suffer from similar causes or from supply/demand imbalances. We focus on blockages, congestion/overloading, and link failures in the remainder of this article.

Figure 3 shows the result of blocking analysis. Nodes or links are said to be “blocking” if the effect of removing them separates the network into islands, such that it becomes impossible for a commodity to flow from one island to another. Parts of the network become unreachable, and resilience decreases as the percentage of blocking nodes/links increases. In Figure 3b, the scale-free network is more resilient than the random network by this measure.

By extending blocking analysis one can study the impact of link removal on the network’s ability to function. What happens when one link fails? Directional networks such as supply chains tend to re-route flows around broken links, which tends to overload other links. In Figure 3, the random network is more resilient by the “link minus one” stress test because worst-case overloading is 175% as compared with 185% for the scale-free network. In general, random networks offer a greater number of alternative routes and avoid overloading.

The literature on scale-free networks says they are more resilient in terms of random failures of nodes leading to cascades via adjacent nodes.2,8,9 This is largely due to the abundance of nodes with little connectivity. However, links are just as important as nodes with respect to cascading faults. When simulating cascading random link failures, the structured scale-free network is slightly less resilient: Tail risk is lower for the random network than the scale-free network (fractal dimension of 1.87 for scale-free versus 2.41 for random). A higher fractal dimension implies a shorter and thinner power law, hence less PML risk.

An intelligent actor would be more likely to attack the highly connected nodes, however, correctly thinking that more damage is caused by a targeted exploit. For obvious reasons, this is the strategy of cyber criminals. In this case, the random network does much better because it has fewer and smaller hubs. When the hub of each network is targeted, the fractal dimension of the random network’s EP is 1.97, compared with 1.57 for the scale-free network—that is, the random network is more resilient. Recall, a larger fractal dimension means a shorter and thinner EP. The corresponding PML risk of scale-free failure is approximately three times greater.

By increasing link capacity to accommodate overloading, we can measure network resilience. Essentially, resilience is tolerance of overloading links. In this contest, the scale-free network wins; 7.17 versus 1.36, as shown in Figure 4.

The Table summarizes these results, but the list is not complete. Other criticality factors, such as the number of paths from source to sink node or various measures of loss when a node or link fails, are possible and equally valid. The point is that the likelihood of adjacent nodes “infecting” nodes and links, cascading due to overloading, and topological structure play a major role in determining resilience. The network structure of a flow network is more resilient when it has alternate paths around a damaged node or link. It is also more resilient when nodes and links have spare capacity.

Table. Summary of criticality factors and their impact on resilience.

Complexity, Fractals, and Resilience

On February 19, 2013, Deputy Secretary of Defense Ashton B. Carter defined complex catastrophe as “any natural or man-made incident, including cyberspace attack, power grid failure, and terrorism, which results in cascading failures of multiple, interdependent, critical, life-sustaining infrastructure sectors and causes extraordinary levels of mass casualties, damage, or disruption severely affecting the population, environment, economy, public health, national morale, response efforts, and/or government functions.”6

Such catastrophes are complex because of their interconnectedness or linkages across assets and sector boundaries. These linkages are responsible for cascading failures that render such catastrophes extraordinary and unusually consequential. They are also often characterized by long-tailed exceedance frequency distributions. The frequency of exceeding a level of consequence is plotted against consequence, where consequence may be defined in terms of cost, size, fatalities, and so on. Most often, the EP obeys a power law.

Power laws are self-similar fractals, with dimension q equal to the exponent of the power law, p(x) = x-q. This suggests many catastrophic events are fractals. Mark Twain said, “History doesn’t repeat, but it rhymes.” We can say, “Disasters don’t repeat, but they are self-similar—that is, fractals.” The fractal dimension represents another quantitative measure of resiliency. Moreover, it provides us a handle on quantifying other measures, such as PML and risk, which are directly or indirectly related to resilience.

Acknowledgment

Thank you to Susan Ginsburg, CEO of Criticality Sciences, Inc., (https://www.critsci.com) for the use of NetResilience, a software tool for analyzing risk and resilience in infrastructure systems that can be represented as a network containing nodes and links.

Figure. Watch the author discuss this work in the exclusive Communications video. https://cacm.acm.org/videos/many-faces-of-resilience

1. Adamic, L.A. et al. Power-law distribution of the World Wide Web. Science 287, 5461 (2000), 2115.

2. Albert, R., Jeong, H., and Barabási, A-L. The Internet's Achilles' Heel: Error and attack tolerance of complex networks. Nature 406 (2000).

3. Bak, P., Tang, C., and Wiesenfeld, K. Self-organized criticality: An explanation of 1/f noise. Physical Review Letters 59 (July 1987), 381–384.

4. Bak, P. How Nature Works: The Science of Self-Organized Criticality, Copernicus Press, New York, NY (1996).

5. Barabási, A-L and Bonabeau, E. Scale-free networks. Scientific American 288, 5 (May 2003), 60–69.

6. Carter, A.B. Definition of the term Complex Catastrophe, Department of Defense Memorandum (February 19, 2013).

7. Eldredge, N. and Gould, S.J. Punctuated equilibria: An alternative to phyletic gradualism. In Models in Paleobiology, T.J.M. Schopf, Ed. Freeman Cooper, San Francisco (1972), 82–115.

8. Gao, J., Barzel, B., and Barabási, A-L. Universal resilience patterns in complex networks. Nature 530 (2016), 307–312.

9. Kott, A. and Linkov. I., eds. Cyber Resilience of Systems and Networks. Springer, Switzerland (2019).

10. Lewis, T.G. Bak's Sand Pile: Strategies for a Catastrophic World. Agile Press, California (2011), 382.

11. Lewis, T.G. Critical Infrastructure Protection in Homeland Security: Defending a Networked Nation. 3rd ed., John Wiley & Sons, (2014).

12. Perrow, C. Normal Accidents: Living With High-Risk Technologies. Princeton University Press (1999).

13. Taleb, N.N. The Black Swan: The Impact of the Highly Improbable. Random House (2007).

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.