Computing Applications Viewpoint

Cyber Efficiency and Cyber Resilience

Complementary objectives competing for the same resources.

By Igor Linkov, Alexandre Ligo, Kelsey Stoddard, Beatrice Perez, Andrew Strelzoffx, Emanuele Bellini, and Alexander Kott

Posted Apr 1 2023

Introduction
Motivation
Formulation
Challenges with Optimization of Long-Term Efficiency
Approaches to Overcome Barriers and Optimize Long-Term Efficiency
Illustration: WannaCry Ransomware and the NHS
Conclusion
References
Authors
Footnotes

blue ball ahead of white blocks on parallel tracks, illustration

When a digital system is developed or purchased, a primary consideration is how successfully the system accomplishes its desired function. Measuring it, however, does encompass the functional lifespan of the system, including its long-term efficiency. Efficiency is defined as the ability to do things well, successfully, and without waste. A short-term view of efficiency might not account for threats with a low probability of occurring in the next days or weeks. Achieving long-term efficiency with the system accomplishing its function even in the presence of disruptions requires consideration of several factors, including short-term efficiency and resilience. Resilience is defined as the ability of a system to absorb, respond, recover from, and adapt to disruptive events.^9,14,20,22

Enhancing cyber resilience often requires investing in qualified labor, redundant equipment, and software. Such investment increases the cost per byte or user and therefore is detrimental to short-term efficiency. However, enhanced resilience reduces the impact of disruptions and speeds up recovery from them. While investing in resilience improves long-term efficiency, its optimization requires perfect knowledge of the system’s short-term efficiency, the exact nature and impact of all future failures, and knowledge of the system’s response to those failures (that is, resilience). As it is impossible to quantify the exact nature and impact of all future failures, it is also impossible to deterministically optimize a system’s long-term efficiency. While risks can’t be understood deterministically, the impact and nature of those risks can be estimated probabilistically. Ideally, the uncertainty introduced by using a probabilistic approach demands prioritizing the most effective solution based on the long-term goals. We need a framework that compares the tradeoffs between short-term efficiency and resilience to optimize the long-term efficiency goals and therefore the effectiveness of a proposed solution.

Efficiency has been examined extensively, both in operations research and general management⁴ and in specific disciplines such as computer science,¹ communications,¹⁵ and supply chain management.^5,11 Likewise, resilience is the theme of a substantial body of knowledge^18,20 with studies focusing on particular areas such as cyber resilience,^3,17 aviation, climate change and epidemiology;^24,25 or specific types of disruptions, such as the ripple effect in supply chains.¹¹

Recent work has examined the relationship between efficiency and resilience in a system from two different perspectives. One body of knowledge has characterized efficiency and resilience as conflicting objectives, proposing a trade-off between them in a system’s design or optimization.^11,24,25 The second perspective asserts that systems optimized exclusively for efficiency often sacrifice redundancy, reliability, robustness, and other attributes related to resilience.^1,12

In this Viewpoint, we explore the relationship between short-term efficiency and resilience of digital or cyber-physical systems through the lens of functionality, resources, and cost over time. The objective is to present a framework for evaluating the long-term efficiency of a system as a combination of short-term efficiency and resilience. After one or more disruptions, systems that were optimized for both short-term efficiency and resilience as a combined objective are more efficient in the long-term than systems that optimize exclusively for efficiency in the short-term. The challenges of estimating resilience against future disruptions and possible approaches to overcome these challenges are discussed, followed by the WannaCry ransomware attack on the National Health Services of England (NHS), to illustrate the impact of disruptions and investment in resilience.

Motivation

As an illustration of the relationship between efficiency and resilience consider a scenario in which two systems initially have identical configurations and levels of functionality (or output)—as an example, in the case of a healthcare system, functionality and output over a given period refers to the volume of patient appointments, response time to emergency calls, and preventive care exams, among others. At an instant denoted as T_pre, one of the systems is modified to improve resilience while the other is kept unchanged. Because of the modification, these systems respond differently to a disruption. Conceptually, the dynamics of the two systems is as follows.

Initial state: Identical systems. Consider two systems S₁ and S₂ that initially have identical configuration and level of functionality. This initial state is represented in the leftmost part of the timeline illustrated in Figure 1, as the period before T_pre. More specifically:

Figure 1. Timeline of two hypothetical systems S₁ and S₂. Initially the two systems are identical, then S₂ is modified at T_pre, and a disruption occurs at T_dis.

Let C be the total cost incurred by a system to achieve a given level of functionality. Since S₁ and S₂ are identical, for the period before S₁ both systems have total cost C₀.

Time T_pre: Modification to improve resilience. At time T_pre, managers of system S₂ decide to improve its resilience against future attacks or disasters. For example, in a communications network, this might be extra routers or links added to improve network resilience, and in a distributed computing platform it might represent redundant servers, storage, and so forth. On the other hand, system S₁ does not receive any investment for resilience at time T_pre. Note that functionality or output F₂ of system S₂ is unaffected by the extra investment since its purpose is to improve resilience rather than functionality. This is illustrated in Figure 1(A) in the period after T_pre. However, the equipment, services, so forth added at the time T_pre increase the total cost of system S₂ increases from C₀ to C₂. This is illustrated in Figure 1(B).

Immediately after T_pre, it becomes apparent that system S₂ is less efficient than S₁, at least in the short term. This is because S₂ incurs a cost C₂>C₁ to produce the same functionality as S₁. We call this a short-term loss of efficiency.

Time T_dis: Disruption. At time T_dis a natural disaster, operations error, or cyber-attack causes disruption on both systems affecting their functionality as illustrated in Figure 1(A). System S₁ suffer a sharp reduction in functionality, followed by a relatively slow recovery toward the pre-disruption level F₀.

On the other hand, the additional investment in resilience in system S₂ pays off. As illustrated in Figure 1(A), the functionality of system S₂ is more stable (that is, has less loss) and recovers (that is, goes back to pre-disruption levels) faster than system S₁ in the time following the disruption. Figure 1(B) illustrates the increase in total costs caused by the disruption. This increase includes direct costs of foregone output and costs to remedy the disruption. The increase in total costs may also include indirect or future costs related to lawsuits, loss of users or customers that no longer trust the system, and so forth. Since system S₁ suffers a more severe and longer impact, it is reasonable to expect that the increase in total cost is relatively larger than that of system S₂, which suffers fewer losses.

Figure 1(C) is one way to illustrate the difference in efficiency (short and long term) between the systems over time. Figure 1(C) represents the ratio of the total cost to functionality or output. In other words, the curves represent the unit cost of each system. Under “normal” conditions (before a disruption, illustrated as t<T_dis), system S₁ has the best short-term efficiency because of cheaper unit cost. But when a disruption occurs at T_dis, functionality declines (see Figure 1(A) for T_dis<t<Trec) while cost increases simultaneously (Figure 1(B) for the same period), and these two effects compound negatively to the cost per unit illustrated in Figure 1(C). On the other hand, system S₂ has both a lower decrease in output and less increase in total cost.

The key takeaway of the scenario illustrated in Figure 1 is that resilience may be detrimental to short-term efficiency but can be necessary for long-term efficiency. System S₂ initially has worse short-term efficiency because of the costly resources added to enhance resilience. However, S₂ has higher long-term efficiency than S₁ after failures, attacks, or disasters because S₂ turns out to have a lower cost per unit in the longer term.

Formulation

Efficiency. To formalize the idea illustrated in Figure 1, let the cost per unit CU=C/F be a ratio of total cost C to functionality F of a system, as illustrated in Figure 1(D). Note that CU is directly related to efficiency. The lower CU is in a period of time, the more efficient a system is during that period. Therefore, the difference in efficiency between two systems S₁ and S₂ can be expressed as the area between CU₁ and CU₂ over time:

This difference in efficiency is illustrated in Figure 2.

Figure 2. Difference in efficiency between two systems, expressed as the area between the curves of cost per unit of the systems.

Note that for the example illustrated in Figure 1 and Figure 2:

For t<T_pre: ΔCU=0 if the systems S₁ and S₂ are initially identical.
For for the special case where S₂ has resilience resources added at T_pre only and S₁ is kept unchanged. If resilience resources have an average unit cost c, then C₂=C₀+cx₂ and therefore . This period represents short-term efficiency: S₁ is more efficient than S₂.
For T_dis≤t<T_rec: in contrast, this period includes a disruption. Concerning long-term efficiency, S₂ is more efficient than S₁. The difference in long-term efficiency after a disruption depends on how much the degradation and recovery of system S₂ following the disruption differs from system S₁. The long-term efficiency difference ΔCU in this period is intimately related to resilience as explained in the next subsection.
For T_rec≤t: The difference in long-term efficiency after restoration from a disruption also depends on S₁ and S₂, but this period may be irrelevant (or even non-existent) if disruptions are recurrent events for example, there is a sequence of periods similar to the one between T_dis and T_rec.

Resilience. Let R be a measure of the resilience of a system. One way to define R is as the area under the curve of the system’s functionality F over the period T_dis≤t<T_rec following a disruption at T_dis (for example, cyberattack). Two resilience properties are directly associated with the change in F: (i) the decrease in output F from the baseline F₀; (ii) the time the system takes to recover (for example, return to pre-disruption output levels—F₀).

Therefore, the area under the curve (that is, the integral of F over time) can be understood as a measure of resilience that combines the impact of the disruption and recovery to the previous output level. The larger the integral, the more resilient a system is. When a disruption occurs, system S₂ is more resilient than S₁ (a larger integral) when either the degradation in F₂ is less than the degradation in F₁, or the recovery of F₂ to the pre-disruption level is faster than the recovery of F₁. In the example illustrated in Figure 1(B), both the degradation in F₂ is less than the degradation in F₁, and the recovery of F₂ to the pre-disruption level is faster than the recovery of F₁. As with the case of efficiency, we can define a difference in resilience between two systems .

Note that resilience R and long-term efficiency (as represented by the inverse of CU) are positively correlated in the period T_dis≤t<T_rec. A system with higher resilience R₂>R₁ will also be more efficient in the long term because it has a lower unit cost CU₂<CU₁ in the period T_dis≤t<T_rec. For example, if S₂ experiences a degradation in F that 10% less than S₁, then both R and 1/CU for S₂ will be relatively higher than R and 1/CU for S₁. The takeaway argument is that a resilient system is also efficient in the long term—there is no conflict. For example, a healthcare system that is prepared for a relatively small degradation in patient appointments following a disruption will have both high resilience and high efficiency in volume of appointments per dollar.

Challenges with Optimization of Long-Term Efficiency

The formulation here implies that the optimal investment in resilience resources is such that maximizes the difference in long-term efficiency after a disruption ΔCU. However, finding such optimal investment is challenging. Even though there is a clear goal, possible barriers to performing the optimization include uncertainty regarding the frequency/probability and impact of disruptions and uncertainty regarding the effect of each dollar invested in resilience with respect to reduction of impact and/or speed of recovery from disruptions.

The uncertainty regarding the frequency/probability and impact of disruptions has several implications that make the optimization for long-term efficiency hard. One is to judge threats to be outside the horizon of interest. For example, a manager might prioritize short-term efficiency as share prices and budgetary/time constraints often mean that short-term efficiency is the metric by which their, and the project’s, performance is gauged. Another implication is a refusal to invest in resilience against unknown threats. For example, a manager could refuse to prepare the system against attacks that decrypt information based on quantum computing, arguing that nobody knows what specific attacks and their impact will look like. If these implications are widespread across organizations and sectors, then there is yet another incentive to focus on short-term efficiency: “no one else is investing in resilience—if we do, we will not be competitive.”

The uncertainty regarding the effect of each dollar invested in resilience for reduction of impact and/or speed of recovery from disruptions also has implications that make the optimization challenging. Managers may assess that they do not know if anything will happen, and if it does, benefit-cost analysis of resilience cannot be performed because the resilience gain per dollar invested cannot be determined a priori. For these reasons, managers may struggle to choose between investment alternatives. Should one invest in thicker data center walls, anti-aircraft missiles, or anti-virus software? Moreover, the uncertainty regarding the return on investment may prevent managers to act if they perceive that they are paying for other people or organizations to benefit.

Resilience may be detrimental to short-term efficiency but can be necessary for long-term efficiency.

Another challenge to optimizing long-term efficiency is that estimates of impacts need to be done for at least two scenarios—without any investment in resilience (that is, S₁) and at least one scenario where resources are added to improve resilience (that is, S₂). Moreover, if there N>2 scenarios, then ΔR and ΔCU must be estimated for multiple pairs (S₁,S_i),i=2,…,N to choose the scenario with highest improvement in ΔR and ΔCU.

Approaches to Overcome Barriers and Optimize Long-Term Efficiency

The uncertainty about disruptions and resilience benefit from investment affects the optimization of long-term efficiency. Therefore, approaches that help reduce this uncertainty will result in investments in resilience that are close to optimal. One approach is to use historical data about past disruptions and past resilience investments to forecast future disruptions and investments. This data should ideally include past disruptions with information about impact and frequency on given systems and data about past investments in resilience that enable the assessment of resilience benefit per dollar (reduction and recovery from disruptions). One shortcoming of this approach is that it does not help optimize the investment for new threats (for example, malware that was never seen before).

Simulation is another approach to reduce the uncertainty about disruptions and resilience benefit from investment, informing the optimization of long-term efficiency. Simulations can be as simple as tabletop red teaming in cyber-security where subject matter experts can be used to estimate probabilities and impact of unforeseen events, but the assumptions can be subjective and result ineffective.¹⁰ The other extreme would be high fidelity digital twins of cyber-physical systems with varied capabilities of functionality emulation. For example, NREL’s CEEP is considered a sophisticated platform for the simulation of cyber events.⁷ In either way, simulation of events with high impact and low probability (for example, quantum computer attacks), as well as simulation of distinct resilience measures (for example, placement of redundant server versus cyber defense software or teams) may facilitate the optimization of long-term efficiency with less uncertainty.

As future work, the approach proposed in this Viewpoint could benefit from insights from Pareto optimality methods. Previous work has discussed the Pareto front when optimizing criteria such as reliability and affordability (for example, Bhattacharya et al.²) and a similar approach could be used to explore equilibrium levels of resilience and efficiency.

Illustration: WannaCry Ransomware and the NHS

In May 2017, malware of large-scale impact was released on the Internet. Using the EternalBlue vulnerability on Microsoft systems (made public earlier the same year) the worm—WannaCry, would infect a computer and encrypt the files, which were then held for ransom. The victim was offered the encryption key in exchange for a payment of the equivalent of 600 USD in Bitcoin (per infected device). After three days, if the ‘ransom’ was not paid the files were permanently deleted. WannaCry spread to 230,000 computers in over 150 countries costing the global economy approximately 4 billion USD,⁶ which represents total cost C in our formulation. While the NHS was not the intended target, it was among the hardest hit. The malware spread to 80 out of the 236 public hospitals in England costing the Department of Health an estimated 120 million USD.¹⁸ This financial loss and the loss of output in terms of healthcare that could not be provided justify investments in resilience resources (as discussed previously). The global and U.K. losses are a tangible portion of the aggregate increase in cost between and in our model. For example, an investment in resilience R that reduced impact and sped recovery for the NHS would prevent part or all of the loss of USD 120 million, or result in minimal impact for the cost per patient CU, preserving long-term efficiency. If the investment does not exceed the prevented loss, it is justifiable.

While we have discussed that unpredictable disruptions make it difficult to optimize investments, this attack was not unprecedented. Two hospitals in the NHS had fallen victim to ransom-ware attacks much earlier—in October 2016²¹ and again in January 2017.⁸ As a result of the earlier attacks, the NHS was in the process of conducting on-site cyber security assessments and all of the 88 hospitals that had been audited at the time had failed the assessment. Moreover, NHS Digital, the Health and Social Care entity in charge of IT, had issued guidance to hospitals to install the updates that would have prevented the success of WannaCry.

However, the resources required to mitigate the effect of the malware did not add value in terms of short-term efficiency. They do streamline monitoring and shed clarity in a moment of emergency but provide limited if any contribution to the day-to-day operation of a hospital, increasing costs in the short term. As a result, prior investments for resilience were possibly below the amount that would optimize long-term efficiency.

Conclusion

In this Viewpoint, we showed that resilience pays off. It is likely adding resources for resilience initially increases its costs without expanding functionality, causing an initial decline in short-term efficiency. However, cyber disruptions are increasingly likely (if not certain), which decrease the system’s functionality and simultaneously increase its costs due to lost customers or users, lawsuits, and other damage. A system that is prepared for resilience has lower declines in functionality and fewer cost overruns, and this advantage can more than compensate for the initial costs of adding resources for resilience. This not only improves resilience but optimizes long-term efficiency as well.

This is especially true for cyber systems and cyber-physical systems. For example, the electric grid is increasingly dependent on information and supervisory systems that control power generation, transmission, and distribution. Disruptions that impair the functionality of control systems may cause electric outages that put lives at risk and incur economic losses in businesses or missions that are beyond the boundaries of the grid. Resilience resources that mitigate those losses, therefore, result in a benefit to energy users that exceeds the upfront cost of the grid. Therefore, adding certain resources should improve both long-term efficiency and resilience.

However, there probably is a limit on such long-term efficiency and resilience gains. In the grid example, adding a redundant server to a single-server control system will probably result in enhanced long-term efficiency and resilience. Adding a third server, a fourth, and so forth will increase cost and probably will not enhance resilience as much as the second server. At some point, adding resources may not improve resilience at all while still adding to the overall cost and therefore becoming detrimental to long-term efficiency. Finding the level of investment in resilience that is optimal for a given system is challenging, if not impossible. While the cost of equipment, software, or personnel dedicated to cybersecurity and resilience is generally known, its benefit depends on the intensity and frequency of future threats.

Submit an Article to CACM

CACM welcomes unsolicited submissions on topics of relevance and value to the computing community.

You Just Read

Cyber Efficiency and Cyber Resilience

View in the ACM Digital Library

DOI

10.1145/3549073

April 2023 Issue

Published: April 1, 2023

Vol. 66 No. 4

Pages: 33-37

Table of Contents

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Explore More

BLOG@CACM Apr 26 2024

Optimizing Energy Efficiency in Datacenters with Advanced Cooling Technologies

Alex Williams

Architecture and Hardware

Credit: Getty Images Servers in snowy setting.

News Apr 23 2024

Maximizing Power Grid Security

R. Colin Johnson

Security and Privacy

News Apr 18 2024

Keeping AI Out of Elections

Bennie Mols

Artificial Intelligence and Machine Learning

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More

Motivation

Formulation

Challenges with Optimization of Long-Term Efficiency

Approaches to Overcome Barriers and Optimize Long-Term Efficiency

Illustration: WannaCry Ransomware and the NHS

Conclusion

Cyber Efficiency and Cyber Resilience

DOI

April 2023 Issue

Related Reading

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.