On the afternoon of May 6, 2010, the U.S. equity markets experienced an extraordinary upheaval. Over approximately 10 minutes, the Dow Jones Industrial Average dropped more than 600 points, representing the disappearance of approximately $800 billion of market value. The share price of several blue-chip multinational companies fluctuated dramatically; shares that had been at tens of dollars plummeted to a penny in some cases and rocketed to values over $100,000 per share in others. As suddenly as this market downturn occurred, it reversed, so over the next few minutes most of the loss was recovered and share prices returned to levels close to what they had been before the crash.
Key Insights
- Coalitions of systems, in which the system elements are managed and owned independently, pose challenging new problems for systems engineering.
- When the fundamental basis of engineering—reductionism—breaks down, incremental improvements to current engineering techniques are unable to address the challenges of developing, integrating, and deploying large-scale complex IT systems.
- Developing complex systems requires a socio-technical perspective involving human, organizational, social, and political factors, as well as technical factors.
This event came to be known as the “Flash Crash,” and, in the inquiry report published six months later,7 the trigger event was identified as a single block sale of $4.1 billion of futures contracts executed with uncommon urgency on behalf of a fund-management company. That sale began a complex pattern of interactions between the high-frequency algorithmic trading systems (algos) that buy and sell blocks of financial instruments on incredibly short timescales.
A software bug did not cause the Flash Crash; rather, the interactions of independently managed software systems created conditions unforeseen (probably unforeseeable) by the owners and developers of the trading systems. Within seconds, the result was a failure in the broader socio-technical markets that increasingly rely on the algos (see the sidebar “Socio-Technical Systems“).
Society depends on complex IT systems created by integrating and orchestrating independently managed systems. The incredible increase in scale and complexity in them over the past decade means new software-engineering techniques are needed to help us cope with their inherent complexity. Here, we explain the principal reasons today’s software-engineering methods and tools do not scale, proposing a research and education agenda to help address the inherent problems of large-scale complex IT systems, or LSCITS, engineering.
Coalitions of Systems
The key characteristic of these systems is that they are assembled from other systems that are independently controlled and managed. While there is increasing awareness in the software-engineering community of related issues,10 the most relevant background work comes from systems engineering. Systems engineering focuses on developing systems as a whole, as defined by the International Council for Systems Engineering (http://www.incose.org/): “Systems engineering integrates all the disciplines and specialty groups into a team effort forming a structured development process that proceeds from concept to production to operation. Systems engineering considers both the business and the technical needs of all customers with the goal of providing a quality product that meets the user needs.”
Systems engineering emerged to take a systemwide perspective on complex engineered systems involving structures and electrical and mechanical systems. Almost all systems today are software-intensive, and systems engineers address the challenge of constructing ultra-large-scale software systems.17 The most relevant aspects of systems engineering is work on “system of systems,” or SoS,12 about which Maier said the distinction between SoS and complex monolithic systems is that SoS elements are operationally and managerially independent. Characterizing SoS, he covered a range of systems, from directed (developed for a particular purpose) to virtual (lacking a central management authority or centrally agreed purpose). LSCITS is a type of SoS in which the elements are owned and managed by different organizations. In this classification, the collection of systems that led to the Flash Crash (an LSCITS) would be called a “virtual system of systems.” However, since Maier’s article was published in 1998, the word “virtual” has generally taken on a different meaning—virtual machines; consequently, we propose an alternative term that we find more descriptive—”coalition of systems.”
Developers cannot analyze inherent complexity during system development, as it depends on the system’s dynamic operating environment.
The systems in a coalition of systems work together, sometimes reluctantly, as doing so is in their mutual interest. Coalitions of systems are not explicitly designed but come into existence as different member systems interact according to agreed-upon protocols. Like political coalitions, there might even be hostility between various members, and members enter and leave according to their interpretation of their own best interests.
The interacting algos that led to the Flash Crash represent an example of a coalition of systems, serving the purposes of their owners and cooperating only because they have to. The owners of the individual systems were competing finance companies that were often mutually hostile. Each system jealously guarded its own information and could change without consulting any other system.
Dynamic coalitions of software-intensive systems are a challenge for software engineering. Designing dependability into the coalition is not possible, as there is no overall design authority, nor is it possible to centrally control the behavior of individual systems. The systems in the coalition can change unpredictably or be completely replaced, and the organizations running them might themselves cease to exist. Coalition “design” involves the protocols for communications, and each organization using the coalition orchestrates the constituent systems its own way. However, the designers and managers of each individual system must consider how to make it robust enough to ensure their organizations are not threatened by failure or undesirable behavior elsewhere in the coalition.
System Complexity
Complexity stems from the number and type of relationships between the system’s components and between the system and its environment. If a relatively small number of relationships exist between system components and they change relatively slowly over time, then engineers can develop deterministic models of the system and make predictions concerning its properties.
However, when the elements in a system involve many dynamic relationships, complexity is inevitable. Complex systems are nondeterministic, and system characteristics cannot be predicted by analyzing the system’s constituents. Such characteristics emerge when the whole system is put to use and changes over time, depending how it is used and on the state of its external environment.
Dynamic relationships include those between system elements and the system’s environment that change. For example, a trust relationship is a dynamic relationship; initially, component A might not trust component B, so, following some interchange, A checks that B has performed as expected. Over time, these checks may be reduced in scope as A‘s trust in B increases. However, some failure in B may profoundly influence that trust, and, after the failure, even more stringent checks might be introduced.
Complexity stemming from the dynamic relationships between elements in a system depends on the existence and nature of these relationships. Engineers cannot analyze this inherent complexity during system development, as it depends on the system’s dynamic operating environment. Coalitions of systems in which elements are large software systems are always inherently complex. The relationships between the elements of the coalition change because they are not independent of how the systems are used or of the nature of their operating environments. Consequently, the nonfunctional (often even the functional) behavior of coalitions of systems is emergent and impossible to predict completely.
Even when the relationships between system elements are simpler, relatively static, and, in principle, understandable, there may be so many elements and relationships that understanding them is practically impossible. Such complexity is called “epistemic complexity” due to our lack of knowledge of the system rather than some inherent system characteristics.16 For example, it may be possible in principle to deduce the traceability relationships between requirements and design, but, if appropriate tools are not available, doing so may be practically impossible.
If you do not know enough about a system’s components and their relationships, you cannot make predictions about overall behavior, even if the system lacks dynamic relationships between its elements. Epistemic complexity increases with system size; as ever-larger systems are built, they are inevitably more difficult to understand and their behavior and properties more difficult to predict. This distinction between inherent and epistemic complexity is important. As discussed in the following section, it is the primary reason new approaches to software engineering are needed.
Reductionism and Software Engineering
In some respects, software engineering has been incredibly successful. Compared to the systems built in the 1970s and 1980s, modern software is much larger, more complex, more reliable, and often developed more quickly. Software products deliver astonishing functionality at relatively low cost.
Software engineering has focused on reducing and managing epistemic complexity, so, where inherent complexity is relatively limited and a single organization controls all system elements, software engineering is highly effective. However, for coalitions of systems with a high degree of inherent complexity, today’s software engineering techniques are inadequate.
This is reflected in the failure that is all too common in large government-funded projects. The software may be delivered late, be more expensive to develop than anticipated, and inadequate for the needs of its users. An example of such a project was the attempt, from 2000 to 2010, to automate U.K. health records; the project was ultimately abandoned at a cost estimated at $5 billion$10 billion.19
The fundamental reason today’s software engineering cannot effectively manage inherent complexity is that its basis is in developing individual programs rather than in interacting systems. The consequence is that software-engineering methods are unsuitable for building LSCITS. To appreciate why, we need to examine the essential divide-and-conquer reductionist assumption that is the basis of all modern engineering.
Reductionism is a philosophical position that a complex system is no more than the sum of its parts, and that an account of the overall system can be reduced to accounts of individual constituents. From an engineering perspective, this means systems engineers must be able to design a system so it is composed of discrete smaller parts and interfaces allowing the parts to work together. A systems engineer then builds the system elements and integrates them to create the desired overall system.
Researchers generally adopt this reductionist assumption, and their work concerns finding better ways to decompose problems or systems (such as software architecture), better ways to create the parts of the system (such as object-oriented techniques), or better ways to do system integration (such as test-first development). Underlying all software-engineering methods and techniques (see Figure 1) are three reductionist assumptions:
System owners control system development. A reductionist perspective takes the view that an ultimate controller has the authority to take decisions about a system and is therefore able to enforce decisions on, say, how components interact. However, when systems consist of independently owned and managed elements, there is no such owner or controller and no central authority to take or enforce design decisions;
Decisions are rational, driven by technical criteria. Decision making in organizations is profoundly influenced by political considerations, with actors striving to maintain or improve their current positions to avoid losing face. Technical considerations are rarely the most significant factor in large-system decision making; and
The problem is definable, and system boundaries are clear. The nature of “wicked problems”15 is that the “problem” is constantly changing, depending on the perceptions and status of stakeholders. As stakeholder positions change, the boundaries are likewise redefined.
However, for coalitions of systems, these assumptions never hold true, and many software project “failures,” where software is delivered late and/or over budget, are a consequence of adherence to the reductionist view. To help address inherent complexity, software engineering must look toward the systems, people, and organizations that make up a software system’s environment. We need to represent, analyze, model, and simulate potential operational environments for coalitions of systems to help us understand and manage, so far as possible, the complex relationships in the coalition.
Challenges
Since 2006, initiatives in the U.S. and in Europe have sought to address engineering large coalitions of systems. In the U.S., a report by the influential Software Engineering Institute at Carnegie Mellon University (http://www.sei.cmu.edu/) on ultra-large-scale systems (ULSS)13 triggered research leading to creation of the Center for Ultra-Large Scale Software-Intensive Systems, or ULSSIS (http://ulssis.cs.virginia.edu/ULSSIS), a research consortium involving the University of Virginia, Michigan State University, Vanderbilt University, and the University of California, San Diego. In the U.K., the comparable LSCITS Initiative addresses problems of inherent and epistemic complexity in LSCITS, while Hillary Sillitto, a senior systems architect at Thales Land & Joint Systems U.K., has proposed ULSS design principles.17
Northrop et al.13 made the point that developing ultra-large-scale systems needs to go beyond incremental improvements to current methods, identifying seven important research areas: human interaction, computational emergence, design, computational engineering, adaptive system infrastructure, adaptable and predictable system quality and policy, and acquisition and management. The SEI ULSS report suggested it is essential to deploy expertise from a range of disciplines to address these challenges.
We agree the research required is interdisciplinary and that incremental improvement in existing techniques is unable to address the long-term software-engineering challenges of ultra-large-scale systems engineering. However, a weakness of the SEI report was its failure to set out a roadmap outlining how large-scale systems engineering can get from where it is today to the research it proposed.
Software engineers worldwide creating large complex software systems require more immediate, perhaps more incremental, research, driven by the practical problems of complex IT systems engineering. The pragmatic proposals we outline here begin to address some of them, aiming for medium-term, as well as a longer-term, impact on LSCITS engineering.
The research topics we propose here might be viewed as part of the roadmap that could lead us from current practice to LSCITS engineering. We see them as a bridge between the short- and medium-term imperative to improve our ability to create coalitions of systems and the longer-term vision set out in the SEI ULSS report.
Developing coalitions of systems involves engineering individual systems to work in the orchestration, as well as configuration, of a coalition to meet organizational needs. Based on the ideas in the SEI ULSS report and on our own experience in the U.K. LSCITS Initiative, we have identified 10 questions that can help define a research agenda for future LSCITS software engineering:
How can interactions between independent systems be modeled and simulated? To help understand and manage coalitions of systems LSCITS engineers need dynamic models that are updated in real time with information from the system itself. These models are needed to help make what-if assessments of the consequences of system-change options. This requires new performance- and failure-modeling techniques where the models adapt automatically due to system-monitoring data. We do not suggest simulations can be complete or predict all possible problems. However, other engineering disciplines (such as civil and aeronautical engineering) have benefited enormously from simulation, and comparable benefits could be achieved for software engineering.
How can coalitions of systems be monitored? And what are the warning signs problems produce? In the run-up to the Flash Crash, no warning signs indicated the market was tending toward an unstable state. To help avoid transition to an unstable system state, systems engineers need to know the indicators that provide information about the state of the coalition of systems, how they may be used to provide both early warnings of system problems, and, if necessary, switch to safe-mode operating conditions that prevent the possibility of damage.
How can systems be designed to recover from failure? A fundamental principle of software engineering is that systems should be built so they do not fail, leading to development of methods and tools based on fault avoidance, fault detection, and fault tolerance. However, as coalitions of systems are constructed with independently managed elements and negotiated requirements, avoiding failure is increasingly impractical. Indeed, what seems to be a failure for some users may have no effect on others. Because some failures are ambiguous, automated systems cannot cope on their own. Human operators must use information from the system, intervening to enable it to recover from failure. This means understanding the socio-technical processes of failure recovery, the support the operators need, and how to design coalition members to be “good citizens” able to support failure recovery.
The nonfunctional (and, often, the functional) behavior of coalitions of systems is emergent and impossible to predict completely.
How can socio-technical factors be integrated into systems and software-engineering methods? Software- and systems-engineering methods support development of technical systems and generally consider human, social, and organizational issues to be outside the system’s boundary. However, such nontechnical factors significantly affect development, integration, and operation of coalitions of systems. Though a considerable body of work covers socio-technical systems, it has not been industrialized or made accessible to practitioners. Baxter and Sommerville2 surveyed this work and proposed a route to industrial-scale use of socio-technical methods. However, much more research and experience is required before socio-technical analyses are used routinely for complex systems engineering.
To what extent can coalitions of systems be self-managing? Needed is research into self-management so systems are able to detect changes in both their operation and operational environment and dynamically reconfigure themselves to cope with the changes. The danger is that reconfiguration will create further complexity, so a key requirement is for the techniques to operate in a safe, predictable, auditable way ensuring self-management does not conflict with “design for recovery.”
How can organizations manage complex, dynamically changing system configurations? Coalitions of systems will be constructed through orchestration and configuration, and desired system configurations will change dynamically in response to load, indicators of system health, unavailability of components, and system-health warnings. Ways are needed to support construction by configuration, managing configuration changes and recording changes, including automated changes from the self-management system, in real time, so an audit trail includes the configuration of the coalition at any point in time.
How should the agile engineering of coalitions of systems be supported? The business environment changes quickly in response to economic circumstances, competition, and business reorganization. Likewise, coalitions of systems must be able to change quickly to reflect current business needs. A model of system change that relies on lengthy processes of requirements analysis and approval does not work. Agile methods of programming have been successful for small- to medium-size systems where the dominant activity is systems development. For large complex systems, development processes are often dominated by coordination activities involving multiple stakeholders and engineers who are not colocated. How can agile approaches be effective for “systems development in the large” to support multi-organization global systems development?
How should coalitions of systems be regulated and certified? Many such coalitions represent critical systems, failure of which could threaten individuals, organizations, and national economies. They may have to be certified by regulators checking that, as far as possible, they do not pose a threat to their operators or to the wider systems environment. But certification is expensive. For some safety-critical systems, the cost of certification can exceed the cost of development, and certification costs will increase as systems become larger and more complex. Though certification as practiced today is almost certainly impossible for coalitions of systems, research is urgently needed into incremental and evolutionary certification so our ability to deploy critical complex systems is not curtailed by certification requirements. This issue is social, as well as technical, as societies decide what level of certification is socially and legally acceptable.
How can systems undergo “probabilistic verification”? Today’s techniques of system testing and more formal analysis are based on the assumption that a system involves a definitive specification and that behavior deviating from it is recognized. Coalitions of systems have no such specification nor is system behavior guaranteed to be deterministic. The key verification issue will not be whether the system is correct but the probability that it satisfies essential properties (such as safety) that take into account its probabilistic, real-time, nondeterministic behavior.8,11
How should shared knowledge in a coalition of systems be represented? We assume the systems in a coalition interact through service interfaces so the system has no overarching controller. Information is encoded in a standards-based representation. The key problem will not be compatibility but understanding what the information exchange really means. This is addressed today on a system-by-system basis through negotiation between system owners to clarify the meaning of shared information. However, if dynamic coalitions are allowed, with systems entering and leaving the coalition, negotiation is not practical. The key is developing a means of sharing the meaning of information, perhaps through ontologies like those proposed by Antoniou and van Harmelen1 involving the semantic Web.
A major problem researchers must address is lack of knowledge of what happens in real systems. High-profile failures (such as the Flash Crash) lead to inquiries, but more is needed about the practical difficulties faced by developers and operators of coalitions of systems and how to address them as they arise. New ideas, tools, and methods must be supported by long-term empirical studies of the systems and their development processes to provide a solid information base for research and innovation.
The U.K. LSCITS Initiative5 addresses some of them, working with partners from the computer, financial services, and health-care industries to develop an understanding of the fundamental systems engineering problems they face. Key to this work is a long-term engagement with the National Health Information Center to create coalitions of systems to provide external access to and analysis of vast amounts of health and patient data.
The project is developing practical techniques of socio-technical systems engineering2 and exploring design for failure.18 It has so far developed practical, predictable techniques for autonomic system management3,4 and is investigating the scaling up of agile methods14 and exploring incremental system certification9 and development of techniques for system simulation and modeling.
Education. To address the practical issues of creating, managing, and operating LSCITS, engineers need knowledge and understanding of the systems and with techniques outside a “normal” software- or systems-engineering education. In the U.K., the LSCITS Initiative provides a new kind of doctoral degree, comparable in standard to a Ph.D. in computer science or engineering. Students get an engineering doctorate, or EngD, in LSCITS,20 with the following key differences between EngD and Ph.D.:
Industrial problems. Students must work on and spend significant time on an industrial problem. Universities cannot simply replicate the complexity of modern software-intensive systems, with few faculty members having experience and understanding of the systems;
Range of courses. Students must take a range of courses focusing on complexity and systems engineering (such as for LSCITS, socio-technical systems, high-integrity systems engineering, empirical methods, and technology innovation); and
Portfolio of work. Students do not have to deliver a conventional thesis, a book on a single topic, but can deliver a portfolio of work around their selected area; it is a better reflection of work in industry and makes it easier for the topic to evolve as systems change and new research emerges.
However, graduating a few advanced doctoral students is not enough. Universities and industry must also create master’s courses that educate complex-systems engineers for the coming decades; our thoughts on what might be covered are outlined in Figure 2. The courses must be multidisciplinary, combining engineering and business disciplines. It is not only the knowledge the disciplines bring that is important but also that students be sensitized to the perspectives of a variety of disciplines and so move beyond the silo of single-discipline thinking.
Conclusion
Since the emergence of widespread networking in the 1990s, all societies have grown increasingly dependent on complex software-intensive systems, with failure having profound social and economic consequences. Industrial organizations and government agencies alike build these systems without understanding how to analyze their behavior and without appropriate engineering principles to support their construction.
The SEI ULSS report14 argued that current engineering methods are inadequate, saying: “For 40 years, we have embraced the traditional engineering perspective. The basic premise underlying the research agenda presented in this document is that beyond certain complexity thresholds, a traditional centralized engineering perspective is no longer adequate nor can it be the primary means by which ultra-complex systems are made real.” A key contribution of our work in LSCITS is articulating the fundamental reasons this assertion is true. By examining how engineering has a basis in the philosophical notion of reductionism and how reductionism breaks down in the face of complexity, it is inevitable that traditional software-engineering methods will fail when used to develop LSCITS. Current software engineering is simply not good enough. We need to think differently to address the urgent need for new engineering approaches to help construct large-scale complex coalitions of systems we can trust.
Acknowledgments
We would like to thank our colleagues Gordon Baxter and John Rooksby of St. Andrews University in Scotland and Hillary Sillitto of Thales Land & Joint Systems U.K. for their constructive comments on drafts of this article. The work report here was partially funded by the U.K. Engineering and Physical Science Research Council (www.epsrc.ac.uk) grant EP/F001096/1.
Join the Discussion (0)
Become a Member or Sign In to Post a Comment