Architecture and Hardware Self-managed systems and services

Introduction

By Jean-Philippe Martin-Flatin, Joe Sventek, and Kurt Geihs

Posted Mar 1 2006

Article
References
Authors

Stable and dependable IT services and infrastructures are of paramount importance for modern enterprises. Traditional solutions for managing and controlling the networked systems and services that sustain them appear to have reached their limits. As the number of hardware and software components to be managed increases, and as the dependencies between user-perceived services and the underlying networks and systems are ever more intricate and dynamic, management systems have become increasingly complex. They do not scale easily, are difficult to configure and use, and struggle to correlate service-level problems with problems in the underlying infrastructure. Even worse, the algorithms used in the management platforms (such as for event correlation) gradually shift from science to art. This trend brings into question the ability of existing management methodologies to cope with tomorrow’s enterprise IT infrastructures and services.

Biological and sociological phenomena often serve as inspirations and guidelines for the design of self-organizing systems.

To address these challenges, researchers have, for several years, experimented with alternate paradigms to organize and structure systems and services differently to better facilitate their management. Since centralized or weakly distributed solutions no longer work, we need to migrate to strongly distributed management solutions where management decisions (including resource allocations, security settings, or device reconfigurations) are highly decentralized and performed autonomously by network devices and systems [3, 4].

After spending several years wandering along the trail of agents (mobile or intelligent), which proved to be of limited interest for building scalable, flexible, and reliable management systems, the management community is now turning to self-management, which also raises considerable interest in the distributed systems [1, 2] and software engineering [5] communities. This interest builds on the success already attained by self-organizing and self-adaptive systems in other disciplines, such as distributed artificial intelligence, material science, and thermodynamics. There is considerable controversy, however, regarding the exact definition of a self-managing system, and how such systems can be engineered. Biological and sociological phenomena often serve as inspirations and guidelines for the design of self-organizing systems.

To investigate these issues and facilitate cross-pollination between different research communities, an interdisciplinary workshop was held in May 2005 in Nice, France. The goal of the First International Workshop on Self-Managed Systems and Services (SelfMan 2005) [6] was to inspire participants from different backgrounds to analyze and discuss the potential of self– properties (for example, self-management, self-organization, self-adaptiveness, self-monitoring, self-tuning, self-repair, self-configuration) for managing and controlling networked systems and services [1]. The articles in this special section are derived from Self-Man 2005, and offer a high degree of topical diversity to better acquaint readers with this wide-ranging subject area.

In the first article, Robertson and Williams describe a mechanism for increasing the robustness of software systems. To identify all possible failures of a system and specify a response for each of them, they add dynamic fault awareness and recovery capabilities to existing systems. This makes it possible to identify unanticipated failures and find workarounds to these failures. To enable a software system to reason about its behavior, the authors complement it with models of the causal relationships between its components, models of intended behavior, and models of correct execution; when available, models of known failures are also used. By sensing its state, reasoning about the difference between the expected state and the observed state, and modifying its running software, the system is able to improve its dependability over time. The example presented in this article pertains to Mars rovers used in the MIT Model-Based Embedded and Robotic Systems testbed.

Porter and Katz examine how to design self-adaptive Web services that can cope with transient overload. By using simple statistical techniques, they uncover the ripple effects of a given Web service request in multi-tier systems and identify the component in which the performance problem lies. Middleware components are monitored in a black-box manner, and the statistical findings are displayed using a visualization tool. The staff in charge of monitoring networks and systems can then use this tool to diagnose and pinpoint performance problems and make management decisions accordingly, for example, by activating a suitable admission control scheme. These ideas were benchmarked using an auction Web site. The results show a large increase in the number of pages served per second and a large decrease in the worst-case response time of the Web site.

Rolia et al. propose a self-managing system for assigning resources from a shared and overbooked resource pool to different applications with diverse Quality of Service (QoS) requirements. To increase the chances that application QoS objectives are satisfied, workload managers (containers that control access to shares of resource capacity) monitor their workload demands and dynamically adjust their allocation of resources to the current application workloads. Each workload manager supports two classes of service (guaranteed service and best-effort service) characterized by different allocation priorities (high and low). Workload managers automatically divide the workload demands of a given application into these two classes of services to meet the application QoS requirements specified by the application owner. The ideas proposed in this article are validated by a case study comprising 26 applications from a large enterprise order entry system.

Leibnitz et al. exploit a biological model known as Adaptive Response by Attractor Selection, which models how a specific type of cell adapts to changes in the availability of a nutrient in its environment, to route packets in a self-organized manner in packet-switched overlay networks. These networks are typically used in peer-to-peer (P2P) computing, mobile ad hoc networks, and sensor networks. For each path between a given source and a given destination, the routing algorithm takes measurements of a metric (such as the round-trip time or the available bandwidth) as input and automatically adapts the packet transmission probabilities for each path. This algorithm is validated with simulation results that demonstrate fast recovery to path failures.

Singh and Haahr generalize a model of self-organization devised almost 50 years ago for solving a sociological issue and use it to adapt the network topology in a pure P2P network. By performing topology adaptation, they adjust the overlay network topology to satisfy certain criteria when peers leave or join the network. For example, in file-sharing P2P applications, all the peers in a given group must have a high bandwidth capacity. The authors detail the algorithm initially proposed by Schelling (the 2005 Nobel Prize winner in Economics), propose a generalized algorithm, and finally validate their solution through simulation.

We hope these articles illuminate the rather broad spectrum of the subject area and indicate the opportunities for research and innovation in the area of self-management. As demonstrated in the section, it requires technologies and methodologies more advanced than are available today. We encourage readers to join us in addressing these challenges.

Submit an Article to CACM

CACM welcomes unsolicited submissions on topics of relevance and value to the computing community.

You Just Read

Introduction

View in the ACM Digital Library

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.

DOI

10.1145/1118178.1118199

March 2006 Issue

Published: March 1, 2006

Vol. 49 No. 3

Pages: 36-39

Table of Contents

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

The Latest from CACM

Explore More

BLOG@CACM Apr 26 2024

Optimizing Energy Efficiency in Datacenters with Advanced Cooling Technologies

Alex Williams

Architecture and Hardware

Credit: Getty Images Servers in snowy setting.

News Apr 23 2024

Maximizing Power Grid Security

R. Colin Johnson

Security and Privacy

News Apr 18 2024

Keeping AI Out of Elections

Bennie Mols

Artificial Intelligence and Machine Learning

Shape the Future of Computing

ACM encourages its members to take a direct hand in shaping the future of the association. There are more ways than ever to get involved.

Get Involved

Communications of the ACM (CACM) is now a fully Open Access publication.

By opening CACM to the world, we hope to increase engagement among the broader computer science community and encourage non-members to discover the rich resources ACM has to offer.

Learn More

Introduction

DOI

March 2006 Issue

Related Reading

Join the Discussion (0)

Become a Member or Sign In to Post a Comment

Shape the Future of Computing

Communications of the ACM (CACM) is now a fully Open Access publication.