October 1990 - Vol. 33 No. 10
Features
Practical programmer: Software teams
I have often heard the phrase, “We see what we know.” As technicians, we concentrate on technical ways to manage complexity: abstraction, design techniques, high-level languages, and so on. That is what we know best. But when the tale is told of a project that failed, the blame is often laid not on technical difficulties, but on management and interpersonal problems.In the last six months, I have seen firsthand how attention to the social organization of a software team can make a big difference in the success of a development project. I work in a “Research and Development” group. “Research” means that some aspects of the project are experimental—we do not know for sure what is going to work. “Development” means we are expected to produce high-quality software for real users. So while we want to encourage creative thought, we must pay heed to the lessons of commercial software developers in quality assurance, testing, documentation, and project control.Our all-wise project leader decided we also needed to pay heed to the lessons of sociology. In particular, we began to apply the ideas found in Larry Constantine's work on the organization of software teams. Our efforts have resulted in a team that is productive, flexible, and comfortable. I thought these qualities are unusual enough to merit a column on the subject.
Introduction—discrete event simulation
This special section deals with the simulation of complex systems, such as computer, communications, and manufacturing systems—specifically focusing on stochastic discrete event simulation. Such systems are typically (but not always) modeled by a network of queues, in which jobs compete for the system's resources. For example, in an on-line computer database system, the jobs would represent transactions, and the system's resources would include processors, disks, main memory, data locks, etc. Since job arrival patterns and resource requirements are unpredictable, such systems are inherently stochastic (random). As these systems increase in complexity, it becomes increasingly difficult to build analytically tractable performance models; thus simulation, because of its versatility, often becomes the only viable analysis technique.This issue contains articles which span a broad range of current topics in simulation, including
Efficient execution of simulation models using parallel processors;
Integration of real-time systems, artificial intelligence, and simulation for automated factory control,
Rapid prototyping and simulation of distributed systems,
Sensitivity analysis of simulation output,
Random number generation, and
Effective use of simulation in analyzing and improving the performance of actual systems.
Richard Fujimoto's article is a state-of-the-art survey on the execution of simulation models on parallel processors. Fujimoto describes why discrete event simulations have proven to be a difficult class of applications to parallelize. He then describes two basic parallelization approaches (conservative and optimistic), and recent experience with these approaches, including his own results in which significant speedups have been obtained on nontrivial problems using the optimistic “Time Warp” approach. He also provides a critique and assessment of the basic approaches to parallel simulation.The article by Sanjay Jain, Karon Barber and David Osterfeld describes a system for factory scheduling. The scheduling system obtains the status of the plant floor using the factory's real-time monitoring and control computer system. This information is fed into the scheduler which generates a new schedule, and sends it back to the factory control computer. The scheduler integrates a simulation model (which simulates backwards in time), and an expert system whose rules encode scheduling heuristics. The system is currently in use at a highly automated General Motors component production facility.Alexander Dupuy, Jed Schwartz, Yechiam Yemini and David Bacon's article describes NEST, a UNIX-based environment for prototyping, modeling, and simulating distributed systems. NEST has a graphical interface for building simulation models. In addition, users can program (or prototype) “node functions” that implement the system's control algorithms (e.g., routing protocols in a communications network). The node functions are linked into the simulation, allowing simulation (and debugging) of a system using its actual control logic. With only minor modification, the node functions can be decoupled from the simulation and used to build the actual system. The authors illustrate this combined use of prototyping and simulation using the RIP protocol, a simple routing protocol for IP (Internet Protocol) networks. The authors also describe other features of NEST, including its ability to dynamically reconfigure the simulation model during execution, which permits the study of how the system reacts to changing conditions, such as link failures.The article by Peter Glynn is concerned with sensitivity analysis of simulation output. He discusses a general-purpose method for using simulation to estimate the derivative of a performance measure, and provides explicit formulae for estimating derivatives for several broad classes of stochastic processes that typically arise in discrete event simulations.Pierre L'Ecuyer's article is a timely survey on pseudorandom number generation. As L'Ecuyer explains, this topic has recently received renewed attention for a variety of reasons, including the proliferation of generators on microcomputers, the need for portable generators, the requirement for generators with very long periods (as machines get faster), and the application of pseudorandom number generators to cryptology. L'Ecuyer examines several classes of generators, outlines their theoretical and empirical properties, and discusses implementation issues as well.The article by David Miller presents a case study illustrating the use of simulation modeling to analyze the performance of an IBM semiconductor manufacturing facility in Essex Junction, Vermont. Miller describes the relevant background and goals of the modeling study, the modeling approach, and model validation. He then describes the results of the simulation experiments, which included investigating a variety of line-loading and scheduling policies. Miller found that a lot-release policy that keeps a fixed amount of work-in-progress in the line could significantly reduce the lot turnaround time without reducing throughput, when compared to the policy that was in use in the facility. Miller then describes the specific changes, suggested by the modeling study, that were implemented in Essex Junction, and the corresponding improvement in the facility's efficiency.
Parallel discrete event simulation
Parallel discrete event simulation (PDES), sometimes called distributed simulation, refers to the execution of a single discrete event simulation program on a parallel computer. PDES has attracted a considerable amount of interest in recent years. From a pragmatic standpoint, this interest arises from the fact that large simulations in engineering, computer science, economics, and military applications, to mention a few, consume enormous amounts of time on sequential machines. From an academic point of view, parallel simulation is interesting because it represents a problem domain that often contains substantial amounts of parallelism (e.g., see [59]), yet paradoxically, is surprisingly difficult to parallelize in practice. A sufficiently general solution to the PDES problem may lead to new insights in parallel computation as a whole. Historically, the irregular, data-dependent nature of PDES programs has identified it as an application where vectorization techniques using supercomputer hardware provide little benefit [14].A discrete event simulation model assumes the system being simulated only changes state at discrete points in simulated time. The simulation model jumps from one state to another upon the occurrence of an event. For example, a simulator of a store-and-forward communication network might include state variables to indicate the length of message queues, the status of communication links (busy or idle), etc. Typical events might include arrival of a message at some node in the network, forwarding a message to another network node, component failures, etc.We are especially concerned with the simulation of asynchronous systems where events are not synchronized by a global clock, but rather, occur at irregular time intervals. For these systems, few simulator events occur at any single point in simulated time; therefore parallelization techniques based on lock-step execution using a global simulation clock perform poorly or require assumptions in the timing model that may compromise the fidelity of the simulation. Concurrent execution of events at different points in simulated time is required, but as we shall soon see, this introduces interesting synchronization problems that are at the heart of the PDES problem.This article deals with the execution of a simulation program on a parallel computer by decomposing the simulation application into a set of concurrently executing processes. For completeness, we conclude this section by mentioning other approaches to exploiting parallelism in simulation problems.Comfort and Shepard et al. have proposed using dedicated functional units to implement specific sequential simulation functions, (e.g., event list manipulation and random number generation [20, 23, 47]). This method can provide only a limited amount of speedup, however. Zhang, Zeigler, and Concepcion use the hierarchical decomposition of the simulation model to allow an event consisting of several subevents to be processed concurrently [21, 98]. A third alternative is to execute independent, sequential simulation programs on different processors [11, 39]. This replicated trials approach is useful if the simulation is largely stochastic and one is performing long simulation runs to reduce variance, or if one is attempting to simulate a specific simulation problem across a large number of different parameter settings. However, one drawback with this approach is that each processor must contain sufficient memory to hold the entire simulation. Furthermore, this approach is less suitable in a design environment where results of one experiment are used to determine the experiment that should be performed next because one must wait for a sequential execution to be completed before results are obtained.
Expert simulation for on-line scheduling
The state-of-the-art in manufacturing has moved toward flexibility, automation and integration. The efforts spent on bringing computer-integrated manufacturing (CIM) to plant floors have been motivated by the overall thrust to increase the speed of new products to market. One of the links in CIM is plant floor scheduling, which is concerned with efficiently orchestrating the plant floor to meet the customer demand and responding quickly to changes on the plant floor and changes in customer demand. The Expert System Scheduler (ESS) has been developed to address this link in CIM. The scheduler utilizes real-time plant information to generate plant floor schedules which honor the factory resource constraints while taking advantage of the flexibility of its components.
The scheduler uses heuristics developed by an experienced human factory scheduler for most of the decisions involved in scheduling. The expertise of the human scheduler has been built into the computerized version using the expert system approach of the discipline of artificial intelligence (AI). Deterministic simulation concepts have been used to develop the schedule and determine the decision points. As such, simulation modeling and AI techniques share many concepts, and the two disciplines can be used synergistically. Examples of some common concepts are the ability of entities to carry attributes and change dynamically (simulation—entities/attributes or transaction/parameters versus AI—frames/slots); the ability to control the flow of entities through a model of the system (simulation—conditional probabilities versus AI—production rules); and the ability to change the model based upon state variables (simulation—language constructs based on variables versus AI—pattern-invoked programs). Shannon [6] highlights similarities and differences between conventional simulation and an AI approach. Kusiak and Chen [3] report increasing use of simulation in development of expert systems.
ESS uses the synergy between AI techniques and simulation modeling to generate schedules for plant floors. Advanced concepts from each of the two areas are used in this endeavor. The expert system has been developed using frames and object-oriented coding which provides knowledge representation flexibility. The concept of “backward” simulation, similar to the AI concept of backward chaining, is used to construct the events in the schedule. Some portions of the schedule are constructed using forward or conventional simulation.
The implementation of expert systems and simulation concepts is intertwined in ESS. However, the application of the concepts from these two areas will be treated separately for ease of presentation. We will first discuss the expert system approach and provide a flavor of the heuristics. The concept of backward simulation and the motive behind it will then be explored along with some details of the implementation and the plant floor where the scheduler is currently being used. We will then highlight some advantages and disadvantages of using the expert simulation approach for scheduling, and, finally, the synergetic relationship between expert systems and simulation.
NEST: a network simulation and prototyping testbed
The Network Simulation Testbed (NEST) is a graphical environment for simulation and rapid-prototyping of distributed networked systems and protocols. Designers of distributed networked systems require the ability to study the systems operations under a variety of simulated network scenarios. For example, designers of a routing protocol need to study the steady-state performance features of the mechanism as well as its dynamic response to failure of links or switching nodes. Similarly, designers of a distributed transaction processing system need to study the performance of the system under a variety of load models as well as its response to failure conditions. NEST provides a complete environment for modeling, execution and monitoring of distributed systems of arbitrary complexity.
NEST is embedded within a standard UNIX environment. A user develops a simulation model of a communication network using a set of graphical tools provided by the NEST generic monitor tools. Node functions (e.g., routing protocol) as well as communication link behaviors (e.g., packet loss or delay features) are typically coded by the user in C; in theory, any high-level block-structured language could be supported for this function. These procedures provided by the user are linked with the simulated network model and executed efficiently by the NEST simulation server. The user can reconfigure the simulation scenario either through graphical interaction or under program control. The results of an execution can be graphically monitored through custom monitors, developed using NEST graphical tools.
NEST may thus be used to conduct simulation studies of arbitrary distributed networked systems. However, unlike pure simulation tools, NEST may also be used as an environment for rapid prototyping of distributed systems and protocols. The actual code of the systems developed in this manner can be used at any development stage as the node functions for a simulation. The behavior of the system may be examined under a variety of simulated scenarios. For example, in the development of a routing protocol for a mobile packet radio network, it is possible to examine the speed with which the routing protocol responds to changes in the topology, the probability and expected duration of a routing loop. The actual code of the routing protocol may be embedded as node functions within NEST. The only modifications of the code will involve use of NEST calls upon the simulated network to send, receive or broadcast a message. Thus NEST is particularly useful as a tool to study the performance behavior of real (or realisticly modeled) distributed systems in response to simulated complex dynamical network behaviors. Such dynamic response is typically beyond the scope of analytical techniques restricted to model steady-state equilibrium behaviors.
Traditional approaches to simulation are either language-based or model-based. Language-based approaches (e.g., Simula, Simscript) provide users with specialized programming language constructs to support modeling and simulation. The key advantage of these approaches is their generality of applications. These approaches, however, are fundamentally limited as tools to study complex distributed systems: First, they separate the tasks of modeling and simulation from those of design and development. A designer of a network protocol is required to develop the code in one environment using one language (e.g., C), while simultaneously developing a consistent simulation model (e.g., in Simscript). The distinctions between the simulation model and the actual system may be significant enough to reduce the effectiveness of simulation. This is particularly true for complex systems involving a long design cycle and significant changes. Second, these approaches require the modeler to efficiently manage the complexity of scheduling distributed system models (under arbitrary network scenarios).
Model-based approaches (e.g., queuing-network simulators such as IBM's RESQ [12]) provide users with extensive collections of tools supporting a particular simulation-modeling technique. The key advantage of model-based approaches is the efficiency with which they may handle large-scale simulations by utilizing model-specific techniques (e.g., fast algorithms to solve complex queuing network models). Their key disadvantage is a narrower scope of applications and questions that they may answer. For example, it is not possible within a pure queuing-network model to model and analyze complex transient behaviors (e.g., formation of routing loops in a mobile packet radio network). The model-based approach, like the language-based approaches, suffers from having simulation/testing separated from design/development. It has the additional important disadvantage of requiring users to develop in-depth understanding of the modeling techniques. Designers of distributed database transaction systems are often unfamiliar with queuing models.
NEST pursues a different approach to simulation studies: extending a networked operating system environment to support simulation modeling and efficient execution. This environment-based approach to simulation shares the generality of its modeling power with language-based approaches. NEST may be used to model arbitrary distributed interacting systems. NEST also shares with the language-based approach an internal execution architecture that accomplishes very efficient scheduling of a large number of processes. However, unlike language-based approaches, the user does not need to be concerned with management of complex simulation scheduling problems. Furthermore, NEST does not require the user to master or use a separate simulation language facility; the processes of design, development and simulation are fully integrated. The user can study the behavior of the actual system being developed (at any level of detail) under arbitrary simulated scenarios. The routing protocol designer, for example, can attach the routing protocol designed (actual code with minor adjustments) to a NEST simulation and study the system behavior. As the system changes through the design process, new simulation studies may be conducted by attaching the new code to the same simulation models. NEST can thus be used as an integral part of the design process along with other tools (e.g., for debugging).
In similarity to model-based approaches, NEST is specifically targeted toward a limited scope of applications: distributed networked systems. NEST supports a built-in customizable communication network model. However, this scope has been sufficiently broad to support studies ranging from low-level communication protocols to complex distributed transaction processing systems, avionic systems and even manufacturing processes.
The environment-based approach to simulation offers a few important attractions to users:
1. Simulation is integrated with the range of tools supported by the environment.
The user can utilize graphics, statistical packages, debuggers and other standard tools of choice in the simulation study.
Simulation can become an integral part of a standard development process.
2. Users need not develop extensive new skills or knowledge to pursue simulation studies.
3. Standard features of the environment can be used to enhance the range of applicability.
NEST simulation is configured as a network server with monitors as clients. The client/server model permits multiple remote accesses to a shared testbed. This can be very important in supporting a large-scale multisite project.
In this article we describe the architecture of NEST, illustrate its use, and describe some aspects of NEST implementation. We will also feature its design and provide examples of NEST applications.
Likelihood ratio gradient estimation for stochastic systems
Consider a computer system having a CPU that feeds jobs to two input/output (I/O) devices having different speeds. Let &thgr; be the fraction of jobs routed to the first I/O device, so that 1 - &thgr; is the fraction routed to the second. Suppose that &agr; = &agr;(&thgr;) is the steady-sate amount of time that a job spends in the system. Given that &thgr; is a decision variable, a designer might wish to minimize &agr;(&thgr;) over &thgr;. Since &agr;(·) is typically difficult to evaluate analytically, Monte Carlo optimization is an attractive methodology. By analogy with deterministic mathematical programming, efficient Monte Carlo gradient estimation is an important ingredient of simulation-based optimization algorithms. As a consequence, gradient estimation has recently attracted considerable attention in the simulation community. It is our goal, in this article, to describe one efficient method for estimating gradients in the Monte Carlo setting, namely the likelihood ratio method (also known as the efficient score method). This technique has been previously described (in less general settings than those developed in this article) in [6, 16, 18, 21]. An alternative gradient estimation procedure is infinitesimal perturbation analysis; see [11, 12] for an introduction. While it is typically more difficult to apply to a given application than the likelihood ratio technique of interest here, it often turns out to be statistically more accurate.
In this article, we first describe two important problems which motivate our study of efficient gradient estimation algorithms. Next, we will present the likelihood ratio gradient estimator in a general setting in which the essential idea is most transparent. The section that follows then specializes the estimator to discrete-time stochastic processes. We derive likelihood-ratio-gradient estimators for both time-homogeneous and non-time homogeneous discrete-time Markov chains. Later, we discuss likelihood ratio gradient estimation in continuous time. As examples of our analysis, we present the gradient estimators for time-homogeneous continuous-time Markov chains; non-time homogeneous continuous-time Markov chains; semi-Markov processes; and generalized semi-Markov processes. (The analysis throughout these sections assumes the performance measure that defines &agr;(&thgr;) corresponds to a terminating simulation.) Finally, we conclude the article with a brief discussion of the basic issues that arise in extending the likelihood ratio gradient estimator to steady-state performance measures.
In the mind of the average computer user, the problem of generating uniform variates by computer has been solved long ago. After all, every computer :system offers one or more function(s) to do so. Many software products, like compilers, spreadsheets, statistical or numerical packages, etc. also offer their own. These functions supposedly return numbers that could be used, for all practical purposes, as if they were the values taken by independent random variables, with a uniform distribution between 0 and 1. Many people use them with faith and feel happy with the results. So, why bother?
Other (less naive) people do not feel happy with the results and with good reasons. Despite renewed crusades, blatantly bad generators still abound, especially on microcomputers [55, 69, 85, 90, 100]. Other generators widely used on medium-sized computers are perhaps not so spectacularly bad, but still fail some theoretical and/or empirical statistical tests, and/or generate easily detectable regular patterns [56, 65].
Fortunately, many applications appear quite robust to these defects. But with the rapid increase in desktop computing power, increasingly sophisticated simulation studies are being performed that require more and more “random” numbers and whose results are more sensitive to the quality of the underlying generator [28, 40, 65, 90]. Sometimes, using a not-so-good generator can give totally misleading results. Perhaps this happens rarely, but can be disastrous in some cases. For that reason, researchers are still actively investigating ways of building generators. The main goal is to design more robust generators without having to pay too much in terms of portability, flexibility, and efficiency. In the following sections, we give a quick overview of the ongoing research. We focus mainly on efficient and recently proposed techniques for generating uniform pseudorandom numbers. Stochastic simulations typically transform such numbers to generate variates according to more complex distributions [13, 25]. Here, “uniform pseudorandom” means that the numbers behave from the outside as if they were the values of i.i.d. random variables, uniformly distributed over some finite set of symbols. This set of symbols is often a set of integers of the form {0, . . . , m - 1} and the symbols are usually transformed by some function into values between 0 and 1, to approximate the U(0, 1) distribution. Other tutorial-like references on uniform variate generation include [13, 23, 52, 54, 65, 84, 89].
Simulation of a semiconductor manufacturing line
Product turnaround time (TAT) is defined as the clock or elapsed time from wafer release to completed wafer fabrication. It is also known as cycle time, mean elapsed time, or manufacturing lead time. Turnaround time is arguably more important in semiconductor fabrication than in any other industry because it has a major impact on contamination levels, process control capabilities, yield-learning rates and product costs.
Semiconductor manufacturers must strictly control particulate contamination to achieve high device yields. Longer turnaround times increase the opportunity for the particles to migrate to wafer surfaces, even in strict clean-room environments [11]. The negative relationship between time and product yields is shown in Figure 1.
Submicron device fabrication also demands stringent process control capability. Variation in time between steps is a major contributor to process variability, directly affecting process yields. Sequential processes performed minutes apart may produce significantly different results than the identical processes performed hours apart because the properties of materials change over time.
The slope of the yield learning curve is also a function of turnaround time. Slow feedback because of high turnaround times delays problem recognition and verification of solutions. Semiconductor fabrication is especially sensitive to turnaround time because definitive functional results are not available until circuits are completely fabricated on the wafe—typically hundreds of process steps after the raw silicon wafers are released into the manufacturing line. Figure 2 depicts the relationship between turnaround time, shown as a multiple of theoretical or raw process time (RPT), and the yield learning rate.
The impact of longer turnaround time is not limited to reduced yields. Longer TAT also increases product costs. For example, a line with a 1O-day TAT that starts 1,000 wafers a day will have 10,000 wafers of work-in-process (WIP). A line with the same wafer starts per day (WSD) but an 11-day TAT will have 11,000 wafers in WIP. The longer TAT causes higher carrying costs for partially finished goods, more space for WIP storage, additional resources for product tracking and control, and many other additional expenses.
Minimizing turnaround time is critically important in the semiconductor industry due to its major contribution to greater product yields and lower costs. However, it is not the only determinant of success in semiconductor manufacturing. State-of-the-art facilities cost hundreds of millions of dollars to build and equip, due to requirements for cleanliness, vibration control, chemical and gas purity and other components that are specified to angstroms and sub-microns. The semiconductor industry's capital-intensive nature demands high throughput rates and maximum utilization of resources to attain a competitive cost per wafer.
The amount of work-in-process (WIP), also known as line-loading levels, affects both turnaround time and throughput performance. Throughput analysis techniques [4, 6] generate curves demonstrating that infinite levels of work in process maximize throughput (Figure 3) by ensuring that resources never starve for work. Queuing theory analysis [1] produces curves that show minimum line-loading levels produce minimum turnaround times by eliminating time spent queued for busy resources (Figure 4). The inverse shapes of these two performance curves demonstrate the inherent conflict in line-loading decisions when attempting to both maximize throughput and minimize turnaround time.
Wafer fabrication involves hundreds of individual tools performing multiple processes to produce an array of sophisticated end products. Actual turnaround time and throughput curves for each tool depend on many variables unique to that tool, such as arrival rates, service rates, rework rates, failure rates, and starvation and blockage opportunities. Resolution of the line-loading conflict is especially difficult given the complexity of semiconductor manufacturing, and simulation emerges as perhaps the only currently existing methodology for taking into account the detailed interactions among elements in such a manufacturing environment [3, 14].
Traditional dialogue design applied to modern user interfaces
Several examples show that principles for the design of usable dialogues are just as important for modern graphical user interfaces as they are for traditional text-based interfaces.
Economic analysis of microcomputer hardware
Economic analysis of the hardware characteristics of the personal computer segment of the microcomputer market indicates that the variables that most affect cost are the bundle of attributes offered by the micros. Interestingly, certain technological characteristics found to be important explanatory variables for cost in larger computers were not found to be significant for micros.
Inside risks: ghosts, mysteries and uncertainty
Background: Perhaps the most frustrating of all the Risks Forum cases are those in which uncertainty and doubts remain to spook us long after the event.