IFRA, an acronym for Instruction Footprint Recording and Analysis, overcomes major challenges associated with a very expensive step in post-silicon validation of processorspinpointing a bug location and the instruction sequence that exposes the bug from a system failure, such as a crash. Special on-chip recorders, inserted in a processor during design, collect instruction footprintsspecial information about flows of instructions, and what the instructions did as they passed through various microarchitectural blocks of the processor. The recording is done concurrently during the normal operation of the processor in a post-silicon system validation setup. Upon detection of a system failure, the recorded information is scanned out and analyzed offline for bug localization. Special self-consistency-based program analysis techniques, together with the test-program binary of the application executed during post-silicon validation, are used for this purpose. Major benefits of using IFRA over traditional techniques for post-silicon bug localization are (1) it does not require full system-level reproduction of bugs, and (2) it does not require full system-level simulation. Hence, it can overcome major hurdles that limit the scalability of traditional post-silicon validation methodologies. Simulation results on a complex superscalar processor demonstrate that IFRA is effective in accurately localizing electrical bugs with 1% chip-level area impact.
Post-Silicon validation involves operating one or more manufactured chips in actual application environments to validate correct behaviors across specified operating conditions. According to recent industry reports, post-silicon validation is becoming significantly expensive. Intel reported a headcount ratio of 3:1 for design vs. post-silicon validation.19 According to Abramovici et al.,1 post-silicon validation may consume 35% of average chip development time. Yerramilli25 observes that post-silicon validation costs are rising faster than the design costs.
Loosely speaking, there are two types of bugs that design and validation engineers worry about:
Post-silicon validation involves four steps:
Josephson9 points out that the second step, bug localization, dominates post-silicon validation effort and costs. Two major factors that contribute to the high cost of traditional post-silicon bug localization approaches are:
Due to these factors, a functional bug typically takes hours to days to be localized vs. an electrical bug that requires days to weeks and more expensive equipments.10
IFRA, an acronym for Instruction Footprint Recording and Analysis, targets bug localization in processors. Figure 1 shows IFRA-based post-silicon bug localization flow. During chip design, a processor is augmented with low-cost hardware recorders (Section 2) for recording instruction footprints, which are compact pieces of information describing the flows of instructions (i.e., where each instruction was at various points of time), and what the instructions did as they passed through various design blocks of the processor. During post-silicon bug detection, instruction footprints are recorded in each recorder, concurrently with system operation, in a circular fashion to capture the last few thousand cycles of history before a failure.
Upon detection of a system failure, the recorded footprints are scanned out through a Boundary-scan interface, which is a standard interface present in most chips for testing purposes. Since a single run up to a failure is sufficient for IFRA to capture the necessary information (details in Section 2), failure reproduction is not required for localization purposes.
The scanned-out footprints, together with the test-program binary executed during post-silicon bug detection, are post-processed off-line using special analysis techniques (Section 3) to identify the microarchitectural block with the bug, and the instruction sequence that exposes the bug (i.e., the bug exposing stimulus). Microarchitectural block boundaries are defined specifically for IFRA. Examples include instruction queue control, scheduler, forwarding path, decoders, etc. IFRA post-analysis techniques do not require any system-level simulation because they rely on checking for self-consistencies in the footprints with respect to the test-program binary.
Once a bug is localized using IFRA, existing circuit-level debug techniques4, 9 can then quickly identify the root cause of bugs, resulting in significant gains in productivity, cost, and time-to-market.
In this paper, we demonstrate the effectiveness of IFRA for a DEC Alpha 21264-like superscalar processor model6 because its architectural simulator2 and RTL model24 are publicly available. Such superscalar processors contain aggressive performance-enhancement features (e.g., execution of multiple instructions per cycle, execution of instructions out of program order, and prediction of branch targets and outcomes) that are present in many commercial high-performance processors.22 Such features significantly complicate post-silicon validation. For simpler in-order processors (e.g., ARMv6, Intel Atom, SUN Niagra cores), IFRA can be significantly simplified.
There is little consensus about models of functional bugs.8 Hence, we focus on electrical bugs that can be modeled as bit-flips (more details in Section 4). Extensive IFRA simulations demonstrate:
Related work on post-silicon validation can be broadly classified as formal methods,5 on-chip trace buffers for hardware debugging,1 off-chip program and data tracing,13 clock manipulation,9 scan-aided techniques,4 check-pointing with deterministic replay,21 and online assertion checking.1, 3 Table 1 presents a qualitative comparison of IFRA vs. existing post-silicon bug localization techniques. In Table 1, a technique is categorised as being intrusive if it can alter the functional/electrical behavior of the system which may prevent electrical bugs to get exposed.
Section 2 describes hardware support for IFRA. Section 3 describes off-line analysis techniques performed on the scanned-out instruction footprints. Section 4 presents simulation results, followed by conclusions in Section 5.
The three hardware components of IFRA's recording infrastructure, for a superscalar processor, are indicated as shaded parts in Figure 2.
While an instruction, with an ID appended, flows through a pipeline stage, it generates an instruction footprint corresponding to that pipeline stage which is stored in the recorder associated with that pipeline stage. An instruction footprint corresponding to a pipeline stage consists of
Synthesis results (using Synopsys Design Compiler with TSMC 0.13 microns library) show that the area impact of the IFRA hardware infrastructure is 1% on the Illinois Verilog Model24 assuming a 2MB on-chip cache, which is typical of current desktop/server processors. The area cost is dominated by the circular buffers present in the recorders. Interconnect area cost is relatively low because the wires connecting the recorders (Figure 2) operate at slow speed, and a large portion of this routing reuses existing on-chip scan chains that are present for manufacturing testing purposes.
2.1. ID-assignment unit
For the recorded data to be useful for offline analysis, it is necessary to identify which of the trillions of instructions that passed through the processor, produced each of the recorded footprints. Hence, each footprint in a recorder must have an identifier or ID.
Simplistic ID assignment schemes have limited applicability. For example, assigning consecutive numbers to each incoming instruction, in a circular fashion, using very wide IDs is wasteful: using 40-bit IDs will increase the instruction footprint total storage to 160KB from 60KB. When IDs are too short, e.g., 8-bit IDs if there can be only 256 instructions in a processor at any one time, aliasing can occur for processors supporting out-of-order execution and pipeline flushes (process of discarding instructions in the middle of execution to enforce a change in control flow). There can be multiple instructions with the same ID in a processor at any given time that may execute out of program order making it very difficult, if not impossible, to distinguish.
The PC (program counter) value cannot be used as an instruction ID for processors supporting out-of-order execution, because programs with loops may produce multiple instances of the same instruction with the same PC value. These multiple instances may execute out of program order.
It is difficult to use time-stamps or other global synchronization mechanisms as instruction IDs for processors supporting multiple clock domains and/or DVFS (dynamic voltage and frequency scaling) for power management.
Our special ID assignment scheme, described below, uses log24n bits, where n is the maximum number of instructions in a processor at any one time (e.g., n = 64 for Alpha 21264). The first two rules assign consecutive numbers to incoming instructions and the third rule allows the scheme to work18 under all the aforementioned circumstances: i.e., for processors supporting out-of-order execution, pipeline flushes, multiple clock domains and DVFS.
Instruction IDs are assigned to individual instructions as they exit the fetch stage and enter the decode stage. Since multiple instructions may exit the fetch stage in parallel at any given clock cycle, multiple IDs are assigned in parallel.
2.2. Post-trigger generators
Suppose that a test program has been executing for billions of cycles and an electrical bug is exercised after 5 billion cycles from start. Moreover, suppose that the electrical bug causes a system crash after another 1 billion cycles (i.e., 6 billion cycles from the start). With limited storage, we are only interested in capturing the information around the time when the electrical bug is exercised. Hence, 5 billions of cycles worth of information before the bug occurrence may not be necessary. On the other hand, if we stop recording only after the system crashes, all the useful recorded information will be overwritten. Thus, we must incorporate mechanisms, referred to as post-triggers, for reducing error detection latency, the length of time between the appearance of an error caused by a bug and visible system failure.
Post-triggers targeting five different failure scenarios are listed in Table 2. A hard post-trigger fires when there is an evident sign of failure, and causes the processor operation to terminate. Classical hardware error detection techniques such as parity bits for arrays and residue codes for arithmetic units20 as well as in-built exceptions, such as unimplemented instruction exceptions and arithmetic exceptions, belong to this category.
However, hard post-triggers mechanisms alone are not sufficient, e.g., two tricky scenarios described in the last two rows of Table 3. These two failure scenarios may be detected several millions of cycles after an error occurs, causing useful recorded information to be overwritten even with the existing error detection mechanisms. Hence, we introduce the notion of soft post-triggers.
A soft post-trigger fires when there is an early symptom of a possible failure. It causes the recording in all recorders to pause, but allows the processor to keep running. If a hard post-trigger for the failure corresponding to the symptom occurs within a pre-specified amount of time, the processor stops. If a hard post-trigger does not fire within the specified time, the recording resumes assuming that the symptom was false.
Segmentation fault (or segfault) requires OS handling and, hence, may take several millions of cycles to resolve. Null-pointer dereference is detected by adding simple hardware in the Load/Store unit. For other illegal memory accesses, TLB-miss is used as the soft post-trigger. If a segfault is not declared by the OS while servicing the TLB-miss, the recording is resumed on TLB-refill. On the other hand, if a segfault is returned, then a hard post-trigger is activated.
Once recorder contents are scanned out, footprints belonging to same instruction (but in multiple recorders) are identified and linked together using a technique called footprint linking (Section 3.1). The linked footprints are also mapped to the corresponding instruction in the test-program binary using the program counter value stored in the fetch-stage recorder (Table 2).
As shown in Figure 3, after the footprint linking, four high-level post-analysis techniques (Section 3.2) that are independent of microarchitecture are run. After which, low-level analysis (Section 3.3), represented as a decision diagram, asks a series of microarchitecture-specific questions until the final bug location-time pair(s) is obtained. The bug exposing stimuli are derived from the location-time pairs. Currently, the decision diagram is created manually based on the microarchitecture. Automatic generation of such decision diagrams is a topic of future research.
The post-analysis techniques rely on the concept of self-consistency which checks for the existence of contradictory events in collected footprints with respect to the test-program binary. While such checks are extensively used in fault-tolerant computing for error detection12, 16, 23 the key difference here is that we use them for bug localization. Such application is possible because, unlike fault-tolerant computing, the checks are performed off-line enabling more complex analysis for localization purposes.
3.1. Footprint linking
Figure 4 shows a part of a test program and the contents of three (out of many) recorders right after they are scanned out. As explained in Section 2, since we use short instruction IDs (8-bits for Alpha 21264-like processor), we end up having multiple footprints having the same ID in the same recorder and/or multiple recorders. For example, in Figure 4, ID 0 appears in three entries of the fetch-stage recorder, in two entries of the issue-stage recorder, and in three entries of the execution-stage recorder.
Which of these ID 0s correspond to the same instruction? This question is answered by the following special properties enforced by the ID assignment scheme presented in Section 2.1:
Property 1. All flushed instructions are identified by utilizing Rule 3 in our special ID assignment scheme (Section 2.1).
Property 2. If instruction A was fetched before instruction B, and they both have the same ID, then A will always exit any pipeline stage (and leave its footprint in the corresponding recorder) before B does for that same pipeline stage.
In Figure 4, using the first property, footprints corresponding to flushed instructions are identified and discarded. After discarding, using the second property, the youngest ID 0s across all recorders are linked together, followed by linking of the second youngest ID 0s, and so on. Since the PC is stored in the fetch-stage recorder, we can link the instruction ID back to the test program binary to find the corresponding instruction.
3.2. High-level analysis
IFRA uses four high-level analysis techniques (1) data dependency analysis, (2) program control-flow analysis, (3) load-store analysis, and (4) decoding analysis.
Each analysis technique is applied separately. We are interested in the inconsistency that is closest to the electrical bug manifestation in terms of time (i.e., the eldest inconsistency). Thus, if multiple of them identify inconsistencies, then the reported inconsistencies are compared to see which one occurred the earliest. The high-level analysis technique with the earliest occurring inconsistency then decides the entry point into the decision diagram for low-level analysis. Here we briefly explain the control-flow analysis, one of the high-level analysis techniques, to illustrate the idea.
In the program control-flow analysis, four types of illegal transitions are searched in the PC sequence of the serial execution trace (obtained from fetch-stage recorder and test-program binary during footprint linking), starting from the eldest PC.
If any illegal transition is found, the low-level analysis scrutinizes the PC register with the instruction that made an illegal transition.
3.3. Low-level analysis
The low-level analysis involves asking a series of microarchitectural-specific questions according to the decision diagram. We present a simple example by tracing one of the paths in the decision diagram.
Consider an example where a segfault (Section 2.2) during instruction access was detected, and the fourth illegal transition of the control-flow analysis was identified. We also assume that R5 shown in Figure 5 was the register used for the register-indirect transition. Instructions B and C have producer-consumer relationship: B writes its result in to register R0, and C uses a value from register R0.
The first question in the decision diagram is whether C consumed the value B produced. The execute-stage recorder contains the residues of results and the issue-stage recorder contains the residues of operands of instructions. Comparing the two values during post-analysis shows that they do not match; i.e., B produced a value with residue of 5, while C received a value with residue of 3. This is clearly a problem.
The second question in the decision diagram is whether C and B used the same physical register to pass along the value. Analysis of the contents of the dispatch-stage recorder, which records the physical register name, reveals that B wrote its results into physical register P2, while C read its operand value from physical register P5, and they are not the same as shown in Figure 6.
There is again a problem, and the third question in the decision diagram asks whether C used a value produced by the previous producer (instruction that wrote its result into register R0 prior to the immediate producer) of register R0. Instruction A in Figure 7 is the previous producer of register R0 and analysis of the contents of the dispatch-stage recorder reveals that indeed that is the case.
Asking several more questions leads to the bug location and the exposing stimulus shown in Figure 8. The instruction trace between instruction A and instruction B is responsible for stimulating the bug, and the trace afterwards is responsible for propagating the bug to an observation point such as a soft post-trigger.
We evaluated IFRA by injecting errors into a microarchitectural simulator2 augmented with IFRA. For an Alpha 21264 configuration (4-way pipeline, 64 maximum instructions in-flight, 2 ALUs, 2 multipliers, 2 load/store units), there are 200 different microarchitectural blocks (excluding array structures and arithmetic units since errors inside those structures are immediately detected and localized using parity and/or residue codes, as discussed in Section 2.2). Each block has an average size equivalent of 10K 2-input NAND gates. Seven benchmarks from SPECint2000 (bzip2, gcc, gap, gzip, mcf, parser, vortex) were chosen as validation test programs as they represent a variety of workloads. Each recorder was sized to have 1024 entries.
All bugs were modeled as single bit-flips at flip-flops to target hard-to-repeat electrical bugs. This is an effective model because electrical bugs eventually manifest themselves as incorrect values arriving at flip-flops for certain input combinations and operating conditions.15
Errors were injected in one of 1191 flip-flops [Park and Mitra17]. No errors were injected inside array structures since they have built-in parities for error detection.
Upon error injection, the following scenarios are possible:
Out of 100,000 error injection runs, 800 of them resulted in Cases 2 and 3. Figure 9 presents results from these two cases. The "exactly located" category represents the cases in which IFRA returned a single and correct location-time pair (as defined in Section 1). The "candidate located" category represents the cases in which IFRA returned multiple location-time pairs (called candidates) out of over 200,000 possible pairs (1 out of 200 microarchitectural blocks and 1 out of 1,000 cycles), and at least 1 pair was fully correct in both location and in time. The "completely missed" category represents the cases where none of the returned pairs were correct, even if either location or time is correct. In addition, we pessimistically report all errors that resulted in Case 3 as "completely missed." All error injections were performed after a million cycles from the beginning of the program in order to demonstrate that there is no need to keep track of footprints from the beginning.
It is clear from Figure 9 that a large percentage of bugs were uniquely located to correct location-time pair, while very few bugs were completely missed, demonstrating the effectiveness of IFRA.
IFRA targets the problem of post-silicon bug localization in a system setup, which is a major challenge in processor post-silicon design validation. There are two major novelties of IFRA:
IFRA overcomes major post-silicon bug localization challenges.
IFRA creates several interesting research directions:
The authors thank A. Bracy, B. Gottlieb, N. Hakim, D. Josephson, P. Patra, J. Stinson, H. Wang of Intel Corporation, O. Mutlu and S. Blanton of Carnegie Mellon University, T. Hong of Stanford University, and E. Rentschler of AMD for helpful discussions and advice. This research is supported in part by the Semiconductor Research Corporation and the National Science Foundation. Sung-Boem Park is also partially supported by Samsung Scholarship, formerly the Samsung Lee Kun Hee Scholarship Foundation.
18. Park S., Hong, T., Mitra, S. Post-silicon bug localization in processors using instruction footprint recording and analysis (IFRA). IEEE Trans. Comput. Aided Des. Integrated Circuits Syst. 28, 10 (Oct. 2009), 15451558.
A previous version of this paper appeared in the Proceedings of the 45th ACM-IEEE Design Automation Conference (2008, Anaheim, CA).
Figure 8. Bug location (enclosed in grey area includes part of the decoder responsible for decoding the architectural destination register, the write circuitry into a register mapping table, and all the pipeline registers in between) shown on the left and the exposing stimulus shown on the right.
©2010 ACM 0001-0782/10/0200 $10.00
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.
The Digital Library is published by the Association for Computing Machinery. Copyright © 2010 ACM, Inc.
No entries found