IFRA, an acronym for Instruction Footprint Recording and Analysis, overcomes major challenges associated with a very expensive step in post-silicon validation of processors—pinpointing a bug location and the instruction sequence that exposes the bug from a system failure, such as a crash. Special on-chip recorders, inserted in a processor during design, collect instruction footprints—special information about flows of instructions, and what the instructions did as they passed through various microarchitectural blocks of the processor. The recording is done concurrently during the normal operation of the processor in a post-silicon system validation setup. Upon detection of a system failure, the recorded information is scanned out and analyzed offline for bug localization. Special self-consistency-based program analysis techniques, together with the test-program binary of the application executed during post-silicon validation, are used for this purpose. Major benefits of using IFRA over traditional techniques for post-silicon bug localization are (1) it does not require full system-level reproduction of bugs, and (2) it does not require full system-level simulation. Hence, it can overcome major hurdles that limit the scalability of traditional post-silicon validation methodologies. Simulation results on a complex superscalar processor demonstrate that IFRA is effective in accurately localizing electrical bugs with 1% chip-level area impact.
1. Introduction
Post-Silicon validation involves operating one or more manufactured chips in actual application environments to validate correct behaviors across specified operating conditions. According to recent industry reports, post-silicon validation is becoming significantly expensive. Intel reported a headcount ratio of 3:1 for design vs. post-silicon validation.19 According to Abramovici et al.,1 post-silicon validation may consume 35% of average chip development time. Yerramilli25 observes that post-silicon validation costs are rising faster than the design costs.
Loosely speaking, there are two types of bugs that design and validation engineers worry about:
- Bugs caused by the interactions between the design and the physical effects, also called electrical bugs.10 Such bugs generally manifest themselves only under certain operating conditions (temperature, voltage, frequency). Examples include setup and hold time problems.
- Functional bugs, also called logic bugs, caused by design errors.
Post-silicon validation involves four steps:
- Detecting a problem by running a test program, such as OS, games, or functional tests, until a system failure occurs (e.g., system crash, segmentation fault, or exceptions).
- Localizing the problem to a small region from the system failure, e.g., a bug in an adder inside an ALU of a complex processor. The stimulus that exposes the bug, e.g., the particular 10 lines of code from some application, is also important.
- Identifying the root cause of the problem. For example, an electrical bug may be caused by power-supply noise slowing down a circuit path resulting in an error at the adder output.
- Fixing or bypassing the problem by microcode patching,7 circuit editing,11 or, as a last resort, respinning using a new mask.
Josephson9 points out that the second step, bug localization, dominates post-silicon validation effort and costs. Two major factors that contribute to the high cost of traditional post-silicon bug localization approaches are:
- Failure reproduction which involves returning the chip to an error-free state, and re-executing the failure-causing stimulus (including test-program segment, interrupts, and operating conditions) to reproduce the same failure. Unfortunately, many electrical bugs are hard to reproduce. The difficulty of bug reproduction is exacerbated by the presence of asynchronous I/Os and multiple clock domains.
- System-level simulation for obtaining golden responses, i.e., correct signal values for every clock cycle for the entire system (i.e., the chip and all the peripheral devices on the board) to compare against the signal values produced by the chip being validated. Running system-level simulation is typically 78 orders of magnitude slower than actual silicon.
Due to these factors, a functional bug typically takes hours to days to be localized vs. an electrical bug that requires days to weeks and more expensive equipments.10
IFRA, an acronym for Instruction Footprint Recording and Analysis, targets bug localization in processors. Figure 1 shows IFRA-based post-silicon bug localization flow. During chip design, a processor is augmented with low-cost hardware recorders (Section 2) for recording instruction footprints, which are compact pieces of information describing the flows of instructions (i.e., where each instruction was at various points of time), and what the instructions did as they passed through various design blocks of the processor. During post-silicon bug detection, instruction footprints are recorded in each recorder, concurrently with system operation, in a circular fashion to capture the last few thousand cycles of history before a failure.
Upon detection of a system failure, the recorded footprints are scanned out through a Boundary-scan interface, which is a standard interface present in most chips for testing purposes. Since a single run up to a failure is sufficient for IFRA to capture the necessary information (details in Section 2), failure reproduction is not required for localization purposes.
The scanned-out footprints, together with the test-program binary executed during post-silicon bug detection, are post-processed off-line using special analysis techniques (Section 3) to identify the microarchitectural block with the bug, and the instruction sequence that exposes the bug (i.e., the bug exposing stimulus). Microarchitectural block boundaries are defined specifically for IFRA. Examples include instruction queue control, scheduler, forwarding path, decoders, etc. IFRA post-analysis techniques do not require any system-level simulation because they rely on checking for self-consistencies in the footprints with respect to the test-program binary.
Once a bug is localized using IFRA, existing circuit-level debug techniques4, 9 can then quickly identify the root cause of bugs, resulting in significant gains in productivity, cost, and time-to-market.
In this paper, we demonstrate the effectiveness of IFRA for a DEC Alpha 21264-like superscalar processor model6 because its architectural simulator2 and RTL model24 are publicly available. Such superscalar processors contain aggressive performance-enhancement features (e.g., execution of multiple instructions per cycle, execution of instructions out of program order, and prediction of branch targets and outcomes) that are present in many commercial high-performance processors.22 Such features significantly complicate post-silicon validation. For simpler in-order processors (e.g., ARMv6, Intel Atom, SUN Niagra cores), IFRA can be significantly simplified.
There is little consensus about models of functional bugs.8 Hence, we focus on electrical bugs that can be modeled as bit-flips (more details in Section 4). Extensive IFRA simulations demonstrate:
- For 75% of injected electrical bugs, IFRA pinpointed their exact location (1 out of 200 microarchitectural blocks) and the time they were injected (1 out of over 1,000 cycles)—referred to as location-time pair. For 21% of injected bugs, IFRA correctly identified their location-time pairs together with 5 other candidates (out of over 200,000 possible pairs) on average. IFRA completely missed correct location-time pairs for only 4% of injected bugs.
- The aforementioned results were obtained without relying on system-level simulation and failure reproduction.
- IFRA hardware introduces a very small area impact of 1% (dominated by on-chip memory for storing 60KB of instruction footprints). If on-chip trace buffers1 already exist for validation purposes, they can be reused to reduce the area impact. Alternatively, a part of data cache may also be used to reduce the area impact of IFRA.
Related work on post-silicon validation can be broadly classified as formal methods,5 on-chip trace buffers for hardware debugging,1 off-chip program and data tracing,13 clock manipulation,9 scan-aided techniques,4 check-pointing with deterministic replay,21 and online assertion checking.1, 3 Table 1 presents a qualitative comparison of IFRA vs. existing post-silicon bug localization techniques. In Table 1, a technique is categorised as being intrusive if it can alter the functional/electrical behavior of the system which may prevent electrical bugs to get exposed.
Section 2 describes hardware support for IFRA. Section 3 describes off-line analysis techniques performed on the scanned-out instruction footprints. Section 4 presents simulation results, followed by conclusions in Section 5.
2. IFRA Hardware Support
The three hardware components of IFRA’s recording infrastructure, for a superscalar processor, are indicated as shaded parts in Figure 2.
- A set of distributed recorders, denoted by ‘R’ in Figure 2, with dedicated circular buffers. As an instruction passes through a pipeline stage, the recorder associated with that stage records information specific to that stage (Table 2). When no instruction passes through a pipeline stage for many cycles, consecutive idle cycles are compacted into a single entry in the corresponding recorder.
- An ID (identification) assignment unit responsible for assigning and appending an ID to each instruction that enters the processor.
- A post-trigger generator, which is a mechanism for deciding when to stop recording.
While an instruction, with an ID appended, flows through a pipeline stage, it generates an instruction footprint corresponding to that pipeline stage which is stored in the recorder associated with that pipeline stage. An instruction footprint corresponding to a pipeline stage consists of
- The instruction’s ID that was appended
- Auxiliary information (Table 2) that tells us what the instruction did in the microarchitectural blocks contained in that pipeline stage
Synthesis results (using Synopsys Design Compiler with TSMC 0.13 microns library) show that the area impact of the IFRA hardware infrastructure is 1% on the Illinois Verilog Model24 assuming a 2MB on-chip cache, which is typical of current desktop/server processors. The area cost is dominated by the circular buffers present in the recorders. Interconnect area cost is relatively low because the wires connecting the recorders (Figure 2) operate at slow speed, and a large portion of this routing reuses existing on-chip scan chains that are present for manufacturing testing purposes.
For the recorded data to be useful for offline analysis, it is necessary to identify which of the trillions of instructions that passed through the processor, produced each of the recorded footprints. Hence, each footprint in a recorder must have an identifier or ID.
Simplistic ID assignment schemes have limited applicability. For example, assigning consecutive numbers to each incoming instruction, in a circular fashion, using very wide IDs is wasteful: using 40-bit IDs will increase the instruction footprint total storage to 160KB from 60KB. When IDs are too short, e.g., 8-bit IDs if there can be only 256 instructions in a processor at any one time, aliasing can occur for processors supporting out-of-order execution and pipeline flushes (process of discarding instructions in the middle of execution to enforce a change in control flow). There can be multiple instructions with the same ID in a processor at any given time that may execute out of program order making it very difficult, if not impossible, to distinguish.
The PC (program counter) value cannot be used as an instruction ID for processors supporting out-of-order execution, because programs with loops may produce multiple instances of the same instruction with the same PC value. These multiple instances may execute out of program order.
It is difficult to use time-stamps or other global synchronization mechanisms as instruction IDs for processors supporting multiple clock domains and/or DVFS (dynamic voltage and frequency scaling) for power management.
Our special ID assignment scheme, described below, uses log24n bits, where n is the maximum number of instructions in a processor at any one time (e.g., n = 64 for Alpha 21264). The first two rules assign consecutive numbers to incoming instructions and the third rule allows the scheme to work18 under all the aforementioned circumstances: i.e., for processors supporting out-of-order execution, pipeline flushes, multiple clock domains and DVFS.
Instruction IDs are assigned to individual instructions as they exit the fetch stage and enter the decode stage. Since multiple instructions may exit the fetch stage in parallel at any given clock cycle, multiple IDs are assigned in parallel.
Suppose that a test program has been executing for billions of cycles and an electrical bug is exercised after 5 billion cycles from start. Moreover, suppose that the electrical bug causes a system crash after another 1 billion cycles (i.e., 6 billion cycles from the start). With limited storage, we are only interested in capturing the information around the time when the electrical bug is exercised. Hence, 5 billions of cycles worth of information before the bug occurrence may not be necessary. On the other hand, if we stop recording only after the system crashes, all the useful recorded information will be overwritten. Thus, we must incorporate mechanisms, referred to as post-triggers, for reducing error detection latency, the length of time between the appearance of an error caused by a bug and visible system failure.
Post-triggers targeting five different failure scenarios are listed in Table 2. A hard post-trigger fires when there is an evident sign of failure, and causes the processor operation to terminate. Classical hardware error detection techniques such as parity bits for arrays and residue codes for arithmetic units20 as well as in-built exceptions, such as unimplemented instruction exceptions and arithmetic exceptions, belong to this category.
However, hard post-triggers mechanisms alone are not sufficient, e.g., two tricky scenarios described in the last two rows of Table 3. These two failure scenarios may be detected several millions of cycles after an error occurs, causing useful recorded information to be overwritten even with the existing error detection mechanisms. Hence, we introduce the notion of soft post-triggers.
A soft post-trigger fires when there is an early symptom of a possible failure. It causes the recording in all recorders to pause, but allows the processor to keep running. If a hard post-trigger for the failure corresponding to the symptom occurs within a pre-specified amount of time, the processor stops. If a hard post-trigger does not fire within the specified time, the recording resumes assuming that the symptom was false.
Segmentation fault (or segfault) requires OS handling and, hence, may take several millions of cycles to resolve. Null-pointer dereference is detected by adding simple hardware in the Load/Store unit. For other illegal memory accesses, TLB-miss is used as the soft post-trigger. If a segfault is not declared by the OS while servicing the TLB-miss, the recording is resumed on TLB-refill. On the other hand, if a segfault is returned, then a hard post-trigger is activated.
3. Post-Analysis Techniques
Once recorder contents are scanned out, footprints belonging to same instruction (but in multiple recorders) are identified and linked together using a technique called footprint linking (Section 3.1). The linked footprints are also mapped to the corresponding instruction in the test-program binary using the program counter value stored in the fetch-stage recorder (Table 2).
As shown in Figure 3, after the footprint linking, four high-level post-analysis techniques (Section 3.2) that are independent of microarchitecture are run. After which, low-level analysis (Section 3.3), represented as a decision diagram, asks a series of microarchitecture-specific questions until the final bug location-time pair(s) is obtained. The bug exposing stimuli are derived from the location-time pairs. Currently, the decision diagram is created manually based on the microarchitecture. Automatic generation of such decision diagrams is a topic of future research.
The post-analysis techniques rely on the concept of self-consistency which checks for the existence of contradictory events in collected footprints with respect to the test-program binary. While such checks are extensively used in fault-tolerant computing for error detection12, 16, 23 the key difference here is that we use them for bug localization. Such application is possible because, unlike fault-tolerant computing, the checks are performed off-line enabling more complex analysis for localization purposes.
Figure 4 shows a part of a test program and the contents of three (out of many) recorders right after they are scanned out. As explained in Section 2, since we use short instruction IDs (8-bits for Alpha 21264-like processor), we end up having multiple footprints having the same ID in the same recorder and/or multiple recorders. For example, in Figure 4, ID 0 appears in three entries of the fetch-stage recorder, in two entries of the issue-stage recorder, and in three entries of the execution-stage recorder.
Which of these ID 0s correspond to the same instruction? This question is answered by the following special properties enforced by the ID assignment scheme presented in Section 2.1:
Property 1. All flushed instructions are identified by utilizing Rule 3 in our special ID assignment scheme (Section 2.1).
Property 2. If instruction A was fetched before instruction B, and they both have the same ID, then A will always exit any pipeline stage (and leave its footprint in the corresponding recorder) before B does for that same pipeline stage.
In Figure 4, using the first property, footprints corresponding to flushed instructions are identified and discarded. After discarding, using the second property, the youngest ID 0s across all recorders are linked together, followed by linking of the second youngest ID 0s, and so on. Since the PC is stored in the fetch-stage recorder, we can link the instruction ID back to the test program binary to find the corresponding instruction.
IFRA uses four high-level analysis techniques (1) data dependency analysis, (2) program control-flow analysis, (3) load-store analysis, and (4) decoding analysis.
Each analysis technique is applied separately. We are interested in the inconsistency that is closest to the electrical bug manifestation in terms of time (i.e., the eldest inconsistency). Thus, if multiple of them identify inconsistencies, then the reported inconsistencies are compared to see which one occurred the earliest. The high-level analysis technique with the earliest occurring inconsistency then decides the entry point into the decision diagram for low-level analysis. Here we briefly explain the control-flow analysis, one of the high-level analysis techniques, to illustrate the idea.
In the program control-flow analysis, four types of illegal transitions are searched in the PC sequence of the serial execution trace (obtained from fetch-stage recorder and test-program binary during footprint linking), starting from the eldest PC.
- The PC does not increment by +4 except in the presence of a control flow transition instruction (e.g., branch, jump).
- A PC jump does not occur in the presence of unconditional transition instruction.
- The PC does not jump to the correct target in presence of direct transition (with target address that does not depend on a register value).
- The PC does not jump to an address that is part of the executable address space (determined from the program binary) in the presence of register-indirect transition (with target address that depends on a register value).
If any illegal transition is found, the low-level analysis scrutinizes the PC register with the instruction that made an illegal transition.
The low-level analysis involves asking a series of microarchitectural-specific questions according to the decision diagram. We present a simple example by tracing one of the paths in the decision diagram.
Consider an example where a segfault (Section 2.2) during instruction access was detected, and the fourth illegal transition of the control-flow analysis was identified. We also assume that R5 shown in Figure 5 was the register used for the register-indirect transition. Instructions B and C have producer-consumer relationship: B writes its result in to register R0, and C uses a value from register R0.
The first question in the decision diagram is whether C consumed the value B produced. The execute-stage recorder contains the residues of results and the issue-stage recorder contains the residues of operands of instructions. Comparing the two values during post-analysis shows that they do not match; i.e., B produced a value with residue of 5, while C received a value with residue of 3. This is clearly a problem.
The second question in the decision diagram is whether C and B used the same physical register to pass along the value. Analysis of the contents of the dispatch-stage recorder, which records the physical register name, reveals that B wrote its results into physical register P2, while C read its operand value from physical register P5, and they are not the same as shown in Figure 6.
There is again a problem, and the third question in the decision diagram asks whether C used a value produced by the previous producer (instruction that wrote its result into register R0 prior to the immediate producer) of register R0. Instruction A in Figure 7 is the previous producer of register R0 and analysis of the contents of the dispatch-stage recorder reveals that indeed that is the case.
Asking several more questions leads to the bug location and the exposing stimulus shown in Figure 8. The instruction trace between instruction A and instruction B is responsible for stimulating the bug, and the trace afterwards is responsible for propagating the bug to an observation point such as a soft post-trigger.
4. Results
We evaluated IFRA by injecting errors into a microarchitectural simulator2 augmented with IFRA. For an Alpha 21264 configuration (4-way pipeline, 64 maximum instructions in-flight, 2 ALUs, 2 multipliers, 2 load/store units), there are 200 different microarchitectural blocks (excluding array structures and arithmetic units since errors inside those structures are immediately detected and localized using parity and/or residue codes, as discussed in Section 2.2). Each block has an average size equivalent of 10K 2-input NAND gates. Seven benchmarks from SPECint2000 (bzip2, gcc, gap, gzip, mcf, parser, vortex) were chosen as validation test programs as they represent a variety of workloads. Each recorder was sized to have 1024 entries.
All bugs were modeled as single bit-flips at flip-flops to target hard-to-repeat electrical bugs. This is an effective model because electrical bugs eventually manifest themselves as incorrect values arriving at flip-flops for certain input combinations and operating conditions.15
Errors were injected in one of 1191 flip-flops [Park and Mitra17]. No errors were injected inside array structures since they have built-in parities for error detection.
Upon error injection, the following scenarios are possible:
- The error vanishes without any effect at the system level or produces an incorrect program output without any post-trigger firing. This case is related to the coverage of validation test programs and post-triggers, and is not the focus of this paper.
- Failure manifestation with short error latency, where recorders successfully capture the history from error injection to failure manifestation (including situations where recording is stopped/paused upon activation of soft post-triggers).
- Failure manifestation with long error latency, where 1024-entry recorders fail to capture the history from error injection to failure (including soft triggers).
Out of 100,000 error injection runs, 800 of them resulted in Cases 2 and 3. Figure 9 presents results from these two cases. The “exactly located” category represents the cases in which IFRA returned a single and correct location-time pair (as defined in Section 1). The “candidate located” category represents the cases in which IFRA returned multiple location-time pairs (called candidates) out of over 200,000 possible pairs (1 out of 200 microarchitectural blocks and 1 out of 1,000 cycles), and at least 1 pair was fully correct in both location and in time. The “completely missed” category represents the cases where none of the returned pairs were correct, even if either location or time is correct. In addition, we pessimistically report all errors that resulted in Case 3 as “completely missed.” All error injections were performed after a million cycles from the beginning of the program in order to demonstrate that there is no need to keep track of footprints from the beginning.
It is clear from Figure 9 that a large percentage of bugs were uniquely located to correct location-time pair, while very few bugs were completely missed, demonstrating the effectiveness of IFRA.
5. Conclusion
IFRA targets the problem of post-silicon bug localization in a system setup, which is a major challenge in processor post-silicon design validation. There are two major novelties of IFRA:
- High-level abstraction for bug localization using low-cost hardware recorders that record semantic information about instruction data and control flows concurrently in a system setup.
- Special techniques, based on self-consistency, to analyze the recorded data for localization after failure detection.
IFRA overcomes major post-silicon bug localization challenges.
- It helps bridge a major gap between system-level and circuit-level debug.
- Failure reproduction is not required.
- Self-consistency checks associated with the analysis techniques minimize the need for full system-level simulation.
IFRA creates several interesting research directions:
- Automated construction of the post-analysis decision diagram for a given microarchitecture.
- Sensitivity analysis and characterization of the interrelationships between post-analysis techniques, architectural features, error detection mechanisms, recorder sizes, and bug types.
- Application to homogeneous/heterogeneous multi-and many-core systems, and system-on-chips (SoCs) consisting of nonprocessor designs.
Acknowledgment
The authors thank A. Bracy, B. Gottlieb, N. Hakim, D. Josephson, P. Patra, J. Stinson, H. Wang of Intel Corporation, O. Mutlu and S. Blanton of Carnegie Mellon University, T. Hong of Stanford University, and E. Rentschler of AMD for helpful discussions and advice. This research is supported in part by the Semiconductor Research Corporation and the National Science Foundation. Sung-Boem Park is also partially supported by Samsung Scholarship, formerly the Samsung Lee Kun Hee Scholarship Foundation.
Figures
Figure 1. Post-silicon bug localization flow using IFRA.
Figure 2. Superscalar processor augmented with recording infrastructure.
Figure 3. Post-analysis summary: Park et al
Figure 4. Instruction footprint linking, with a maximum number of 2 instructions in flight (i.e., n = 2).
Figure 5. First question in the low-level analysis example: Did C consume the value B produced? Answer: No
Figure 6. Second question asked in the low-level analysis example: Did C and B use the same physical register to pass along the value? Answer: No
Figure 7. Third question asked in the low-level analysis example: Did C and A use the same physical register to pass along the value? Answer: Yes
Figure 8. Bug location (enclosed in grey area includes part of the decoder responsible for decoding the architectural destination register, the write circuitry into a register mapping table, and all the pipeline registers in between) shown on the left and the exposing stimulus shown on the right.
Join the Discussion (0)
Become a Member or Sign In to Post a Comment