A stress procedure for reliability screening, SHOrt Voltage Elevation (SHOVE) test, is analyzed here. During SHOVE, test vectors are run at higher-than-normal supply voltage for a short period. Functional tests and IDDQ tests are then performed at the normal voltage. This procedure is effective in screening oxide thinning, which occurs when the oxide thickness of a transistor is less than expected, as well as via defects. The stress voltage of SHOVE testing should be set such that the electric field across an oxide is approximately 6MV/cm. The stress time can be calculated by using the "effective oxide thinning" model. We will also discuss the requirements on input vectors for stressing complementary CMOS logic gates and CMOS domino logic gates efficiently.
The goal of IC production test is to avoid selling bad parts. The goal of fault grading is to assure that the test is so thorough that only an acceptably small fraction of shipped parts are bad. Fault grading is almost always based on a single-stuck fault, ssf, model.< >
The following article is a condensation of a COSINE task force report. The editing process required to condense the original report for these pages necessarily involves arbitrary judgements, and may have introduced emphases and perspectives that are not held by the authors of the report. All such differences are unintentional. Interested readers should obtain copies of the full report, which are available without charge from: Commission on Education, National Academy of Engineering, 2101 Constitution Avenue, N. W. Washington, D. C. 20418
in previous work a simple bound, ~+2 , on the aliasing probability in serial signature analysis for a random test pattern of length L was derived. This simple bound is sharpened here by almost a factor of two. For serial signature analysis, it is shown that the I+& 1 aliasing probability is bounded above by - = L (E L small for large L) for test lengths L less than the period, Lc, of the signature polynomial. The simple bounds derived are compared with exact as well as experimentally measured aliasing probability values. It is conjectured that L-l is the best monotonic bound on the aliasing probability for serial signature analysis.
Real-time computing systems often have stringent reliability and performance requirements. Failures in real-time systems can result in data corruption and lower performance, leading to catastrophic failures. Previous researchers demonstrated the use of reconfigurable hardware (e.g. Field-Programmable Gate Arrays, or FPGAs) for implementing cost effective fault tolerance techniques in general applications. In this dissertation, we developed techniques to design reliable real-time systems on FPGAs.
To demonstrate the effectiveness of our design techniques, we implemented a robot control algorithm on FPGAs. Various fault tolerance features are implemented in the robot controller to ensure reliability. Our implementation results show that the performance of the FPGA-based controller with triple modular redundancy (TMR) is comparable to that of a software-implemented control algorithm (with TMR) in a microprocessor. We developed a roll-forward technique for transient error recovery in TMR-based real-time systems. Our technique does not need any re-computation and therefore significantly reduces timing overhead associated with conventional recovery techniques. Analytical results show that our recovery scheme can significantly improve reliability of TMR systems compared to conventional approaches. Implementation results in the robot controller design demonstrate that our recovery scheme introduces very small area overhead.
The conventional approach to permanent fault repair in FPGAs is to reconfigure the design so that the faulty part is avoided. However, for TMR systems with high area utilization or long mission times, this approach may not be applicable due to non-availability of additional hardware resources. In such circumstances, our new permanent fault repair scheme reconfigures the original TMR-based design into another fault tolerant design of smaller area so that the faulty elements are avoided. However, unlike TMR systems, extra delays during transient error recovery may occur. Three new design techniques for this repair scheme are presented. Analytical results show that these techniques can significantly reduce the delay overhead due to rollbacks. The effectiveness of our repair approach is demonstrated using the robot controller design. A repair scheme is also designed for FPGA interconnects. Unlike conventional schemes that use redundant buses, our scheme only needs a spare wire. Our scheme can repair systems from failures caused by single faulty wire connecting FPGAs.
This paper analyzes the data integrity of one of the most widely used lossless data compression techniques, Lempel-Ziv (LZ) compression. In this algorithm, because the data reconstruction from compressed codewords relies on previously decoded results, a transient error during compression may propagate to the decoder and cause a significant corruption in the reconstructed data. To recover the system from transient faults, we designed two rollback error recovery schemes for the LZ compression hardware, the "reload-retry" and "direct-retry" schemes. Statistical analyses show that the "reload-retry" scheme can recover the LZ compression process from transient faults in one dictionary reload cycle with a small amount of hardware redundancy. The "direct-retry" scheme can recover normal operations with a shorter latency but with a small degradation in the compression ratio.
A new method for delay fault testing of digital circuits is presented. Unlike catastrophic failures that simply have incorrect steady-state logic values at the circuit outputs, delay faults change the shape of the output waveforms by moving the signal transitions in time. Therefore, since the output waveforms contain information about the circuit delays, instead of only latching the outputs at the sampling time, the output waveforms between samples are analyzed as well. Two classes of output waveform analysis are discussed. In the first technique, the output waveform is observed for any changes after the sampling time, since in a fault-free circuit, the outputs are expected to have stabilized at the desired logic values. In the second technique, information is extracted from the faulty and fault-free waveforms before the sampling time, and compared for any differences. Circuits for the waveform analyzers are presented to show that the method is feasible, and experimental results are given.
Delay defects can escape detection during the normal production test flow; particularly if they do not affect any of the long paths included in the test flow. Some delay defects can have their delay increased, making them easier to detect, by carrying out the test with a very low supply voltage (VLV testing). However, VLV testing is not effective for delay defects caused by high resistance interconnects. This paper presents a screening technique for such defects. This technique, cold testing, relies on carrying out the test at low temperature. One particular type of defect, silicide open, is analyzed and experimental data are presented to demonstrate the effectiveness of cold testing.
The research to develop a testing methodology for flight software is described. An experiment was conducted in using assertions to dynamically test digital flight control software. The experiment showed that 87% of typical errors introduced into the program would be detected by assertions. Detailed analysis of the test data showed that the number of assertions needed to detect those errors could be reduced to a minimal set. The analysis also revealed that the most effective assertions tested program parameters that provided greater indirect (collateral) testing of other parameters. In addition, a prototype watchdog task system was built to evaluate the effectiveness of executing assertions in parallel by using the multitasking features of Ada.
Redundancy techniques like duplication and Triple Modular Redundancy (TMR) are widely used for designing dependable systems to ensure high reliability and data integrity. In this paper, for the first time, we develop fault models for common-mode failures (CMFs) in redundant systems and describe techniques to design redundant systems protected against the modeled CMFs. We first develop an input-register-CMF model that targets systems with register-files. This paper shows that, in the presence of input-register-CMFs, we can always design duplex or TMR systems that either produce correct outputs or indicate error situations when incorrect outputs are produced. This property ensures data integrity. Next, we extend the input-register-CMF model to consider systems where the storage elements of the registers are not organized in register-files; instead, the register flip-flops are placed using conventional CAD programs. For this case, we present a technique to synthesize redundant systems with guaranteed data integrity against the extended input-register-CMFs.