It is possible to screen cots components for appropriate levels of tds, and shield them as required, but there is no similar fix for seus. In this context, softwarebased fault tolerance is an attractive solution, since it allows implementing dependable systems without incurring the high costs associated with developing custom hardware based tolerance techniques not readily available in offtheshelf prod ucts 1. Efficient faultinjectionbased assessment of software. Fault tolerance techniques are also separated in two major fields, hardware redundancy and time redundancy. In the context of software implemented fault injection swifi, the injection of mbus in all variables is infeasible due to the exponential size of the fault space, thereby making it necessary to carefully select those fault injection points that maximises the probability of causing a failure. Keywordssoftwareimplemented fault tolerance, distributed systems, high availability.
Reliability through redundant parallelism for microsatellite. In the argos experiment, these two approaches were compared in an actual space experiment. Fault mitigation techniques based on pure software, known as softwareimplemented hardware fault tolerance sihft, are very attractive for use in cots commercial off the shelf microprocessors because they do not require physical modification of the system. An open and versatile faultinjection framework for. We assess the effectiveness of softwareimplemented hardware fault tolerance sihft techniques in enhancing the reliability of cots. In addition, several design constraints are analyzed to determine the scalability of the system. A sihft technique can provide an inexpensive alternative to hardware andor information redundancy. Cots board first the software fault tolerance techniques that have been used in the argos test bed are introduced in this section. All hardware implemented fi techniques require specialized hardware setups, and only fi via test access ports can be achieved with cots hardware while retaining the repeatability and controllability necessary for detailed postinjection analysis. The goal of this project is to collect data on the errors that occur in microprocessors in a space environment, to determine the tradeoffs between fault avoidance and fault tolerance, and to. Cots devices for space applications has been suggested to accelerate. Modern experiments in particle physics are based on advanced and sophisticated electronic systems which have to operate under radiation impact. An efficient controlflow checking technique for the. The effects of an armorbased sift environment on the.
A softwareimplemented configurable control flow checking. In particular, in microprocessorbased systems, since the fault mitigation techniques based on pure software, also known as softwareimplemented hardware fault tolerance sihft 6, do not require physical modi. We present the architecture of the fault tolerant control computer of the bird. The introduction of software implemented hardware fault tolerance sihft 6 techniques for fault detection is applicable to cots based devices, providing lowcost solutions for enhancing the reliability of these systems without modifying the hardware. Together, these aspects of the dependable multiprocessor will allow space scientists to perform on. Theres a research group at stanford thats done a lot of work with sihft, and even flew a test processor on the argos satellite.
We then present a quantitative evaluation that demonstrates significant reliability improvement from the cots based fault tolerance. A major concern in digital electronics used in space is radiationinduced. This technical report contains the text of philip shirvanis ph. Sihft and cots commercial components are designed to function in an. Predeployment validation of faulttolerant systems through. In this paper, i am going to compare the fault avoidance techniques with the purely software implemented fault tolerance techniques from with the experiments carried out in a special space mission, the argos project. Christopher wilson new opportunities lead nasa goddard. Actually, there has been some research into using software implemented hardware fault tolerance sihft to guard against the effects of bits flips. Reis gives a software implemented fault tolerance mechanism named swift with a enhanced controlflow checking mechanism based on the compiler technology5.
Softwareimplemented fault detection for highperformance space applications michael turmon, robert granat, and daniel s. In argos project1, these two approaches are compared in an actual space experiment. Validating softwareimplemented fault tolerance mechanisms. A performance evaluation of the software implemented fault tolerance computer daniel l. Citeseerx fault tolerant computing in space environment. The book presents the theory behind softwareimplemented hardware fault tolerance, as well as the practical aspects related to put it at work on real examples. Shirvani department of electrical engineering stanford university ph. Butlert nasa langley research center, hampton, virginia the results of a performance evaluation of the software implemented fault tolerance sift computer system conducted in the nasa avionics integration research laboratory are presented. Introduction radiation, such as alpha particles and cosmic rays, can cause transient faults in electronic systems.
The unconventional stellar aspect usa experiment on. Software implemented fault tolerance sift sift is a contrasting approach to a fault tolerant multiprocessor. Fault tolerance through redundant cots components for satellite processing applications. A performance evaluation of the softwareimplemented fault. Advancedtechnology fault tolerant software is used to detect and correct faults. Using approximate computing and selective hardening for. Our fault injection experiment simulating bit flips in memory shows that eddi provides over. Validating softwareimplemented fault tolerance mechanisms for critical space systems regular paper abstractfaulttolerant system architectures for space applications are currently validated using systemlevel testing. A performance evaluation of the softwareimplemented faulttolerance computer daniel l. Chapter 1 introduction space, the final frontier, will become more and more popular. In proceedings of the international conference on information, communications and signal processing. Segail carnegiemellon university pittsbu rgb, pennsy zuania prepared for langley research center under grant nag 1 190 national aeronautics and space administration office of management.
Software implemented fault injection much more costeffective alternatives are several variants. Section 4 presents data from softwareimplemented fault insertion experiments and provides a comparison, where applicable, to similar hardware fault insertion experiments. Seus are a major cause of concern in a space environment, and have also been observed at ground. The comparison of different fault injection techniques leads to the conclusion that emulation based approach has key advantages for achieving the goals required for faulttolerant tolerance. Hardware and software fault tolerance of softcore processors implemented in srambased fpgas nathaniel h. In argos project 1, these two approaches are compared in an actual space.
Fault tolerance adding extra node temporal redundancy allowing extra time fault tolerance can be defined as the ability to comply with the specification in spite of faults. It is in this context that we describe and test the mathematical background for using checksum methods to validate results returned by a numerical subroutine operating in an seuprone environment. This work was supported in part by the ballistic missile defense organization, innovative science and technology bmdoist directorate and administered through the department of the navy, office of naval research under grant nos. Softwareimplemented hardware fault tolerance request pdf. The book presents the theory behind software implemented hardware fault tolerance, as well as the practical aspects related to put it at work on real examples. In effect, extra hardware and additional computations are the costs of using cots equipment in space. Rollins department of electrical and computer engineering doctor of philosophy softcore processors are an attractive alternative to using expensive radiationhardened processors for space based applications. Offtheshelf cots components have been investigated for space applications because of their. Commercial off the shelf cots components have been considered as a lowcost alternative to radiation hardened parts. Single cots processor with time redundancy sift in this approach, a single cots processor is used together with software implemented fault tolerance sift, which executes the entire software or certain software sections twice or more. Bringing faulttolerant gigahertzcomputing to space. Towards affordable faulttolerant nanosatellite computing.
Softwareimplemented hardware fault tolerance sihft techniques can provide lowcost solutions for enhancing the reliability of these systems without modifying the hardware 345. Software based fault tolerance techniques, also referred in the literature as software implemented hardware fault tolerance sihft 10, are techniques implemented in software to protect. Huhtinen cern, geneva, switzerland,first evaluation of the single event upset. The ree system design deployed software implemented fault tolerance sift middleware layer, known as the adaptive reconfigurable mobile objects of reliability armor. By evaluating accurately the advantages and disadvantages of the already available approaches, the book provides a guide to developers willing to adopt software implemented hardware.
Softwareimplemented fault injection techniques allow injection of faults. By featuring cots devices to perform the critical data processing, supported by simpler radhard devices that monitor and manage the cots devices, and augmented with novel uses of fault tolerant. The presence of seus requires that applications be selfchecking, or tolerant of errors, as the first layer of fault tolerance. Using approximate computing and selective hardening for the. The typical spacebased redundancy is softwareimplemented hardware faulttolerance based on faulttolerant compiler, and twopass adjudicators tpa. Faulttolerant computing for radiation environments 1 enter for eliable omputing c r c faulttolerant computing for radiation environments philip p. Development of the remote exploration and experimentation ree commercial off the shelf cots based spaceborne supercomputer requires a detailed model of single event upset seu induced faults. Reliable management services for cotsbased space systems and. A sihft technique can provide an inexpensive alternative to hardware andor information redundancy techniques and can be especially attractive when using. Traditional hardwarebased faulttolerance ft concepts for general purpose. Software fault tolerance is the ability of computer software to continue its normal operation despite the presence of system or hardware faults. The space industry is continually growing and new products and services will be required.
Commercial off the shelf cots components have been considered as a lowcost alternative to radiationhardened parts. However, these techniques cause software overheads that may affect the efficiency and costs of the overall system. Fault masking is enhanced through multiple buses or a redundantly linked network over which multiple copies of data are transmitted. Fault isolation can be obtained by physical isolation of failed hardware components. Cots hardware and software layers, the middleware layer is projected to provide a system availability of 0. Commercial offtheshelf cots components have been considered as a lowcost alternative to radiationhardened parts. Softwareimplemented hardware fault tolerance experiments cots in space. Software implemented transient fault detection in space. Faulttolerant parallelprocessing supercomputer for spacecraft onboard scientific data processing. We assess the effectiveness of software implemented hardware fault tolerance sihft techniques in enhancing the reliability of cots. Softwareimplemented hardware fault tolerance sihft.
Pdf fault injection experiment results in space borne. International conference on dependable systems and networks ftcs 30 and dcca 8, fast abstracts, pp. A valgrindbased soft error injection tool for sihft evaluations. Wood,softwareimplemented hardware fault tolerance experiments. The paper deals with the problem of checking system fault susceptibility in simulation experiments. Software implemented hardware fault tolerance guide books. Section iii presents our research on fieldprogrammable logic devices fplds and their use in adaptive computing systems acs. The dependable multiprocessor dm is one of the four experiments on the upcoming nmp space technology 8 st8 mission, to be launched in 2009, and the experiment seeks to deploy commercialoff the shelf cots technology to boost onboard processing performance per watt. Scheduling tradeoffs for heterogeneous computing on an. Softwareimplemented hardware fault tolerance experiments cots in space article pdf available january 2000 with 225 reads how we measure reads. Hardware fault tolerance sihft techniques can provide. This mechanism is useful for software fault tolerant, but do nothing with the other related hardware modules.
A faulttolerant structure for reliable multicore systems. Programs are partitioned into blocks and acceptance tests are. Software fault tolerance for lowtomoderate radiation. This is viable for systems relying on hardware measures, but unsuitable for fault tolerance ft implemented in software. Job scheduling, cots components, space systems, scalability, fault tolerance 1. Section ii, we discuss softwareimplemented hardware fault tolerance sihft techniques and the space experiment that we are involved in. Electronics free fulltext using approximate computing. Following the cots philosophy laid out above, our general approach has been to wrap exist. The primary goals of this project are to collect data on the errors that occur in digital integrated circuits in a space environment, to determine the tradeoffs between faultavoidance and fault. In the previous stage of the experiment, we observed that seus corrupted the memory, forcing for frequent system resets. Cots based fault tolerance, ieee 94, space applications, tree topology, bus network reliability principal contact.
After a brief overview of the software development processes, we note how hardtodetect design faults are likely to be introduced during development. Operating system support for redundant multithreading. Proceedings of the international conference on dependable systems and networks, new york, ny. Softwareimplemented fault detection for highperformance. Power fluctuation and electromagnetic interference may cause bitflips in memories. Introduction highperformance applications that form the basis of many. In contrast, softwareimplemented fault tolerance can easily be adapted by patching the involved applications and integrating with operating system os resource scheduling. We assessed the effectiveness of softwareimplemented hardware fault tolerance sihft techniques in enhancing the reliability of cots. Use of commodity offtheshelf cots components in space, on the other hand, implies that faults must be handled in software. Faulttolerant computing basic concepts ucla computer. Software implemented fault inserters sciencedirect. Emulation based approach to iso 26262 compliant processors design.
Cots in space, international conference on dependable systems and networks, fast abstracts, june 2528, 2000. In argos project, these two approaches were compared in an actual space experiment. Fault tolerant software has the ability to satisfy requirements despite failures. Effectiveness of softwarebased hardening for radiationinduced. Center for reliable computing technical report faulttolerant. Introduction radiation, can cause transient faults in electronic systems, that is alpha particles and cosmic rays, such type of faults cause errors known as singleevent upsets seus. The dependable multiprocessor validation experiment will demonstrate the technological maturity of a cots based computer architecture and its fault tolerant software. Emulation based approach to iso 26262 compliant processors.
System management services for highperformance insitu. Cots in space international conference on dependable systems and networks ftcs30 and dcca8, new york ny, 2000, pages b5657. Dependable multiprocessor space mission and science news. The crc argos project involves fault tolerance experiments conducted on a couple of processor boards on board the argos experimental satellite. We implemented edec in software and use periodic scrubbing to protect the code segments of operating.
Predeployment validation of faulttolerant systems through softwareimplemented fault insertion edward w. By evaluating accurately the advantages and disadvantages of the already available approaches, the book provides a guide to developers willing to adopt softwareimplemented hardware. Tools, techniques, and contributions of this dissertation 3 1. Validating softwareimplemented fault tolerance mechanisms for critical space systems regular paper abstract fault tolerant system architectures for space applications are currently validated using systemlevel testing. The next step will be the design and fabrication of a hardware prototype that will match the mass and form factor of a future flight model and will demonstrate. When the hardware cannot be changed, a pure software method is the only feasible solution. Strategies for faulttolerant, spacebased computing. Softwareimplemented hardware fault tolerance experiments. Such faults cause errors called singleevent upsets seus. This additional overhead diminishes the gain in performance that the use of cots equipment would otherwise provide.
Orals presentation march 5, 2001 2 the challenge stanford crc argos project. Softwareimplemented hardware fault tolerance springerlink. The problem of designing a hardened system becomes very important, especially in places such as accelerators and synchrotrons where the results of the experiments depend on control system based on digital devices eg. Center for reliable computing technical report fault. Softwareimplemented hardware fault tolerance experiments cots in space, 2000. Faulttolerant computing for radiation environments 1 enter for eliable omputing c r c. Hardware fault tolerance software fault tolerance software implemented hardware fault tolerance in all types, fault tolerance is. These techniques include, software implemented error detection and correction codes, error detection by duplicated instructions, control flow checking by software signatures and watchdog timer. Fault tolerant computing in space environment and software. Many experiments and studies using pinlevel fault injection were carried out during the. Single event upsets are a major cause of concern in a space. Butlert nasa langley research center, hampton, virginia the results of a performance evaluation of the softwareimplemented faulttolerance sift computer system conducted in the nasa avionics integration research laboratory are presented. Chameleon is a software implemented fault tolerance sift middleware capable of providing adaptive fault tolerance in a cots componentsofftheshelf environment with the capability to adapt to changing runtime requirements as well as changing application requirements. Radtest testing board for the software implemented.
216 492 616 1536 774 1206 832 409 799 630 781 527 1332 602 685 377 369 1109 910 901 1199 406 947 131 259 505 1448 1402 65 163 1090