Knowledge (XXG)

Fault tolerance

Source 📝

594:
application of brake force to all wheels. It would also be prohibitively costly to further double-up the main components and they would add considerable weight. However, the similarly critical systems for actuating the brakes under driver control are inherently less robust, generally using a cable (can rust, stretch, jam, snap) or hydraulic fluid (can leak, boil and develop bubbles, absorb water and thus lose effectiveness). Thus in most modern cars the footbrake hydraulic brake circuit is diagonally divided to give two smaller points of failure, the loss of either only reducing brake power by 50% and not causing as much dangerous brakeforce imbalance as a straight front-back or left-right split, and should the hydraulic circuit fail completely (a relatively very rare occurrence), there is a failsafe in the form of the cable-actuated parking brake that operates the otherwise relatively weak rear brakes, but can still bring the vehicle to a safe halt in conjunction with transmission/engine braking so long as the demands on it are in line with normal traffic flow. The cumulatively unlikely combination of total foot brake failure with the need for harsh braking in an emergency will likely result in a collision, but still one at lower speed than would otherwise have been the case.
602:
rear brake is relatively strong compared to its automotive cousin, being a powerful disc on some sports models, even though the usual intent is for the front system to provide the vast majority of braking force; as the overall vehicle weight is more central, the rear tire is generally larger and has better traction, so that the rider can lean back to put more weight on it, therefore allowing more brake force to be applied before the wheel locks. On cheaper, slower utility-class machines, even if the front wheel should use a hydraulic disc for extra brake force and easier packaging, the rear will usually be a primitive, somewhat inefficient, but exceptionally robust rod-actuated drum, thanks to the ease of connecting the footpedal to the wheel in this way and, more importantly, the near impossibility of catastrophic failure even if the rest of the machine, like a lot of low-priced bikes after their first few years of use, is on the point of collapse from neglected maintenance.
849:. It attaches to the application process when an error occurs, repairs the execution, tracks the repair effects as the execution continues, contains the repair effects within the application process, and detaches from the process after all repair effects are flushed from the process state. It does not interfere with the normal execution of the program and therefore incurs negligible overhead. For 17 of 18 systematically collected real world null-dereference and divide-by-zero errors, a prototype implementation enables the application to continue to execute to provide acceptable output and service to its users on the error-triggering inputs. 598:
built into it per se (and it typically uses a cheaper, lighter, but less hardwearing cable actuation system), and it can suffice, if this happens on a hill, to use the footbrake to momentarily hold the vehicle still, before driving off to find a flat piece of road on which to stop. Alternatively, on shallow gradients, the transmission can be shifted into Park, Reverse or First gear, and the transmission lock / engine compression used to hold it stationary, as there is no need for them to include the sophistication to first bring it to a halt.
443: 344: 775:(TMR). The voting circuit can determine which replication is in error when a two-to-one vote is observed. In this case, the voting circuit can output the correct result, and discard the erroneous version. After this, the internal state of the erroneous replication is assumed to be different from that of the other two, and the voting circuit can switch to a DMR mode. This model can be applied to any larger number of replications. 796:. Two replicated elements operate in lockstep as a pair, with a voting circuit that detects any mismatch between their operations and outputs a signal indicating that there is an error. Another pair operates exactly the same way. A final circuit selects the output of the pair that does not proclaim that it is in error. Pair-and-spare requires four replicas rather than the three of TMR, but has been used commercially. 482:(used in computing, similar to "fail safe") operates at a reduced level of performance after some component fails. For example, if grid power fails, a building may operate lighting at reduced levels or elevators at reduced speeds. In computing, if insufficient network bandwidth is available to stream an online video, a lower-resolution version might be streamed in place of the high-resolution version. 38: 313:
that to be fully effective, the system had to be self-repairing and diagnosing – isolating a fault and then implementing a redundant backup while alerting a need for repair. This is known as N-model redundancy, where faults cause automatic fail-safes and a warning to the operator, and it is still the most common form of level one fault-tolerant design in use today.
890:
the type of redundant resources added to the system. In time redundancy the computation or data transmission is repeated and the result is compared to a stored copy of the previous result. The current terminology for this kind of testing is referred to as 'In Service Fault Tolerance Testing or ISFTT for short.
882:
tires, and no one tire is critical (with the exception of the front tires, which are used to steer, but generally carry less load, each and in total, than the other four to 16, so are less likely to fail). The idea of incorporating redundancy in order to improve the reliability of a system was pioneered by
632:
Fault containment to prevent propagation of the failure – Some failure mechanisms can cause a system to fail by propagating the failure to the rest of the system. An example of this kind of failure is the "rogue transmitter" that can swamp legitimate communication in a system and cause overall system
597:
In comparison with the foot pedal activated service brake, the parking brake itself is a less critical item, and unless it is being used as a one-time backup for the footbrake, will not cause immediate danger if it is found to be nonfunctional at the moment of application. Therefore, no redundancy is
153:
Fault tolerance specifically refers to a system's capability to handle faults without any degradation or downtime. In the event of an error, end-users remain unaware of any issues. Conversely, a system that experiences errors with some interruption in service or graceful degradation of performance is
919:
Even if the operator is aware of the fault, having a fault-tolerant system is likely to reduce the importance of repairing the fault. If the faults are not corrected, this will eventually lead to system failure, when the fault-tolerant component fails completely or when all redundant components have
912:
Another variation of this problem is when fault tolerance in one component prevents fault detection in a different component. For example, if component B performs some operation based on the output from component A, then fault tolerance in B can hide a problem with A. If component B is later changed
905:
To continue the above passenger vehicle example, with either of the fault-tolerant systems it may not be obvious to the driver when a tire has been punctured. This is usually handled with a separate "automated fault-detection system". In the case of the tire, an air pressure monitor detects the loss
889:
Two kinds of redundancy are possible: space redundancy and time redundancy. Space redundancy provides additional components, functions, or data items that are unnecessary for fault-free operation. Space redundancy is further classified into hardware, software and information redundancy, depending on
593:
Another excellent and long-term example of this principle being put into practice is the braking system: whilst the actual brake mechanisms are critical, they are not particularly prone to sudden (rather than progressive) failure, and are in any case necessarily duplicated to allow even and balanced
950:
A fault-tolerant design may allow for the use of inferior components, which would have otherwise made the system inoperable. While this practice has the potential to mitigate the cost increase, use of multiple inferior components may lower the reliability of the system to a level equal to, or even
834:
Recovery shepherding is a lightweight technique to enable software programs to recover from otherwise fatal errors such as null pointer dereference and divide by zero. Comparing to the failure oblivious computing technique, recovery shepherding works on the compiled program binary directly and does
601:
On motorcycles, a similar level of fail-safety is provided by simpler methods; first, the front and rear brake systems are entirely separate, regardless of their method of activation (that can be cable, rod or hydraulic), allowing one to fail entirely while leaving the other unaffected. Second, the
624:
to the failing component – When a failure occurs, the system must be able to isolate the failure to the offending component. This requires the addition of dedicated failure detection mechanisms that exist only for the purpose of fault isolation. Recovery from a fault condition requires classifying
586:, so the second test is passed. The cost of a redundant restraint method like seat belts is quite low, both economically and in terms of weight and space, so the third test is passed. Therefore, adding seat belts to all vehicles is an excellent idea. Other "supplemental restraint systems", such as 316:
Voting was another initial method, as discussed above, with multiple redundant backups operating constantly and checking each other's results. For example, if four components reported an answer of 5 and one component reported an answer of 6, the other four would "vote" that the fifth component was
312:
In general, the early efforts at fault-tolerant designs were focused mainly on internal diagnosis, where a fault would indicate something was failing and a worker could replace it. SAPO, for instance, had a method by which faulty memory drums would emit a noise before failure. Later efforts showed
446:
An example of graceful degradation by design in an image with transparency. Each of the top two images is the result of viewing the composite image in a viewer that recognises transparency. The bottom two images are the result in a viewer with no support for transparency. Because the transparency
550:
Providing fault-tolerant design for every component is normally not an option. Associated redundancy brings a number of penalties: increase in weight, size, power consumption, cost, as well as time to design, verify, and test. Therefore, a number of choices have to be examined to determine which
881:
Redundancy is the provision of functional capabilities that would be unnecessary in a fault-free environment. This can consist of backup components that automatically "kick in" if one component fails. For example, large cargo trucks can lose a tire without any major consequences. They have many
667:
Research into the kinds of tolerances needed for critical systems involves a large amount of interdisciplinary work. The more complex the system, the more carefully all possible interactions have to be considered and prepared for. Considering the importance of high-value systems in transport,
582:. If the vehicle rolls over or undergoes severe g-forces, then this primary method of occupant restraint may fail. Restraining the occupants during such an accident is absolutely critical to safety, so the first test is passed. Accidents causing occupant ejection were quite common before 262:. This computer had a backup of memory arrays to use memory recovery methods and thus it was called the JPL Self-Testing-And-Repairing computer. It could detect its own errors and fix them or bring up redundant modules as needed. The computer is still working, as of early 2022. 451:
A highly fault-tolerant system might continue at the same level of performance even though one or more components have failed. For example, a building with a backup electrical generator will provide the same voltage to wall outlets even if the grid power fails.
486:
is another example, where web pages are available in a basic functional format for older, small-screen, or limited-capability web browsers, but in an enhanced version for browsers capable of handling additional technologies or that have a larger display.
788:
Bringing the replications into synchrony requires making their internal stored states the same. They can be started from a fixed initial state, such as the reset state. Alternatively, the internal state of one replica can be copied to another replica.
913:(to a less fault-tolerant design) the system may fail suddenly, making it appear that the new component B is the problem. Only after the system has been carefully scrutinized will it become clear that the root problem is actually with component A. 413:
to ignore new and unsupported HTML entities without causing the document to be unusable. Additionally, some sites, including popular platforms such as Twitter (until December 2020), provide an optional lightweight front end that does not rely on
647:
In addition, fault-tolerant systems are characterized in terms of both planned service outages and unplanned service outages. These are usually measured at the application level and not just at a hardware level. The figure of merit is called
517:
component is designed to report at the first point of failure, rather than generating reports when downstream components fail. This allows easier diagnosis of the underlying problem, and may prevent improper operation in a broken state.
785:, with each gate of each replication making the same state transition on the same edge of the clock, and the clocks to the replications being exactly in phase. However, it is possible to build lockstep systems without this requirement. 320:
Historically, the trend has been to move away from N-model and toward M out of N, as the complexity of systems and the difficulty of ensuring the transitive state from fault-negative to fault-positive did not disrupt operations.
463:, whether it functions at a reduced level or fails completely, does so in a way that protects people, property, or data from injury, damage, intrusion, or disclosure. In computers, a program might fail-safe by executing a 815:. The technique can be applied in different contexts. It can handle invalid memory reads by returning a manufactured value to the program, which in turn, makes use of the manufactured value and ignores the former 281:
enough during a fault to allow continued operation, while relying on constant human monitoring of computer output to detect faults. Again, IBM developed the first computer of this kind for NASA for guidance of
541:
is a condition when a single means for protection against hazard in equipment is defective or a single external abnormal condition is present, e.g. short circuit between the live parts and the applied part.
447:
mask (center bottom) is discarded, only the overlay (center top) remains. The image on the left has been designed to degrade gracefully, hence is still meaningful without its transparency information.
513:
will alert users that a component failure has occurred, even if it continues to operate with full performance, so that failure can be repaired or imminent complete failure anticipated. Likewise, a
533:
is defective. If a single fault condition results unavoidably in another single fault condition, the two failures are considered one single fault condition. A source offers the following example:
763:
fault-tolerant machine uses replicated elements operating in parallel. At any time, all the replications of each element should be in the same state. The same inputs are provided to each
944:, for example, have so many redundant and fault-tolerant components that their weight is increased dramatically over uncrewed systems, which do not require the same level of safety. 934:, where operators tested the emergency backup cooling by disabling primary and secondary cooling. The backup failed, resulting in a core meltdown and massive release of radiation. 1804: 373:) before the backup also fails. It is helpful if the time between failures is as long as possible, but this is not specifically required in a fault-tolerant system. 1362:
Fault tolerant computing in computer design Neilforoshan, M.R Journal of Computing Sciences in Colleges archive Volume 18, Issue 4 (April 2003) Pages: 213 – 220,
826:
The approach has performance costs: because the technique rewrites code to insert dynamic checks for address validity, execution time will increase by 80% to 500%.
767:, and the same outputs are expected. The outputs of the replications are compared using a voting circuit. A machine with two replications of each element is termed 355:
Hardware fault tolerance sometimes requires that broken parts be taken out and replaced with new parts while the system is still operational (in computing known as
626: 940:
Both fault-tolerant components and redundant components tend to increase cost. This can be a purely economic cost or can include other measures, such as weight.
1848: 771:(DMR). The voting circuit can then only detect a mismatch and recovery relies on other methods. A machine with three replications of each element is termed 503: 1891: 570:
Requiring a redundant car engine, for example, would likely be too expensive both economically and in terms of weight and space, to be considered.
467:(as opposed to an uncontrolled crash) to prevent data corruption after an error occurs. A similar distinction is made between "failing well" and " 194: 1717:, Computer network security: Fourth International Conference on Mathematical Methods, Models, and Architectures for Computer Network Security, 842: 906:
of pressure and notifies the driver. The alternative is a "manual fault-detection system", such as manually inspecting all tires at each stop.
209:). Several other machines were developed along this line, mostly for military use. Eventually, they separated into three distinct categories: 1765: 1726: 1688: 1633: 1596: 1576: 1746:
Long, Fan; Sidiroglou-Douskos, Stelios; Rinard, Martin (2014). "Automatic Runtime Error Repair and Containment via Recovery Shepherding".
1294: 1247: 1195: 250:
Most of the development in the so-called LLNM (Long Life, No Maintenance) computing was done by NASA during the 1960s, in preparation for
154:
termed 'resilient'. In resilience, the system adapts to the error, maintaining service but acknowledging a certain impact on performance.
274: 1085: 782: 1841: 1133: 1060: 121: 736:
implementations of the same specification, and using them like replicated systems to cope with errors in a specific implementation.
1718: 725:: Providing multiple identical instances of the same system and switching to one of the remaining instances in case of a failure ( 2255: 641: 55: 1151:"The STAR (Self-Testing And Repairing) Computer: An Investigation Of the Theory and Practice Of Fault-tolerant Computer Design" 1528: 102: 2008: 1901: 1876: 1020: 864: 858: 708: 395: 59: 74: 138:
to maintain proper operation despite failures or faults in one or more of its components. This capability is essential for
1967: 1881: 1075: 491: 332: 2270: 2265: 2229: 1834: 711:: Providing multiple identical instances of the same system or subsystem, directing tasks or requests to all of them in 81: 2044: 1957: 1710: 1404: 1906: 1429: 1080: 672:
and the military, the field of topics that touch on research is very wide: it can include such obvious subjects as
381: 494:
are designed to continue operation despite an error, exception, or invalid input, instead of crashing completely.
238:
Computers with a high amount of runtime that would be under heavy use, such as many of the supercomputers used by
48: 2275: 2260: 2054: 2039: 1665:, Lecture Notes in Computer Science, vol. 11058, Cham: Springer International Publishing, pp. 376–390, 1379: 985: 930:, there is no easy way to verify that the backup components are functional. The most infamous example of this is 772: 366: 206: 88: 618:– If a system experiences a failure, it must continue to operate without interruption during the repair process. 227:
Computers that were very dependable but required constant monitoring, such as those used to monitor and control
2250: 2069: 1947: 1942: 1886: 876: 839: 722: 637:
or other mechanisms that isolate a rogue transmitter or failing component to protect the system are required.
1916: 1911: 1150: 1065: 990: 768: 764: 615: 483: 70: 2191: 1896: 846: 442: 423: 419: 347:"M2 Mobile Web", the original mobile web front end of Twitter, later served as fallback legacy version to 1149:
Algirdas Avižienis; George C. Gilley; Francis P. Mathur; David A. Rennels; John A. Rohr; David K. Rubin.
506:
are likewise expected to prevent complete failure in situations like earthquakes, floods, or collisions.
2125: 2095: 1871: 1200: 1045: 1025: 1010: 564:
Some components, like the drive shaft in a car, are not likely to fail, so no fault tolerance is needed.
406: 960:
There is a difference between fault tolerance and systems that rarely have problems. For instance, the
2216: 2186: 1070: 778: 760: 634: 526: 510: 499: 495: 328: 147: 574:
An example of a component that passes all the tests is a car's occupant restraint system. While the
2196: 2181: 2150: 1857: 1090: 1055: 1015: 749: 370: 239: 228: 169:
issues. Non-computing examples include structures that retain their integrity despite damage from
2206: 2201: 1952: 1921: 1771: 1666: 1639: 1508: 1332: 1275: 1225: 931: 712: 689: 348: 331:
were among the first companies specializing in the design of fault-tolerant computer systems for
306: 170: 1748:
Proceedings of the 35th ACM SIGPLAN Conference on Programming Language Design and Implementation
1367: 704:
Spare components address the first fundamental characteristic of fault tolerance in three ways:
380:
built their entire business on such machines, which used single-point tolerance to create their
277:
in the United States. These entities needed computers with massive amounts of uptime that would
1807:", in Automata Studies, eds. C. Shannon and J. McCarthy, Princeton University Press, pp. 43–98 1350: 2160: 2100: 1817: 1794:", Proceedings of 15th International Symposium on Fault-Tolerant Computing (FTSC-15), pp. 2–11 1761: 1722: 1684: 1629: 1592: 1572: 1476: 1363: 1324: 1267: 1217: 1129: 1040: 941: 808: 693: 673: 255: 162: 139: 971:. But when a fault did occur they still stopped operating completely, and therefore were not 95: 2165: 2135: 2064: 2059: 1998: 1983: 1751: 1676: 1621: 1468: 1314: 1306: 1259: 1209: 1005: 995: 961: 883: 669: 514: 377: 324: 295: 278: 143: 369:
should be long enough for the operators to have sufficient time to fix the broken devices (
343: 2211: 2110: 2105: 2049: 2024: 1000: 964: 927: 816: 753: 677: 621: 259: 186: 158: 898:
Fault-tolerant design's advantages are obvious, while many of its disadvantages are not:
558:
In a car, the radio is not critical, so this component has less need for fault tolerance.
2140: 2130: 2085: 2029: 1926: 1791: 1167: 251: 213:
Machines that would last a long time without any maintenance, such as the ones used on
190: 2280: 2244: 2120: 2115: 2090: 1962: 1659:"Context-Aware Failure-Oblivious Computing as a Means of Preventing Buffer Overflows" 1351:"The F14A Central Air Data Computer, and the LSI Technology State-of-the-Art in 1968" 1279: 1243: 1191: 1118: 1050: 468: 464: 270: 17: 1336: 1229: 317:
faulty and have it taken out of service. This is called M out of N majority voting.
2155: 1805:
Probabilistic Logics and Synthesis of Reliable Organisms from Unreliable Components
1776: 1643: 1538: 1319: 685: 649: 431: 357: 1658: 1560:
Essentials of Equipment in Anaesthesia, Critical Care, and Peri-Operative Medicine
967:
systems had failure rates of two hours per forty years, and therefore were highly
1680: 1434: 1125: 1029: 410: 365:
and represents the vast majority of fault-tolerant systems. In such systems the
243: 217: 198: 37: 1618:
2012 Seventh International Conference on Availability, Reliability and Security
1168:"Voyager Mission state (more often than not at least three months out of date)" 394:
architectures may encompass also the computer software, for example by process
2145: 1456: 820: 681: 415: 202: 1480: 1328: 1271: 1221: 1756: 1035: 812: 583: 456: 391: 351:
without JavaScript support and/or incompatible browsers until December 2020.
221: 174: 1613: 1310: 1263: 1213: 1625: 726: 659:
Fault-tolerant systems are typically based on the concept of redundancy.
629:(NIST) categorizes faults based on locality, cause, duration, and effect. 427: 302: 283: 266: 232: 166: 1713:, in Gorodetski, Vladimir I.; Kotenko, Igor; Skormin, Victor A. (eds.), 2003: 1295:"Operating System Structures to Support Security and Reliable Software" 653: 579: 1826: 1472: 867:
is a technique to avoid catastrophic failures in distributed systems.
502:
continue to transmit data despite the failure of some links or nodes.
1993: 1507:
Thampi, Sabu M. (2009-11-23). "Introduction to Distributed Systems".
716: 587: 530: 385: 291: 135: 1589:
Reliability evaluation of some fault-tolerant computer architectures
1430:"Legacy Twitter Shutdown Means You Can't Tweet From The 3DS Anymore" 1671: 1792:
Dependable Computing and Fault Tolerance: Concepts and Terminology
1513: 441: 342: 1820:", IEEE Transactions on Computers, vol. 25, no. 12, pp. 1304–1312 1457:"Evaluation and comparison of fault-tolerant software techniques" 490:
In fault-tolerant computer systems, programs that are considered
2034: 1533: 745: 741: 590:, are more expensive and so pass that test by a smaller margin. 402: 376:
Fault tolerance is notably successful in computer applications.
287: 214: 1830: 1657:
Rigger, Manuel; Pekarek, Daniel; Mössenböck, Hanspeter (2018),
823:, which inform the program of the error or abort the program. 361:). Such a system implemented with a single backup is known as 31: 1571:
Dubrova, E. (2013). "Fault-Tolerant Design", Springer, 2013,
1455:
Hudak, J.J.; Suh, B.-H.; Siewiorek, D.P.; Segall, Z. (1993).
301:
In the 1970s, much work happened in the field. For instance,
254:
and other research aspects. NASA's first machine went into a
578:
occupant restraint system is not normally thought of, it is
258:, and their second attempt, the JSTAR computer, was used in 1116:
Daniel P. Siewiorek; C. Gordon Bell; Allen Newell (1982).
656:
system would statistically provide 99.999% availability.
568:
How expensive is it to make the component fault tolerant?
401:
Data formats may also be designed to degrade gracefully.
161:, ensuring the overall system remains functional despite 903:
Interference with fault detection in the same component.
1614:"Oblivious and Fair Server-Aided Two-Party Computation" 926:
For certain critical fault-tolerant systems, such as a
910:
Interference with fault detection in another component.
1750:. PLDI '14'. New York, NY, US: ACM. pp. 227–238. 819:
value it tried to access, this is a great contrast to
610:
The basic characteristics of fault tolerance require:
1111: 1109: 1107: 27:
Resilience of systems to component failures or errors
715:, and choosing the correct result on the basis of a 265:
Hyper-dependable computers were pioneered mostly by
2174: 2078: 2017: 1976: 1935: 1864: 951:
worse than, a comparable non-fault-tolerant system.
781:fault-tolerant machines are most easily made fully 62:. Unsourced material may be challenged and removed. 1495:Operating Systems. Internals and Design Principles 1117: 748:, except RAID 0, are examples of a fault-tolerant 1405:"Why your website should work without JavaScript" 652:and is expressed as a percentage. For example, a 1196:"Reliability Issues in Computing System Design" 535: 1711:"Characterizing Software Self-Healing Systems" 627:National Institute of Standards and Technology 201:connected via relays, with a voting method of 1842: 8: 1715:Characterizing Software Self-Healing Systems 1120:Computer Structures: Principles and Examples 185:The first known fault-tolerant computer was 1194:; Lee, P.A.; Treleaven, P. C. (June 1978). 1849: 1835: 1827: 917:Reduction of priority of fault correction. 1775: 1755: 1670: 1512: 1318: 122:Learn how and when to remove this message 1892:Earth systems engineering and management 434:with limited web browsing capabilities. 1103: 1741: 1739: 1737: 1612:Herzberg, Amir; Shulman, Haya (2012). 504:Resilient buildings and infrastructure 551:components should be fault tolerant: 157:Typically, fault tolerance describes 7: 1293:Theodore A. Linden (December 1976). 746:redundant array of independent disks 625:the fault or failing component. The 562:How likely is the component to fail? 134:Fault tolerance is the ability of a 60:adding citations to reliable sources 525:is a situation where one means for 1591:. Springer-Verlag. November 1980. 1380:"History of TANDEM COMPUTERS, INC" 1248:"Fault tolerant operating systems" 1086:Self-management (computer science) 835:not need to recompile to program. 25: 1927:Sociocultural Systems Engineering 1558:Baha Al-Shaikh, Simon G. Stacey, 1061:List of system quality attributes 1461:IEEE Transactions on Reliability 1428:Fairfax, Zackerie (2020-11-28). 688:, formal or exclusionary logic, 498:is the opposite of robustness. 474:A system designed to experience 36: 405:for example, is designed to be 47:needs additional citations for 2009:Systems development life cycle 1902:Enterprise systems engineering 1877:Biological systems engineering 1709:Keromytis, Angelos D. (2007), 1021:Error detection and correction 865:circuit breaker design pattern 859:Circuit breaker design pattern 811:to continue executing despite 732:Diversity: Providing multiple 556:How critical is the component? 1: 1968:System of systems engineering 1882:Cognitive systems engineering 1076:Robustness (computer science) 680:, to arcane elements such as 455:A system that is designed to 333:online transaction processing 1681:10.1007/978-3-030-02744-5_28 807:is a technique that enables 2045:Quality function deployment 1958:Verification and validation 1663:Network and System Security 1403:Nathaniel (17 March 2021). 805:Failure-oblivious computing 800:Failure-oblivious computing 2299: 1907:Health systems engineering 1081:Rollback (data management) 874: 856: 663:Fault tolerance techniques 367:mean time between failures 2225: 2055:Systems Modeling Language 1803:von Neumann, J. (1956). " 986:Byzantine fault tolerance 207:triple modular redundancy 2070:Work breakdown structure 1948:Functional specification 1943:Requirements engineering 1887:Configuration management 1620:. IEEE. pp. 75–84. 877:Redundancy (engineering) 773:triple modular redundant 2256:Reliability engineering 1917:Reliability engineering 1912:Performance engineering 1816:Avizienis, A. (1976). " 1790:Laprie, J. C. (1985). " 1757:10.1145/2594291.2594337 1320:2027/mdp.39015086560037 1066:Progressive enhancement 991:Control reconfiguration 821:typical memory checkers 740:All implementations of 616:single point of failure 484:Progressive enhancement 422:layout, to ensure wide 197:. Its basic design was 2192:Industrial engineering 1897:Electrical engineering 1818:Fault-Tolerant Systems 843:binary instrumentation 792:One variant of DMR is 769:dual modular redundant 543: 539:single-fault condition 523:single fault condition 448: 352: 286:rockets, but later on 2126:Arthur David Hall III 2096:Benjamin S. Blanchard 1872:Aerospace engineering 1493:Stallings, W (2009): 1311:10.1145/356678.356682 1299:ACM Computing Surveys 1264:10.1145/356678.356680 1252:ACM Computing Surveys 1214:10.1145/356725.356729 1201:ACM Computing Surveys 1046:Fall back and forward 1026:Error-tolerant design 1011:Ecological resilience 459:, or fail-secure, or 445: 363:single point tolerant 346: 148:life-critical systems 18:Fault-tolerant design 2217:Software engineering 2187:Computer engineering 1626:10.1109/ares.2012.28 1071:Resilience (network) 948:Inferior components. 830:Recovery shepherding 676:and reliability, or 511:failure transparency 496:Software brittleness 476:graceful degradation 229:nuclear power plants 56:improve this article 2271:Systems engineering 2266:Control engineering 2197:Operations research 2182:Control engineering 2151:Joseph Francis Shea 1858:Systems engineering 1091:Crash-only software 1056:Intrusion tolerance 1016:Elegant degradation 690:parallel processing 509:A system with high 388:measured in years. 371:mean time to repair 273:companies, and the 240:insurance companies 189:, built in 1951 in 2207:Quality management 2202:Project management 2030:Function modelling 1953:System integration 1922:Safety engineering 500:Resilient networks 449: 407:forward compatible 353: 307:built-in self-test 2238: 2237: 2161:Manuela M. Veloso 2101:Wernher von Braun 1767:978-1-4503-2784-8 1728:978-3-540-73985-2 1690:978-3-030-02743-8 1635:978-1-4673-2244-7 1598:978-3-540-10274-8 1577:978-1-4614-2112-2 1473:10.1109/24.229487 1246:(December 1976). 1041:Failure semantics 1032:-tolerant design) 942:Crewed spaceships 809:computer programs 694:data transmission 674:software modeling 298:built their own. 275:railroad industry 256:space observatory 140:high-availability 132: 131: 124: 106: 71:"Fault tolerance" 16:(Redirected from 2288: 2276:Software quality 2261:Computer systems 2166:John N. Warfield 2136:Robert E. Machol 2065:Systems modeling 2060:Systems analysis 1999:System lifecycle 1984:Business process 1851: 1844: 1837: 1828: 1821: 1814: 1808: 1801: 1795: 1788: 1782: 1781: 1779: 1759: 1743: 1732: 1731: 1706: 1700: 1699: 1698: 1697: 1674: 1654: 1648: 1647: 1609: 1603: 1602: 1585: 1579: 1569: 1563: 1556: 1550: 1549: 1547: 1546: 1537:. Archived from 1525: 1519: 1518: 1516: 1504: 1498: 1491: 1485: 1484: 1452: 1446: 1445: 1443: 1442: 1425: 1419: 1418: 1416: 1415: 1400: 1394: 1393: 1391: 1390: 1376: 1370: 1360: 1354: 1347: 1341: 1340: 1322: 1290: 1284: 1283: 1240: 1234: 1233: 1188: 1182: 1181: 1179: 1178: 1164: 1158: 1157: 1155: 1146: 1140: 1139: 1123: 1113: 1006:Defence in depth 996:Damage tolerance 962:Western Electric 924:Test difficulty. 884:John von Neumann 670:public utilities 640:Availability of 378:Tandem Computers 325:Tandem Computers 309:and redundancy. 296:General Electric 235:experiments; and 159:computer systems 144:mission-critical 127: 120: 116: 113: 107: 105: 64: 40: 32: 21: 2298: 2297: 2291: 2290: 2289: 2287: 2286: 2285: 2251:Fault tolerance 2241: 2240: 2239: 2234: 2221: 2212:Risk management 2170: 2111:Harold Chestnut 2106:Kathleen Carley 2074: 2050:System dynamics 2025:Decision-making 2013: 1989:Fault tolerance 1972: 1931: 1860: 1855: 1825: 1824: 1815: 1811: 1802: 1798: 1789: 1785: 1768: 1745: 1744: 1735: 1729: 1708: 1707: 1703: 1695: 1693: 1691: 1656: 1655: 1651: 1636: 1611: 1610: 1606: 1599: 1587: 1586: 1582: 1570: 1566: 1562:(2017), p. 247. 1557: 1553: 1544: 1542: 1527: 1526: 1522: 1506: 1505: 1501: 1497:, sixth edition 1492: 1488: 1454: 1453: 1449: 1440: 1438: 1427: 1426: 1422: 1413: 1411: 1402: 1401: 1397: 1388: 1386: 1384:FundingUniverse 1378: 1377: 1373: 1361: 1357: 1348: 1344: 1292: 1291: 1287: 1242: 1241: 1237: 1190: 1189: 1185: 1176: 1174: 1166: 1165: 1161: 1153: 1148: 1147: 1143: 1136: 1115: 1114: 1105: 1100: 1095: 1001:Data redundancy 981: 969:fault resistant 958: 928:nuclear reactor 896: 879: 873: 861: 855: 853:Circuit breaker 832: 802: 754:data redundancy 702: 678:hardware design 665: 642:reversion modes 622:Fault isolation 608: 548: 461:fail gracefully 440: 341: 279:fail gracefully 269:manufacturers, 195:Antonín Svoboda 183: 128: 117: 111: 108: 65: 63: 53: 41: 28: 23: 22: 15: 12: 11: 5: 2296: 2295: 2292: 2284: 2283: 2278: 2273: 2268: 2263: 2258: 2253: 2243: 2242: 2236: 2235: 2233: 2232: 2226: 2223: 2222: 2220: 2219: 2214: 2209: 2204: 2199: 2194: 2189: 2184: 2178: 2176: 2175:Related fields 2172: 2171: 2169: 2168: 2163: 2158: 2153: 2148: 2143: 2141:Radhika Nagpal 2138: 2133: 2131:Derek Hitchins 2128: 2123: 2118: 2113: 2108: 2103: 2098: 2093: 2088: 2086:James S. Albus 2082: 2080: 2076: 2075: 2073: 2072: 2067: 2062: 2057: 2052: 2047: 2042: 2037: 2032: 2027: 2021: 2019: 2015: 2014: 2012: 2011: 2006: 2001: 1996: 1991: 1986: 1980: 1978: 1974: 1973: 1971: 1970: 1965: 1960: 1955: 1950: 1945: 1939: 1937: 1933: 1932: 1930: 1929: 1924: 1919: 1914: 1909: 1904: 1899: 1894: 1889: 1884: 1879: 1874: 1868: 1866: 1862: 1861: 1856: 1854: 1853: 1846: 1839: 1831: 1823: 1822: 1809: 1796: 1783: 1766: 1733: 1727: 1701: 1689: 1649: 1634: 1604: 1597: 1580: 1564: 1551: 1520: 1499: 1486: 1467:(2): 190–204. 1447: 1420: 1395: 1371: 1355: 1342: 1305:(4): 409–445. 1285: 1258:(4): 359–389. 1235: 1208:(2): 123–165. 1192:Randell, Brian 1183: 1159: 1141: 1134: 1102: 1101: 1099: 1096: 1094: 1093: 1088: 1083: 1078: 1073: 1068: 1063: 1058: 1053: 1048: 1043: 1038: 1033: 1023: 1018: 1013: 1008: 1003: 998: 993: 988: 982: 980: 977: 973:fault tolerant 957: 954: 953: 952: 945: 935: 921: 914: 907: 895: 892: 886:in the 1950s. 875:Main article: 872: 869: 857:Main article: 854: 851: 831: 828: 801: 798: 794:pair-and-spare 750:storage device 738: 737: 730: 720: 701: 698: 664: 661: 645: 644: 638: 630: 619: 607: 604: 572: 571: 565: 559: 547: 544: 439: 436: 340: 337: 327:, in 1976 and 252:Project Apollo 248: 247: 236: 225: 199:magnetic drums 191:Czechoslovakia 182: 179: 130: 129: 44: 42: 35: 26: 24: 14: 13: 10: 9: 6: 4: 3: 2: 2294: 2293: 2282: 2279: 2277: 2274: 2272: 2269: 2267: 2264: 2262: 2259: 2257: 2254: 2252: 2249: 2248: 2246: 2231: 2228: 2227: 2224: 2218: 2215: 2213: 2210: 2208: 2205: 2203: 2200: 2198: 2195: 2193: 2190: 2188: 2185: 2183: 2180: 2179: 2177: 2173: 2167: 2164: 2162: 2159: 2157: 2154: 2152: 2149: 2147: 2144: 2142: 2139: 2137: 2134: 2132: 2129: 2127: 2124: 2122: 2121:Barbara Grosz 2119: 2117: 2116:Wolt Fabrycky 2114: 2112: 2109: 2107: 2104: 2102: 2099: 2097: 2094: 2092: 2091:Ruzena Bajcsy 2089: 2087: 2084: 2083: 2081: 2077: 2071: 2068: 2066: 2063: 2061: 2058: 2056: 2053: 2051: 2048: 2046: 2043: 2041: 2038: 2036: 2033: 2031: 2028: 2026: 2023: 2022: 2020: 2016: 2010: 2007: 2005: 2002: 2000: 1997: 1995: 1992: 1990: 1987: 1985: 1982: 1981: 1979: 1975: 1969: 1966: 1964: 1963:Design review 1961: 1959: 1956: 1954: 1951: 1949: 1946: 1944: 1941: 1940: 1938: 1934: 1928: 1925: 1923: 1920: 1918: 1915: 1913: 1910: 1908: 1905: 1903: 1900: 1898: 1895: 1893: 1890: 1888: 1885: 1883: 1880: 1878: 1875: 1873: 1870: 1869: 1867: 1863: 1859: 1852: 1847: 1845: 1840: 1838: 1833: 1832: 1829: 1819: 1813: 1810: 1806: 1800: 1797: 1793: 1787: 1784: 1778: 1773: 1769: 1763: 1758: 1753: 1749: 1742: 1740: 1738: 1734: 1730: 1724: 1720: 1716: 1712: 1705: 1702: 1692: 1686: 1682: 1678: 1673: 1668: 1664: 1660: 1653: 1650: 1645: 1641: 1637: 1631: 1627: 1623: 1619: 1615: 1608: 1605: 1600: 1594: 1590: 1584: 1581: 1578: 1574: 1568: 1565: 1561: 1555: 1552: 1541:on 1999-10-08 1540: 1536: 1535: 1530: 1524: 1521: 1515: 1510: 1503: 1500: 1496: 1490: 1487: 1482: 1478: 1474: 1470: 1466: 1462: 1458: 1451: 1448: 1437: 1436: 1431: 1424: 1421: 1410: 1409:DEV Community 1406: 1399: 1396: 1385: 1381: 1375: 1372: 1369: 1365: 1359: 1356: 1352: 1346: 1343: 1338: 1334: 1330: 1326: 1321: 1316: 1312: 1308: 1304: 1300: 1296: 1289: 1286: 1281: 1277: 1273: 1269: 1265: 1261: 1257: 1253: 1249: 1245: 1244:P. J. Denning 1239: 1236: 1231: 1227: 1223: 1219: 1215: 1211: 1207: 1203: 1202: 1197: 1193: 1187: 1184: 1173: 1169: 1163: 1160: 1152: 1145: 1142: 1137: 1135:0-07-057302-6 1131: 1127: 1122: 1121: 1112: 1110: 1108: 1104: 1097: 1092: 1089: 1087: 1084: 1082: 1079: 1077: 1074: 1072: 1069: 1067: 1064: 1062: 1059: 1057: 1054: 1052: 1051:Graceful exit 1049: 1047: 1044: 1042: 1039: 1037: 1034: 1031: 1027: 1024: 1022: 1019: 1017: 1014: 1012: 1009: 1007: 1004: 1002: 999: 997: 994: 992: 989: 987: 984: 983: 978: 976: 974: 970: 966: 963: 956:Related terms 955: 949: 946: 943: 939: 936: 933: 929: 925: 922: 918: 915: 911: 908: 904: 901: 900: 899: 894:Disadvantages 893: 891: 887: 885: 878: 870: 868: 866: 860: 852: 850: 848: 844: 841: 836: 829: 827: 824: 822: 818: 814: 810: 806: 799: 797: 795: 790: 786: 784: 780: 776: 774: 770: 766: 762: 757: 755: 751: 747: 743: 735: 731: 728: 724: 721: 718: 714: 710: 707: 706: 705: 699: 697: 695: 691: 687: 683: 679: 675: 671: 662: 660: 657: 655: 651: 643: 639: 636: 631: 628: 623: 620: 617: 613: 612: 611: 605: 603: 599: 595: 591: 589: 585: 581: 577: 569: 566: 563: 560: 557: 554: 553: 552: 545: 542: 540: 534: 532: 528: 524: 519: 516: 512: 507: 505: 501: 497: 493: 488: 485: 481: 477: 472: 470: 469:failing badly 466: 465:graceful exit 462: 458: 453: 444: 437: 435: 433: 432:game consoles 430:, such as on 429: 425: 424:accessibility 421: 417: 412: 408: 404: 399: 397: 393: 389: 387: 384:systems with 383: 379: 374: 372: 368: 364: 360: 359: 350: 345: 338: 336: 334: 330: 326: 322: 318: 314: 310: 308: 304: 299: 297: 293: 289: 285: 280: 276: 272: 271:nuclear power 268: 263: 261: 257: 253: 245: 241: 237: 234: 233:supercollider 230: 226: 223: 219: 216: 212: 211: 210: 208: 204: 200: 196: 192: 188: 180: 178: 176: 172: 168: 164: 160: 155: 151: 149: 145: 141: 137: 126: 123: 115: 104: 101: 97: 94: 90: 87: 83: 80: 76: 73: –  72: 68: 67:Find sources: 61: 57: 51: 50: 45:This article 43: 39: 34: 33: 30: 19: 2156:Katia Sycara 2040:Optimization 1988: 1812: 1799: 1786: 1747: 1714: 1704: 1694:, retrieved 1662: 1652: 1617: 1607: 1588: 1583: 1567: 1559: 1554: 1543:. Retrieved 1539:the original 1532: 1523: 1502: 1494: 1489: 1464: 1460: 1450: 1439:. Retrieved 1433: 1423: 1412:. Retrieved 1408: 1398: 1387:. Retrieved 1383: 1374: 1358: 1345: 1302: 1298: 1288: 1255: 1251: 1238: 1205: 1199: 1186: 1175:. Retrieved 1171: 1162: 1144: 1119: 972: 968: 959: 947: 937: 923: 920:also failed. 916: 909: 902: 897: 888: 880: 862: 840:just-in-time 838:It uses the 837: 833: 825: 804: 803: 793: 791: 787: 777: 758: 739: 733: 703: 696:, and more. 686:graph theory 666: 658: 650:availability 646: 609: 606:Requirements 600: 596: 592: 575: 573: 567: 561: 555: 549: 538: 536: 522: 520: 508: 489: 479: 475: 473: 460: 454: 450: 411:Web browsers 400: 390: 375: 362: 358:hot swapping 356: 354: 323: 319: 315: 311: 300: 264: 249: 218:space probes 203:memory error 184: 156: 152: 133: 118: 112:January 2008 109: 99: 92: 85: 78: 66: 54:Please help 49:verification 46: 29: 1435:Screen Rant 1126:McGraw-Hill 1030:human error 783:synchronous 765:replication 709:Replication 700:Replication 438:Terminology 409:, allowing 396:replication 246:monitoring. 244:probability 205:detection ( 177:or impact. 2245:Categories 2146:Simon Ramo 1696:2020-10-07 1672:1806.09026 1545:2016-04-06 1441:2021-07-01 1414:2021-05-16 1389:2023-03-01 1349:Ray Holt. 1177:2022-04-01 1098:References 871:Redundancy 845:framework 752:that uses 723:Redundancy 682:stochastic 654:five nines 584:seat belts 529:against a 527:protection 418:and has a 416:JavaScript 242:for their 222:satellites 146:, or even 82:newspapers 1936:Processes 1865:Subfields 1529:"Control" 1514:0911.4395 1481:1558-1721 1368:1937-4771 1329:0360-0300 1280:207736773 1272:0360-0300 1222:0360-0300 1036:Fail-safe 932:Chernobyl 734:different 692:, remote 635:Firewalls 633:failure. 515:fail-fast 480:fail soft 457:fail safe 392:Fail-safe 175:corrosion 2230:Category 1977:Concepts 1719:Springer 1337:16720589 1230:16909447 979:See also 965:crossbar 779:Lockstep 761:lockstep 727:failover 713:parallel 684:models, 546:Criteria 478:, or to 428:outreach 339:Examples 303:F14 CADC 284:Saturn V 267:aircraft 167:software 163:hardware 2004:V-Model 1777:6252501 1644:6579295 588:airbags 580:gravity 576:primary 420:minimal 386:uptimes 382:NonStop 349:clients 329:Stratus 260:Voyager 181:History 171:fatigue 96:scholar 2079:People 1994:System 1774:  1764:  1725:  1687:  1642:  1632:  1595:  1575:  1479:  1366:  1335:  1327:  1278:  1270:  1228:  1220:  1132:  817:memory 813:errors 717:quorum 531:hazard 492:robust 294:, and 292:Unisys 136:system 98:  91:  84:  77:  69:  2018:Tools 1772:S2CID 1667:arXiv 1640:S2CID 1509:arXiv 1333:S2CID 1276:S2CID 1226:S2CID 1154:(PDF) 938:Cost. 103:JSTOR 89:books 2281:RAID 2035:IDEF 1762:ISBN 1723:ISBN 1685:ISBN 1630:ISBN 1593:ISBN 1573:ISBN 1534:IEEE 1477:ISSN 1364:ISSN 1325:ISSN 1268:ISSN 1218:ISSN 1172:NASA 1130:ISBN 863:The 742:RAID 426:and 403:HTML 305:had 288:BNSF 220:and 215:NASA 187:SAPO 75:news 1752:doi 1677:doi 1622:doi 1469:doi 1315:hdl 1307:doi 1260:doi 1210:doi 847:Pin 614:No 471:". 231:or 193:by 165:or 150:. 58:by 2247:: 1770:. 1760:. 1736:^ 1721:, 1683:, 1675:, 1661:, 1638:. 1628:. 1616:. 1531:. 1475:. 1465:42 1463:. 1459:. 1432:. 1407:. 1382:. 1331:. 1323:. 1313:. 1301:. 1297:. 1274:. 1266:. 1254:. 1250:. 1224:. 1216:. 1206:10 1204:. 1198:. 1170:. 1128:. 1124:. 1106:^ 975:. 759:A 756:. 744:, 729:); 537:A 521:A 398:. 335:. 290:, 173:, 142:, 1850:e 1843:t 1836:v 1780:. 1754:: 1679:: 1669:: 1646:. 1624:: 1601:. 1548:. 1517:. 1511:: 1483:. 1471:: 1444:. 1417:. 1392:. 1353:. 1339:. 1317:: 1309:: 1303:8 1282:. 1262:: 1256:8 1232:. 1212:: 1180:. 1156:. 1138:. 1028:( 719:; 224:; 125:) 119:( 114:) 110:( 100:· 93:· 86:· 79:· 52:. 20:)

Index

Fault-tolerant design

verification
improve this article
adding citations to reliable sources
"Fault tolerance"
news
newspapers
books
scholar
JSTOR
Learn how and when to remove this message
system
high-availability
mission-critical
life-critical systems
computer systems
hardware
software
fatigue
corrosion
SAPO
Czechoslovakia
Antonín Svoboda
magnetic drums
memory error
triple modular redundancy
NASA
space probes
satellites

Text is available under the Creative Commons Attribution-ShareAlike License. Additional terms may apply.