Definition of Fault Tolerance in Network Encyclopedia.
What is Fault Tolerance?
Fault Tolerance is any mechanism or technology that allows a computer or operating system to recover from a failure. In fault tolerant systems, the data remains available when one component of the system fails.
Here are some examples of fault tolerant systems:
- Transactional log files that protect the Microsoft Windows registry and allow recovery of hives
- RAID 5 disk systems that protect against data loss
- Uninterruptible power supply (UPS) to protect the system against primary power failure
A fault-tolerant design enables a system to continue its intended operation, possibly at a reduced level, rather than failing completely, when some part of the system fails. The term is most commonly used to describe computer systems designed to continue more or less fully operational with, perhaps, a reduction in throughput or an increase in response time in the event of some partial failure. That is, the system as a whole is not stopped due to problems either in the hardware or the software. An example in another field is a motor vehicle designed so it will continue to be drivable if one of the tires is punctured, or a structure that is able to retain its integrity in the presence of damage due to causes such as fatigue, corrosion, manufacturing flaws, or impact.
Redundancy is the provision of functional capabilities that would be unnecessary in a fault-free environment. This can consist of backup components that automatically “kick in” if one component fails. For example, large cargo trucks can lose a tire without any major consequences. They have many tires, and no one tire is critical (with the exception of the front tires, which are used to steer, but generally carry less load, each and in total, than the other four to 16, so are less likely to fail). The idea of incorporating redundancy in order to improve the reliability of a system was pioneered by John von Neumann in the 1950s.
Two kinds of redundancy are possible:
- Space redundancy: provides additional components, functions, or data items that are unnecessary for fault-free operation. Space redundancy is further classified into hardware, software and information redundancy, depending on the type of redundant resources added to the system.
- Time redundancy: the computation or data transmission is repeated and the result is compared to a stored copy of the previous result. The current terminology for this kind of testing is referred to as In Service Fault Tolerance Testing or ISFTT for short.
Spare components address the first fundamental characteristic of fault tolerance in three ways:
- Replication: Providing multiple identical instances of the same system or subsystem, directing tasks or requests to all of them in parallel, and choosing the correct result on the basis of a quorum;
- Redundancy: Providing multiple identical instances of the same system and switching to one of the remaining instances in case of a failure (failover);
- Diversity: Providing multiple different implementations of the same specification, and using them like replicated systems to cope with errors in a specific implementation.
All implementations of RAID, redundant array of independent disks, except RAID 0, are examples of a fault-tolerant storage device that uses data redundancy.
Just because your system is fault tolerant doesn’t mean you are fully prepared for disaster. You still need to perform regular backups of important data. For example, a RAID 5 disk system will protect against data loss if one disk drive fails, but not if two or more drives fail simultaneously.