High-Availability (HA), Fault Tolerance (FT) & Disaster Recovery (DR)

Ref: https://learn.cantrill.io/courses/1820301/lectures/41301631 [SAA-C03]

High-Availability (HA)

🔧 A highly available (HA) system aims to ensure an agreed level of operational performance, usually uptime, for a higher than normal period
- In other words, HA is about optimizing a system's online time (uptime)
  - Upon failure, components can be replaced/fixed quickly (often with automation)
- Incurs additional costs compared to normal systems
  - Requires design with certain level of automation
  - Might require redundancy
Examples:
- Have a spare physical server ready so if the main one fails, the spare can be used
- Automatic failovers: substituting a failed server/instance with a standby/replica
‼️ HA is NOT about stopping failure, good UX, nor preventing disruption/outages!!
- 🔧 HA systems have outages, but they are shorter than in normal systems
- User disruption is fine if system can get back online faster
  - A HA system can be offline for a bit while components are replaced after failure
  - There can be user relogins
💡 TLDR: HA is about fast & automatic recovery of issues
- RL example: carry an extra spare tyre in your 4x4 vehicle while going through the desert
  - Changing a flat tyre is a disruption, but it's much better than having to call someone to come with an extra tyre and find you in the middle of the desert!
HA Summary Diagram

🔧 A FT system can continue to operate properly even if some of its components fail
- System components can have one or more faults within
  - Proper operation should continue even if the faults are being fixed in the meantime
  - System must stop using faulty components and reroute traffic automatically
- ‼️ Much more than HA!! System needs to tolerate failure → much more expensive too!
  - There can be NO downtime/outages
  - Needs more redundancy than HA
Example: Heart monitoring system in a hospital (2x monitors connected to 2 servers)
- There can be no downtime at all, or else the patient dies!
💡 RL example: An airplane with redundant engines and electronics
- Airplane must continue to fly in the air even if an engine or electronics fails
- Can't repair the airplane while flying
FT Summary Diagram

🔧 Disaster Recovery (DR) = set of policies, tools & procedures to enable the recovery or continuation of vital technology, infrastructure and systems following a natural or human-induced disaster
- 💡 In essence: what to do if HA & FT fail?
- Plan what to do if/when a disaster occurs that knocks out a system
  - Pre-planning: what happens BEFORE disaster
  - DR process: what happens AFTER disaster
- Nowadays very automated → minimizes human errors
- Business continuity (BC) focuses on keeping business operations during a disaster, while DR focuses on restoring data access and IT infrastructure after a disaster
Usually in disasters there's shock, stress & fear → leads to bad decisions → DR is critical
- Plan documentation, staff, training, etc in advance! Before disaster hits!