Ref: https://learn.cantrill.io/courses/1820301/lectures/41301631 [SAA-C03]
High-Availability (HA)
- 🔧 A highly available (HA) system aims to ensure an agreed level of operational performance, usually uptime, for a higher than normal period
- In other words, HA is about optimizing a system's online time (uptime)
- Upon failure, components can be replaced/fixed quickly (often with automation)
- Incurs additional costs compared to normal systems
- Requires design with certain level of automation
- Might require redundancy
- Examples:
- Have a spare physical server ready so if the main one fails, the spare can be used
- Automatic failovers: substituting a failed server/instance with a standby/replica
- ‼️ HA is NOT about stopping failure, good UX, nor preventing disruption/outages!!
- 🔧 HA systems have outages, but they are shorter than in normal systems
- User disruption is fine if system can get back online faster
- A HA system can be offline for a bit while components are replaced after failure
- There can be user relogins
- 💡 TLDR: HA is about fast & automatic recovery of issues
- RL example: carry an extra spare tyre in your 4x4 vehicle while going through the desert
- Changing a flat tyre is a disruption, but it's much better than having to call someone to come with an extra tyre and find you in the middle of the desert!
- HA Summary Diagram
Fault-Tolerance (FT)
- 🔧 A FT system can continue to operate properly even if some of its components fail
- System components can have one or more faults within
- Proper operation should continue even if the faults are being fixed in the meantime
- System must stop using faulty components and reroute traffic automatically
- ‼️ Much more than HA!! System needs to tolerate failure → much more expensive too!
- There can be NO downtime/outages
- Needs more redundancy than HA
- Example: Heart monitoring system in a hospital (2x monitors connected to 2 servers)
- There can be no downtime at all, or else the patient dies!
- 💡 RL example: An airplane with redundant engines and electronics
- Airplane must continue to fly in the air even if an engine or electronics fails
- Can't repair the airplane while flying
- FT Summary Diagram
Disaster Recovery (DR)
- 🔧 Disaster Recovery (DR) = set of policies, tools & procedures to enable the recovery or continuation of vital technology, infrastructure and systems following a natural or human-induced disaster
- 💡 In essence: what to do if HA & FT fail?
- Plan what to do if/when a disaster occurs that knocks out a system
- Pre-planning: what happens BEFORE disaster
- DR process: what happens AFTER disaster
- Nowadays very automated → minimizes human errors
- Business continuity (BC) focuses on keeping business operations during a disaster, while DR focuses on restoring data access and IT infrastructure after a disaster
- Usually in disasters there's shock, stress & fear → leads to bad decisions → DR is critical
- Plan documentation, staff, training, etc in advance! Before disaster hits!