Recovery Point Objective (RPO) & Recovery Time Objective (RTO)

Ref: https://learn.cantrill.io/courses/2022818/lectures/45637149

YouTube: https://www.youtube.com/watch?v=KesoHnsZWoA

Recovery Point Objective (RPO)

🔧 Two definitions:
1. Maximum allowable age of the last backup in the event of a disaster or outage
2. Maximum amount of data (measured in time) that can be lost during a DR situation before that loss will exceed what the organization can tolerate
‼️ RPO is measured in time (mins/hours)!!!
Understanding how data can be lost is key to implementing RPO requirements
Recovery points = successful backups
Backups should happen at least as often as the RPO…
- ❗ …but a failed backup can double data loss!! → hence more frequency is advisable
  - 💡 e.g. A company's RPO is 30 mins, and backups happen every 25 mins. Imagine that the latest successful backup was at 9:00, and the one scheduled at 9:25 fails. At 9:40 there's a disaster and all data is lost… Latest backup was 40 mins ago, which is above the RPO! → Very advisable to do backups at least every 30/2=15 mins instead!
Lower RPO = more frequent backups = higher cost → Trade-off
- Aim for Goldilocks approach → “as close to true business requirements as possible”
  - 💡 In the example above, backups every 5 mins is too costly, backups every 25 mins is too risky → backups every 15 mins probably a good solution

🔧 Two definitions:
1. Maximum tolerable duration of a service outage
2. Maximum tolerable length of time it takes to restore services to normal after an outage
Systems with lower RTO are more critical
‼️ Recovery time…:
- …Begins at moment of failure!
- …Ends when system is operational and handed back to the business!!
ALL OF THIS is included in recovery time:
1. (Most important step) Must identify that the system has failed!
  - Monitoring and notification
    - 💡 Is there monitoring? Is it reliable? How does it notify staff? etc.
  - ❗ Even in the best case scenario, identifying system failure already takes some time!
2. Investigate the issue
3. Decide how to restore:
  1. Backup to restore: What type? Who restores it? Where? How? Documentation?…
  2. Where to restore: Do we have a spare server? Secondary site?…
  - 💡 can be stressful, especially at night
4. Finally: after restoration is in place → Business testing, user testing, final handover
Effective ways to reduce recovery time:
- Planning
- Monitoring
- Notification process
- Spare HW
- Staff training
- More efficient systems (virtual or cloud e.g. AWS)
Once again, aim for Goldilocks approach