Ref: https://learn.cantrill.io/courses/2022818/lectures/45637149
YouTube: https://www.youtube.com/watch?v=KesoHnsZWoA
Recovery Point Objective (RPO)
- 🔧 Two definitions:
- Maximum allowable age of the last backup in the event of a disaster or outage
- Maximum amount of data (measured in time) that can be lost during a DR situation before that loss will exceed what the organization can tolerate
- ‼️ RPO is measured in time (mins/hours)!!!
- Understanding how data can be lost is key to implementing RPO requirements
- Recovery points = successful backups
- Backups should happen at least as often as the RPO…
- ❗ …but a failed backup can double data loss!! → hence more frequency is advisable
- 💡 e.g. A company's RPO is 30 mins, and backups happen every 25 mins. Imagine that the latest successful backup was at 9:00, and the one scheduled at 9:25 fails. At 9:40 there's a disaster and all data is lost… Latest backup was 40 mins ago, which is above the RPO! → Very advisable to do backups at least every 30/2=15 mins instead!
- Lower RPO = more frequent backups = higher cost → Trade-off
- Aim for Goldilocks approach → “as close to true business requirements as possible”
- 💡 In the example above, backups every 5 mins is too costly, backups every 25 mins is too risky → backups every 15 mins probably a good solution
- RPO Summary Diagram
Recovery Time Objective (RTO)
- 🔧 Two definitions:
- Maximum tolerable duration of a service outage
- Maximum tolerable length of time it takes to restore services to normal after an outage
- Systems with lower RTO are more critical
- ‼️ Recovery time…:
- …Begins at moment of failure!
- …Ends when system is operational and handed back to the business!!
- ALL OF THIS is included in recovery time:
- (Most important step) Must identify that the system has failed!
- Monitoring and notification
- 💡 Is there monitoring? Is it reliable? How does it notify staff? etc.
- âť—Â Even in the best case scenario, identifying system failure already takes some time!
- Investigate the issue
- Decide how to restore:
- Backup to restore: What type? Who restores it? Where? How? Documentation?…
- Where to restore: Do we have a spare server? Secondary site?…
- 💡 can be stressful, especially at night
- Finally: after restoration is in place → Business testing, user testing, final handover
- Effective ways to reduce recovery time:
- Planning
- Monitoring
- Notification process
- Spare HW
- Staff training
- More efficient systems (virtual or cloud e.g. AWS)
- Once again, aim for Goldilocks approach