EC2 Instance Status Checks & Auto Recovery

Ref: https://learn.cantrill.io/courses/1820301/lectures/41301341

EC2 Instance Status Checks (EC2 Instance Status Monitoring)

🔧 Every EC2 instance has 2 high-level status checks:
1. System status check
  - Issues impacting EC2 service as a whole or the EC2 host
  - Failure could indicate: loss of system power, loss of NW connectivity, host SW issues, host HW issues
2. Instance status check
  - Issues impacting the EC2 instance itself
  - Failure could indicate: corrupted FS, incorrect instance NWing, OS kernel issues
- 💡 Each check represents a separate set of tests, failure of one suggests a specific set of underlying problems (and a different set than the other check)
When you launch an instance, checks start in initializing state…
- If eventually you get to 2/2 checks passed → “all is well” within the instance
  - ‼️ All is well from the perspective of EC2! (your own custom configurations might NOT be correctly initialized, we will explore that in later EC2 sections)
- If any of the checks do not pass and instance is not launching → problem to address!
  - You must address failure by taking actions on the instance manually or automatically

🔧 A CW Alarm can be configured to notify and/or take action automatically if an EC2 status check is not passing (i.e. instance failure)
- Notifications usually sent to SNS topics
Possible actions to address instance failure:
- Reboot → traditional method to bring instance back to functional state
- Stop → useful for taking diagnostics
- Terminate → useful if HA is configured
  - EC2 can be configured to reprovision missing instances if our system has HA (e.g. with an EC2 Auto Scaling Group)
  - Termination is fine in this case → avoids storing many stopped instances and billing
- Recover → uses EC2 Auto Recovery feature (tries to restore instance to functional state)

🔧 Simple feature for automatic recovery of an EC2 instance to any status check issues
- ‼️ Not designed to recover against large scale or complex system issues!
  - Only answers a very narrow set of error based scenarios for an isolated instance
  - Won't fix every problem in EC2
  - Still useful to avoid alerting a sys admin for common failures
Takes a set of actions and tries its best to recover the instance
1. Moves instance to new host (in same AZ)
2. Starts instance up in new host with exact same configuration as before
  - Maintains IP addressing (except non-EIP public IPv4, since migration to new host)
  - Instance will reboot if instance is configured to do so
‼️ Considerations/Limitations:
- Relies on having spare EC2 host capacity (otherwise it will fail to recover)
- Only works on modern types of instances (A1, C4, C5, M4, M5, R3, R4, R5…)
- ❗ Won't work if instance is using instance store volumes! (EBS volumes are fine)