Ref: https://learn.cantrill.io/courses/1820301/lectures/41301341
EC2 Instance Status Checks (EC2 Instance Status Monitoring)
- 🔧 Every EC2 instance has 2 high-level status checks:
- System status check
- Issues impacting EC2 service as a whole or the EC2 host
- Failure could indicate: loss of system power, loss of NW connectivity, host SW issues, host HW issues
- Instance status check
- Issues impacting the EC2 instance itself
- Failure could indicate: corrupted FS, incorrect instance NWing, OS kernel issues
- 💡 Each check represents a separate set of tests, failure of one suggests a specific set of underlying problems (and a different set than the other check)

- When you launch an instance, checks start in initializing state…
- If eventually you get to 2/2 checks passed → “all is well” within the instance
- ‼️ All is well from the perspective of EC2! (your own custom configurations might NOT be correctly initialized, we will explore that in later EC2 sections)
- If any of the checks do not pass and instance is not launching → problem to address!
- You must address failure by taking actions on the instance manually or automatically
EC2 Status Check Alarm
- 🔧 A CW Alarm can be configured to notify and/or take action automatically if an EC2 status check is not passing (i.e. instance failure)
- Notifications usually sent to SNS topics
- Possible actions to address instance failure:
- Reboot → traditional method to bring instance back to functional state
- Stop → useful for taking diagnostics
- Terminate → useful if HA is configured
- EC2 can be configured to reprovision missing instances if our system has HA (e.g. with an EC2 Auto Scaling Group)
- Termination is fine in this case → avoids storing many stopped instances and billing
- Recover → uses EC2 Auto Recovery feature (tries to restore instance to functional state)
EC2 Auto Recovery
- 🔧 Simple feature for automatic recovery of an EC2 instance to any status check issues
- ‼️ Not designed to recover against large scale or complex system issues!
- Only answers a very narrow set of error based scenarios for an isolated instance
- Won't fix every problem in EC2
- Still useful to avoid alerting a sys admin for common failures
- Takes a set of actions and tries its best to recover the instance
- Moves instance to new host (in same AZ)
- Starts instance up in new host with exact same configuration as before
- Maintains IP addressing (except non-EIP public IPv4, since migration to new host)
- Instance will reboot if instance is configured to do so
- ‼️ Considerations/Limitations:
- Relies on having spare EC2 host capacity (otherwise it will fail to recover)
- Only works on modern types of instances (A1, C4, C5, M4, M5, R3, R4, R5…)
- ❗ Won't work if instance is using instance store volumes! (EBS volumes are fine)