Failure Management and AWS: How to Withstand and Repair Problems

Every system will encounter problems and occasionally fail. What makes a system reliable is its ability to react quickly and efficiently to failures.

The goal is to create a workload that automatically returns to a standard operating level without creating a disruption.

Architecting for Resiliency
Resiliency is the ability to bounce back from failure, overload, or attack. The Well-Architected Framework has five best practices to ensure your workload is as resilient as possible.

Monitor All Components
Design automatic systems that monitor every aspect of your workload continuously. Determine key performance indicators (KPIs) based on your business goals, not your systems’ requirements. When the system notices a KPI breach, it can fix the failure.

You can also set monitoring systems to detect degradation, which lets you know that a failure is likely. Your automated systems can also take action to prevent the looming failure.

Keep Healthy Resources Separate
Instead of using a single workload, set up several smaller ones. Ensure that if a particular system fails, other healthy resources can continue to handle requests.

For essential services like location, create backup systems that can fail over to healthy resources. If you’re using AWS systems, they will automatically activate to ensure your healthy systems can keep working.

Automate Healing
It takes time for a team member to receive a notification, learn about the problem, and determine a plan of action. Instead, create automatic services that can fix failures quickly.

Consider utilising AWS systems, like Auto Scaling and EC2 Automatic Recovery, to help your system repair itself.

Static Stability Prevents Bimodal Behaviour
A workload is exhibiting bimodal behaviour when it acts differently under standard and failure modes. Design your workloads with static stability in mind, testing to ensure they always react the same way.

You also should not allow clients to avoid your workload’s cache even in a cascade failure, because it creates bimodal behaviour.

Notifications
Have every automated system send the relevant team member a notification when a system is nearing failure or has failed. You also want teams notified when your systems detect a problem that will affect availability.

Well-Architected Review
If you’re struggling to make your systems reliable, WOLK, an experienced AWS Partner, is authorised to perform a Well-Architected Review.

Through the review, WOLK can identify high-risk items and any areas that are low in compliance with the Framework. The team can then mitigate the problems, ensuring your systems are reliable and resilient.