Cloud Reliability: The Five AWS Design Principles

The third pillar of the AWS Well-Architected Framework is reliability. It deals with ensuring that your workloads and applications produce the same results every time.

Using the five design principles of the reliability pillar, you can create workloads and applications that are reliable for their entire lifecycle.

Automatically Recover From Failure
Automation is a vital element in the reliability pillar. Set up systems that monitor for Key Performance Indicators (KPIs) of your business values. When one of the KIPs reads too low or too high, your monitoring system should automatically notify you and continue tracking the problem.

You can also set up automatic recovery systems that your monitoring systems trigger when there’s a problem.

To prepare as much as possible for failure, you can set up systems that track trends, meaning they can predict future problems.

Test Recovery Procedures
Just as you test your workload’s operating procedures, you should also evaluate its recovery methods. While working the cloud, use automation to cause a failure in your workload and observe how well the recovery systems and procedures work.

It’s also possible to use automation to recreate past failures. If you’re unsure of exactly where a failure occurred, a recreation can help you determine causes and ensure it doesn’t happen again.

Scale Horizontally
Instead of using one large workload, consider breaking it up into several smaller resources. If a failure occurs in an overarching workload, you might have to shut down your entire system for the repair.

Ensure you spread out your requests across the smaller resources so they don’t share a common failure.

Don’t Guess Capacity
Don’t just assume that your workload can handle the demands you place on it. One of the most common reasons that a workload fails is due to resource saturation.

Use AWS tools to monitor the demands placed on your workload and its saturation level. Create systems that automatically reduce demands when your workload approaches saturation.

Manage Change
Use automatic systems to change your workload. Automation removes human error, reducing your risk.

Changes made to automatic systems should be tracked and reviewed, preferably by another automated system.

Work with an AWS Well-Architected Partner
To ensure you are compliant with all five design principles of the reliability pillar, consider working with an experienced AWS Partner. The WOLK team stays up-to-date with the current design principles and best practices of the AWS Well-Architected Framework.

After performing a Well-Architected Review, we can identify any non-compliance issues and mitigate them for you.