Prevent and Troubleshoot Issues with CloudWatch Alarms

Learning Objectives

After completing this unit, you’ll be able to:

  • Explain the features of Amazon CloudWatch alarms.
  • Describe how Amazon CloudWatch alarms help prevent and troubleshoot operational issues.

Your cat photo application is hosted by various AWS services. It’s automatically sending metrics to CloudWatch. You can view and analyze those metrics using either the CloudWatch console or the AWS CLI. That’s great! 

However, you now have a lot of emails from your end users. Your website is having serious operational issues, and end users get an error when they try to upload their adorable cat photos. You log in to AWS and check your CloudWatch dashboards and CloudWatch Logs only to see that the number of 500-error responses has gone up 300% over the past 24 hours. 

You start digging. You find out that a recent code change is causing this. You fix the code problem, deploy the application update, and wait to see what happens. Over the next few hours, you can see in CloudWatch that the 500-error response rate has returned to normal levels—the issue is fixed. 

CloudWatch gave you the ability to troubleshoot and pinpoint what the issue was. But wouldn’t it be nice if you could have been notified about the rise in 500-error responses before your end users had to contact you? This is where CloudWatch alarms come into play. We explore what CloudWatch alarms are and how they work in this unit.

Configure a CloudWatch Alarm

You can create CloudWatch alarms to automatically initiate actions based on sustained state changes of your metrics. You configure when alarms are triggered and the action that is performed. 

You first need to decide what metric you want to set up an alarm for, then you define the threshold at which you want the alarm to trigger. Next, you define the specified time period of which the metric should cross the threshold for the alarm to be triggered. 

CloudWatch alarm logo with three vertical bars and a circle over the last bar, with the three requirements to create an alarm: a metric, threshold, and time period.

For example, if you wanted to set up an alarm for an EC2 instance to trigger when the CPU utilization goes over a threshold of 80%, you also need to specify the time period the CPU utilization is over the threshold. You don’t want to trigger an alarm based on short temporary spikes in the CPU. You only want to trigger an alarm if the CPU is elevated for a sustained amount of time, for example if it is over 80% for 5 minutes or longer, when there is a potential resource issue. 

Keeping all that in mind, to set up an alarm you need to choose the metric, the threshold, and the time period.

Diagram that shows if CPU utilization is greater than 80% for 5 minutes, send a message to the dev team or create another instance to handle the load.

An alarm has three possible states.

  • OK: The metric is within the defined threshold. Everything appears to be operating like normal.
  • ALARM: The metric is outside of the defined threshold. This could be an operational issue.
  • INSUFFICIENT_DATA: The alarm has just started, the metric is not available, or not enough data is available for the metric to determine the alarm state.

An alarm can be triggered when it transitions from one state to another. Once an alarm is triggered, it can initiate an action. Actions can be an Amazon EC2 action, an Auto Scaling action, or a notification sent to Amazon Simple Notification Service (SNS).

To learn more about Amazon SNS and Auto Scaling, check out the resources section at the end of this unit.

Use CloudWatch Alarms to Prevent and Troubleshoot Issues

Think about the issue described at the beginning of this unit, where the cat photo application was serving 500-error responses on picture upload. How can you use CloudWatch alarms to notify you when this begins?

In the last unit you learned that you can create metric filters for your logs. CloudWatch Logs uses this metric filter to turn the log data into metrics that you can graph or set an alarm on. For the cat photo application, you set up a metric filter for 500-error response codes. 

Then, you define an alarm for that metric that will go into the ALARM state if 500-error responses go over a certain amount for a sustained time period. Let’s say if it’s more than five 500-error responses per hour, the alarm should enter the ALARM state. Next, you define an action that you want to take place when the alarm is triggered. 

In this case, it makes sense to send an email or text alert to you so you can start troubleshooting the website, hopefully fixing it before it becomes a bigger issue. Once the alarm is set up, you feel comfortable knowing that if the error happens again, you’ll be notified promptly.

Use a 500-error response filter to create an alarm for 500-error response greater than five per hour that creates an action like triggering Amazon SNS and Auto Scaling.

You can set up different alarms for different reasons to help you prevent or troubleshoot operational issues. In the scenario just described, the alarm triggered an SNS notification that went to a person who looked into the issue manually. Another option is to have alarms trigger actions that automatically remediate technical issues. 

For example, you can set up an alarm to trigger an EC2 instance to reboot, or scale services up or down. You can even set up an alarm to trigger an SNS notification, which then triggers an AWS Lambda function. The Lambda function then calls any AWS API to manage your resources, and troubleshoot operational issues. By using AWS services together like this, you respond to events more quickly.

Wrap Up

CloudWatch alarms are used to proactively alert you when something isn’t quite right with your metrics. You can define what metrics you want to associate an alarm with, what the normal thresholds are for metrics, and then define the action that takes place when an alarm is triggered. This gives you the ability to respond to operational events as they are happening, and not respond retroactively after your end users are negatively impacted.