Unexplainable delay between Alarm data breach and Alarm state change

0

Hello fellow developers,

I have encountered a problem, in which there is a significant delay between a Cloudwatch Alarm breach and the actual Alarm state change.
I can not figure out where the delay is coming from and after searching through the documentation and the developer forums without any success, I require your assistance :)

What I have done? (Steps to reproduce)
In concrete, I deployed a webserver using ECS Fargate behind an internet-facing application loadbalancer. The metric, on which the alarm should trigger, is the amount of 4xx Response Codes, which the application loadbalancer receives from the webserver. In case the sum of failed requests (yielding in a 4xx response code) over a timeperiod of 1 minute is bigger than 1, the alarm should switch from the OK in the ALARM state.

The whole setup can be replicated using the following cdk code:

    const vpc = new ec2.Vpc(scope, "SampleVpc", {
      subnetConfiguration: [{ // avoid autonomous creation of private subnets and costly NAT gateways
        cidrMask: 18,
        name: 'Public',
        subnetType: ec2.SubnetType.PUBLIC,
      }]
    });

    //nginx provides a sample page served over port 80
    const initialContainerImage = ecs.ContainerImage.fromRegistry("nginx");
    const loadBalancedFargateService = new ecsPatterns.ApplicationLoadBalancedFargateService(this, "FargateService", {
      assignPublicIp: true,
      taskImageOptions: {
        image: initialContainerImage,
      },
      publicLoadBalancer: true,
      desiredCount: 1,
      vpc: vpc
    });

    //create an alarm, when loadbalancer register a HTTP 4xx response code from targets
    const metric4xxresponseCode = loadBalancedFargateService.loadBalancer.metricHttpCodeTarget(elbv2.HttpCodeTarget.TARGET_4XX_COUNT)
    const alarm = new cw.Alarm(this, "4xxAlarm", {
      alarmName: "4xxAlarm",
      metric: metric4xxresponseCode,
      threshold: 1,
      evaluationPeriods: 1,
      statistic: "sum",
      period: cdk.Duration.minutes(1),
      treatMissingData: cw.TreatMissingData.NOT_BREACHING
    });

What was the expected behaviour?
Each minute, the Application Loadbalancer publishes the 4xxResponseCode metric to Cloudwatch (also written in this documentation: https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-cloudwatch-metrics.html).
Because the period of the alarm is also set to one minute and only the datapoints of the last period are evaluated (defined through the evaluationPeriod = 1), the alarm should trigger within one minute, after the treshold is violated. Assuming the 4xx response code only occurs one time, the alarm should also switch back from the ALARM state to the OK state one minute after the ALARM state was reached, since the period is set to one minute, right?

What did actually happen?
After provoking some 404 error codes, the metric HTTPCode_Target_4XX_Count
shows at 10:10 a sum of 3, therefore violationg the Treshold HTTPCode_Target_4XX_Count >= 1 for 1 datapoints within 1 minute.
Anyway, the alarm state changes 4 minutes later at 10:14:

$ aws cloudwatch describe-alarm-history --alarm-name 4xxAlarm
{
    "AlarmHistoryItems": [
        {
            "AlarmName": "4xxAlarm",
            "Timestamp": "2020-03-13T10:20:37.155000+00:00",
            "HistoryItemType": "StateUpdate",
            "HistorySummary": "Alarm updated from ALARM to OK",
            "HistoryData": "{\"version\":\"1.0\",\"oldState\":{\"stateValue\":\"ALARM\",\"stateReason\":\"Threshold Crossed: 1 datapoint [3.0 (13/03/20 10:10:00)] was greater than or equal to the threshold (1.0).\",\"stateReasonData\":{\"version\":\"1.0\",\"queryDate\":\"2020-03-13T10:14:37.152+0000\",\"startDate\":\"2020-03-13T10:10:00.000+0000\",\"statistic\":\"Sum\",\"period\":60,\"recentDatapoints\":[3.0],\"threshold\":1.0}},\"newState\":{\"stateValue\":\"OK\",\"stateReason\":\"Threshold Crossed: no datapoints were received for 1 period and 1 missing datapoint was treated as [NonBreaching].\",\"stateReasonData\":{\"version\":\"1.0\",\"queryDate\":\"2020-03-13T10:20:37.149+0000\",\"statistic\":\"Sum\",\"period\":60,\"recentDatapoints\":[],\"threshold\":1.0}}}"
        },
        {
            "AlarmName": "4xxAlarm",
            "Timestamp": "2020-03-13T10:14:37.157000+00:00",
            "HistoryItemType": "StateUpdate",
            "HistorySummary": "Alarm updated from OK to ALARM",
            "HistoryData": "{\"version\":\"1.0\",\"oldState\":{\"stateValue\":\"OK\",\"stateReason\":\"Threshold Crossed: no datapoints were received for 1 period and 1 missing datapoint was treated as [NonBreaching].\",\"stateReasonData\":{\"version\":\"1.0\",\"queryDate\":\"2020-03-13T10:05:37.195+0000\",\"statistic\":\"Sum\",\"period\":60,\"recentDatapoints\":[],\"threshold\":1.0}},\"newState\":{\"stateValue\":\"ALARM\",\"stateReason\":\"Threshold Crossed: 1 datapoint [3.0 (13/03/20 10:10:00)] was greater than or equal to the threshold (1.0).\",\"stateReasonData\":{\"version\":\"1.0\",\"queryDate\":\"2020-03-13T10:14:37.152+0000\",\"startDate\":\"2020-03-13T10:10:00.000+0000\",\"statistic\":\"Sum\",\"period\":60,\"recentDatapoints\":[3.0],\"threshold\":1.0}}}"
        },
[...]
}

Note, that the switch back to OK states takes 6 minutes. Why is there such a huge delay, even when the period can be set to 1 minute?

Thanks for your help!

asked 4 years ago3403 views
2 Answers
1

Hi,

The reporting criteria for the HTTPCode_Target_4XX_Count metric is if there is a non-zero value. That means data point will only be reported if a non-zero value is generated, otherwise nothing will be pushed to the metric.

CloudWatch standard alarm evaluates its state every minute and no matter what value you set for how to treat missing data, when an alarm evaluates whether to change state, CloudWatch attempts to retrieve a higher number of data points than specified by Evaluation Periods (1 in this case). The exact number of data points it attempts to retrieve depends on the length of the alarm period and whether it is based on a metric with standard resolution or high resolution. The time frame of the data points that it attempts to retrieve is the evaluation range. Treat missing data as setting is applied if all the data in the evaluation range is missing, and not just if the data in evaluation period is missing.

Hence, CloudWatch alarms will look at some previous data points to evaluate its state, and will use the treat missing data as setting if all the data in evaluation range is missing. In this case, for the time when alarm did not transition to OK state, it was using the previous data points in the evaluation range to evaluate its state, as expected.

The alarm evaluation in case of missing data is explained in detail here, that will help in understanding this further:
https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html#alarms-evaluating-missing-data

Hope this helps !

answered 4 years ago
0

also changing the period in the metric options from the default 5 minutes to 1 minute did not change the previously observed behavior

    const metric4xxResponseCode = loadBalancedFargateService.loadBalancer.metricHttpCodeTarget(elbv2.HttpCodeTarget.TARGET_4XX_COUNT, {
      period: cdk.Duration.minutes(1)
    });
answered 4 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions