Skip to main content

AWS ECS Service Status Check

Validates the status of an Amazon ECS service. This probe checks if the specified ECS service has reached the desired state (e.g., running tasks count matches desired count) within the given timeout.

Infrastructure type

  • Kubernetes

Use cases

AWS ECS Service Status Check probe helps you:

  • Ensure ECS services maintain desired task count during failures
  • Validate service auto-scaling behavior during load changes
  • Monitor service health during container chaos experiments
  • Verify service deployment and rollback operations

Overview

This probe uses the AWS CLI to query ECS service status and validates that the service has reached its desired state. It continuously checks the running task count against the desired count, making it ideal for monitoring service stability during chaos experiments.

Probe type

Command Probe

Prerequisites

  • Kubernetes cluster with chaos infrastructure installed
  • AWS credentials configured with appropriate IAM permissions:
    • ecs:DescribeServices
    • ecs:DescribeClusters
    • ecs:ListTasks
  • Network connectivity to AWS API endpoints
  • Target ECS cluster and services should be accessible

Probe properties

Command

healthchecks -name aws-ecs

Comparator

TypeCriteriaValue
stringcontains[Pass]

The probe passes when the command output contains [Pass], indicating the ECS service has reached the desired state.

Environment variables

VariableDescriptionRequiredDefault
CLUSTER_NAMEName of the ECS cluster containing the service (e.g., my-ecs-cluster).Yes-
SERVICE_NAMESComma-separated list of ECS service names to check (e.g., web-service,api-service).Yes-
REGIONAWS region where the ECS cluster is located (e.g., us-east-1, eu-west-2).Yes-
STATUS_CHECK_TIMEOUTMaximum time in seconds to wait for the service to reach desired state.No180
STATUS_CHECK_DELAYDelay in seconds between status checks.No2

Run properties

PropertyDescriptionTypeDefault
timeoutMaximum time to wait for the probe to complete (e.g., 30s, 1m, 5m)String300s
intervalTime between probe executions (e.g., 5s, 30s, 1m)String10s
attemptNumber of retry attempts before marking the probe as failedInteger1
pollingIntervalTime between retry attempts (e.g., 1s, 5s, 10s)String-
initialDelayInitial delay before starting the probe (e.g., 0s, 10s, 30s)String-
stopOnFailureStop the experiment if the probe failsBooleanfalse
verbosityLog verbosity level (info, debug, trace)String-

Probe definition

You can define this probe in your chaos experiment as follows:

Single service check

probe:
- name: "ecs-service-health-check"
type: "cmdProbe"
mode: "Continuous"
cmdProbe/inputs:
command: "healthchecks -name aws-ecs"
comparator:
type: "string"
criteria: "contains"
value: "[Pass]"
env:
- name: CLUSTER_NAME
value: "production-cluster"
- name: SERVICE_NAMES
value: "web-service"
- name: REGION
value: "us-east-1"
- name: STATUS_CHECK_TIMEOUT
value: "180"
- name: STATUS_CHECK_DELAY
value: "2"
runProperties:
timeout: 300s
interval: 10s
attempt: 1
stopOnFailure: false

Multiple services check

probe:
- name: "ecs-multi-service-check"
type: "cmdProbe"
mode: "Edge"
cmdProbe/inputs:
command: "healthchecks -name aws-ecs"
comparator:
type: "string"
criteria: "contains"
value: "[Pass]"
env:
- name: CLUSTER_NAME
value: "production-cluster"
- name: SERVICE_NAMES
value: "web-service,api-service,worker-service"
- name: REGION
value: "ap-south-1"
- name: STATUS_CHECK_TIMEOUT
value: "300"
- name: STATUS_CHECK_DELAY
value: "5"
runProperties:
timeout: 400s
interval: 15s
attempt: 3
pollingInterval: 5s