Skip to main content

ECS task stop

Last updated on

ECS task stop is an AWS chaos fault that stops a configurable percentage of running ECS tasks (selected by explicit task IDs or by SERVICE_NAME) for a configurable duration. Stopped tasks transition to STOPPED immediately and the ECS service controller (for service-managed tasks) launches replacements to restore the desired count. Standalone tasks started with run-task are not replaced.

Use this fault to test how an ECS service behaves when individual task replicas disappear: whether traffic reroutes to healthy replicas, whether the deployment reaches steady state again within the recovery SLA, and whether dependent services (caches, databases, sidecars) survive the partial outage.

Run your first experiment

If you have not configured the chaos infrastructure yet, go to Quickstart to install the chaos infrastructure and run an experiment end to end.


Use cases

Run this fault when you want to answer concrete questions like:

  • Replica failover: When a percentage of replicas is stopped, do remaining tasks absorb the traffic without dropping requests?
  • Service recovery time: How long does the service controller take to bring the desired count back, and does the deployment ever stall?
  • Load balancer rotation: Does the target group remove stopped tasks quickly enough, and does it register replacement tasks cleanly?
  • Sidecar dependencies: For tasks with sidecars (Envoy, log shippers, etc.), do the sidecars in surviving tasks remain healthy when peer tasks disappear?
  • Connection draining: Does the ECS stopTimeout and SIGTERM handling give the application enough time to drain long-lived connections?

Prerequisites

  • Kubernetes version: 1.21 or later for the chaos infrastructure cluster. Go to What's supported to confirm distribution support.
  • Target ECS cluster and service: CLUSTER_NAME exists in REGION. Either SERVICE_NAME is set (so the fault stops a percentage of the service's tasks), or TASK_REPLICA_IDS is set (so the fault stops those exact task IDs).
  • AWS credentials available: Either an AWS credentials file uploaded as a File Secret in Harness Secret Manager (see Authentication below) or an IAM role for service accounts (IRSA) bound to the chaos infrastructure service account.
  • IAM permissions granted: The credentials or role include the permissions listed below.

Supported environments

PlatformSupport status
Amazon ECS on EC2 launch typeSupported
Amazon ECS on Fargate launch typeSupported
Service-managed tasksSupported (tasks are rescheduled automatically)
Standalone tasks (run-task)Supported (tasks are stopped but not rescheduled)
AWS regionsSupported in every commercial region; pass the region in REGION

Permissions required

The IAM principal that the chaos pod uses (the credentials mounted from the Harness Secret Manager file secret, the IRSA role on the chaos service account, or the role assumed via ASSUME_ROLE_ARN) needs the following AWS actions.

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"ecs:DescribeClusters",
"ecs:DescribeServices",
"ecs:DescribeTasks",
"ecs:ListTasks",
"ecs:StopTask"
],
"Resource": "*"
}
]
}
  • ecs:ListTasks and ecs:DescribeTasks are used to enumerate candidate task IDs from SERVICE_NAME or to validate the explicit IDs in TASK_REPLICA_IDS.
  • ecs:StopTask drives the fault.
  • ecs:DescribeServices and ecs:DescribeClusters are used to validate the target and (optionally) to confirm desired count is restored.

Go to common policy for all AWS faults to use a single superset IAM policy across every AWS fault.


Authentication

The fault supports three credential delivery models. Pick one based on how your chaos infrastructure is deployed.

MethodWhen to use itHow to configure
Harness Secret Manager file secretChaos infrastructure runs outside EKS, or you want explicit static credentialsUpload the AWS credentials file as a File Secret in Harness Secret Manager and reference its identifier via AWS_AUTHENTICATION_SECRET
IAM Roles for Service Accounts (IRSA)Chaos infrastructure runs in EKS and uses an OIDC-bound service accountNo tunable changes; the chaos pod inherits the role automatically. Go to AWS IAM integration to set it up
Assume roleThe fault needs to act in a different account or with elevated permissionsSet ASSUME_ROLE_ARN to the role ARN; the chaos pod assumes the role on top of its base credentials

When using the Harness Secret Manager method, the contents of the File Secret should be the AWS credentials file in the standard ~/.aws/credentials format:

[default]
aws_access_key_id = REPLACE_WITH_ACCESS_KEY_ID
aws_secret_access_key = REPLACE_WITH_SECRET_ACCESS_KEY

Upload this file as a File Secret in Harness Secret Manager (Project Setup → Secrets → New File Secret), and pass the secret identifier in AWS_AUTHENTICATION_SECRET when configuring the fault.

Go to AWS named profile for chaos to switch between profiles inside a single credentials file.


Fault tunables

Configure the following fault parameters when you add ECS task stop to an experiment in Chaos Studio. Defaults are shown for reference.

Required parameters

TunableDescriptionDefault
CLUSTER_NAMEName of the target ECS cluster.(required)
REGIONAWS region that hosts the ECS cluster (for example us-east-1).(required)

Targeting parameters

TunableDescriptionDefault
SERVICE_NAMEName of the target ECS service. When set (and TASK_REPLICA_IDS is empty), the fault selects TASK_REPLICA_AFFECTED_PERC of the service's running tasks.""
TASK_REPLICA_IDSComma-separated list of explicit ECS task IDs to stop (overrides SERVICE_NAME and percentage selection).0
TASK_REPLICA_AFFECTED_PERCPercentage of the service's running tasks to stop when SERVICE_NAME is set. 0 corresponds to one task.100

Chaos parameters

TunableDescriptionDefault
TOTAL_CHAOS_DURATIONDuration of the fault in seconds.60
CHAOS_INTERVALDelay in seconds between successive stop iterations when running for more than one cycle.60
SEQUENCEOrder in which multiple tasks are stopped: parallel stops all selected tasks at once; serial stops them one at a time.parallel
RAMP_TIMEWait period in seconds before and after the fault. Go to ramp time to read how it is applied.0

Authentication

TunableDescriptionDefault
ASSUME_ROLE_ARNARN of an IAM role to assume on top of the base credentials. Leave empty to use the base credentials directly.""
AWS_AUTHENTICATION_SECRETIdentifier of the File Secret in Harness Secret Manager that contains the AWS credentials file. Not required when using IRSA.""

Tunables that apply to every fault are documented in common tunables for all faults. AWS-specific shared tunables are documented in common AWS fault tunables.


Fault execution in brief

Resolves the target task list from TASK_REPLICA_IDS or SERVICE_NAME + TASK_REPLICA_AFFECTED_PERC, calls StopTask on every selected task in CLUSTER_NAME / REGION, then waits for TOTAL_CHAOS_DURATION seconds before letting the service controller restore the desired count.


Expected behavior during fault execution

  • Selected tasks transition from RUNNING to DEPROVISIONING to STOPPED within seconds of the stop call. The SIGTERM and stopTimeout from the task definition are honoured, so well-behaved applications get the chance to drain.
  • Targets in the load balancer target group for those tasks are marked unhealthy and removed from rotation.
  • For service-managed tasks, the ECS service controller observes that running count is below desired count and launches replacements according to the deployment configuration.
  • For standalone tasks (started with run-task), no replacement is launched; the running task count for the workload remains reduced after the fault.
  • Customer requests in flight to a stopped task either drain via SIGTERM or are reset, depending on the application.
When the fault ends

This fault stops tasks but does not start replacements itself; the ECS service controller is responsible for restoring desired count. The fault waits TOTAL_CHAOS_DURATION seconds before reporting success so that the resilience signals can be measured during the recovery window.

Signals to watch

Attach resilience probes to assert each layer:

  • Application availability: Use an HTTP probe against the service endpoint to detect downtime and measure recovery time.
  • Running task count: Use a Prometheus probe on aws_ecs_running_task_count to confirm desired count is restored.
  • Target group health: Use a Prometheus probe on aws_applicationelb_healthy_host_count to confirm replacement tasks register cleanly.
  • Task lifecycle: Use a command probe that runs aws ecs describe-tasks and asserts the expected task ARN set at each phase.

Verify the fault execution effect

While the experiment is running, confirm tasks are stopped and the service recovers:

  1. List running tasks for the service.

    aws ecs list-tasks \
    --cluster <cluster> \
    --service-name <service> \
    --desired-status RUNNING \
    --region <region>

    The running task count should drop by TASK_REPLICA_AFFECTED_PERC during the fault and return to normal afterwards.

  2. Inspect individual task state.

    aws ecs describe-tasks \
    --cluster <cluster> \
    --tasks <task-arn> \
    --region <region> \
    --query "tasks[].[taskArn,lastStatus,stoppedReason]"

    Stopped tasks should report STOPPED with a stop reason; replacement tasks should appear with new ARNs.

  3. Check service deployment status.

    aws ecs describe-services \
    --cluster <cluster> \
    --services <service> \
    --region <region> \
    --query "services[].deployments"

    The deployment should reach runningCount == desiredCount once recovery is complete.


Recovery and cleanup

  • End of duration: The fault stops driving and the ECS service controller restores desired count on its own schedule.
  • Abort the experiment: Stopping the experiment from Chaos Studio cancels any pending iterations; tasks that have already been stopped are not restored by the fault (the service controller restores them as usual).
  • Manual recovery: For standalone tasks (no service), restart them with aws ecs run-task and the appropriate task definition.
  • Workload recovery: Long-lived TCP connections that were dropped during the outage need to be re-established by the caller. Sticky sessions tied to a specific task IP are broken.

Limitations

  • Standalone tasks are not replaced: Only service-managed tasks are replaced; tasks started with run-task stay stopped after the fault.
  • No fine-grained selection beyond service: Selection is by service or explicit ID; you cannot target tasks by tag or by container instance in this fault. Use placement-constraint-aware patterns or TASK_REPLICA_IDS for finer control.
  • Deployment in progress: If a deployment is in progress, the fault counts tasks at the moment of execution; behaviour during ongoing deployments is harder to reason about.
  • Cross-region targeting: A single experiment targets one region (the value of REGION).
  • stopTimeout is honoured: Stopping tasks is not instant; high stopTimeout values in the task definition can cause the fault to take longer to fully drain replicas than TOTAL_CHAOS_DURATION suggests.

Troubleshooting

ECS task stop experiment fails with AccessDeniedException in Harness Chaos Engineering

The credentials supplied to the chaos pod do not have the required ECS permissions in the target region. Confirm the IAM policy attached to the user, role, or IRSA service account includes ecs:ListTasks, ecs:DescribeTasks, ecs:StopTask, ecs:DescribeServices, and ecs:DescribeClusters. When using ASSUME_ROLE_ARN, also confirm the source identity is trusted to assume the role.

ECS task stop reports no tasks were selected

The fault could not find any candidate tasks. The most common causes are: SERVICE_NAME does not exist in CLUSTER_NAME; the service currently has zero running tasks; TASK_REPLICA_IDS were provided but the IDs are stale (already STOPPED); or TASK_REPLICA_AFFECTED_PERC=0 with no specific IDs. Verify with 'aws ecs list-tasks --service-name <svc>' and re-run.

ECS service did not reschedule tasks after ECS task stop

The most common causes are: the affected tasks were standalone (not service-managed) and therefore are not rescheduled automatically; the service has maximumPercent set such that the replacement count is constrained; the cluster does not have enough free capacity (CPU/memory reservation) to schedule replacements; or task placement constraints prevent rescheduling. Inspect 'aws ecs describe-services' for events and adjust capacity or constraints accordingly.

Customer requests are dropped instead of drained when tasks are stopped

The application is not handling SIGTERM gracefully or the task definition's stopTimeout is too short. Increase stopTimeout in the task definition to at least the longest request duration plus a buffer, and ensure the application begins draining (closing keep-alive connections, refusing new requests) on SIGTERM. Also confirm the load balancer's deregistration delay is long enough for in-flight requests.