Skip to main content

ECS agent stop

ECS agent stop disrupts the state of infrastructure resources.

  • The fault induces an agent stop chaos on AWS ECS using Amazon SSM Run command, this is carried out by using SSM Docs which is in-built in the fault for the give chaos scenario.
  • It causes agent container stop on ECS with a given CLUSTER_NAME envrionment variable using an SSM docs for a specific duration.

ECS Agent Stop

Usage

View fault usage
ECS agent stop chaos stops the agent that manages the task container on the ECS cluster, thereby impacting its delivery. Killing the agent container disrupts the performance of the task containers.

Prerequisites

  • Kubernetes >= 1.17

  • Ensure that the ECS container metadata is enabled this feature is disabled by default. To enable it please follow the aws docs to Enabling container metadata. If you have your task running prior this activity you may need to restart it to get the metadata directory as mentioned in the docs.

  • Ensure that both you and ECS cluster instances have a Role with required AWS access to do SSM and ECS operations. Refer the below mentioned sample policy for the fault (please note that the sample policy can be minimised further). To know more checkout Systems Manager Docs. Also, please refer the below mentioned policy required for the experiment.

  • Ensure to create a Kubernetes secret having the AWS access configuration(key) in the CHAOS_NAMESPACE. A sample secret file looks like:

apiVersion: v1
kind: Secret
metadata:
name: cloud-secret
type: Opaque
stringData:
cloud_config.yml: |-
# Add the cloud AWS credentials respectively
[default]
aws_access_key_id = XXXXXXXXXXXXXXXXXXX
aws_secret_access_key = XXXXXXXXXXXXXXX
  • It is recommended to use the same secret name, i.e. cloud-secret. Otherwise, you will need to update the AWS_SHARED_CREDENTIALS_FILE environment variable in the fault template and you may be unable to use the default health check probes.

  • Refer to AWS Named Profile For Chaos to know how to use a different profile for AWS faults.

Permissions required

Here is an example AWS policy to execute the fault.

View policy for the fault
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": [
"ecs:UpdateContainerInstancesState",
"ecs:RegisterContainerInstance",
"ecs:ListContainerInstances",
"ecs:DeregisterContainerInstance",
"ecs:DescribeContainerInstances",
"ecs:ListTasks",
"ecs:DescribeClusters"

],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"ssm:GetDocument",
"ssm:DescribeDocument",
"ssm:GetParameter",
"ssm:GetParameters",
"ssm:SendCommand",
"ssm:CancelCommand",
"ssm:CreateDocument",
"ssm:DeleteDocument",
"ssm:GetCommandInvocation",
"ssm:UpdateInstanceInformation",
"ssm:DescribeInstanceInformation"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"ec2messages:AcknowledgeMessage",
"ec2messages:DeleteMessage",
"ec2messages:FailMessage",
"ec2messages:GetEndpoint",
"ec2messages:GetMessages",
"ec2messages:SendReply"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"ec2:DescribeInstances"
],
"Resource": [
"*"
]
}
]
}

Refer to the superset permission/policy to execute all AWS faults.

Default validations

The ECS container instance should be in healthy state.

Fault tunables

Fault tunables

Mandatory fields

Variables Description Notes
CLUSTER_NAME Name of the target ECS cluster Single name supported For example, demo-cluster
REGION The AWS region name of the target ECS cluster For example, us-east-2

Optional fields

Variables Description Notes
TOTAL_CHAOS_DURATION The total time duration for chaos insertion (sec) Defaults to 30s
CHAOS_INTERVAL The interval (in sec) between successive instance termination. Defaults to 30s
AWS_SHARED_CREDENTIALS_FILE Provide the path for aws secret credentials Defaults to /tmp/cloud_config.yml
SEQUENCE It defines sequence of chaos execution for multiple instance Default value: parallel. Supported: serial, parallel
RAMP_TIME Period to wait before and after injection of chaos in sec For example, 30

Fault examples

Common and AWS-specific tunables

Refer the common attributes and AWS-specific tunables to tune the common tunables for all faults and aws specific tunables.

Agent Stop

It derives the target agent from CLUSTER_NAME and stop them for a certain chaos duration.

Use the following example to tune this:

# stops the agent of an ECS cluster
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: engine-nginx
spec:
engineState: "active"
annotationCheck: "false"
chaosServiceAccount: litmus-admin
experiments:
- name: ecs-agent-stop
spec:
components:
env:
# provide the name of ECS cluster
- name: CLUSTER_NAME
value: 'demo'
- name: REGION
value: 'us-east-2'
- name: TOTAL_CHAOS_DURATION
VALUE: '60'