ECS agent stop

ECS agent stop disrupts the state of infrastructure resources. This fault:

Induces an agent stop chaos on AWS ECS using Amazon SSM Run command, that is carried out by using SSM documentation which is in-built in the fault for the give chaos scenario.
Causes agent container stop on ECS for a specific duration, with a given CLUSTER_NAME envrionment variable using SSM documentation. Killing the agent container disrupts the performance of the task containers.

ECS Agent Stop

Use cases

ECS agent stop halts the agent that manages the task container on the ECS cluster, thereby impacting its delivery.

Prerequisites

Kubernetes >= 1.17
The ECS container instance should be in healthy state.
ECS container metadata should be enabled (this feature is disabled by default). To enable it please follow the aws docs to Enabling container metadata. If you have your task running prior this activity you may need to restart it to get the metadata directory as mentioned in the docs.
You and ECS cluster instances have a role with the required AWS access to do SSM and ECS operations. Go to Systems Manager documentation.

The Kubernetes secret should have the AWS access configuration(key) in the CHAOS_NAMESPACE. A sample secret file looks like:

apiVersion: v1
kind: Secret
metadata:
  name: cloud-secret
type: Opaque
stringData:
  cloud_config.yml: |-
    # Add the cloud AWS credentials respectively
    [default]
    aws_access_key_id = XXXXXXXXXXXXXXXXXXX
    aws_secret_access_key = XXXXXXXXXXXXXXX

tip

HCE recommends that you use the same secret name, that is, cloud-secret. Otherwise, you will need to update the AWS_SHARED_CREDENTIALS_FILE environment variable in the fault template with the new secret name and you won't be able to use the default health check probes.

Below is an example AWS policy to execute the fault.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "ecs:UpdateContainerInstancesState",
                "ecs:RegisterContainerInstance",
                "ecs:ListContainerInstances",
                "ecs:DeregisterContainerInstance",
                "ecs:DescribeContainerInstances",
                "ecs:ListTasks",
                "ecs:DescribeClusters"

            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "ssm:GetDocument",
                "ssm:DescribeDocument",
                "ssm:GetParameter",
                "ssm:GetParameters",
                "ssm:SendCommand",
                "ssm:CancelCommand",
                "ssm:CreateDocument",
                "ssm:DeleteDocument",
                "ssm:GetCommandInvocation",          
                "ssm:UpdateInstanceInformation",
                "ssm:DescribeInstanceInformation"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "ec2messages:AcknowledgeMessage",
                "ec2messages:DeleteMessage",
                "ec2messages:FailMessage",
                "ec2messages:GetEndpoint",
                "ec2messages:GetMessages",
                "ec2messages:SendReply"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "ec2:DescribeInstances"
            ],
            "Resource": [
                "*"
            ]
        }
    ]
}

note

This experiment induces chaos within a container and depends on an EC2 instance. Typically, these are prefixed with "ECS container" and involve direct interaction with the EC2 instances hosting the ECS containers.
Go to AWS named profile for chaos to use a different profile for AWS faults and the superset permission/policy to execute all AWS faults.
Go the common tunables and AWS-specific tunables to tune the common tunables for all faults and AWS-specific tunables.

Mandatory tunables

Tunable	Description	Notes
CLUSTER_NAME	Name of the target ECS cluster	Single name supported For example, `demo-cluster`. For more information, go to cluster name.
REGION	The AWS region name of the target ECS cluster	For example, `us-east-2`

Optional tunables

Tunable	Description	Notes
TOTAL_CHAOS_DURATION	The total time duration for chaos insertion (sec)	Default: 30 s. For more information, go to duration of the chaos.
CHAOS_INTERVAL	The interval (in sec) between successive instance termination.	Default: 30 s. For more information, go to chaos interval.
AWS_SHARED_CREDENTIALS_FILE	Provide the path for aws secret credentials	Default: `/tmp/cloud_config.yml`
SEQUENCE	It defines sequence of chaos execution for multiple instance	Default: parallel. Supports serial and parallel. For more information, go to sequence of chaos execution.
RAMP_TIME	Period to wait before and after injection of chaos in sec	For example, 30 s. For more information, go to ramp time.

Agent stop

Target agent that is stopped for a specific duration. Tune it by using the CLUSTER_NAME environment variable.

The following YAML snippet illustrates the use of this environment variable:

# stops the agent of an ECS cluster
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: engine-nginx
spec:
  engineState: "active"
  annotationCheck: "false"
  chaosServiceAccount: litmus-admin
  experiments:
  - name: ecs-agent-stop
    spec:
      components:
        env:
        # provide the name of ECS cluster
        - name: CLUSTER_NAME
          value: 'demo'
        - name: REGION
          value: 'us-east-2'
        - name: TOTAL_CHAOS_DURATION
          VALUE: '60'

Use cases​

Prerequisites​

Mandatory tunables​

Optional tunables​

Agent stop​

Use cases

Prerequisites

Mandatory tunables

Optional tunables

Agent stop