Skip to main content

ECS container IO stress

ECS container IO stress disrupts the state of infrastructure resources. It induces stress on the AWS ECS container using Amazon SSM Run command, which is carried out using SSM docs which is in-built into the fault.

  • It causes I/O stress on the containers of the ECS task using the given CLUSTER_NAME environment variable for a specific duration.
  • To select the Task Under Chaos (TUC), use the servie name associated with the task. If you provide the service name along with the cluster name, all the tasks associated with the given service will be selected as chaos targets.
  • It tests the ECS task sanity (service availability) and recovery of the task containers subject to I/O stress.
  • This experiment induces chaos within a container and depends on an EC2 instance. Typically, these are prefixed with "ECS container" and involve direct interaction with the EC2 instances hosting the ECS containers.

ECS Container IO Stress

Use cases

ECS container IO stress determines how a container recovers from a memory exhaustion. File system read and write evicts the application (task container) and impacts its delivery. These issues are also known as noisy-neighbour problems. Injecting a rogue process into a target container starves the main microservice process (typically pid 1) of the resources allocated to it (where the limits are defined). This slows down the application traffic or exhausts the resources leading to eviction of all task containers.

Prerequisites

  • Kubernetes >= 1.17
  • ECS container metadata is enabled (disabled by default). To enable it, refer to this docs. If your task is running from before, you may need to restart it to get the metadata directory.
  • You and the ECS cluster instances have a role with the required AWS access to perform the SSM and ECS operations. Refer to systems manager docs.
  • Create a Kubernetes secret that has the AWS access configuration(key) in the CHAOS_NAMESPACE. Below is a sample secret file:
apiVersion: v1
kind: Secret
metadata:
name: cloud-secret
type: Opaque
stringData:
cloud_config.yml: |-
# Add the cloud AWS credentials respectively
[default]
aws_access_key_id = XXXXXXXXXXXXXXXXXXX
aws_secret_access_key = XXXXXXXXXXXXXXX
tip

HCE recommends that you use the same secret name, that is, cloud-secret. Otherwise, you will need to update the AWS_SHARED_CREDENTIALS_FILE environment variable in the fault template with the new secret name and you won't be able to use the default health check probes.

Below is an example AWS policy to execute the fault.

{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": [
"ecs:UpdateContainerInstancesState",
"ecs:RegisterContainerInstance",
"ecs:ListContainerInstances",
"ecs:DeregisterContainerInstance",
"ecs:DescribeContainerInstances",
"ecs:ListTasks",
"ecs:DescribeClusters"

],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"ssm:GetDocument",
"ssm:DescribeDocument",
"ssm:GetParameter",
"ssm:GetParameters",
"ssm:SendCommand",
"ssm:CancelCommand",
"ssm:CreateDocument",
"ssm:DeleteDocument",
"ssm:GetCommandInvocation",
"ssm:UpdateInstanceInformation",
"ssm:DescribeInstanceInformation"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"ec2messages:AcknowledgeMessage",
"ec2messages:DeleteMessage",
"ec2messages:FailMessage",
"ec2messages:GetEndpoint",
"ec2messages:GetMessages",
"ec2messages:SendReply"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"ec2:DescribeInstances"
],
"Resource": [
"*"
]
}
]
}
note
  • The ECS container instance should be in a healthy state.
  • It is recommended to use the same secret name, i.e. cloud-secret. Otherwise, you will need to update the AWS_SHARED_CREDENTIALS_FILE environment variable in the fault template and you may be unable to use the default health check probes.
  • Refer to the superset permission/policy to execute all AWS faults.
  • Refer to AWS named profile for chaos to know how to use a different profile for AWS faults.
  • Refer to the common attributes and AWS-specific tunables to tune the common tunables for all faults and aws specific tunables.

Mandatory tunables

Variables Description Notes
CLUSTER_NAME Name of the target ECS cluster. For example, cluster-1.
REGION Region name of the target ECS cluster For example, us-east-1.

Optional tunables

Variables Description Notes
TOTAL_CHAOS_DURATION Duration that you specify, through which chaos is injected into the target resource (in seconds). Default: 30 s. For more information, go to duration of the chaos.
CHAOS_INTERVAL Interval between successive instance terminations (in seconds). Default: 30 s. For more information, go to chaos interval.
AWS_SHARED_CREDENTIALS_FILE Path to the AWS secret credentials. Defaults to /tmp/cloud_config.yml.
SERVICE_NAME Target ECS service name. For example, app-svc. For more information, go to ECS service name.
FILESYSTEM_UTILIZATION_BYTES Memory consumed during I/O stress (in GB). Default: 1. For more information, go to file system utilization in bytes.
FILESYSTEM_UTILIZATION_PERCENTAGE Memory consumed during I/O stress (in percentage). Default: 10. For more information, go to file system utilization in percentage.
VOLUME_MOUNT_PATH Location that points to the volume mount path used in I/O stress. Default: /tmp. For more information, go to volume mount path.
NUMBER_OF_WORKERS Number of workers for memory stress. Default: 1. For more information, go to workers.
SEQUENCE Sequence of chaos execution for multiple instances Default: parallel. Supports serial and parallel. For more information, go to sequence of chaos execution.
RAMP_TIME Period to wait before and after injecting chaos (in seconds). For example, 30 s. For more information, go to ramp time.

File system utilization percentage

Amount of free space available in the ECS container (in percentage). Tune it by using the FILESYSTEM_UTILIZATION_PERCENTAGE environment variable.

The following YAML snippet illustrates the use of this environment variable:

# stress the i/o of the targeted pod with FILESYSTEM_UTILIZATION_PERCENTAGE of total free space 
# it is mutually exclusive with the FILESYSTEM_UTILIZATION_BYTES.
# if both are provided then it will use FILESYSTEM_UTILIZATION_PERCENTAGE for stress
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: engine-nginx
spec:
engineState: "active"
annotationCheck: "false"
chaosServiceAccount: litmus-admin
experiments:
- name: ecs-container-io-stress
spec:
components:
env:
# percentage of free space of file system, need to be stressed
- name: FILESYSTEM_UTILIZATION_PERCENTAGE
value: '10' #in percentage
- name: TOTAL_CHAOS_DURATION
VALUE: '60'

File system utilization bytes

Amount of free space available in the ECS container (in gigabytes). Tune it by using the FILESYSTEM_UTILIZATION_BYTES environment variable.

The FILESYSTEM_UTILIZATION_BYTES and FILESYSTEM_UTILIZATION_PERCENTAGE environment variables are mutually exclusive. If values for both variables are provided, FILESYSTEM_UTILIZATION_PERCENTAGE takes precedence.

The following YAML snippet illustrates the use of this environment variable:

# stress the i/o of the targeted pod with FILESYSTEM_UTILIZATION_PERCENTAGE of total free space 
# it is mutually exclusive with the FILESYSTEM_UTILIZATION_BYTES.
# if both are provided then it will use FILESYSTEM_UTILIZATION_PERCENTAGE for stress
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: engine-nginx
spec:
engineState: "active"
annotationCheck: "false"
chaosServiceAccount: litmus-admin
experiments:
- name: ecs-container-io-stress
spec:
components:
env:
# size of io to be stressed
- name: FILESYSTEM_UTILIZATION_BYTES
value: '1' #in GB
- name: TOTAL_CHAOS_DURATION
VALUE: '60'

Mount path

Location that points to the volume mount path used in I/O stress. Tune it by using the VOLUME_MOUNT_PATH environment variable.

The following YAML snippet illustrates the use of this environment variable:

# provide the volume mount path, which needs to be filled
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: engine-nginx
spec:
engineState: "active"
annotationCheck: "false"
chaosServiceAccount: litmus-admin
experiments:
- name: ecs-container-io-stress
spec:
components:
env:
# path need to be stressed/filled
- name: VOLUME_MOUNT_PATH
value: '/some-dir-in-container'
- name: TOTAL_CHAOS_DURATION
VALUE: '60'

Workers for stress

The number of workers on which you apply stress. Tune it by using the NUMBER_OF_WORKERS environment variable.

The following YAML snippet illustrates the use of this environment variable:

# number of workers for the stress
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: engine-nginx
spec:
engineState: "active"
annotationCheck: "false"
chaosServiceAccount: litmus-admin
experiments:
- name: ecs-container-io-stress
spec:
components:
env:
# number of io workers
- name: NUMBER_OF_WORKERS
value: '4'
- name: TOTAL_CHAOS_DURATION
VALUE: '60'

ECS service name

Service name whose tasks are stopped. Tune it by using the SERVICE_NAME environment variable.

The following YAML snippet illustrates the use of this environment variable:

# stop the tasks of an ECS cluster
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: engine-nginx
spec:
engineState: "active"
annotationCheck: "false"
chaosServiceAccount: litmus-admin
experiments:
- name: ecs-task-stop
spec:
components:
env:
# provide the name of ECS cluster
- name: CLUSTER_NAME
value: 'demo'
- name: SERVICE_NAME
vale: 'test-svc'
- name: REGION
value: 'us-east-1'
- name: TOTAL_CHAOS_DURATION
VALUE: '60'