ECS Container CPU Hog
Introduction
ECS Container CPU hog contains chaos to disrupt the state of infra resources. The fault can induce stress chaos on the AWS ECS container using SSM Run Command, this is carried out by using SSM Docs which is in-built into the fault for the give chaos experiment.
Injects cpu resource stress/exhaustion on the target task container for the given duration. The number of cpu cores stressed can be provided as input.
To select the Task Under Chaos (TUC) you can make use of servie name associated with the task that is - if you provide the service name along with cluster name only all the tasks associated with the given service will be selected as chaos targets.
It tests the ECS task sanity (service availability) and recovery of the task containers subjected to CPU stress.
Uses
View the uses of the fault
Injecting a rogue process into a target task container, we starve the main microservice process (typically pid 1) of the resources allocated to it (where limits are defined) causing slowness in application traffic or in other cases unrestrained use can cause instance to exhaust resources leading to eviction of all task container. So this category of chaos fault helps to build the immunity on the application undergoing any such stress scenario.
Prerequisites
- Ensure that Kubernetes Version >= 1.17
Enable Container Metadata
Ensure that the ECS container metadata is enabled
;this feature is disabled
by default. Refer AWS docs - Enabling container metadata. This will allow HCE to know the container details like containerID that is running the ECS tasks.
- In the EC2 dashboard sidebar click on Launch Configurations under Auto Scaling.
- Create a copy of autoscaling configuration used in target ECS cluster. This will create a new (copied) Launch Template.
- In the new(copied) Launch Template, update the IAM role of the instances with ECS-SSM permissions (as shown in below permission requirement section).
- Now update the user data with
ECS_ENABLE_CONTAINER_METADATA
to betrue
as shown below.
- Now save the launch configuration by clicking on ‘Create Launch Template'.
- Now go back to auto scaling group and switch to launch template (as launch configuration is deprecating by AWS).
- Update the cluster auto-scaling group with the newer launch template.
- Restart the instances of the ECS cluster to pull the updated configuration:
Ensure both user and ECS cluster instances have a Role with required AWS access to do SSM and ECS operations. Refer the below mentioned sample policy for the fault.
Ensure to create a Kubernetes secret having the AWS access configuration(key) in the
CHAOS_NAMESPACE
. A sample secret file looks like:
apiVersion: v1
kind: Secret
metadata:
name: cloud-secret
type: Opaque
stringData:
cloud_config.yml: |-
# Add the cloud AWS credentials respectively
[default]
aws_access_key_id = XXXXXXXXXXXXXXXXXXX
aws_secret_access_key = XXXXXXXXXXXXXXX
- If you change the secret key name (from
cloud_config.yml
) please also update theAWS_SHARED_CREDENTIALS_FILE
ENV value in the ChaosExperiment CR with the same name.
Permission Requirement
- Here is an example AWS policy to execute this fault.
View policy for this fault
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": [
"ecs:UpdateContainerInstancesState",
"ecs:RegisterContainerInstance",
"ecs:ListContainerInstances",
"ecs:DeregisterContainerInstance",
"ecs:DescribeContainerInstances",
"ecs:ListTasks",
"ecs:DescribeClusters"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"ssm:GetDocument",
"ssm:DescribeDocument",
"ssm:GetParameter",
"ssm:GetParameters",
"ssm:SendCommand",
"ssm:CancelCommand",
"ssm:CreateDocument",
"ssm:DeleteDocument",
"ssm:GetCommandInvocation",
"ssm:UpdateInstanceInformation",
"ssm:DescribeInstanceInformation"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"ec2messages:AcknowledgeMessage",
"ec2messages:DeleteMessage",
"ec2messages:FailMessage",
"ec2messages:GetEndpoint",
"ec2messages:GetMessages",
"ec2messages:SendReply"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"ec2:DescribeInstances"
],
"Resource": [
"*"
]
}
]
}
- Refer a superset permission/policy to execute all AWS faults.
Default Validations
- ECS container instance should be in healthy state.
Fault Tunables
Check the Fault Tunables
Mandatory Fields
Variables | Description | Notes |
---|---|---|
CLUSTER_NAME | Name of the target ECS cluster | Eg. cluster-1 |
REGION | The region name of the target ECS cluster | Eg. us-east-1 |
Optional Fields
Variables | Description | Notes |
---|---|---|
TOTAL_CHAOS_DURATION | The total time duration for chaos insertion (sec) | Defaults to 30s |
CHAOS_INTERVAL | The interval (in sec) between successive instance termination. | Defaults to 30s |
AWS_SHARED_CREDENTIALS_FILE | Provide the path for aws secret credentials | Defaults to /tmp/cloud_config.yml |
CPU_CORE | Provide the number of cpu core to consume | Defaults to 0 |
CPU_LOAD | Provide the percentage of CPU to be consumed | Defaults to 100 |
SEQUENCE | It defines sequence of chaos execution for multiple instance | Default value: parallel. Supported: serial, parallel |
RAMP_TIME | Period to wait before and after injection of chaos in sec | Eg. 30 |
Fault Examples
Common and AWS specific tunables
Refer the common attributes and AWS specific tunable to tune the common tunables for all faults and aws specific tunables.
CPU Cores
It contains the cores of CPU to hog for the target container instances. It can be tuned via CPU_CORE
ENV. 0
core means all the available CPU resources should be consumed.
Use the following example to tune this:
# cpu cores for the stress
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: engine-nginx
spec:
engineState: "active"
annotationCheck: "false"
chaosServiceAccount: litmus-admin
experiments:
- name: ecs-container-cpu-hog
spec:
components:
env:
# provide the cpu core to be hogged
- name: CPU_CORE
value: '0'
- name: REGION
value: 'us-east-2'
- name: TOTAL_CHAOS_DURATION
VALUE: '60'
CPU Load
It contains the percentage of CPU to be consumed for the target container instances. It can be tuned via CPU_LOAD
ENV. CPU Load 100
means 100% per cpu core provided.
Use the following example to tune this:
# cpu load for the stress
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: engine-nginx
spec:
engineState: "active"
annotationCheck: "false"
chaosServiceAccount: litmus-admin
experiments:
- name: ecs-container-cpu-hog
spec:
components:
env:
# provide the cpu load percentage
- name: CPU_LOAD
value: '100'
- name: CPU_CORE
value: '0'
- name: TOTAL_CHAOS_DURATION
VALUE: '60'