Skip to main content

ECS Container CPU Hog

Introduction

  • ECS Container CPU hog contains chaos to disrupt the state of infra resources. The fault can induce stress chaos on the AWS ECS container using SSM Run Command, this is carried out by using SSM Docs which is in-built into the fault for the give chaos experiment.

  • Injects cpu resource stress/exhaustion on the target task container for the given duration. The number of cpu cores stressed can be provided as input.

  • To select the Task Under Chaos (TUC) you can make use of servie name associated with the task that is - if you provide the service name along with cluster name only all the tasks associated with the given service will be selected as chaos targets.

  • It tests the ECS task sanity (service availability) and recovery of the task containers subjected to CPU stress.

Fault execution flow chart

ECS Container CPU Hog

Uses

View the uses of the fault
CPU hogs are another very common and frequent scenario we find with containers/applications that can result in the eviction of the application (task container) and impact its delivery. Such scenarios can still occur despite whatever availability aids docker provides. These problems are generally referred to as "Noisy Neighbour" problems.

Injecting a rogue process into a target task container, we starve the main microservice process (typically pid 1) of the resources allocated to it (where limits are defined) causing slowness in application traffic or in other cases unrestrained use can cause instance to exhaust resources leading to eviction of all task container. So this category of chaos fault helps to build the immunity on the application undergoing any such stress scenario.

Prerequisites

info
  • Ensure that Kubernetes Version >= 1.17
Enable Container Metadata

Ensure that the ECS container metadata is enabled;this feature is disabled by default. Refer AWS docs - Enabling container metadata. This will allow HCE to know the container details like containerID that is running the ECS tasks.

NOTE: You need to do the following steps to enable container metadata and attach IAM role to the cluster instances:
  • In the EC2 dashboard sidebar click on Launch Configurations under Auto Scaling.

autoscaling-config

  • Create a copy of autoscaling configuration used in target ECS cluster. This will create a new (copied) Launch Template.

create-copy-of-lc

  • In the new(copied) Launch Template, update the IAM role of the instances with ECS-SSM permissions (as shown in below permission requirement section).

iam-instance-profile

  • Now update the user data with ECS_ENABLE_CONTAINER_METADATA to be true as shown below.

user-data

  • Now save the launch configuration by clicking on ‘Create Launch Template'.

create-launch-config

  • Now go back to auto scaling group and switch to launch template (as launch configuration is deprecating by AWS).

switch-to-launch-template

  • Update the cluster auto-scaling group with the newer launch template.

update-launch-config

  • Restart the instances of the ECS cluster to pull the updated configuration:

restart-instances

  • Ensure both user and ECS cluster instances have a Role with required AWS access to do SSM and ECS operations. Refer the below mentioned sample policy for the fault.

  • Ensure to create a Kubernetes secret having the AWS access configuration(key) in the CHAOS_NAMESPACE. A sample secret file looks like:

apiVersion: v1
kind: Secret
metadata:
name: cloud-secret
type: Opaque
stringData:
cloud_config.yml: |-
# Add the cloud AWS credentials respectively
[default]
aws_access_key_id = XXXXXXXXXXXXXXXXXXX
aws_secret_access_key = XXXXXXXXXXXXXXX
  • If you change the secret key name (from cloud_config.yml) please also update the AWS_SHARED_CREDENTIALS_FILE ENV value in the ChaosExperiment CR with the same name.

Permission Requirement

  • Here is an example AWS policy to execute this fault.
View policy for this fault
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": [
"ecs:UpdateContainerInstancesState",
"ecs:RegisterContainerInstance",
"ecs:ListContainerInstances",
"ecs:DeregisterContainerInstance",
"ecs:DescribeContainerInstances",
"ecs:ListTasks",
"ecs:DescribeClusters"

],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"ssm:GetDocument",
"ssm:DescribeDocument",
"ssm:GetParameter",
"ssm:GetParameters",
"ssm:SendCommand",
"ssm:CancelCommand",
"ssm:CreateDocument",
"ssm:DeleteDocument",
"ssm:GetCommandInvocation",
"ssm:UpdateInstanceInformation",
"ssm:DescribeInstanceInformation"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"ec2messages:AcknowledgeMessage",
"ec2messages:DeleteMessage",
"ec2messages:FailMessage",
"ec2messages:GetEndpoint",
"ec2messages:GetMessages",
"ec2messages:SendReply"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"ec2:DescribeInstances"
],
"Resource": [
"*"
]
}
]
}

Default Validations

info
  • ECS container instance should be in healthy state.

Fault Tunables

Check the Fault Tunables

Mandatory Fields

Variables Description Notes
CLUSTER_NAME Name of the target ECS cluster Eg. cluster-1
REGION The region name of the target ECS cluster Eg. us-east-1

Optional Fields

Variables Description Notes
TOTAL_CHAOS_DURATION The total time duration for chaos insertion (sec) Defaults to 30s
CHAOS_INTERVAL The interval (in sec) between successive instance termination. Defaults to 30s
AWS_SHARED_CREDENTIALS_FILE Provide the path for aws secret credentials Defaults to /tmp/cloud_config.yml
CPU_CORE Provide the number of cpu core to consume Defaults to 0
CPU_LOAD Provide the percentage of CPU to be consumed Defaults to 100
SEQUENCE It defines sequence of chaos execution for multiple instance Default value: parallel. Supported: serial, parallel
RAMP_TIME Period to wait before and after injection of chaos in sec Eg. 30

Fault Examples

Common and AWS specific tunables

Refer the common attributes and AWS specific tunable to tune the common tunables for all faults and aws specific tunables.

CPU Cores

It contains the cores of CPU to hog for the target container instances. It can be tuned via CPU_CORE ENV. 0 core means all the available CPU resources should be consumed.

Use the following example to tune this:

# cpu cores for the stress
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: engine-nginx
spec:
engineState: "active"
annotationCheck: "false"
chaosServiceAccount: litmus-admin
experiments:
- name: ecs-container-cpu-hog
spec:
components:
env:
# provide the cpu core to be hogged
- name: CPU_CORE
value: '0'
- name: REGION
value: 'us-east-2'
- name: TOTAL_CHAOS_DURATION
VALUE: '60'

CPU Load

It contains the percentage of CPU to be consumed for the target container instances. It can be tuned via CPU_LOAD ENV. CPU Load 100 means 100% per cpu core provided.

Use the following example to tune this:

# cpu load for the stress
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: engine-nginx
spec:
engineState: "active"
annotationCheck: "false"
chaosServiceAccount: litmus-admin
experiments:
- name: ecs-container-cpu-hog
spec:
components:
env:
# provide the cpu load percentage
- name: CPU_LOAD
value: '100'
- name: CPU_CORE
value: '0'
- name: TOTAL_CHAOS_DURATION
VALUE: '60'