Skip to main content

ECS container memory hog

ECS container memory hog disrupts the state of infrastructure resources. It induces stress on the AWS ECS container using Amazon SSM Run command, which is carried out using SSM docs which is in-built into the fault.

  • It causes memory stress on the containers of the ECS task using the given CLUSTER_NAME environment variable for a specific duration.
  • To select the Task Under Chaos (TUC), use the service name associated with the task. If you provide the service name along with the cluster name, all the tasks associated with the given service will be selected as chaos targets.
  • It tests the ECS task sanity (service availability) and recovery of the task containers subject to memory stress.

ECS Container Memory Hog

Usage

View fault usage
Memory usage inside containers is subject to constraints. If the limits are specified, exceeding them can result in termination of the container (due to OOMKill of the primary process, often pid 1). The container is restarted, depending on the policy specified. When there are no limits on the memory consumption of containers, containers on the instance can be killed based on their oom_score, which extends to all the task containers running on the instance. This results in a bigger blast radius. This fault launches a stress process within the target container, that causes the primary process in the container to have constraints based on resources or eat up the available system memory on the instance when limits on resources are not specified.

Prerequisites

  • Kubernetes >= 1.17
  • ECS container metadata is enabled (disabled by default). To enable it, refer to this docs. If your task is running from before, you may need to restart it to get the metadata directory.
  • You and the ECS cluster instances have a role with the required AWS access to perform the SSM and ECS operations. Refer to systems manager docs.
  • Create a Kubernetes secret that has the AWS access configuration(key) in the CHAOS_NAMESPACE. Below is a sample secret file:
apiVersion: v1
kind: Secret
metadata:
name: cloud-secret
type: Opaque
stringData:
cloud_config.yml: |-
# Add the cloud AWS credentials respectively
[default]
aws_access_key_id = XXXXXXXXXXXXXXXXXXX
aws_secret_access_key = XXXXXXXXXXXXXXX
  • If you change the secret key name (from cloud_config.yml), ensure that you update the AWS_SHARED_CREDENTIALS_FILE environment variable in the chaos experiment with the new name.

Permissions required

Here is an example AWS policy to execute the fault.

View policy for the fault
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "VisualEditor0",
"Effect": "Allow",
"Action": [
"ecs:UpdateContainerInstancesState",
"ecs:RegisterContainerInstance",
"ecs:ListContainerInstances",
"ecs:DeregisterContainerInstance",
"ecs:DescribeContainerInstances",
"ecs:ListTasks",
"ecs:DescribeClusters"

],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"ssm:GetDocument",
"ssm:DescribeDocument",
"ssm:GetParameter",
"ssm:GetParameters",
"ssm:SendCommand",
"ssm:CancelCommand",
"ssm:CreateDocument",
"ssm:DeleteDocument",
"ssm:GetCommandInvocation",
"ssm:UpdateInstanceInformation",
"ssm:DescribeInstanceInformation"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"ec2messages:AcknowledgeMessage",
"ec2messages:DeleteMessage",
"ec2messages:FailMessage",
"ec2messages:GetEndpoint",
"ec2messages:GetMessages",
"ec2messages:SendReply"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"ec2:DescribeInstances"
],
"Resource": [
"*"
]
}
]
}

Refer to the superset permission/policy to execute all AWS faults.

Default validations

The ECS container instance should be in a healthy state.

Fault tunables

Fault tunables

Mandatory fields

Variables Description Notes
CLUSTER_NAME Name of the target ECS cluster. For example, cluster-1.
REGION Region name of the target ECS cluster. For example, us-east-1.

Optional fields

Variables Description Notes
TOTAL_CHAOS_DURATION Duration that you specify, through which chaos is injected into the target resource (in seconds). Defaults to 30s.
CHAOS_INTERVAL The interval between successive instance terminations (in seconds). Defaults to 30s.
AWS_SHARED_CREDENTIALS_FILE Path to the AWS secret credentials. Defaults to /tmp/cloud_config.yml.
MEMORY_CONSUMPTION Memory consumed (in mebibytes). Defaults to 100.
MEMORY_PERCENTAGE Memory consumed (in percentage). Defaults to 0.
SEQUENCE Sequence of chaos execution for multiple instances. Defaults to parallel. Supports serial sequence as well.
RAMP_TIME Period to wait before and after injecting chaos (in seconds). For example, 30s.

Fault examples

Common and AWS-specific tunables

Refer to the common attributes and AWS-specific tunables to tune the common tunables for all faults and aws specific tunables.

Memory percentage

It specifies the memory consumed by the target container (in terms of percenage) for a duration specified by TOTAL_CHAOS_DURATION environment variable. You can tune it using the MEMORY_PERCENTAGE environment variable.

Use the following example to tune it:

# define the memory percentage in MB
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: engine-nginx
spec:
engineState: "active"
annotationCheck: "false"
chaosServiceAccount: litmus-admin
experiments:
- name: ecs-container-memory-hog
spec:
components:
env:
# memory percentage value
- name: MEMORY_PERCENTAGE
value: '100' #in MB
- name: TOTAL_CHAOS_DURATION
value: '60'

Memory consumption

It specifies the memory consumed by the target container (in terms of mebibytes) for a duration specified by TOTAL_CHAOS_DURATION environment variable. You can tune it using the MEMORY_CONSUMPTION environment variable.

Use the following example to tune it:

# define the memory consumption in MB
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: engine-nginx
spec:
engineState: "active"
annotationCheck: "false"
chaosServiceAccount: litmus-admin
experiments:
- name: ecs-container-memory-hog
spec:
components:
env:
# memory consumption value
- name: MEMORY_CONSUMPTION
value: '500' #in MB
- name: TOTAL_CHAOS_DURATION
value: '60'

Workers for stress

It specifies the worker's count to apply stress. You can tune it using the NUMBER_OF_WORKERS environment variable. Use the following example to tune it:

# number of workers used for the stress
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: engine-nginx
spec:
engineState: "active"
annotationCheck: "false"
chaosServiceAccount: litmus-admin
experiments:
- name: ecs-container-memory-hog
spec:
components:
env:
# number of workers for stress
- name: NUMBER_OF_WORKERS
value: '1'
- name: TOTAL_CHAOS_DURATION
value: '60'