Chaos Faults for AWS
Introduction
AWS faults disrupt the resources running on different AWS services from the EKS cluster. To perform such AWS chaos experiments, you will need to authenticate HCE with the AWS platform. This can be done in two ways.
- Using secrets: You can use secrets to authenticate HCE with AWS regardless of whether the Kubernetes cluster is used for the deployment. This is Kubernetes’ native way of authenticating HCE with AWS.
- IAM integration: You can authenticate HCE using AWS using IAM when you have deployed chaos on the EKS cluster. You can associate an IAM role with a Kubernetes service account. This service account can be used to provide AWS permissions to the experiment pod which uses the particular service account.
Why should I use IAM integration for AWS authentication?
IAM roles for service accounts provide the following benefits.
- Least privilege: Using IAM roles for service accounts avoids extending permissions for the pods on the node, such as restricting the node IAM role for pods from making an AWS API call. You can scope IAM permissions to a service account, and only pods that use that service account will have access to those permissions.
- Credential isolation: The experiment can only retrieve credentials for the IAM role associated with a particular service account. This experiment would not have access to credentials for other experiments belonging to other pods.
Below are the steps to enable service accounts to access AWS resources.
Step 1: Create an IAM OpenID Connect (OIDC) provider for your cluster
You must create an IAM OpenID Connect (OIDC) identity provider for your cluster with eksctl
. This step is performed once for a cluster. For more information, go to AWS documentation to setup an OIDC provider.
Below is the command to check if your cluster has an existing IAM OIDC provider.
The cluster name specified in this example is litmus-demo
and region is us-west-1
. Replace these values based on your environment.
aws eks describe-cluster --name <litmus-demo> --query "cluster.identity.oidc.issuer" --output text
Output:
https://oidc.eks.us-west-1.amazonaws.com/id/D054E55B6947B1A7B3F200297789662C
To list the IAM OIDC providers in your account, execute the following command.
aws iam list-open-id-connect-providers | grep <EXAMPLED539D4633E53DE1B716D3041E>
Replace <D054E55B6947B1A7B3F200297789662C>
(including <>
) with the value returned from the output of the previous command.
If no IAM OIDC identity provider is available for your account, create one for your cluster using the following command.
Replace <litmus-demo>
(including <>
) with values of your choice.
eksctl utils associate-iam-oidc-provider --cluster litmus-demo --approve
2021-09-07 14:54:01 [ℹ] eksctl version 0.52.0
2021-09-07 14:54:01 [ℹ] using region us-west-1
2021-09-07 14:54:04 [ℹ] will create IAM Open ID Connect provider for cluster "udit-cluster-11" in "us-west-1"
2021-09-07 14:54:05 [✔] created IAM Open ID Connect provider for cluster "litmus-demo" in "us-west-1"
Step 2: Create an IAM role and policy for your service account
Create an IAM policy with the permissions that you would like the experiment to have. There are several ways to create a new IAM permission policy. Go to AWS documentation to create IAM policy to know more. Use the eksctl
command to create the IAM permission policy.
eksctl create iamserviceaccount \
--name <service_account_name> \
--namespace <service_account_namespace> \
--cluster <cluster_name> \
--attach-policy-arn <IAM_policy_ARN> \
--approve \
--override-existing-serviceaccounts
Step 3: Associate an IAM role with a service account
Define the IAM role for every Kubernetes service account in your cluster that requires access to AWS resources by adding the following annotation to the service account.
apiVersion: v1
kind: ServiceAccount
metadata:
annotations:
eks.amazonaws.com/role-arn: arn:aws:iam::<ACCOUNT_ID>:role/<IAM_ROLE_NAME>
You can also annotate the experiment service account using the command:
kubectl annotate serviceaccount -n <SERVICE_ACCOUNT_NAMESPACE> <SERVICE_ACCOUNT_NAME> \
eks.amazonaws.com/role-arn=arn:aws:iam::<ACCOUNT_ID>:role/<IAM_ROLE_NAME>
- Annotating the
litmus-admin
service account inHCE
namespace will work for most experiments. - For the cluster autoscaler experiment, annotate the service account in the
kube-system
namespace.
Step 4: Verify that the experiment service account associates with the IAM
If you run an experiment and describe one of the pods, you will be able to verify whether the AWS_WEB_IDENTITY_TOKEN_FILE
and AWS_ROLE_ARN
environment variables exist.
kubectl exec -n litmus <ec2-terminate-by-id-z4zdf> env | grep AWS
Output:
AWS_VPC_K8S_CNI_LOGLEVEL=DEBUG
AWS_ROLE_ARN=arn:aws:iam::<ACCOUNT_ID>:role/<IAM_ROLE_NAME>
AWS_WEB_IDENTITY_TOKEN_FILE=/var/run/secrets/eks.amazonaws.com/serviceaccount/token
Step 5: Configure the experiment CR
Since you have already configured IAM for the experiment service account, you won't have to create a secret and mount it with the experiment CR (enabled by default). To remove the secret mount, remove the following lines from the experiment YAML.
secrets:
- name: cloud-secret
mountPath: /tmp/
Now, you can run chaos experiments with IAM integration.
Here are AWS faults that you can execute and validate.
ALB AZ down
ALB AZ down takes down the AZ (Availability Zones) on a target application load balancer for a specific duration.
CLB AZ down
CLB AZ down takes down the AZ (Availability Zones) on a target CLB for a specific duration.
EBS loss by ID
EBS loss by ID disrupts the state of EBS volume by detaching it from the node (or EC2) instance using volume ID for a certain duration.
EBS loss by tag
EBS loss by tag disrupts the state of EBS volume by detaching it from the node (or EC2) instance using volume ID for a certain duration.
EC2 CPU hog
EC2 CPU hog disrupts the state of infrastructure resources. It induces stress on the AWS ECS container using Amazon SSM Run command, which is carried out using SSM docs which is in-built into the fault.
EC2 DNS chaos
EC2 DNS chaos causes DNS errors on the specified EC2 instance for a specific duration.
ALB AZ down
ALB AZ down takes down the AZ (Availability Zones) on a target application load balancer for a specific duration. This fault:
- Restricts access to certain availability zones for a specific duration.
- Tests the application sanity, availability, and recovery workflows of the application pod attached to the load balancer.
View fault usage
- ALB AZ down fault breaks the connectivity of an ALB with the given zones and impacts their delivery.
- Detaching the AZ from the application load balancer disrupts the application's performance.
CLB AZ down
CLB AZ down takes down the AZ (Availability Zones) on a target CLB for a specific duration. This fault:
- Restricts access to certain availability zones for a specific duration.
- Tests the application sanity, availability, and recovery workflows of the application pod attached to the load balancer.
View fault usage
- CLB AZ down fault breaks the connectivity of a CLB with the given zones and impacts their delivery.
- Detaching the AZ from the classic load balancer disrupts the dependent application's performance.
EBS loss by ID
EBS loss by ID disrupts the state of EBS volume by detaching it from the node (or EC2) instance using volume ID for a certain duration.
- In case of EBS persistent volumes, the volumes can self-attach and the re-attachment step can be skipped.
- It tests the deployment sanity (replica availability and uninterrupted service) and recovery workflows of the application pod.
Use cases
EBS loss by tag
EBS loss by tag disrupts the state of EBS volume by detaching it from the node (or EC2) instance using volume ID for a certain duration.
- In case of EBS persistent volumes, the volumes can self-attach and the re-attachment step can be skipped.
- It tests the deployment sanity (replica availability and uninterrupted service) and recovery workflows of the application pod.
Use cases
EC2 CPU hog
EC2 CPU hog disrupts the state of infrastructure resources. It induces stress on the AWS ECS container using Amazon SSM Run command, which is carried out using SSM docs which is in-built into the fault.
- It causes CPU chaos on the containers of the ECS task using the given
CLUSTER_NAME
environment variable for a specific duration.
Use cases
EC2 DNS chaos
EC2 DNS chaos causes DNS errors on the specified EC2 instance for a specific duration.
- It determines the performance of the application (or process) running on the EC2 instance(s).
Use cases
EC2 HTTP latency
EC2 HTTP latency disrupts the state of infrastructure resources. This fault induces HTTP chaos on an AWS EC2 instance using the Amazon SSM Run command, carried out using SSM Docs that is in-built in the fault.
- It injects HTTP response latency to the service whose port is specified using
TARGET_SERVICE_PORT
environment variable by starting the proxy server and redirecting the traffic through the proxy server. - It introduces HTTP latency chaos on the EC2 instance using an SSM doc for a certain chaos duration.
Use cases
EC2 HTTP modify body
EC2 HTTP modify body injects HTTP chaos which affects the request/response by modifying the status code or the body or the headers by starting proxy server and redirecting the traffic through the proxy server.
- It tests the application's resilience to erroneous (or incorrect) HTTP response body.
Use cases
EC2 HTTP modify header
EC2 HTTP modify header injects HTTP chaos which affects the request (or response) by modifying the status code (or the body or the headers) by starting the proxy server and redirecting the traffic through the proxy server.
- It modifies the headers of requests and responses of the service.
- This can be used to test the resilience of the application to incorrect (or incomplete) headers.
Use cases
EC2 HTTP reset peer
EC2 HTTP reset peer injects HTTP reset on the service whose port is specified using the TARGET_SERVICE_PORT
environment variable.
- It stops the outgoing HTTP requests by resetting the TCP connection for the requests.
- It determines the application's resilience to a lossy (or flaky) HTTP connection.
Use cases
EC2 HTTP status code
EC2 HTTP status code injects HTTP chaos that affects the request (or response) by modifying the status code (or the body or the headers) by starting a proxy server and redirecting the traffic through the proxy server.
- It tests the application's resilience to erroneous code HTTP responses from the application server.
Use cases
EC2 IO stress
EC2 IO stress disrupts the state of infrastructure resources.
- The fault induces stress on AWS EC2 instance using Amazon SSM Run command that is carried out using the SSM docs that comes in-built in the fault.
- It causes IO stress on the EC2 instance for a certain duration.
Use cases
EC2 memory hog
EC2 memory hog disrupts the state of infrastructure resources.
- The fault induces stress on AWS EC2 instance using Amazon SSM Run command that is carried out using the SSM docs that comes in-built in the fault.
- It causes memory exhaustion on the EC2 instance for a specific duration.
Use cases
EC2 network latency
EC2 network latency causes flaky access to the application (or services) by injecting network packet latency to EC2 instance(s).
- It determines the performance of the application (or process) running on the EC2 instances.
Use cases
EC2 network loss
EC2 network loss causes flaky access to the application (or services) by injecting network packet loss to EC2 instance(s).
- It checks the performance of the application (or process) running on the EC2 instances.
Use cases
EC2 process kill
EC2 process kill fault kills the target processes running on an EC2 instance.
- It checks the performance of the application/process running on the EC2 instance(s).
Use cases
EC2 stop by ID
EC2 stop by ID stops an EC2 instance using the provided instance ID or list of instance IDs.
- It brings back the instance after a specific duration.
- It checks the performance of the application (or process) running on the EC2 instance.
- When the
MANAGED_NODEGROUP
environment variable is enabled, the fault will not try to start the instance after chaos. Instead, it checks for the addition of a new node instance to the cluster.
Use cases
EC2 stop by tag
EC2 stop by tag stops an EC2 instance using the provided tag.
- It brings back the instance after a specific duration.
- It checks the performance of the application (or process) running on the EC2 instance.
- When the
MANAGED_NODEGROUP
environment variable is enabled, the fault will not try to start the instance after chaos. Instead, it checks for the addition of a new node instance to the cluster.
Use cases
ECS agent stop
ECS agent stop disrupts the state of infrastructure resources.
- The fault induces an agent stop chaos on AWS ECS using Amazon SSM Run command, this is carried out by using SSM Docs which is in-built in the fault for the give chaos scenario.
- It causes agent container stop on ECS with a given
CLUSTER_NAME
envrionment variable using an SSM docs for a specific duration.
Use cases
ECS container CPU hog
ECS container CPU hog disrupts the state of infrastructure resources. It induces stress on the AWS ECS container using Amazon SSM Run command, which is carried out using SSM docs which is in-built into the fault.
- It causes CPU chaos on the containers of the ECS task using the given
CLUSTER_NAME
environment variable for a specific duration. - To select the Task Under Chaos (TUC), use the servie name associated with the task. If you provide the service name along with the cluster name, all the tasks associated with the given service will be selected as chaos targets.
- It tests the ECS task sanity (service availability) and recovery of the task containers subject to CPU stress.
Use cases
ECS container IO stress
ECS container IO stress disrupts the state of infrastructure resources. It induces stress on the AWS ECS container using Amazon SSM Run command, which is carried out using SSM docs which is in-built into the fault.
- It causes I/O stress on the containers of the ECS task using the given
CLUSTER_NAME
environment variable for a specific duration. - To select the Task Under Chaos (TUC), use the servie name associated with the task. If you provide the service name along with the cluster name, all the tasks associated with the given service will be selected as chaos targets.
- It tests the ECS task sanity (service availability) and recovery of the task containers subject to I/O stress.
Use cases
ECS container memory hog
ECS container memory hog disrupts the state of infrastructure resources. It induces stress on the AWS ECS container using Amazon SSM Run command, which is carried out using SSM docs which is in-built into the fault.
- It causes memory stress on the containers of the ECS task using the given
CLUSTER_NAME
environment variable for a specific duration. - To select the Task Under Chaos (TUC), use the service name associated with the task. If you provide the service name along with the cluster name, all the tasks associated with the given service will be selected as chaos targets.
- It tests the ECS task sanity (service availability) and recovery of the task containers subject to memory stress.
Use cases
ECS container network latency
ECS container network latency disrupts the state of infrastructure resources. It brings delay on the AWS ECS container using Amazon SSM Run command, which is carried out using SSM docs which is in-built into the fault.
- It causes network stress on the containers of the ECS task using the given
CLUSTER_NAME
environment variable for a specific duration. - To select the Task Under Chaos (TUC), use the service name associated with the task. If you provide the service name along with the cluster name, all the tasks associated with the given service will be selected as chaos targets.
- It tests the ECS task sanity (service availability) and recovery of the task containers subject to network stress.
Use cases
ECS container network loss
ECS container network loss disrupts the state of infrastructure resources.
- The fault induces chaos on the AWS ECS container using Amazon SSM Run command, which is carried out using SSM docs that comes in-built in the fault.
- It causes network disruption on containers of the ECS task in the cluster name.
- To select the Task Under Chaos (TUC), use the service name associated with the task. If you provide the service name along with cluster name, all the tasks associated with the given service will be selected as chaos targets.
- It tests the ECS task sanity (service availability) and recovery of the task containers subjected to network chaos.
Use cases
ECS instance stop
ECS instance stop induces stress on an AWS ECS cluster. It derives the instance under chaos from the ECS cluster.
- It causes EC2 instance to stop and get deleted from the ECS cluster for a specific duration.
Use cases
ECS task stop
ECS task stop is an AWS fault that injects chaos to stop the ECS tasks based on the services or task replica ID and checks the task availability.
- This fault results in the unavailability of the application running on the tasks.
Use cases
Lambda delete event source mapping
Lambda delete event source mapping removes the event source mapping from an AWS Lambda function for a specific duration.
- It checks the performance of the application (or service) without the event source mapping which may cause missing entries in a database.
Use cases
Lambda toggle event mapping state
Lambda toggle event mapping state toggles (or sets) the event source mapping state to disable
for a Lambda function during a specific duration.
- It checks the performance of the running application (or service) when the event source mapping is not enabled which may cause missing entries in a database.
Use cases
Lambda update function memory
Lambda update function memory causes the memory of a Lambda function to be updated to a specified value for a certain duration.
- It checks the performance of the application (or service) running with a new memory limit.
- It helps determine a safe overall memory limit value for the function.
- Smaller the memory limit higher will be the time taken by the Lambda function under load.
Use cases
Lambda update function timeout
Lambda update function timeout causes timeout of a Lambda function to be updated to a specified value for a certain duration.
- It checks the performance of the application (or service) running with a new timeout.
- It also helps determine a safe overall timeout value for the function.
Use cases
Lambda update role permission
Lambda update role permission is an AWS fault that modifies the role policies associated with a Lambda function.
- It verifies the handling mechanism for function failures.
- It can also be used to update the role attached to a Lambda function.
- It checks the performance of the running lambda application in case it does not have enough permissions.
Use cases
Lambda delete function concurrency
Lambda delete function concurrency is an AWS fault that deletes the Lambda function's reserved concurrency, thereby ensuring that the function has adequate unreserved concurrency to run.
- Examines the performance of the running Lambda application, if the Lambda function lacks sufficient concurrency.
Use cases
RDS instance delete
RDS instance delete removes an instances from AWS RDS cluster.
- This makes the cluster unavailable for a specific duration.
- It determines how quickly an application can recover from an unexpected cluster deletion.
Use cases
RDS instance reboot
RDS instance reboot can induce an RDS instance reboot chaos on AWS RDS cluster. It derives the instance under chaos from RDS cluster.
Use cases
Windows EC2 blackhole chaos
Windows EC2 blackhole chaos results in access loss to the given target hosts or IPs by injecting firewall rules.
Use cases
Windows EC2 CPU hog
EC2 windows CPU hog induces CPU stress on the AWS Windows EC2 instances using Amazon SSM Run command.
Use cases
EC2 windows CPU hog:
- Simulates the situation of a lack of CPU for processes running on the instance, which degrades their performance.
- Simulates slow application traffic or exhaustion of the resources, leading to degradation in the performance of processes on the instance.
Windows EC2 memory hog
Windows EC2 memory hog induces memory stress on the target AWS Windows EC2 instance using Amazon SSM Run command.
Use cases
Windows EC2 memory hog:
- Causes memory stress on the target AWS EC2 instance(s).
- Simulates the situation of memory leaks in the deployment of microservices.
- Simulates application slowness due to memory starvation, and noisy neighbour problems due to hogging.