Chaos faults for AWS
Introduction
AWS faults disrupt the resources running on different AWS services from the EKS cluster. To perform such AWS chaos experiments, you will need to authenticate CE with the AWS platform. This can be done in two ways.
- Using secrets: You can use secrets to authenticate CE with AWS regardless of whether the Kubernetes cluster is used for the deployment. This is Kubernetes' native way of authenticating CE with AWS.
- IAM integration: You can authenticate CE using AWS using IAM when you have deployed chaos on the EKS cluster. You can associate an IAM role with a Kubernetes service account. This service account can be used to provide AWS permissions to the experiment pod which uses the particular service account.
Here are AWS faults that you can execute and validate.
ALB AZ down
ALB AZ down takes down the AZ (Availability Zones) on a target application load balancer for a specific duration.
CLB AZ down
CLB AZ down takes down the AZ (Availability Zones) on a target CLB for a specific duration.
DynamoDB replication pause
DynamoDB replication pause fault pauses the data replication in DynamoDB tables over multiple locations for the chaos duration.
EBS loss by ID
EBS loss by ID disrupts the state of EBS volume by detaching it from the node (or EC2) instance using volume ID for a certain duration.
EBS loss by tag
EBS loss by tag disrupts the state of EBS volume by detaching it from the node (or EC2) instance using volume ID for a certain duration.
EC2 CPU hog
EC2 CPU hog disrupts the state of infrastructure resources. It induces stress on the AWS ECS container using Amazon SSM Run command, which is carried out using SSM docs which is in-built into the fault.
ALB AZ down
ALB AZ down takes down the AZ (Availability Zones) on a target application load balancer for a specific duration. This fault restricts access to certain availability zones for a specific duration.
Use cases
- Tests the application sanity, availability, and recovery workflows of the application pod attached to the load balancer.
- ALB AZ down fault breaks the connectivity of an ALB with the given zones and impacts their delivery.
- Detaching the AZ from the application load balancer disrupts the application's performance.
CLB AZ down
CLB AZ down takes down the AZ (Availability Zones) on a target CLB for a specific duration. This fault restricts access to certain availability zones for a specific duration.
Use cases
- Tests the application sanity, availability, and recovery workflows of the application pod attached to the load balancer.
- CLB AZ down fault breaks the connectivity of a CLB with the given zones and impacts their delivery.
- Detaching the AZ from the classic load balancer disrupts the dependent application's performance.
DynamoDB replication pause
DynamoDB replication pause fault pauses the data replication in DynamoDB tables over multiple locations for the chaos duration.
- When chaos experiment is being executed, any changes to the DynamoDB table will not be replicated in different regions, thereby making the data in the DynamoDB inconsistent.
- You can execute this fault on a DynamoDB table that is global, that is, there should be more than one replica of the table.
Use cases
DynamoDB replication pause determines the resilience of the application when data (in a database) that needs to be constantly updated is disrupted.
EBS loss by ID
EBS loss by ID disrupts the state of EBS volume by detaching it from the node (or EC2) instance using volume ID for a certain duration.
- In case of EBS persistent volumes, the volumes can self-attach and the re-attachment step can be skipped.
Use cases
- It tests the deployment sanity (replica availability and uninterrupted service) and recovery workflows of the application pod.
EBS loss by tag
EBS loss by tag disrupts the state of EBS volume by detaching it from the node (or EC2) instance using volume ID for a certain duration.
- In case of EBS persistent volumes, the volumes can self-attach and the re-attachment step can be skipped.
Use cases
- It tests the deployment sanity (replica availability and uninterrupted service) and recovery workflows of the application pod.
EC2 CPU hog
EC2 CPU hog disrupts the state of infrastructure resources. It induces stress on the AWS ECS container using Amazon SSM Run command, which is carried out using SSM docs which is in-built into the fault.
- It causes CPU chaos on the containers of the ECS task using the given
CLUSTER_NAME
environment variable for a specific duration.
Use cases
- Induces CPU stress on the target AWS EC2 instance(s).
- Simulates a lack of CPU for processes running on the application, which degrades their performance.
- Simulates slow application traffic or exhaustion of the resources, leading to degradation in the performance of processes on the instance.
EC2 DNS chaos
EC2 DNS chaos causes DNS errors on the specified EC2 instance for a specific duration. It determines the performance of the application (or process) running on the EC2 instance(s).
Use cases
- Determines the performance of the application (or process) running on the EC2 instance(s).
- Simulates the unavailability (or distorted) network connectivity from the VM to the target hosts.
- Determines the impact of DNS chaos on the infrastructure and standalone tasks.
- Simulates unavailability of the DNS server (loss of access to any external domain from a given microservice, access to cloud provider dependencies, and access to specific third party services).
EC2 HTTP latency
EC2 HTTP latency disrupts the state of infrastructure resources. This fault induces HTTP chaos on an AWS EC2 instance using the Amazon SSM Run command, carried out using SSM Docs that is in-built in the fault.
- It injects HTTP response latency to the service whose port is specified using
TARGET_SERVICE_PORT
environment variable by starting the proxy server and redirecting the traffic through the proxy server. - It introduces HTTP latency chaos on the EC2 instance using an SSM doc for a certain chaos duration.
Use cases
- Delays the network connectivity from the VM to the target hosts.
- Simulates latency to specific API services for (or from) a given microservice.
- Simulates a slow response on specific third party (or dependent) components (or services).
EC2 HTTP modify body
EC2 HTTP modify body injects HTTP chaos which affects the request/response by modifying the status code or the body or the headers by starting proxy server and redirecting the traffic through the proxy server.
Use cases
- It can test the application's resilience to erroneous or incorrect HTTP response body.
EC2 HTTP modify header
EC2 HTTP modify header injects HTTP chaos which affects the request (or response) by modifying the status code (or the body or the headers) by starting the proxy server and redirecting the traffic through the proxy server. It modifies the headers of requests and responses of the service.
Use cases
- This can be used to test service resilience towards incorrect or incomplete headers.
EC2 HTTP reset peer
EC2 HTTP reset peer injects HTTP reset on the service whose port is specified using the TARGET_SERVICE_PORT
environment variable.
- It stops the outgoing HTTP requests by resetting the TCP connection for the requests.
Use cases
- Verifies connection timeout by simulating premature connection loss (firewall issues or other issues) between microservices.
- Simulates connection resets due to resource limitations on the server side like out of memory server (or process killed or overload on the server due to a high amount of traffic).
- Determines the application's resilience to a lossy (or flaky) HTTP connection.
EC2 HTTP status code
EC2 HTTP status code injects HTTP chaos that affects the request (or response) by modifying the status code (or the body or the headers) by starting a proxy server and redirecting the traffic through the proxy server.
Use cases
- Tests the application's resilience to erroneous code HTTP responses from the application server.
- Simulates unavailability of specific API services (503, 404).
- Simulates unavailability of specific APIs for (or from) a given microservice (TBD or Path Filter) (404).
- Simulates unauthorized requests for 3rd party services (401 or 403), and API malfunction (internal server error) (50x).
EC2 IO stress
EC2 IO stress disrupts the state of infrastructure resources.
- The fault induces stress on AWS EC2 instance using Amazon SSM Run command that is carried out using the SSM docs that comes in-built in the fault.
- It causes IO stress on the EC2 instance for a certain duration.
Use cases
- Simulates slower disk operations by the application.
- Simulates noisy neighbour problems by hogging the disk bandwidth.
- Verifies the disk performance on increasing IO threads and varying IO block sizes.
- Checks how the application functions under high disk latency conditions, when IO traffic is high and includes large I/O blocks, and when other services monopolize the IO disks.
EC2 memory hog
EC2 memory hog disrupts the state of infrastructure resources.
- The fault induces stress on AWS EC2 instance using Amazon SSM Run command that is carried out using the SSM docs that comes in-built in the fault.
- It causes memory exhaustion on the EC2 instance for a specific duration.
Use cases
- Causes memory stress on the target AWS EC2 instance(s).
- Simulates the situation of memory leaks in the deployment of microservices.
- Simulates application slowness due to memory starvation, and noisy neighbour problems due to hogging.
EC2 network latency
EC2 network latency causes flaky access to the application (or services) by injecting network packet latency to EC2 instance(s). This fault:
- Degrades the network without marking the EC2 instance as unhealthy (or unworthy) of traffic, which is resolved using a middleware that switches traffic based on SLOs (performance parameters).
- May stall the EC2 instance or get corrupted waiting endlessly for a packet.
- Limits the impact (blast radius) to the traffic that you wish to test, by specifying the IP addresses.
Use cases
- Determines the performance of the application (or process) running on the EC2 instances.
- Simulates a consistently slow network connection between microservices (for example, cross-region connectivity between active-active peers of a given service or across services or poor cni-performance in the inter-pod-communication network).
- Simulates jittery connection with transient latency spikes between microservices.
- Simulates a slow response on specific third party (or dependent) components (or services), and degraded data-plane of service-mesh infrastructure.
EC2 network loss
EC2 network loss causes flaky access to the application (or services) by injecting network packet loss to EC2 instance(s). This fault:
- Degrades the network without marking the EC2 instance as unhealthy (or unworthy) of traffic, which is resolved using a middleware that switches traffic based on SLOs (performance parameters).
- May stall the EC2 instance or get corrupted waiting endlessly for a packet.
- Limits the impact (blast radius) to the traffic that you wish to test, by specifying the IP addresses.
Use cases
- Determines the performance of the application (or process) running on the EC2 instances.
- Simulates a consistently slow network connection between microservices (for example, cross-region connectivity between active-active peers of a given service or across services or poor cni-performance in the inter-pod-communication network).
- Simulates jittery connection with transient latency spikes between microservices.
- Simulates a slow response on specific third party (or dependent) components (or services), and degraded data-plane of service-mesh infrastructure.
EC2 process kill
EC2 process kill fault kills the target processes running on an EC2 instance. This fault disrupts the application critical processes such as databases or message queues running on the EC2 instance by killing their underlying processes or threads.
Use cases
EC2 process kill determines the resilience of applications when processes on EC2 instances are unexpectedly killed (or disrupted).
EC2 stop by ID
EC2 stop by ID stops an EC2 instance using the provided instance ID or list of instance IDs.
- It brings back the instance after a specific duration.
- It checks the performance of the application (or process) running on the EC2 instance.
- When the
MANAGED_NODEGROUP
environment variable is enabled, the fault will not try to start the instance after chaos. Instead, it checks for the addition of a new node instance to the cluster.
Use cases
- Determines the performance of the application (or process) running on the EC2 instance.
- Determines the resilience of an application to unexpected halts in the EC2 instance by validating its failover capabilities.
EC2 stop by tag
EC2 stop by tag stops an EC2 instance using the provided tag.
- It brings back the instance after a specific duration.
- It checks the performance of the application (or process) running on the EC2 instance.
- When the
MANAGED_NODEGROUP
environment variable is enabled, the fault will not try to start the instance after chaos. Instead, it checks for the addition of a new node instance to the cluster.
Use cases
- Determines the performance of the application (or process) running on the EC2 instance.
- Determines the resilience of an application to unexpected halts in the EC2 instance by validating its failover capabilities.
ECS agent stop
ECS agent stop disrupts the state of infrastructure resources. This fault:
- Induces an agent stop chaos on AWS ECS using Amazon SSM Run command, that is carried out by using SSM documentation which is in-built in the fault for the give chaos scenario.
- Causes agent container stop on ECS for a specific duration, with a given
CLUSTER_NAME
environment variable using SSM documentation. Killing the agent container disrupts the performance of the task containers.
Use cases
- ECS agent stop halts the agent that manages the task container on the ECS cluster, thereby impacting its delivery.
ECS container CPU hog
ECS container CPU hog disrupts the state of infrastructure resources. It induces stress on the AWS ECS container using Amazon SSM Run command, which is carried out using SSM documentation that is in-built into the fault. This fault:
- Causes CPU chaos on the containers of the ECS task using the given
CLUSTER_NAME
environment variable for a specific duration. - To select the Task Under Chaos (TUC), use the service name associated with the task. If you provide the service name along with the cluster name, all the tasks associated with the given service will be selected as chaos targets.
- This experiment induces chaos within a container and depends on an EC2 instance. Typically, these are prefixed with "ECS container" and involve direct interaction with the EC2 instances hosting the ECS containers.
Use cases
- Evicts the application (task container) thereby impacting its delivery. These issues are known as noisy neighbour problems.
- Simulates a lack of CPU for processes running on the application, which degrades their performance.
- Verifies metrics-based horizontal pod autoscaling as well as vertical autoscale, that is, demand-based CPU addition.
- Scales the nodes based on growth beyond budgeted pods.
- Verifies the autopilot functionality of (cloud) managed clusters.
- Verifies multi-tenant load issue, wherein when the load increases on one container, it does not cause downtime in other containers.
- Tests the ECS task sanity (service availability) and recovery of the task containers subject to CPU stress.
ECS container HTTP latency
ECS container HTTP latency induces HTTP chaos on containers running in an Amazon ECS (Elastic Container Service) task. This fault introduces latency in the HTTP responses of containers of a specific service using a proxy server, simulating delays in network connectivity or slow responses from the dependent services.
Use cases
- Modifies the HTTP responses of containers in a specified ECS service by starting a proxy server and redirecting traffic through the proxy server.
- Simulates scenarios where containers experience delays in network connectivity or slow responses from dependent services, which may impact the behavior of your application.
- Validates the behavior of your application and infrastructure during simulated HTTP latency, such as:
- Testing how your application handles delays in network connectivity from containers to dependent services.
- Verifying the resilience of your system when containers experience slow responses from dependent services.
- Evaluating the impact of HTTP latency on the performance and availability of your application.
ECS container HTTP modify body
ECS container HTTP modify body injects HTTP chaos which affects the request or response by modifying the status code, body, or headers. This is achieved by starting a proxy server and redirecting the traffic through the proxy server.
Use cases
- Tests the application's resilience to erroneous (or incorrect) HTTP response body.
- Tests the resilience of the ECS application container to erroneous or incorrect HTTP response body.