Chaos faults for AWS
Introduction
AWS faults disrupt the resources running on different AWS services from the EKS cluster. To perform such AWS chaos experiments, you will need to authenticate CE with the AWS platform. This can be done in two ways.
- Using secrets: You can use secrets to authenticate CE with AWS regardless of whether the Kubernetes cluster is used for the deployment. This is Kubernetes' native way of authenticating CE with AWS.
- IAM integration: You can authenticate CE using AWS using IAM when you have deployed chaos on the EKS cluster. You can associate an IAM role with a Kubernetes service account. This service account can be used to provide AWS permissions to the experiment pod which uses the particular service account.
Here are AWS faults that you can execute and validate.
ALB AZ down
ALB AZ down takes down the AZ (Availability Zones) on a target application load balancer for a specific duration.
CLB AZ down
CLB AZ down takes down the AZ (Availability Zones) on a target CLB for a specific duration.
AZ blackhole
AZ blackhole causes network blackhole by isolating traffic in specific availability zones across an entire region.
VPC route misconfiguration
VPC route misconfiguration causes network issues due to misconfiguration on the route table associated with the target VPC.
DynamoDB replication pause
DynamoDB replication pause fault pauses the data replication in DynamoDB tables over multiple locations for the chaos duration.
EBS loss by ID
EBS loss by ID detaches an EBS volume by volume ID for a configurable duration and reattaches it afterwards, so you can test how a workload behaves when its storage disappears.
ALB AZ down
ALB AZ down takes down the AZ (Availability Zones) on a target application load balancer for a specific duration. This fault restricts access to certain availability zones for a specific duration.Use cases
CLB AZ down
CLB AZ down takes down the AZ (Availability Zones) on a target CLB for a specific duration. This fault restricts access to certain availability zones for a specific duration.Use cases
AZ blackhole
The AZ blackhole causes network blackhole by isolating traffic in specific availability zones across an entire region. Users can control the blast radius by providing targeted VPC IDs for the AZ failure.Use cases
VPC route misconfiguration
The vpc route misconfiguration chaos causes network issues due to the misconfiguration of the route table associated with the targeted VPC.Use cases
DynamoDB replication pause
DynamoDB replication pause fault pauses the data replication in DynamoDB tables over multiple locations for the chaos duration.
- When chaos experiment is being executed, any changes to the DynamoDB table will not be replicated in different regions, thereby making the data in the DynamoDB inconsistent.
- You can execute this fault on a DynamoDB table that is global, that is, there should be more than one replica of the table.
Use cases
DynamoDB replication pause determines the resilience of the application when data (in a database) that needs to be constantly updated is disrupted.
EBS loss by ID
EBS loss by ID detaches an EBS volume by volume ID for a configurable duration and reattaches it afterwards, so you can test how a workload behaves when its storage disappears.Use cases
EBS loss by tag
EBS loss by tag detaches EBS volumes selected by tag for a configurable duration and reattaches them afterwards, so you can test how workloads behave when a tagged subset of storage disappears.Use cases
VOLUME_AFFECTED_PERC keeps the impact within the planned blast radius.
EC2 CPU hog
EC2 CPU hog stresses a configurable number of CPU cores at a configurable load percentage inside a target EC2 instance for a configurable duration, so you can test how the workload behaves when its host is CPU-starved.Use cases
EC2 DNS chaos
EC2 DNS chaos fails DNS resolution for selected hostnames on a target EC2 instance for a configurable duration, so you can test how the workload reacts when a dependency cannot be resolved.Use cases
EC2 HTTP latency
EC2 HTTP latency adds latency to inbound HTTP traffic on a configurable port of a target EC2 instance for a configurable duration, so you can test how clients react when an HTTP service responds slowly.Use cases
EC2 HTTP modify body
EC2 HTTP modify body rewrites HTTP response bodies on a configurable port of a target EC2 instance for a configurable duration, so you can test how clients react when an upstream returns unexpected content.Use cases
EC2 HTTP modify header
EC2 HTTP modify header adds, changes, or removes HTTP headers on requests or responses on a configurable port of a target EC2 instance for a configurable duration, so you can test how clients and servers react when headers are missing or malformed.Use cases
Authorization is stripped from requests.
EC2 HTTP reset peer
EC2 HTTP reset peer resets inbound TCP connections to an HTTP service on a configurable port of a target EC2 instance for a configurable duration, so you can test how clients react when the server tears down connections mid-flight.Use cases
EC2 HTTP status code
EC2 HTTP status code rewrites HTTP response status codes on a configurable port of a target EC2 instance for a configurable duration, so you can test how clients react to specific error codes returned by an upstream service.Use cases
429 handling, circuit-breaker open/close behaviour, and cache fallback on 502.401/403) paths refresh tokens cleanly.
EC2 IO stress
EC2 IO stress generates sustained filesystem read and write load on a target EC2 instance for a configurable duration, so you can test how the workload behaves under disk pressure or near-full storage.Use cases
ephemeral-storage limits evict pods as expected.
EC2 memory hog
EC2 memory hog consumes a configurable amount of memory inside a target EC2 instance for a configurable duration, so you can test how the workload behaves when its host is starved of memory.Use cases
EC2 network latency
EC2 network latency adds configurable latency and jitter to outbound traffic on a target EC2 instance for a configurable duration, so you can test how the workload reacts when network round-trip times grow.Use cases
EC2 network loss
EC2 network loss drops a configurable percentage of outbound packets on a target EC2 instance for a configurable duration, so you can test how the workload reacts when network reliability degrades.Use cases
tcp_retransmits) and alerts.
EC2 process kill
EC2 process kill kills one or more processes by PID inside a target EC2 instance for a configurable duration, so you can test how the workload recovers when a critical process disappears without losing the host.Use cases
FORCE between SIGTERM and SIGKILL.
EC2 stop by ID
EC2 stop by ID stops one or more EC2 instances identified by their instance IDs for a configurable duration and then starts them again, so you can test how the workload behaves when a specific host disappears. When MANAGED_NODEGROUP=enable, the fault waits for a replacement node from the auto-scaling group instead of starting the original instance.Use cases
EC2 stop by tag
EC2 stop by tag stops EC2 instances selected by tag for a configurable duration and starts them again afterwards, so you can test how a workload behaves when a tagged subset of capacity disappears.Use cases
MANAGED_NODEGROUP=enable).INSTANCE_AFFECTED_PERCENTAGE keeps the blast radius within plan.
ECS agent stop
ECS agent stop disrupts the state of infrastructure resources. This fault:
- Induces an agent stop chaos on AWS ECS using Amazon SSM Run command, that is carried out by using SSM documentation which is in-built in the fault for the give chaos scenario.
- Causes agent container stop on ECS for a specific duration, with a given
CLUSTER_NAMEenvironment variable using SSM documentation. Killing the agent container disrupts the performance of the task containers.
Use cases
- ECS agent stop halts the agent that manages the task container on the ECS cluster, thereby impacting its delivery.
ECS container CPU hog
ECS container CPU hog disrupts the state of infrastructure resources. It induces stress on the AWS ECS container using Amazon SSM Run command, which is carried out using SSM documentation that is in-built into the fault. This fault:
- Causes CPU chaos on the containers of the ECS task using the given
CLUSTER_NAMEenvironment variable for a specific duration. - To select the Task Under Chaos (TUC), use the service name associated with the task. If you provide the service name along with the cluster name, all the tasks associated with the given service will be selected as chaos targets.
- This experiment induces chaos within a container and depends on an EC2 instance. Typically, these are prefixed with "ECS container" and involve direct interaction with the EC2 instances hosting the ECS containers.
Use cases
- Evicts the application (task container) thereby impacting its delivery. These issues are known as noisy neighbour problems.
- Simulates a lack of CPU for processes running on the application, which degrades their performance.
- Verifies metrics-based horizontal pod autoscaling as well as vertical autoscale, that is, demand-based CPU addition.
- Scales the nodes based on growth beyond budgeted pods.
- Verifies the autopilot functionality of (cloud) managed clusters.
- Verifies multi-tenant load issue, wherein when the load increases on one container, it does not cause downtime in other containers.
- Tests the ECS task sanity (service availability) and recovery of the task containers subject to CPU stress.
ECS container HTTP latency
ECS container HTTP latency induces HTTP chaos on containers running in an Amazon ECS (Elastic Container Service) task. This fault introduces latency in the HTTP responses of containers of a specific service using a proxy server, simulating delays in network connectivity or slow responses from the dependent services.Use cases
ECS container HTTP modify body
ECS container HTTP modify body injects HTTP chaos which affects the request or response by modifying the status code, body, or headers. This is achieved by starting a proxy server and redirecting the traffic through the proxy server.Use cases
ECS container HTTP modify header
ECS container HTTP modify header injects HTTP chaos which modifies the headers of the request or response of the service.
- This is achieved by starting a proxy server and redirecting the traffic through the proxy server.
- This experiment induces chaos within a container and depends on an EC2 instance. Typically, these are prefixed with "ECS container" and involve direct interaction with the EC2 instances hosting the ECS containers.
Use cases
ECS container HTTP modify header tests the resilience of the ECS application container to erroneous or incorrect HTTP header of the request or response body.
ECS container HTTP reset peer
ECS container HTTP reset peer injects HTTP reset on the service whose port is specified using the TARGET_SERVICE_PORT environment variable.
- It stops the outgoing HTTP requests by resetting the TCP connection for the requests.
Use cases
- It determines the application's resilience to a lossy (or flaky) HTTP connection.
- Simulates premature connection loss (firewall issues or other issues) between microservices (verify connection timeout).
- Simulates connection resets due to resource limitations on the server side like out of memory server (or process killed or overload on the server due to a high amount of traffic).
ECS container HTTP status code
ECS container HTTP status code injects HTTP chaos that affects the request (or response) by modifying the status code (or the body or the headers) by starting a proxy server and redirecting the traffic through the proxy server on the target ECS containers.Use cases
ECS container IO stress
ECS container IO stress disrupts the state of infrastructure resources. It induces stress on the AWS ECS container using Amazon SSM Run command, which is carried out using SSM docs which is in-built into the fault.
- It causes I/O stress on the containers of the ECS task using the given
CLUSTER_NAMEenvironment variable for a specific duration. - To select the Task Under Chaos (TUC), use the service name associated with the task. If you provide the service name along with the cluster name, all the tasks associated with the given service will be selected as chaos targets.
- It tests the ECS task sanity (service availability) and recovery of the task containers subject to I/O stress.
- This experiment induces chaos within a container and depends on an EC2 instance. Typically, these are prefixed with "ECS container" and involve direct interaction with the EC2 instances hosting the ECS containers.
Use cases
- Determines how a container recovers from a memory exhaustion.
- File system read and write evicts the application (task container) and impacts its delivery. These issues are also known as noisy-neighbour problems.
- Injecting a rogue process into a target container starves the main microservice process (typically pid 1) of the resources allocated to it (where the limits are defined). This slows down the application traffic or exhausts the resources leading to eviction of all task containers.
ECS container memory hog
ECS container memory hog disrupts the state of infrastructure resources. It induces stress on the AWS ECS container using Amazon SSM Run command, which is carried out using SSM docs which is in-built into the fault.
- It causes memory stress on the containers of the ECS task using the given
CLUSTER_NAMEenvironment variable for a specific duration. - To select the Task Under Chaos (TUC), use the service name associated with the task. If you provide the service name along with the cluster name, all the tasks associated with the given service will be selected as chaos targets.
- It tests the ECS task sanity (service availability) and recovery of the task containers subject to memory stress.
Details
Use cases
Memory usage inside containers is subject to constraints. If the limits are specified, exceeding them can result in termination of the container (due to OOMKill of the primary process, often pid 1). The container is restarted, depending on the policy specified. When there are no limits on the memory consumption of containers, containers on the instance can be killed based on their oom_score, which extends to all the task containers running on the instance. This results in a bigger blast radius. This fault launches a stress process within the target container, that causes the primary process in the container to have constraints based on resources or eat up the available system memory on the instance when limits on resources are not specified.ECS container network latency
ECS container network latency disrupts the state of infrastructure resources. It brings delay on the AWS ECS container using Amazon SSM Run command, which is carried out using SSM docs which is in-built into the fault.
- It causes network stress on the containers of the ECS task using the given
CLUSTER_NAMEenvironment variable for a specific duration. - To select the Task Under Chaos (TUC), use the service name associated with the task. If you provide the service name along with the cluster name, all the tasks associated with the given service will be selected as chaos targets.
- It tests the ECS task sanity (service availability) and recovery of the task containers subject to network stress.
Details
Use cases
This fault degrades the network of the task container without the container being marked as unhealthy/ (or unworthy) of traffic. It simulates issues within the ECS task network or communication across services in different availability zones (or regions). This can be resolved using middleware that switches traffic based on certain SLOs (or performance parameters). This can also be resolved by highlighting the degradation using notifications (or alerts). It also determines the impact of the fault on the microservice. The task may stall or get corrupted while waiting endlessly for a packet. The fault limits the impact (blast radius) to only the traffic you wish to test by specifying the service to find TUC (Task Under Chaos). This fault helps improve the resilience of the services over time.ECS container network loss
ECS container network loss disrupts the state of infrastructure resources.
- The fault induces chaos on the AWS ECS container using Amazon SSM Run command, which is carried out using SSM docs that comes in-built in the fault.
- It causes network disruption on containers of the ECS task in the cluster name.
- To select the Task Under Chaos (TUC), use the service name associated with the task. If you provide the service name along with cluster name, all the tasks associated with the given service will be selected as chaos targets.
- It tests the ECS task sanity (service availability) and recovery of the task containers subjected to network chaos.
Details
Use cases
This fault degrades the network of the task container without the container being marked as unhealthy/ (or unworthy) of traffic. It simulates issues within the ECS task network or communication across services in different availability zones (or regions). This can be resolved using middleware that switches traffic based on certain SLOs (or performance parameters). This can also be resolved by highlighting the degradation using notifications (or alerts). It also determines the impact of the fault on the microservice. The task may stall or get corrupted while waiting endlessly for a packet. The fault limits the impact (blast radius) to only the traffic you wish to test by specifying the service to find TUC (Task Under Chaos). It simulates degraded network with varied percentages of dropped packets between microservices, loss of access to specific third party (or dependent) services (or components), blackhole against traffic to a given AZ (failure simulation of availability zones), and network partitions (split-brain) between peer replicas for a stateful application. This fault helps improve the resilience of the services over time.ECS container volume detach
ECS container volume detach provides a mechanism to detach and remove volumes associated with ECS task containers in an Amazon ECS (Elastic Container Service) task.
This experiment primarily involves ECS Fargate and doesn't depend on EC2 instances. They focus on altering the state or resources of ECS containers without direct container interaction.Use cases
ECS Fargate CPU Hog
ECS Fargate CPU Hog generates high CPU load on a specific task running in an ECS service.Use cases
ECS Fargate memory hog
ECS Fargate memory hog generates high CPU load on a specific task running in an ECS service.Use cases
ECS instance stop
ECS instance stop induces stress on an AWS ECS cluster. It derives the instance under chaos from the ECS cluster.
- It causes EC2 instance to stop and get deleted from the ECS cluster for a specific duration.
Details
Use cases
EC2 instance stop breaks the agent that manages the task container on ECS cluster, thereby impacting its delivery. Killing the EC2 instance disrupts the performance of the task container.ECS invalid container image
ECS invalid container image allows you to update the Docker image used by a container in an Amazon ECS (Elastic Container Service) task.Use cases
ECS network restrict
ECS network restrict allows you to restrict the network connectivity of containers in an Amazon ECS (Elastic Container Service) task by modifying the container security rules.Use cases
ECS task scale
ECS task scale is an AWS fault that injects chaos to scale (up or down) the ECS tasks based on the services and checks the task availability. This fault affects the availability of a task in an ECS cluster.Details
Use cases
ECS task scale affects the availability of a task in a cluster.
It determines the resilience of an application when ECS tasks are unexpectedly scaled up (or down).
ECS task stop
ECS task stop is an AWS fault that injects chaos to stop the ECS tasks based on the services or task replica ID and checks the task availability.
- This fault results in the unavailability of the application running on the tasks.
Details
Use cases
This fault determines the resilience of an application when ECS tasks unexpectedly stop due to task being unavailable.ECS update container resource limit
ECS update container resource limit allows you to modify the CPU and memory resources of containers in an Amazon ECS (Elastic Container Service) task.Use cases
ECS update container timeout
ECS update container timeout modifies the start and stop timeout for ECS containers in Amazon ECS clusters. It allows you to specify the duration for which the containers should be allowed to start or stop before they are considered as failed.Use cases
ECS update task role
ECS update task role allows you to modify the IAM task role associated with an Amazon ECS (Elastic Container Service) task.Use cases
Generic experiment template
Generic experiment template provides a template to natively inject faults using FIS for different services, such as EC2, EBS, DynamoDB, and so on.
- You need to create an FIS template and store it.
- Provide parameters to the pre-created FIS templates and execute experiments.
- You can specify the template ID and region on Harness to execute the experiments using these FIS templates.
- You can monitor and report the results of executing the experiment from these FIS templates.
Use cases
- Inject faults natively using FIS services.
- Monitor and report the results of executing the experiment from the FIS templates.
- Build chaos experiments with pre-defined templates or build experiments from scratch using FIS service.
Lambda delete event source mapping
Lambda delete event source mapping removes the event source mapping from an AWS Lambda function for a specific duration. Deleting an event source mapping from a Lambda function is critical. It can lead to failure in updating the database on an event trigger, which can break the service.Use cases
Lambda function layer detach
Lambda function layer detach is an AWS fault that detaches the Lambda layer associated with the function, thereby causing dependency-related issues or breaking the Lambda function that relies on the layer's content.Use cases
Lambda delete function concurrency
Lambda delete function concurrency deletes the Lambda function's reserved concurrency, thereby ensuring that the function has adequate unreserved concurrency to run.Use cases
Lambda toggle event mapping state
Lambda toggle event mapping state toggles (or sets) the event source mapping state to disable for a Lambda function during a specific duration. Toggling between different states of event source mapping from a Lambda function may lead to failures when updating the database on an event trigger. This can break the service and impact its delivery.Use cases
Lambda update function memory
Lambda update function memory causes the memory of a Lambda function to update to a specific value for a certain duration. This fault:
- Determines a safe overall memory limit value for the function. Smaller the memory limit, higher will be the time taken by the Lambda function under load.
Use cases
- Helps build resilience to unexpected scenarios such as hitting a memory limit with the Lambda function, that slows down the service and impacts its delivery. Running out of memory due to smaller limits interrupts the flow of the given function.
- Checks the performance of the application (or service) running with a new memory limit.
Lambda update function timeout
Lambda update function timeout causes a timeout of a Lambda function, thereby updating the timeout to a specific value for a certain duration. Timeout errors interrupt the flow of the given function.
Hitting a timeout is a frequent scenario with Lambda functions. This can break the service and impact the delivery. Such scenarios can occur despite the availability aids provided by AWS.Use cases
Lambda inject status code
Use cases
- Assess how downstream services react when receiving non-standard or error HTTP status codes, ensuring that error-handling logic and fallback mechanisms are effective.
- Test the robustness of client applications and APIs when they encounter unexpected status codes, allowing for early detection of integration issues.
- Evaluate and fine-tune retry policies and error logging strategies by simulating intermittent faulty responses in a controlled manner. =======
- Checks integrated services handle delayed responses, ensuring that timeouts and fallback mechanisms are appropriately configured.
- Inject latency when interacting with external APIs or databases to determine if your system can maintain functionality under slower-than-expected response times.
- Evaluate the impact of delays typically experienced during cold starts or resource contention, and refine scaling strategies accordingly.
Lambda update role permission
Lambda update role permission is an AWS fault that modifies the role policies associated with a Lambda function. Sometimes, Lambda functions depend on services like RDS, DynamoDB, and S3. In such cases, certain permissions are required to access these services. This fault helps understand how your application would behave when a Lambda function does not have enough permissions to access the services.Use cases
Lambda modify response body
Lambda modify response body modifies the response body of a Lambda function at runtime, simulating unexpected output alterations. This interrupt the flow of the given function.Use cases
NLB AZ down
NLB AZ down takes down the access for AZ (Availability Zones) on a target network load balancer for a specific duration. This fault restricts access to certain availability zones for a specific duration.Use cases
RDS instance delete
RDS instance delete deletes a target RDS DB instance, so you can test how applications behave when a database disappears permanently and how disaster-recovery procedures handle the loss.Use cases
RDS instance reboot
RDS instance reboot reboots a target RDS DB instance (with optional Multi-AZ failover) for a configurable duration, so you can test how applications behave when their database restarts.Use cases
FAILOVER=true.
Resource access restrict
Resource access restrict restricts access to a specific AWS resource for a specific duration.Use cases
SSM chaos by ID
SSM chaos by ID runs an arbitrary AWS Systems Manager document against a target EC2 instance selected by ID, so you can inject custom chaos that is not covered by a dedicated fault.Use cases
SSM chaos by tag
SSM chaos by tag runs an arbitrary AWS Systems Manager document against EC2 instances selected by tag, so you can inject custom chaos against a logical group of hosts.Use cases
INSTANCE_AFFECTED_PERC.
Windows EC2 blackhole chaos
Windows EC2 blackhole chaos results in access loss to the given target hosts or IPs by injecting firewall rules. Windows EC2 blackhole chaos:Use cases
Windows EC2 CPU hog
Windows EC2 CPU hog induces CPU stress on the AWS Windows EC2 instances using Amazon SSM Run command. EC2 windows CPU hog:Use cases
Windows EC2 memory hog
Windows EC2 memory hog induces memory stress on the target AWS Windows EC2 instance using Amazon SSM Run command. Windows EC2 memory hog:Use cases
Windows EC2 Network Latency
Windows EC2 network latency causes a network packet delay on Windows VM for the target EC2 instance(s) using Clumsy.Use cases
Windows EC2 Network Loss
Windows EC2 network loss causes network packet loss on Windows VM for the target EC2 instance(s) using Clumsy. It results in flaky access to the application. It checks the performance of the services running on the Windows VMs under the disrupted network loss conditions.Use cases
Windows EC2 Process Kill
Windows EC2 Process Kill fault kills the target processes running on a Windows EC2 instance. This fault disrupts application-critical processes running on the instance by killing their underlying processes or threads.Use cases
Lambda Block TCP Connection
Lambda Block TCP Connection is an AWS fault that simulates network blocks for TCP connections of a Lambda function. This fault helps you evaluate how your application responds when outbound TCP connections from a Lambda function are blocked.Use cases