Chaos faults for AWS
Introduction
AWS faults disrupt the resources running on different AWS services from the EKS cluster. To perform such AWS chaos experiments, you will need to authenticate CE with the AWS platform. This can be done in two ways.
- Using secrets: You can use secrets to authenticate CE with AWS regardless of whether the Kubernetes cluster is used for the deployment. This is Kubernetes' native way of authenticating CE with AWS.
- IAM integration: You can authenticate CE using AWS using IAM when you have deployed chaos on the EKS cluster. You can associate an IAM role with a Kubernetes service account. This service account can be used to provide AWS permissions to the experiment pod which uses the particular service account.
Here are AWS faults that you can execute and validate.
ALB AZ down
ALB AZ down detaches one or more availability zones from an Application Load Balancer for a configurable duration so you can test how clients, target groups, and AZ-aware routing behave when a zone is taken out of the load balancer rotation.
CLB AZ down
CLB AZ down disables one or more availability zones on a Classic Load Balancer for a configurable duration so you can test how clients and back-end instances behave when an AZ is removed from the load balancer rotation.
AZ blackhole
AZ blackhole isolates network traffic in one or more AWS Availability Zones (optionally scoped to specific VPCs or subnets) for a configurable duration and restores connectivity afterwards, so you can test how multi-AZ workloads handle a zone-level outage.
VPC route misconfiguration
VPC route misconfiguration temporarily removes specified CIDR routes from one or more VPC route tables for a configurable duration and restores them afterwards, so you can test how the workload behaves when egress to a Transit Gateway, NAT Gateway, VPC peer, or internet gateway disappears.
DynamoDB replication pause
DynamoDB replication pause pauses cross-region replication on one or more Amazon DynamoDB global tables for a configurable duration using an AWS Fault Injection Service (FIS) experiment, so you can test how your application handles a brief stop in multi-region consistency.
EBS loss by ID
EBS loss by ID detaches an EBS volume by volume ID for a configurable duration and reattaches it afterwards, so you can test how a workload behaves when its storage disappears.
ALB AZ down
ALB AZ down detaches one or more availability zones from an Application Load Balancer for a configurable duration, then reattaches them, so you can test how multi-AZ workloads behave when a single AZ disappears from the load balancer rotation.Use cases
CLB AZ down
CLB AZ down disables one or more availability zones on a Classic Load Balancer for a configurable duration, then re-enables them, so you can test how multi-AZ workloads behave when an AZ disappears from a CLB.Use cases
N-1 AZ capacity is sufficient on the remaining AZs.
AZ blackhole
AZ blackhole isolates network traffic in one or more AWS Availability Zones (optionally scoped to specific VPCs or subnets) for a configurable duration and restores connectivity afterwards, so you can test how multi-AZ workloads handle a zone-level outage.Use cases
VPC route misconfiguration
VPC route misconfiguration temporarily removes specified CIDR routes from one or more VPC route tables for a configurable duration and restores them afterwards, so you can test how the workload behaves when egress to a Transit Gateway, NAT Gateway, VPC peer, or internet gateway disappears.Use cases
DynamoDB replication pause
DynamoDB replication pause pauses cross-region replication on one or more Amazon DynamoDB global tables for a configurable duration using an AWS Fault Injection Service (FIS) experiment, so you can test how your application handles a brief stop in multi-region consistency.Use cases
EBS loss by ID
EBS loss by ID detaches an EBS volume by volume ID for a configurable duration and reattaches it afterwards, so you can test how a workload behaves when its storage disappears.Use cases
EBS loss by tag
EBS loss by tag detaches EBS volumes selected by tag for a configurable duration and reattaches them afterwards, so you can test how workloads behave when a tagged subset of storage disappears.Use cases
VOLUME_AFFECTED_PERC keeps the impact within the planned blast radius.
EC2 CPU hog
EC2 CPU hog stresses a configurable number of CPU cores at a configurable load percentage inside a target EC2 instance for a configurable duration, so you can test how the workload behaves when its host is CPU-starved.Use cases
EC2 DNS chaos
EC2 DNS chaos fails DNS resolution for selected hostnames on a target EC2 instance for a configurable duration, so you can test how the workload reacts when a dependency cannot be resolved.Use cases
EC2 HTTP latency
EC2 HTTP latency adds latency to inbound HTTP traffic on a configurable port of a target EC2 instance for a configurable duration, so you can test how clients react when an HTTP service responds slowly.Use cases
EC2 HTTP modify body
EC2 HTTP modify body rewrites HTTP response bodies on a configurable port of a target EC2 instance for a configurable duration, so you can test how clients react when an upstream returns unexpected content.Use cases
EC2 HTTP modify header
EC2 HTTP modify header adds, changes, or removes HTTP headers on requests or responses on a configurable port of a target EC2 instance for a configurable duration, so you can test how clients and servers react when headers are missing or malformed.Use cases
Authorization is stripped from requests.
EC2 HTTP reset peer
EC2 HTTP reset peer resets inbound TCP connections to an HTTP service on a configurable port of a target EC2 instance for a configurable duration, so you can test how clients react when the server tears down connections mid-flight.Use cases
EC2 HTTP status code
EC2 HTTP status code rewrites HTTP response status codes on a configurable port of a target EC2 instance for a configurable duration, so you can test how clients react to specific error codes returned by an upstream service.Use cases
429 handling, circuit-breaker open/close behaviour, and cache fallback on 502.401/403) paths refresh tokens cleanly.
EC2 IO stress
EC2 IO stress generates sustained filesystem read and write load on a target EC2 instance for a configurable duration, so you can test how the workload behaves under disk pressure or near-full storage.Use cases
ephemeral-storage limits evict pods as expected.
EC2 memory hog
EC2 memory hog consumes a configurable amount of memory inside a target EC2 instance for a configurable duration, so you can test how the workload behaves when its host is starved of memory.Use cases
EC2 network latency
EC2 network latency adds configurable latency and jitter to outbound traffic on a target EC2 instance for a configurable duration, so you can test how the workload reacts when network round-trip times grow.Use cases
EC2 network loss
EC2 network loss drops a configurable percentage of outbound packets on a target EC2 instance for a configurable duration, so you can test how the workload reacts when network reliability degrades.Use cases
tcp_retransmits) and alerts.
EC2 process kill
EC2 process kill kills one or more processes by PID inside a target EC2 instance for a configurable duration, so you can test how the workload recovers when a critical process disappears without losing the host.Use cases
FORCE between SIGTERM and SIGKILL.
EC2 stop by ID
EC2 stop by ID stops one or more EC2 instances identified by their instance IDs for a configurable duration and then starts them again, so you can test how the workload behaves when a specific host disappears. When MANAGED_NODEGROUP=enable, the fault waits for a replacement node from the auto-scaling group instead of starting the original instance.Use cases
EC2 stop by tag
EC2 stop by tag stops EC2 instances selected by tag for a configurable duration and starts them again afterwards, so you can test how a workload behaves when a tagged subset of capacity disappears.Use cases
MANAGED_NODEGROUP=enable).INSTANCE_AFFECTED_PERCENTAGE keeps the blast radius within plan.
ECS agent stop
ECS agent stop halts the ECS agent on one or more EC2 container instances belonging to an ECS cluster for a configurable duration, so you can test how the cluster behaves when the data-plane bridge between agent and control plane is interrupted.Use cases
ECS container CPU hog
ECS container CPU hog stresses CPU inside containers of EC2-backed ECS tasks for a configurable duration, so you can test how the application and the host behave under CPU saturation.Use cases
ECS container HTTP latency
ECS container HTTP latency adds latency to inbound HTTP traffic on a configurable port of containers in an EC2-backed ECS task for a configurable duration, so you can test how clients react when an HTTP service responds slowly.Use cases
ECS container HTTP modify body
ECS container HTTP modify body rewrites HTTP response bodies on a configurable port of containers in an EC2-backed ECS task for a configurable duration, so you can test how clients react when an upstream returns unexpected content.Use cases
ECS container HTTP reset peer
ECS container HTTP reset peer resets inbound TCP connections to an HTTP service on a configurable port of containers in an EC2-backed ECS task for a configurable duration, so you can test how clients react when the server tears down connections mid-flight.Use cases
ECS container HTTP status code
ECS container HTTP status code rewrites HTTP response status codes on a configurable port of containers in an EC2-backed ECS task for a configurable duration, so you can test how clients react to specific error codes returned by an upstream service.Use cases
429 handling, circuit-breaker open/close behaviour, and cache fallback on 502.401/403) paths refresh tokens cleanly.
ECS container IO stress
ECS container IO stress generates sustained filesystem read and write load inside containers of EC2-backed ECS tasks for a configurable duration, so you can test how the workload behaves under disk pressure.Use cases
ECS container memory hog
ECS container memory hog consumes a configurable amount of memory inside containers of EC2-backed ECS tasks for a configurable duration, so you can test how the workload behaves when its container is starved of memory.Use cases
ECS container network latency
ECS container network latency adds configurable latency to outbound traffic from containers in EC2-backed ECS tasks for a configurable duration, so you can test how the workload reacts when network round-trip times grow.Use cases
ECS container network loss
ECS container network loss drops a configurable percentage of outbound packets from containers in EC2-backed ECS tasks for a configurable duration, so you can test how the workload reacts when network reliability degrades.Use cases
ECS container volume detach
ECS container volume detach detaches EBS volumes attached to ECS task containers for a configurable duration and reattaches them afterwards, so you can test how stateful tasks behave when their storage disappears.Use cases
ECS Fargate CPU hog
ECS Fargate CPU hog stresses CPU inside a Fargate task for a configurable duration, so you can test how the application behaves when its task is CPU-starved.Use cases
ECS Fargate memory hog
ECS Fargate memory hog consumes a configurable amount of memory inside a Fargate task for a configurable duration, so you can test how the application behaves when its task is starved of memory.Use cases
ECS instance stop
ECS instance stop stops one or more EC2 container instances belonging to an ECS cluster for a configurable duration, then starts them again, so you can test how the cluster reschedules tasks and how the workload behaves when a host disappears.Use cases
ECS invalid container image
ECS invalid container image swaps the container image on an ECS service to an invalid value for a configurable duration, then restores the original image, so you can test how deployments, rollbacks, and monitoring react to a failing image pull.Use cases
deploymentConfiguration) prevents traffic from shifting to the bad revision.
ECS network restrict
ECS network restrict modifies the security group rules of an ECS service for a configurable duration and restores them afterwards, so you can test how the workload behaves when outbound or inbound network access is restricted.Use cases
ECS task scale
ECS task scale changes the desired count of an ECS service for a configurable duration and restores it afterwards, so you can test how the workload behaves under sudden scale-up or scale-down.Use cases
ECS task stop
ECS task stop stops one or more ECS tasks (selected by service or task ID) for a configurable duration, so you can test how the workload behaves when a specific task disappears.Use cases
ECS update container resource limit
ECS update container resource limit re-registers an ECS task definition with reduced CPU or memory limits for a configurable duration and restores the original limits afterwards, so you can test how the workload behaves under tightened resource constraints.Use cases
ECS update container timeout
ECS update container timeout re-registers an ECS task definition with modified container start or stop timeouts for a configurable duration and restores the originals afterwards, so you can test how the deployment behaves when container start or stop takes longer than expected.Use cases
ECS update task role
ECS update task role swaps the IAM task role on an ECS service for a configurable duration and restores the original afterwards, so you can test how the workload behaves when its IAM permissions change.Use cases
Generic experiment template
Generic experiment template (also known as Generic FIS experiment template) starts a pre-existing AWS Fault Injection Service (FIS) template by ID, so you can fold native AWS-managed faults into a Harness chaos experiment and probe, verify, and report on the result as you do with any other Harness fault.Use cases
Lambda block TCP connection
Lambda block TCP connection blocks outbound TCP connections from an AWS Lambda function to one or more target hostnames for a configurable duration, so you can test how the function behaves when a TCP-based dependency is unreachable.Use cases
Lambda delete event source mapping
Lambda delete event source mapping deletes one or more event source mappings on an AWS Lambda function for a configurable duration and recreates them afterwards, so you can test how the workload behaves when the function stops receiving events from its source.Use cases
Lambda function layer detach
Lambda function layer detach detaches a specified Lambda layer from a target AWS Lambda function for a configurable duration and reattaches it afterwards, so you can test how the workload behaves when a shared dependency layer disappears.Use cases
Lambda delete function concurrency
Lambda delete function concurrency deletes the reserved concurrency configuration on an AWS Lambda function for a configurable duration and restores it afterwards, so you can test how the workload behaves when the function has to share account-level concurrency with other functions.Use cases
Throttles and account-level concurrency usage.
Lambda toggle event mapping state
Lambda toggle event mapping state disables one or more event source mappings on an AWS Lambda function for a configurable duration and re-enables them afterwards, so you can test how the workload behaves when the function temporarily stops receiving events from its source.Use cases
Lambda update function memory
Lambda update function memory lowers the memory allocation of an AWS Lambda function for a configurable duration and restores it afterwards, so you can test how the workload behaves with less memory and a proportionally smaller CPU share.Use cases
Lambda update function timeout
Lambda update function timeout lowers the configured timeout of an AWS Lambda function for a configurable duration and restores it afterwards, so you can test how the workload behaves when invocations are cut short.Use cases
Errors from timed-out invocations.
Lambda inject latency
Lambda inject latency adds a configurable amount of latency to every invocation of an AWS Lambda function for a configurable duration, so you can test how upstream callers and downstream consumers handle slower-than-expected responses, cold-start spikes, and resource contention.Use cases
Duration and end-to-end p99.
Lambda inject status code
Lambda inject status code overrides the HTTP status code returned by an AWS Lambda function for a configurable duration, so you can test how upstream callers and downstream consumers handle unexpected error status responses.Use cases
Errors and API Gateway / ALB 5xx.
Lambda update role permission
Lambda update role permission detaches a specified IAM policy from the execution role attached to an AWS Lambda function for a configurable duration and reattaches it afterwards, so you can test how the workload behaves when the function loses permission to call a downstream AWS service.Use cases
Lambda modify response body
Lambda modify response body overrides the response body returned by an AWS Lambda function for a configurable duration, so you can test how upstream callers and client applications handle unexpected payload shapes and corrupted data.Use cases
NLB AZ down
NLB AZ down detaches one or more availability zones from a Network Load Balancer for a configurable duration, then reattaches them, so you can test how multi-AZ NLB workloads behave when a zone disappears from the load balancer surface.Use cases
RDS instance delete
RDS instance delete deletes a target RDS DB instance, so you can test how applications behave when a database disappears permanently and how disaster-recovery procedures handle the loss.Use cases
RDS instance reboot
RDS instance reboot reboots a target RDS DB instance (with optional Multi-AZ failover) for a configurable duration, so you can test how applications behave when their database restarts.Use cases
FAILOVER=true.
Resource access restrict
Resource access restrict temporarily strips ingress or egress rules from one or more AWS security groups for a configurable duration and restores them afterwards, so you can test how the workload behaves when network access to (or from) an AWS resource disappears.Use cases
SSM chaos by ID
SSM chaos by ID runs an arbitrary AWS Systems Manager document against a target EC2 instance selected by ID, so you can inject custom chaos that is not covered by a dedicated fault.Use cases
SSM chaos by tag
SSM chaos by tag runs an arbitrary AWS Systems Manager document against EC2 instances selected by tag, so you can inject custom chaos against a logical group of hosts.Use cases
INSTANCE_AFFECTED_PERC.
Windows EC2 blackhole chaos
Windows EC2 blackhole chaos drops all network traffic destined for specific IPs or hosts on one or more Windows EC2 instances for a configurable duration, so you can test how Windows-hosted workloads behave when a specific dependency is completely unreachable.Use cases
Windows EC2 CPU hog
Windows EC2 CPU hog stresses a configurable number of CPU cores at a configurable load percentage inside one or more Windows EC2 instances for a configurable duration, so you can test how Windows-hosted workloads behave when their host is CPU-starved.Use cases
Windows EC2 memory hog
Windows EC2 memory hog consumes a configurable amount of memory inside one or more Windows EC2 instances for a configurable duration, so you can test how Windows-hosted workloads behave when their host is starved of memory.Use cases
Windows EC2 network latency
Windows EC2 network latency adds a configurable amount of latency to network traffic destined for specific IPs or hosts on one or more Windows EC2 instances for a configurable duration, so you can test how Windows-hosted workloads behave when the network is slow.Use cases
Windows EC2 network loss
Windows EC2 network loss drops a configurable percentage of network packets destined for specific IPs or hosts on one or more Windows EC2 instances for a configurable duration, so you can test how Windows-hosted workloads behave when the network is lossy.Use cases
Windows EC2 process kill
Windows EC2 process kill kills one or more processes (selected by PID or process name) on one or more Windows EC2 instances for a configurable duration, so you can test how Windows-hosted workloads behave when their backing processes die.Use cases