Skip to main content

Chaos faults for AWS

Last updated on

Introduction

AWS faults disrupt the resources running on different AWS services from the EKS cluster. To perform such AWS chaos experiments, you will need to authenticate CE with the AWS platform. This can be done in two ways.

  • Using secrets: You can use secrets to authenticate CE with AWS regardless of whether the Kubernetes cluster is used for the deployment. This is Kubernetes' native way of authenticating CE with AWS.
  • IAM integration: You can authenticate CE using AWS using IAM when you have deployed chaos on the EKS cluster. You can associate an IAM role with a Kubernetes service account. This service account can be used to provide AWS permissions to the experiment pod which uses the particular service account.

Here are AWS faults that you can execute and validate.

ALB AZ down

ALB AZ down

ALB AZ down detaches one or more availability zones from an Application Load Balancer for a configurable duration so you can test how clients, target groups, and AZ-aware routing behave when a zone is taken out of the load balancer rotation.

availabilityload balancer
CLB AZ down

CLB AZ down

CLB AZ down disables one or more availability zones on a Classic Load Balancer for a configurable duration so you can test how clients and back-end instances behave when an AZ is removed from the load balancer rotation.

availabilityload balancer
AZ blackhole

AZ blackhole

AZ blackhole isolates network traffic in one or more AWS Availability Zones (optionally scoped to specific VPCs or subnets) for a configurable duration and restores connectivity afterwards, so you can test how multi-AZ workloads handle a zone-level outage.

zoneblackhole
VPC route misconfiguration

VPC route misconfiguration

VPC route misconfiguration temporarily removes specified CIDR routes from one or more VPC route tables for a configurable duration and restores them afterwards, so you can test how the workload behaves when egress to a Transit Gateway, NAT Gateway, VPC peer, or internet gateway disappears.

vpcroute tables
DynamoDB replication pause

DynamoDB replication pause

DynamoDB replication pause pauses cross-region replication on one or more Amazon DynamoDB global tables for a configurable duration using an AWS Fault Injection Service (FIS) experiment, so you can test how your application handles a brief stop in multi-region consistency.

replicationpausedynamodb
EBS loss by ID

EBS loss by ID

EBS loss by ID detaches an EBS volume by volume ID for a configurable duration and reattaches it afterwards, so you can test how a workload behaves when its storage disappears.

lossid

Page 1 of 11

[object Object]

ALB AZ down

Back to top

ALB AZ down detaches one or more availability zones from an Application Load Balancer for a configurable duration, then reattaches them, so you can test how multi-AZ workloads behave when a single AZ disappears from the load balancer rotation.

Use cases
  • Validate AZ-level resilience and DNS-based client failover within the TTL budget.
  • Confirm remaining AZs absorb redirected traffic without breaching latency SLOs.
  • Verify cross-zone load balancing behavior and target group re-registration after recovery.
View details
[object Object]

CLB AZ down

Back to top

CLB AZ down disables one or more availability zones on a Classic Load Balancer for a configurable duration, then re-enables them, so you can test how multi-AZ workloads behave when an AZ disappears from a CLB.

Use cases
  • Validate AZ-level resilience and DNS-based client failover.
  • Confirm N-1 AZ capacity is sufficient on the remaining AZs.
  • Verify instances re-register cleanly when the AZ returns.
View details
[object Object]

AZ blackhole

Back to top

AZ blackhole isolates network traffic in one or more AWS Availability Zones (optionally scoped to specific VPCs or subnets) for a configurable duration and restores connectivity afterwards, so you can test how multi-AZ workloads handle a zone-level outage.

Use cases
  • Validate ALB / NLB failover to remaining zones when an AZ goes dark.
  • Confirm Multi-AZ databases (RDS, ElastiCache, OpenSearch, MSK) survive a single-AZ blackhole without data loss.
  • Rehearse Auto Scaling and disaster-recovery automation under a zone-level outage.
View details
[object Object]

VPC route misconfiguration

Back to top

VPC route misconfiguration temporarily removes specified CIDR routes from one or more VPC route tables for a configurable duration and restores them afterwards, so you can test how the workload behaves when egress to a Transit Gateway, NAT Gateway, VPC peer, or internet gateway disappears.

Use cases
  • Detect blast radius of a future change to a VPC route table before rolling it out.
  • Validate clean error handling when egress to a TGW / NAT Gateway / peer is broken.
  • Confirm alarms on NAT bytes or TGW packet drops fire within the SLA.
View details
[object Object]

DynamoDB replication pause

Back to top

DynamoDB replication pause pauses cross-region replication on one or more Amazon DynamoDB global tables for a configurable duration using an AWS Fault Injection Service (FIS) experiment, so you can test how your application handles a brief stop in multi-region consistency.

Use cases
  • Validate eventual-consistency tolerance when replication latency spikes.
  • Confirm cross-region failover automation does not misfire on a temporary replication pause.
  • Rehearse global-table catch-up after a multi-region replication gap.
View details
[object Object]

EBS loss by ID

Back to top

EBS loss by ID detaches an EBS volume by volume ID for a configurable duration and reattaches it afterwards, so you can test how a workload behaves when its storage disappears.

Use cases
  • Validate clean IO-error handling and database failover when the data volume disappears.
  • Confirm the workload reconnects cleanly when the volume is reattached.
  • Rehearse disaster-recovery procedures for missing-volume scenarios.
View details
[object Object]

EBS loss by tag

Back to top

EBS loss by tag detaches EBS volumes selected by tag for a configurable duration and reattaches them afterwards, so you can test how workloads behave when a tagged subset of storage disappears.

Use cases
  • Validate replica absorption and stateful-workload failover when a tagged subset of storage disappears.
  • Confirm VOLUME_AFFECTED_PERC keeps the impact within the planned blast radius.
  • Rehearse the recovery procedure for losing a tagged subset of storage.
View details
[object Object]

EC2 CPU hog

Back to top

EC2 CPU hog stresses a configurable number of CPU cores at a configurable load percentage inside a target EC2 instance for a configurable duration, so you can test how the workload behaves when its host is CPU-starved.

Use cases
  • Validate p99 latency stays within SLO when all cores are saturated.
  • Confirm CloudWatch CPU alarms and ASG scale-out trigger in the expected window.
  • Test burst-credit exhaustion on T-family instances and co-tenant isolation.
View details
[object Object]

EC2 DNS chaos

Back to top

EC2 DNS chaos fails DNS resolution for selected hostnames on a target EC2 instance for a configurable duration, so you can test how the workload reacts when a dependency cannot be resolved.

Use cases
  • Validate clean error handling when a critical dependency cannot be resolved.
  • Confirm resolver retry semantics back off correctly instead of amplifying load.
  • Test multi-target outages and observability coverage of DNS failures.
View details
[object Object]

EC2 HTTP latency

Back to top

EC2 HTTP latency adds latency to inbound HTTP traffic on a configurable port of a target EC2 instance for a configurable duration, so you can test how clients react when an HTTP service responds slowly.

Use cases
  • Validate client timeouts and retry-with-backoff paths under HTTP slowness.
  • Confirm connection pools absorb added latency without exhausting.
  • Test load-balancer behaviour and end-to-end SLO impact across the call graph.
View details
[object Object]

EC2 HTTP modify body

Back to top

EC2 HTTP modify body rewrites HTTP response bodies on a configurable port of a target EC2 instance for a configurable duration, so you can test how clients react when an upstream returns unexpected content.

Use cases
  • Validate schema-validation and parse-error paths in clients.
  • Test empty-response and truncated-payload handling.
  • Confirm UX degrades gracefully when the API returns unexpected content.
View details
[object Object]

EC2 HTTP modify header

Back to top

EC2 HTTP modify header adds, changes, or removes HTTP headers on requests or responses on a configurable port of a target EC2 instance for a configurable duration, so you can test how clients and servers react when headers are missing or malformed.

Use cases
  • Validate auth-failure paths when Authorization is stripped from requests.
  • Test cache-control and CORS-header changes against downstream caches and browser clients.
  • Confirm tracing-header propagation breaks (or recovers) exactly where expected.
View details
[object Object]

EC2 HTTP reset peer

Back to top

EC2 HTTP reset peer resets inbound TCP connections to an HTTP service on a configurable port of a target EC2 instance for a configurable duration, so you can test how clients react when the server tears down connections mid-flight.

Use cases
  • Validate client-side retry safety when connections are reset before the response arrives.
  • Test HTTP connection-pool recovery after a churn event.
  • Confirm load-balancer detection and observability of TCP resets.
View details
[object Object]

EC2 HTTP status code

Back to top

EC2 HTTP status code rewrites HTTP response status codes on a configurable port of a target EC2 instance for a configurable duration, so you can test how clients react to specific error codes returned by an upstream service.

Use cases
  • Validate 4xx vs 5xx semantics: clients refrain from retrying 4xx but retry 5xx with backoff.
  • Test 429 handling, circuit-breaker open/close behaviour, and cache fallback on 502.
  • Confirm auth-failure (401/403) paths refresh tokens cleanly.
View details
[object Object]

EC2 IO stress

Back to top

EC2 IO stress generates sustained filesystem read and write load on a target EC2 instance for a configurable duration, so you can test how the workload behaves under disk pressure or near-full storage.

Use cases
  • Validate disk-bound latency and write-error handling under saturation.
  • Test near-full disk behaviour and WAL flush stalls for databases.
  • Confirm EKS ephemeral-storage limits evict pods as expected.
View details
[object Object]

EC2 memory hog

Back to top

EC2 memory hog consumes a configurable amount of memory inside a target EC2 instance for a configurable duration, so you can test how the workload behaves when its host is starved of memory.

Use cases
  • Validate OOM-killer victim selection lands on the right process.
  • Test JVM heap pressure, container memory limits, and pod restarts on EKS.
  • Confirm CloudWatch memory alarms trigger ASG scale-out in time.
View details
[object Object]

EC2 network latency

Back to top

EC2 network latency adds configurable latency and jitter to outbound traffic on a target EC2 instance for a configurable duration, so you can test how the workload reacts when network round-trip times grow.

Use cases
  • Validate cross-AZ latency tolerance and database-call timeouts.
  • Test connection-pool resilience and retry-storm protection under added latency.
  • Confirm SLO error budgets burn at the expected rate when latency is injected.
View details
[object Object]

EC2 network loss

Back to top

EC2 network loss drops a configurable percentage of outbound packets on a target EC2 instance for a configurable duration, so you can test how the workload reacts when network reliability degrades.

Use cases
  • Validate partial-loss tolerance and 100%-loss failover behaviour.
  • Test TCP retransmission cost and replica failover under packet loss.
  • Confirm loss surfaces in network metrics (tcp_retransmits) and alerts.
View details
[object Object]

EC2 process kill

Back to top

EC2 process kill kills one or more processes by PID inside a target EC2 instance for a configurable duration, so you can test how the workload recovers when a critical process disappears without losing the host.

Use cases
  • Validate supervisor (systemd, container runtime) restart cadence.
  • Test crash vs graceful-shutdown semantics by toggling FORCE between SIGTERM and SIGKILL.
  • Confirm liveness probes detect the failure and trigger restarts cleanly.
View details
[object Object]

EC2 stop by ID

Back to top

EC2 stop by ID stops one or more EC2 instances identified by their instance IDs for a configurable duration and then starts them again, so you can test how the workload behaves when a specific host disappears. When MANAGED_NODEGROUP=enable, the fault waits for a replacement node from the auto-scaling group instead of starting the original instance.

Use cases
  • Validate replica failover when an instance hosting a workload is stopped.
  • Confirm load balancer health checks detach and reattach the instance cleanly.
  • Test auto-scaling group response and EKS managed node group recovery.
View details
[object Object]

EC2 stop by tag

Back to top

EC2 stop by tag stops EC2 instances selected by tag for a configurable duration and starts them again afterwards, so you can test how a workload behaves when a tagged subset of capacity disappears.

Use cases
  • Validate replica failover for a tagged tier and load-balancer detach/reattach.
  • Confirm auto-scaling group response and EKS managed-node-group recovery (MANAGED_NODEGROUP=enable).
  • Verify INSTANCE_AFFECTED_PERCENTAGE keeps the blast radius within plan.
View details
[object Object]

ECS agent stop

Back to top

ECS agent stop halts the ECS agent on one or more EC2 container instances belonging to an ECS cluster for a configurable duration, so you can test how the cluster behaves when the data-plane bridge between agent and control plane is interrupted.

Use cases
  • Validate that running tasks continue to serve traffic while the agent is offline.
  • Confirm the cluster detects the disconnected instance and that the agent recovers cleanly.
  • Test that new task placements skip the affected instance until the agent reconnects.
View details
[object Object]

ECS container CPU hog

Back to top

ECS container CPU hog stresses CPU inside containers of EC2-backed ECS tasks for a configurable duration, so you can test how the application and the host behave under CPU saturation.

Use cases
  • Validate p99 latency stays within SLO when containers are CPU-starved.
  • Confirm CloudWatch CPU alarms and service autoscaling trigger in the expected window.
  • Test noisy-neighbour isolation across containers on the same host.
View details
[object Object]

ECS container HTTP latency

Back to top

ECS container HTTP latency adds latency to inbound HTTP traffic on a configurable port of containers in an EC2-backed ECS task for a configurable duration, so you can test how clients react when an HTTP service responds slowly.

Use cases
  • Validate client timeouts and retry-with-backoff paths under HTTP slowness.
  • Confirm connection pools absorb added latency without exhausting.
  • Test load-balancer behaviour and end-to-end SLO impact across the call graph.
View details
[object Object]

ECS container HTTP modify body

Back to top

ECS container HTTP modify body rewrites HTTP response bodies on a configurable port of containers in an EC2-backed ECS task for a configurable duration, so you can test how clients react when an upstream returns unexpected content.

Use cases
  • Validate schema-validation and parse-error paths in clients.
  • Test empty-response and truncated-payload handling.
  • Confirm UX degrades gracefully when the API returns unexpected content.
View details
[object Object]

ECS container HTTP reset peer

Back to top

ECS container HTTP reset peer resets inbound TCP connections to an HTTP service on a configurable port of containers in an EC2-backed ECS task for a configurable duration, so you can test how clients react when the server tears down connections mid-flight.

Use cases
  • Validate client-side retry safety when connections are reset before the response arrives.
  • Test HTTP connection-pool recovery after a churn event.
  • Confirm load-balancer detection and observability of TCP resets.
View details
[object Object]

ECS container HTTP status code

Back to top

ECS container HTTP status code rewrites HTTP response status codes on a configurable port of containers in an EC2-backed ECS task for a configurable duration, so you can test how clients react to specific error codes returned by an upstream service.

Use cases
  • Validate 4xx vs 5xx semantics: clients refrain from retrying 4xx but retry 5xx with backoff.
  • Test 429 handling, circuit-breaker open/close behaviour, and cache fallback on 502.
  • Confirm auth-failure (401/403) paths refresh tokens cleanly.
View details
[object Object]

ECS container IO stress

Back to top

ECS container IO stress generates sustained filesystem read and write load inside containers of EC2-backed ECS tasks for a configurable duration, so you can test how the workload behaves under disk pressure.

Use cases
  • Validate disk-bound latency and write-error handling under saturation.
  • Test near-full disk behaviour and WAL flush stalls for stateful workloads.
  • Confirm ephemeral-storage limits behave as expected when IO load is sustained.
View details
[object Object]

ECS container memory hog

Back to top

ECS container memory hog consumes a configurable amount of memory inside containers of EC2-backed ECS tasks for a configurable duration, so you can test how the workload behaves when its container is starved of memory.

Use cases
  • Validate OOM-killer victim selection lands on the right process.
  • Test JVM heap pressure and container memory limits.
  • Confirm CloudWatch memory alarms trigger service autoscaling in time.
View details
[object Object]

ECS container network latency

Back to top

ECS container network latency adds configurable latency to outbound traffic from containers in EC2-backed ECS tasks for a configurable duration, so you can test how the workload reacts when network round-trip times grow.

Use cases
  • Validate cross-AZ latency tolerance and dependency-call timeouts.
  • Test connection-pool resilience and retry-storm protection under added latency.
  • Confirm SLO error budgets burn at the expected rate when latency is injected.
View details
[object Object]

ECS container network loss

Back to top

ECS container network loss drops a configurable percentage of outbound packets from containers in EC2-backed ECS tasks for a configurable duration, so you can test how the workload reacts when network reliability degrades.

Use cases
  • Validate partial-loss tolerance and 100%-loss failover behaviour.
  • Test TCP retransmission cost and replica failover under packet loss.
  • Confirm loss surfaces in network metrics and alerts.
View details
[object Object]

ECS container volume detach

Back to top

ECS container volume detach detaches EBS volumes attached to ECS task containers for a configurable duration and reattaches them afterwards, so you can test how stateful tasks behave when their storage disappears.

Use cases
  • Validate clean IO-error handling and stateful-task failover when the data volume disappears.
  • Confirm the task reconnects cleanly when the volume is reattached.
  • Rehearse disaster-recovery procedures for missing-volume scenarios on ECS.
View details
[object Object]

ECS Fargate CPU hog

Back to top

ECS Fargate CPU hog stresses CPU inside a Fargate task for a configurable duration, so you can test how the application behaves when its task is CPU-starved.

Use cases
  • Validate that Fargate task vCPU sizing is sufficient for the workload's peak.
  • Confirm autoscaling on the ECS service scales out under sustained CPU pressure.
  • Test the impact of a noisy sidecar consuming CPU on the main application container.
View details
[object Object]

ECS Fargate memory hog

Back to top

ECS Fargate memory hog consumes a configurable amount of memory inside a Fargate task for a configurable duration, so you can test how the application behaves when its task is starved of memory.

Use cases
  • Validate OOM behaviour and task restart inside the Fargate task.
  • Confirm that the application gracefully degrades or restarts when memory is exhausted.
  • Test the impact of a noisy sidecar consuming memory on the main application container.
View details
[object Object]

ECS instance stop

Back to top

ECS instance stop stops one or more EC2 container instances belonging to an ECS cluster for a configurable duration, then starts them again, so you can test how the cluster reschedules tasks and how the workload behaves when a host disappears.

Use cases
  • Validate task rescheduling onto surviving container instances.
  • Confirm Auto Scaling Group response when an EC2 container instance disappears.
  • Test workload availability across a multi-AZ ECS cluster during host loss.
View details
[object Object]

ECS invalid container image

Back to top

ECS invalid container image swaps the container image on an ECS service to an invalid value for a configurable duration, then restores the original image, so you can test how deployments, rollbacks, and monitoring react to a failing image pull.

Use cases
  • Validate that the service detects ImagePullBackOff-style failures and surfaces them in alerts.
  • Confirm deployment circuit-breaker (deploymentConfiguration) prevents traffic from shifting to the bad revision.
  • Rehearse rollback runbooks for failed image pulls.
View details
[object Object]

ECS network restrict

Back to top

ECS network restrict modifies the security group rules of an ECS service for a configurable duration and restores them afterwards, so you can test how the workload behaves when outbound or inbound network access is restricted.

Use cases
  • Validate clean error handling when outbound access to a dependency is blocked.
  • Confirm health checks behave correctly when inbound access is restricted.
  • Test fallback paths when an SG rule change breaks a specific port.
View details
[object Object]

ECS task scale

Back to top

ECS task scale changes the desired count of an ECS service for a configurable duration and restores it afterwards, so you can test how the workload behaves under sudden scale-up or scale-down.

Use cases
  • Validate replica failover when the task count is reduced.
  • Confirm autoscaling and capacity-provider behaviour during a sudden scale-up.
  • Test deployment circuit-breaker behaviour under aggressive scale changes.
View details
[object Object]

ECS task stop

Back to top

ECS task stop stops one or more ECS tasks (selected by service or task ID) for a configurable duration, so you can test how the workload behaves when a specific task disappears.

Use cases
  • Validate that the ECS service replaces stopped tasks within the deployment configuration.
  • Confirm load-balancer target deregistration and re-registration is clean.
  • Test that standalone (non-service) tasks fail upstream callers gracefully.
View details
[object Object]

ECS update container resource limit

Back to top

ECS update container resource limit re-registers an ECS task definition with reduced CPU or memory limits for a configurable duration and restores the original limits afterwards, so you can test how the workload behaves under tightened resource constraints.

Use cases
  • Validate workload behaviour when CPU or memory limits are tightened.
  • Confirm autoscaling triggers more aggressively under reduced limits.
  • Rehearse the rollback runbook for incorrect resource-limit changes.
View details
[object Object]

ECS update container timeout

Back to top

ECS update container timeout re-registers an ECS task definition with modified container start or stop timeouts for a configurable duration and restores the originals afterwards, so you can test how the deployment behaves when container start or stop takes longer than expected.

Use cases
  • Validate that deployments fail fast (or wait gracefully) when container start exceeds the timeout.
  • Confirm container shutdown handlers complete within the configured stop timeout.
  • Rehearse rollback for accidentally too-low start/stop timeouts.
View details
[object Object]

ECS update task role

Back to top

ECS update task role swaps the IAM task role on an ECS service for a configurable duration and restores the original afterwards, so you can test how the workload behaves when its IAM permissions change.

Use cases
  • Validate clean error handling when the task role loses permissions to a dependency (S3, DynamoDB, KMS).
  • Confirm monitoring detects AccessDenied surges from the application.
  • Test fallback or retry behaviour against AWS API permission errors.
View details
[object Object]

Generic experiment template

Back to top

Generic experiment template (also known as Generic FIS experiment template) starts a pre-existing AWS Fault Injection Service (FIS) template by ID, so you can fold native AWS-managed faults into a Harness chaos experiment and probe, verify, and report on the result as you do with any other Harness fault.

Use cases
  • Drive an existing FIS template from Chaos Studio so you can attach probes, hypothesis criteria, and reports.
  • Mix native FIS actions with Harness-native faults inside one experiment.
  • Centralise FIS results alongside every other Harness chaos run.
View details
[object Object]

Lambda block TCP connection

Back to top

Lambda block TCP connection blocks outbound TCP connections from an AWS Lambda function to one or more target hostnames for a configurable duration, so you can test how the function behaves when a TCP-based dependency is unreachable.

Use cases
  • Validate fail-fast behaviour when a TCP-based dependency (database, cache, external API) is unreachable.
  • Confirm function timeout protects against TCP-blocked dependencies without amplifying cost.
  • Test alarm fidelity for elevated Lambda error rate.
View details
[object Object]

Lambda delete event source mapping

Back to top

Lambda delete event source mapping deletes one or more event source mappings on an AWS Lambda function for a configurable duration and recreates them afterwards, so you can test how the workload behaves when the function stops receiving events from its source.

Use cases
  • Validate event backlog handling (SQS / Kinesis / DynamoDB Streams) when the mapping is removed.
  • Confirm drain behaviour and idempotency when the mapping is recreated.
  • Test alarm fidelity for iterator age and queue depth.
View details
[object Object]

Lambda function layer detach

Back to top

Lambda function layer detach detaches a specified Lambda layer from a target AWS Lambda function for a configurable duration and reattaches it afterwards, so you can test how the workload behaves when a shared dependency layer disappears.

Use cases
  • Validate clean error reporting when the layer's libraries or binaries disappear.
  • Audit whether the function actually uses the libraries provided by the layer.
  • Confirm reattach restores normal operation without manual intervention.
View details
[object Object]

Lambda delete function concurrency

Back to top

Lambda delete function concurrency deletes the reserved concurrency configuration on an AWS Lambda function for a configurable duration and restores it afterwards, so you can test how the workload behaves when the function has to share account-level concurrency with other functions.

Use cases
  • Validate throttling exposure when the reservation disappears.
  • Confirm alarm fidelity on Throttles and account-level concurrency usage.
  • Test downstream consumer behaviour when the function throughput drops.
View details
[object Object]

Lambda toggle event mapping state

Back to top

Lambda toggle event mapping state disables one or more event source mappings on an AWS Lambda function for a configurable duration and re-enables them afterwards, so you can test how the workload behaves when the function temporarily stops receiving events from its source.

Use cases
  • Validate event backlog handling when the mapping is disabled (without losing the mapping).
  • Confirm drain behaviour when the mapping is re-enabled.
  • Test alarm fidelity for iterator age, queue depth, and "no invocations".
View details
[object Object]

Lambda update function memory

Back to top

Lambda update function memory lowers the memory allocation of an AWS Lambda function for a configurable duration and restores it afterwards, so you can test how the workload behaves with less memory and a proportionally smaller CPU share.

Use cases
  • Validate OOM behaviour and the impact of reduced CPU share on duration.
  • Identify the lowest safe memory setting for cost optimization.
  • Confirm alarms fire when memory pressure rises.
View details
[object Object]

Lambda update function timeout

Back to top

Lambda update function timeout lowers the configured timeout of an AWS Lambda function for a configurable duration and restores it afterwards, so you can test how the workload behaves when invocations are cut short.

Use cases
  • Validate clean caller behaviour when tail invocations are killed.
  • Identify the lowest safe timeout setting for cost optimization.
  • Confirm alarms fire on elevated Errors from timed-out invocations.
View details
[object Object]

Lambda inject latency

Back to top

Lambda inject latency adds a configurable amount of latency to every invocation of an AWS Lambda function for a configurable duration, so you can test how upstream callers and downstream consumers handle slower-than-expected responses, cold-start spikes, and resource contention.

Use cases
  • Validate caller timeout and retry behaviour when the function is slow.
  • Confirm retries do not amplify load on the function or its dependencies.
  • Test alarm fidelity for Duration and end-to-end p99.
View details
[object Object]

Lambda inject status code

Back to top

Lambda inject status code overrides the HTTP status code returned by an AWS Lambda function for a configurable duration, so you can test how upstream callers and downstream consumers handle unexpected error status responses.

Use cases
  • Validate caller error handling, retry budgets, circuit breakers, and fallback flows.
  • Confirm alarm fidelity on Lambda Errors and API Gateway / ALB 5xx.
  • Test client behaviour against unexpected statuses without crashing.
View details
[object Object]

Lambda update role permission

Back to top

Lambda update role permission detaches a specified IAM policy from the execution role attached to an AWS Lambda function for a configurable duration and reattaches it afterwards, so you can test how the workload behaves when the function loses permission to call a downstream AWS service.

Use cases
  • Validate fail-fast behaviour when an AWS API permission disappears.
  • Confirm alarm fidelity on Lambda errors and downstream service access denials.
  • Test caller and downstream behaviour during the IAM propagation window.
View details
[object Object]

Lambda modify response body

Back to top

Lambda modify response body overrides the response body returned by an AWS Lambda function for a configurable duration, so you can test how upstream callers and client applications handle unexpected payload shapes and corrupted data.

Use cases
  • Validate response-schema validation in client applications.
  • Identify silent consumers that accept invalid responses without alerting.
  • Confirm alarm fidelity on application-level error rate.
View details
[object Object]

NLB AZ down

Back to top

NLB AZ down detaches one or more availability zones from a Network Load Balancer for a configurable duration, then reattaches them, so you can test how multi-AZ NLB workloads behave when a zone disappears from the load balancer surface.

Use cases
  • Validate AZ-level resilience and DNS-based client failover for TCP/UDP workloads.
  • Test long-lived TCP connection recovery when the AZ endpoint is removed.
  • Confirm cross-zone behaviour and target re-registration after recovery.
View details
[object Object]

RDS instance delete

Back to top

RDS instance delete deletes a target RDS DB instance, so you can test how applications behave when a database disappears permanently and how disaster-recovery procedures handle the loss.

Use cases
  • Rehearse the DR runbook for restoring a deleted DB instance from snapshot.
  • Validate read-replica promotion when the primary disappears.
  • Confirm monitoring detects the deletion within the expected window.
View details
[object Object]

RDS instance reboot

Back to top

RDS instance reboot reboots a target RDS DB instance (with optional Multi-AZ failover) for a configurable duration, so you can test how applications behave when their database restarts.

Use cases
  • Validate connection-pool reconnection across a reboot.
  • Test Multi-AZ failover and read-replica behaviour with FAILOVER=true.
  • Confirm write-path timeout handling and one cohesive alert during reboot.
View details
[object Object]

Resource access restrict

Back to top

Resource access restrict temporarily strips ingress or egress rules from one or more AWS security groups for a configurable duration and restores them afterwards, so you can test how the workload behaves when network access to (or from) an AWS resource disappears.

Use cases
  • Validate clean fail-fast behaviour when network access disappears.
  • Confirm multi-AZ resilience absorbs the load on healthy resources.
  • Test alarm fidelity on connection errors and target health.
View details
[object Object]

SSM chaos by ID

Back to top

SSM chaos by ID runs an arbitrary AWS Systems Manager document against a target EC2 instance selected by ID, so you can inject custom chaos that is not covered by a dedicated fault.

Use cases
  • Run a custom shell script or domain-specific failure not covered by another fault.
  • Trigger filesystem corruption, kernel-level chaos, or other one-shot scenarios.
  • Validate hypotheses without authoring a new dedicated fault.
View details
[object Object]

SSM chaos by tag

Back to top

SSM chaos by tag runs an arbitrary AWS Systems Manager document against EC2 instances selected by tag, so you can inject custom chaos against a logical group of hosts.

Use cases
  • Run a custom shell script across a tagged service tier.
  • Apply domain-specific failures to a percentage of tagged hosts via INSTANCE_AFFECTED_PERC.
  • Validate hypotheses across a fleet without authoring a new dedicated fault.
View details
[object Object]

Windows EC2 blackhole chaos

Back to top

Windows EC2 blackhole chaos drops all network traffic destined for specific IPs or hosts on one or more Windows EC2 instances for a configurable duration, so you can test how Windows-hosted workloads behave when a specific dependency is completely unreachable.

Use cases
  • Validate fail-fast and circuit-breaker behaviour when a dependency is unreachable.
  • Confirm cross-region or fallback routing engages within the SLA.
  • Test monitoring alert fidelity on "dependency unreachable" conditions.
View details
[object Object]

Windows EC2 CPU hog

Back to top

Windows EC2 CPU hog stresses a configurable number of CPU cores at a configurable load percentage inside one or more Windows EC2 instances for a configurable duration, so you can test how Windows-hosted workloads behave when their host is CPU-starved.

Use cases
  • Validate p99 latency stays within SLO when all cores are saturated.
  • Confirm CloudWatch CPU alarms and ASG scale-out trigger in the expected window.
  • Test burst-credit exhaustion on T-family instances and co-tenant isolation.
View details
[object Object]

Windows EC2 memory hog

Back to top

Windows EC2 memory hog consumes a configurable amount of memory inside one or more Windows EC2 instances for a configurable duration, so you can test how Windows-hosted workloads behave when their host is starved of memory.

Use cases
  • Validate that the OS swap (pagefile) absorbs the pressure without crashing the application.
  • Confirm CloudWatch memory alarms trigger ASG scale-out in time.
  • Test application behaviour as available memory falls below critical thresholds.
View details
[object Object]

Windows EC2 network latency

Back to top

Windows EC2 network latency adds a configurable amount of latency to network traffic destined for specific IPs or hosts on one or more Windows EC2 instances for a configurable duration, so you can test how Windows-hosted workloads behave when the network is slow.

Use cases
  • Validate timeouts and retry behaviour for targeted dependency slowdowns.
  • Confirm latency budgets across the call graph hold within SLO.
  • Test cross-AZ or cross-region replica failover when one path is slow.
View details
[object Object]

Windows EC2 network loss

Back to top

Windows EC2 network loss drops a configurable percentage of network packets destined for specific IPs or hosts on one or more Windows EC2 instances for a configurable duration, so you can test how Windows-hosted workloads behave when the network is lossy.

Use cases
  • Validate retry recovery and TCP backoff under partial packet loss.
  • Confirm monitoring detects elevated loss on the affected path.
  • Test replica failover when a specific path becomes lossy.
View details
[object Object]

Windows EC2 process kill

Back to top

Windows EC2 process kill kills one or more processes (selected by PID or process name) on one or more Windows EC2 instances for a configurable duration, so you can test how Windows-hosted workloads behave when their backing processes die.

Use cases
  • Validate Windows Service Control Manager recovery options.
  • Confirm cluster failover (SQL Server AlwaysOn, MSCS) when a local process dies.
  • Test custom watchdog/supervisor (NSSM, FireDaemon) respawn behaviour.
View details