Skip to main content

EBS loss by ID

Last updated on

EBS loss by ID is an AWS chaos fault that detaches an EBS volume by its volume ID for a configurable duration and reattaches it at the end of the fault. The fault calls AWS EC2 APIs directly; no agent on the host instance is required.

Use this fault to test how a workload behaves when its storage disappears: does the application surface a clean IO error, does the database fail over to a replica, does the application crash, does monitoring fire the right alert?

Run your first experiment

If you have not configured the chaos infrastructure yet, go to Quickstart to install the chaos infrastructure and run an experiment end to end.


Use cases

Run this fault when you want to answer concrete questions like:

  • IO error handling: When the data volume disappears, does the application surface a clean error or hang?
  • Database failover: Does a stateful workload (database, message broker) fail over to a replica, and how long until the replica is fully promoted?
  • Recovery after reattach: When the volume comes back, does the application reconnect cleanly, or does it require manual intervention (remount, restart)?
  • Detached-volume detection: Does monitoring fire the right alert at the right time, and does the runbook actually point at the cause?
  • Disaster-recovery rehearsal: Validate that the recovery procedure for a missing volume works end to end.

Prerequisites

  • Kubernetes version: 1.21 or later for the chaos infrastructure cluster.
  • Target volume identified: EBS_VOLUME_ID is the ID of an existing EBS volume in REGION that is in in-use state.
  • Volume is detachable: The volume is not the root volume of a running instance, or force=true is acceptable (root volumes can be detached but require the instance to be stopped first).
  • AWS credentials available: Either an AWS credentials file uploaded as a File Secret in Harness Secret Manager (see Authentication below) or IRSA on the chaos infrastructure service account.

Supported environments

PlatformSupport status
Amazon EBS (gp2, gp3, io1, io2, st1, sc1)Supported
EBS volumes attached to EC2 instancesSupported
EBS volumes attached to Outposts instancesSupported
AWS regionsSupported in every commercial region; pass the region in REGION
Root volumes of running instancesNot directly supported; stop the instance first
Volumes attached to multiple instances (MultiAttachEnabled)Detach is supported, but reattach must restore the original attachment topology

Permissions required

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"ec2:DescribeVolumes",
"ec2:DescribeInstances",
"ec2:AttachVolume",
"ec2:DetachVolume"
],
"Resource": "*"
}
]
}
  • ec2:DescribeVolumes resolves EBS_VOLUME_ID and captures the current attachment (instance ID + device name) so the volume can be reattached after the fault.
  • ec2:DetachVolume drives the fault.
  • ec2:AttachVolume restores the attachment at the end of the fault.
  • ec2:DescribeInstances confirms the instance the volume reattaches to is still running.

Go to common policy for all AWS faults to use a single superset IAM policy.


Authentication

The fault supports three credential delivery models. Pick one based on how your chaos infrastructure is deployed.

MethodWhen to use itHow to configure
Harness Secret Manager file secretChaos infrastructure runs outside EKS, or you want explicit static credentialsUpload the AWS credentials file as a File Secret in Harness Secret Manager and reference its identifier via AWS_AUTHENTICATION_SECRET
IAM Roles for Service Accounts (IRSA)Chaos infrastructure runs in EKS and uses an OIDC-bound service accountNo tunable changes; the chaos pod inherits the role automatically. Go to AWS IAM integration to set it up
Assume roleThe fault needs to act in a different account or with elevated permissionsSet ASSUME_ROLE_ARN to the role ARN; the chaos pod assumes the role on top of its base credentials

When using the Harness Secret Manager method, the File Secret should contain an AWS credentials file in the standard ~/.aws/credentials format:

[default]
aws_access_key_id = REPLACE_WITH_ACCESS_KEY_ID
aws_secret_access_key = REPLACE_WITH_SECRET_ACCESS_KEY

Upload this file as a File Secret in Harness Secret Manager (Project Setup → Secrets → New File Secret), and pass the secret identifier in AWS_AUTHENTICATION_SECRET.


Fault tunables

Required parameters

TunableDescriptionDefault
EBS_VOLUME_IDID of the target EBS volume (for example vol-0a1b2c3d4e5f67890).(required)
REGIONAWS region that hosts the target volume.(required)

Chaos parameters

TunableDescriptionDefault
TOTAL_CHAOS_DURATIONDuration of the detached state in seconds.30
CHAOS_INTERVALTime interval between successive iterations of the detach-attach cycle (in seconds).30
DEFAULT_HEALTH_CHECKWhen true, the fault performs default health checks against the volume's attachment state.false
SEQUENCEReserved for parity with other faults; this fault targets one volume.parallel
RAMP_TIMEWait period in seconds before and after the fault.0

Authentication

TunableDescriptionDefault
ASSUME_ROLE_ARNARN of an IAM role to assume on top of the base credentials.""
AWS_AUTHENTICATION_SECRETIdentifier of the File Secret in Harness Secret Manager that contains the AWS credentials file. Not required when using IRSA.""
Detach is disruptive

A detach call interrupts in-flight IO on the volume. Plan around the workload's tolerance for sudden IO loss, and ensure backups exist before targeting a production volume.


Fault execution in brief

Looks up EBS_VOLUME_ID in REGION, captures its current attachment (instance ID and device name), calls DetachVolume, waits TOTAL_CHAOS_DURATION seconds, then calls AttachVolume to restore the original attachment.


Expected behavior during fault execution

  • The volume transitions from in-use to detaching to available within seconds.
  • The instance that owned the volume sees an IO error on the device once in-flight IO completes or fails.
  • Filesystems mounted from the volume become inaccessible; reads and writes return EIO.
  • Applications either log the IO failure cleanly or hang waiting for it, depending on how they handle disk errors.
  • At the end of the fault, the volume transitions back to in-use and the original instance may need to remount the filesystem (the device file path returns but the kernel may have given up on the mountpoint).
When the fault ends

The chaos pod calls AttachVolume for the original instance and device. The volume is available again from the AWS perspective, but the kernel on the original instance may need a manual remount (mount <device> <path>).

Signals to watch

  • Volume state: Use a command probe that runs aws ec2 describe-volumes --volume-ids <id> and asserts on State.
  • CloudWatch volume metrics: Use a Prometheus probe on aws_ebs_volume_queue_length; the metric goes to 0 while the volume is detached.
  • Application IO errors: Use a Prometheus probe on the application's IO-error counter (or any database-specific WAL-error metric).

Verify the fault execution effect

While the experiment is running:

  1. Confirm the volume is detached.

    aws ec2 describe-volumes \
    --region <region> \
    --volume-ids <volume-id> \
    --query "Volumes[0].[State,Attachments]"

    The state should be available and Attachments empty.

  2. After the fault ends, confirm the volume is attached again.

    aws ec2 describe-volumes \
    --region <region> \
    --volume-ids <volume-id> \
    --query "Volumes[0].[State,Attachments[0].InstanceId,Attachments[0].Device]"

Recovery and cleanup

  • End of duration: The chaos pod reattaches the volume to the original instance and device.
  • Abort the experiment: Stopping the experiment from Chaos Studio triggers the reattach call.
  • Manual recovery: If the reattach call fails (instance gone, device name conflict), call aws ec2 attach-volume --volume-id <id> --instance-id <id> --device <device> manually.
  • Filesystem remount: Even after a clean reattach, the kernel on the original instance may need a manual mount if the original mount was lost.

Limitations

  • No agent control on the instance: This fault operates at the EBS API layer. It does not unmount the filesystem cleanly first, so the workload sees a hard IO failure rather than a graceful detach.
  • Reattach may fail: If the original instance is stopped or terminated between detach and reattach, the reattach call fails. Recover manually.
  • Device name reuse: The original device name (for example /dev/xvdf) must still be available on the original instance at reattach time. The fault does not free it for you.
  • Root volumes: Cannot detach the root volume of a running instance. Stop the instance first.
  • Cross-AZ: A volume cannot move between Availability Zones; the fault always reattaches to the same AZ.

Troubleshooting

EBS loss by ID experiment fails with UnauthorizedOperation

The credentials supplied to the chaos pod do not have the required EC2 permissions in the target region. Confirm the IAM policy attached to the user, role, or IRSA service account includes ec2:DetachVolume, ec2:AttachVolume, ec2:DescribeVolumes, and ec2:DescribeInstances. When using ASSUME_ROLE_ARN, also confirm the trust policy allows the source identity.

EBS loss by ID experiment failed to reattach the volume

The most common causes are: the original instance was stopped or terminated between detach and reattach; the device name on the original instance is now occupied by another volume; or the volume is in a different AZ than the instance. Recover with aws ec2 attach-volume --volume-id <id> --instance-id <id> --device <device>; if that fails, attach to a different instance in the same AZ, mount, copy data off, and reattach to the intended instance manually.

EBS loss by ID detached but filesystem is read-only after reattach

The kernel marked the filesystem read-only after the hard detach. Connect to the instance via SSM Session Manager, unmount with 'umount -lf <mount>' (lazy + force), run fsck if needed, then remount. For production-critical volumes, snapshot the volume before chaos and consider running fsck during a maintenance window after the experiment.