Pod IO mistake

Last updated on Jun 2, 2026

Pod IO mistake is a Kubernetes pod-level chaos fault that overwrites a configurable number of bytes in matched filesystem read or write syscalls on a target container's mounted volume for a configurable duration. The application sees the call succeed, but the data flowing through it has been seeded with wrong bytes. When the fault ends, syscalls behave normally again immediately.

Use this fault to test how a service detects and recovers from silent data corruption: a flaky disk that flips bytes, a buggy controller that returns shuffled blocks, or a hardware error that escapes parity checks.

warning

Due to the large blast radius of this fault, we recommend you do not execute it in the production environment.
Through the fault execution, the application pod can potentially fail to perform successful IO writes if the write system call is being targeted. This can cause any data produced in this duration to be lost.
Any data produced before the execution of the fault is not harmed as a result of its execution.

Run your first experiment

If you have not configured the chaos infrastructure yet, go to Quickstart to install the chaos infrastructure and run an experiment end to end.

Use cases

Run this fault when you want to answer concrete questions like:

Checksum coverage: Does the application verify checksums on read, or does it silently propagate corrupted data downstream?
Format-level validation: When a JSON or Protobuf payload is read with zeroed bytes in the middle, does the parser reject the message or crash on it?
Replication conflict handling: For replicated stateful workloads, does a corrupted write cause the follower to diverge, and does the cluster detect the split?
Backup integrity: If a backup process reads corrupted bytes from a snapshot, does the resulting backup get flagged or stored as-is?
Recovery story: Once the fault ends, can the application repair its in-memory state, or does it stay broken until restart?

Prerequisites

Kubernetes version: 1.21 or later. Go to What's supported to confirm distribution support.
Target pod is Running: The pod you intend to target is in the Running state with a mounted volume.
Privileged pods allowed: The cluster lets you schedule privileged pods in the chaos namespace. GKE Autopilot supports this fault but requires the one-time setup in Chaos on GKE Autopilot; other locked-down distributions may need similar exemptions.
Container runtime access: The chaos pod can reach the container runtime socket on the target node (/run/containerd/containerd.sock, /var/run/docker.sock, or /var/run/crio/crio.sock).
Mounted volume: The target container has at least one writable mounted volume reachable at MOUNT_PATH.
Workload selector defined: The chaos experiment knows the target workload by kind, namespace, and either names or labels.

Supported environments

Platform	Support status
Amazon EKS	Supported
Azure AKS	Supported
Google GKE	Supported
Red Hat OpenShift	Supported
Rancher	Supported
VMware Tanzu	Supported
Self-managed Kubernetes (CNCF-certified)	Supported
GKE Autopilot	Supported with Autopilot setup
EKS Fargate, ACI virtual nodes	Not supported (no access to container runtime sockets)

Permissions required

The fault runs under the chaos infrastructure's service account.

Resource (`apiGroup`)	Verbs	Why it is needed
`pods` (`""`)	`get`, `list`, `create`, `delete`, `deletecollection`, `patch`, `update`	Discover target pods and run the chaos pod on the same node
`pods/log` (`""`)	`get`, `list`, `watch`	Stream chaos pod logs for status and debugging
`deployments`, `statefulsets`, `replicasets`, `daemonsets` (`apps`)	`get`, `list`	Resolve the target workload to the pods it owns
`events` (`""`)	`get`, `list`, `create`, `patch`, `update`	Record fault progress as Kubernetes events
`jobs` (`batch`)	`get`, `list`, `create`, `delete`, `deletecollection`	Run the chaos job that drives the fault

The default Harness chaos infrastructure service account already includes these permissions.

Fault tunables

Configure the following fault parameters when you add Pod IO mistake to an experiment in Chaos Studio. Defaults are shown for reference.

Chaos parameters

Tunable	Description	Default
`SEED`	Wrong-data pattern to substitute. Common values: `zero` (zero bytes), `random`, or a literal string.	`"zero"`
`MAX_TIMES`	Maximum number of places per syscall payload to seed the wrong data.	`1`
`MAX_SIZE`	Maximum size in bytes of each seeded chunk.	`8`
`PERCENTAGE`	Percentage of matching IO requests to corrupt, between `0` and `100`. `100` corrupts every request.	`100`
`TOTAL_CHAOS_DURATION`	Duration of the fault in seconds.	`60`

Filters

Tunable	Description	Default
`MOUNT_PATH`	Volume mount path inside the container to scope the fault. Empty applies to all mounts.	`""`
`FILE_PATH`	File path or glob beneath `MOUNT_PATH` to scope the fault. Empty applies to all files under `MOUNT_PATH`.	`""`
`METHOD_TYPES`	Comma-separated syscall types to corrupt: `read`, `write`, `open`. Empty applies to all three.	`""`

Targeting

Tunable	Description	Default
`TARGET_PODS`	Comma-separated list of pod names to target. Empty selects from the workload's pods using `POD_AFFECTED_PERCENTAGE`.	`""`
`TARGET_CONTAINER`	Container in the pod whose filesystem to corrupt. Empty targets the first container in the pod spec.	`""`
`NODE_LABEL`	Label selector to filter target pods by the node they run on. Empty disables node-based filtering.	`""`
`POD_AFFECTED_PERCENTAGE`	Percentage of the workload's pods to target. `0` means one pod.	`0`
`SEQUENCE`	When multiple pods are targeted, inject `parallel` (all at once) or `serial` (one after another).	`parallel`

Runtime and helper

Tunable	Description	Default
`CONTAINER_RUNTIME`	Container runtime on the target nodes. One of `containerd`, `docker`, `crio`.	`containerd`
`SOCKET_PATH`	Path to the container runtime socket on the target node. Set to match `CONTAINER_RUNTIME`.	`/run/containerd/containerd.sock`
`RAMP_TIME`	Wait period in seconds before and after the fault. Go to ramp time to read how it is applied.	`0`

Common pod selection tunables (TARGET_WORKLOAD_KIND, TARGET_WORKLOAD_NAMESPACE, TARGET_WORKLOAD_NAMES, TARGET_WORKLOAD_LABELS) are documented in common pod fault tunables. Tunables that apply to every fault are documented in common tunables for all faults.

Writes seed corruption to disk

When METHOD_TYPES includes write, the wrong bytes can land on disk and survive the fault. Always run against a recoverable test environment with backups; never run against production data.

Configure for your container runtime

Set CONTAINER_RUNTIME and SOCKET_PATH to match the runtime on the target node:

`CONTAINER_RUNTIME`	`SOCKET_PATH`
`containerd` (default)	`/run/containerd/containerd.sock`
`docker`	`/var/run/docker.sock`
`crio`	`/var/run/crio/crio.sock`

Fault execution in brief

Intercepts filesystem syscalls (read, write, open) inside the target container's mount namespace and substitutes up to MAX_TIMES slices of up to MAX_SIZE bytes in the payload with SEED, optionally limited to a configurable percentage so other IO is unaffected.

Expected behavior during fault execution

Matched syscalls return success with corrupted payloads. Applications that do not checksum what they read or write pass the corruption downstream.
Format-aware parsers (JSON, Protobuf, BSON) typically fail loudly with parse errors when the corruption falls in a structural region.
Checksumming layers (RocksDB, etcd, MongoDB) surface verification failures and may close affected segments or trip read-repair.
Writes can persist corrupted bytes to the backing volume. The corruption survives until overwritten or repaired by the application.

When the fault ends

Syscalls behave normally again immediately. Any corrupted data already written to the volume remains until the application overwrites or repairs it.

Signals to watch

Attach resilience probes to assert each layer:

Application error rate: Use an HTTP probe to detect parse or checksum errors surfacing as 5xx.
Pod readiness: Use a Kubernetes probe to fail when the target pod stops being Ready because of failing self-checks.
Checksum / corruption metrics: Use a Prometheus probe on workload-specific counters (RocksDB block_checksum_mismatch, MongoDB wiredTiger cursor errors, etc.).

Verify the fault execution effect

While the experiment is running, confirm the data is corrupted:

Read a matched file and inspect a hexdump.

kubectl exec -n <namespace> <target-pod> -c <target-container> -- \
  sh -c "head -c 64 <MOUNT_PATH>/<FILE_PATH> | hexdump -C"

The leading bytes should contain the SEED pattern instead of the original content.

Confirm application-level impact.
```
kubectl logs -n <namespace> <target-pod> -c <target-container> --tail=200
```
Look for parse errors, checksum mismatches, or rejected payloads.

Recovery and cleanup

End of duration: Syscalls behave normally again automatically.
Abort the experiment: Stopping the experiment from Chaos Studio triggers the same cleanup path.
Persisted corruption: Bytes written to the volume during the fault remain corrupted. Restore from backup, replay from a clean source, or rebuild affected state.

Limitations

Serverless Kubernetes (EKS Fargate, ACI virtual nodes): These platforms do not expose container runtime sockets and reject the privileged access the fault needs. GKE Autopilot is supported once the one-time setup in Chaos on GKE Autopilot is in place.
Windows containers: This fault is supported on Linux pods only.
Read-only volumes: ConfigMap, Secret, and downward API volumes are read-only.
Persisted corruption is not undone: Writes that succeeded during the fault leave corrupted bytes on disk. Plan recovery before running.
Page cache and mmap: Bytes already in the page cache or mapped pages may bypass the syscall path; behavior on these reads is not guaranteed.

Troubleshooting

Pod IO mistake experiment stays Pending or never starts in Harness Chaos Engineering

Inspect the chaos pods in the experiment namespace with kubectl describe pod -n <chaos-namespace>. The most common causes are taints on the target node that the chaos pods do not tolerate, insufficient resources, or a PodSecurity admission policy blocking privileged pods. Add the required tolerations to the experiment or run in a namespace with privileged Pod Security level.

No corruption observed during pod-io-mistake

The most common causes are: MOUNT_PATH does not match a real mount inside the container (verify with kubectl exec <pod> -- mount); FILE_PATH or METHOD_TYPES filters are too narrow; or the application reads through the page cache and never hits the patched syscall. Re-run with PERCENTAGE=100, empty filters, drop caches if possible, and hexdump a fresh read.

Connection to container runtime fails for pod-io-mistake in Harness Chaos Engineering

The default SOCKET_PATH is /run/containerd/containerd.sock. For Docker, set CONTAINER_RUNTIME=docker and SOCKET_PATH=/var/run/docker.sock. For CRI-O, set CONTAINER_RUNTIME=crio and SOCKET_PATH=/var/run/crio/crio.sock.

Data is still corrupted after pod-io-mistake ends

Writes that landed on disk during the fault remain corrupted. Restore the affected files from backup, replay from a known-good source, or rebuild the volume. Run this fault only against test data that can be recreated.

Pod IO error: Fail filesystem syscalls with a configurable error code instead of corrupting data.
Pod IO latency: Add delay to filesystem syscalls.
Pod IO attribute override: Override file attributes on the mounted volume.
Pod IO stress: Generate sustained read and write load.
Common pod fault tunables: Shared environment variables for selecting target pods and workloads.

Use cases​

Prerequisites​

Supported environments​

Permissions required​

Fault tunables​

Configure for your container runtime​

Fault execution in brief​

Expected behavior during fault execution​

Signals to watch​

Verify the fault execution effect​

Recovery and cleanup​

Limitations​

Troubleshooting​

Related faults​