Best Practices for Probe Validation - Node Level Faults
This topic describes the best practices to use with resilience probes in Kubernetes node-level chaos faults.
Common Node Fault Tunables
Environment variables shared by node-level chaos faults for selecting target nodes by name, by label, or by percentage.
Kubelet Service Kill
Stop the kubelet on a Kubernetes node to simulate node loss without rebooting, and test eviction, rescheduling, and recovery behavior.
Node CPU Hog
Exhaust CPU on a Kubernetes node to test scheduler behavior, pod eviction under pressure, HPA reactions, and noisy-neighbor isolation.
Node Drain
Cordon and drain a Kubernetes node using the Eviction API to test PodDisruptionBudget enforcement, graceful shutdown, and rescheduling behavior.
Node IO Stress
Stress disk I/O on a Kubernetes node to test ephemeral-storage eviction, etcd write tolerance, log shipper backpressure, and noisy-neighbor isolation.
Node Memory Hog
Exhaust memory on a Kubernetes node to test kubelet eviction order, QoS-based pod prioritization, OOM behavior, and noisy-neighbor isolation.
Node Network Latency
Inject configurable network latency on a Kubernetes node's interface to test application timeouts, retry tuning, and tail-latency resilience.
Node Network Loss
Drop a configurable percentage of packets on a Kubernetes node's network interface to test cluster, application, and control-plane resilience.
Node Restart
Reboot a Kubernetes node over SSH to test how the cluster handles sudden node loss, pod rescheduling, and stateful recovery.
Node Taint
Apply a temporary taint to a Kubernetes node to test toleration correctness, scheduling policies, and NoExecute eviction behavior.