Skip to main content

Node network latency

Last updated on

Node network latency is a Kubernetes node-level chaos fault that adds a configurable delay to packets leaving a target node for a configurable duration. Every workload that shares the node's network stack experiences the added round-trip time: application pods, DaemonSets, kube-proxy, CNI agents, and the kubelet itself.

Use this fault to simulate a slow network neighbor: a congested transit link, a saturated NIC, an overloaded service mesh sidecar, or a cross-region failover that has stretched east-west latency.

Run your first experiment

If you have not configured the chaos infrastructure yet, go to Quickstart to install the chaos infrastructure and run an experiment end to end.


Use cases

Run this fault when you want to answer concrete questions like:

  • Client timeout configuration: Are HTTP, gRPC, and database client timeouts tuned for realistic tail latency, or do they fire so aggressively that a brief spike collapses connection pools?
  • Retry and hedging behavior: Do retries spread over time as designed, or do they pile on at the same instant and amplify the slowdown?
  • Service-mesh and load balancer behavior: Do outlier detection and latency-aware load balancing evict the slow node's endpoints from the rotation, and how quickly?
  • Asynchronous queue depth: When upstream calls take longer, do queues, thread pools, and goroutine counts stay bounded, or do they grow until the application runs out of memory?
  • SLO and error budget validation: Does your latency SLO fire before user-visible degradation, or only after?
  • Cross-region or cross-AZ failover: When latency to a dependency stretches to cross-region levels (for example, 100 to 300 ms), does the application keep serving from the local replica, or does it amplify the slowness by waiting?

Prerequisites

  • Kubernetes version: 1.21 or later. Go to What's supported to confirm distribution support.
  • Standard Linux networking modules: Target nodes run a Linux kernel that includes the standard networking modules (present by default on most distributions). On minimal or hardened images, your platform team may need to install kernel-modules-extra (RHEL-family) or the equivalent package.
  • Privileged pods allowed: The cluster lets you schedule privileged pods with NET_ADMIN and SYS_ADMIN capabilities in the chaos namespace. GKE Autopilot supports this fault but requires the one-time setup in Chaos on GKE Autopilot; other locked-down distributions may need similar exemptions.
  • Node readiness: Target nodes are in Ready state before the fault is launched. The fault reports a precheck failure otherwise.
  • Application timeouts are configured: Without explicit client timeouts, your application cannot distinguish a slow request from a hung one. Run this fault against services that have realistic timeouts; otherwise the observation has no failure boundary.

Supported environments

PlatformSupport status
Amazon EKSSupported
Azure AKSSupported
Google GKESupported
Red Hat OpenShiftSupported
RancherSupported
VMware TanzuSupported
Self-managed Kubernetes (CNCF-certified)Supported
GKE AutopilotSupported with Autopilot setup

Permissions required

The fault runs under the chaos infrastructure's service account. The account must be able to perform the following operations against the target cluster.

Resource (apiGroup)VerbsWhy it is needed
pods ("")get, list, create, delete, deletecollection, patch, updateRun the chaos pod that injects the fault on the target node
pods/log ("")get, list, watchStream chaos pod logs for status and debugging
events ("")get, list, create, patch, updateRecord fault progress as Kubernetes events
nodes ("")get, listDiscover target nodes and validate selectors
jobs (batch)get, list, create, delete, deletecollectionRun the chaos job that drives the fault

The default Harness chaos infrastructure service account already includes these permissions. You only need to extend it if you are running with a restricted scope, for example a namespace-scoped install where you have removed cluster-wide node access.


Fault tunables

Configure the following fault parameters when you add Node network latency to an experiment in Chaos Studio. Defaults are shown for reference.

Chaos parameters

TunableDescriptionDefault
NETWORK_LATENCYDelay added to each packet on the shaped path. Accepts a Go-style duration string (for example, 500ms, 2s, 1m)."2s"
TOTAL_CHAOS_DURATIONDuration of the fault in seconds.60

Scoping

TunableDescriptionDefault
DESTINATION_IPSComma-separated list of IP addresses or CIDRs to scope the delay to. Empty means all destinations.""
DESTINATION_HOSTSComma-separated list of DNS names or FQDNs to scope the delay to. Empty means all destinations.""
NETWORK_INTERFACEInterface on which tc rules are applied.eth0

Targeting

TunableDescriptionDefault
TARGET_NODESComma-separated list of node names to target. Go to target multiple nodes to read more.""
NODE_LABELLabel selector for choosing target nodes. Go to target nodes with labels to read more.""
NODES_AFFECTED_PERCENTAGEPercentage of nodes (matching the selector) to target. 0 means one node.0
SEQUENCEWhen multiple nodes are targeted, inject parallel (all at once) or serial (one after another).parallel

Runtime and helper

TunableDescriptionDefault
RAMP_TIMEWait period in seconds before and after the fault. Go to ramp time to read how it is applied.0

Tunables that apply to every chaos fault are documented in common tunables for all faults.

Start small and scope aggressively

The default 2s latency is high enough to break most service-level objectives. For experiments that target a single dependency path, set DESTINATION_HOSTS to the FQDN of the slow dependency and start with a smaller NETWORK_LATENCY (for example, 100ms to 500ms). Apply the default 2s only when you intend to simulate a near-partition.


Fault execution in brief

Configures the target node's network interface to add a specified delay (with optional jitter) to packets for the configured duration, optionally scoped to certain destination IPs or hosts so other traffic passes through unaffected.

Because the delay is added at the node's network path, every connection passing through it sees a longer round-trip time. The paths and behaviors most commonly affected are:

PathWhat is affected
Pod to pod across nodesEast-west RPC latency grows. Connection pools that assume tight RTTs may exhaust.
Pod to external endpointsEgress to image registries, third-party APIs, managed databases, and SaaS providers. TLS handshake and TCP keepalive timers stretch.
Pod to kube-apiserverIn-cluster API calls take longer, but kubelet heartbeats survive unless the added latency is comparable to node-monitor-grace-period.
Node to nodeOverlay and underlay traffic between nodes slows. CNI control-plane chatter slows but rarely breaks.

Pods on the same node that talk over localhost or the node's CNI bridge are typically unaffected. Bridge-mode CNIs (for example kindnet, flannel with VXLAN, default Calico) keep intra-node traffic local; some overlay configurations hairpin through the host NIC and pick up the delay.


Expected behavior during fault execution

At the default 2s delay, applications with realistic timeouts (typically 1 to 5 seconds for HTTP, lower for in-cluster RPCs) will see immediate failures: timeouts firing, retries piling up, and connection pools draining.

In detail:

  • TCP round-trip time on the shaped path increases by NETWORK_LATENCY. Existing connections survive; new connections take longer to establish because the TLS handshake adds two extra RTTs.
  • Application-level timeouts decide what users see. If the client timeout is shorter than NETWORK_LATENCY, every call fails. If it is longer, the call succeeds but tail latency spikes.
  • Connection pools that bound concurrency to a small number can exhaust quickly. A pool of size 50 that serves 1000 req/s at 5 ms RTT can suddenly hold every connection for 2 seconds, instantly turning into a bottleneck.
  • The kubelet's node lease renewal RPC keeps working at the default 2s delay because the node-monitor-grace-period (40 to 50 seconds in current Kubernetes versions) is well above the added latency. At extreme delays (tens of seconds), the node can still flip to NotReady.
  • Retries that lack jitter and exponential backoff amplify the load on upstream services, sometimes triggering rate limiting or cascading failures further upstream.
Latency, not loss

Unlike packet loss, this fault rarely produces a NotReady node transition at the default delay. If you need to test the cluster's partition-handling behavior (pod eviction, taint manager, StatefulSet stuck-Terminating), use Node network loss instead.

Signals to watch

A useful experiment captures signals from three layers. Attach resilience probes to assert each layer automatically:

  • Application service-level indicators: Watch p50, p95, and p99 latency for the affected workloads. The signal that matters is whether your timeouts catch the degradation before users do. Use an HTTP probe for direct endpoint latency assertions.
  • Connection pool and concurrency metrics: Look for in-flight request counts, queue depth, thread pool saturation, and goroutine counts. Use a Prometheus probe or an APM probe to fail the experiment when concurrency exceeds your safe ceiling.
  • Upstream amplification: Track RPS to the slow dependency before and during the fault. Retries without jitter often double or triple the load. Use a Prometheus probe on the upstream's request rate to detect the amplification pattern.

Verify the fault execution effect

While the experiment is running, confirm that latency is reaching the node:

  1. Measure RTT from outside the node. From another pod on a different node:

    kubectl run -it --rm netshoot --image=nicolaka/netshoot --restart=Never -- \
    ping -c 20 <target-pod-ip-or-svc>

    Average RTT should be close to baseline + NETWORK_LATENCY. Packet loss should remain at zero; that is the signature that distinguishes latency from loss injection.

  2. Watch application latency metrics. Open the dashboard for the affected service and confirm p99 has stepped up by approximately NETWORK_LATENCY. If you do not see the step, traffic is routing around the slow path (for example, in-cluster localhost or sidecar bypass).


Recovery and cleanup

  • End of duration: When TOTAL_CHAOS_DURATION elapses, the delay configuration is removed automatically and RTT returns to baseline within a few seconds.
  • Backlogged requests drain: Application connection pools, retry queues, and request backlogs that built up during the fault take longer to drain than the fault itself. Plan a post-experiment observation window of two to three minutes before declaring recovery complete.
  • No node reboot or pod eviction: Unlike Node network loss, this fault does not normally cause the controller-manager to mark the node NotReady, so no taint is applied and no pods are evicted by the taint manager.
  • If automated cleanup did not complete: Reboot the affected node (or have your admin reset its network configuration). The fault does not persist across a node reboot.
  • Abort the experiment early: Stop the experiment from Harness Chaos Studio. Cleanup runs before the chaos pod exits.

Limitations

This fault is not appropriate in the following scenarios:

  • Serverless Kubernetes (EKS Fargate, ACI virtual nodes): These platforms do not expose real nodes or allow the privileged access this fault needs. GKE Autopilot is supported once the one-time setup in Chaos on GKE Autopilot is in place.
  • Windows nodes: This fault is supported on Linux nodes only. Use the equivalent Windows network fault on Windows-only workloads.
  • Single-node clusters or co-located chaos infrastructure: If the chaos infrastructure pods live on the slow node, the experiment loses observability. Schedule chaos infrastructure on a node outside the blast radius.
  • Pods using hostNetwork: true: These bypass per-pod network namespaces. Delay is still applied to the node, but observed behavior depends on how the pod uses the host stack.
  • Hardened or stripped kernels: Some custom distribution kernels omit the networking modules this fault depends on, and the fault fails immediately on those nodes. On RHEL-family hosts, install kernel-modules-extra and reboot to restore the missing modules.
  • Applications without explicit timeouts: Without client timeouts, slowness is indistinguishable from a hang and there is no observable failure boundary. Configure timeouts in the application under test before running this fault.

Troubleshooting

Node network latency experiment stays Pending or never starts in Harness Chaos Engineering

Inspect the chaos pods in the experiment namespace with kubectl describe pod -n <chaos-namespace>. The most common causes are taints on the target node, insufficient CPU or memory on the node, or a PodSecurity admission policy blocking privileged pods. Add the required tolerations to the experiment, free resources on the node, or run the experiment in a namespace with privileged Pod Security level.

Node network latency fails immediately on hardened nodes or with an unknown interface

Either the required Linux networking kernel modules are missing on the target node, or the NETWORK_INTERFACE value does not match a real interface on the node. SSH to the node and run ip -br link to list interfaces. On RHEL-family hosts install kernel-modules-extra and reboot to restore the missing modules; otherwise set NETWORK_INTERFACE to the actual interface name (for example ens5 on EKS, eth0 on most others).

Node network latency fault runs but application latency does not increase

Three common causes: (1) the workload routes intra-node over the CNI bridge and never crosses the shaped interface, (2) DESTINATION_HOSTS resolves to IPs not on the shaped path, or (3) the application talks to a sidecar over localhost which bypasses the shaped path. Ping the target service from a pod on a different node to verify the delay end-to-end.

Node flips to NotReady when running node-network-latency at high delay values

At delays comparable to or greater than node-monitor-grace-period (40 to 50 seconds in current Kubernetes), kubelet lease renewals can miss their window and the controller-manager flips the node to NotReady. Use a lower NETWORK_LATENCY value, or exclude the kube-apiserver IP from the shaped path by scoping with DESTINATION_HOSTS to the specific dependency you want to slow down.