Skip to main content

Pod network partition

Last updated on

Pod network partition is a Kubernetes pod-level chaos fault that isolates a target pod by creating a temporary NetworkPolicy that blocks ingress, egress, or both for a configurable duration. The partition is enforced by the cluster's CNI at the dataplane, so this fault has lighter prerequisites than the other pod network faults — it works wherever the cluster CNI supports NetworkPolicy (Calico, Cilium, AWS VPC CNI in policy mode, Azure CNI, etc.). When the fault ends, the policy is deleted and connectivity is restored immediately.

Use this fault to test how a service behaves when it is fully cut off from a specific peer or dependency: a network partition between leader and follower, a failed gateway, a split namespace, or an annotated dependency you want to isolate without changing application code.

Run your first experiment

If you have not configured the chaos infrastructure yet, go to Quickstart to install the chaos infrastructure and run an experiment end to end.


Use cases

Run this fault when you want to answer concrete questions like:

  • Leader-follower partition: Cut off the leader of a stateful service from its followers. Does the follower set elect a new leader within the configured timeout? Does the original leader step down cleanly when the partition heals?
  • Cross-namespace isolation: Block one workload from a specific peer namespace and confirm the workload either fails fast or degrades to a documented fallback (cache, default value, error response).
  • Dependency cutoff: Isolate a workload from its database, message bus, or external API and confirm the circuit breaker engages, the cache serves stale data, and callers are not exposed to long timeouts.
  • Probe and ejection behavior: Does the kubelet mark the pod NotReady because its outbound health check fails? Does the service mesh outlier-detection eject the pod from upstream pools within the SLO?

Prerequisites

  • Kubernetes version: 1.21 or later. Go to What's supported to confirm distribution support.
  • CNI with NetworkPolicy enforcement: The cluster's CNI plugin must enforce NetworkPolicy. Plain Flannel and the default AWS VPC CNI in non-policy mode do not. Verify with kubectl api-resources | grep networkpolicy.
  • Target pods are Running: The application pods you intend to target are in the Running state before the fault is launched.
  • Workload selector defined: The chaos experiment knows the target workload (Deployment, StatefulSet, etc.) by kind, namespace, and either names or labels.

Supported environments

PlatformSupport status
Amazon EKS (with Calico, Cilium, or VPC CNI in policy mode)Supported
Azure AKSSupported
Google GKESupported
Red Hat OpenShiftSupported
RancherSupported
VMware TanzuSupported
Self-managed Kubernetes with policy-enforcing CNISupported
GKE AutopilotSupported with Autopilot setup
Clusters with no NetworkPolicy enforcementNot supported (the policy is created but has no effect)

Permissions required

The fault runs under the chaos infrastructure's service account.

Resource (apiGroup)VerbsWhy it is needed
networkpolicies (networking.k8s.io)get, list, create, delete, patchCreate and delete the chaos NetworkPolicy in the target namespace
pods ("")get, listDiscover target pods and confirm they are Running
deployments, statefulsets, replicasets, daemonsets (apps)get, listResolve the target workload to the pods it owns
events ("")get, list, create, patch, updateRecord fault progress as Kubernetes events
jobs (batch)get, list, create, delete, deletecollectionRun the chaos job that drives the fault

The default Harness chaos infrastructure service account already includes these permissions.


Fault tunables

Configure the following fault parameters when you add Pod network partition to an experiment in Chaos Studio. Defaults are shown for reference.

Chaos parameters

TunableDescriptionDefault
POLICY_TYPESWhich traffic direction to block. One of ingress, egress, or all.all
TOTAL_CHAOS_DURATIONDuration of the fault in seconds.60

Partition scope (what is blocked)

If you set none of the following, the partition blocks all traffic in the chosen direction. Combine them to narrow the partition.

TunableDescriptionDefault
DESTINATION_IPSComma-separated CIDRs to block (for example, 10.0.0.0/16,192.168.1.0/24).""
DESTINATION_HOSTSComma-separated hostnames to block. The helper resolves them and adds the resolved IPs to the policy.""
POD_SELECTORLabel selector matching peer pods to block, in key=value,key=value form.""
NAMESPACE_SELECTORLabel selector matching peer namespaces to block, in key=value,key=value form.""
PORTSComma-separated ports (or port/protocol pairs) the policy should restrict to.""

Runtime and helper

TunableDescriptionDefault
RAMP_TIMEWait period in seconds before and after the fault. Go to ramp time to read how it is applied.0

Tunables that apply to every chaos fault are documented in common tunables for all faults.

Empty partition scope blocks everything

With no DESTINATION_*, POD_SELECTOR, or NAMESPACE_SELECTOR set, the chaos NetworkPolicy denies all traffic in the chosen direction, including DNS to kube-system. If you only want to cut off a single dependency, always set at least one scope tunable.


Fault execution in brief

Creates a Kubernetes NetworkPolicy in the target pod's namespace that denies ingress, egress, or both for the configured duration, optionally scoped to specific destinations, namespaces, pod selectors, or ports so other traffic remains unaffected.


Expected behavior during fault execution

  • TCP connections in the blocked direction stop progressing. Existing sockets time out per their tcp_user_timeout and OS keepalive settings (often 5 to 15 minutes). New connection attempts fail immediately with connection refused or host unreachable.
  • UDP traffic (DNS, QUIC) drops silently. If you blocked egress without explicitly allowing DNS, name resolution fails.
  • The target pod can still reach anything not covered by the policy. With POD_SELECTOR=app=foo, only pods with that label are blocked.
  • Health probes from the kubelet still work (kubelet talks to the pod over the node's network, which is not subject to pod-level NetworkPolicy).
  • Service meshes that use sidecar-to-sidecar TCP detect the failures and may eject the pod from upstream pools.
When the fault ends

The NetworkPolicy is deleted, and connectivity is restored as fast as the CNI can reconcile (typically within one second). TCP connections that timed out during the partition do not automatically reconnect; the application must handle that itself.

Signals to watch

Attach resilience probes to assert each layer:

  • Application failover: Use an HTTP probe against the dependency to confirm the partition is in effect, and a second HTTP probe against the application's user-facing endpoint to confirm graceful degradation.
  • Cluster events: Use a Kubernetes probe to confirm the NetworkPolicy was created and deleted as expected.
  • Mesh ejection: Use a Prometheus probe on Istio/Envoy cluster.outlier_detection.ejections_active or your mesh's equivalent.

Verify the fault execution effect

While the experiment is running, confirm that the partition is actually in effect:

  1. Inspect the chaos NetworkPolicy.

    kubectl get networkpolicy -n <namespace> -l harness.io/chaos=pod-network-partition -o yaml

    You should see the policy with policyTypes and the rules you configured.

  2. Test connectivity from the target pod.

    kubectl exec -n <namespace> <pod-name> -- nc -zv -w 2 <blocked-host> <port>

    The connection should fail. For an unblocked destination, the same command should succeed.


Recovery and cleanup

  • End of duration: The chaos NetworkPolicy is deleted automatically. The CNI reconciles within seconds and connectivity returns.

  • Abort the experiment: Stopping the experiment from Chaos Studio triggers the same cleanup path.

  • Failed cleanup: If automated cleanup did not complete, delete the chaos NetworkPolicy manually:

    kubectl delete networkpolicy -n <namespace> -l harness.io/chaos=pod-network-partition
  • Stuck connections: TCP sockets that timed out during the partition stay closed; the application reconnects on its own schedule.


Limitations

This fault is not appropriate in the following scenarios:

  • CNI plugins without NetworkPolicy enforcement: Plain Flannel, default AWS VPC CNI (without policy mode), and a few others do not enforce policies. The fault creates the policy but it has no effect.
  • hostNetwork pods: Pods using the host network namespace are not subject to namespaced NetworkPolicy. The fault refuses to inject on hostNetwork: true pods.
  • Cluster-scoped resources: This fault works at the pod level. To partition a node from the rest of the cluster, use Node network loss with NETWORK_PACKET_LOSS_PERCENTAGE: 100.
  • kubelet-to-pod traffic: NetworkPolicy does not apply to kubelet probe traffic in most CNI implementations, so liveness and readiness probes still work even under "block all" rules. To break probes, target a different layer.

Troubleshooting

Pod network partition created but the target pod can still reach the blocked destination

The cluster CNI does not enforce NetworkPolicy. Verify with kubectl describe pod -n kube-system -l k8s-app=<cni> or your CNI's docs. Plain Flannel and the default AWS VPC CNI in non-policy mode do not enforce policies. Switch to Calico, Cilium, or enable AWS VPC CNI policy mode.

Pod network partition blocks more traffic than expected in Harness Chaos Engineering

With no DESTINATION_*, POD_SELECTOR, or NAMESPACE_SELECTOR set, the chaos NetworkPolicy denies all traffic in the chosen direction, including DNS to kube-system. Either set scope tunables to narrow the partition, or add an explicit allow rule outside this fault for DNS (UDP/53 to kube-system pods).

NetworkPolicy not deleted after pod-network-partition ends

Automated cleanup did not complete. Delete the chaos NetworkPolicy manually with kubectl delete networkpolicy -n <namespace> -l harness.io/chaos=pod-network-partition. Verify connectivity returns with a test from inside the target pod.

Permission denied creating NetworkPolicy for pod-network-partition

The chaos service account lacks the create verb on networkpolicies in the networking.k8s.io group for the target namespace. Verify with kubectl auth can-i create networkpolicies --as=system:serviceaccount:<chaos-ns>:<sa-name> -n <target-ns>. Grant the missing role binding.