Skip to main content

Node Status Check

Last updated on

Node Status Check is a built-in Command Probe template that validates the current state of Kubernetes nodes during a chaos experiment. It confirms that the targeted nodes stay healthy and Ready, so you can assert that cluster capacity holds up while a node-level fault is injected. You select nodes by name or by label.

The probe runs the healthchecks utility bundled in the chaos probe image, queries the Kubernetes API, and prints [Pass] when every targeted node is healthy. The comparator marks the probe as passed when the output contains [Pass].

Built-in probe template

This is a built-in Command Probe template that runs on Kubernetes chaos infrastructure. Add it to an experiment from the probe library and customize its inputs. Go to Built-in probe templates to browse the full library, or go to Command probe to understand how command probes work.


Use cases

Use this probe template to:

  • Verify that nodes stay healthy during chaos experiments.
  • Validate node recovery after failures.
  • Monitor cluster health during node-level chaos.
  • Confirm node availability during infrastructure disruptions.

How the probe works

The template configures a Command Probe that runs healthchecks -name node-level. The utility resolves the target nodes from TARGET_NODE, TARGET_NODES, or NODE_LABEL, queries the Kubernetes API, and prints [Pass] when every resolved node is in a healthy state. The comparator passes the probe when the output contains [Pass], and fails it otherwise.


Prerequisites

  • Chaos infrastructure: A Kubernetes chaos infrastructure installed in the target cluster.
  • Node access: Access to the cluster nodes you want to check.
  • RBAC permissions: Permissions for the chaos service account to query node status.

Probe properties

Command

healthchecks -name node-level

Comparator

TypeCriteriaValue
stringcontains[Pass]

The probe passes when the command output contains [Pass], which indicates that every targeted node is in a healthy state.

Environment variables

VariableDescriptionRequiredDefault
TARGET_NODEComma-separated list of nodes to check. Provide one of TARGET_NODE, TARGET_NODES, or NODE_LABEL.Conditional-
TARGET_NODESComma-separated list of nodes to check. Provide one of TARGET_NODE, TARGET_NODES, or NODE_LABEL.Conditional-
NODE_LABELNode label used to list nodes to check (for example, node-role.kubernetes.io/worker=). Provide one of TARGET_NODE, TARGET_NODES, or NODE_LABEL.Conditional-
STATUS_CHECK_TIMEOUTMaximum time in seconds to wait for the status check.No180
STATUS_CHECK_DELAYDelay in seconds between status checks.No2
Node selection

Provide at least one of TARGET_NODE, TARGET_NODES, or NODE_LABEL.


Run properties

PropertyDescriptionTypeDefault
timeoutMaximum time to wait for the probe to complete (for example, 30s, 1m, 5m).String180s
intervalTime between probe executions (for example, 1s, 5s, 10s).String1s
attemptNumber of retry attempts before the probe is marked as failed.Integer1
pollingIntervalTime between retry attempts (for example, 1s, 5s, 10s).String-
initialDelayInitial delay before the probe starts (for example, 0s, 10s, 30s).String-
stopOnFailureStop the experiment if the probe fails.Booleanfalse
verbosityLog verbosity level (info, debug, trace).String-

Troubleshooting

Node Status Check probe fails because no nodes matched the target

The selectors did not resolve any nodes. Confirm that the values in TARGET_NODE or TARGET_NODES match node names exactly, or that NODE_LABEL matches a label applied to the nodes. An empty match is treated as a failure.

Node Status Check probe fails with a forbidden or RBAC error

The chaos service account does not have permission to read nodes. Nodes are cluster-scoped, so grant get and list on nodes through a ClusterRole and ClusterRoleBinding for the chaos service account, then rerun the experiment.

Node Status Check probe times out before the node becomes Ready

The node did not return to a Ready state within STATUS_CHECK_TIMEOUT. Increase STATUS_CHECK_TIMEOUT and the run-property timeout, and inspect the node with kubectl describe node to find NotReady conditions, taints, or kubelet issues.