Node restart

Last updated on Dec 4, 2025

Node restart disrupts the state of the node by restarting it.

Node Restart

Use cases

Node restart fault:

Helps understand how an application behaves when a node is rebooted in a cluster.
Determines the deployment sanity (replica availability and uninterrupted service) and recovery workflows of the application pod in the event of an unexpected node restart.
Simulates loss of critical services (or node-crash).
Verifies resource budgeting on cluster nodes (whether request(or limit) settings honored on available nodes).
Determines whether topology constraints are adhered to (node selectors, tolerations, zone distribution, affinity or anti-affinity policies) or not.

Permissions required

Below is a sample Kubernetes role that defines the permissions required to execute the fault.

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: hce
  name: node-restart
spec:
  definition:
    scope: Cluster
permissions:
  - apiGroups: [""]
    resources: ["pods"]
    verbs: ["create", "delete", "get", "list", "patch", "deletecollection", "update"]
  - apiGroups: [""]
    resources: ["events"]
    verbs: ["create", "get", "list", "patch", "update"]
  - apiGroups: [""]
    resources: ["chaosEngines", "chaosExperiments", "chaosResults"]
    verbs: ["create", "delete", "get", "list", "patch", "update"]
  - apiGroups: [""]
    resources: ["pods/log"]
    verbs: ["get", "list", "watch"]
  - apiGroups: [""]
    resources: ["pods/exec"]
    verbs: ["get", "list", "create"]
  - apiGroups: ["batch"]
    resources: ["jobs"]
    verbs: ["create", "delete", "get", "list", "deletecollection"]
  - apiGroups: [""]
    resources: ["nodes"]
    verbs: ["get", "list"]
  - apiGroups: [""]
    resources: ["secrets"]
    verbs: ["get", "list"]

Prerequisites

Kubernetes > 1.16
Create a Kubernetes secret named id-rsa where the fault will be executed. The contents of the secret will be the private SSH key for SSH_USER that will be used to connect to the node that hosts the target pod in the secret field ssh-privatekey.
- Below is a sample secret file:
```
apiVersion: v1
        kind: Secret
        metadata:
          name: id-rsa
        type: kubernetes.io/ssh-auth
        stringData:
          ssh-privatekey: |-
            # SSH private key for ssh contained here
```
  Creating the RSA key pair for remote SSH access for those who are already familiar with an SSH client, has been summarized below.

Supported environments

Platform	Support Status
GKE (Google Kubernetes Engine)	✅ Supported
EKS (Amazon Elastic Kubernetes Service)	✅ Supported
AKS (Azure Kubernetes Service)	✅ Supported
GKE Autopilot	✅ Supported
Self-managed Kubernetes	✅ Supported

Create a new key pair and store the keys in a file named my-id-rsa-key and my-id-rsa-key.pub for the private and public keys respectively:

ssh-keygen -f ~/my-id-rsa-key -t rsa -b 4096

For each available node, run the below command that copies the public key of my-id-rsa-key:

ssh-copy-id -i my-id-rsa-key user@node

For further details, refer to this documentation. After copying the public key to all nodes and creating the secret, you are all set to execute the fault.

The target nodes should be in the ready state before and after injecting chaos.

Recommended alternatives for managed Kubernetes

For managed Kubernetes services, consider using cloud-specific VM stop faults instead of node-restart:

EKS: Use ec2-stop-by-id or ec2-stop-by-tag
GKE: Use gcp-vm-instance-stop
AKS: Use azure-instance-stop

These alternatives don't require SSH access and work directly with the cloud provider APIs.

Mandatory tunables

Tunable	Description	Notes
TARGET_NODE	Name of the target node subject to chaos. If this is not provided, a random node is selected.	For more information, go to target node.
NODE_LABEL	It contains the node label that is used to filter the target nodes.	It is mutually exclusive with the `TARGET_NODES` environment variable. If both are provided, `TARGET_NODES` takes precedence. For more information, go to tagret node with labels.

Optional tunables

Tunable	Description	Notes
LIB_IMAGE	Image used to run the stress command.	Default: `harness/chaos-go-runner:main-latest`. For more information, go to image used by the helper pod.
SSH_USER	Name of the SSH user.	Default: `root`. For more information, go to SSH user.
TARGET_NODE_IP	Internal IP of the target node subject to chaos. If not provided, the fault uses the node IP of the `TARGET_NODE`.	Default: empty. For more information, go to target node internal IP.
REBOOT_COMMAND	Command used to reboot.	Default: `sudo systemctl reboot`. For more information, go to reboot command.
TOTAL_CHAOS_DURATION	Duration that you specify, through which chaos is injected into the target resource (in seconds).	Default: 120 s. For more information, go to duration of the chaos.
RAMP_TIME	Period to wait before and after injecting chaos (in seconds).	For example, 30 s. For more information, go to ramp time.

Reboot command

Command to restart the target node. Tune it by using the REBOOT_COMMAND environment variable.

The following YAML snippet illustrates the use of this environment variable:

# provide the reboot command
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: engine-nginx
spec:
  engineState: "active"
  annotationCheck: "false"
  chaosServiceAccount: litmus-admin
  experiments:
  - name: node-restart
    spec:
      components:
        env:
        # command used for the reboot
        - name: REBOOT_COMMAND
          value: 'sudo systemctl reboot'
        # name of the target node
        - name: TARGET_NODE
          value: 'node01'
        - name: TOTAL_CHAOS_DURATION
          VALUE: '60'

SSH user

Name of the SSH user for the target node. Tune it by using the SSH_USER environment variable.

The following YAML snippet illustrates the use of this environment variable:

# name of the ssh user used to ssh into targeted node
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: engine-nginx
spec:
  engineState: "active"
  annotationCheck: "false"
  chaosServiceAccount: litmus-admin
  experiments:
  - name: node-restart
    spec:
      components:
        env:
        # name of the ssh user
        - name: SSH_USER
          value: 'root'
        # name of the target node
        - name: TARGET_NODE
          value: 'node01'
        - name: TOTAL_CHAOS_DURATION
          VALUE: '60'

Target node internal IP

Internal IP of the target node (optional). If the internal IP is not provided, the fault derives the internal IP of the target node. Tune it by using the TARGET_NODE_IP environment variable.

The following YAML snippet illustrates the use of this environment variable:

# internal ip of the targeted node
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: engine-nginx
spec:
  engineState: "active"
  annotationCheck: "false"
  chaosServiceAccount: litmus-admin
  experiments:
  - name: node-restart
    spec:
      components:
        env:
        # internal ip of the targeted node
        - name: TARGET_NODE_IP
          value: '10.0.170.92'
        # name of the target node
        - name: TARGET_NODE
          value: 'node01'
        - name: TOTAL_CHAOS_DURATION
          VALUE: '60'

Use cases​

Permissions required​

Prerequisites​

Supported environments​

Mandatory tunables​

Optional tunables​

Reboot command​

SSH user​

Target node internal IP​

Use cases

Permissions required

Prerequisites

Supported environments

Mandatory tunables

Optional tunables

Reboot command

SSH user

Target node internal IP