Skip to main content

Azure AKS node down

Last updated on

Azure AKS node down fault deallocates nodes in an Azure Kubernetes Service (AKS) cluster for a certain chaos duration.

  • It helps to check the resilience of your applications when AKS nodes become unavailable.
  • It targets VMSS (Virtual Machine Scale Set) instances in the AKS node pools and temporarily deallocates them.
  • You can filter target nodes by node pool name, availability zone, and percentage of nodes to affect.

Use cases

Azure AKS node down:

  • Determines the resilience of applications when AKS cluster nodes become unavailable.
  • Validates that workloads are properly distributed across nodes and can handle node failures gracefully.
  • Tests the behavior of Kubernetes scheduling and auto-scaling when nodes are deallocated.
  • Simulates availability zone (AZ) failures by targeting nodes in specific zones.
  • Verifies that critical applications have proper pod disruption budgets and replica counts.
  • Validates monitoring and alerting systems properly detect node failures.
  • Ensures that stateful applications handle node loss without data corruption.

Prerequisites

  • Kubernetes >= 1.17
  • Azure authentication configured for chaos faults. Refer to Azure authentication methods for setup instructions.
  • The target AKS cluster should be in a running state before chaos injection.

Required Azure permissions

The service principal needs the following permissions:

  • Reader role on the AKS cluster's resource group
  • Virtual Machine Contributor role on the AKS cluster's node resource group (auto-generated resource group containing VMSS instances)
  • Or custom role with these permissions:
    • Microsoft.ContainerService/managedClusters/read (on AKS cluster resource group)
    • Microsoft.Compute/virtualMachineScaleSets/read (on node resource group)
    • Microsoft.Compute/virtualMachineScaleSets/virtualMachines/read (on node resource group)
    • Microsoft.Compute/virtualMachineScaleSets/virtualMachines/deallocate/action (on node resource group)
    • Microsoft.Compute/virtualMachineScaleSets/virtualMachines/powerOff/action (on node resource group - for ephemeral OS disk VMs)
    • Microsoft.Compute/virtualMachineScaleSets/virtualMachines/start/action (on node resource group)

Mandatory tunables

Tunable Description Notes
AKS_CLUSTER_NAME Name of the Azure Kubernetes Service (AKS) cluster. For example, my-aks-cluster. For more information, go to AKS cluster name.
AKS_RESOURCE_GROUP Resource group of the AKS cluster. For example, rg-aks-cluster. For more information, go to resource group field in the YAML file.

Optional tunables

Tunable Description Notes
TOTAL_CHAOS_DURATION Duration that you specify, through which chaos is injected into the target resource (in seconds). Defaults to 30s. For more information, go to duration of the chaos.
CHAOS_INTERVAL Time interval between successive chaos iterations (in seconds). Defaults to 30s. For more information, go to chaos interval.
TARGET_NODE_POOL_NAMES Comma-separated list of node pool names to target. Empty means all node pools. For example, nodepool1,nodepool2. For more information, go to target node pools.
TARGET_ZONES Comma-separated list of availability zones to target. Empty means all zones. For example, 1,2,3. For more information, go to target zones.
NODE_AFFECTED_PERCENTAGE Percentage of nodes to affect. Defaults to 0 (corresponds to 1 instance). For more information, go to node affected percentage.
SEQUENCE Sequence of chaos execution for multiple nodes. Defaults to parallel. Also supports serial sequence. For more information, go to sequence of chaos execution.
RAMP_TIME Period to wait before and after injecting chaos (in seconds). For example, 30s. For more information, go to ramp time.
DEFAULT_HEALTH_CHECK Determines if you wish to run the default health check which is present inside the fault. Default: 'false'. For more information, go to default health check.

Deallocate AKS nodes

It deallocates AKS cluster nodes for a specific chaos duration. Tune it by using the AKS_CLUSTER_NAME, AKS_RESOURCE_GROUP, and NODE_AFFECTED_PERCENTAGE environment variables.

Use the following example to tune it:

# deallocate AKS nodes for a certain chaos duration
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: engine-nginx
spec:
engineState: "active"
annotationCheck: "false"
chaosServiceAccount: litmus-admin
experiments:
- name: azure-aks-node-down
spec:
components:
env:
# name of the AKS cluster
- name: AKS_CLUSTER_NAME
value: 'my-aks-cluster'
# resource group of the AKS cluster
- name: AKS_RESOURCE_GROUP
value: 'rg-aks-cluster'
# percentage of nodes to affect
- name: NODE_AFFECTED_PERCENTAGE
value: '100'
- name: TOTAL_CHAOS_DURATION
value: '60'

Target specific node pools

It targets nodes from specific node pools in the AKS cluster. Tune it by using the TARGET_NODE_POOL_NAMES environment variable.

Use the following example to tune it:

# target specific node pools in AKS cluster
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: engine-nginx
spec:
engineState: "active"
annotationCheck: "false"
chaosServiceAccount: litmus-admin
experiments:
- name: azure-aks-node-down
spec:
components:
env:
# name of the AKS cluster
- name: AKS_CLUSTER_NAME
value: 'my-aks-cluster'
# resource group of the AKS cluster
- name: AKS_RESOURCE_GROUP
value: 'rg-aks-cluster'
# comma-separated list of node pool names to target
- name: TARGET_NODE_POOL_NAMES
value: 'nodepool1,nodepool2'
# percentage of nodes to affect
- name: NODE_AFFECTED_PERCENTAGE
value: '50'
- name: TOTAL_CHAOS_DURATION
value: '60'

Target nodes by availability zone

It targets nodes from specific availability zones in the AKS cluster. Tune it by using the TARGET_ZONES environment variable.

Use the following example to tune it:

# target nodes in specific availability zones
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: engine-nginx
spec:
engineState: "active"
annotationCheck: "false"
chaosServiceAccount: litmus-admin
experiments:
- name: azure-aks-node-down
spec:
components:
env:
# name of the AKS cluster
- name: AKS_CLUSTER_NAME
value: 'my-aks-cluster'
# resource group of the AKS cluster
- name: AKS_RESOURCE_GROUP
value: 'rg-aks-cluster'
# comma-separated list of availability zones to target
- name: TARGET_ZONES
value: '1,2'
# percentage of nodes to affect
- name: NODE_AFFECTED_PERCENTAGE
value: '50'
- name: TOTAL_CHAOS_DURATION
value: '60'

Node affected percentage

It specifies the percentage of nodes to be affected in the target AKS cluster. Tune it by using the NODE_AFFECTED_PERCENTAGE environment variable.

Use the following example to tune it:

# affect a specific percentage of nodes
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: engine-nginx
spec:
engineState: "active"
annotationCheck: "false"
chaosServiceAccount: litmus-admin
experiments:
- name: azure-aks-node-down
spec:
components:
env:
# name of the AKS cluster
- name: AKS_CLUSTER_NAME
value: 'my-aks-cluster'
# resource group of the AKS cluster
- name: AKS_RESOURCE_GROUP
value: 'rg-aks-cluster'
# percentage of nodes to affect (0-100), where 0 means 1 instance
- name: NODE_AFFECTED_PERCENTAGE
value: '30'
# sequence of chaos execution
- name: SEQUENCE
value: 'parallel'
- name: TOTAL_CHAOS_DURATION
value: '60'