Azure AKS node down
Azure AKS node down fault deallocates nodes in an Azure Kubernetes Service (AKS) cluster for a certain chaos duration.
- It helps to check the resilience of your applications when AKS nodes become unavailable.
- It targets VMSS (Virtual Machine Scale Set) instances in the AKS node pools and temporarily deallocates them.
- You can filter target nodes by node pool name, availability zone, and percentage of nodes to affect.
Use cases
Azure AKS node down:
- Determines the resilience of applications when AKS cluster nodes become unavailable.
- Validates that workloads are properly distributed across nodes and can handle node failures gracefully.
- Tests the behavior of Kubernetes scheduling and auto-scaling when nodes are deallocated.
- Simulates availability zone (AZ) failures by targeting nodes in specific zones.
- Verifies that critical applications have proper pod disruption budgets and replica counts.
- Validates monitoring and alerting systems properly detect node failures.
- Ensures that stateful applications handle node loss without data corruption.
Prerequisites
- Kubernetes >= 1.17
- Azure authentication configured for chaos faults. Refer to Azure authentication methods for setup instructions.
- The target AKS cluster should be in a running state before chaos injection.
Required Azure permissions
The service principal needs the following permissions:
- Reader role on the AKS cluster's resource group
- Virtual Machine Contributor role on the AKS cluster's node resource group (auto-generated resource group containing VMSS instances)
- Or custom role with these permissions:
Microsoft.ContainerService/managedClusters/read(on AKS cluster resource group)Microsoft.Compute/virtualMachineScaleSets/read(on node resource group)Microsoft.Compute/virtualMachineScaleSets/virtualMachines/read(on node resource group)Microsoft.Compute/virtualMachineScaleSets/virtualMachines/deallocate/action(on node resource group)Microsoft.Compute/virtualMachineScaleSets/virtualMachines/powerOff/action(on node resource group - for ephemeral OS disk VMs)Microsoft.Compute/virtualMachineScaleSets/virtualMachines/start/action(on node resource group)
Mandatory tunables
| Tunable | Description | Notes |
|---|---|---|
| AKS_CLUSTER_NAME | Name of the Azure Kubernetes Service (AKS) cluster. | For example, my-aks-cluster. For more information, go to AKS cluster name. |
| AKS_RESOURCE_GROUP | Resource group of the AKS cluster. | For example, rg-aks-cluster. For more information, go to resource group field in the YAML file. |
Optional tunables
| Tunable | Description | Notes |
|---|---|---|
| TOTAL_CHAOS_DURATION | Duration that you specify, through which chaos is injected into the target resource (in seconds). | Defaults to 30s. For more information, go to duration of the chaos. |
| CHAOS_INTERVAL | Time interval between successive chaos iterations (in seconds). | Defaults to 30s. For more information, go to chaos interval. |
| TARGET_NODE_POOL_NAMES | Comma-separated list of node pool names to target. | Empty means all node pools. For example, nodepool1,nodepool2. For more information, go to target node pools. |
| TARGET_ZONES | Comma-separated list of availability zones to target. | Empty means all zones. For example, 1,2,3. For more information, go to target zones. |
| NODE_AFFECTED_PERCENTAGE | Percentage of nodes to affect. | Defaults to 0 (corresponds to 1 instance). For more information, go to node affected percentage. |
| SEQUENCE | Sequence of chaos execution for multiple nodes. | Defaults to parallel. Also supports serial sequence. For more information, go to sequence of chaos execution. |
| RAMP_TIME | Period to wait before and after injecting chaos (in seconds). | For example, 30s. For more information, go to ramp time. |
| DEFAULT_HEALTH_CHECK | Determines if you wish to run the default health check which is present inside the fault. | Default: 'false'. For more information, go to default health check. |
Deallocate AKS nodes
It deallocates AKS cluster nodes for a specific chaos duration. Tune it by using the AKS_CLUSTER_NAME, AKS_RESOURCE_GROUP, and NODE_AFFECTED_PERCENTAGE environment variables.
Use the following example to tune it:
# deallocate AKS nodes for a certain chaos duration
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: engine-nginx
spec:
engineState: "active"
annotationCheck: "false"
chaosServiceAccount: litmus-admin
experiments:
- name: azure-aks-node-down
spec:
components:
env:
# name of the AKS cluster
- name: AKS_CLUSTER_NAME
value: 'my-aks-cluster'
# resource group of the AKS cluster
- name: AKS_RESOURCE_GROUP
value: 'rg-aks-cluster'
# percentage of nodes to affect
- name: NODE_AFFECTED_PERCENTAGE
value: '100'
- name: TOTAL_CHAOS_DURATION
value: '60'
Target specific node pools
It targets nodes from specific node pools in the AKS cluster. Tune it by using the TARGET_NODE_POOL_NAMES environment variable.
Use the following example to tune it:
# target specific node pools in AKS cluster
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: engine-nginx
spec:
engineState: "active"
annotationCheck: "false"
chaosServiceAccount: litmus-admin
experiments:
- name: azure-aks-node-down
spec:
components:
env:
# name of the AKS cluster
- name: AKS_CLUSTER_NAME
value: 'my-aks-cluster'
# resource group of the AKS cluster
- name: AKS_RESOURCE_GROUP
value: 'rg-aks-cluster'
# comma-separated list of node pool names to target
- name: TARGET_NODE_POOL_NAMES
value: 'nodepool1,nodepool2'
# percentage of nodes to affect
- name: NODE_AFFECTED_PERCENTAGE
value: '50'
- name: TOTAL_CHAOS_DURATION
value: '60'
Target nodes by availability zone
It targets nodes from specific availability zones in the AKS cluster. Tune it by using the TARGET_ZONES environment variable.
Use the following example to tune it:
# target nodes in specific availability zones
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: engine-nginx
spec:
engineState: "active"
annotationCheck: "false"
chaosServiceAccount: litmus-admin
experiments:
- name: azure-aks-node-down
spec:
components:
env:
# name of the AKS cluster
- name: AKS_CLUSTER_NAME
value: 'my-aks-cluster'
# resource group of the AKS cluster
- name: AKS_RESOURCE_GROUP
value: 'rg-aks-cluster'
# comma-separated list of availability zones to target
- name: TARGET_ZONES
value: '1,2'
# percentage of nodes to affect
- name: NODE_AFFECTED_PERCENTAGE
value: '50'
- name: TOTAL_CHAOS_DURATION
value: '60'
Node affected percentage
It specifies the percentage of nodes to be affected in the target AKS cluster. Tune it by using the NODE_AFFECTED_PERCENTAGE environment variable.
Use the following example to tune it:
# affect a specific percentage of nodes
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
name: engine-nginx
spec:
engineState: "active"
annotationCheck: "false"
chaosServiceAccount: litmus-admin
experiments:
- name: azure-aks-node-down
spec:
components:
env:
# name of the AKS cluster
- name: AKS_CLUSTER_NAME
value: 'my-aks-cluster'
# resource group of the AKS cluster
- name: AKS_RESOURCE_GROUP
value: 'rg-aks-cluster'
# percentage of nodes to affect (0-100), where 0 means 1 instance
- name: NODE_AFFECTED_PERCENTAGE
value: '30'
# sequence of chaos execution
- name: SEQUENCE
value: 'parallel'
- name: TOTAL_CHAOS_DURATION
value: '60'