Skip to main content

Best Practices for Probe Validation - Node Level Faults

This topic describes the best practices to use with resilience probes in Kubernetes node-level chaos faults.

Node Restart

  • HTTP Probe: To perform application health check before and after chaos (applications which are scheduled on that node).
  • CMD Source Probe (Node startup time): To check the node start time to benchmark application performance and find SLA violations.
  • CMD Source Probe (Alert): Check if an alert is triggered on the node reboot signal.
  • CMD Source Probe (Node Status check): Check if Kubernetes updates the status of the node to "not-ready" if the node takes more time to start.

Node Network Loss

  • HTTP Probe (Application Health): To check if the application end point is healthy or not (applications which are scheduled on that node).
  • Prometheus/DataDog/Dynatrace Queries (Latency, Error rates) Queries (Latency, Error rate): To check the latency and error rate of the applications.
  • CMD Source Probe (Data Consistency): To ensure that data is not corrupted and the read and write operations involving the affected node work as expected.
  • CMD Source Probe (Failover): To check how quickly the application fails over to other node if the node is not availiable for longer duration.
  • CMD Source Probe (Failover): It checks if the system brings up the new node.
  • Prometheus/DataDog/Dynatrace (Alerts): To check if alerts are being fired when node in the network is not available, for example, time greater than the threshold.

Node Drain

Node CPU Hog

  • HTTP Probe (Health Check): To check if the application end point is still responsive or not.
  • CMD Probe/ APM Queries (Latency): To check how long you (user) waited before getting a response. This helps understand the retry, exponential back off and failover mechanism (if any).
  • CMD Source Probe/APM Queries (Alerts): To check if the alerts are fired when your system was un-responsive.
  • CMD Source Probe (Cluster Autoscaler): To check if the nodes are scaled (up or down) according to your autoscaler. It also determines the time the service takes to scale the nodes.
  • CMD Source Probe (Pod Failover): To check if pods can failover to other nodes if the impacted node becomes unresponsive.
  • CMD Source Probe: Check if the containers scheduled on the target node restarted.

Node Memory Hog

  • HTTP Probe (Health Check): To check if the application end point is still responsive or hangs.
  • CMD Probe/ APM Queries (Latency): To check how long you (user) waited before getting a response. This helps understand the retry, exponential back off and failover mechanism (if any).
  • CMD Source Probe/APM Queries (Alerts): To check if the alerts are fired when your system was un-responsive.
  • CMD Source Probe (Cluster Autoscaler): To check if the nodes are scaled (up or down) according to your autoscaler. It also determines the time the service takes to scale the nodes.
  • CMD Source Probe (Pod Failover): To check if pods can failover to other nodes if the impacted node becomes unresponsive.
  • CMD Source Probe: Check if the containers scheduled on the target node restarted.
  • CMD Source Probe: Checks if alerts were fired or node status was updated when the node was restarted due to OOM kill (when memory usage was more than allotted memory).

Kubelet Service Kill

  • HTTP Probe (Health Check): To check if the application end point is still responsive or hangs.
  • CMD Probe/APM Queries (Alerts): To check if alerts are fired and node status is changed to "not-ready" when the kubelet service is stopped for long duration.
  • CMD Source Probe (Daemon Set): To check if the Daemon set pods remain functional after the kubelet recovers.
  • CMD Source Probe (Cluster Autoscaler): To check if the nodes are scaled (up and down) based on your autoscaler. It also checks the time required to scale the nodes.