Best Practices for Probe Validation - Node Level Faults

This topic describes the best practices to use with resilience probes in Kubernetes node-level chaos faults.

HTTP Probe: To perform application health check before and after chaos (applications which are scheduled on that node).
CMD Source Probe (Node startup time): To check the node start time to benchmark application performance and find SLA violations.
CMD Source Probe (Alert): Check if an alert is triggered on the node reboot signal.
CMD Source Probe (Node Status check): Check if Kubernetes updates the status of the node to "not-ready" if the node takes more time to start.

HTTP Probe (Application Health): To check if the application end point is healthy or not (applications which are scheduled on that node).
Prometheus/DataDog/Dynatrace Queries (Latency, Error rates) Queries (Latency, Error rate): To check the latency and error rate of the applications.
CMD Source Probe (Data Consistency): To ensure that data is not corrupted and the read and write operations involving the affected node work as expected.
CMD Source Probe (Failover): To check how quickly the application fails over to other node if the node is not availiable for longer duration.
CMD Source Probe (Failover): It checks if the system brings up the new node.
Prometheus/DataDog/Dynatrace (Alerts): To check if alerts are being fired when node in the network is not available, for example, time greater than the threshold.

HTTP Probe (Application health): To check if the application end point is healthy.
Prometheus/DataDog/Dynatrace (Alerts): To check if alerts are being fired when node drain is in progress.
CMD Source Probe: Ensure that all the pods/replicas are in healthy state after being re-scheduled.

HTTP Probe (Health Check): To check if the application end point is still responsive or not.
CMD Probe/ APM Queries (Latency): To check how long you (user) waited before getting a response. This helps understand the retry, exponential back off and failover mechanism (if any).
CMD Source Probe/APM Queries (Alerts): To check if the alerts are fired when your system was un-responsive.
CMD Source Probe (Cluster Autoscaler): To check if the nodes are scaled (up or down) according to your autoscaler. It also determines the time the service takes to scale the nodes.
CMD Source Probe (Pod Failover): To check if pods can failover to other nodes if the impacted node becomes unresponsive.
CMD Source Probe: Check if the containers scheduled on the target node restarted.

HTTP Probe (Health Check): To check if the application end point is still responsive or hangs.
CMD Probe/ APM Queries (Latency): To check how long you (user) waited before getting a response. This helps understand the retry, exponential back off and failover mechanism (if any).
CMD Source Probe/APM Queries (Alerts): To check if the alerts are fired when your system was un-responsive.
CMD Source Probe (Cluster Autoscaler): To check if the nodes are scaled (up or down) according to your autoscaler. It also determines the time the service takes to scale the nodes.
CMD Source Probe (Pod Failover): To check if pods can failover to other nodes if the impacted node becomes unresponsive.
CMD Source Probe: Check if the containers scheduled on the target node restarted.
CMD Source Probe: Checks if alerts were fired or node status was updated when the node was restarted due to OOM kill (when memory usage was more than allotted memory).

HTTP Probe (Health Check): To check if the application end point is still responsive or hangs.
CMD Probe/APM Queries (Alerts): To check if alerts are fired and node status is changed to "not-ready" when the kubelet service is stopped for long duration.
CMD Source Probe (Daemon Set): To check if the Daemon set pods remain functional after the kubelet recovers.
CMD Source Probe (Cluster Autoscaler): To check if the nodes are scaled (up and down) based on your autoscaler. It also checks the time required to scale the nodes.