Skip to main content

Run your first chaos experiment

Welcome to Harness Chaos Engineering!

This guide will help you set up a project, environment and infrastructure and execute your first Kuberenetes pod-level chaos experiment on your application (known as the target). Fulfill the prerequisites before executing chaos experiments on Harness CE.

Step 1: Identify the microservice to target

  1. Identify the microservice in your application that you will target, whose resources will be affected. In this guide, you will delete a Kubernetes pod from your application. Pod delete is the simplest chaos experiment that is recommended as the firs steo since it has a small blast radius.

The diagram below describes the high-level steps to inject chaos into your application.

steps

Step 2: Create a project

  1. Sign up or log in to your account, and access the Chaos Engineering module. Create a new project or ask your administrator to add you to an existing project.

Step 3: Create an environment

You can follow the interctive guide or the step-by-step guide to create your environment.

tip

You can also select one of the environments from the list of environments if you have created it (or it is available) instead of creating an environment.

Step 4: Create an infrastructure

Step 5: Create observability infrastructure

Once you are all ready to target your Kubernetes resources, execute the simplest fault, Pod Delete on your application. This fault deletes the pods of a deployment, StatefulSet, DaemonSet, etc, to validate the resiliency of a microservice application.

  1. Run the following commands to set the target application microservices and observability infrastructure (optional), including Grafana, Prometheus, and a BlackBox exporter. Installing the observability infrastructure (optional) provides a dashboard that helps validate the health of the constituent application microservices in real-time.
❯ kubectl apply -f path-to-manifest/app.yaml -n <namespace>
❯ kubectl apply -f path-to-manifest/monitoring.yaml -n <namespace>
info
  • Earlier, you specified the installation mode as Specific namespace access, hence the resources are deployed in a specific namespace.

  • The target application and observability infrastructure pods are available in the hce namespace.

  • You can access the monitoring service dashboard to see the status of the application.

  • To view the pods in a namespace, execute:

    kubectl get pods -n <namespace>
  • To list the services available in a namespace, execute:

    kubectl get services -n <namespace>

Step 6: Define resilience probes

You can define a resilience probe while creating an experiment or you can create one before and use it in the chaos experiment. In this example, you will create a probe and use it in the experiment. Create a probe before you move to the next step.

Step 7: Construct a chaos experiment

Once the target application is deployed, create a chaos experiment. Target the pods of the microservice of your choice with the pod delete fault. Ensure that the application is healthy before injecting chaos and after chaos duration is complete using resilience probes.

Step 8: Observing chaos execution

To execute the chaos experiment, click Save, and then Run. This schedules an experiment run. You can see the logs of the experiment in the Experiment Builder tab.

Run and save

tip

You can view the application's behavior using the application metrics dashboard too. The probe success percentage for website availability (200 response code) decreases steeply along with the 99th percentile (green line) queries per second (QPS) and access duration for the application microservices. Also, the mean QPS (yellow line) steeply increases. This is because no pod is available at the moment to service the query requests.

Evaluate the experiment run

When the experiment execution concludes, you get a resilience score of 0 %. You will see that the pod delete fault step failed. Before analyzing the experiment result, validate that the application is now again accessible, without any errors. You can validate this from the Grafana dashboard metrics (optional) that indicate the app returning to normal as the chaos duration is over.

You can see the chaos result that shows the pod delete experiment Failed. This is because the probe would have failed. The failure is due to the unavailability of the pod, due to "pod delete" fault injection.

Conclusion

Congratulations on running your first chaos experiment! Want to know how to enhance the resilience of the application and successfully execute the experiment? Increase the experiment pods to at least two so that at least one deployment pod survives the pod delete fault and helps the application stay afloat. Try running it on your own!

Next steps