Skip to main content

Prerequisites

This topic describes generic prerequisites to be fulfilled before executing chaos experiments using Harness Chaos Engineering (SaaS).

The prerequisites are listed in the table format that maps every category of prerequisite with the checks to perform or permissions to obtain.

CategoryItemHarness Notes
Access to a Harness accountSign up here in case you don't have an account already. If you have an account, login and navigate to the "Chaos Engineering" module.
Target PreparationA cloud-native microservices application or an infrastructure component.
Platform Details
  1. Targeted platform(s) for Kubernetes can be AKS/GKE/EKS.
  2. The Kubernetes version used on the targeted cluster(s).
  3. For GKE platform, choose between standard OR Autopilot mode.
  1. Harness Chaos supports fault injection across different Kubernetes clusters from a single/centralized execution environment (cluster), provided there is network connectivity (transient runners/pods on the target clusters).
  2. The delegate is responsible for launching the transient runners across clusters.
  3. To use IRSA for AWS resource chaos, the central execution environment should be setup on EKS.
Network Connectivity
  1. Check for outbound connectivity from the target clusters to the internet
  2. Check if whitelisting is required.
  3. Check if the chaos runners should be configured with any proxy settings.
The transient runners connect to app.harness.io endpoints over 443.
RBACCheck the scope of the Chaos Component workloads to be created on the clusters. Harness CE facilitates creating custom roles/serviceaccounts on the cloud providers for chaos purposes.
  1. By design, Harness CE discovers services and injects chaos on workloads across namespaces on a given cluster - i.e., it is cluster-scoped by default.
  2. Check if discovery and fault-injection should be restricted to a specific namespace. (in the namespace-mode, node faults can't be executed)
  3. If there are any approval processes to create roles with specific policies/permissions, Harness CE provides the minimal policy spec based on the resources and faults.
Image Registry
  1. Check if you can pull images from public upstream image registries for workloads on your cluster.
  2. If private registries are involved, check if they have the credentials/secret requirement.
  3. Check if there is a single registry or more than one that is subscribed to by different target clusters.
  4. Check for any naming conventions for the images or tags, determine if you can retain them or they are upstream.
In case of a private registry , configure the registry settings within the Harness Chaos Module.
Deployment
  1. Check for specific procedures involved in the Deployment of third-party workloads (Harness Delegate/Chaos agent components). Determine if the workloads can be manifest-based deployments OR Helm based only.
  2. Check if the deployment can be manual action OR pipeline-driven only.
Container Runtime Permissions
  1. Check if you have policies/admission controllers to validate workloads for runtime security (for example, OPA/Kyverno/Anthos Config Management/Custom controllers).
  2. Check for restrictions on the topology/scheduling third-party cluster workloads.

For some of the resource and network-related faults, transient chaos workloads are required to run with privileged containers, specific Linux capabilities, and root user. The policy for this should be created and applied on the cluster. There are guardrails to ensure that only specific users can implement such faults, on a specific cluster at a specific time (implemented using ChaosGuard ).

Service MeshesCheck which target clusters are configured with service meshes. Check if all the workloads should have sidecar proxies injected OR they can be excluded.
Resource Constraints
  1. Check if there are any policies that mandate resource requests/limits on all containers.
  2. Determine if there are constraints on the values, such as namespace quotas.
Application Observability
  1. Check if the target applications hosted on the cluster/running on AWS resources have well-defined healthcheck mechanisms.
  2. Check if there is more than one way to determine their steady-state behavior.
  3. Check if the application dependencies/downstream components and their health indicators are defined.

Harness chaos provides Resilience Probes to validate app health and/or any hypothesis around app behavior during the course of the experiment. Probe results generate a Resilience Score.

Monitoring
  1. Determine the observability systems/APMs in use for the services running on the cluster/AWS resources.
  2. Check if these APM endpoints are accessible from within the cluster workloads.

Platform-Specific Fault Permissions

Refer to the specific docs for platform-specific permissions required to execute faults.

Next steps