Key Concepts

This guide covers the essential terminology and concepts for Harness Resilience Testing.

Chaos Engineering

Chaos Engineering is the discipline of experimenting on a system to build confidence in its capability to withstand turbulent and unexpected conditions in production.

Core Principles

Steady State Hypothesis

The steady state represents your system's normal operating condition. Before running chaos experiments, you define:

Measurable system outputs that indicate normal behavior
Baseline metrics using Service Level Objectives (SLOs)
Acceptable thresholds for system performance

Example: "Our API should maintain 99.9% availability with response times under 200ms during normal operations."

Blast Radius

The scope of impact a chaos experiment can have on your system. Best practices:

Start small with non-critical systems or components
Gradually increase experiment scope as confidence grows
Use infrastructure controls to limit impact
Implement automatic rollback mechanisms

Hypothesis-Driven Testing

Each chaos experiment follows a scientific approach:

Identify the steady state and specify SLOs
Hypothesize what will happen when a fault is injected
Inject the failure in a controlled manner with minimal blast radius
Validate whether the system maintains steady state and meets SLOs

Key Components

Chaos Experiments

A chaos experiment is a set of operations coupled together to inject faults into a target resource and validate the system's resilience. Each experiment:

Targets specific infrastructure or application components
Injects one or more chaos faults in a defined sequence
Uses probes to validate system behavior
Can include custom actions for notifications or integrations
Generates a resilience score based on results

Chaos Faults

Chaos faults are pre-built failure scenarios that simulate real-world issues:

Fault Categories

Kubernetes Faults: Pod deletions, container kills, resource stress
Cloud Platform Faults: AWS, GCP, Azure service disruptions
Infrastructure Faults: CPU stress, memory exhaustion, network latency, disk pressure
Application Faults: Service failures, error injection, timeout simulation

Harness provides 200+ ready-to-use chaos faults in the Enterprise ChaosHub.

Resilience Probes

Resilience probes are health validation mechanisms that run during chaos experiments to verify your system maintains its steady state:

Probe Modes

Continuous Mode: Monitor throughout experiment duration
Edge Mode: Check at specific experiment phases (before, during, after)
OnChaos Mode: Validate only during fault injection

Probe Types

HTTP Probe: Validate API endpoints and services
Command Probe: Execute custom commands and validate output
Prometheus Probe: Query Prometheus metrics
Datadog Probe: Query Datadog metrics
Dynatrace Probe: Query Dynatrace metrics

Actions

Actions are custom tasks that execute within experiments:

Send notifications to Slack, PagerDuty, or email
Trigger webhooks for external integrations
Execute custom scripts or commands
Add delays between experiment steps
Integrate with monitoring and observability tools

ChaosHub

A ChaosHub is a centralized repository for reusable chaos engineering resources:

Enterprise ChaosHub

Provided by Harness with 200+ chaos faults
Regularly updated with new faults and improvements
Available across all projects and organizations

Custom ChaosHub

Create your own hub for organization-specific resources
Store custom experiment templates
Share probe templates across teams
Version control using Git integration

Chaos Infrastructure

Chaos infrastructure represents the target environment where chaos experiments execute:

Deployment Models

Agent-Based: Deploy chaos agents on Linux or Windows hosts
Agentless: Use Harness Delegate for Kubernetes and cloud resources

Supported Targets

Kubernetes clusters (EKS, GKE, AKS, OpenShift)
Linux and Windows hosts
AWS, GCP, Azure cloud resources
VMware infrastructure

Load Testing

Load Testing validates that your system can handle expected and peak traffic while maintaining performance and reliability.

Key Concepts (Coming Soon):

Traffic patterns and virtual users
Performance metrics (response time, throughput, error rate)
Bottleneck identification
Auto-scaling validation

Disaster Recovery Testing

Disaster Recovery Testing validates that backup systems, failover mechanisms, and recovery procedures work during catastrophic scenarios.

Key Concepts (Coming Soon):

Recovery Time Objective (RTO) and Recovery Point Objective (RPO)
Failover mechanisms and backup validation
Business continuity planning
Disaster recovery procedures

Harness Resilience Testing Concepts

These concepts apply across all resilience testing activities in the Harness platform.

Environments

Logical groupings of your infrastructure where tests are executed:

Organize resources by purpose (dev, staging, production)
Control access and permissions per environment
Isolate test execution to specific infrastructure scopes

Governance

Controls and policies to ensure safe, controlled testing:

RBAC (Role-Based Access Control)

Fine-grained permissions for who can:

Create and modify tests
Execute tests on specific infrastructure
View results and analytics
Manage governance policies

ChaosGuard

ChaosGuard provides advanced governance specifically for chaos experiments:

Define when experiments can run (time windows)
Specify where experiments can execute (infrastructure scope)
Control what faults can be injected (fault restrictions)
Set approval requirements for high-risk tests

GameDays

GameDays are coordinated resilience testing exercises:

Plan and schedule multiple tests together
Simulate real incident scenarios
Practice team response procedures
Validate runbooks and recovery processes

Risks

Automated identification and tracking of system weaknesses:

Resilience Risks: Potential failure points in your system
Performance Risks: Bottlenecks and capacity issues
Compliance Risks: Gaps in meeting RTO/RPO targets

Risks are discovered automatically and can be validated through targeted tests.

Next Steps

Now that you understand the core concepts:

Architecture: Learn about the control plane and execution plane architecture
Get Started with Chaos Testing: Run your first chaos experiment
Explore Chaos Faults: Browse 200+ ready-to-use fault scenarios
Set Up Governance: Configure RBAC and ChaosGuard for safe testing

Chaos Engineering​

Core Principles​

Steady State Hypothesis​

Blast Radius​

Hypothesis-Driven Testing​

Key Components​

Chaos Experiments​

Chaos Faults​

Fault Categories​

Resilience Probes​

Probe Modes​

Probe Types​

Actions​

ChaosHub​

Enterprise ChaosHub​

Custom ChaosHub​

Chaos Infrastructure​

Deployment Models​

Supported Targets​

Load Testing​

Disaster Recovery Testing​

Harness Resilience Testing Concepts​

Environments​

Governance​

RBAC (Role-Based Access Control)​

ChaosGuard​

GameDays​

Risks​

Next Steps​