Skip to main content

Key Concepts

This guide covers the essential terminology and concepts for Harness Resilience Testing.

Chaos Engineering

Chaos Engineering is the discipline of experimenting on a system to build confidence in its capability to withstand turbulent and unexpected conditions in production.

Core Principles

Steady State Hypothesis

The steady state represents your system's normal operating condition. Before running chaos experiments, you define:

  • Measurable system outputs that indicate normal behavior
  • Baseline metrics using Service Level Objectives (SLOs)
  • Acceptable thresholds for system performance

Example: "Our API should maintain 99.9% availability with response times under 200ms during normal operations."

Blast Radius

The scope of impact a chaos experiment can have on your system. Best practices:

  • Start small with non-critical systems or components
  • Gradually increase experiment scope as confidence grows
  • Use infrastructure controls to limit impact
  • Implement automatic rollback mechanisms

Hypothesis-Driven Testing

Each chaos experiment follows a scientific approach:

  1. Identify the steady state and specify SLOs
  2. Hypothesize what will happen when a fault is injected
  3. Inject the failure in a controlled manner with minimal blast radius
  4. Validate whether the system maintains steady state and meets SLOs

Key Components

Chaos Experiments

A chaos experiment is a set of operations coupled together to inject faults into a target resource and validate the system's resilience. Each experiment:

  • Targets specific infrastructure or application components
  • Injects one or more chaos faults in a defined sequence
  • Uses probes to validate system behavior
  • Can include custom actions for notifications or integrations
  • Generates a resilience score based on results

Chaos Faults

Chaos faults are pre-built failure scenarios that simulate real-world issues:

Fault Categories

  • Kubernetes Faults: Pod deletions, container kills, resource stress
  • Cloud Platform Faults: AWS, GCP, Azure service disruptions
  • Infrastructure Faults: CPU stress, memory exhaustion, network latency, disk pressure
  • Application Faults: Service failures, error injection, timeout simulation

Harness provides 200+ ready-to-use chaos faults in the Enterprise ChaosHub.

Resilience Probes

Resilience probes are health validation mechanisms that run during chaos experiments to verify your system maintains its steady state:

Probe Modes

  • Continuous Mode: Monitor throughout experiment duration
  • Edge Mode: Check at specific experiment phases (before, during, after)
  • OnChaos Mode: Validate only during fault injection

Probe Types

  • HTTP Probe: Validate API endpoints and services
  • Command Probe: Execute custom commands and validate output
  • Prometheus Probe: Query Prometheus metrics
  • Datadog Probe: Query Datadog metrics
  • Dynatrace Probe: Query Dynatrace metrics

Actions

Actions are custom tasks that execute within experiments:

  • Send notifications to Slack, PagerDuty, or email
  • Trigger webhooks for external integrations
  • Execute custom scripts or commands
  • Add delays between experiment steps
  • Integrate with monitoring and observability tools

ChaosHub

A ChaosHub is a centralized repository for reusable chaos engineering resources:

Enterprise ChaosHub

  • Provided by Harness with 200+ chaos faults
  • Regularly updated with new faults and improvements
  • Available across all projects and organizations

Custom ChaosHub

  • Create your own hub for organization-specific resources
  • Store custom experiment templates
  • Share probe templates across teams
  • Version control using Git integration

Chaos Infrastructure

Chaos infrastructure represents the target environment where chaos experiments execute:

Deployment Models

  • Agent-Based: Deploy chaos agents on Linux or Windows hosts
  • Agentless: Use Harness Delegate for Kubernetes and cloud resources

Supported Targets

  • Kubernetes clusters (EKS, GKE, AKS, OpenShift)
  • Linux and Windows hosts
  • AWS, GCP, Azure cloud resources
  • VMware infrastructure

Load Testing

Load Testing validates that your system can handle expected and peak traffic while maintaining performance and reliability.

Key Concepts (Coming Soon):

  • Traffic patterns and virtual users
  • Performance metrics (response time, throughput, error rate)
  • Bottleneck identification
  • Auto-scaling validation

Disaster Recovery Testing

Disaster Recovery Testing validates that backup systems, failover mechanisms, and recovery procedures work during catastrophic scenarios.

Key Concepts (Coming Soon):

  • Recovery Time Objective (RTO) and Recovery Point Objective (RPO)
  • Failover mechanisms and backup validation
  • Business continuity planning
  • Disaster recovery procedures

Harness Resilience Testing Concepts

These concepts apply across all resilience testing activities in the Harness platform.

Environments

Logical groupings of your infrastructure where tests are executed:

  • Organize resources by purpose (dev, staging, production)
  • Control access and permissions per environment
  • Isolate test execution to specific infrastructure scopes

Governance

Controls and policies to ensure safe, controlled testing:

RBAC (Role-Based Access Control)

Fine-grained permissions for who can:

  • Create and modify tests
  • Execute tests on specific infrastructure
  • View results and analytics
  • Manage governance policies

ChaosGuard

ChaosGuard provides advanced governance specifically for chaos experiments:

  • Define when experiments can run (time windows)
  • Specify where experiments can execute (infrastructure scope)
  • Control what faults can be injected (fault restrictions)
  • Set approval requirements for high-risk tests

GameDays

GameDays are coordinated resilience testing exercises:

  • Plan and schedule multiple tests together
  • Simulate real incident scenarios
  • Practice team response procedures
  • Validate runbooks and recovery processes

Risks

Automated identification and tracking of system weaknesses:

  • Resilience Risks: Potential failure points in your system
  • Performance Risks: Bottlenecks and capacity issues
  • Compliance Risks: Gaps in meeting RTO/RPO targets

Risks are discovered automatically and can be validated through targeted tests.

Next Steps

Now that you understand the core concepts: