Key Concepts
This guide covers the essential terminology and concepts for Harness Resilience Testing.
Chaos Engineering
Chaos Engineering is the discipline of experimenting on a system to build confidence in its capability to withstand turbulent and unexpected conditions in production.
Core Principles
Steady State Hypothesis
The steady state represents your system's normal operating condition. Before running chaos experiments, you define:
- Measurable system outputs that indicate normal behavior
- Baseline metrics using Service Level Objectives (SLOs)
- Acceptable thresholds for system performance
Example: "Our API should maintain 99.9% availability with response times under 200ms during normal operations."
Blast Radius
The scope of impact a chaos experiment can have on your system. Best practices:
- Start small with non-critical systems or components
- Gradually increase experiment scope as confidence grows
- Use infrastructure controls to limit impact
- Implement automatic rollback mechanisms
Hypothesis-Driven Testing
Each chaos experiment follows a scientific approach:
- Identify the steady state and specify SLOs
- Hypothesize what will happen when a fault is injected
- Inject the failure in a controlled manner with minimal blast radius
- Validate whether the system maintains steady state and meets SLOs
Key Components
Chaos Experiments
A chaos experiment is a set of operations coupled together to inject faults into a target resource and validate the system's resilience. Each experiment:
- Targets specific infrastructure or application components
- Injects one or more chaos faults in a defined sequence
- Uses probes to validate system behavior
- Can include custom actions for notifications or integrations
- Generates a resilience score based on results
Chaos Faults
Chaos faults are pre-built failure scenarios that simulate real-world issues:
Fault Categories
- Kubernetes Faults: Pod deletions, container kills, resource stress
- Cloud Platform Faults: AWS, GCP, Azure service disruptions
- Infrastructure Faults: CPU stress, memory exhaustion, network latency, disk pressure
- Application Faults: Service failures, error injection, timeout simulation
Harness provides 200+ ready-to-use chaos faults in the Enterprise ChaosHub.
Resilience Probes
Resilience probes are health validation mechanisms that run during chaos experiments to verify your system maintains its steady state:
Probe Modes
- Continuous Mode: Monitor throughout experiment duration
- Edge Mode: Check at specific experiment phases (before, during, after)
- OnChaos Mode: Validate only during fault injection
Probe Types
- HTTP Probe: Validate API endpoints and services
- Command Probe: Execute custom commands and validate output
- Prometheus Probe: Query Prometheus metrics
- Datadog Probe: Query Datadog metrics
- Dynatrace Probe: Query Dynatrace metrics
Actions
Actions are custom tasks that execute within experiments:
- Send notifications to Slack, PagerDuty, or email
- Trigger webhooks for external integrations
- Execute custom scripts or commands
- Add delays between experiment steps
- Integrate with monitoring and observability tools
ChaosHub
A ChaosHub is a centralized repository for reusable chaos engineering resources:
Enterprise ChaosHub
- Provided by Harness with 200+ chaos faults
- Regularly updated with new faults and improvements
- Available across all projects and organizations
Custom ChaosHub
- Create your own hub for organization-specific resources
- Store custom experiment templates
- Share probe templates across teams
- Version control using Git integration
Chaos Infrastructure
Chaos infrastructure represents the target environment where chaos experiments execute:
Deployment Models
- Agent-Based: Deploy chaos agents on Linux or Windows hosts
- Agentless: Use Harness Delegate for Kubernetes and cloud resources
Supported Targets
- Kubernetes clusters (EKS, GKE, AKS, OpenShift)
- Linux and Windows hosts
- AWS, GCP, Azure cloud resources
- VMware infrastructure
Load Testing
Load Testing validates that your system can handle expected and peak traffic while maintaining performance and reliability.
Key Concepts (Coming Soon):
- Traffic patterns and virtual users
- Performance metrics (response time, throughput, error rate)
- Bottleneck identification
- Auto-scaling validation
Disaster Recovery Testing
Disaster Recovery Testing validates that backup systems, failover mechanisms, and recovery procedures work during catastrophic scenarios.
Key Concepts (Coming Soon):
- Recovery Time Objective (RTO) and Recovery Point Objective (RPO)
- Failover mechanisms and backup validation
- Business continuity planning
- Disaster recovery procedures
Harness Resilience Testing Concepts
These concepts apply across all resilience testing activities in the Harness platform.
Environments
Logical groupings of your infrastructure where tests are executed:
- Organize resources by purpose (dev, staging, production)
- Control access and permissions per environment
- Isolate test execution to specific infrastructure scopes
Governance
Controls and policies to ensure safe, controlled testing:
RBAC (Role-Based Access Control)
Fine-grained permissions for who can:
- Create and modify tests
- Execute tests on specific infrastructure
- View results and analytics
- Manage governance policies
ChaosGuard
ChaosGuard provides advanced governance specifically for chaos experiments:
- Define when experiments can run (time windows)
- Specify where experiments can execute (infrastructure scope)
- Control what faults can be injected (fault restrictions)
- Set approval requirements for high-risk tests
GameDays
GameDays are coordinated resilience testing exercises:
- Plan and schedule multiple tests together
- Simulate real incident scenarios
- Practice team response procedures
- Validate runbooks and recovery processes
Risks
Automated identification and tracking of system weaknesses:
- Resilience Risks: Potential failure points in your system
- Performance Risks: Bottlenecks and capacity issues
- Compliance Risks: Gaps in meeting RTO/RPO targets
Risks are discovered automatically and can be validated through targeted tests.
Next Steps
Now that you understand the core concepts:
- Architecture: Learn about the control plane and execution plane architecture
- Get Started with Chaos Testing: Run your first chaos experiment
- Explore Chaos Faults: Browse 200+ ready-to-use fault scenarios
- Set Up Governance: Configure RBAC and ChaosGuard for safe testing