Skip to main content

Comparison of EC2 Chaos Approach for Kubernetes versus Linux

This topic compares EC2 chaos injection approach for Kubernetes+SSM and Native Linux agent.

AreaKubernetes agent driven EC2 chaos Native Linux agent driven EC2 chaos
Install Prerequisites/Agent Setup
  1. Installation of the agent needs user to be a cluster-admin OR mapped to cluster role with these permissions.
  2. SSM Agent is installed (it runs with sudo by default) on the target EC2 instance(s).
  3. Default SSM IAM role should be attached to the target EC2 instance(s).
  4. Ensure that you either create a secret with account user credentials or map an appropriate IAM role reference/ARN to the chaos ServiceAccount to carry out the chaos injection.
Console access to the machine as root/sudo OR Ability to inject processes remotely over SSH as root/sudo.
Installed Components The K8s agent comprises the following stateless deployments in a dedicated namespace: subscriber, wf-controller, chaos-operator, exporter along with some secrets and ConfigMaps.The native Linux chaos agent comprises a systemd-based service (configured with post hook). The agent config, logs and cron configuration are stored in dedicated, predefined paths.
Dependencies (a combination of upstream Linux and Harness utilities required for chaos injection.)
  • They can be installed just-in-time by the experiment OR can be placed into the machine prior (in case of disconnected setups).
  • tc, stress-ng, jq, iproute2, tproxy, dns-interceptor
  • Installed as part of the agent installation process
  • tc, stress-ng, jq, iproute2, tproxy, byteman
Network Connectivity From Chaos Agent:
  1. Outbound over port 443 to Harness from Kubernetes cluster.
  2. Outbound over 443 to cloud acc resource endpoints from Kubernetes cluster.
  3. Outbound to application health endpoints (ones which will be used for resilience validation) from Kubernetes cluster.
  4. From EC2 Instance: Outbound over port 443 to package repo/Harness S3 endpoints to pull dependencies (in connected mode).
  1. Outbound over port 443 to Harness from VM.
  2. Outbound over port 443 to package repo/Harness S3 endpoints to pull dependencies (in connected mode).
  3. Outbound to application health endpoints (ones which will be used for resilience validation) from VM.
Lifecycle Management
  • Availability: Tracked via Heartbeat. Can be scaled down to 0 replicas under idle conditions.
  • Upgrade: Automatic and manual upgrades supported.
  • Note: Automated upgrades only via Kubernetes Manifests. Helm bundle upgrades are manual/offline.
  • Uninstall/Deletion: The "Disconnect" operation from control plane removes the subscriber and configs/secrets involved in auth.
  • Availability: Tracked via Heartbeat. Service can be stopped under idle conditions.
  • Upgrade: Only Manual upgrades supported.
  • Uninstall/Deletion: Performed via an offline uninstaller utility.
Permissions/Access for Chaos InjectionDepends upon the nature of the fault. Master Policy for EC2 faults for all supported faults on EC2. Run experiments with root user.
Chaos Experiment Execution
  • Max Execution Time: Chaos Duration + Probe Validation Timeout + [~60-120s] (Relatively Higher)
  • Note: Involves generation of K8s events and creation of transient pods to carry out the fault business logic, which can add to overall execution time.
  • Parallel Fault Support Within Experiment: Yes
  • Multi-Infra Support Within Experiment: No
  • Support for HTTP Probes: Yes
  • Support for Command Probes in Source Mode (custom validation via user-defined container images): Yes
  • Max Execution Time: Chaos Duration + Probe Validation Timeout (Relatively Lower)
  • Parallel Fault Support Within Experiment: Yes
  • Multi-Infra Support Within Experiment: Yes
  • Support for HTTP Probes: Yes
  • Support for Command Probes in Source Mode (custom validation via user-defined container images): No
Execution Control
  • Abort Support: Yes. Internally invokes cancellation of the SSM command (which in turn is a bash script). However, there are some risks of continued operations as highlighted by AWS.
  • SSM Agent Crash: Dependent on AWS-native based recovery.
  • Abort Support: Yes. An abort-watcher ensures graceful cancellation of the chaos process.
  • Chaos Agent Crash: The agent service is configured with the right hooks (ExecStart/Stop) which removes all residual chaos on the system as a safety measure.
LogsLogs are based off the success of the SSM commands, with a need to explicitly fetch the stdout/stderr.Custom logs tracking each stage of the fault injection are available.
OS-Specific Fault CoverageNot availableAvailable
Custom Chaos Support (SSH, Load)AvailableNot available
APM Integrations for ProbesSupports Prometheus, Dynatrace, Datadog, NewRelic out-of-the-box.Dynatrace and Datadog supported out-of-the-box. Others can be implemented using custom/command probes.
Harness Chaos Management Feature Support (Cron, ChaosGuard, Gamedays, CD Integration)AvailableGameday support available
Agent Reuse for Managed Service ChaosSupportedNot Available