Comparison of EC2 Chaos Approach for Kubernetes versus Linux

This topic compares EC2 chaos injection approach for Kubernetes+SSM and Native Linux agent.

Area	Kubernetes agent driven EC2 chaos	Native Linux agent driven EC2 chaos
Install Prerequisites/Agent Setup	Installation of the agent needs user to be a cluster-admin OR mapped to cluster role with these permissions. SSM Agent is installed (it runs with sudo by default) on the target EC2 instance(s). Default SSM IAM role should be attached to the target EC2 instance(s). Ensure that you either create a secret with account user credentials or map an appropriate IAM role reference/ARN to the chaos ServiceAccount to carry out the chaos injection.	Console access to the machine as root/sudo OR Ability to inject processes remotely over SSH as root/sudo.
Installed Components	The K8s agent comprises the following stateless deployments in a dedicated namespace: subscriber, wf-controller, chaos-operator, exporter along with some secrets and ConfigMaps.	The native Linux chaos agent comprises a systemd-based service (configured with post hook). The agent config, logs and cron configuration are stored in dedicated, predefined paths.
Dependencies (a combination of upstream Linux and Harness utilities required for chaos injection.)	They can be installed just-in-time by the experiment OR can be placed into the machine prior (in case of disconnected setups). tc, stress-ng, jq, iproute2, tproxy, dns-interceptor	Installed as part of the agent installation process tc, stress-ng, jq, iproute2, tproxy, byteman
Network Connectivity	From Chaos Agent: Outbound over port 443 to Harness from Kubernetes cluster. Outbound over 443 to cloud acc resource endpoints from Kubernetes cluster. Outbound to application health endpoints (ones which will be used for resilience validation) from Kubernetes cluster. From EC2 Instance: Outbound over port 443 to package repo/Harness S3 endpoints to pull dependencies (in connected mode).	Outbound over port 443 to Harness from VM. Outbound over port 443 to package repo/Harness S3 endpoints to pull dependencies (in connected mode). Outbound to application health endpoints (ones which will be used for resilience validation) from VM.
Lifecycle Management	Availability: Tracked via Heartbeat. Can be scaled down to 0 replicas under idle conditions. Upgrade: Automatic and manual upgrades supported. Note: Automated upgrades only via Kubernetes Manifests. Helm bundle upgrades are manual/offline. Uninstall/Deletion: The "Disconnect" operation from control plane removes the subscriber and configs/secrets involved in auth.	Availability: Tracked via Heartbeat. Service can be stopped under idle conditions. Upgrade: Only Manual upgrades supported. Uninstall/Deletion: Performed via an offline uninstaller utility.
Permissions/Access for Chaos Injection	Depends upon the nature of the fault. Master Policy for EC2 faults for all supported faults on EC2.	Run experiments with root user.
Chaos Experiment Execution	Max Execution Time: Chaos Duration + Probe Validation Timeout + [~60-120s] (Relatively Higher) Note: Involves generation of K8s events and creation of transient pods to carry out the fault business logic, which can add to overall execution time. Parallel Fault Support Within Experiment: Yes Multi-Infra Support Within Experiment: No Support for HTTP Probes: Yes Support for Command Probes in Source Mode (custom validation via user-defined container images): Yes	Max Execution Time: Chaos Duration + Probe Validation Timeout (Relatively Lower) Parallel Fault Support Within Experiment: Yes Multi-Infra Support Within Experiment: Yes Support for HTTP Probes: Yes Support for Command Probes in Source Mode (custom validation via user-defined container images): No
Execution Control	Abort Support: Yes. Internally invokes cancellation of the SSM command (which in turn is a bash script). However, there are some risks of continued operations as highlighted by AWS. SSM Agent Crash: Dependent on AWS-native based recovery.	Abort Support: Yes. An abort-watcher ensures graceful cancellation of the chaos process. Chaos Agent Crash: The agent service is configured with the right hooks (ExecStart/Stop) which removes all residual chaos on the system as a safety measure.
Logs	Logs are based off the success of the SSM commands, with a need to explicitly fetch the stdout/stderr.	Custom logs tracking each stage of the fault injection are available.
OS-Specific Fault Coverage	Not available	Available
Custom Chaos Support (SSH, Load)	Available	Not available
APM Integrations for Probes	Supports Prometheus, Dynatrace, Datadog, NewRelic out-of-the-box.	Dynatrace and Datadog supported out-of-the-box. Others can be implemented using custom/command probes.
Harness Chaos Management Feature Support (Cron, ChaosGuard, Gamedays, CD Integration)	Available	Gameday support available
Agent Reuse for Managed Service Chaos	Supported	Not Available