Skip to main content

EC2 process kill

Last updated on

EC2 process kill is an AWS chaos fault that kills one or more processes identified by PID inside a target EC2 instance for a configurable duration. The host stays running; only the named processes are terminated. The fault dispatches the kill via AWS Systems Manager Run Command, so the target instance must have the SSM Agent installed and an IAM role that permits SSM messages.

Use this fault to test how the workload reacts when a critical process disappears: does a supervisor (systemd, container runtime, custom watchdog) restart it cleanly, does the application surface the failure to its callers, and how long until traffic recovers?

Run your first experiment

If you have not configured the chaos infrastructure yet, go to Quickstart to install the chaos infrastructure and run an experiment end to end.


Use cases

Run this fault when you want to answer concrete questions like:

  • Supervisor recovery: When a managed process (systemd unit, container, custom worker) is killed, does the supervisor restart it within the expected window?
  • Crash semantics vs graceful shutdown: When you kill a process with SIGKILL (FORCE=true) instead of SIGTERM (FORCE=false), does the application recover identically, or does state corruption surface?
  • Liveness probe correctness: For containerized workloads, do liveness probes detect the failure and trigger a restart at the correct cadence?
  • Caller-side resilience: When the process is unavailable, do upstream callers retry, fall back, or fail gracefully?
  • Observability coverage: Does the kill surface in logs, traces, and alerts with enough context to drive a runbook?

Prerequisites

  • Kubernetes version: 1.21 or later for the chaos infrastructure cluster.
  • Target instance is reachable via SSM: The instance has the SSM Agent running and an instance profile with the AmazonSSMManagedInstanceCore policy (or equivalent). Confirm with aws ssm describe-instance-information --filters "Key=InstanceIds,Values=<id>".
  • Selector provided: Either EC2_INSTANCE_ID or EC2_INSTANCE_TAG is set.
  • Processes identified: PROCESS_IDS lists the PIDs to kill on the target instance.
  • AWS credentials available: Either an AWS credentials file uploaded as a File Secret in Harness Secret Manager (see Authentication below) or IRSA on the chaos infrastructure service account.
  • IAM permissions granted: The credentials or role include the SSM and EC2 permissions listed below.

Supported environments

PlatformSupport status
Amazon EC2 (Linux instances with SSM Agent)Supported
Amazon EC2 (Windows instances with SSM Agent)Supported
Amazon EKS managed worker nodesSupported (if SSM Agent is installed)
Amazon EKS self-managed worker nodesSupported (if SSM Agent is installed)
AWS regionsSupported in every commercial region
Targeting by tagSupported via EC2_INSTANCE_TAG
Targeting by IDSupported via EC2_INSTANCE_ID

Permissions required

{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"ec2:DescribeInstances",
"ec2:DescribeInstanceStatus"
],
"Resource": "*"
},
{
"Effect": "Allow",
"Action": [
"ssm:SendCommand",
"ssm:CancelCommand",
"ssm:GetCommandInvocation",
"ssm:DescribeInstanceInformation",
"ssm:GetDocument",
"ssm:DescribeDocument"
],
"Resource": "*"
}
]
}
  • ec2:DescribeInstances / ec2:DescribeInstanceStatus resolve the target instance and confirm reachability.
  • ssm:SendCommand and ssm:GetCommandInvocation send the kill command and read its result.
  • ssm:CancelCommand is used to roll back if the experiment is aborted mid-flight.
  • ssm:DescribeInstanceInformation confirms the SSM Agent is online before the fault starts.

Go to common policy for all AWS faults to use a single superset IAM policy across every AWS fault.


Authentication

The fault supports three credential delivery models. Pick one based on how your chaos infrastructure is deployed.

MethodWhen to use itHow to configure
Harness Secret Manager file secretChaos infrastructure runs outside EKS, or you want explicit static credentialsUpload the AWS credentials file as a File Secret in Harness Secret Manager and reference its identifier via AWS_AUTHENTICATION_SECRET
IAM Roles for Service Accounts (IRSA)Chaos infrastructure runs in EKS and uses an OIDC-bound service accountNo tunable changes; the chaos pod inherits the role automatically. Go to AWS IAM integration to set it up
Assume roleThe fault needs to act in a different account or with elevated permissionsSet ASSUME_ROLE_ARN to the role ARN; the chaos pod assumes the role on top of its base credentials

When using the Harness Secret Manager method, the File Secret should contain an AWS credentials file in the standard ~/.aws/credentials format:

[default]
aws_access_key_id = REPLACE_WITH_ACCESS_KEY_ID
aws_secret_access_key = REPLACE_WITH_SECRET_ACCESS_KEY

Upload this file as a File Secret in Harness Secret Manager (Project Setup → Secrets → New File Secret), and pass the secret identifier in AWS_AUTHENTICATION_SECRET.


Fault tunables

Required parameters

TunableDescriptionDefault
REGIONAWS region that hosts the target instance.(required)
PROCESS_IDSComma-separated list of PIDs to kill on the target instance (for example 1234,1235).(required)
EC2_INSTANCE_ID or EC2_INSTANCE_TAGOne of these must be set to select the target instance(s).""

Chaos parameters

TunableDescriptionDefault
TOTAL_CHAOS_DURATIONDuration of the fault in seconds.30
INSTANCE_AFFECTED_PERCPercentage of matching instances to target (applies only when using EC2_INSTANCE_TAG). 0 targets one instance.0
FORCEWhen true, kills with SIGKILL (force kill); when false, sends SIGTERM for graceful shutdown.false
INSTALL_DEPENDENCIESInstall dependencies inside the target instance if missing. Set to False to skip.True
PROXYHTTP/HTTPS proxy used by the in-instance installer (for example https_proxy=http://proxy.server:3128). Leave empty when no proxy is needed.""
SEQUENCEOrder in which multiple instances are processed: parallel or serial.parallel
RAMP_TIMEWait period in seconds before and after the fault. Go to ramp time to read how it is applied.0

Authentication

TunableDescriptionDefault
ASSUME_ROLE_ARNARN of an IAM role to assume on top of the base credentials.""
AWS_AUTHENTICATION_SECRETIdentifier of the File Secret in Harness Secret Manager that contains the AWS credentials file. Not required when using IRSA.""

Tunables that apply to every fault are documented in common tunables for all faults.

Use FORCE deliberately

FORCE=false (SIGTERM) tests graceful-shutdown paths. FORCE=true (SIGKILL) simulates a hard crash and skips the application's shutdown hooks. Run both to validate full coverage.


Fault execution in brief

Sends an SSM Run Command to the selected instance(s) in REGION that kills the PIDs listed in PROCESS_IDS (using SIGKILL when FORCE=true, otherwise SIGTERM), held for TOTAL_CHAOS_DURATION seconds before the experiment exits.


Expected behavior during fault execution

  • The named processes terminate on the target instance(s).
  • If a supervisor (systemd, container runtime, custom watchdog) is responsible for the process, a replacement starts and the application recovers; if not, the process stays gone until the host is rebooted or the application is redeployed.
  • Callers of the process may see connection resets, 5xx responses, or timeouts depending on what the process serves.
  • Logs on the target instance typically show the kill signal received by the process.
When the fault ends

The chaos pod stops issuing kill commands. Processes restarted by their supervisors continue running normally; processes without a supervisor remain dead.

Signals to watch

Attach resilience probes to assert each layer:

  • Process state on the target: Use a command probe running aws ssm send-command --document-name AWS-RunShellScript --parameters 'commands=["pgrep <process>"]' to confirm the kill.
  • Application availability: Use an HTTP probe against an endpoint served by the killed process to detect downtime and measure restart time.
  • Supervisor restart events: Use a Prometheus probe on node_systemd_unit_state{state="active"} (or the equivalent for your supervisor) to confirm the restart.

Verify the fault execution effect

While the experiment is running, confirm the processes are gone:

  1. Run a process query via SSM.

    aws ssm send-command \
    --region <region> \
    --instance-ids <id> \
    --document-name AWS-RunShellScript \
    --parameters 'commands=["ps -p <pid> -o pid,comm || echo missing"]'
  2. Check supervisor logs.

    For systemd:

    aws ssm send-command \
    --region <region> \
    --instance-ids <id> \
    --document-name AWS-RunShellScript \
    --parameters 'commands=["journalctl -u <service> --since \"5 minutes ago\""]'

    You should see a stop event followed by a restart.


Recovery and cleanup

  • End of duration: The chaos pod stops issuing kill commands.
  • Abort the experiment: Stopping the experiment from Chaos Studio cancels any in-flight SSM command.
  • Manual recovery: If the killed process has no supervisor, restart it manually (systemctl restart <service> via SSM, kubectl rollout restart for a Kubernetes workload, or your platform's equivalent).

Limitations

  • PID stability: PIDs change across reboots and supervisor restarts. Capture the PID immediately before the experiment or target a long-lived process whose PID file is stable.
  • SSM Agent required: Instances without the SSM Agent running cannot be targeted. Bake the agent into your AMI or install it via cloud-init.
  • Single command per instance: Multiple PIDs are passed to a single kill command; if the command fails midway, some processes may have already exited and others may not.
  • No payload control: This fault only kills processes. It does not modify their behavior or arguments. Use ec2-cpu-hog or ec2-memory-hog for resource-level chaos.

Troubleshooting

EC2 process kill experiment fails with InvalidInstanceId in Harness Chaos Engineering

The SSM Agent is not online for the target instance. Confirm with aws ssm describe-instance-information --filters 'Key=InstanceIds,Values=<id>'. If the instance is missing, install the SSM Agent (it ships with Amazon Linux 2 and most official AMIs) and attach an instance profile that includes AmazonSSMManagedInstanceCore.

EC2 process kill experiment reports success but the process is still running

The most common causes are: the PID in PROCESS_IDS does not exist on the target instance; FORCE=false sent SIGTERM and the process ignored or trapped the signal; or a supervisor restarted the process instantly. Verify by running 'ps -p <pid>' via SSM during the fault, and rerun with FORCE=true to use SIGKILL if the process traps SIGTERM.

EC2 process kill experiment fails with AccessDeniedException calling ssm:SendCommand

The chaos pod's IAM principal lacks ssm:SendCommand or related SSM permissions. Add ssm:SendCommand, ssm:GetCommandInvocation, ssm:DescribeInstanceInformation, ssm:CancelCommand, ssm:GetDocument, and ssm:DescribeDocument to the policy. If using ASSUME_ROLE_ARN, also confirm the trust policy allows the source identity.