Linux DNS error
Linux DNS error is a chaos fault that returns DNS failures for host names matching HOST_NAMES (filtered by MATCH_SCHEME) on the target Linux machine for DURATION, then restores normal DNS resolution. Queries that do not match are forwarded to UPSTREAM_SERVER. The fault runs through the Linux Chaos Infrastructure (LCI) systemd service installed on the target VM.
Use this fault to test how a workload behaves during a partial or full DNS outage: whether the application surfaces clean resolution errors or hangs, whether DNS caching layers absorb the failure, whether retries amplify the outage, and whether monitoring detects the resolution failures within the alerting SLA.
If you have not installed the Linux Chaos Infrastructure yet, go to Linux Chaos Infrastructure to install the agent and connect the VM to the control plane.
Use cases
Run this fault when you want to answer concrete questions like:
- DNS failure paths: When
HOST_NAMESfail to resolve, do application clients surface clean errors or hang on the resolver? - Cache absorption: Do local DNS caches (nscd, systemd-resolved) absorb the failure for previously resolved entries?
- Retry storms: Do dependent services retry the failed lookups in a way that amplifies the outage?
- Monitoring fidelity: Do alerts on DNS failures, connection errors, and end-to-end p99 fire within the alerting SLA?
Prerequisites
- Linux Chaos Infrastructure installed: The
linux-chaos-infrastructuresystemd service isactiveon the target VM and the infrastructure is inCONNECTEDstate. Go to Linux Chaos Infrastructure to install it. - DNS interceptor port available:
DNS_PORT(default53) is not bound by another resolver on the target VM, or the fault is configured to coexist with a local resolver. /tmpis exec-mountable: The DNS interceptor binary executes from/tmp. Verify withfindmnt -l | grep noexec | grep /tmp. If/tmpis mountednoexec, remount withsudo mount /tmp -o remount,exec.
Supported environments
The fault has been tested on the following Linux distributions. Go to Linux fault requirements to see the full compatibility matrix.
| Platform | Support status |
|---|---|
| Ubuntu 16+, Debian 10+ | Supported |
| CentOS 7+, RHEL 7+, Fedora 30+ | Supported |
| openSUSE LEAP 15.4+ / SUSE Linux Enterprise 15+ | Supported |
Permissions required
This fault is classified as an Advanced Linux fault. It requires the Linux Chaos Infrastructure systemd service to run with the root user and root user group on the target VM so it can bind the DNS interceptor port and rewrite resolution. No cloud credentials are needed.
Fault tunables
Configure the following fault parameters when you add Linux DNS error to an experiment in Chaos Studio. Defaults are shown for reference.
Chaos parameters
| Tunable | Description | Default |
|---|---|---|
DURATION | Total duration of the fault. Accepts [hours]h[minutes]m[seconds]s format (for example, 30s, 1m25s, 1h3m2s). | 30s |
HOST_NAMES | Comma-separated host names to fail. Leave empty to fail every query forwarded to the interceptor. | "" |
MATCH_SCHEME | Match type for host names. Accepts exact or substring. | exact |
UPSTREAM_SERVER | Upstream DNS server used for forwarding queries that do not match HOST_NAMES. Leave empty to use the system resolver. | "" |
DNS_PORT | Port on which the DNS interceptor listens for redirected queries. | 53 |
RAMP_TIME | Wait period in seconds before and after the fault. Go to ramp time to read how it is applied. | 0 |
Tunables that apply to every fault are documented in common tunables for all faults.
Fault execution in brief
Redirects DNS traffic on the target VM to a local interceptor on DNS_PORT for DURATION. Queries matching HOST_NAMES per MATCH_SCHEME are answered with a failure; other queries are forwarded to UPSTREAM_SERVER (or the system resolver).
Expected behavior during fault execution
- Lookups for
HOST_NAMESreturn a failure (SERVFAILor equivalent) for the duration of the fault. - Application clients see DNS errors when they attempt to resolve the matched hosts; subsequent TCP/HTTP calls fail with
nodename nor servname provided-style errors. - Lookups for hosts that do not match continue to succeed via
UPSTREAM_SERVER. - After the duration ends, DNS redirection is removed and normal resolution resumes.
The chaos pod removes the DNS redirect rule and stops the interceptor. New lookups go to the system resolver. Cached resolutions in nscd or systemd-resolved may persist for the cache TTL.
Signals to watch
Attach resilience probes to assert each layer:
- DNS failures: Use a command probe running
dig @127.0.0.1 -p <DNS_PORT> <host>and assert it returnsSERVFAILduring the chaos window. - Application errors: Use a Prometheus probe on application DNS or connection-error metrics.
- End-to-end availability: Use an HTTP probe on a user-visible endpoint that depends on the matched hosts.
Verify the fault execution effect
While the experiment is running, confirm DNS was failing and then recovered:
-
Query a matched host.
dig +tries=1 +time=2 <one-of-HOST_NAMES>getent hosts <one-of-HOST_NAMES>The query should return
SERVFAIL(or emptygetent) during the chaos window and resolve normally afterwards. -
Query a non-matched host.
dig +tries=1 +time=2 example.comThe query should resolve normally throughout, confirming the filter is working.
-
Inspect Linux Chaos Infrastructure logs.
sudo journalctl -u linux-chaos-infrastructure -n 100 --no-pagerLook for the fault start, the redirect setup, and the fault end markers.
Recovery and cleanup
- End of duration: The chaos pod removes the DNS redirect and stops the interceptor when
DURATIONelapses. - Abort the experiment: Stopping the experiment from Chaos Studio also removes the redirect.
- Manual recovery: If the redirect survives an abort, flush iptables rules created by the fault (
sudo iptables -t nat -L -n -vto inspect) and kill the interceptor withsudo pkill -f <interceptor-binary>(the binary name is recorded in the LCI logs). - Cache purge: Flush local DNS caches with
sudo systemd-resolve --flush-cachesorsudo nscd -i hostsif cached failures persist after the chaos window.
Limitations
- Single VM scope: Each fault run targets one VM (the VM hosting the selected Linux Chaos Infrastructure).
- System resolver coexistence: If a local resolver (systemd-resolved, dnsmasq) already binds port
53, configureDNS_PORTto a free port and verify clients use it. - TTL-bound cache: Previously resolved hosts may continue to resolve from
nscdorsystemd-resolvedcache until the TTL expires. - No partial failure mode: Matched lookups always fail; there is no failure rate tunable. Use a smaller
HOST_NAMESset to limit blast radius.
Troubleshooting
Linux DNS error fault did not cause resolution failures in Harness Chaos Engineering
Confirm clients are using the interceptor port (DNS_PORT) and that nscd or systemd-resolved is not serving cached answers. Flush local DNS caches and re-test. Also verify the linux-chaos-infrastructure systemd service is active and CONNECTED.
Port 53 already in use
systemd-resolved or dnsmasq commonly bind port 53. Configure DNS_PORT to a free port (for example, 1053) and update /etc/resolv.conf or the application config to use it, or stop the local resolver before running the fault.
/tmp mounted noexec prevents the interceptor from starting
Remount /tmp with exec permissions for the duration of the experiment: sudo mount /tmp -o remount,exec. Restore the original mount options after the experiment if required by your security policy.
Related faults
- Linux DNS spoof: Return spoofed IPs instead of failures.
- Linux network loss: Drop packets entirely instead of failing DNS.
- Linux API block: Block API responses through a proxy instead of DNS.