Alert Management

Last updated on Jul 2, 2026

Harness AI SRE provides flexible alert management to receive, route, and respond to alerts from any monitoring system or custom application.

Overview

Alert management in Harness AI SRE enables you to:

Receive alerts from any source: Webhook integrations support 25+ monitoring tools with pre-configured templates, plus custom webhooks for any system that can send HTTP requests.
Route alerts intelligently: Route alerts based on service, environment, team, severity, or custom fields.
Enrich alerts with context: Automatically add service metadata, team information, historical data, and related incidents.
Automate responses: Trigger runbooks, create incidents, send notifications, and execute remediation actions.

How Alerts Work

Alert Ingestion

Alerts enter Harness AI SRE through webhook integrations:

External monitoring system (Datadog, PagerDuty, Prometheus, etc.) detects an issue
Webhook POST sends alert payload to your unique webhook URL
Field extraction parses the JSON payload using JSONPath expressions
Field mapping populates alert properties using Mustache templates
Alert created in Harness AI SRE with enriched context

Go to Ingest Alerts to set up webhook integrations.

Alert Routing

Once alerts are received, alert routing rules determine how they are processed:

Route to services: Automatically link alerts to the affected service
Assign to teams: Route alerts to the responsible on-call team
Create incidents: Automatically create incidents for critical alerts
Trigger runbooks: Execute automated remediation workflows
Suppress duplicates: Deduplicate alerts based on custom rules

Go to Route Alerts to set up routing and automation.

Alert Sources

Webhook Integrations

Harness AI SRE uses webhook integrations to receive alerts from monitoring systems. Webhooks provide:

25+ pre-configured templates: Ready-to-use integrations for popular monitoring tools
Custom webhooks: Support for any system that can send HTTP POST requests
Flexible payload mapping: JSONPath extraction and Mustache templates for field mapping
Dual trigger methods: HTTP POST endpoint or email address for legacy systems

Supported Monitoring Tools (Webhook Templates)

Application Performance Monitoring:

Infrastructure Monitoring:

Incident Management:

Security and Compliance:

Lacework

Website Monitoring:

AlertSite

Alert Correlation:

BigPanda

SLO Monitoring:

Harness SLO

Go to Webhook Templates to browse all available templates.

Custom Monitoring Tools

For custom monitoring solutions, internal applications, or legacy systems:

Generic Webhook: Accepts any JSON payload with custom field mapping
Email Triggers: Send alerts via email for systems without webhook support
Custom Field Mapping: Use JSONPath, Mustache templates, and CEL expressions

Go to Create a Webhook for custom webhook setup.

Service Paging Webhooks

For dedicated on-call paging from external systems:

Service-specific webhook URLs: Each service gets a unique paging webhook
Automatic on-call routing: Pages the current on-call engineer
Email-based triggering: Legacy systems can page via email
Bypass alert routing: Direct service paging without alert rule processing

Go to Service Paging Webhook for dedicated service paging.

Alert Configuration

Alert Routing

Configure alert routing based on:

Service: Route alerts to specific services based on payload fields
Environment: Separate production, staging, and development alerts
Team: Direct alerts to the responsible team or on-call schedule
Severity: Escalate critical alerts, suppress informational alerts
Custom fields: Route based on any field in the alert payload

Example routing rules:

Critical alerts → Create incident + page on-call
Warning alerts → Create alert + notify Slack channel
Info alerts → Log only, no notifications

Go to Route Alerts for routing configuration.

Alert Enrichment

Enhance alerts with additional context automatically:

Service metadata: Service owner, runbooks, documentation links
Team information: On-call schedule, escalation policy, Slack channel
Environment details: Region, cluster, deployment version
Historical data: Similar past incidents, resolution patterns
Related incidents: Link to active incidents for the same service

Enrichment data is added at alert creation time and visible in the alert timeline.

Alert Actions

Define automated actions when alerts are received:

Incident Creation:

Create incidents automatically for critical alerts
Link related alerts to existing incidents
Inherit incident metadata from service configuration

Runbook Execution:

Trigger diagnostic runbooks automatically
Execute remediation workflows
Gather context before human intervention

Notifications:

Send Slack messages to team channels
Page on-call engineers via PagerDuty
Post to Microsoft Teams or Google Chat

Ticketing:

Create Jira issues for alerts requiring follow-up
Open ServiceNow incidents for escalation
Track alert resolution in external systems

Go to Create a Runbook to automate alert responses.

Alert Lifecycle

Alert States

Alerts progress through these states:

New: Alert received from monitoring system
Acknowledged: On-call engineer acknowledges alert
Resolved: Underlying issue resolved (automatic or manual)
Closed: Alert closed after resolution confirmation

Alert Resolution

Alerts can be resolved in multiple ways:

Automatic Resolution:

Monitoring system sends resolution webhook (e.g., Datadog recovery)
Alert rule detects resolution condition
Linked incident is resolved

Manual Resolution:

On-call engineer marks alert as resolved
Runbook completes successfully
External system (Jira, ServiceNow) updates status

Alert History

All alert events are tracked in the alert timeline:

Alert creation from monitoring system
Status changes (acknowledged, resolved)
Associated incidents and runbooks
Comments and annotations
Related alerts and context

Best Practices

Alert Design

Use clear, actionable alert names:

❌ Alert triggered
✅ High CPU usage on api-service in us-east-1

Include relevant context in alert descriptions:

Service name and environment
Current metric value and threshold
Link to dashboard or logs
Suggested remediation steps

Set appropriate thresholds:

Avoid alert fatigue from noisy thresholds
Balance sensitivity with false positive rate
Use dynamic thresholds for variable workloads

Configure proper severity levels:

Critical: Service down, immediate action required
High: Degraded performance, page on-call
Medium: Warning condition, notify team channel
Low: Informational, log only