Automate Incident Response with Runbooks
Learn how to automate incident response workflows in Harness AI SRE using runbooks, triggers, and integrations.
Overview
AI SRE automates incident workflows through:
- Runbooks - Sequences of automated actions (notifications, API calls, scripts)
- Triggers - Conditions that automatically execute runbooks (alert rules, incident events, status changes)
- Alert Rules - Route alerts to on-call teams and auto-create incidents
- Integrations - Connect to Slack, Jira, ServiceNow, Zoom, PagerDuty, and more
Workflow automation in AI SRE uses form-based UI configuration with Mustache templates, not YAML files. Actions are configured through forms where you can:
- Select integration type (Slack, Jira, HTTP, etc.)
- Fill in action parameters using form fields
- Reference incident data using Mustache syntax like
{{incident.title}}or{{alert.severity}} - Test actions before saving
Automation Patterns
Pattern 1: Alert Detection → Incident Creation
Use case: Automatically create incidents from high-severity alerts
How to configure:
- Navigate to Alerts → Alert Rules
- Click Create Alert Rule
- Configure the rule:
- Name: "P1/P2 Alerts Auto-Create Incidents"
- Conditions:
alert.priorityin[p1_critical, p2_error] - Action: Create Incident
- Incident Type: Select "Service Incident"
- Map Fields:
- Title:
{{alert.title}} - Service:
{{alert.service}} - Severity:
{{alert.priority}}
- Title:
- Click Save
Result: When P1 or P2 alerts are received, incidents are automatically created with pre-populated fields.
Pattern 2: Incident Creation → Automated Response
Use case: When a P1 incident is created, automatically notify on-call, create Zoom bridge, and trigger diagnostic runbook
How to configure:
-
Create a runbook for P1 response:
- Navigate to Runbooks → Create Runbook
- Name: "P1 Incident Response"
- Add actions:
- Zoom: Create Meeting
- Name:
{{incident.title}} - Incident Bridge - Participants:
{{incident.responders}}
- Name:
- Slack: Post Message
- Channel:
#incidents - Message:
🔴 **P1 Incident Created****Title**: {{incident.title}}**Service**: {{incident.service}}**Zoom Bridge**: {{runbook.outputs.zoom_create_meeting.join_url}}**Incident Link**: {{incident.url}}
- Channel:
- On-Call: Page Service
- Service:
{{incident.service}} - Message:
P1 incident - join bridge at {{runbook.outputs.zoom_create_meeting.join_url}}
- Service:
- Zoom: Create Meeting
-
Configure automatic trigger:
- In the runbook editor, go to Triggers tab
- Click Add Trigger
- Trigger Type: Incident Created
- Conditions:
incident.severityequalsSEV0ORSEV1
- Click Save
Result: P1 incidents automatically trigger the runbook, creating a Zoom bridge and notifying responders.
Pattern 3: Status Change → Stakeholder Notification
Use case: When incident status changes to "Resolved", notify stakeholders and create post-incident review task
How to configure:
-
Create a runbook:
- Name: "Incident Resolution Workflow"
- Add actions:
- Slack: Post Message
- Channel:
#incidents - Message:
✅ **Incident Resolved****Title**: {{incident.title}}**Duration**: {{incident.duration}}**Resolved By**: {{incident.resolver}}Post-incident review scheduled.
- Channel:
- Jira: Create Issue
- Project:
SRE - Issue Type: Task
- Summary:
Post-Incident Review: {{incident.title}} - Description:
Incident: {{incident.url}}Duration: {{incident.duration}}Severity: {{incident.severity}}Complete post-incident review within 3 business days.
- Assignee:
{{incident.owner}}
- Project:
- Slack: Post Message
-
Configure trigger:
- Trigger Type: Incident Field Updated
- Field: Status
- New Value: Resolved
Result: When incidents are marked resolved, stakeholders are notified and a review task is created in Jira.
Pattern 4: Time-Based Escalation
Use case: If a P1 incident isn't acknowledged within 5 minutes, escalate to VP Engineering
How to configure:
-
Create escalation runbook:
- Name: "P1 Escalation"
- Add actions:
- Slack: Post Message
- Channel:
#exec-alerts - Message:
🚨 **ESCALATION: P1 Incident Not Acknowledged****Title**: {{incident.title}}**Created**: {{incident.created_at}}**Service**: {{incident.service}}**Link**: {{incident.url}}@vp-engineering - immediate attention required
- Channel:
- PagerDuty: Trigger Escalation
- Escalation Policy: "Executive Escalation"
- Slack: Post Message
-
Configure trigger:
- Trigger Type: Incident Created
- Conditions:
incident.severityequalsSEV0ORSEV1
- Delay: 5 minutes
- Only run if:
incident.statusnot equalsAcknowledged
Result: If P1 incidents remain unacknowledged for 5 minutes, executives are paged.
Pattern 5: Deployment Change → Proactive Investigation
Use case: When a deployment occurs, automatically check for related alerts and create incident if errors spike
How to configure:
-
Send deployment webhooks to AI SRE:
- Configure your CI/CD pipeline to POST deployment events to:
POST https://app.harness.io/gateway/ai-sre/api/v1/orgs/{org}/projects/{project}/webhooks/deploy
- Webhook payload:
{"service": "payment-service","version": "v2.3.1","environment": "production","deployed_by": "jane@company.com","git_commit": "abc123"}
- Configure your CI/CD pipeline to POST deployment events to:
-
AI SRE automatically:
- Correlates the deployment with any alerts occurring within 30 minutes
- Surfaces the deployment as a root cause theory in investigations
- Links related pull requests for code review
No additional configuration needed - deployment correlation is automatic when webhooks are sent.
Runbook Components
Actions
Runbooks execute sequences of actions. Available action types:
Communication Actions
- Slack: Post Message - Send formatted messages to Slack channels
- Slack: Update Message - Update existing Slack messages
- Microsoft Teams: Post Message - Send messages to Teams channels
- Google Chat: Post Message - Send messages to Google Chat spaces
- Zoom: Create Meeting - Generate instant Zoom bridges
Ticketing Actions
- Jira: Create Issue - Create Jira tickets with incident context
- Jira: Update Issue - Update existing Jira issues
- ServiceNow: Create Incident - Create ServiceNow incidents
- ServiceNow: Update Incident - Update ServiceNow incidents
Deployment Actions
- Harness: Run Pipeline - Trigger Harness CD pipelines (rollback, scale, deploy)
On-Call Actions
- Page Service - Page the on-call responders for a service
- Page User - Directly page a specific user
- Page Team - Page all members of a team
Custom Actions
- HTTP Request - Call any REST API
- Script - Run custom JavaScript/Python logic
Action Inputs
Actions use form-based configuration with Mustache template support:
Text fields accept Mustache variables:
Title: {{incident.title}}
Service: {{incident.service}}
Owner: {{incident.owner}}
Available variables:
{{incident.*}}- Current incident fields (title, severity, service, owner, status, etc.){{alert.*}}- Alert fields if runbook triggered by alert rule{{runbook.outputs.*}}- Outputs from previous actions in the runbook{{user.*}}- User who triggered the runbook (name, email)
Example: Reference previous action output
Action 1: Zoom: Create Meeting
→ Outputs: join_url, meeting_id
Action 2: Slack: Post Message
Message: Join the incident bridge at {{runbook.outputs.zoom_create_meeting.join_url}}
Triggers
Runbooks can be triggered:
Manual Execution
- Run from incident timeline
- Run from pinned runbooks list
- Run via
/harness run <slug>Slack command
Automatic Triggers
Incident Created
- Condition:
incident.severity in [SEV0, SEV1] - Runs when new incidents match conditions
Incident Field Updated
- Condition:
incident.status changed_to Resolved - Runs when specific fields change
Alert Rule Match
- Configured in alert rules
- Runs when alerts meet routing criteria
Scheduled
- Cron syntax:
0 9 * * 1(every Monday at 9am) - Use for: daily health checks, weekly reports
Conditions
Trigger conditions use field comparisons:
Operators:
equals,not_equalsin,not_in(for arrays)changed_to,changed_from(for field updates)contains,not_contains(for strings)greater_than,less_than(for numbers)
Examples:
incident.severity in [SEV0, SEV1]
incident.service equals payment-service
incident.status changed_to Resolved
alert.priority equals p1_critical
Integration Examples
Slack Incident Channel Creation
Goal: Create a dedicated Slack channel for each P1 incident
-
Enable Slack integration:
- Navigate to Project Settings → Integrations
- Connect Slack workspace
-
Create runbook:
- Action: Slack: Create Channel
- Channel Name:
inc-{{incident.short_id}}-{{incident.service}} - Topic:
{{incident.title}} - {{incident.severity}} - Invite Users:
{{incident.responders}}
- Channel Name:
- Action: Slack: Post Message
- Channel:
{{runbook.outputs.slack_create_channel.channel_id}} - Message:
🚨 **Incident Summary****Service**: {{incident.service}}**Severity**: {{incident.severity}}**Owner**: {{incident.owner}}View details: {{incident.url}}
- Channel:
- Action: Slack: Create Channel
-
Configure trigger:
- Trigger Type: Incident Created
- Conditions:
incident.severity in [SEV0, SEV1]
Jira Bidirectional Sync
Goal: Create Jira tickets for incidents and sync status updates
Outbound (AI SRE → Jira):
-
Create runbook:
- Action: Jira: Create Issue
- Project:
INCIDENT - Summary:
{{incident.title}} - Description:
Incident: {{incident.url}}Service: {{incident.service}}Severity: {{incident.severity}}
- Project:
- Action: AI SRE: Add Timeline Event
- Description:
Jira ticket created: {{runbook.outputs.jira_create_issue.issue_key}}
- Description:
- Action: Jira: Create Issue
-
Trigger: Incident Created with severity P1/P2
Inbound (Jira → AI SRE):
Requires custom configuration using Jira Automation Rules:
- In Jira, create automation rule:
- Trigger: Issue Updated
- Condition: Status changed
- Action: Send web request
- URL:
https://app.harness.io/gateway/ai-sre/api/v1/incidents/{{issue.customfield_incident_id}}/status - Method: PUT
- Body:
{"status": "{{issue.status}}"}
- URL:
See Jira Integration Guide for complete setup.
ServiceNow Change Correlation
Goal: Automatically surface ServiceNow change requests in incident investigations
-
Configure ServiceNow connector:
- Navigate to Project Settings → Connectors
- Add ServiceNow connector with credentials
-
Enable RCA Change Agent:
- The agent automatically polls ServiceNow every 5 minutes
- Change requests are correlated with incidents by:
- Time window (changes within 1 hour of incident)
- Service/CI matching
- Configuration item relationships
-
View correlated changes:
- Open an incident
- Go to Investigation tab
- ServiceNow changes appear as root cause theories
See RCA Change Agent for details.
Best Practices
Start Simple
- Begin with 1-2 critical workflows (e.g., P1 notifications)
- Add automation incrementally as you validate effectiveness
- Avoid over-automating before processes are stable
Test Thoroughly
- Use runbook test mode to validate actions without creating real incidents
- Test with non-production services first
- Verify Mustache variables render correctly with sample data
Handle Failures Gracefully
- Add error notifications if critical actions fail
- Use HTTP action retries for flaky APIs
- Monitor runbook execution logs for patterns
Monitor Effectiveness
- Track runbook execution success rates
- Measure time-to-response improvements
- Gather feedback from incident responders
- Iterate based on what works
Document Runbook Purpose
- Add clear descriptions to each runbook
- Document what triggers it and what it does
- Note any prerequisites (credentials, permissions)
- Keep runbooks focused on single workflows
Related Documentation
- Create a Runbook - Detailed runbook creation guide
- Runbook Triggers - Configure automatic execution
- Alert Rules - Route alerts and auto-create incidents
- Slack Integration - Slack action reference
- Jira Integration - Jira action reference