Skip to main content

Computing uptime for Harness Modules

This is a Harness operational reference guide for all the Service Level Indicators (SLIs) across our modules. Our SLO gets calculated based on these user centric SLIs.

Weightage Factor

Harness operations apply a weighting factor to the SLIs post any incidents.

  • Major outage = 100% of the downtime hit
  • Partial = 30% of the downtime hit
  • Degraded performance = None. This is because our stance is that a degraded performance does impact the user experience but it’s not technically downtime.

A production incident, commonly known as an "incident," is an unexpected event or problem that arises within our live production environments, resulting in either complete or partial service disruptions. In the case of a partial incident, it renders one or more functions of a module nonfunctional or inaccessible. All production incidents are posted in our status page (https://status.harness.io) and our users can subscribe to the feeds from this site to get notified.

Recent Incidents and How We Calculate Our Availability

Oct 4th - Impacted Continuous Integration Enterprise(CIE) - Self Hosted Runners

Incident: Issue with sending Git Status for PR URL: https://status.harness.io/incidents/p24h63dhy18d

  • Component: Platform/Delegate
  • SLI: API Error Rate
  • Availability - Partial Outage of 28 Minutes
  • Threshold: More than 1% over 5 min rolling window
  • SLA Impact - Partial Outage of 28 Minutes/1680 seconds, taking 30% of the downtime hit comes to 504 seconds.

During the incident, the error rate for the Platform/Delegate component exceeded 1% over a 5-minute rolling window due to a missing dependency and we had a partial outage for CIE - Self Hosted Runners in Prod-2.

Oct 16th - Impacted all the components in Prod-2

Incident: Failed to retrieve license information seen for some customers URL: https://status.harness.io/incidents/bwpdhdyyyjfw

  • Component: Platform/Login
  • SLI: API Error Rate
  • Availability - Partial Outage of 8 Minutes
  • Threshold: More than 1% over 5 min rolling window
  • SLA Impact - Partial Outage of 8 Minutes/480 seconds, taking 30% of the downtime hit comes to 144 seconds.

During the incident, the error rate for the Platform/Login component exceeded 1% over a 5-minute rolling window and we had a partial outage across all of our components in Prod-2.

Service Level Indicators specific to Harness Modules

Pipelines

Pipeline is a core construct of the Harness platform. All of the SLIs defined here will be applicable to CD, CI, STO and for that fact, any other modules where the usage is tied to a pipeline.

ComponentSLIThresholdAvailability
Pipeline/Triggers/Input SetsAPIs Error rateMore than 1% over 5 min rolling windowMajor Outage
API Response Time95th percentile: > 1s over 5 min rolling windowDegraded Performance
Pipeline Executions failure caused by Harness platformFailure rate IncreaseMore than 50% over 5 min rolling windowMajor Outage
Failure rate IncreaseMore than 1% over 5 min rolling windowPartial Outage
Slow Executions2x of average latency in a rolling window of 5 minsDegraded Performance
TriggersTrigger ActivationsMore than 1% over 5 min rolling windowDegraded Performance

Platform

Core platform constructs and services are foundational to Harness modules and any breach of these SLIs will impact all of the Harness modules.

ComponentSLIThresholdAvailability
Access ControlPermissions Change Processing TimeNew permissions (additions/removals) should take effect within 5 minutesDegraded Performance
Platform resources (All APIs) - Account, Login, Project/Org, Connectors, Secrets, Delegate, Settings, Notifications, Audits, Templates, Services, Environments , Policies, File Store, Log UploadsAPI Error rateMore than 1% over 5 min rolling windowPartial Outage
API Response Time95th percentile: > 1s over 5 min rolling windowDegraded Performance
NotificationsNotification Delivery Latency99% of notifications are dispatched within 1 minute from the moment they are sent to the notification serviceDegraded Performance

CD & GitOps (NextGen)

All the Pipeline and Platform SLIs are applicable here.

ComponentSLIThresholdAvailability
GitOpsAPIs Error rateMore than 1% over 5 min rolling windowPartial Outage
API Response Time95th percentile: > 1s over 5 min rolling windowDegraded Performance

CI Test Intelligence

All the Pipeline and Platform SLIs are applicable here.

SLIThresholdAvailability
APIs Error rateMore than 1% over 5 min rolling windowDegraded performance
API Response Time95th percentile: > 1s over 5 min rolling windowDegraded Performance

Feature Flags

All the Platform SLIs are applicable here. Pipeline relevant if the FF use case is tied to a pipeline.

SLIThresholdAvailability
Admin UI response time95th percentile: > 30s over a 5 minute rolling windowDegraded Performance
Admin UI error rate5% of requests over 5 min rolling window fails to respond or returns a 5xx errorPartial Outage
Authentication response time95th percentile: > 30s over a 5 minute rolling windowDegraded Performance
Authentication error rate5% of requests over 5 min rolling window fails to respond or returns a 5xx errorMajor Outage
SDK evaluation response time95th percentile: > 30s over a 5 minute rolling windowDegraded Performance
SDK evaluation error rate5% of requests over 5 min rolling window fails to respond or returns a 5xx errorMajor Outage
SDK metrics response time95th percentile: > 30s over a 5 minute rolling windowDegraded Performance
SDK metrics error rate5% of requests over 5 min rolling window fails to respond or returns a 5xx errorPartial Outage
SDK events request response time95th percentile: > 30s over a 5 minute rolling windowDegraded Performance
SDK events request error rate5% of requests over 5 min rolling window fails to respond or returns a 5xx errorMajor Outage

Dashboards

SLIThresholdAvailability
Dashboards not LoadingFor a duration of 60 secsMajor Outage
Latency in Loading dashboards2x of average latency in a rolling window of 5 minsDegraded performance
CRUD/Actions not workingFor a duration of 60 secsPartial Outage

Cloud Cost Management

All the Pipeline and Platform SLIs are applicable here.

SLIThresholdAvailability
APIs Error rateMore than 1% over 5 min rolling windowMajor Outage
API Response Time95th percentile: > 1s over 5 min rolling windowDegraded performance
CCM UI is down (ping failure)For a consecutive duration of 30secsMajor Outage
Perspective load timesGreater than 2 mins for a consecutive duration of 10 minsPartial Outage
Max AutoStopping rule warmup timeGreater than 10 mins for a consecutive duration of 30 minsPartial Outage
Max asset gov policy evaluationGreater than 15 mins for a consecutive duration of 30 minsPartial Outage
Cloud provider data ingestion delayGreater than 48hrs of no data receivedPartial Outage
K8s data at hourly granularityNo events received for more than 6 hrsPartial Outage
K8s data at daily granularityNo events received for more than 48 hrsPartial Outage

Chaos Engineering

All the Platform SLIs are applicable here. Pipeline relevant if the chaos use case is tied to a pipeline.

SLIThresholdAvailability
APIs Error rateMore than 1% over 5 min rolling windowMajor Outage
API Response Time95th percentile: > 1s over 5 min rolling windowDegraded performance
Load times on UIData load time > 10s consecutively over a 5 min periodDegraded performance
ChaosGuard Rule Evaluation DurationThe ChaosGuard rule evaluation stage takes >10s consecutively over a 5 min period across experiment runsDegraded performance

Service Reliability Management

All the Platform SLIs are applicable here.

ComponentSLIThresholdAvailability
SLO Creation APIAPIs Error rateMore than 5% over 5 min rolling windowMajor outage
More than 1% over 5 min rolling windowPartial outage
SLO Update APIAPIs Error rateMore than 5% over 5 min rolling windowMajor outage
More than 1% over 5 min rolling windowPartial outage
SLO List APIAPIs Error rateMore than 5% over 5 min rolling windowMajor outage
More than 1% over 5 min rolling windowPartial outage
Monitored service creation APIAPIs Error rateMore than 5% over 5 min rolling windowMajor outage
More than 1% over 5 min rolling windowPartial outage
Monitored Service update APIAPIs Error rateMore than 5% over 5 min rolling windowMajor outage
More than 1% over 5 min rolling windowPartial outage
Monitored Service List APIAPIs Error rateMore than 5% over 5 min rolling windowMajor outage
More than 1% over 5 min rolling windowPartial outage

Security Test Orchestration

All the Platform SLIs are applicable here. Pipeline relevant if the STO use case is tied to a pipeline.

ComponentSLIThresholdAvailability
STO APIs4xx Error RateMore than 5% over 5 min rolling windowPossible Partial Outage
STO APIs5xx Error RateMore than 1% over 5 min rolling windowPartial Outage
STO APIs5xx Error RateMore than 5% over 5 min rolling windowMajor Outage
STO APIsResponse Time95th percentile: > 1s over 5 min rolling windowDegraded performance
Pipeline StepsSecurity Step Execution Failures25% increase in security stage execution failures in a rolling window of 5 minsPartial Outage

Continuous Error Tracking

All the Platform SLIs are applicable here.

SLIThresholdAvailability
APIs Error rateMore than 1% over 5 min rolling windowMajor Outage
API Response Time95th percentile: > 1s over 5 min rolling windowDegraded performance
Agent cannot connect to CET collectorFor a consecutive duration of 60 secsMajor outage
Agent not being shown as connected in the UIFor a consecutive duration of 60 secsPartial outage
Latency greater than 30 seconds for a consecutive duration of 10 mins (95th percentile)Degraded performance
UI is downFor a consecutive duration of 30 secsMajor outage
ARC screen is downNo hit is openableMajor outage
Some hits aren’t openable - at least 20% of the hits in a total of at least 20 unique eventsDegraded performance
Tiny links not workingTiny link doesn’t direct to a viewable ARC screenPartial outage
Tiny link should be clickable after no more than 90s after it was loggedDegraded performance
New events/Metrics don’t show up on the summary or event listFor a consecutive duration of 180 secsMajor outage
Latency greater than 125 seconds in metrics since happened in the agent until shown in the UIDegraded performance
NotificationsExpected notification doesn’t arrive for a consecutive duration of 60 secs after the ETAMajor outage
Latency greater than 30 secondsDegraded performance
Links in notifications don’t workDegraded performance
Admin operations not working (Including: Tokens, Critical events, hide & resolve events, Jira integration, Notifications, Saved Search)For a consecutive duration of 30 secsMajor outage

Internal Developer Portal

All the Platform SLIs are applicable here.

SLIThresholdAvailability
IDP UI is down(Included: Catalog, Self service Hub, Scorecards Excluded: Non-Harness owned plugins)For a consecutive duration of 30secsMajor Outage
IDP admin UI is downFor a consecutive duration of 30secsPartial Outage
Unable to access Service Catalog APIs5XX Errors for a consecutive duration of 30secs (95th percentile)Major outage
Latency greater than 30 seconds for a consecutive duration of 10 mins (95th percentile)Partial outage
Scorecards not functional5XX Errors for a consecutive duration of 30secs (95th percentile)Partial outage
Latency greater than 60 seconds for a consecutive duration of 10 mins (95th percentile)Degraded Performance
Issue with IDP admin operations5XX Errors for a consecutive duration of 30secs (95th percentile)Partial outage
Latency greater than 10 seconds for a consecutive duration of 10 mins (95th percentile)Degraded Performance
Open Source Plugins functionality5XX Errors for a consecutive duration of 30secs (95th percentile)Degraded Performance
Latency greater than 30 seconds for a consecutive duration of 10 mins (95th percentile)Degraded Performance

Code Repository

All the Platform SLIs are applicable here.

SLIThresholdAvailability
Git Operations success rate(Clone, Pull, Push and associated operations like Merge, Blame )> 99.9% over a rolling 5 min windowMajor Outage
Git Operations execution time(Clone, Pull, Push and associated operations like Merge, Blame )2X increase of time for git operationsDegraded Performance
CODE Reviews Error Rate Increase5% increase in 5xx errors in a rolling window of 5 minsPartial Outage
CODE Reviews Latency Increase2x of average latency in a rolling window of 5 minsDegraded Performance
PR Checks & Webhooks - Error Rate Increase is PR Checks5% increase in 5xx errors in a rolling window of 5 minsDegraded Performance
PR Checks & Webhooks - Webhooks are not triggered5% increase in 5xx errors in a rolling window of 5 minsDegraded Performance
UI unable to render pageFor a consecutive duration of 2 minMajor Outage

Infrastructure as Code Management

SLIThresholdAvailability
APIs Error rateMore than 1% over 5 min rolling windowMajor Outage
API Response Time95th percentile: > 1s over 5 min rolling windowDegraded performance
Unable to run IaC Stage & Steps in a PipelineAPI’s are down for a consecutive duration of 60 secondsMajor Outage
10% of traffic generates 5xx error in a rolling window of 5 minsPartial Outage
2x of average latency in a rolling window of 5 minsDegraded Performance

Supply Chain Security

All the Platform and Pipeline SLIs are applicable here.

SLIThresholdAvailability
APIs Error rateMore than 1% over 5 min rolling windowMajor Outage
API Response Time95th percentile: > 1s over 5 min rolling windowDegraded performance

Software Engineering Insights

SLIThresholdAvailability
Login Failure (Legacy only)Greater than 30 seconds for a consecutive duration of 5 minutesMajor Outage
Integrations list API Error Ratefailure rate (5XX) of the API in 5 minutes > 0.5Major Outage
Integrations list API LatencyResponse time greater than 30 secondsDegraded Performance
Ingestion DelayDelay in receiving any events > 24 hoursPartial Outage
ETL / Aggregations DelayDelay in receiving any events > 48 hoursPartial Outage
ETL / Aggregations PerformanceJobs stuck in scheduled state for more than 12 hoursDegraded Performance
ES Indexing DelayDelay in receiving any events > 48 hoursPartial Outage
UI dashboard widget Load timesGreater than 3 mins for a consecutive duration of 10 mins for all customersDegraded Performance
UI landing page/dashboard page not loadingFor a consecutive duration of 5 minsMajor Outage
Trellis EventsDelay in processing events > 24 hours or monthly calculation not finished in first 7 daysDegraded Performance
DB HealthDB Load > 80%Degraded Performance
ES Cluster healthES cluster state RED / read-only modePartial Outage
Server API Error rate (5XX)More than 1% over 5 min rolling windowMajor Outage
API Response Time95th percentile: > 15s over 5 min rolling windowDegraded performance