Failure strategy support for Service and Environment
Currently, this feature is behind the feature flag, CDS_SERVICE_INFRA_FAILURE_STRATEGY
. Contact Harness Support to enable the feature.
Failure strategies can be applied to Service and Environment steps, allowing users to define failure actions like retries, abort etc. in case of failure. This approach helps maintain deployment stability by allowing the system to handle unexpected issues gracefully. By default, the failure strategy for Service and Environment steps is inherited from the stage configuration.
Supported Failure Strategies
The following failure strategies will be supported for Service and Environment steps:
Services
- Rollback Pipeline
- Retry
- Abort
- Mark as Failure
Environment and Infrastructure
- Rollback Pipeline
- Retry
- Abort
- Mark as Failure
- Rollback Stage
Examples covering failure strategy for Service and Infrastructure
The Service and Infrastructure step does not support failure strategy configuration and rather inherits. This means that any failure actions configured for the stage will apply to the Service step as well.
Retry Service Step
YAML configuration demonstrating the Retry strategy for a service step
This pipeline YAML demonstrates a service step that references an invalid artifact (primaryArtifactRef
). As a result, the step will fail. The failure strategy is configured to retry
the step twice, waiting 10 seconds between retries. If the issue persists, the pipeline will be marked as failed.
Pipeline Rollback on Infra Failure
YAML configuration where an incorrect infrastructure configuration triggers a pipeline rollback
This pipeline defines an infrastructure step with an incorrect connectorRef
. The failure strategy is configured to trigger a PipelineRollback
on failure. This means that if the infrastructure setup fails, the entire pipeline will revert to its previous state.
The infrastructure step will fail due to the incorrect connector reference, causing the pipeline to roll back both t2
and t1
stages.
Failure Strategy Support for Multiservice, Multi-Infrastructure, and Matrix Deployment
Currently, this feature is behind the feature flag, CDS_MULTI_DEPLOYMENT_ON_FAILURE
. Contact Harness Support to enable the feature.
Multiservice deployments can be achieved using either the native multiservice configuration or the matrix configuration.
Applying Failure strategy during Multi-Deployment
In the example below, two services (ser_1, ser_2) are deployed across two infrastructures (infra2, infra1). The infrastructures are configured to deploy in parallel, and a failure strategy is defined to ensure resilient handling of any errors during deployment.
Failure strategy during Multi-Deployment
In this deployment will proceed as follow:-
(ser_1 and infra_2) and ( ser_2 and infra_2 ) will be deployed in parallel (different services within the same infrastructure). (ser_1 and infra_1) and 4 (ser_2 and infra_1) will be deployed in parallel (different services within the same infrastructure)
You need to add the following field in the YAML configuration to handle failure scenarios in multi-deployment setups:-
multiDeploymentConfig:
onFailure: <value>
The onFailure
field supports two values:
RunAll
: This value ensures that if one deployment fails, the system will continue executing all remaining deployments.
multiDeploymentConfig:
onFailure: RunAll
In the example above, we have given given failure srategy for multideployment as RunAll, for example there is an issue with ser_1 and both the stages in series fails as well as there is an issue with infra_2 due to invalid configuration of artifact and connectors. In this case stage with (ser_1 infra_2), (ser_2 and infra_2) and (ser_1 and infra_1) will fail. The RunAll failure strategy ensures that even though these stages are failing (ser_2 and infra_1) proceeds with the deployment.
SkipQueued
When this value is used, if any deployment fails, all queued deployments that haven’t yet started will be skipped.
multiDeploymentConfig:
onFailure: SkipQueued
If a failure occurs during the deployment of ser_1 on infra2, all subsequent deployments, such as ser_2 on infra1, will be skipped, halting the pipeline for queued tasks.
In a parallel deployment, any deployments that have already started cannot be stopped, even if a failure occurs. Only queued deployments that haven’t begun execution will be skipped.
Applying Failure strategy for multi service with Matrix looping
Adding Failure strategy using Matrix for Multi-Deployment
The matrix strategy defines two variables—service and infra with each service rollingDeploy2
and rollingDeploy1
deployed across both infra1
and infra2
.
The onFailure
field needs to be added inside the matrix configuration to define how the system should handle failures.
The onFailure
field supports two values:
strategy:
onFailure: <value>
RunAll
This ensures that even if one or more infrastructure-service combinations fail, the remaining ones will continue to execute. For instance, if infra1 encounters issues, the system will continue deploying to infra2.
strategy:
onFailure: RunAll
If infra1 encounters connector issues and fails during the deployment, the remaining combination will still proceed to deployment. This allows the other deployments to continue without interruption
SkipQueued
When this value is applied, any failures that occur will cause the system to skip the remaining queued deployments that haven’t started yet.
strategy:
onFailure: SkipQueued
In this case, if infra1 fails during deployment due to an issue (e.g., invalid connector reference), the remaining deployment will be skipped, stopping the execution for all pending or queued combinations.
Override Failure Strategy at Service/Environment Level
Users can override stage-level failure strategies and define service- or environment-specific strategies to control pipeline behavior during failures.
Supported Failure Strategies
-
Service Level:
- Rollback Pipeline
- Retry Step
- Abort
- Mark as Failure
-
Environment and Infrastructure Level:
- Abort
- Rollback Pipeline
- Retry Step
- Mark as Failure
- Rollback Stage
These strategies can be configured in the Advanced Configuration section of the service/environment tabs in the pipeline and are reflected in the YAML under the failureStrategies
field. Failure strategies are directly associated with the service/environment configuration, ensuring flexibility and alignment with the existing failure strategy schema.
Behavior of Failure Strategies
-
Fallback to Stage-Level Strategy:
If no failure strategy is set at the service or environment level, the pipeline will fall back to the failure strategy defined at the stage level. However, this fallback only works if the stage-level strategy is within the subset of allowed failure strategies for the service or environment level.Failure strategies at the service/environment level are completely independent of step-level failure strategies.
-
Flexibility Over Propagation:
When propagating services and environments across stages, failure strategies are not propagated. Instead, users are given the flexibility to define a specific failure strategy for each stage, ensuring tailored configurations.
Limitations
-
Excluded Strategies for Environment Failures:
Strategies such as Manual Intervention, Ignore Failure, Mark as Success, and Proceed with Default Values are not supported as they are not applicable to environment failures. -
Excluded Strategies for Service Failures:
Strategies such as Manual Intervention, Ignore Failure, Mark as Success, and Proceed with Default Values are not supported as they are not applicable to service failures. -
Stage-Level Failure Strategy vs. Service/Environment-Level Failures If a failure strategy is defined at the stage level, it will only apply if the failure occurs at a level that supports the chosen strategy.
For example:
- Suppose a failure occurs at the service or environment level, but the stage-level failure strategy is set to Mark as Success.
- Since the service level only supports Rollback Pipeline, Retry Step, Abort, and Mark as Failure, the Mark as Success strategy will not be applicable in this scenario. This means failure strategies must be aligned with the supported failure-handling mechanisms at each level.