Orca Operations and Fine Tuning Orca Performance

The following KB article explains how a modern Orca operates in Spinnaker and provides guidance towards fine-tuning Orca performance in more modern Spinnaker environments. Please note that customers should enable monitoring in their environment before proceeding with any changes.

Introduction to Orca

Orca is the system that handles and processes "tasks" and "stages" for Spinnaker. Spinnaker's application saves, server creation, pipeline updates, and everything is run through "tasks" that execute through Orca. Orca has a few key components:

The work queue system* The storage system for executions
The pipeline/stages themselves. The work queue tracks the in-process task state and schedules/runs the pipeline tasks. The storage system stores the results of each task and all artifacts tasks generate or add.

Background on Classic Orca Operations vs. Modern Orca Services

Changes to Orca's queue to fill up available work: https://medium.com/@rizza/dev-journal-ed-4-e2cf2ef2fbc6 Blog on monitoring (Orca emphasis): https://blog.spinnaker.io/monitoring-spinnaker-part-1-4847f42a3abd

While the articles above share some unique insights into Orca operations, the documentation and information are often based upon some blog posts made by Netflix regarding the topic over five years old.

While helpful, these articles present some information that doesn't match how modern Orca operates and the current code. They contain information that may be useful in generic operational guidance, but several metrics don't match those in currently running systems.

Differences and Changes - Blocking Queues

All Java services, as a rule, use thread pools as part of their operations. In a thread pool, a work queue is responsible for running tasks. The work queue is added via metrics threadPool.* with a critical metric reported as blockingQueueSize. The blogs above state, "This should always be 0" when looking at the messageHandlerPool thread pool (aka the core work pool). This can be queried using the below query: max(threadpool_blockingQueueSize{id="messageHandlerPool",spinSvc="orca"})

Differences and Changes - pollSkippedNoCapacity

Please also note that the pollSkippedNoCapacity message sometimes referenced in articles and blog posts doesn't work. It is ignored via a NoOp handler, not added as a metric, or isn't sent via any notification. This appears to be something Netflix extended or completed internally and never released to OSS.

Monitoring Orca

Before tuning the Orca service, customers should have monitoring set up with the Armory Observability Plugin. A monitored environment will let Cloud Admins see if their adjustments resolve issues and can help pinpoint fine-tuned adjustments. For more information about setting up monitoring within a Spinnaker environment, please visit our Docs: https://docs.armory.io/continuous-deployment/armory-admin/observe/observability-configure/

Queue Lag

Query by message latency. The longer a message takes to get through the system, the longer the end user sees pipelines taking. This is the KPI of how Orca is doing concerning processing messages.

-- NRQL
FROM Metric SELECT max(queue.message.lag)/1000 FACET customerName 
TIMESERIES since 7 days ago
-- PromQL (rough)
queue_message_lag_seconds_max

Thread Handling

This is how Administrators can query threading capacity. This metric tells Admins how close they are to the maximum usage of the thread work pool. If this metric stays consistently above 80%, scaling Orca vertically or horizontally is recommended. -- NRQL for newrelic metrics.

FROM Metric SELECT max(threadpool.activeCount)/max(threadpool.maximumPoolSize)
where metricName like 'threadpool.%'  limit 100 
where spinSvc = 'orca' TIMESERIES since 7 days ago 
FACET hostname where id = 'messageHandlerPool'

-- promql ROUGH equivalent... NOT tested

max(threadpool_activeCount_total{spinSvc="orca",id="messageHandlerPool"}/max(threadpool_maximumPoolSize_total{spinSvc="orca",id="messageHandlerPool"})

Activity Load

There's no current metric to track activity load, but Orca stores all pipeline context in a database. These operations can generate large amounts of data and communications. It's recommended to monitor

Database network traffic* Host network traffic* Query the "body" size of pipelines in Orca's database. Anything over 200MB indicates that some pipelines are too large.* You can set various flags to disable passing context to child pipelines. "Skip downstream output flag" as done via https://github.com/spinnaker/orca/pull/4080 and https://github.com/spinnaker/orca/pull/3989.

Tuning Orca

The following are changes recommended to place within the Orca configuration of the Spinnaker manifest as a starting point. Adjustments should be made within the SpinnakerService.yml file under the spec.spinnakerConfig.profiles.orca section. Please note that adjusting cleanup timings should depend on the organization's compliance rules. Making the timing shorter will clear logs after a specific date which may not meet the organization's rules around data retention.

spec:
  # spec.spinnakerConfig - This section is how to specify configuration spinnaker
  spinnakerConfig:
    # spec.spinnakerConfig.profiles - This section contains the YAML of each service's profile
    profiles:
      orca:
        # Allows plugins to override webhook/job stages
        dynamic-stage-resolver:
          enabled: true
        ## sets it to do a full poll per cycle.  Drastically improves queue handling
        keiko:
          queue:
            fillExecutorEachCycle: true
        ## Don't keep a lot of TASK history - this can be LARGE.  
        ## Defaults to 30 days retention normally.  Depends on the app
        tasks:
          daysOfExecutionHistory: 180
          useManagedServiceAccounts: true
          useSharedManagedServiceAccounts: true
        pollers:
          topApplicationExecutionCleanup:
            enabled: true
            intervalMs: 1200000 # 20 minutes, default 1 hour
            threshold: 1000 # default is 2000
            chunkSize: 20 ## defaults to 1
            exceptionApplicationThresholds:
              appKeyKeepHistory: 2000 ## overrides the threshold for a specific app
          oldPipelineCleanup:
            threshold-days: 90
            enabled: true

Note that at this time, SQL Keiko queues are not used by default. If administrators are interested in enabling Keiko SQL, there are quite a few properties to set differently. These are documented for testing below. Note that customers should add the following configuration under the spec.spinnakerConfig.profiles.orca.keiko section, in addition to the above adjustments.

spec:
  spinnakerConfig:
    profiles:
      orca:
        keiko:
          queue:
            sql: 
              enabled: true
            redis:
            enabled: false
        ## it's critical to get the pending execution to use the same queue config
        ## as the main queue so it can repopulate pending executions
        queue:
          pending-execution-service:
            redis:
              enabled: false
            sql:
              enabled: true

We are continuously evaluating these configurations as Spinnaker continues to evolve, and we may have further recommendations in the future for SQL. There have been some reported latency or failures when using SQL for Orca's queues at a massive scale. Still, it hasn't been determined whether environmental issues were the root cause or an issue between SQL vs. Redis configuration.

Scaling

After making the appropriate reads on the environment, Administrators may want to look at scaling Orca to help with increased traffic/demand. One note of caution:Scaling Orca will put an increasing load on the rest of the system. This will cause increased calls to Echo, Rosco, Clouddriver, and other supporting services that Orca interacts with to execute tasks.

Scaling Vertically

handlerCorePoolSize & handlerMaxPoolSize directly impact the available threads for queue processing. Query by id per the command:max(threadpool_blockingQueueSize{id="messageHandlerPool",spinSvc="orca"})To get thread utilization by pool id. Increase as needed with caution. Making these changes will increase the load per Orca pod.

Scaling Horizontally

Administrators can also scale Orca pods horizontally to increase capacity vs. increasing available threads. See: https://github.com/spinnaker/orca/blob/master/orca-core/src/main/java/com/netflix/spinnaker/orca/config/OrcaConfiguration.java#L193

Many of Orca's thread pools use fixed settings and are non-adjustable via JVM. Thus scaling horizontally is likely required regardless of any thread settings. These requests do not get near the traffic and probably won't need the same amount of scaling.

Introduction to Orca​

Background on Classic Orca Operations vs. Modern Orca Services​

Differences and Changes - Blocking Queues​

Differences and Changes - pollSkippedNoCapacity​

Monitoring Orca​

Queue Lag​

Thread Handling​

Activity Load​

Tuning Orca​

Scaling​

Scaling Vertically​

Scaling Horizontally​