Armory Enterprise (Spinnaker) Sizing and Scaling Guide

Introduction

This guide describes the general architecture considerations you want to make for Armory Enterprise as a whole. Armory Enterprise is composed of multiple services, each with a distinct job. These services can be scaled independently based on your usage. This sizing and scaling guide uses the following assumptions:

Pipelines used to evaluate Armory Enterprise are simple and made of one Deploy and two Wait stages for stage scheduling. If you expect your pipelines to be complex, divide the supported executions by the number of non-trivial expected stages (baking, deploying) in your pipelines.
API requests simulate potential tool requests as well as user activity. This guide provides the number of concurrent users.
All services run with at least two replicas for basic availability. It is possible to run with fewer replicas at the cost of potential outages.

When you are attempting to size or scale your instance, keep the following in mind:

The number of active users impacts how to size the API gateway service (Gate).
Complex pipelines impact the amount of work the orchestration service (Orca) performs.
Different providers (Kubernetes, GCP, AWS, etc) have different execution profiles for the Clouddriver service.

Architecture considerations

Monitor your Armory Enterprise instance.
Have a log management solution in place to use with Armory Enterprise. This will help with troubleshooting. Additionally, you can opt into Armory’s log aggregation to make troubleshooting easier.
We do not recommend the use of High Availability modes - these are more service distribution vs. increasing availability. These configurations can create excessive management overhead and complexities and requires significantly higher operational cost. It's recommended to use the Kubernetes agent and/or horizontally scale services as needed. To find out more about how the High Availability function works in Spinnaker, please take a look at the following KB: https://support.armory.io/support?id=kb_article_view&sysparm_article=KB0010327

Scaling Armory Enterprise Architecture

Kubernetes cluster

The Kubernetes cluster sizing recommendations assume that only Armory Enterprise, monitoring agents, and the Kubernetes operator run in the cluster. It also provides extra room for rolling deployments of Armory Enterprise itself.

Use a supported Kubernetes version, Kubernetes 1.18+* If possible, use a cluster with nodes in different availability zones.Scale out with multiple smaller nodes instead of scaling up with fewer larger nodes. This makes the cluster more resilient to the loss of nodes.
The smaller nodes still need to be able to handle the largest pods in terms of CPU and memory. Make sure to use multiple replicas for availability, scaling and performance. When using multiple replicas, enable locking in Igor service where it's required to prevent duplication of requests. We recommend running Echo service in SQL mode to avoid duplication of CRON triggers. Igor has an option for locking as a configuration option

spec:
  spinnakerConfig:
    profiles:
      igor:
        locking:
          enabled: true​

Echo will require either a set up with:

SQL rather than in memory storage (https://github.com/spinnaker/echo#configuration)
Echo as an HA set up https://support.armory.io/support?id=kb_article_view&sysparm_article=KB0010327

Database

Clouddriver, Orca, Echo, and Front50 services must each use a different database. These databases can be in the same database cluster or separate ones. Using a single cluster makes management easier and more cost-effective. Keep in mind though that the number of connections used by Spinnaker will be added across all services. When configuring your databases, use the following guidelines:

Use a MySQL compatible database engine

Armory recommends using Aurora for cross-region replication

If available, use cross-region replication to ensure data durability. Be aware of which services stores what kind of data:

Front50’s database contains pipeline definitions and needs to be backed up.* Orca’s database contains pipeline execution history that is displayed in the UI.* Clouddriver’s database contains your infrastructure cache. If lost, Spinnaker needs to cache it again, which may take a while depending on the size of your infrastructure. It doesn’t have long-term value.

Make sure the network latency between Armory Enterprise and the database cluster is reasonable. Armory recommends locating the database cluster in the same datacenter as Armory Enterprise.

Your database cluster must support the number of open connections from Spinnaker and any other tool you need. For numbers refer to the database connections chart in the profiles below.

Tune Clouddriver’s connection pools based on your usage. Use the sql.connectionPools.cacheWriter.maxPoolSize and sql.connectionPools.default.maxPoolSize parameters. Both values default to 20 and need to be increased to handle more tasks per Clouddriver.

Redis

Most services rely on Redis for lightweight storage and/or task coordination. Spinnaker does not store many items in Redis as is reflected in the following recommendations.

Use one CPU for Redis since it is a single threaded.* It's recommended to use a managed Redis service. We recommend using Redis for provider agent scheduling. It's required for http session storage, terraform integration, bake history and numerous other pieces. Spinnaker services There are numerous parameters to tune spinnaker. These include:
JDBC settings - max pool sizes, connection timeouts, lifetimes, etc. See: https://support.armory.io/support?id=kb_article_view&sysparm_article=KB0010395 on some tuning options* HTTP request tuning - it's recommended that you change the max requests per host and other timeouts accordingly. These timeouts are used to control many of the interactions between services, but also control access to artifact storage systems, CI systems and more. See https://support.armory.io/support?id=kb_article_view&sysparm_article=KB0010163* Memory tuning - It's recommended to watch JVM memory metrics and increase based upon utilization. Look at heap memory usage metric data and GC collection times.

Base profile

Introduction

This guide describes the hardware recommendations for what is considered a base profile for using Armory Enterprise (Spinnaker) based on the architecture described in part 1 of the Sizing and Scaling Guide. Base profile The following minimum recommendations are based on a usage profile that is similar to the following:

50 AWS Accounts, 10-15 Kubernetes accounts, or 30,000 Docker Images
250 deployments per day over a 5-hour window.
30 requests persecond coming from browser sessions or tools
10x burst for both pipelines and API calls.

Hardware sizing

Service	Value	CPU	Memory
Clouddriver	5	2500m	10.0Gi
Deck	2	300m	2.0Gi
Echo	2	2000m	3.0Gi
Fiat	2	2000m	3.0Gi
Front50	2	2000m	3.0Gi
Gate	2	2000m	4.0Gi
Kayenta	2	750m	2.0Gi
Igor	2	2000m	4.0Gi
Orca	2	2000m	4.0Gi
Rosco	2	2000m	3.0Gi
Terraformer	2	500m	2.0Gi

The following examples describe what the Note the following:

Starting size for compute is 3 * m5.2xlarge or 5 * m5.xlarge (or equivalent)* Using gp3 disks or equivalent types with SSD storage is recommended to minimize host disk I/O wait times

EKS Node Pool

Type: m5.xlarge vCPU: 4 Memory: 16.0 Gb Storage type: EBS storage, gp3, 150 GB Nodes: 5

GCP Node Pool

Type: n1-standard-16 vCPU: 4 Memory: 16.0 Gb Disk Type: pd-ssd Disk Size: 150 GB Nodes: 3

Redis

Type: cache.r4.4xlarge (or equivalent) vCPU: 4 Memory: 25.0 Gb

RDS/Aurora

Type: db.r5.xlarge (or equivalent) vCPU: 8 Memory: 32.0 Gb Disk Type: EBS storage, gp3, 150 GB

GCS

Type: db-n1-standard-2 Disk Type: PD_SSD Disk Size: 157 GB

Scaling the base profile

Introduction

This guide describes the following:

What to monitor to determine if you need to scale your Armory Enterprise instance beyond the usage levels of the base profile described in part two
What a scaled up instance might look like and the impact of increased usages

Monitoring

Different parts of the environment consume resources differently, so each part needs to be monitored based on different metrics. To understand what needs to be scaled for your instance, you need to utilize white box, black box, and infrastructure monitoring. Note that some of the performance metrics can be improved without increasing resources.

JVM Spinnaker services

AreaAlert ConditionActionsConcernsJVM: CPU Time>75% for 10 minsIncrease available CPU/Memory for the serviceNoneJVM: G1 Old/Young Generation>10% for 10 mins- Investigate node pool disk IOPS- If correlated with other JVM alerts,increase available CPU/Memory for the serviceClouddriver: Interim failures in Clouddriver accounts could potentially give a false positive alertJVM: Heap memory usage85% for 10 minsIncrease available memory to serviceClouddriver: Interim failures in Clouddriver accounts could potentially give a false positive alertPod CPU request/limits usage>95% for 360 minsIncrease available CPU request/limitsClouddriver: Interim failures in Clouddriver accounts could potentially give a false positive alert. We should separate Clouddriver from other services in regards to the threshold and time windowPod Memory request/limits usage>95% for 360 minsIncrease available Memory request/limitsClouddriver: Is always using the full available pod memory when at scale and operates for long periods of time without redeployment We should separate Clouddriver from other services in regards to the threshold and time window.

Terraformer service

The Terraformer service is used by the Terraform Integration stage. AreaAlert ConditionActionsConcernsPod CPU request/limits usage>95% for 360 minsIncrease available CPU request/limitsSeparate Clouddriver from other services in regards to the threshold and time window in order to catch Terraformer resource issuesPod Memory request/limits usage>95% for 360 minsncrease available Memory request/limitsSeparate Clouddriver from other services in regards to the threshold and time window in order to catch Terraformer resource issuesTerraformer application metricsNoneStaring with 2.24.0, Terraformer exposes an OpenMetrics endpoint for metric scraping. Enable the configuration and add proper alerts in your monitoring solution.

Aurora Cluster

AreaAlert ConditionActionsConcernsWrite latencyBaseline deviation for 15 minsIdentify the root cause of the increased latencyNoneRead latencyBaseline deviation for 15 minsIdentify the root cause of the increased latencyNoneCPU utilisation> 70% for 15 minsIncrease Aurora cluster instance typeNone

Redis cluster

AreaAlert ConditionActionsConcernsRedis Get LatencyBaseline deviation for 15 minsIdentify the root cause of the increased latencyNoneRedis Set LatencyBaseline deviation for 15 minsIdentify the root cause of the increased latencyNoneRedis CPU utilisation>90% for 15 minsIncrease Redis cluster instance typeNoneRedis Storage utlisation>85% for 15 mins- Run Redis keys cleanup job for Rosco and Terraformer keys- If cleanup doesnt resolve the alert then increase the instance typeNone

Armory Enterprise application metrics

AreaAlert ConditionActionsConcernsJDBC connections usage>50% for 3 minsIncrease the available connection pool of the serviceNoneClouddriver Caching agents>85% for 10 minsIncrease the max-concurrent-caching agents config and/or increase the available Clouddriver replicasNumberOfConfiguredRegions

Scaling Exercise Example

Current Usage Profile

This example describes how to scale your Armory Enterprise instance based on a usage profile that resembles the following: Total Accounts using Clouddriver Caching Agents

AWS: 250
ECS: 175
Kubernetes: 4

Running Agents	Number of Replicas	Max-Concurrent Agents	Utilization
5803	7	1000	85%

Calculation: RunningAgents/(NumReplicas*MaxConcurrentAgents)*100 % Weekly usage stats:

Number of Active Applications: 50
Includes stage invocations like SavePipeline or UpdatePipeline
Number of Executed Pipelines: 40
Number of Fiat Groups: 1200* Weekly Active Users: 125
Max Concurrent stage executions[1h]: 350
Top Stage Invocations by Type:
- RunJob Stage
- Terraform Stage* Evaluate Artifacts

Resource usage by service

Service	Replicas	Requests/Limits	Average utilization 1 week
Clouddriver	7	CPU: 6500m Memory: 12Gi	JVM Heap used: 50.4% JVM CPU used: 25.5 Container CPU used: 31% Container Memory used: 99.99%
Terraformer	4	CPU: 3000m Memory: 4Gi	Container CPU used: 1.2% Container Memory used: 68%
Dinghy	1	CPU: 1000m Memory: 512Mi	Container CPU used: - Container Memory used: 3%
Echo	1	CPU: 2000m Memory: 4 Gi	JVM Heap used: 23.9% JVM CPU used: 4.45% Container CPU used: 4.45% Container Memory used: 76.5 %
Fiat	2	CPU: 2000m Memory: 18Gi	JVM Heap used: 7.8% JVM CPU used: 2.2% Container CPU used: 2.2% Container Memory used: 55.7%
Front50	2	CPU: 1500m Memory: 3Gi	JVM Heap used: 30.4% JVM CPU used: 1.65% Container CPU used: 2.1% Container Memory used: 57.8%
Gate	2	CPU: 1000m Memory: 8Gi	JVM Heap used: 20.4% JVM CPU used: 3.3% Container CPU used: 3.4% Container Memory used: 57.6%
Igor	2	CPU: 750m Memory: 2.0Gi	JVM Heap used: 23.1% JVM CPU used: 0.6% Container CPU used: 0.8% Container Memory used: 51.5%
Orca	2	CPU: 1500m Memory: 10Gi	JVM Heap used: 36.9% JVM CPU used: 4.1% Container CPU used: 5% Container Memory used: 83.5%
Rosco	2	CPU: 4000m Memory: 8Gi	JVM Heap used: 27.6% JVM CPU used: 0.3% Container CPU used: 0.5% Container Memory used: 84.5%

EKS Node pools

Type: m5.4xlarge
vCPU: 16* Memory: 64Gb
Maximum Network interfaces: 8
Private IPv4 per interface: 30
Current node pool utilization (average 1 week):
- CPU: 20%
- Memory: 35%
Current # of nodes: 9
Current # of pods: 92

VPC IP allocation/reservation:

IP Reservation: 540
- Calculation: max(number of ENIs with 2 ENIs per instance)*30
- Note: At a minimum, every instance has 2 ENIs (1 Primary, 1 Secondary) attached. AWS CNI attaches a private IP from the secondary ENI to each pod. Based on the instance type-specific, the number of IPv4 are reserved per ENI.
IP Allocation: 110
- Calculation: (number of nodes * 2) + (number of Pods)
Unused IPs in the Warm pool: 430

Optimize the IP reservation for the existing cluster to ensure that scaling the number of replicas or scaling the number of nodes or even switching to a higher instance type wont over-reserve IPs from the private subnets. For more information, see https://github.com/aws/amazon-vpc-cni-k8s/blob/master/docs/eni-and-ip-target.md.

Aurora Cluster

Instance Type: db.r5.2xlarge
Current CPU utilization(average 1 week): 49%

Redis Cluster

Instance Type: cache.r5.2xlarge
Current CPU utilization: 5%
Current Storage utilization: 73%

Scaling Projection Plan

These projections are based on a rough estimation for a potential increase in accounts, users, and deployments by 30%, 50%, or 100% of the current usage. The exact scaling needs depend on various factors.

Assumptions

Each additional AWS/ECS account requires an additional 512MB in a Clouddriver pod given that each account is caching the same amount of resources* Each additional AWS account requires 17 caching agents per region
Each additional ECS account requires 10 caching agents per region
EKS Node pools will remain in the same instance type
Spinnaker resources will be calculated at rest since the resource utilization is dependent on various factors during active executions The following table shows the projected increases for certain key metrics:

Key Metric	Current	+30%	+50%	+100%
Active Users	125	163	188	250
Active Pipelines	40	52	60	80
Concurrent Stage Executions	350	455	525	700
AWS accounts	250	325	375	500
ECS accounts	175	227	262	350

Spinnaker resources

Caching agents

Calculation: CurrentRunning + (NewAWSAccounts17+NewECSAccounts10)*NumberOfConfiguredRegions Assuming that max-concurrent agents config remains at 1000, the number of replicas needs to increase for Clouddriver:

Running Agents	+30%	+50%	+100%
5803	7564	8747	11701

Spinnaker Resources

Clouddriver Calculation: CurrentMemoryRequest + 512*NewAccounts/numberOfReplicas

Memory	+30%	+50%	+100%
12Gi (6 Replicas)	17Gi (8 Replicas)	19Gi (9 Replicas)	24Gi (12 Replicas)

Orca

An setimate for 100% increase is 3 Replicas/4vCPU/12Gi memory. Scaling Orca depends on the Redis cluster and the downstream services like Clouddriver. Expect to increase the resources and replicas as the number of concurrent stages increase.

Fiat

Scaling Fiat depends on the following:

Number of active users
Groups sync from the external identity provider
Number of active executions An estimate for 100% increase is 3 Replicas/2vCPU/18Gi Memory.

Expect to increase the resources and replicas as the number of active users and pipeline executions increase. However, this depends on the setup of the external identity provider and the sync processing of the available roles. The amount of data returned can potentially change drastically the rough projections.

Front50

An estimate for 100% increase is 3 Replicas/1.5vCPU/4Gi Memory Scaling Front50 depends on the number of applications, pipelines, and the stage invocations on those. Expect to increase the resources and replicas as the number of applications and pipelines increase.

Terraformer and Rosco

Scaling either the Terraformer service or Rosco service depends on the concurrent executions of Terraform or Bake stages. Expect to increase the resources and replicas.

EKS node pools

The major load increase comes from the horizontal and vertical scaling of Clouddriver. Only a small portion is allocated to scaling the other Spinnaker services. Only Clouddriver scaling will be taken into consideration for the bellow calculations Number of Nodes+30%+50%+100%9101112 The following VPC IPs are calculated to be reserved (without changes to the configuration for warm pool). Reserved IPs+30% (10 Nodes)+50% (11 Nodes)+100% (12 Nodes)540600660720 By optimizing the IP reservation in the CNI configuration, the number of reserved IPs values and unused reserved private IPs per node can be decreased.

Aurora cluster

Based on the projected scaling of Clouddriver and Orca services and given that the CPU load averages weekly to 50%, rough estimates indicate that the Aurora instance type needs to be upgraded when more than 50% of the projected increase happens.

Redis cluster

You should be aware of the amount of pressure that the projecting scaling of Orca/Fiat/Rosco/Terraformer services in terms of active users, concurrent stage executions and Terraformer/Bake stages. The projected increases do not require upgrading the instance type for the Redis cluster.

Introduction​

Architecture considerations​

Scaling Armory Enterprise Architecture​

Kubernetes cluster​

Database​

Redis​

Base profile

Introduction​

Hardware sizing​

EKS Node Pool​

GCP Node Pool​

Redis​

RDS/Aurora​

GCS​

Scaling the base profile

Introduction​

Monitoring​

JVM Spinnaker services​

Terraformer service​

Aurora Cluster​

Redis cluster​

Armory Enterprise application metrics​

Scaling Exercise Example​

Current Usage Profile​

Resource usage by service​

EKS Node pools​

VPC IP allocation/reservation:​

Aurora Cluster​

Redis Cluster​

Scaling Projection Plan​

Assumptions​

Spinnaker resources​

Caching agents​

Spinnaker Resources​

Orca​

Fiat​

Front50​

Terraformer and Rosco​

EKS node pools​

Aurora cluster​

Redis cluster​

Introduction

Architecture considerations

Scaling Armory Enterprise Architecture

Kubernetes cluster

Database

Redis

Introduction

Hardware sizing

EKS Node Pool

GCP Node Pool

Redis

RDS/Aurora

GCS

Introduction

Monitoring

JVM Spinnaker services

Terraformer service

Aurora Cluster

Redis cluster

Armory Enterprise application metrics

Scaling Exercise Example

Current Usage Profile

Resource usage by service

EKS Node pools

VPC IP allocation/reservation:

Aurora Cluster

Redis Cluster

Scaling Projection Plan

Assumptions

Spinnaker resources

Caching agents

Spinnaker Resources

Orca

Fiat

Front50

Terraformer and Rosco

EKS node pools

Aurora cluster

Redis cluster