Armory Enterprise (Spinnaker) Sizing and Scaling Guide
Introduction
This guide describes the general architecture considerations you want to make for Armory Enterprise as a whole. Armory Enterprise is composed of multiple services, each with a distinct job. These services can be scaled independently based on your usage. This sizing and scaling guide uses the following assumptions:
- Pipelines used to evaluate Armory Enterprise are simple and made of one Deploy and two Wait stages for stage scheduling. If you expect your pipelines to be complex, divide the supported executions by the number of non-trivial expected stages (baking, deploying) in your pipelines.
- API requests simulate potential tool requests as well as user activity. This guide provides the number of concurrent users.
- All services run with at least two replicas for basic availability. It is possible to run with fewer replicas at the cost of potential outages.
When you are attempting to size or scale your instance, keep the following in mind:
- The number of active users impacts how to size the API gateway service (Gate).
- Complex pipelines impact the amount of work the orchestration service (Orca) performs.
- Different providers (Kubernetes, GCP, AWS, etc) have different execution profiles for the Clouddriver service.
Architecture considerations
- Monitor your Armory Enterprise instance.
- Have a log management solution in place to use with Armory Enterprise. This will help with troubleshooting. Additionally, you can opt into Armory’s log aggregation to make troubleshooting easier.
- We do not recommend the use of High Availability modes - these are more service distribution vs. increasing availability. These configurations can create excessive management overhead and complexities and requires significantly higher operational cost. It's recommended to use the Kubernetes agent and/or horizontally scale services as needed. To find out more about how the High Availability function works in Spinnaker, please take a look at the following KB: https://support.armory.io/support?id=kb_article_view&sysparm_article=KB0010327
Scaling Armory Enterprise Architecture
Kubernetes cluster
The Kubernetes cluster sizing recommendations assume that only Armory Enterprise, monitoring agents, and the Kubernetes operator run in the cluster. It also provides extra room for rolling deployments of Armory Enterprise itself.
- Use a supported Kubernetes version, Kubernetes 1.18+* If possible, use a cluster with nodes in different availability zones.Scale out with multiple smaller nodes instead of scaling up with fewer larger nodes. This makes the cluster more resilient to the loss of nodes.
- The smaller nodes still need to be able to handle the largest pods in terms of CPU and memory.
Make sure to use multiple replicas for availability, scaling and performance. When using multiple replicas, enable locking in Igor service where it's required to prevent duplication of requests. We recommend running Echo service in SQL mode to avoid duplication of CRON triggers.
Igor has an option for
locking
as a configuration option
spec:
spinnakerConfig:
profiles:
igor:
locking:
enabled: true
Echo will require either a set up with:
- SQL rather than in memory storage (https://github.com/spinnaker/echo#configuration)
- Echo as an HA set up https://support.armory.io/support?id=kb_article_view&sysparm_article=KB0010327
Database
Clouddriver, Orca, Echo, and Front50 services must each use a different database. These databases can be in the same database cluster or separate ones. Using a single cluster makes management easier and more cost-effective. Keep in mind though that the number of connections used by Spinnaker will be added across all services. When configuring your databases, use the following guidelines:
Use a MySQL compatible database engine
- Armory recommends using Aurora for cross-region replication
If available, use cross-region replication to ensure data durability. Be aware of which services stores what kind of data:
- Front50’s database contains pipeline definitions and needs to be backed up.* Orca’s database contains pipeline execution history that is displayed in the UI.* Clouddriver’s database contains your infrastructure cache. If lost, Spinnaker needs to cache it again, which may take a while depending on the size of your infrastructure. It doesn’t have long-term value.
Make sure the network latency between Armory Enterprise and the database cluster is reasonable. Armory recommends locating the database cluster in the same datacenter as Armory Enterprise.
Your database cluster must support the number of open connections from Spinnaker and any other tool you need. For numbers refer to the database connections chart in the profiles below.
Tune Clouddriver’s connection pools based on your usage. Use the sql.connectionPools.cacheWriter.maxPoolSize
and sql.connectionPools.default.maxPoolSize
parameters. Both values default to 20 and need to be increased to handle more tasks per Clouddriver.
Redis
Most services rely on Redis for lightweight storage and/or task coordination. Spinnaker does not store many items in Redis as is reflected in the following recommendations.
- Use one CPU for Redis since it is a single threaded.* It's recommended to use a managed Redis service. We recommend using Redis for provider agent scheduling. It's required for http session storage, terraform integration, bake history and numerous other pieces. Spinnaker services There are numerous parameters to tune spinnaker. These include:
- JDBC settings - max pool sizes, connection timeouts, lifetimes, etc. See: https://support.armory.io/support?id=kb_article_view&sysparm_article=KB0010395 on some tuning options* HTTP request tuning - it's recommended that you change the max requests per host and other timeouts accordingly. These timeouts are used to control many of the interactions between services, but also control access to artifact storage systems, CI systems and more. See https://support.armory.io/support?id=kb_article_view&sysparm_article=KB0010163* Memory tuning - It's recommended to watch JVM memory metrics and increase based upon utilization. Look at heap memory usage metric data and GC collection times.
Base profile
Introduction
This guide describes the hardware recommendations for what is considered a base profile for using Armory Enterprise (Spinnaker) based on the architecture described in part 1 of the Sizing and Scaling Guide. Base profile The following minimum recommendations are based on a usage profile that is similar to the following:
- 50 AWS Accounts, 10-15 Kubernetes accounts, or 30,000 Docker Images
- 250 deployments per day over a 5-hour window.
- 30 requests persecond coming from browser sessions or tools
- 10x burst for both pipelines and API calls.
Hardware sizing
Service | Value | CPU | Memory |
---|---|---|---|
Clouddriver | 5 | 2500m | 10.0Gi |
Deck | 2 | 300m | 2.0Gi |
Echo | 2 | 2000m | 3.0Gi |
Fiat | 2 | 2000m | 3.0Gi |
Front50 | 2 | 2000m | 3.0Gi |
Gate | 2 | 2000m | 4.0Gi |
Kayenta | 2 | 750m | 2.0Gi |
Igor | 2 | 2000m | 4.0Gi |
Orca | 2 | 2000m | 4.0Gi |
Rosco | 2 | 2000m | 3.0Gi |
Terraformer | 2 | 500m | 2.0Gi |
The following examples describe what the Note the following:
- Starting size for compute is 3 * m5.2xlarge or 5 * m5.xlarge (or equivalent)* Using gp3 disks or equivalent types with SSD storage is recommended to minimize host disk I/O wait times
EKS Node Pool
Type: m5.xlarge vCPU: 4 Memory: 16.0 Gb Storage type: EBS storage, gp3, 150 GB Nodes: 5
GCP Node Pool
Type: n1-standard-16 vCPU: 4 Memory: 16.0 Gb Disk Type: pd-ssd Disk Size: 150 GB Nodes: 3
Redis
Type: cache.r4.4xlarge (or equivalent) vCPU: 4 Memory: 25.0 Gb
RDS/Aurora
Type: db.r5.xlarge (or equivalent) vCPU: 8 Memory: 32.0 Gb Disk Type: EBS storage, gp3, 150 GB
GCS
Type: db-n1-standard-2 Disk Type: PD_SSD Disk Size: 157 GB
Scaling the base profile
Introduction
This guide describes the following:
- What to monitor to determine if you need to scale your Armory Enterprise instance beyond the usage levels of the base profile described in part two
- What a scaled up instance might look like and the impact of increased usages
Monitoring
Different parts of the environment consume resources differently, so each part needs to be monitored based on different metrics. To understand what needs to be scaled for your instance, you need to utilize white box, black box, and infrastructure monitoring. Note that some of the performance metrics can be improved without increasing resources.
JVM Spinnaker services
AreaAlert ConditionActionsConcernsJVM: CPU Time>75% for 10 minsIncrease available CPU/Memory for the serviceNoneJVM: G1 Old/Young Generation>10% for 10 mins- Investigate node pool disk IOPS- If correlated with other JVM alerts,increase available CPU/Memory for the serviceClouddriver: Interim failures in Clouddriver accounts could potentially give a false positive alertJVM: Heap memory usage85% for 10 minsIncrease available memory to serviceClouddriver: Interim failures in Clouddriver accounts could potentially give a false positive alertPod CPU request/limits usage>95% for 360 minsIncrease available CPU request/limitsClouddriver: Interim failures in Clouddriver accounts could potentially give a false positive alert. We should separate Clouddriver from other services in regards to the threshold and time windowPod Memory request/limits usage>95% for 360 minsIncrease available Memory request/limitsClouddriver: Is always using the full available pod memory when at scale and operates for long periods of time without redeployment We should separate Clouddriver from other services in regards to the threshold and time window.
Terraformer service
The Terraformer service is used by the Terraform Integration stage. AreaAlert ConditionActionsConcernsPod CPU request/limits usage>95% for 360 minsIncrease available CPU request/limitsSeparate Clouddriver from other services in regards to the threshold and time window in order to catch Terraformer resource issuesPod Memory request/limits usage>95% for 360 minsncrease available Memory request/limitsSeparate Clouddriver from other services in regards to the threshold and time window in order to catch Terraformer resource issuesTerraformer application metricsNoneStaring with 2.24.0, Terraformer exposes an OpenMetrics endpoint for metric scraping. Enable the configuration and add proper alerts in your monitoring solution.
Aurora Cluster
AreaAlert ConditionActionsConcernsWrite latencyBaseline deviation for 15 minsIdentify the root cause of the increased latencyNoneRead latencyBaseline deviation for 15 minsIdentify the root cause of the increased latencyNoneCPU utilisation> 70% for 15 minsIncrease Aurora cluster instance typeNone
Redis cluster
AreaAlert ConditionActionsConcernsRedis Get LatencyBaseline deviation for 15 minsIdentify the root cause of the increased latencyNoneRedis Set LatencyBaseline deviation for 15 minsIdentify the root cause of the increased latencyNoneRedis CPU utilisation>90% for 15 minsIncrease Redis cluster instance typeNoneRedis Storage utlisation>85% for 15 mins- Run Redis keys cleanup job for Rosco and Terraformer keys- If cleanup doesnt resolve the alert then increase the instance typeNone
Armory Enterprise application metrics
AreaAlert ConditionActionsConcernsJDBC connections usage>50% for 3 minsIncrease the available connection pool of the serviceNoneClouddriver Caching agents>85% for 10 minsIncrease the max-concurrent-caching agents config and/or increase the available Clouddriver replicasNumberOfConfiguredRegions
Scaling Exercise Example
Current Usage Profile
This example describes how to scale your Armory Enterprise instance based on a usage profile that resembles the following: Total Accounts using Clouddriver Caching Agents
- AWS: 250
- ECS: 175
- Kubernetes: 4
Running Agents | Number of Replicas | Max-Concurrent Agents | Utilization |
---|---|---|---|
5803 | 7 | 1000 | 85% |
Calculation: RunningAgents/(NumReplicas*MaxConcurrentAgents)*100 % Weekly usage stats:
- Number of Active Applications: 50
- Includes stage invocations like SavePipeline or UpdatePipeline
- Number of Executed Pipelines: 40
- Number of Fiat Groups: 1200* Weekly Active Users: 125
- Max Concurrent stage executions[1h]: 350
- Top Stage Invocations by Type:
- RunJob Stage
- Terraform Stage* Evaluate Artifacts
Resource usage by service
Service | Replicas | Requests/Limits | Average utilization 1 week |
---|---|---|---|
Clouddriver | 7 | CPU: 6500m Memory: 12Gi | JVM Heap used: 50.4% JVM CPU used: 25.5 Container CPU used: 31% Container Memory used: 99.99% |
Terraformer | 4 | CPU: 3000m Memory: 4Gi | Container CPU used: 1.2% Container Memory used: 68% |
Dinghy | 1 | CPU: 1000m Memory: 512Mi | Container CPU used: - Container Memory used: 3% |
Echo | 1 | CPU: 2000m Memory: 4 Gi | JVM Heap used: 23.9% JVM CPU used: 4.45% Container CPU used: 4.45% Container Memory used: 76.5 % |
Fiat | 2 | CPU: 2000m Memory: 18Gi | JVM Heap used: 7.8% JVM CPU used: 2.2% Container CPU used: 2.2% Container Memory used: 55.7% |
Front50 | 2 | CPU: 1500m Memory: 3Gi | JVM Heap used: 30.4% JVM CPU used: 1.65% Container CPU used: 2.1% Container Memory used: 57.8% |
Gate | 2 | CPU: 1000m Memory: 8Gi | JVM Heap used: 20.4% JVM CPU used: 3.3% Container CPU used: 3.4% Container Memory used: 57.6% |
Igor | 2 | CPU: 750m Memory: 2.0Gi | JVM Heap used: 23.1% JVM CPU used: 0.6% Container CPU used: 0.8% Container Memory used: 51.5% |
Orca | 2 | CPU: 1500m Memory: 10Gi | JVM Heap used: 36.9% JVM CPU used: 4.1% Container CPU used: 5% Container Memory used: 83.5% |
Rosco | 2 | CPU: 4000m Memory: 8Gi | JVM Heap used: 27.6% JVM CPU used: 0.3% Container CPU used: 0.5% Container Memory used: 84.5% |
EKS Node pools
- Type: m5.4xlarge
- vCPU: 16* Memory: 64Gb
- Maximum Network interfaces: 8
- Private IPv4 per interface: 30
- Current node pool utilization (average 1 week):
- CPU: 20%
- Memory: 35%
- Current # of nodes: 9
- Current # of pods: 92
VPC IP allocation/reservation:
- IP Reservation: 540
- Calculation: max(number of ENIs with 2 ENIs per instance)*30
- Note: At a minimum, every instance has 2 ENIs (1 Primary, 1 Secondary) attached. AWS CNI attaches a private IP from the secondary ENI to each pod. Based on the instance type-specific, the number of IPv4 are reserved per ENI.
- IP Allocation: 110
- Calculation: (number of nodes * 2) + (number of Pods)
- Unused IPs in the Warm pool: 430
Optimize the IP reservation for the existing cluster to ensure that scaling the number of replicas or scaling the number of nodes or even switching to a higher instance type wont over-reserve IPs from the private subnets. For more information, see https://github.com/aws/amazon-vpc-cni-k8s/blob/master/docs/eni-and-ip-target.md.
Aurora Cluster
- Instance Type: db.r5.2xlarge
- Current CPU utilization(average 1 week): 49%
Redis Cluster
- Instance Type: cache.r5.2xlarge
- Current CPU utilization: 5%
- Current Storage utilization: 73%
Scaling Projection Plan
These projections are based on a rough estimation for a potential increase in accounts, users, and deployments by 30%, 50%, or 100% of the current usage. The exact scaling needs depend on various factors.
Assumptions
- Each additional AWS/ECS account requires an additional 512MB in a Clouddriver pod given that each account is caching the same amount of resources* Each additional AWS account requires 17 caching agents per region
- Each additional ECS account requires 10 caching agents per region
- EKS Node pools will remain in the same instance type
- Spinnaker resources will be calculated at rest since the resource utilization is dependent on various factors during active executions The following table shows the projected increases for certain key metrics:
Key Metric | Current | +30% | +50% | +100% |
---|---|---|---|---|
Active Users | 125 | 163 | 188 | 250 |
Active Pipelines | 40 | 52 | 60 | 80 |
Concurrent Stage Executions | 350 | 455 | 525 | 700 |
AWS accounts | 250 | 325 | 375 | 500 |
ECS accounts | 175 | 227 | 262 | 350 |
Spinnaker resources
Caching agents
Calculation: CurrentRunning + (NewAWSAccounts17+NewECSAccounts10)*NumberOfConfiguredRegions Assuming that max-concurrent agents config remains at 1000, the number of replicas needs to increase for Clouddriver:
Running Agents | +30% | +50% | +100% |
---|---|---|---|
5803 | 7564 | 8747 | 11701 |
Spinnaker Resources
Clouddriver Calculation: CurrentMemoryRequest + 512*NewAccounts/numberOfReplicas
Memory | +30% | +50% | +100% |
---|---|---|---|
12Gi (6 Replicas) | 17Gi (8 Replicas) | 19Gi (9 Replicas) | 24Gi (12 Replicas) |
Orca
An setimate for 100% increase is 3 Replicas/4vCPU/12Gi memory. Scaling Orca depends on the Redis cluster and the downstream services like Clouddriver. Expect to increase the resources and replicas as the number of concurrent stages increase.
Fiat
Scaling Fiat depends on the following:
- Number of active users
- Groups sync from the external identity provider
- Number of active executions An estimate for 100% increase is 3 Replicas/2vCPU/18Gi Memory.
Expect to increase the resources and replicas as the number of active users and pipeline executions increase. However, this depends on the setup of the external identity provider and the sync processing of the available roles. The amount of data returned can potentially change drastically the rough projections.
Front50
An estimate for 100% increase is 3 Replicas/1.5vCPU/4Gi Memory Scaling Front50 depends on the number of applications, pipelines, and the stage invocations on those. Expect to increase the resources and replicas as the number of applications and pipelines increase.
Terraformer and Rosco
Scaling either the Terraformer service or Rosco service depends on the concurrent executions of Terraform or Bake stages. Expect to increase the resources and replicas.
EKS node pools
The major load increase comes from the horizontal and vertical scaling of Clouddriver. Only a small portion is allocated to scaling the other Spinnaker services. Only Clouddriver scaling will be taken into consideration for the bellow calculations Number of Nodes+30%+50%+100%9101112 The following VPC IPs are calculated to be reserved (without changes to the configuration for warm pool). Reserved IPs+30% (10 Nodes)+50% (11 Nodes)+100% (12 Nodes)540600660720 By optimizing the IP reservation in the CNI configuration, the number of reserved IPs values and unused reserved private IPs per node can be decreased.
Aurora cluster
Based on the projected scaling of Clouddriver and Orca services and given that the CPU load averages weekly to 50%, rough estimates indicate that the Aurora instance type needs to be upgraded when more than 50% of the projected increase happens.
Redis cluster
You should be aware of the amount of pressure that the projecting scaling of Orca/Fiat/Rosco/Terraformer services in terms of active users, concurrent stage executions and Terraformer/Bake stages. The projected increases do not require upgrading the instance type for the Redis cluster.