Jason McIntosh's handy information about Setting Up Monitoring
Introduction
Below is a reference to Jason's recommendations to JPMC about monitoring their environment.
A lot of the information below has been recreated in the KB article https://support.armory.io/support?id=kb_article&sysparm_article=KB0010370 The information below is Jason's raw notes about monitoring https://cloud-armory.slack.com/archives/CC0TJ4K24/p1626192030226800?thread_ts=1626186324.221500&cid=CC0TJ4K24. Please note that these are general best practices for just about any software deployment in the cloud. They should not be shared directly with the customers as direct recommendations. The KB article attempts to provide some clues, but shies away from providing a full end to end solution, as customers Dev Ops and DBAs should ultimately be responsible for figuring out what external monitoring is necessary.
Prerequisites
N/A
Instructions
- Watch CloudWatch metrics on your instances, particularly burst budget limits on EBS volumes or CPU usage. INFRASTRUCTURE monitoring of key metrics is KEY - and these are platform agnostic that “Ops” teams should be aware of and alert on.* Watch Redis & MySQL metrics - storage, IO wait times, etc. In particular Transactions per second limits. DBA’s usually know these intimately well (at the very least, watch some of the common ones, IO Wait time, Transaction burst capacity and transactions per second rates, as well as CPU utilization… AND STORAGE USAGE!!)Watch your APPS - particularly JVM metric data, queue times, cache latency. Specifically
- https://blog.spinnaker.io/monitoring-spinnaker-part-1-4847f42a3abd* https://blog.spinnaker.io/monitoring-spinnaker-sla-metrics-a408754f6b7b
- And our SLI/SLO doc
[internal use only]
: https://paper.dropbox.com/doc/Armory-SLISLO-Documentation--BPVZrXCL8ulTK2w9Nv8p5tkjAg-UiwecmNjt0DGV7vE4NQ6a* ANYTHING in java where GC collection is over 5-10% is a sign of memory exhaustion and needs to be increased. If continuous increasing there’s something leaking and/or causing a leak (can be external or internal, e.g. a thread blocked by an external resource can eventually fill up, and is tricky to track, ETC)* I HIGHLY suggest an APM - LIKE datadog/NewRelic/etc. that can monitor the JVM itself at a deeper level than what is reported via metrics. This is an extra layer that has a 2-5% overhead, but gives you way more visibility easier. You can get SOME of this data from the metrics, but… not always * Distributed tracing - often get with an APM, but not always. Zipkin/sleuth - these are USEFUL tools for tracking flow through a system to know if a paramter broke things.* JMX - if you can, this can provide some INSANE diagnostics, but requires a connection to the JVM (similar to APM agents). NICE thing is a lot of things work well with this - e.g. Tomcat used to provide GREAT JMX controls to do things like JDBC pool configurations & thread manipulation during RUNTIME. SUPER useful for viewing things like that. Less common to see in todays world but still POTENTIALLY useful if configured/enabled for remote RMI calls.* Monitor your OTHER systems, like the EKS environment, API rate limits on artifact stores, throttling of anything, etc. that are often signs of systems struggling. These are not always “super critical” but can be signs that spinnaker is overwhelming your transtiive infrastructure. Key ones: GitHub APIs for example, or if you have burst rate limits on AWS or Google APIs.* WATCH any of your artifact stores, CI pipelines, etc. E.g. Jenkins jobs & polling Jenkins can HAMMER it, but ALSO can fail if too many changes and not enough capacity to watch for those changes.