Thoughts on monitoring and troubleshooting
Monitor for good behavior or die under a pile of alerts
In my career in Site Reliability I participated in a lot of postmortem conversations about incidents. A lot of them. While I was at Facebook (now Meta), my favorite meetings were the SEV Reviews, where we reviewed the main incidents of the week. Each large engineering organization ran their own, and then we also ran a large meeting company wide.
One of the most frustrating moments in such conversations for me is when people raise the action item of creating a new alert that will detect (in the future) the specific failure which caused the incident in discussion.
One of the best indicators of a solid monitoring plan for a service is the number of alerts they trigger, independent of the level. Only production problems should alert oncalls and wake people in the middle of the night. A disk is full? Hopefully I do not care, unless it causes a production issue. It should probably not cause a production issue on a large scale system of hundreds or thousands of machines.
Monitoring systems for bad behavior is an uphill battle. If you go down this rabbit role, you will spend infinite amounts of time trying to anticipate how systems fail. Let me give you the spoiler: you will fail at that. Systems will, no matter what you do, surprise you on how they fail, and people will make mistakes and you won’t be able to anticipate what the next mistake will be. No number of alerts for failure will help you keep things under control. I have been in teams that had literally hundreds of alerts configured and got a couple of THOUSANDS of alerts triggered every day. I have been oncall for a rotation that had hundreds of alerts triggered daily. You know what I did? Learned which ones I should ignore, and looked only at the high signal alerts. As an engineer in the team, you likely know which ones those are too.
Should we give up? No. Of course not. There’s a very simple way to avoid this scenario.
First, you need to separate in your mind what is monitoring and alerting and what is troubleshooting.
Monitoring and alerting
Every system has a known expected behavior. It can be predictable egress curve, it can be number of orders, or error level, or a P99 latency number. Your monitoring system should have one alert for each of one of those, and let me give you another spoiler: if you spend a few days thinking about this, your list of alerts will likely be less than 10 for the *very important stuff* which will cover most if not all incidents you will see in the future. Monitor your service or product for its good behavior, and each deviation of that should be investigated.
Is it a system issue causing trouble to the one of the main good behavior my system should display? Then this is an incident. Is this something that can wait a bit to be looked at? Then this is a Jira ticket. Is this a small single host/instance issue that a robot can deal with? Then plug a response script to this alert. Make sure this is a convention in your team and everyone knows what type of attention each level of ticket priority requires.
Up to here, do I need to know what exactly is wrong? Probably not. An alert is the trigger of an action, which can be investigation or an automated response, it can be necessary immediately or not. An alert itself doesn’t have to contain the cause of the incident. You need to take the next step and try to uncover the cause of the problem at hand. But how?
Please don’t tell me you’ll start SSHing into machines and reading logs. :-)
Troubleshooting
Those are the systems that collect data from your service, plot graphs, allow you to drill down following an error, or re-doing the path of a request that is failing so you can get to the root cause.
Those are the systems you use to figure what what that flag raised by your monitoring system actually mean.
The better the troubleshooting systems you have, the least time you will spend investigating an issue and potentially your mitigation will be much shorter. I would even argue that time to mitigate an issue is more influenced by the troubleshooting systems you use than by the monitoring system you have, because in the end of the day, even if you do not detect an issue, eventually your users will let you know things aren’t working.
My suggestion here is that you treat this particular stack as a critical service you must provide to the engineering team, because well, it is. Having a good toolset to drill down into issues is a game changer for raising the bar of the quality of conversation about how things fail when they do.
There are several types of services that are useful in this scenario, from tools such as Honeycomb.io or Sentry, to other more traditional time series collection services for metrics you care about. You can use profiling systems for CPU and memory usage, you should likely save logs aggregated somewhere (probably sampled). There are a lot of options in the SaaS world.
Without good troubleshooting systems, the second best option is to create theories about what happened and trying to then confirm or falsify them. It is a valid investigation method if nothing else is available but those tend to take a lot more time, and if you have a SLO to your customers, you’re probably not interested in lengthy investigations if you can avoid them.
Not all incidents will be simple to troubleshoot and mitigate. I’ve been involved in incidents where basically the whole engineering team lost access to all of its resources, including troubleshooting systems. It does happen, but in my experience those “omgtheworldisfallingapart” type of incidents are much less common, thankfully.
Deliberately separating what is monitoring and what is alerting will reduce a lot the amount of alerts your team gets while not affecting your ability to detect production issues. In addition having clarity that you do need to invest on something else than alerts, like a good troubleshooting system will ensure that when such incidents happen, time to mitigate will be shorter than otherwise.
Some references on monitoring:
Google SRE’s Monitoring Distributed Systems.
Brendan Gregg’s USE Method.
My favorite ever time series visualisation library: Cubism.