Since its release, the Google SRE handbook has convinced many organizations to implement and see the value in establishing SLOs. They measure the health and success of their infrastructure and applications which can be quite useful for organizations to have. However, for as many companies that successfully implement these measurements, there is an equal number struggling with the creation of these objectives.
In my experience, I have seen this happen for a variety of reasons but the most prevalent are:
These are all valid hurdles. When I first started creating SLOs, it was difficult to apply the principles outlined in the handbook to the real-life scenarios I was encountering. Then when it seemed I finally understood SLOs and SLIs, I made the mistake of creating WAY too many of them. Not only did I create way too many of them, but the metrics I had generated beforehand were entirely way too complicated. It was overwhelming to sort through, to say the least, but I’ve been there and my experiences in this area are not unique. But, here’s how I simplified the process for myself, and hopefully you.
Let’s start by explaining what SLOs and SLIs are. SLOs are Service Level Objectives and SLIs are Service Level Indicators. These are just fancy terms for saying “threshold” and “measurement”. In order to have a threshold, you need a measurement. And then to have a measurement, you first need to have a goal. For example, if I have a web application, my goal is that it responds successfully the vast majority of the time. With the goal in mind, we can create a measurement of the percentage of successful requests by dividing the count of 200 responses by the total number of responses and multiplying the result by 100. Now we have a metric to set our threshold (or, objective) and determine where a point where action is needed.
This is where this becomes an art more than an exact science; we need to determine the tolerance for violating the threshold.
To determine the tolerance, we can create an SLO that says “our successful request percentage should be greater than 98% over 4 weeks.” That’s an example because maybe you want to measure over one week or three months. But essentially, your timeframe for measurement should reflect how you want to take action on an SLO violation. In some organizations, when an SLO is broken, the SREs hand the pager back to the development team to fix the issues. In worst-case scenarios, development is brought to a halt while they address the problems. From my experience, it is better to plan for a longer timeline when it comes to testing the SLOs’ violation tolerance. The longer timeline allows for issues to present themselves in a way that regular development or schedules are not impacted. But if this is a metric that represents the most critical aspect of your business, we may want to have a much shorter timeline. Whatever the case may be, SLOs and SLIs can be thought of as an “action threshold” and a “measurement”.
Now that we have a better understanding of what SLOs and SLIs are, we can go into the metrics that actually power them. Remember when I said we needed a goal to have a metric? The reason is it makes me consider what we’re trying to measure and why or else I waste time on finding a “cool” metric to monitor. When you start with a metric that needs to be generated or is something too specific, it can lead to an SLO that doesn’t have much meaning and is difficult to act upon.
Here’s an example I came across recently:
Someone was measuring the total execution time of a single SQL query over a day. When this threshold is crossed, what does it mean? Was the rest of the application affected? Did a user have a bad experience, or were there errors or latency increases? The answers, in this case, were no; but since they had once seen a dramatic increase in the execution time, they had concerns and believed they should monitor it. However, when the monitor did go off, there wasn’t anything they could take action on. So what do we do instead of monitoring very specific items? This will depend on your application, but for this case, we want to ask, “what actually indicates a problem?” A problem can likely be measured in error rate or response time. When those go off, it’s much more likely that there is something actionable to address. It’s always best to simplify things to a higher level. While you may need to invest more work in investigating, you are more likely to catch an issue this way. While that query could have impacted site latency, what if site latency was impacted by something else that wasn’t monitored? We would not have caught it.
Lastly, we should only ever create as many SLOs as we need. It may seem tempting to alert on more items than necessary with all the modern monitoring tools making it REALLY easy to monitor hundreds or thousands of metrics per host. But think about it this way, I’m a fan of cooking reality shows and one of the biggest flaws a contestant can make is split one dish into multiple. As a result, instead of being judged on one item, they are judged on multiple, leaving room for error and distraction. The same holds true for monitoring, especially when there are actions to take when SLOs are violated. If there is an SLO created off of a metric that doesn’t have a broader application impact, it could hamper or stop development until it’s addressed. And at the same time, if a tree falls in the woods and no one is around to hear it, does it make a sound? But this isn’t to say ignore metrics, it’s to say pick and choose what is actually important and representative of your goals. The best way is to narrow down your SLI to items that represent the latency, error rate, traffic, and capacity of your systems; then drill into issues from there.
SLOs and SLIs don’t need to be difficult or complicated. We certainly don’t need them for every item. They can be simple metrics representing the health and success of your application, allowing you to catch all issues at a high value before your customers report them. Start by determining your goals and which metrics represent them best, this is your path to SLO simplicity.