How much does waking up an engineer at 3 AM cost?

I used to work at a firm that blasted a page, email, and text whenever a network router dropped a connection in the office. That page went directly to me in case the network imploded and they needed “all hands on deck.”

“We need to make sure it’s not a catastrophic issue, so it’s worth waking you up,” was a rational, preventative measure.

The network going down was a problem, but there must be a middle-ground between paying overtime for an engineer and not getting any alerts for a network device failure.

Nowadays, I’ve learned there’s a metric ton of nuance in alerting. What’s the difference between a network complaining every day at 2 PM and a network complaining once at 3 AM? Do we alert on problems we already know from incidents or should we preemptively alert on something that looks scary? Do engineers really care about warnings or should we just stick to paging? Monitoring strategies can almost look like a design document with a spec sheet for every application, but is it worth the overhead for every monitor and still end up with an engineer getting paged in the middle of the night?

Building Datadog Monitors is the bread and butter at RapDev alongside Dashboards. It’s surprising how much value a well-structured suite of monitors has for many teams. A good foundation with those two products builds confidence across network architecture, applications, and infrastructure.

But monitors can get a bit unwieldy without guardrails, especially in Datadog. You need a strategy to keep your monitoring consistent and practical. Here are some general guidelines and tricks I’ve used to keep it simple:

Tagging

An odd first choice in a blog about monitors, but Datadog is 100% powered by tags. Without clear and consistent tagging across your owned applications, a lot of data falls through the cracks. This isn’t as erroneous as getting a specific deployment version number or a hostname wrong; it’s as simple as mixing your env:prod tags with env:production. This requires an overarching strategy and could be an entire blog post in itself. Or some outside help. But once tagging is in place, the rest is easy-peasy.

Alert for information, Page for action

One of the bigger mistakes is paging for any alert related to prod. If someone needs to take action the microsecond a CPU core hits 100%, you will be paying for a lot more than the SLA you’re not meeting. In Datadog, you can simply send an email or message alert for metrics you want historical data on, but you can also send a page for clear, actionable events. A cluster that will auto-scale at 100% CPU can have an alert directed to a slack channel for historical data, but a cluster crash-looping for 60 minutes should have an engineer look at the instance. A simple strategy can prevent many headaches when trying to manage too many alerts.

Thresholds should be simple and clear

A CPU utilization alert set at 43% should make sense out of context. The engineer looking at the metric should know immediately what the issue is, whether it’s too low or too high. Either this threshold needs to be set to something that intuitively drives action or the length of time at that threshold should. Which brings me to the next tip…

Your evaluation length needs to be longer

You don’t need to wake someone up the microsecond a network goes down. However, if the network is down for 1 hour, it’s time to get someone in front of that router. Again, the length of time should be an intuitive indicator for level of severity. In Datadog this is as simple as changing a dropdown option. In fact, it can also be for certain times of the day if something needs attention within the minute. Context is important.

Tags, again

Your monitor needs tags. Unfortunately, monitor tags are their own siloed user metadata set manually. But once set in Datadog, your team can easily search, filter, and find the monitors they own and remediate. A simple user tag of owner:rapdev has much more value than it seems when you can slice all your monitor data easily.

Learn how to notify

The notification messages of Datadog monitors is an incredibly powerful tool capable of dynamically alerting based on various parameters and tags. Or it can page the whole company. Make use of {{#is_alert}} and {{#is_warning}} blocks to intelligently alert or page depending on thresholds, and add as much context from tags and data here to aid with incidents and retros. Notifications can get unwieldy, so keeping them short and actionable makes them practical even when multiple alerts come in simultaneously. Datadog says it best.

Keeping it simple

Monitoring can become a complicated solution to simple problems. With clear guidelines and a platform that allows engineers to solve issues effectively, Datadog can be a powerful tool to simplify incidents and reduce the total time to remediation. The time and effort spent building simple and effective monitoring solutions will be worth it to the next engineer sleeping soundly at 3 AM.

Read more about our tagging strategy offerings and workshop and reach out to our team at chat@rapdev.io to learn more.

Written by

Paul Kim

Boston

Avid Board Gamer and Software Engineer turned Datadog Engineer. Discovering the power of application monitoring, Cloud infrastructure, and trading 1 wood for 2 sheep.

More by

Paul

RapDev's Open Source Host List Generator

Jun 2022

Resources

We don’t believe in hoarding knowledge

We go further and faster when we collaborate. Geek out with our team of engineers on our learnings, insights, and best practices to unlock maximum value and begin your business transformation today.