As we discussed in a previous blog post, Datadog application performance monitoring (APM) helps empower developers to troubleshoot issues with their backend services. In this blog, I will describe the typical process I use to troubleshoot backend performance problems in Datadog.
I will start by focusing on the Service Catalog Page. The Service Catalog is the main landing page for Datadog APM Services. It displays all of your services broken down by environment, but that's not its only function. Clicking on a service gives you an overview of the service, including ownership, features enabled, reliability, performance, and more.
From the Service Catalog view, we can navigate to the Service Page. This page shows information for specific slices of service performance. I usually utilize the Service Page as a launching pad to narrow down when and where a problem occurs. First, I use the summary charts to determine when and where the issue occurred, and then I use the drag feature on the graph to narrow down the timeframe of when the problem occurred.
Once I’ve selected the relevant timeframe, I click on the Service Page view. The Service Page view is a more detailed view of what we see in the Service Catalog. It is the main page I use to start drilling down into data. In this situation, I want to inspect failing traces immediately during the selected time window.
From here, I scroll down to the traces section. This section contains all requests executed in the timeframe I selected in the summary section. From here, I use the search bar to try to find problematic traces. To do this, I often filter by requests that are returning an error code and sort by long-running traces via duration.
Once I’ve applied my filters, I begin opening individual traces. When you open a trace view, you will be presented with what, at first, seems like an overwhelming amount of information. It can seem daunting, but once you understand that Datadog is trying to provide as much context to the trace as possible, it starts to make more sense.
The first section I examine is the four charts at the top. These charts are the flame graph, waterfall, span list, and map. Each of these four views is designed to display the request's performance on the service and any other upstream or downstream calls made by other services. We will dig deeper into these views another time.
In the section below, you will find tabs named metadata, errors, host metrics, runtime metrics, application logs, process metrics, network metrics, SQL queries, and method hotspots. This section gives you all the information needed to determine what caused the problem.
Oftentimes the issue will be found under the errors tab, this will either show you the response code that was returned or the exception that was thrown by the code. However, there are times when the error could be more readily apparent, or you want a more detailed answer on why the problem is occurring. Using the other tabs is a great way to get more details in those use cases. First, start by checking the infrastructure, runtime metrics, process, and network tabs to ensure there isn’t a performance issue with the code. If you still haven’t found the root cause, look into application logs and database queries. Logs might provide further context that was not found in the traces, and poor database performance has been the cause of many issues with backend services in the past.
If nothing is found in those sections, I look at method hotspots. This view will give you a breakdown of the performance of methods used to execute the request.
If none of those options work, I begin to check other requests, either upstream or downstream of my request, as they provide additional context to help me determine the root cause.
In this trace, the issue was a bottleneck in the network performance of the host running this service. I was able to determine this by inspecting the errors tab; this told me it was a network issue.
I then looked at the network tab and saw a spike in TCP retransmits during that time. I inspected the logs tab and filtered by errors. From here, I found that GraphQL unexpectedly closed the connection.
By following the steps I listed above, I could quickly drill down and find several issues on my backend services that point me to a likely cause of the performance issues I’m experiencing on my host.
In this blog, I gave you a walkthrough of navigating the APM Service Page and utilizing the views within to begin troubleshooting issues. In our next blog, we can discuss how we can use other features on the APM page to build a proactive monitoring environment.
Co-written by my Cat Tikka, who laid on me while I was writing this blog, making the entire process more difficult.
Interested to learn more? Watch our fireside chat on how you can proactively monitor your applications and reduce MTTR with Datadog APM or reach out to us at chat@radpev.io