I recently had the opportunity to lead a large-scale observability migration with the goal of moving a client’s infrastructure and applications off of Dynatrace and onto Datadog. The scope of the migration was around 100 applications across 3,000 hosts, mostly Linux, and 15 OpenShift Kubernetes clusters.
Living in these OpenShift clusters were hundreds of Java microservice workloads deployed using a common helm chart. These microservices leveraged HikariCP for database connection pooling to ensure the database connections did not bottleneck the application during peak load. As a result, the teams developing/supporting these applications were particularly interested in these HikariCP metrics, in addition to the standard APM telemetry gathered by default for Java applications.
This led to the question- “How to collect these custom JMX metrics?” The typical pattern would be to use the JMXFetch plugin built into the Datadog Agent, but this would require our common helm chart to expose the JMX service on some containerPort. This would pose the main challenge, as the common helm chart only created a containerPort for HTTP traffic on 8080. Dynatrace was able to collect these metrics without the need for such containerPort, therefore I knew there had to be a way to collect these metrics in a similar fashion using Datadog.
Enter the Datadog Java Tracer dd-java-agent.jar. We were already in the process of instrumenting the applications deployed by the common helm chart for APM. This involved the addition of dd-java-agent.jar to the Java base docker images used across most Java applications. We would continue to use this pattern to collect our HikariCP JMX metrics. I worked with a few developers to query the JMX service from inside the container and came up with the following JMX config to collect the connection pool metrics.
The main difference between this configuration and one you would run in typical fashion from the Datadog Agent is the presence of “jvm_direct: true” in our configuration. This tells the check to attach to the Java process directly, and allows us to omit the usually required host and port attributes of the configuration. After this configuration was added to the base docker image, we then updated the common helm chart startup command.
This was easily configurable using the chart’s values and involved us passing the path to our JMX configuration to DD_JMXFETCH_CONFIG. In practice this resulted in the addition of the following Java system property.
The final steps were to put all the pieces together! We deployed the new version of the Java base image using the common helm chart and then once successfully deployed, also updated the common helm chart version itself to the instrumented version. Shortly after requests were made to the new application deployment did the metrics begin reporting.
To wrap up, I'll call out some of the ‘gotchas’ encountered in the process. Since we’re working in Kubernetes, we need to update the agent configuration to bind the DogStatsD to a host port which, by default, listens on a Unix Domain Socket. This involved adding the following to our Datadog Agent Operator configuration.
If you have just deployed your JMX instrumented application, made some request activity, and validated your connection pools are initialized, you may still encounter cases where metrics do not report until the next JMX bean refresh. This is configured with DD_JMXFETCH_REFRESH_BEANS_PERIOD, set to 600 seconds by default. This lapse in metrics will only occur for the first refresh period for a newly created pod.
Ready to optimize your Datadog environment? Contact us today and unlock the true potential of your observability stack.