You’re running a Kubernetes cluster, and you want to know what’s going on. Scouring the internet, you look for ways to monitor your cluster and stumble upon Datadog and Prometheus. While neither is specifically made for Kubernetes, both of these tools are highly regarded for Kubernetes monitoring. So which one should you choose?
As with many things in life, it depends. What are your needs?
There can be many reasons you might want to monitor a cluster:
- Do you want to know what the state of your nodes are?
- Do you want greater insight into how your application is performing?
- Maybe you want a complete overview of not just your cluster, but how your infrastructure and applications are performing in general.
Let's take a look into what these two tools provide, what sets them apart, and what use cases they’re best suited for.
What sets these two apart?
To begin, let’s dive into the major differences between Prometheus and Datadog.
Who manages them?
For many, this will be the biggest differentiator between the two. After all, whether a service is managed or unmanaged can have a huge impact on setup, configuration, and maintainability.
Datadog is managed, meaning you don’t have to manage anything yourself. You pay them a set amount of money based on your monthly usage, and they take care of pretty much everything. You’ll only need to worry about configuring the tool, and if anything doesn’t make sense, you’ll have a support channel to help you.
For an unmanaged service, like Prometheus, you need to self-manage, and you’re in charge of everything. It’s up to you how you deploy the application, how it should be backed up, how to maintain it, and so on. Only limited by the feature set of the tool, you are the one in control. However, recently, a number of cloud providers have launched Prometheus as a service.
This can be a huge advantage. The disadvantage is that you also have to maintain it. An unmanaged tool is usually free, but you have to pay someone on your staff to maintain it.
The amount of features
The second biggest differentiator has to be the feature set you’re getting from Datadog vs Prometheus. Datadog aims to be a one-stop shop, whereas Prometheus is specialized.
While a vanilla setup of Prometheus doesn’t provide many features other than metric collection, it’s very expandable. Third-party tools enable you to get most of the same features as Datadog—you just have to be aware that many of these features do not come out of the box. Case in point of how a managed service can be very different from an unmanaged one.
The goals of each tool
Knowing the goal of a tool and whether it aligns with your vision for monitoring can be a great help in making your choice.
Datadog wants to be more of a one-stop shop, where you can get metrics, logging, and application monitoring all in one place.
Prometheus wants to be the greatest time-series database, and not much more. This does mean that Prometheus has ended up being a lot more expandable and integrates with a variety of third-party tools.
If you’re familiar with Helm or willing to try it out, both of these tools are very easy to set up. There’s a Helm Chart for Datadog, and there’s a Helm Chart for Prometheus. If you’re not familiar with Helm, you essentially configure everything in a `values.yaml` file, run `helm install`, and you’re all set.
There’s also the possibility to set up both Datadog and Prometheus manually. Datadog provides official documentation on how to set up their Cluster Agent using a DaemonSet. If the setup process fails for some reason, you can contact their support to aid in troubleshooting.
This is very much in contrast to Prometheus, which has no official documentation for Kubernetes. If the setup fails, you’ll have to rely on the community for support. Whether this is an advantage or disadvantage is up to you. Do you want to pay a service provider or your engineer for troubleshooting?
Initial setup and shared goals are important considerations when choosing a tool, but eventually you’ll have to start comparing features. Here you’ll get an overview of how these tools compare in the most common monitoring use cases.
As soon as you’ve installed the Datadog agent into your cluster, it starts collecting metrics and sends them back to Datadog HQ. You can then view them in their web UI, create dashboards, set up alerts, and so on. After setup, you can choose which metrics you want to collect from among the many that Datadog already has defined. If you don’t find all the metrics you want, you have the opportunity to define your own. This will cost you however, as Datadog charges for custom metrics.
In the case of Prometheus, you first have to configure the tool to your liking and tell it which metrics you want to collect. Configuring Prometheus is a matter of defining a ConfigMap, after which it will start scraping various services for your defined metrics. To view these metrics, you can use the integrated `/graph` endpoint or use a tool like Grafana.
Any Kubernetes cluster is constantly creating events. These events show you what’s happening inside your cluster and can offer a more detailed look into your cluster, rather than only looking at metrics.
When you’ve installed the Datadog agent in your cluster, it will forward these events to Datadog, making them viewable in their web UI where you can create dashboards and alerts.
Prometheus doesn’t natively ship with support for event collection. You’ll have to turn to the community to see what third-party tools are available for this functionality. This may look like a disadvantage for Prometheus, but in reality it shows how expandable the tool is. The community is large, and the chances of finding a tool for your needs are big.
After enabling the Kubernetes integration in Datadog, you get some premade dashboards which will show you an overview of Pods, Nodes, Deployments, and so on. If you’re not finding what you want from these pre-configured dashboards, you have the option of making your own.
As mentioned previously, Prometheus does ship with a `/graph` endpoint. However, it’s commonly understood that you want to use a third-party tool like Grafana for viewing your metrics. This is especially evident when looking at the official documentation, which dedicates a section to Grafana.
For many, alerting is the entire point of having a monitoring system. Using your collected metrics and events to know how your system is performing is great, but using them to get notified when something goes wrong is a powerful mechanic.
Datadog offers an extensive alerting suite. You can set up a traditional threshold alert, for example notifying you when CPU load is above 80 percent. They also offer more advanced alerts, like anomaly alerts. You may have a service where it’s expected that CPU load fluctuates a lot, but with a pattern to it: high during lunch, low at midnight. In Datadog, you can set up an alert that triggers if CPU load is suddenly high at midnight or suddenly low during lunch.
As has been the case for many things, Prometheus does ship with an alerting system, but even their official documentation recommends a third-party tool: in this case, AlertManager. This is another example of Prometheus not trying to be a jack-of all-trades but rather a master of one. Whether this is an advantage or disadvantage is again up to you and your use case.
At this point, you know what DataDog and Prometheus are, as well as the specific advantages and disadvantages of both. Datadog is better if you can pay for a managed service, Prometheus is better if you want more control. If Datadog fills all your needs on its own, perfect! If it doesn’t, it can be quite the task to expand upon it. That being said, in many cases expansion won’t be necessary, considering the amount of features you’re getting.
Prometheus can require many engineering hours to set up, however, if it doesn’t fulfill all your needs on its own, it’s very expandable with third-party plugins. On top of that, it’s open-source, so any issues you may have with the core application can be improved upon with a pull request.
Finally, you may also want to spend some time comparing Sysdig to Prometheus as Sysdig, like Datadog, offers more of a platform-type approach. And consider comparing Datadog to a variety of SaaS-based alternatives, like Airplane. Airplane is the developer platform for building internal tools. With Airplane, you can build custom internal tools that fit your engineering workflows. Using Views, you can build a React-based monitoring dashboard that makes it easy to track your applications.