Kubernetes has become an increasingly important tool for companies running workloads with containers. Small and large organizations have migrated to K8s for a number of reasons, including cost savings, autoscaling, and improved productivity.
These days, most organizations are also using a cloud-hosted or managed Kubernetes service. Cloud providers like AWS, GCP, and Azure all offer managed Kubernetes services. However, some engineering teams are still working in a physical on-prem environment, where they are responsible for supporting Kubernetes and maintaining the hardware.
While Kubernetes has many benefits, there is often a need for teams to deploy a monitoring and observability stack to troubleshoot issues that happen within the cluster and the applications themselves.
For companies with highly secure environments, like those in financial services or healthcare, using a SaaS offering for monitoring and observability might not be feasible. The risk of sensitive data leaving a secure environment for a third-party environment keeps engineering leaders up at night.
For these companies, whether they are using a managed Kubernetes service or running K8s with physical hardware on-prem, choosing to run their observability tools in-cluster, in-cloud, or “on-prem” offers peace of mind. In this article, we’ll explore some of the challenges of using an on-prem, or in-cloud, set of observability tools, introduce some best practices, and highlight a handful of tools that companies are using.
Challenges of introducing monitoring on-prem or in-cloud
While running your monitoring and observability toolset on-prem has many benefits, there are without a doubt a number of challenges that organizations might face. These challenges include the added complexity of self-managing a toolset, additional direct costs, and the limited availability of tooling that works on-prem or in an in-cloud environment.
One of the biggest challenges of running a monitoring solution in an on-prem environment is that it requires self-management. This introduces another variable for engineering teams to be aware of. And it can create a scenario wherein your monitoring toolset needs, itself, to be monitored.
Fortunately, today, there are a number of open-source tools with an emphasis on Kubernetes. And many of these tools have become reliable and secure, due in large part to robust contributor communities.
While self-managing your toolset on-prem or in-cloud will take additional engineering time, it may be worthwhile for companies in highly secure industries.
Additional direct costs
When using a managed solution or SaaS offering, you are able to pass along the costs of hosting to the vendor itself. However, when using a toolset on-prem, or in-cloud specifically, you will be tasked with hosting and managing this resource on your own.
By self-hosting your toolset, you might find that the direct costs of hosting and maintaining the environment will increase. For example, if you are self-hosting Prometheus in-cloud, there will be meaningful cloud costs associated with receiving and storing metrics in the time series database.
Limited availability of tooling
Unfortunately, for companies looking to run their toolset in an on-prem environment, there are a smaller number of vendors to choose from. Popular solutions like Datadog and New Relic are not available for on-prem or in-cloud offerings. And while Datadog, for example, allows you to run their agent in an on-prem cluster, it still requires that the data then leave.
Tools for K8s monitoring on-prem
As discussed in the previous section, a limited number of tools are available for companies looking to run their monitoring on-prem. Vendors like Datadog, New Relic, and others do not offer their services to customers on-prem. In this section, we’ll highlight three different tools and platforms that are available for companies looking to keep their monitoring toolset on-prem: two third-party offerings and a popular open-source solution.
Using Prometheus and Grafana - popular open-source tools
Prometheus, a popular open-source monitoring solution, and a graduated CNCF project, is available for companies running workloads on-prem. Prometheus is a time series database known for its ability to scale to massive environments and for its efficiency.
Prometheus is often used in conjunction with Grafana, which is an open-source tool for visualization. Today, both Prometheus and Grafana can be run as managed SaaS offerings, through the cloud providers or through Grafana Enterprise. However, many organizations still run these tools together in on-prem environments, too.
Prometheus is known for its large number of integrations, which make it easy to bring third-party data into Prometheus. It’s also known for its popular PromQL query language, and for its alerting capabilities with Alertmanager.
Grafana, when it’s used to visualize the data collected in Prometheus, can be helpful for teams when they are troubleshooting. Grafana Enterprise is an increasingly popular toolset for the largest enterprises looking to use Grafana on-prem. Included with Grafana Enterprise is access to Enterprise Logs, Enterprise Metrics, and Enterprise Traces. There are a number of additional enterprise features like authentication, usage insights, and scalability guarantees.
Grafana Enterprise pricing is not publicly available, and you must request a demo in order to learn more.
AppDynamics - an APM ideal for on-prem and hybrid environments
AppDynamics, now part of Cisco, has been a leading provider of application performance monitoring since 2008.
AppDynamics has a Kubernetes Monitoring suite that can be run on-prem, hybrid, and across multiple cloud providers. This is particularly helpful for companies that are looking to aggregate data between on-prem and cloud environments in one tool.
While there is quite a bit of overlap between what AppDynamics offers and what you can get from another tooling, there are a couple of places where AppDynamics excels.
AppDynamics offers automatic root cause analysis, which uses historical performance to identify and alert on actual issues that need to be addressed. This can help to reduce alert fatigue, a common complaint with Promtheus’s AlertManager.
AppDynamics also offers users a toolset to tie monitoring to actual business performance. For example, with AppDynamics, users are able to correlate latency to the conversion rate on a given path. For example, if you were running an e-commerce website, you could view the performance of the checkout services, like Add To Cart, to checkout conversion rates at points in time.
Users can access a 15-day free trial, or book a demo with a member of the sales team to get started.
Best practices to implement
In this section, we’ll explore some best practices for organizations of all sizes. As part of this, we’ll discuss which metrics are important, tips for troubleshooting more effectively, and which critical pieces of data might be providing clues about the end-user experience.
Keep a close eye on metrics
Monitoring metrics is often the first step for companies standing up a monitoring stack. Keeping track of pod and node CPU and memory, among other things like disk space, is important to ensure that overall cluster health is strong.
One of the biggest benefits of using Kubernetes is that it is efficiently able to scale resources, or the number of pods and nodes, as demand on your applications grows. By tracking resource usage, as well as the allocation of resources, teams are able to ensure a more performant experience. And by tracking historical usage alongside limits, teams are often able to rightsize and reduce expenses.
Use Kubernetes events when troubleshooting
The Kubernetes events feed can be a highly valuable source of information, especially when it comes to troubleshooting and remediation.
The Kubernetes events feed captures a historical record of everything happening inside of the cluster including both Normal and Warning events. When issues arise, Warning events can act as a guide to what went wrong and when. And when used alongside logs and metrics, Kubernetes events can be a helpful starting point for gathering additional context and data.
Correlations make troubleshooting faster
We’ve all been there. Endlessly scrolling through logs trying to figure out what happened, and then matching timestamps up in order to get the full picture.
Teams can avoid doing this manual matching and correlating by finding a toolset that correlates automatically. Grafana Enterprise and AppDynamics all have the ability to correlate between multiple metrics and pieces of information, for example, logs, at points in time.
Correlations make troubleshooting a more efficient process, and teams should use a toolset that enables this behavior by default.
Track end-user experiences too
While infrastructure metrics are important, it’s also wise to spend time thinking about end-user performance metrics. By implementing an APM, or APM-like features, engineering teams can track the performance of end-user metrics. For example, teams may find it valuable to track latency for particular microservices, or for particular paths, where a customer exists.
Teams might also find it useful to track status codes by request and to alert when problematic status codes are identified.
While engineering teams likely aren’t making decisions for the end-user, it is important for the team to work alongside other parts of the organization, like the product and marketing teams, to implement monitoring that ensures a performant end-user experience.
Be aware of alert fatigue
Whether using a SaaS offering or one of the on-prem offerings mentioned above, alert fatigue is going to be an issue to get out in front of.
As the name suggests, alert fatigue is the result of inadequately set alerts and monitoring, which leads to wasted time and energy for the engineering team. Alert fatigue is typically caused by setting too many alerts, or by using the wrong thresholds. Instead, engineering teams should work to set alerts only on the actions that have a direct impact or could lead to a compounded issue.
Prometheus comes, by default, with preset alerting capabilities. Before automatically turning all of these alerts on, engineers should double-check that each alert has an actual business use case.
Finally, teams should work to deliver the alerts to the right place and person. By using a paging system like PagerDuty or K6s, teams can forward the alerts to specific people or teams depending on the date and time. Knowing who is responsible for which issues can also help to avoid overfatiguing an entire team.
In this article, we explored the concept of running a Kubernetes monitoring toolset on-prem or in-cloud. Included, we highlighted some of the challenges of going on-prem, including the limited availability of some tools, and the added costs associated with self-managing your monitoring stack. We also explored some popular tools for going on-prem with your Kubernetes monitoring toolset. Prometheus and AppDynamics are good examples of monitoring tools that can be run either on-prem or in-cloud. And finally, we explored some best practices to implement to make troubleshooting a more efficient process, improve end-user experience, and avoid the alert fatigue that is a byproduct of observability tools.
If you're looking for an internal tooling platform that makes it even easier to monitor your applications, then you should check out Airplane. Airplane is the developer platform for building custom internal tools. With Airplane you can transform scripts, queries, APIs, and more into powerful workflows and UIs.
The basic building blocks of Airplane are Tasks, which are single or multi-step functions that anyone can use. Airplane also offers Views, a React-based platform for building custom UIs. You can also execute Airplane Tasks within your own VPC by using self-hosted agents. Airplane also offers self-hosted storage, making it easy to satisfy rigorous security and compliance requirements for your data.