Let’s consider a scenario.
It’s 12:45 p.m., and you’re wrapping up a lunchtime interview with the senior hiring manager and the director of global infrastructure at an e-commerce startup looking for SREs. The director leans forward and says, “Okay, I’ve got one more question. In your own words, can you tell us the difference between observability and monitoring?”
To this point, you’ve confidently answered the majority of the questions thrown at you. Yet despite your twelve years of experience, this one is tricky.
To state it simply, monitoring focuses on what, and observability focuses on why
The feedback provided from your systems is just as (if not more) important as the feedback provided to you from your customers.
Applications composed of complex and highly available resources require each component to share as much state information as possible. Extracting and comprehending this data can inform company decisions as to how best to improve application performance.
This article will cover some often misinterpreted concepts of observability and monitoring. Not only will you learn what exactly these phrases mean, but you’ll also learn how their concepts complement each other. Together, they can supply you with understanding of what’s really going on inside your deployments, so you can react efficiently to events.
The history of observability and monitoring
Rudolf E. Kálmán first wrote about observability in his work on control theory. A basic level of observability can be achieved with applications and infrastructure—health checks, storage capacity, latency, and throughput are all metrics that have been available for quite some time. However, today, microservices architecture means a much deeper observation of your applications and services.
Monitoring goes back even further than the concept of observability. But in terms of internet-based applications, it can be argued that network monitoring marks the beginning of monitoring in the modern sense, specifically when the SNMP (Simple Network Management Protocol) protocol was established in 1988.
By now, monitoring has evolved into a concept that encompasses the collection of state data, the processing and visualization of this data, and sending alerts to inform your interested parties when unsatisfactory states arise.
What is observability?
Observability can be understood as an in-depth look into the inner workings of modern applications to provide you with ways to quickly determine problems and potential resolutions.
Alternatively, observability can be described as the qualitative measure of an application’s external outputs used to determine its overall health. In other words, with higher fidelity external outputs, you have a better chance of determining the true states of your applications and the reasons behind them.
As an example, today many software solutions are web-based, which means more often than not, software is being deployed to an internet-based platform. Determining whether this web-based application is available or not would obviously be very important to monitor. If your application goes offline, the health checks that you have in place enter into a failed state, and in turn, notify any interested parties that some or all of your application is suddenly unavailable. Your pager goes off, and what happens next?
You proceed to determine why the outage occurred by collecting as much data as you can about each of the components in question. Observability refers to the quality of this collected data and how well it articulates the reasons behind the outage.
When it comes to IT resources, observability is at the basis of problem determination. In IT, there are three concepts that provide system administrators and application developers with actionable intelligence: observability, monitoring, and analysis. Your monitoring is only as effective as the observability that your resources provide, and the quality of your analysis is directly proportioned to the quality of your monitoring stack. These three concepts reveal the information necessary to understand and react to your applications needs.
Today, application stacks are highly available microservices that are constantly deployed, destroyed, and redeployed. The “web” of data flowing back and forth between these services makes problem determination much more challenging. This is why it is so important to identify the external outputs that offer the most detail on the state of your application’s components. You must tune into the true signals from your applications and eliminate as much noise from your applications as possible.
So how do you achieve those higher fidelity outputs that provide you with the vision you need to appropriately analyze what’s happening inside your cloud-native applications? Here are five things to consider if you’re working to increase observability:
1. Define SLOs and determine SLIs
Defining Service Level Objectives involves understanding how your applications should be—in other words, detailing the desired characteristics and the appropriate behaviors. Then, identify the Service Level Indicators or metrics that will allow you to measure your progress toward your objectives.
If you’re looking for more information on this topic, Google has published a great article on the topic of SLIs, SLAs, and SLOs.
2. Centralized logging
Each component of your highly available applications generates plenty of logging info. For example, if you are running workloads on Kubernetes, there will be both application level and cluster level logs generated.
Without a centralized way of viewing the logs from all sources involved, it will get increasingly difficult to find the reasons behind the states your applications are in.
3. Event and metrics collection
Events and metrics should be collected at the various layers of an application’s deployment.
Here’s an example: in cloud computing, your applications are at the mercy of the servers or VMS that ultimately host them. The issues encountered within your applications often trace back to problems with the host. By performing event and metric collection at the hypervisor layer of your applications, and you gain that out-of-band insight that reveals the issues at the host level.
You should also take into consideration that containerized applications are hosted on platforms like Kubernetes. Errors can occur at the platform level, in addition to the infrastructure/node level. The events and metrics that you collect at the varying layers of your application will help in gaining a complete picture of what’s happening with your applications.
Tracing is one of the most important concepts in observability. Having the ability to trace the individual transactions that occur within your distributed applications and understanding the lifecycle of your application requests are vital to fully understanding how your applications work.
When an unexpected event occurs, your legacy monitoring solutions offer you an automated way of being notified. However, the automation shouldn’t stop at the notification step. Use automation to pull logs and other data at the time of the event. This saves time and allows you to begin determining what the problems are faster than manually retrieving this information.
It’s hard to know the exact metrics you should collect for your application, since all applications are unique. Generally speaking, there are a few metrics to consider, like:
- Requests per second. How often are requests being received?
- Average response time. How long does it take for your application to respond?
- Server uptime. How long have your servers been up?
- Error rates. What is the rate at which your application is receiving errors?
- Queue time. How long are jobs taking to process?
- Garbage collection. How often does garbage collection run for your application?
What is monitoring?
In 2016, Greg Poirier, while CTO at OpSee (a previous project), created the well-known “Monitoring is Dead” presentation. In it, he defines monitoring as “the action of observing and checking the behavior and outputs of a system and its components over time.”
Observability exposes internal data via external outputs, and with more information made available it aids you in gaining insight. However, without the raw data or outputs of the system, it won’t matter.
Monitoring starts with the events that occur to take your applications into an unsatisfactory state: a network outage, a server stops responding, etc. It’s at this point that the concept of observability becomes important, as you attempt to determine the reasons behind those failures.
Monitoring isn’t just about failure detection either. It should also be used to monitor metrics like utilization and billing.
Observability and monitoring: working together
Observability and monitoring should be practiced together. To take action, you need extensive observability from your applications and infrastructure, a robust monitoring solution to expose and interpret this data, and a solid means of performing analysis on this data to determine what needs to improve.
Observability and monitoring are practices that should be continuously improved upon as you gain more insight from your applications and services. Failures will happen. The idea is to detect problems before they arise, or at the very least, provide yourself with as much information as you can when the failures occur.
What are the false positives? Is your alerting noisy or valuable? Are collecting useful output from the systems you employ? Are you able to understand what’s happening from the data you’ve collected?
Observability and monitoring work together to provide your organization with the insight required to make well-informed, data-backed decisions. Combine the insight that observability provides with a process of monitoring this data in real-time, and you will have the power to swiftly tune your software as needed.
In many cases, this insight can be leveraged proactively to take preventative measures just as human health data helps us to determine the conditions of the body, allowing us to do the same.
If observability refers to a system’s visibility, then monitoring is the behavior displayed to use that visibility to your advantage. As modern applications become more distributed and more complex architecturally, it becomes increasingly important to inspect the inner workings of those applications.
Modern infrastructure and applications have evolved, so the instrumentation that we use to determine problems must adapt to these changes as well. Today, cloud-native applications add their fair share of complexity to the process of observing your applications and monitoring their overall state.
If you're looking for an internal tooling platform that makes it easy to build monitoring dashboards for your applications, then check out Airplane. With Airplane, you can transform scripts, queries, APIs, and more into powerful UIs and workflows that anyone on your team can use. Airplane also offers strong built-ins, such as audit logs and approval flows, that make your internal tools secure and easy to track.