There are some basic resources that are always needed, no matter what application you’re running. CPU, memory, and disk space are universal, and will be used in all applications. Most engineers have a proper grasp of how to handle CPU and memory, but not everyone takes the time to understand how to properly work with disks.
In Kubernetes, this can become disastrous over time, as Kubernetes will start to “save” itself once it becomes overloaded. This is done by killing pods, thereby reducing the load on the nodes. This can cause issues either if your application doesn’t know how to handle a sudden shutdown properly, or it can result in not having enough resources to handle the given load.
In this article, you’ll get a high-level overview of what it means when a node disk is under pressure, what some typical causes are, and how you can diagnose it.
What is node disk pressure
Node disk pressure means, as the name suggests, that the disks that are attached to the node are under pressure. You’re unlikely to encounter node disk pressure, as there are measures built into Kubernetes to avoid it, but it does happen from time to time. While there are a variety of things that can cause node desk pressure, there are two primary reasons you may encounter it.
The first reason you might run into node disk pressure is that Kubernetes has not cleaned up any unused images. By default, this shouldn’t happen, as Kubernetes routinely checks whether there are any images not in use, and deletes any unused images that it finds. This is an unlikely source of node disk pressure; however, it should be kept in mind.
The other issue, which you’re significantly more likely to run into, is a problem of logs building up. The default behavior in Kubernetes is to save logs in two cases: It’ll save the logs of any running containers, and it’ll save the logs of the most recently exited container to aid in troubleshooting. This is an attempt to strike a balance between keeping important logs, and getting rid of logs that aren’t useful by deleting them over time. However, if you have a long-running container with a lot of logs, they may build up enough that it overloads the capacity of the node disk.
To figure out what the exact issue is, you need to find the details of which files are taking up the greatest amount of space.
Troubleshooting node disk pressure
To troubleshoot the issue of node disk pressure, you need to figure out what files are taking up the most space. Since Kubernetes is running on Linux, this is easily done by running the
du command. You can either manually SSH into each Kubernetes node, or use a DaemonSet.
Deploying and understanding the DaemonSet
To get the DaemonSet deployed, you can either use the GitHub Gist for the DaemonSet directly, or you can create a file with the following contents:
Now you can run the following:
Before utilizing the DaemonSet for troubleshooting, it’s important to understand what is happening inside it. If you take a look at the manifest file above, you’ll notice that it is, in fact, a very simple service. A lot of it is boilerplate, but the important thing to note is the
args fields. This is where the
du command is set to run, and then print the top twenty results. Further down, you can also see that the host volume is bounded into the container at the path
Utilizing the DaemonSet
First, you need to make sure that the DaemonSet is properly deployed, which you can do by running
kubectl get pods -l app=disk-checker. This should produce an output like the following:
$ kubectl get pods -l app=disk-checker
The number of pods you see here will depend on how many nodes are running inside your cluster. Once you’ve verified that the nodes are running, you can start looking at the logs of the pods that are running by executing
kubectl logs -l app=disk-checker. This may take some time, but in the end, you should see a list of files and their sizes, which will give you greater insight into what is taking up space on your nodes. What you do next depends on what files are taking up space—you need to look into the output from the DaemonSet and understand what’s happening, and if it’s log files, application files, or something else that’s using your disk space.
Although it’s important to analyze and understand the output from the DaemonSet, and from there tackle the issue at hand, there are two probable solutions.
You may figure out that the issue is caused by necessary application data, making it impossible to delete the files. In this case, you will have to increase the size of the node disks to ensure that there’s sufficient room for the application files. This is an easy step to implement, but it increases the cost of running your cluster. Therefore, a better first step is to look at how the application is structured, and see if you can find ways to make it less reliant on application files, thereby decreasing the overall need for disk usage.
On the other hand, you may find that you have applications that have produced a lot of files that are no longer needed. In this case, it’s as simple as deleting unnecessary files. Depending on how your application is set up in terms of availability, you may be able to just restart the pod, leading Kubernetes to automatically clean up any files from the container. Note that this will only be done if you are using ephemeral volumes, not if you are using persistent volumes. You can read more about volumes in general here.
By now you should know more about what it means when you experience an issue with node disk pressure, and what your immediate thoughts should be when you run into the issue: an error in garbage collecting or log files.
You will likely have to either upgrade the size of the disks in your cluster, or clean up unused files. No matter the issue and solution, you now have a better understanding of how to move forward when you run into this issue.
If you're looking for an efficient way to monitor errors and make troubleshooting easier, consider using Airplane. Airplane allows you to build powerful internal UIs and workflows using code. With Airplane, you can create engineering workflows that help catch errors and streamline the troubleshooting process. You can also create custom internal UIs that help you monitor your node disk pressures.