Airplane has been acquired by Airtable. Learn more →
gradient
OOMKilled: troubleshooting Kubernetes memory requests and limits

OOMKilled: troubleshooting Kubernetes memory requests and limits

Oct 23, 2021
5 min read

If you’ve been working with Kubernetes for any period of time, you’ve probably come across the OOMkilled error. It can be a frustrating error to debug if you don’t understand how it works. In this article, we’ll take a closer look at the OOMKilled error, why this error occurs, how to troubleshoot it when it happens, and what steps you can take to help prevent it.

Memory in Kubernetes

Let’s begin by understanding how Kubernetes thinks about memory allocation. When the scheduler is trying to decide how to place pods in the Kubernetes cluster, it looks at the capacity for each node.

You should note that a node with 8 GB of memory won’t necessarily have 8 GB available to run pods. Kubernetes tries to determine how much of the 8 GB the node needs for normal operation and how much is left over to run pods.

You can see a breakdown of allocatable resources by taking a look at the YAML for a node: kubectl get node my_node -oyaml. You should see something like this:‍

Allocatable:

attachable-volumes-aws-ebs:25
cpu:3920m
ephemeral-storage:95551679124
hugepages-1Gi:0
hugepages-2Mi:0
memory:15104488Ki
pods:58


Based on this resource, the scheduler decides which pods to run where, and it tries to make sure that none of the nodes in the cluster end up running more pods than they can handle.

When you define a container, you can set two different variables for memory. Whether or not you set these variables and what you set them to can have huge repercussions for your pod.

The first is the requests variable. This tells Kubernetes that this particular container needs, at minimum, this much memory. Kubernetes will guarantee the memory is available when it places the pod. If you don’t set it, Kubernetes will assume you don’t need any resources by default and so it won’t guarantee that your pod will be placed on a node with enough memory.

The next value you can set is the limit, which is the maximum. The container won’t always need this much memory, but if it asks for more, it can’t go above the limit. Limits can be tricky—when Kubernetes places a pod, it only checks the requests variable.

We’ll take a look at how this can contribute to an OOMKilled error in a bit.

yaml


The request and the limit are important because they play a big role in how Kubernetes decides which pods to kill when it needs to free up resources:

  • Pods that do not have the limit or the request set
  • Pods with no set limit
  • Pods that are over memory request but under limit
  • Pods using less than requested memory

So what is OOMKilled?

OOMKilled is an error that actually has its origins in Linux. Linux systems have a program called OOM (Out of Memory Manager) that tracks memory usage per process. If the system is in danger of running out of available memory, OOM Killer will come in and start killing processes to try to free up memory and prevent a crash. The goal of OOM Killer is to free up as much memory as possible by killing off the least number of processes.

Under the hood, OOM Killer allocates each running process a score. The greater the score, the greater the possibility the process will be killed off. The method it uses to calculate this score is beyond this tutorial, but it’s good to know that Kubernetes takes advantage of the score to help make decisions about which pods to kill.

The kubelet running on the VM monitors memory consumption. If resources on a VM become scarce, the kubelet will start killing pods. Essentially, the idea is to preserve the health of the VM so that all the pods running on it won’t fail. The needs of the many outweigh the needs of the few, and the few get murdered.

There are two main OOMKilled errors you’ll see in Kubernetes:

  • OOMKilled: Limit Overcommit
  • OOMKilled: Container Limit Reached

Let’s take a look at each one.

OOMKilled because of limit overcommit

Remember that limit variable we talked about? Here is where it can get you into trouble.

The OOMKilled: Limit Overcommit error can occur when the sum of pod limits is greater than the available memory on the node. So for example, if you have a node with 8 GB of available memory, you might get eight pods that each need a gig of memory. However, if even one of those pods is configured with a limit of, say 1.5 gigs, you run the risk of running out of memory. All it would take is for that one pod to have a spike in traffic or an unknown memory leak, and Kubernetes will be forced to start killing pods.

You might also want to check the host itself and see if there are any processes running outside of Kubernetes that could be eating up memory, leaving less for the pods.

OOMKilled because of container limit reached

While the Limit Overcommit error is related to the total amount of memory on the node, Container Limit Reached is usually relegated to a single pod. When Kuberntetes detects a pod using more memory than the set limit, it will kill the pod with error OOMKilled—Container Limit Reached.

When this happens, check the application logs to try to understand why the pod was using more memory than the set limit. It could be for a number of reasons, such as a spike in traffic or a long-running Kubernetes job that caused it to use more memory than usual.

If during your investigation you find that the application is running as expected and that it just requires more memory to run, you might consider increasing the values for request and limit.

Conclusion

In this article, we took a closer look at the Kubernetes OOMKilled error, an error that has its origins in Linux. It helps Kubernetes manage memory when scheduling pods and make decisions about what pods to kill when resources are running low. Don’t forget to consider the two flavors of the OOMKilled error, Container Limit Reached and Limit Overcommit. Understanding them both can go a long way toward successful troubleshooting, and ensure that you minimize running into the error in the future.

To track errors efficiently and ensure they are solved as quickly as possible, try using a powerful internal tooling platform like Airplane. You can use Airplane Tasks to build multi-step workflows that can help you debug issues, or use Views (a React-based UI platform) to build monitoring dashboards that help you see, troubleshoot, and resolve errors as quickly as possible.

To build your first workflow or dashboard that can help you monitor and resolve errors, sign up for a free account or book a demo.

Share this article:
Josh Alletto
Josh Alletto is a developer, writer, and educator with over ten years of experience teaching in classrooms. He is currently a Senior Site Reliability Engineer (SRE) at Lessonly.

Subscribe to new blog posts from Airplane.