Kubernetes liveness probes determine whether your pods are running normally. Setting up these probes helps you check whether your workloads are healthy. They can identify application instances that have entered a failed state, even when the pod that contains the instance appears to be operational.
Kubernetes automatically monitors your pods and restarts them when failures are detected. This handles issues where your application crashes, terminating its process and emitting a non-zero exit code. Not all issues exhibit this behavior, though. Your application could have lost its database connection, or be experiencing timeouts while communicating with a third-party service. In these situations, the pod will look from the outside like it’s running, but users won’t be able to access the application within.
Liveness probes are a mechanism for indicating your application’s internal health to the Kubernetes control plane. Kubernetes uses liveness probes to detect issues within your pods. When a liveness check fails, Kubernetes restarts the container in an attempt to restore your service to an operational state.
In this article, we’ll explore when liveness probes should be used, how you can create them, and some best practices to be aware of as you add probes to your cluster.
Why do liveness probes matter?
Liveness probes enhance Kubernetes’ ability to manage workloads on your behalf. Without probes, you need to manually monitor your pods to distinguish which application instances are healthy and which are not. This becomes time-consuming and error-prone when you’re working with hundreds or thousands of pods.
Allowing unhealthy pods to continue without detection degrades your service’s stability over time. Pods that are silently failing as they age, perhaps due to race conditions, deadlocks, or corrupted caches, will gradually reduce your service’s capacity to handle new requests. Eventually, your entire pod fleet could be affected, even though all the containers report as running.
Debugging this kind of issue is often confusing and inefficient. Since all your dashboards show your pods as operational, it’s easy to turn down the wrong diagnostic path. Using liveness probes to communicate information about pods’ internal states to Kubernetes lets your cluster handle problems for you, reducing the maintenance burden and ensuring you always have serviceable pods available.
Types of liveness probes
There are four basic types of liveness probes:
- Exec: The probe runs a command inside the container. The probe is considered successful if the command terminates with
0as its exit code.
- HTTP: The probe makes an HTTP
GETrequest against a URL in the container. The probe is successful when the container’s response has an HTTP status code in the 200-399 range.
- TCP: The probe tries to connect to a specific TCP port inside the container; if the port is open, the probe is deemed successful.
- gRPC: gRPC health-checking probes are supported for applications that use gRPC. This type of probe is available in alpha as of Kubernetes v1.23.
The probe types share five basic parameters for configuring the frequency and success criteria of the checks:
initialDelaySeconds: Set a delay between the time the container starts and the first time the probe is executed. Defaults to zero seconds.
periodSeconds: Defines how frequently the probe will be executed after the initial delay. Defaults to ten seconds.
timeoutSeconds: Each probe will time out and be marked as failed after this many seconds. Defaults to one second.
failureThreshold: Instructs Kubernetes to retry the probe this many times after a failure is first recorded. The container will only be restarted if the retries also fail. Defaults to three.
successThreshold: This sets the criteria for reverting an unhealthy container to a healthy state. It means the container must successfully pass this number of consecutive liveness checks before it’s deemed healthy again. Defaults to one.
Successful liveness probes have no impact on your cluster. The targeted container will keep running, and a new probe will be scheduled to run after the configured
periodSeconds delay. A failed probe will trigger a restart of the container, as it’s expected that the fresh instance will be healthy.
Creating liveness probes
Liveness probes are defined by a pod’s
spec.containers.livenessProbe field. Here’s a simple example of an
exec (command) type liveness probe:
This pod’s containers have a liveness probe that has an initial delay of five seconds, then reads the content of the
/healthcheck file every fifteen seconds. The container is configured to create the
/healthcheck file when it starts up; it then removes the file after thirty seconds have elapsed. The liveness probe’s
cat command will begin to issue non-zero status codes at this point, causing subsequent probes to be marked as failed.
Apply the YAML manifest to your cluster:
Now inspect the events of the pod you’ve created:
$ kubectl describe pod liveness-probe-demo
|Normal||Scheduled||30s||default-scheduler||Successfully assigned liveness-probe-demo...|
|Normal||Pulling||29s||kubelet||Pulling image "busybox:latest"|
|Normal||Pulled||28s||kubelet||Successfully pulled image "busybox:latest" in 1.1243596453s|
|Normal||Created||28s||kubelet||Created container liveness-probe-demo|
|Normal||Started||28s||kubelet||Started container liveness-probe-demo|
Everything looks good! The container was created and started successfully. There’s no sign of any failed liveness probes.
Now wait for thirty seconds before retrieving the events again:
$ kubectl describe pod liveness-probe-demo
|Normal||Scheduled||70s||default-scheduler||Successfully assigned liveness-probe-demo...|
|Normal||Pulling||69s||kubelet||Pulling image "busybox:latest"|
|Normal||Pulled||68s||kubelet||Successfully pulled image "busybox:latest" in 1.1243596453s|
|Normal||Created||68s||kubelet||Created container liveness-probe-demo|
|Normal||Started||68s||kubelet||Started container liveness-probe-demo|
|Normal||Started||10s||kubelet||Liveness probe failed: cat: can't open '/healthcheck': No such file or directory|
|Normal||Killing||10s||kubelet||Container liveness-probe-demo failed liveness probe, will be restarted|
The event log now shows that the liveness probe began to fail after the container deleted its
/healthcheck file. The event reveals the output from the liveness probe’s command. If you used a different probe type, such as HTTP or TCP, you’d see relevant information such as the HTTP status code instead.
HTTP probes are created in a similar manner to exec commands. Nest an
httpGet field instead of
exec in your
This probe sends an HTTP
GET request to
/healthz on the container’s port
8080 every fifteen seconds. The image used is a minimal HTTP server provided by Kubernetes as an example liveness check provider. The server issues a successful response with a
200 status code for the first ten seconds of its life. After that point, it will return a
500, failing the liveness probe and causing the container to restart.
livenessProbe.httpGet field supports optional
httpHeaders fields to customize the request that’s made. The
host defaults to the pod’s internal IP address; the scheme is
http. The following snippet sets up a probe to make an HTTP request with a custom header:
TCP probes try to open a socket to your container on a specified port. Add a
tcpSocket.port field to your
livenessProbe configuration to use this probe type:
The probe will be considered failed if the socket can’t be opened.
gRPC probes are the newest type of probe. The implementation is similar to the grpc-health-probe utility, which was commonly used before Kubernetes integrated the functionality.
To use a gRPC probe, ensure you’re on Kubernetes v1.23 and have the
GRPCContainerProbe feature gate enabled. Add a
grpc.port field to your pod’s
livenessProbe to define where health checks should be directed to:
etcd container image is used here as an example of a gRPC-compatible service. Kubernetes will send gRPC health check requests to port 2379 in the container. The liveness probe will be marked as failed when the container issues an unhealthy response. The probe is also considered failed if the service doesn’t implement the gRPC health checking protocol.
Best practices for effective probes
Liveness probes have some pitfalls that you need to watch out for. Foremost among these are the impact misconfigured probes can have on your application. A probe that’s run too frequently wastes resources and impedes performance; conversely, probing infrequency can let containers sit in an unhealthy state for too long.
timeoutSeconds, and success and failure threshold parameters should be used to tune your probes to your application. Pay attention to how long your probe’s command, API request, or gRPC call takes to complete. Use this value with a small buffer period as your
periodSeconds needs to be bespoke to your environment; a good rule of thumb is to use the smallest value possible for simple, short-running probes. More intensive commands may need to wait longer between repetitions.
Probes themselves should be as lightweight as possible. To ensure that your checks can execute quickly and efficiently, avoid using expensive operations within your probes. The target of your probe’s command or HTTP request should be independent of your main application, so it can run to completion even during failure conditions. A probe that’s served by your standard application entry point could lead to inaccurate results if its framework fails to start or a required external dependency is unavailable.
Here are a few other best practices to keep in mind:
- Probes are affected by restart policies. Container restart policies are applied after probes. This means your containers need
restartPolicy: Always(the default) or
restartPolicy: OnFailureso Kubernetes can restart them after a failed probe. Using the Never policy will keep the container in the failed state.
- Probes should be consistent in their execution. You should be able to approximate the execution time of your probes, so you can configure their period, delay, and timeout correctly. Observe your real-world workloads instead of using the defaults that Kubernetes provides.
- Not every container needs a probe. Simple containers that always terminate on failure don’t need a probe. You may also omit probes from low-priority services, where the command would need to be relatively expensive to accurately determine healthiness.
- Revisit your probes regularly. New features, optimizations, and regressions in your app can all impact probe performance and what constitutes a “healthy” state. Set a reminder to regularly check your probes and make necessary adjustments.
Other types of probes
Liveness probes aren’t your only option for disclosing a pod’s internal status to Kubernetes. Liveness probes exclusively focus on the ongoing health of your application; two other probes are better suited for detecting problems early in a pod’s lifecycle.
Readiness probes determine when new containers are able to receive traffic. Pods with one of these probes won’t become part of services until the probe indicates its assent. You can use this mechanism to prevent a new container from handling user requests while its bootstrap scripts are running.
Startup probes are the final type of probe. They indicate if a container’s application has finished launching. When a container has this type of probe, its liveness and readiness probes won’t be executed until the startup probe has succeeded. It’s a way to avoid continual container restarts due to probes that fail because the application’s not ready to handle them.
Liveness probes are a Kubernetes mechanism for exposing whether the applications inside your containers are healthy. It’s a way to address the disconnect between Kubernetes’ perception of your pods and the reality of if users can actually access your service.
In this article, you’ve learned why you should use liveness probes, the types of probes that are available, and how you can start attaching them to your pods. We’ve also discussed some of the config tweaks necessary to prevent probes from becoming problems themselves.
If you're looking for an easy way to monitor your Kubernetes pods, then consider using Airplane. Airplane is the developer platform for building internal tools. With Airplane, you can build custom UIs and dashboards that make monitoring unhealthy pods easy to do.