If an eviction threshold is met and the grace period is passed, the node initiates the process of evicting pods until the signal goes below
the defined threshold.
The node ranks pods for eviction by their quality of service, and, among those with the same quality of service, by the consumption of the starved compute resource relative to the pod’s scheduling request.
The following table lists each QOS level and the associated OOM score.
Table 2. Quality of Service Levels
Quality of Service |
Description |
|
Pods that consume the highest amount of the starved resource relative to
their request are failed first. If no pod has exceeded its request, the strategy
targets the largest consumer of the starved resource.
|
|
Pods that consume the highest amount of the starved resource relative to their
request for that resource are failed first. If no pod has exceeded its request,
the strategy targets the largest consumer of the starved resource.
|
|
Pods that consume the highest amount of the starved resource are failed
first.
|
A Guaranteed
pod will never be evicted because of another pod’s resource consumption unless a system daemon (such as node, docker, journald) is consuming more resources than were reserved using system-reserved, or kube-reserved allocations or if the node has only Guaranteed
pods remaining.
If the node has only Guaranteed
pods remaining, the node evicts a Guaranteed
pod that least impacts node stability and limits the impact of the unexpected consumption to other Guaranteed
pods.
Local disk is a BestEffort
resource. If necessary, the node evicts pods one at a time to reclaim disk when DiskPressure
is encountered. The node ranks
pods by quality of service. If the node is responding to inode starvation, it will reclaim inodes by evicting pods with the lowest quality of service first.
If the node is responding to lack of available disk, it will rank pods within a quality of service that consumes the largest amount of local disk, and evict
those pods first.
Understanding Quality of Service and Out of Memory Killer
If the node experiences a system out of memory (OOM) event before it is able to reclaim memory, the node depends on the OOM killer to respond.
The node sets a oom_score_adj
value for each container based on the quality of service for the pod.
Table 3. Quality of Service Levels
Quality of Service |
oom_score_adj Value |
|
|
|
min(max(2, 1000 - (1000 * memoryRequestBytes) / machineMemoryCapacityBytes), 999)
|
|
|
If the node is unable to reclaim memory prior to experiencing a system OOM event, the oom_killer
calculates an oom_score
:
% of node memory a container is using + `oom_score_adj` = `oom_score`
The node then kills the container with the highest score.
Containers with the lowest quality of service that are consuming the largest amount of memory relative to the scheduling request are failed first.
Unlike pod eviction, if a pod container is OOM failed, it can be restarted by the node based on the node restart policy.