Overview

This topic discusses best-effort attempts to prevent OpenShift Origin from experiencing out-of-memory (OOM) and out-of-disk-space conditions.

A node must maintain stability when available compute resources are low. This is especially important when dealing with incompressible resources such as memory or disk. If either resource is exhausted, the node becomes unstable.

Administrators can proactively monitor nodes for and prevent against situations where the node runs out of compute and memory resources using configurable eviction policies.

This topic also provides information on how OpenShift Origin handles out-of-resource conditions and provides an example scenario and recommended practices:

If swap memory is enabled for a node, that node cannot detect that it is under MemoryPressure.

To take advantage of memory based evictions, operators must disable swap.

Configuring Eviction Policies

An eviction policy allows a node to fail one or more pods when the node is running low on available resources. Failing a pod allows the node to reclaim needed resources.

An eviciton policy is a combination of an eviction trigger signal with a specific eviction threshold value, that is set in the node configuration file or through the command line. Evictions can be either hard, where a node takes immediate action on a pod that exceeds a threshold, or soft, where a node allows a grace period before taking action. See the sections below for important information the differences between hard and soft evictions.

By using well-configured eviction policies, a node can proactively monitor for and prevent against total starvation of a compute resource.

When the node fails a pod, it terminates all containers in the pod, and the PodPhase is transitioned to Failed.

Using the Node Configuration to Create a Policy

To configure an eviction policy, edit the node configuration file (the /etc/origin/node/node-config.yaml file) to specify the eviction thresholds under the eviction-hard or eviction-soft parameters.

For example:

Example 1. Sample Node Configuration file for a hard eviction
kubeletArguments:
  eviction-hard: (1)
  - memory.available<500Mi(2)
  - nodefs.available<500Mi
  - nodefs.inodesFree<100Mi
  - imagefs.available<100Mi
  - imagefs.inodesFree<100Mi
1 The type of eviction: Use this parameter for a hard eviction.
2 An eviction threshold based on a specific eviction trigger signal.
Example 2. Sample Node Configuration file for a soft eviction
kubeletArguments:
  eviction-soft: (1)
  - memory.available<500Mi (2)
  - nodefs.available<500Mi
  - nodefs.inodesFree<100Mi
  - imagefs.available<100Mi
  - imagefs.inodesFree<100Mi
  eviction-soft-grace-period:(3)
  - memory.available=1m30s
  - nodefs.available=1m30s
  - nodefs.inodesFree=1m30s
  - imagefs.available=1m30s
  - imagefs.inodesFree=1m30s
1 The type of eviction: Use this parameter for a soft eviction.
2 An eviction threshold based on a specific eviction trigger signal.
3 The grace period for the soft eviction. Leave the default values for optimal performance.
  1. Restart the OpenShift Origin service for the changes to take effect:

    # systemctl restart origin-node

Understanding Eviction Signals

You can configure a node to trigger eviction decisions on any of the signals described in the table below. You add an eviction signal to an eviction threshold along with a threshold value.

The value of each signal is described in the Description column based on the node summary API.

To view the signals:

curl <certificate details> \
  https://<master>/api/v1/nodes/<node>/proxy/stats/summary
Table 1. Supported Eviction Signals
Node Condition Eviction Signal Value Description

MemoryPressure

memory.available

memory.available = node.status.capacity[memory] - node.stats.memory.workingSet

Available memory on the node has exceeded an eviction threshold.

DiskPressure

nodefs.available

nodefs.available = node.stats.fs.available

Available diskspace on either the node root file system or image file system has exceeded an eviction threshold.

nodefs.inodesFree

nodefs.inodesFree = node.stats.fs.inodesFree

imagefs.available

imagefs.available = node.stats.runtime.imagefs.available

imagefs.inodesFree

imagefs.inodesFree = node.stats.runtime.imagefs.inodesFree

Each of the above signals supports either a literal or percentage-based value. The percentage-based value is calculated relative to the total capacity associated with each signal.

A script derives the value for memory.available from your cgroup driver using the same set of steps that the kubelet performs. The script excludes inactive file memory (that is, the number of bytes of file-backed memory on inactive LRU list) from its calculation as it assumes that inactive file memory is reclaimable under pressure.

Do not use tools like free -m, because free -m does not work in a container.

The node supports the nodefs and imagefs file system partitions when detecting disk pressure, as follows:

  • The nodefs file system that the node uses for local disk volumes, daemon logs, and so on (for example, the file system that provides /).

  • The imagefs file system that the container runtime uses for storing images and individual container writable layers.

OpenShift Origin monitors these file systems every 10 seconds.

If you store volumes and logs in a dedicated file system, the node will not monitor that file system.

As of OpenShift Origin 3.4, the node supports the ability to trigger eviction decisions based on disk pressure. Before evicting pods becuase of disk pressure, the node also performs container and image garbage collection. In future releases, garbage collection will be deprecated in favor of a pure disk-eviction based configuration.

Understanding Eviction Thresholds

You can configure a node to specify eviction thresholds, which triggers the node to reclaim resources, by adding a threshold to the node configuration file.

If an eviction threshold is met, independent of its associated grace period, the node reports a condition indicating that the node is under memory or disk pressure. This prevents the scheduler from scheduling any additional pods on the node while attempts to reclaim resources are made.

The node continues to report node status updates at the frequency specified by the node-status-update-frequency argument, which defaults to 10s (ten seconds).

Eviction thresholds can be hard, for when the node takes immediate action when a threshold is met, or soft, for when you allow a grace period before reclaiming resources.

Soft eviction usage is more common when you are targeting a certain level of utilization, but can tolerate temporary spikes. We recommended setting the soft eviction threshold lower than the hard eviction threshold, but the time period can be operator-specific. The system reservation should also cover the soft eviction threshold.

The soft eviction threshold is an advanced feature. You should configure a hard eviction threshold before attempting to use soft eviction thresholds.

Thresholds are configured in the following form:

<eviction_signal><operator><quantity>

For example, if an operator has a node with 10Gi of memory, and that operator wants to induce eviction if available memory falls below 1Gi, an eviction threshold for memory can be specified as either of the following:

memory.available<1Gi
memory.available<10%

The node evaluates and monitors eviction thresholds every 10 seconds and the value can not be modified. This is the housekeeping interval.

Understanding Hard Eviction Thresholds

A hard eviction threshold has no grace period and, if observed, the node takes immediate action to reclaim the associated starved resource. If a hard eviction threshold is met, the node kills the pod immediately with no graceful termination.

To configure hard eviction thresholds, add eviction thresholds to the node configuration file under eviction-hard, as shown in Using the Node Configuration to Create a Policy.

Sample Node Configuration file with hard eviction thresholds
kubeletArguments:
  eviction-hard:
  - memory.available<500Mi
  - nodefs.available<500Mi
  - nodefs.inodesFree<100Mi
  - imagefs.available<100Mi
  - imagefs.inodesFree<100Mi

This example is a general guideline and not recommended settings.

Default Hard Eviction Thresholds

OpenShift Origin uses the following default configuration for eviction-hard.

...
kubeletArguments:
  eviction-hard:
  - memory.available<100Mi
  - nodefs.available<10%
  - nodefs.inodesFree<5%
  - imagefs.available<15%
...

Understanding Soft Eviction Thresholds

A soft eviction threshold pairs an eviction threshold with a required administrator-specified grace period. The node does not reclaim resources associated with the eviction signal until that grace period is exceeded. If no grace period is provided in the node configuration the node errors on startup.

In addition, if a soft eviction threshold is met, an operator can specify a maximum allowed pod termination grace period to use when evicting pods from the node. If eviction-max-pod-grace-period is specified, the node uses the lesser value among the pod.Spec.TerminationGracePeriodSeconds and the maximum-allowed grace period. If not specified, the node kills pods immediately with no graceful termination.

For soft eviction thresholds the following flags are supported:

  • eviction-soft: a set of eviction thresholds (for example, memory.available<1.5Gi) that, if met over a corresponding grace period, triggers a pod eviction.

  • eviction-soft-grace-period: a set of eviction grace periods (for example, memory.available=1m30s) that correspond to how long a soft eviction threshold must hold before triggering a pod eviction.

  • eviction-max-pod-grace-period: the maximum-allowed grace period (in seconds) to use when terminating pods in response to a soft eviction threshold being met.

To configure soft eviction thresholds, add eviction thresholds to the node configuration file under eviction-soft, as shown in Using the Node Configuration to Create a Policy.

Sample Node Configuration files with soft eviction thresholds
kubeletArguments:
  eviction-soft:
  - memory.available<500Mi
  - nodefs.available<500Mi
  - nodefs.inodesFree<100Mi
  - imagefs.available<100Mi
  - imagefs.inodesFree<100Mi
  eviction-soft-grace-period:
  - memory.available=1m30s
  - nodefs.available=1m30s
  - nodefs.inodesFree=1m30s
  - imagefs.available=1m30s
  - imagefs.inodesFree=1m30s

This example is a general guideline and not recommended settings.

Configuring the Amount of Resource for Scheduling

You can control how much of a node resource is made available for scheduling in order to allow the scheduler to fully allocate a node and to prevent evictions.

Set system-reserved equal to the amount of resource you want available to the scheduler for deploying pods and for system-daemons. Evictions should only occur if pods use more than their requested amount of an allocatable resource.

A node reports two values:

  • Capacity: How much resource is on the machine

  • Allocatable: How much resource is made available for scheduling.

To configure the amount of allocatable resources, edit the node configuration file (the /etc/origin/node/node-config.yaml file) to add or modify the system-reserved parameter for eviction-hard or eviction-soft.

+

kubeletArguments:
  eviction-hard: (1)
    - "memory.available<500Mi"
  system-reserved:
    - "1.5Gi"
1 This threshold can either be eviction-hard or eviction-soft.
  1. Restart the OpenShift Origin service for the changes to take effect:

    # systemctl restart origin-node

Controlling Node Condition Oscillation

If a node is oscillating above and below a soft eviction threshold, but not exceeding its associated grace period, the corresponding node condition oscillates between true and false, which can cause problems for the scheduler.

To prevent this oscillation, set the eviction-pressure-transition-period parameter to control how long the node must wait before transitioning out of a pressure condition.

  1. Edit or add the parameter to the kubeletArguments section of the node configuration file (the /etc/origin/node/node-config.yaml) using a set of <resource_type>=<resource_quantity> pairs.

kubeletArguments:
  eviction-pressure-transition-period="5m"

+ The node toggles the condition back to false when the node has not observed an eviction threshold being met for the specified pressure condition for the specified period.

+

Use the default value (5 minutes) before doing any adjustments. The default choice is intended to allow the system to stabilize, and to prevent the scheduler from assigning new pods to the node before it has settled.

  1. Restart the OpenShift Origin services for the changes to take effect:

    # systemctl restart origin-node

Reclaiming Node-level Resources

If an eviction criteria is satisfied, the node initiates the process of reclaiming the pressured resource until the signal goes below the defined threshold. During this time, the node does not support scheduling any new pods.

The node attempts to reclaim node-level resources prior to evicting end-user pods, based on whether the host system has a dedicated imagefs configured for the container runtime.

With Imagefs

If the host system has imagefs:

  • If the nodefs file system meets eviction thresholds, the node frees up disk space in the following order:

    • Delete dead pods/containers

  • If the imagefs file system meets eviction thresholds, the node frees up disk space in the following order:

    • Delete all unused images

Without Imagefs

If the host system does not have imagefs:

  • If the nodefs file system meets eviction thresholds, the node frees up disk space in the following order:

    • Delete dead pods/containers

    • Delete all unused images

Understanding Pod Eviction

If an eviction threshold is met and the grace period is passed, the node initiates the process of evicting pods until the signal goes below the defined threshold.

The node ranks pods for eviction by their quality of service, and, among those with the same quality of service, by the consumption of the starved compute resource relative to the pod’s scheduling request.

Each QOS level has an OOM score, which the Linux out-of-memory tool (OOM killer) uses to determine which pods to kill. See Understanding Quality of Service and Out of Memory Killer below.

The following table lists each QOS level and the associated OOM score.

Table 2. Quality of Service Levels
Quality of Service Description

Guaranteed

Pods that consume the highest amount of the starved resource relative to their request are failed first. If no pod has exceeded its request, the strategy targets the largest consumer of the starved resource.

Burstable

Pods that consume the highest amount of the starved resource relative to their request for that resource are failed first. If no pod has exceeded its request, the strategy targets the largest consumer of the starved resource.

BestEffort

Pods that consume the highest amount of the starved resource are failed first.

A Guaranteed pod will never be evicted because of another pod’s resource consumption unless a system daemon (such as node, docker, journald) is consuming more resources than were reserved using system-reserved, or kube-reserved allocations or if the node has only Guaranteed pods remaining.

If the node has only Guaranteed pods remaining, the node evicts a Guaranteed pod that least impacts node stability and limits the impact of the unexpected consumption to other Guaranteed pods.

Local disk is a BestEffort resource. If necessary, the node evicts pods one at a time to reclaim disk when DiskPressure is encountered. The node ranks pods by quality of service. If the node is responding to inode starvation, it will reclaim inodes by evicting pods with the lowest quality of service first. If the node is responding to lack of available disk, it will rank pods within a quality of service that consumes the largest amount of local disk, and evict those pods first.

Understanding Quality of Service and Out of Memory Killer

If the node experiences a system out of memory (OOM) event before it is able to reclaim memory, the node depends on the OOM killer to respond.

The node sets a oom_score_adj value for each container based on the quality of service for the pod.

Table 3. Quality of Service Levels
Quality of Service oom_score_adj Value

Guaranteed

-998

Burstable

min(max(2, 1000 - (1000 * memoryRequestBytes) / machineMemoryCapacityBytes), 999)

BestEffort

1000

If the node is unable to reclaim memory prior to experiencing a system OOM event, the oom_killer calculates an oom_score:

% of node memory a container is using + `oom_score_adj` = `oom_score`

The node then kills the container with the highest score.

Containers with the lowest quality of service that are consuming the largest amount of memory relative to the scheduling request are failed first.

Unlike pod eviction, if a pod container is OOM failed, it can be restarted by the node based on the node restart policy.

Understanding the Pod Scheduler and OOR Conditions

The scheduler views node conditions when placing additional pods on the node. For example, if the node has an eviction threshold like the following:

eviction-hard is "memory.available<500Mi"

and available memory falls below 500Mi, the node reports a value in Node.Status.Conditions as MemoryPressure as true.

Table 4. Node Conditions and Scheduler Behavior
Node Condition Scheduler Behavior

MemoryPressure

If a node reports this condition, the scheduler will not place BestEffort pods on that node.

DiskPressure

If a node reports this condition, the scheduler will not place any additional pods on that node.

Example Scenario

Consider the following scenario.

An opertator:

  • has a node with a memory capacity of 10Gi;

  • wants to reserve 10% of memory capacity for system daemons (kernel, node, etc.);

  • wants to evict pods at 95% memory utilization to reduce thrashing and incidence of system OOM.

Implicit in this configuration is the understanding that system-reserved should include the amount of memory covered by the eviction threshold.

To reach that capacity, either some pod is using more than its request, or the system is using more than 1Gi.

If a node has 10 Gi of capacity, and you want to reserve 10% of that capacity for the system daemons (system-reserved), perform the following calculation:

capacity = 10 Gi
system-reserved = 10 Gi * .1 = 1 Gi

The amount of allocatable resources becomes:

allocatable = capacity - system-reserved = 9 Gi

This means by default, the scheduler will schedule pods that request 9 Gi of memory to that node.

If you want to turn on eviction so that eviction is triggered when the node observes that available memory falls below 10% of capacity for 30 seconds, or immediately when it falls below 5% of capacity, you need the scheduler to see allocatable as 8Gi. Therefore, ensure your system reservation covers the greater of your eviction thresholds.

capacity = 10 Gi
eviction-threshold = 10 Gi * .1 = 1 Gi
system-reserved = (10Gi * .1) + eviction-threshold = 2 Gi
allocatable = capacity - system-reserved = 8 Gi

Enter the following in the node-config.yaml:

kubeletArguments:
  system-reserved:
  - "2Gi"
  eviction-hard:
  - memory.available<.5Gi
  eviction-soft:
  - memory.available<1Gi
  eviction-soft-grace-period:
  - memory.available=30s

This configuration ensures that the scheduler does not place pods on a node that immediately induce memory pressure and trigger eviction assuming those pods use less than their configured request.

DaemonSets and Out of Resource Handling

If a node evicts a pod that was created by a DaemonSet, the pod will immediately be recreated and rescheduled back to the same node, because the node has no ability to distinguish a pod created from a DaemonSet versus any other object.

In general, DaemonSets should not create BestEffort pods to avoid being identified as a candidate pod for eviction. Instead DaemonSets should ideally launch Guaranteed pods.