Ceph filesystem client eviction

When a filesystem client is unresponsive or otherwise misbehaving, it may be necessary to forcibly terminate its access to the filesystem. This process is called eviction.

This process is somewhat thorough in order to protect against data inconsistency resulting from misbehaving clients.

OSD blacklisting

First, prevent the client from performing any more data operations by blacklisting it at the RADOS level. You may be familiar with this concept as fencing in other storage systems.

Identify the client to evict from the MDS session list:

# ceph daemon mds.a session ls
[
    { "id": 4117,
      "num_leases": 0,
      "num_caps": 1,
      "state": "open",
      "replay_requests": 0,
      "reconnecting": false,
      "inst": "client.4117 172.16.79.251:0\/3271",
      "client_metadata": { "entity_id": "admin",
          "hostname": "fedoravm.localdomain",
          "mount_point": "\/home\/user\/mnt"}}]

In this case the ‘fedoravm’ client has address 172.16.79.251:0/3271, so we blacklist it as follows:

# ceph osd blacklist add 172.16.79.251:0/3271
blacklisting 172.16.79.251:0/3271 until 2014-12-09 13:09:56.569368 (3600 sec)

OSD epoch barrier

While the evicted client is now marked as blacklisted in the central (mon) copy of the OSD map, it is now necessary to ensure that this OSD map update has propagated to all daemons involved in subsequent filesystem I/O. To do this, use the osdmap barrier MDS admin socket command.

First read the latest OSD epoch:

# ceph osd dump
epoch 12
fsid fd61ca96-53ff-4311-826c-f36b176d69ea
created 2014-12-09 12:03:38.595844
modified 2014-12-09 12:09:56.619957
...

In this case it is 12. Now request the MDS to barrier on this epoch:

# ceph daemon mds.a osdmap barrier 12

MDS session eviction

Finally, it is safe to evict the client’s MDS session, such that any capabilities it held may be issued to other clients. The ID here is the id attribute from the session ls output:

# ceph daemon mds.a session evict 4117

That’s it! The client has now been evicted, and any resources it had locked will now be available for other clients.

Background: OSD epoch barrier

The purpose of the barrier is to ensure that when we hand out any capabilities which might allow touching the same RADOS objects, the clients we hand out the capabilities to must have a sufficiently recent OSD map to not race with cancelled operations (from ENOSPC) or blacklisted clients (from evictions)

More specifically, the cases where we set an epoch barrier are:

  • Client eviction (where the client is blacklisted and other clients must wait for a post-blacklist epoch to touch the same objects)
  • OSD map full flag handling in the client (where the client may cancel some OSD ops from a pre-full epoch, so other clients must wait until the full epoch or later before touching the same objects).
  • MDS startup, because we don’t persist the barrier epoch, so must assume that latest OSD map is always required after a restart.

Note that this is a global value for simplicity: we could maintain this on a per-inode basis. We don’t, because:

  • It would be more complicated
  • It would use an extra 4 bytes of memory for every inode
  • It would not be much more efficient as almost always everyone has the latest OSD map anyway, in most cases everyone will breeze through this barrier rather than waiting.
  • We only do this barrier in very rare cases, so any benefit from per-inode granularity would only very rarely be seen.

The epoch barrier is transmitted along with all capability messages, and instructs the receiver of the message to avoid sending any more RADOS operations to OSDs until it has seen this OSD epoch. This mainly applies to clients (doing their data writes directly to files), but also applies to the MDS because things like file size probing and file deletion are done directly from the MDS.