Before troubleshooting your OSDs, check your monitors and network first. If you execute ceph health or ceph -s on the command line and Ceph returns a health status, the return of a status means that the monitors have a quorum. If you don’t have a monitor quorum or if there are errors with the monitor status, address the monitor issues first. Check your networks to ensure they are running properly, because networks may have a significant impact on OSD operation and performance.
A good first step in troubleshooting your OSDs is to obtain information in addition to the information you collected while monitoring your OSDs (e.g., ceph osd tree).
If you haven’t changed the default path, you can find Ceph log files at /var/log/ceph:
ls /var/log/ceph
If you don’t get enough log detail, you can change your logging level. See Logging and Debugging for details to ensure that Ceph performs adequately under high logging volume.
Use the admin socket tool to retrieve runtime information. For details, list the sockets for your Ceph processes:
ls /var/run/ceph
Then, execute the following, replacing {socket-name} with an actual socket name to show the list of available options:
ceph --admin-daemon /var/run/ceph/{socket-name} help
The admin socket, among other things, allows you to:
Filesystem issues may arise. To display your filesystem’s free space, execute df.
df -h
Execute df --help for additional usage.
To retrieve diagnostic messages, use dmesg with less, more, grep or tail. For example:
dmesg | grep scsi
Periodically, you may need to perform maintenance on a subset of your cluster, or resolve a problem that affects a failure domain (e.g., a rack). If you do not want CRUSH to automatically rebalance the cluster as you stop OSDs for maintenance, set the cluster to noout first:
ceph osd set noout
Once the cluster is set to noout, you can begin stopping the OSDs within the failure domain that requires maintenance work.
ceph osd stop osd.{num}
Note
Placement groups within the OSDs you stop will become degraded while you are addressing issues with within the failure domain.
Once you have completed your maintenance, restart the OSDs.
ceph osd start osd.{num}
Finally, you must unset the cluster from noout.
ceph osd unset noout
Under normal circumstances, simply restarting the ceph-osd daemon will allow it to rejoin the cluster and recover.
If you start your cluster and an OSD won’t start, check the following:
If you cannot resolve the issue and the email list isn’t helpful, you may contact Inktank for support.
When a ceph-osd process dies, the monitor will learn about the failure from surviving ceph-osd daemons and report it via the ceph health command:
ceph health
HEALTH_WARN 1/3 in osds are down
Specifically, you will get a warning whenever there are ceph-osd processes that are marked in and down. You can identify which ceph-osds are down with:
ceph health detail
HEALTH_WARN 1/3 in osds are down
osd.0 is down since epoch 23, last address 192.168.106.220:6800/11080
If there is a disk failure or other fault preventing ceph-osd from functioning or restarting, an error message should be present in its log file in /var/log/ceph.
If the daemon stopped because of a heartbeat failure, the underlying kernel file system may be unresponsive. Check dmesg output for disk or other kernel errors.
If the problem is a software error (failed assertion or other unexpected error), it should be reported to the ceph-devel email list.
Ceph prevents you from writing to a full OSD so that you don’t lose data. In an operational cluster, you should receive a warning when your cluster is getting near its full ratio. The mon osd full ratio defaults to 0.95, or 95% of capacity before it stops clients from writing data. The mon osd nearfull ratio defaults to 0.85, or 85% of capacity when it generates a health warning.
Full cluster issues usually arise when testing how Ceph handles an OSD failure on a small cluster. When one node has a high percentage of the cluster’s data, the cluster can easily eclipse its nearfull and full ratio immediately. If you are testing how Ceph reacts to OSD failures on a small cluster, you should leave ample free disk space and consider temporarily lowering the mon osd full ratio and mon osd nearfull ratio.
Full ceph-osds will be reported by ceph health:
ceph health
HEALTH_WARN 1 nearfull osds
osd.2 is near full at 85%
Or:
ceph health
HEALTH_ERR 1 nearfull osds, 1 full osds
osd.2 is near full at 85%
osd.3 is full at 97%
The best way to deal with a full cluster is to add new ceph-osds, allowing the cluster to redistribute data to the newly available storage.
If you cannot start an OSD because it is full, you may delete some data by deleting some placement group directories in the full OSD.
Important
If you choose to delete a placement group directory on a full OSD, DO NOT delete the same placement group directory on another full OSD, or YOU MAY LOSE DATA. You MUST maintain at least one copy of your data on at least one OSD.
See Monitor Config Reference for additional details.
A commonly recurring issue involves slow or unresponsive OSDs. Ensure that you have eliminated other troubleshooting possibilities before delving into OSD performance issues. For example, ensure that your network(s) is working properly and your OSDs are running. Check to see if OSDs are throttling recovery traffic.
Tip
Newer versions of Ceph provide better recovery handling by preventing recovering OSDs from using up system resources so that up and in OSDs aren’t available or are otherwise slow.
Ceph is a distributed storage system, so it depends upon networks to peer with OSDs, replicate objects, recover from faults and check heartbeats. Networking issues can cause OSD latency and flapping OSDs. See Flapping OSDs for details.
Ensure that Ceph processes and Ceph-dependent processes are connected and/or listening.
netstat -a | grep ceph
netstat -l | grep ceph
sudo netstat -p | grep ceph
Check network statistics.
netstat -s
A storage drive should only support one OSD. Sequential read and sequential write throughput can bottleneck if other processes share the drive, including journals, operating systems, monitors, other OSDs and non-Ceph processes.
Ceph acknowledges writes after journaling, so fast SSDs are an attractive option to accelerate the response time–particularly when using the ext4 or XFS filesystems. By contrast, the btrfs filesystem can write and journal simultaneously.
Note
Partitioning a drive does not change its total throughput or sequential read/write limits. Running a journal in a separate partition may help, but you should prefer a separate physical drive.
Check your disks for bad sectors and fragmentation. This can cause total throughput to drop substantially.
Monitors are generally light-weight processes, but they do lots of fsync(), which can interfere with other workloads, particularly if monitors run on the same drive as your OSDs. Additionally, if you run monitors on the same host as the OSDs, you may incur performance issues related to:
In these cases, multiple OSDs running on the same host can drag each other down by doing lots of commits. That often leads to the bursty writes.
Spinning up co-resident processes such as a cloud-based solution, virtual machines and other applications that write data to Ceph while operating on the same hardware as OSDs can introduce significant OSD latency. Generally, we recommend optimizing a host for use with Ceph and using other hosts for other processes. The practice of separating Ceph operations from other applications may help improve performance and may streamline troubleshooting and maintenance.
If you turned logging levels up to track an issue and then forgot to turn logging levels back down, the OSD may be putting a lot of logs onto the disk. If you intend to keep logging levels high, you may consider mounting a drive to the default path for logging (i.e., /var/log/ceph/$cluster-$name.log).
Depending upon your configuration, Ceph may reduce recovery rates to maintain performance or it may increase recovery rates to the point that recovery impacts OSD performance. Check to see if the OSD is recovering.
Check the kernel version you are running. Older kernels may not receive new backports that Ceph depends upon for better performance.
Try running one OSD per host to see if performance improves. Old kernels might not have a recent enough version of glibc to support syncfs(2).
Currently, we recommend deploying clusters with XFS or ext4. The btrfs filesystem has many attractive features, but bugs in the filesystem may lead to performance issues.
We recommend 1GB of RAM per OSD daemon. You may notice that during normal operations, the OSD only uses a fraction of that amount (e.g., 100-200MB). Unused RAM makes it tempting to use the excess RAM for co-resident applications, VMs and so forth. However, when OSDs go into recovery mode, their memory utilization spikes. If there is no RAM available, the OSD performance will slow considerably.
If a ceph-osd daemon is slow to respond to a request, it will generate log messages complaining about requests that are taking too long. The warning threshold defaults to 30 seconds, and is configurable via the osd op complaint time option. When this happens, the cluster log will receive messages.
Legacy versions of Ceph complain about ‘old requests`:
osd.0 192.168.106.220:6800/18813 312 : [WRN] old request osd_op(client.5099.0:790 fatty_26485_object789 [write 0~4096] 2.5e54f643) v4 received at 2012-03-06 15:42:56.054801 currently waiting for sub ops
New versions of Ceph complain about ‘slow requests`:
{date} {osd.num} [WRN] 1 slow requests, 1 included below; oldest blocked for > 30.005692 secs
{date} {osd.num} [WRN] slow request 30.005692 seconds old, received at {date-time}: osd_op(client.4240.0:8 benchmark_data_ceph-1_39426_object7 [write 0~4194304] 0.69848840) v4 currently waiting for subops from [610]
Possible causes include:
Possible solutions
We recommend using both a public (front-end) network and a cluster (back-end) network so that you can better meet the capacity requirements of object replication. Another advantage is that you can run a cluster network such that it isn’t connected to the internet, thereby preventing some denial of service attacks. When OSDs peer and check heartbeats, they use the cluster (back-end) network when it’s available. See Monitor/OSD Interaction for details.
However, if the cluster (back-end) network fails or develops significant latency while the public (front-end) network operates optimally, OSDs currently do not handle this situation well. What happens is that OSDs mark each other down on the monitor, while marking themselves up. We call this scenario ‘flapping`.
If something is causing OSDs to ‘flap’ (repeatedly getting marked down and then up again), you can force the monitors to stop the flapping with:
ceph osd set noup # prevent OSDs from getting marked up
ceph osd set nodown # prevent OSDs from getting marked down
These flags are recorded in the osdmap structure:
ceph osd dump | grep flags
flags no-up,no-down
You can clear the flags with:
ceph osd unset noup
ceph osd unset nodown
Two other flags are supported, noin and noout, which prevent booting OSDs from being marked in (allocated data) or protect OSDs from eventually being marked out (regardless of what the current value for mon osd down out interval is).
Note
noup, noout, and nodown are temporary in the sense that once the flags are cleared, the action they were blocking should occur shortly after. The noin flag, on the other hand, prevents OSDs from being marked in on boot, and any daemons that started while the flag was set will remain that way.