Following the pattern of several other multi-threaded UNIX® kernels, FreeBSD deals with interrupt handlers by giving them their own thread context. Providing a context for interrupt handlers allows them to block on locks. To help avoid latency, however, interrupt threads run at real-time kernel priority. Thus, interrupt handlers should not execute for very long to avoid starving other kernel threads. In addition, since multiple handlers may share an interrupt thread, interrupt handlers should not sleep or use a sleepable lock to avoid starving another interrupt handler.
The interrupt threads currently in FreeBSD are referred to as heavyweight interrupt threads. They are called this because switching to an interrupt thread involves a full context switch. In the initial implementation, the kernel was not preemptive and thus interrupts that interrupted a kernel thread would have to wait until the kernel thread blocked or returned to userland before they would have an opportunity to run.
To deal with the latency problems, the kernel in FreeBSD has been made preemptive. Currently, we only preempt a kernel thread when we release a sleep mutex or when an interrupt comes in. However, the plan is to make the FreeBSD kernel fully preemptive as described below.
Not all interrupt handlers execute in a thread context. Instead, some handlers execute
directly in primary interrupt context. These interrupt handlers are currently misnamed
“fast” interrupt handlers since the INTR_FAST
flag used in earlier versions of the kernel is used to mark these handlers. The only
interrupts which currently use these types of interrupt handlers are clock interrupts and
serial I/O device interrupts. Since these handlers do not have their own context, they
may not acquire blocking locks and thus may only use spin mutexes.
Finally, there is one optional optimization that can be added in MD code called lightweight context switches. Since an interrupt thread executes in a kernel context, it can borrow the vmspace of any process. Thus, in a lightweight context switch, the switch to the interrupt thread does not switch vmspaces but borrows the vmspace of the interrupted thread. In order to ensure that the vmspace of the interrupted thread does not disappear out from under us, the interrupted thread is not allowed to execute until the interrupt thread is no longer borrowing its vmspace. This can happen when the interrupt thread either blocks or finishes. If an interrupt thread blocks, then it will use its own context when it is made runnable again. Thus, it can release the interrupted thread.
The cons of this optimization are that they are very machine specific and complex and thus only worth the effort if their is a large performance improvement. At this point it is probably too early to tell, and in fact, will probably hurt performance as almost all interrupt handlers will immediately block on Giant and require a thread fix-up when they block. Also, an alternative method of interrupt handling has been proposed by Mike Smith that works like so:
Each interrupt handler has two parts: a predicate which runs in primary interrupt context and a handler which runs in its own thread context.
If an interrupt handler has a predicate, then when an interrupt is triggered, the predicate is run. If the predicate returns true then the interrupt is assumed to be fully handled and the kernel returns from the interrupt. If the predicate returns false or there is no predicate, then the threaded handler is scheduled to run.
Fitting light weight context switches into this scheme might prove rather complicated. Since we may want to change to this scheme at some point in the future, it is probably best to defer work on light weight context switches until we have settled on the final interrupt handling architecture and determined how light weight context switches might or might not fit into it.
Kernel preemption is fairly simple. The basic idea is that a CPU should always be doing the highest priority work available. Well, that is the ideal at least. There are a couple of cases where the expense of achieving the ideal is not worth being perfect.
Implementing full kernel preemption is very straightforward: when you schedule a thread to be executed by putting it on a run queue, you check to see if its priority is higher than the currently executing thread. If so, you initiate a context switch to that thread.
While locks can protect most data in the case of a preemption, not all of the kernel
is preemption safe. For example, if a thread holding a spin mutex preempted and the new
thread attempts to grab the same spin mutex, the new thread may spin forever as the
interrupted thread may never get a chance to execute. Also, some code such as the code to
assign an address space number for a process during exec
on
the Alpha needs to not be preempted as it supports the actual context switch code.
Preemption is disabled for these code sections by using a critical section.
The responsibility of the critical section API is to prevent context switches inside
of a critical section. With a fully preemptive kernel, every setrunqueue
of a thread other than the current thread is a
preemption point. One implementation is for critical_enter
to set a per-thread flag that is cleared by its counterpart. If setrunqueue
is called with this flag set, it does not preempt
regardless of the priority of the new thread relative to the current thread. However,
since critical sections are used in spin mutexes to prevent context switches and multiple
spin mutexes can be acquired, the critical section API must support nesting. For this
reason the current implementation uses a nesting count instead of a single per-thread
flag.
In order to minimize latency, preemptions inside of a critical section are deferred rather than dropped. If a thread that would normally be preempted to is made runnable while the current thread is in a critical section, then a per-thread flag is set to indicate that there is a pending preemption. When the outermost critical section is exited, the flag is checked. If the flag is set, then the current thread is preempted to allow the higher priority thread to run.
Interrupts pose a problem with regards to spin mutexes. If a low-level interrupt
handler needs a lock, it needs to not interrupt any code needing that lock to avoid
possible data structure corruption. Currently, providing this mechanism is piggybacked
onto critical section API by means of the cpu_critical_enter
and cpu_critical_exit
functions. Currently this API disables and
re-enables interrupts on all of FreeBSD's current platforms. This approach may not be
purely optimal, but it is simple to understand and simple to get right. Theoretically,
this second API need only be used for spin mutexes that are used in primary interrupt
context. However, to make the code simpler, it is used for all spin mutexes and even all
critical sections. It may be desirable to split out the MD API from the MI API and only
use it in conjunction with the MI API in the spin mutex implementation. If this approach
is taken, then the MD API likely would need a rename to show that it is a separate
API.
As mentioned earlier, a couple of trade-offs have been made to sacrifice cases where perfect preemption may not always provide the best performance.
The first trade-off is that the preemption code does not take other CPUs into account. Suppose we have a two CPU's A and B with the priority of A's thread as 4 and the priority of B's thread as 2. If CPU B makes a thread with priority 1 runnable, then in theory, we want CPU A to switch to the new thread so that we will be running the two highest priority runnable threads. However, the cost of determining which CPU to enforce a preemption on as well as actually signaling that CPU via an IPI along with the synchronization that would be required would be enormous. Thus, the current code would instead force CPU B to switch to the higher priority thread. Note that this still puts the system in a better position as CPU B is executing a thread of priority 1 rather than a thread of priority 2.
The second trade-off limits immediate kernel preemption to real-time priority kernel threads. In the simple case of preemption defined above, a thread is always preempted immediately (or as soon as a critical section is exited) if a higher priority thread is made runnable. However, many threads executing in the kernel only execute in a kernel context for a short time before either blocking or returning to userland. Thus, if the kernel preempts these threads to run another non-realtime kernel thread, the kernel may switch out the executing thread just before it is about to sleep or execute. The cache on the CPU must then adjust to the new thread. When the kernel returns to the preempted thread, it must refill all the cache information that was lost. In addition, two extra context switches are performed that could be avoided if the kernel deferred the preemption until the first thread blocked or returned to userland. Thus, by default, the preemption code will only preempt immediately if the higher priority thread is a real-time priority thread.
Turning on full kernel preemption for all kernel threads has value as a debugging aid since it exposes more race conditions. It is especially useful on UP systems were many races are hard to simulate otherwise. Thus, there is a kernel option FULL_PREEMPTION to enable preemption for all kernel threads that can be used for debugging purposes.
Simply put, a thread migrates when it moves from one CPU to another. In a
non-preemptive kernel this can only happen at well-defined points such as when calling
msleep
or returning to userland. However, in the preemptive
kernel, an interrupt can force a preemption and possible migration at any time. This can
have negative affects on per-CPU data since with the exception of curthread
and curpcb
the data can
change whenever you migrate. Since you can potentially migrate at any time this renders
unprotected per-CPU data access rather useless. Thus it is desirable to be able to
disable migration for sections of code that need per-CPU data to be stable.
Critical sections currently prevent migration since they do not allow context switches. However, this may be too strong of a requirement to enforce in some cases since a critical section also effectively blocks interrupt threads on the current processor. As a result, another API has been provided to allow the current thread to indicate that if it preempted it should not migrate to another CPU.
This API is known as thread pinning and is provided by the scheduler. The API consists
of two functions: sched_pin
and sched_unpin
. These functions manage a per-thread nesting count
td_pinned
. A thread is pinned when its nesting count is
greater than zero and a thread starts off unpinned with a nesting count of zero. Each
scheduler implementation is required to ensure that pinned threads are only executed on
the CPU that they were executing on when the sched_pin
was
first called. Since the nesting count is only written to by the thread itself and is only
read by other threads when the pinned thread is not executing but while sched_lock
is held, then td_pinned
does not need any locking. The sched_pin
function
increments the nesting count and sched_unpin
decrements the
nesting count. Note that these functions only operate on the current thread and bind the
current thread to the CPU it is executing on at the time. To bind an arbitrary thread to
a specific CPU, the sched_bind
and sched_unbind
functions should be used instead.
The timeout
kernel facility permits kernel services to
register functions for execution as part of the softclock
software interrupt. Events are scheduled based on a desired number of clock ticks, and
callbacks to the consumer-provided function will occur at approximately the right
time.
The global list of pending timeout events is protected by a global spin mutex, callout_lock
; all access to the timeout list must be performed
with this mutex held. When softclock
is woken up, it scans
the list of pending timeouts for those that should fire. In order to avoid lock order
reversal, the softclock
thread will release the callout_lock
mutex when invoking the provided timeout
callback function. If the CALLOUT_MPSAFE
flag was not set during registration, then Giant
will be grabbed before invoking the callout, and then released afterwards. The callout_lock
mutex will be re-grabbed before proceeding. The softclock
code is careful to leave the list in a consistent state
while releasing the mutex. If DIAGNOSTIC
is enabled, then
the time taken to execute each function is measured, and a warning is generated if it
exceeds a threshold.