struct ucred
is the kernel's
internal credential structure, and is generally used as the
basis for process-driven access control within the kernel.
BSD-derived systems use a “copy-on-write” model
for credential data: multiple references may exist for a
credential structure, and when a change needs to be made, the
structure is duplicated, modified, and then the reference
replaced. Due to wide-spread caching of the credential to
implement access control on open, this results in substantial
memory savings. With a move to fine-grained SMP, this model
also saves substantially on locking operations by requiring
that modification only occur on an unshared credential,
avoiding the need for explicit synchronization when consuming
a known-shared credential.
Credential structures with a single reference are
considered mutable; shared credential structures must not be
modified or a race condition is risked. A mutex,
cr_mtxp
protects the reference
count of struct ucred
so as to
maintain consistency. Any use of the structure requires a
valid reference for the duration of the use, or the structure
may be released out from under the illegitimate
consumer.
The struct ucred
mutex is a leaf
mutex and is implemented via a mutex pool for performance
reasons.
Usually, credentials are used in a read-only manner for access
control decisions, and in this case
td_ucred
is generally preferred
because it requires no locking. When a process' credential is
updated the proc
lock must be held across
the check and update operations thus avoid races. The process
credential p_ucred
must be used for
check and update operations to prevent time-of-check,
time-of-use races.
If system call invocations will perform access control after
an update to the process credential, the value of
td_ucred
must also be refreshed to
the current process value. This will prevent use of a stale
credential following a change. The kernel automatically
refreshes the td_ucred
pointer in
the thread structure from the process
p_ucred
whenever a process enters
the kernel, permitting use of a fresh credential for kernel
access control.
struct prison
stores
administrative details pertinent to the maintenance of jails
created using the jail(2) API. This includes the
per-jail hostname, IP address, and related settings. This
structure is reference-counted since pointers to instances of
the structure are shared by many credential structures. A
single mutex, pr_mtx
protects read
and write access to the reference count and all mutable
variables inside the struct jail. Some variables are set only
when the jail is created, and a valid reference to the
struct prison
is sufficient to read
these values. The precise locking of each entry is documented
via comments in sys/jail.h
.
The TrustedBSD MAC Framework maintains data in a variety
of kernel objects, in the form of struct
label
. In general, labels in kernel objects
are protected by the same lock as the remainder of the kernel
object. For example, the v_label
label in struct vnode
is protected
by the vnode lock on the vnode.
In addition to labels maintained in standard kernel objects,
the MAC Framework also maintains a list of registered and
active policies. The policy list is protected by a global
mutex (mac_policy_list_lock
) and a busy
count (also protected by the mutex). Since many access
control checks may occur in parallel, entry to the framework
for a read-only access to the policy list requires holding the
mutex while incrementing (and later decrementing) the busy
count. The mutex need not be held for the duration of the
MAC entry operation--some operations, such as label operations
on file system objects--are long-lived. To modify the policy
list, such as during policy registration and de-registration,
the mutex must be held and the reference count must be zero,
to prevent modification of the list while it is in use.
A condition variable,
mac_policy_list_not_busy
, is available to
threads that need to wait for the list to become unbusy, but
this condition variable must only be waited on if the caller is
holding no other locks, or a lock order violation may be
possible. The busy count, in effect, acts as a form of
shared/exclusive lock over access to the framework: the difference
is that, unlike with an sx lock, consumers waiting for the list
to become unbusy may be starved, rather than permitting lock
order problems with regards to the busy count and other locks
that may be held on entry to (or inside) the MAC Framework.
For the module subsystem there exists a single lock that is
used to protect the shared data. This lock is a shared/exclusive
(SX) lock and has a good chance of needing to be acquired (shared
or exclusively), therefore there are a few macros that have been
added to make access to the lock more easy. These macros can be
located in sys/module.h
and are quite basic
in terms of usage. The main structures protected under this lock
are the module_t
structures (when shared)
and the global modulelist_t
structure,
modules. One should review the related source code in
kern/kern_module.c
to further understand the
locking strategy.
The newbus system will have one sx lock. Readers will hold a shared (read) lock (sx_slock(9)) and writers will hold an exclusive (write) lock (sx_xlock(9)). Internal functions will not do locking at all. Externally visible ones will lock as needed. Those items that do not matter if the race is won or lost will not be locked, since they tend to be read all over the place (e.g., device_get_softc(9)). There will be relatively few changes to the newbus data structures, so a single lock should be sufficient and not impose a performance penalty.
- process hierarchy
- proc locks, references
- thread-specific copies of proc entries to freeze during system calls, including td_ucred
- inter-process operations
- process groups and sessions
Lots of references to sched_lock
and notes
pointing at specific primitives and related magic elsewhere in the
document.
The select
and
poll
functions permit threads to block
waiting on events on file descriptors--most frequently,
whether or not the file descriptors are readable or
writable.
...
The SIGIO service permits processes to request the delivery
of a SIGIO signal to its process group when the read/write
status of specified file descriptors changes. At most one
process or process group is permitted to register for SIGIO
from any given kernel object, and that process or group is
referred to as the owner. Each object supporting SIGIO
registration contains pointer field that is
NULL
if the object is not registered, or
points to a struct sigio
describing
the registration. This field is protected by a global mutex,
sigio_lock
. Callers to SIGIO maintenance
functions must pass in this field “by reference”
so that local register copies of the field are not made when
unprotected by the lock.
One struct sigio
is allocated for
each registered object associated with any process or process
group, and contains back-pointers to the object, owner, signal
information, a credential, and the general disposition of the
registration. Each process or progress group contains a list of
registered struct sigio
structures,
p_sigiolst
for processes, and
pg_sigiolst
for process groups.
These lists are protected by the process or process group
locks respectively. Most fields in each struct
sigio
are constant for the duration of the
registration, with the exception of the
sio_pgsigio
field which links the
struct sigio
into the process or
process group list. Developers implementing new kernel
objects supporting SIGIO will, in general, want to avoid
holding structure locks while invoking SIGIO supporting
functions, such as fsetown
or funsetown
to avoid
defining a lock order between structure locks and the global
SIGIO lock. This is generally possible through use of an
elevated reference count on the structure, such as reliance
on a file descriptor reference to a pipe during a pipe
operation.
The sysctl
MIB service is invoked
from both within the kernel and from userland applications
using a system call. At least two issues are raised in
locking: first, the protection of the structures maintaining
the namespace, and second, interactions with kernel variables
and functions that are accessed by the sysctl interface.
Since sysctl permits the direct export (and modification) of
kernel statistics and configuration parameters, the sysctl
mechanism must become aware of appropriate locking semantics
for those variables. Currently, sysctl makes use of a single
global sx lock to serialize use of
sysctl
; however, it is assumed to operate
under Giant and other protections are not provided. The
remainder of this section speculates on locking and semantic
changes to sysctl.
- Need to change the order of operations for sysctl's that update values from read old, copyin and copyout, write new to copyin, lock, read old and write new, unlock, copyout. Normal sysctl's that just copyout the old value and set a new value that they copyin may still be able to follow the old model. However, it may be cleaner to use the second model for all of the sysctl handlers to avoid lock operations.
- To allow for the common case, a sysctl could embed a pointer to a mutex in the SYSCTL_FOO macros and in the struct. This would work for most sysctl's. For values protected by sx locks, spin mutexes, or other locking strategies besides a single sleep mutex, SYSCTL_PROC nodes could be used to get the locking right.
The taskqueue's interface has two basic locks associated
with it in order to protect the related shared data. The
taskqueue_queues_mutex
is meant to serve as a
lock to protect the taskqueue_queues
TAILQ.
The other mutex lock associated with this system is the one in the
struct taskqueue
data structure. The
use of the synchronization primitive here is to protect the
integrity of the data in the struct
taskqueue
. It should be noted that there are no
separate macros to assist the user in locking down his/her own work
since these locks are most likely not going to be used outside of
kern/subr_taskqueue.c
.
All FreeBSD documents are available for download at http://ftp.FreeBSD.org/pub/FreeBSD/doc/
Questions that are not answered by the
documentation may be
sent to <[email protected]>.
Send questions about this document to <[email protected]>.