struct ucred
is the kernel's internal credential
structure, and is generally used as the basis for process-driven access control within
the kernel. BSD-derived systems use a “copy-on-write” model for credential
data: multiple references may exist for a credential structure, and when a change needs
to be made, the structure is duplicated, modified, and then the reference replaced. Due
to wide-spread caching of the credential to implement access control on open, this
results in substantial memory savings. With a move to fine-grained SMP, this model also
saves substantially on locking operations by requiring that modification only occur on an
unshared credential, avoiding the need for explicit synchronization when consuming a
known-shared credential.
Credential structures with a single reference are considered mutable; shared
credential structures must not be modified or a race condition is risked. A mutex, cr_mtxp
protects the reference count of struct ucred
so as to maintain consistency. Any use of the
structure requires a valid reference for the duration of the use, or the structure may be
released out from under the illegitimate consumer.
The struct ucred
mutex is a leaf mutex and is
implemented via a mutex pool for performance reasons.
Usually, credentials are used in a read-only manner for access control decisions, and
in this case td_ucred
is generally preferred because it
requires no locking. When a process' credential is updated the proc lock must be held across the check and update operations thus
avoid races. The process credential p_ucred
must be used
for check and update operations to prevent time-of-check, time-of-use races.
If system call invocations will perform access control after an update to the process
credential, the value of td_ucred
must also be refreshed
to the current process value. This will prevent use of a stale credential following a
change. The kernel automatically refreshes the td_ucred
pointer in the thread structure from the process p_ucred
whenever a process enters the kernel, permitting use of a fresh credential for kernel
access control.
Details to follow.
struct prison
stores administrative details pertinent
to the maintenance of jails created using the jail(2) API. This
includes the per-jail hostname, IP address, and related settings. This structure is
reference-counted since pointers to instances of the structure are shared by many
credential structures. A single mutex, pr_mtx
protects
read and write access to the reference count and all mutable variables inside the struct
jail. Some variables are set only when the jail is created, and a valid reference to the
struct prison
is sufficient to read these values. The
precise locking of each entry is documented via comments in sys/jail.h.
The TrustedBSD MAC Framework maintains data in a variety of kernel objects, in the
form of struct label
. In general, labels in kernel
objects are protected by the same lock as the remainder of the kernel object. For
example, the v_label
label in struct vnode
is protected by the vnode lock on the vnode.
In addition to labels maintained in standard kernel objects, the MAC Framework also
maintains a list of registered and active policies. The policy list is protected by a
global mutex (mac_policy_list_lock
) and a busy count (also
protected by the mutex). Since many access control checks may occur in parallel, entry to
the framework for a read-only access to the policy list requires holding the mutex while
incrementing (and later decrementing) the busy count. The mutex need not be held for the
duration of the MAC entry operation--some operations, such as label operations on file
system objects--are long-lived. To modify the policy list, such as during policy
registration and de-registration, the mutex must be held and the reference count must be
zero, to prevent modification of the list while it is in use.
A condition variable, mac_policy_list_not_busy
, is
available to threads that need to wait for the list to become unbusy, but this condition
variable must only be waited on if the caller is holding no other locks, or a lock order
violation may be possible. The busy count, in effect, acts as a form of shared/exclusive
lock over access to the framework: the difference is that, unlike with an sx lock,
consumers waiting for the list to become unbusy may be starved, rather than permitting
lock order problems with regards to the busy count and other locks that may be held on
entry to (or inside) the MAC Framework.
For the module subsystem there exists a single lock that is used to protect the shared
data. This lock is a shared/exclusive (SX) lock and has a good chance of needing to be
acquired (shared or exclusively), therefore there are a few macros that have been added
to make access to the lock more easy. These macros can be located in sys/module.h and are quite basic in terms of usage. The main
structures protected under this lock are the module_t
structures (when shared) and the global modulelist_t
structure, modules. One should review the related source code in kern/kern_module.c to further understand the locking strategy.
The newbus system will have one sx lock. Readers will hold a shared (read) lock (sx_slock(9)) and writers will hold an exclusive (write) lock (sx_xlock(9)). Internal functions will not do locking at all. Externally visible ones will lock as needed. Those items that do not matter if the race is won or lost will not be locked, since they tend to be read all over the place (e.g. device_get_softc(9)). There will be relatively few changes to the newbus data structures, so a single lock should be sufficient and not impose a performance penalty.
...
- process hierarchy
- proc locks, references
- thread-specific copies of proc entries to freeze during system calls, including td_ucred
- inter-process operations
- process groups and sessions
Lots of references to sched_lock
and notes pointing at
specific primitives and related magic elsewhere in the document.
The select
and poll
functions permit threads to block waiting on events on file descriptors--most frequently,
whether or not the file descriptors are readable or writable.
...
The SIGIO service permits processes to request the delivery of a SIGIO signal to its
process group when the read/write status of specified file descriptors changes. At most
one process or process group is permitted to register for SIGIO from any given kernel
object, and that process or group is referred to as the owner. Each object supporting
SIGIO registration contains pointer field that is NULL
if
the object is not registered, or points to a struct sigio
describing the registration. This field is protected by a global mutex, sigio_lock
. Callers to SIGIO maintenance functions must pass in
this field “by reference” so that local register copies of the field are not
made when unprotected by the lock.
One struct sigio
is allocated for each registered
object associated with any process or process group, and contains back-pointers to the
object, owner, signal information, a credential, and the general disposition of the
registration. Each process or progress group contains a list of registered struct sigio
structures, p_sigiolst
for processes, and pg_sigiolst
for process groups. These lists are protected by
the process or process group locks respectively. Most fields in each struct sigio
are constant for the duration of the registration,
with the exception of the sio_pgsigio
field which links
the struct sigio
into the process or process group list.
Developers implementing new kernel objects supporting SIGIO will, in general, want to
avoid holding structure locks while invoking SIGIO supporting functions, such as fsetown
or funsetown
to avoid
defining a lock order between structure locks and the global SIGIO lock. This is
generally possible through use of an elevated reference count on the structure, such as
reliance on a file descriptor reference to a pipe during a pipe operation.
The sysctl
MIB service is invoked from both within the
kernel and from userland applications using a system call. At least two issues are raised
in locking: first, the protection of the structures maintaining the namespace, and
second, interactions with kernel variables and functions that are accessed by the sysctl
interface. Since sysctl permits the direct export (and modification) of kernel statistics
and configuration parameters, the sysctl mechanism must become aware of appropriate
locking semantics for those variables. Currently, sysctl makes use of a single global sx
lock to serialize use of sysctl
; however, it is assumed to
operate under Giant and other protections are not provided. The remainder of this section
speculates on locking and semantic changes to sysctl.
- Need to change the order of operations for sysctl's that update values from read old, copyin and copyout, write new to copyin, lock, read old and write new, unlock, copyout. Normal sysctl's that just copyout the old value and set a new value that they copyin may still be able to follow the old model. However, it may be cleaner to use the second model for all of the sysctl handlers to avoid lock operations.
- To allow for the common case, a sysctl could embed a pointer to a mutex in the SYSCTL_FOO macros and in the struct. This would work for most sysctl's. For values protected by sx locks, spin mutexes, or other locking strategies besides a single sleep mutex, SYSCTL_PROC nodes could be used to get the locking right.
The taskqueue's interface has two basic locks associated with it in order to protect
the related shared data. The taskqueue_queues_mutex
is meant
to serve as a lock to protect the taskqueue_queues
TAILQ.
The other mutex lock associated with this system is the one in the struct taskqueue
data structure. The use of the synchronization
primitive here is to protect the integrity of the data in the struct taskqueue
. It should be noted that there are no separate
macros to assist the user in locking down his/her own work since these locks are most
likely not going to be used outside of kern/subr_taskqueue.c.
This, and other documents, can be downloaded from ftp://ftp.FreeBSD.org/pub/FreeBSD/doc/.
For questions about FreeBSD, read the documentation before contacting <[email protected]>.
For questions about this documentation, e-mail <[email protected]>.