This section talks about machine independent part of the Linuxulator. It covers the emulation infrastructure needed for Linux® 2.6 emulation, the thread local storage (TLS) implementation (on i386) and futexes. Then we talk briefly about some syscalls.
One of the major areas of progress in development of Linux® 2.6
was threading. Prior to 2.6, the Linux® threading support was
implemented in the linuxthreads library.
The library was a partial implementation of POSIX® threading. The
threading was implemented using separate processes for each thread
using the clone
syscall to let them share the
address space (and other things). The main weaknesses of this
approach was that every thread had a different PID, signal handling
was broken (from the pthreads perspective), etc. Also the performance
was not very good (use of SIGUSR
signals for
threads synchronization, kernel resource consumption, etc.) so to
overcome these problems a new threading system was developed and
named NPTL.
The NPTL library focused on two things but a third thing came along so it is usually considered a part of NPTL. Those two things were embedding of threads into a process structure and futexes. The additional third thing was TLS, which is not directly required by NPTL but the whole NPTL userland library depends on it. Those improvements yielded in much improved performance and standards conformance. NPTL is a standard threading library in Linux® systems these days.
The FreeBSD Linuxulator implementation approaches the NPTL in three main areas. The TLS, futexes and PID mangling, which is meant to simulate the Linux® threads. Further sections describe each of these areas.
These sections deal with the way Linux® threads are managed and how we simulate that in FreeBSD.
The Linux® emulation layer in FreeBSD supports runtime setting of
the emulated version. This is done via sysctl(8), namely
compat.linux.osrelease
, which is set to 2.4.2 by
default (as of April 2007) and with all Linux® versions up to 2.6
it just determined what uname(1) outputs. It is different with
2.6 emulation where setting this sysctl(8) affects runtime
behaviour of the emulation layer. When set to 2.6.x it sets the
value of linux_use_linux26
while setting to
something else keeps it unset. This variable (plus per-prison
variables of the very same kind) determines whether 2.6
infrastructure (mainly PID mangling) is used in the code or not.
The version setting is done system-wide and this affects all Linux®
processes. The sysctl(8) should not be changed when running any
Linux® binary as it might harm things.
The semantics of Linux® threading are a little confusing and
uses entirely different nomenclature to FreeBSD. A process in
Linux® consists of a struct task
embedding two
identifier fields - PID and TGID. PID is not
a process ID but it is a thread ID. The TGID identifies a thread
group in other words a process. For single-threaded process the
PID equals the TGID.
The thread in NPTL is just an ordinary process that happens to have TGID not equal to PID and have a group leader not equal to itself (and shared VM etc. of course). Everything else happens in the same way as to an ordinary process. There is no separation of a shared status to some external structure like in FreeBSD. This creates some duplication of information and possible data inconsistency. The Linux® kernel seems to use task -> group information in some places and task information elsewhere and it is really not very consistent and looks error-prone.
Every NPTL thread is created by a call to the
clone
syscall with a specific set of flags
(more in the next subsection). The NPTL implements strict
1:1 threading.
In FreeBSD we emulate NPTL threads with ordinary FreeBSD processes that share VM space, etc. and the PID gymnastic is just mimicked in the emulation specific structure attached to the process. The structure attached to the process looks like:
struct linux_emuldata { pid_t pid; int *child_set_tid; /* in clone(): Child.s TID to set on clone */ int *child_clear_tid;/* in clone(): Child.s TID to clear on exit */ struct linux_emuldata_shared *shared; int pdeath_signal; /* parent death signal */ LIST_ENTRY(linux_emuldata) threads; /* list of linux threads */ };
The PID is used to identify the FreeBSD process that attaches this
structure. The child_se_tid
and
child_clear_tid
are used for TID address
copyout when a process exits and is created. The
shared
pointer points to a structure shared
among threads. The pdeath_signal
variable
identifies the parent death signal and the
threads
pointer is used to link this structure
to the list of threads. The linux_emuldata_shared
structure looks like:
struct linux_emuldata_shared { int refs; pid_t group_pid; LIST_HEAD(, linux_emuldata) threads; /* head of list of linux threads */ };
The refs
is a reference counter being used
to determine when we can free the structure to avoid memory leaks.
The group_pid
is to identify PID ( = TGID) of the
whole process ( = thread group). The threads
pointer is the head of the list of threads in the process.
The linux_emuldata
structure can be obtained
from the process using em_find
. The prototype
of the function is:
struct linux_emuldata *em_find(struct proc *, int locked);
Here, proc
is the process we want the emuldata
structure from and the locked parameter determines whether we want to
lock or not. The accepted values are EMUL_DOLOCK
and EMUL_DOUNLOCK
. More about locking
later.
Because of the described different view knowing what a process
ID and thread ID is between FreeBSD and Linux® we have to translate
the view somehow. We do it by PID mangling. This means that we
fake what a PID (=TGID) and TID (=PID) is between kernel and
userland. The rule of thumb is that in kernel (in Linuxulator)
PID = PID and TGID = shared -> group pid and to userland we
present PID = shared -> group_pid
and
TID = proc -> p_pid
.
The PID member of linux_emuldata structure
is
a FreeBSD PID.
The above affects mainly getpid, getppid, gettid syscalls. Where
we use PID/TGID respectively. In copyout of TIDs in
child_clear_tid
and
child_set_tid
we copy out FreeBSD PID.
The clone
syscall is the way threads are
created in Linux®. The syscall prototype looks like this:
int linux_clone(l_int flags, void *stack, void *parent_tidptr, int dummy, void * child_tidptr);
The flags
parameter tells the syscall how
exactly the processes should be cloned. As described above, Linux®
can create processes sharing various things independently, for
example two processes can share file descriptors but not VM, etc.
Last byte of the flags
parameter is the exit
signal of the newly created process. The stack
parameter if non-NULL
tells, where the thread
stack is and if it is NULL
we are supposed to
copy-on-write the calling process stack (i.e. do what normal
fork(2) routine does). The parent_tidptr
parameter is used as an address for copying out process PID (i.e.
thread id) once the process is sufficiently instantiated but is
not runnable yet. The dummy
parameter is here
because of the very strange calling convention of this syscall on
i386. It uses the registers directly and does not let the compiler
do it what results in the need of a dummy syscall. The
child_tidptr
parameter is used as an address
for copying out PID once the process has finished forking and when
the process exits.
The syscall itself proceeds by setting corresponding flags
depending on the flags passed in. For example,
CLONE_VM
maps to RFMEM (sharing of VM), etc.
The only nit here is CLONE_FS
and
CLONE_FILES
because FreeBSD does not allow setting
this separately so we fake it by not setting RFFDG (copying of fd
table and other fs information) if either of these is defined. This
does not cause any problems, because those flags are always set
together. After setting the flags the process is forked using
the internal fork1
routine, the process is
instrumented not to be put on a run queue, i.e. not to be set
runnable. After the forking is done we possibly reparent the newly
created process to emulate CLONE_PARENT
semantics.
Next part is creating the emulation data. Threads in Linux® does
not signal their parents so we set exit signal to be 0 to disable
this. After that setting of child_set_tid
and
child_clear_tid
is performed enabling the
functionality later in the code. At this point we copy out the PID
to the address specified by parent_tidptr
. The
setting of process stack is done by simply rewriting thread frame
%esp
register (%rsp
on amd64).
Next part is setting up TLS for the newly created process. After
this vfork(2) semantics might be emulated and finally the newly
created process is put on a run queue and copying out its PID to the
parent process via clone
return value is
done.
The clone
syscall is able and in fact is
used for emulating classic fork(2) and vfork(2) syscalls.
Newer glibc in a case of 2.6 kernel uses clone
to implement fork(2) and vfork(2) syscalls.
The locking is implemented to be per-subsystem because we do not
expect a lot of contention on these. There are two locks:
emul_lock
used to protect manipulating of
linux_emuldata
and
emul_shared_lock
used to manipulate
linux_emuldata_shared
. The
emul_lock
is a nonsleepable blocking mutex while
emul_shared_lock
is a sleepable blocking
sx_lock
. Because of the per-subsystem locking we
can coalesce some locks and that is why the em find offers the
non-locking access.
This section deals with TLS also known as thread local storage.
Threads in computer science are entities within a process that
can be scheduled independently from each other. The threads in the
process share process wide data (file descriptors, etc.) but also
have their own stack for their own data. Sometimes there is a need
for process-wide data specific to a given thread. Imagine a name of
the thread in execution or something like that. The traditional
UNIX® threading API, pthreads provides
a way to do it via pthread_key_create(3),
pthread_setspecific(3) and pthread_getspecific(3) where a
thread can create a key to the thread local data and using
pthread_getspecific(3) or pthread_getspecific(3) to
manipulate those data. You can easily see that this is not the most
comfortable way this could be accomplished. So various producers of
C/C++ compilers introduced a better way. They defined a new modifier
keyword thread that specifies that a variable is thread specific. A
new method of accessing such variables was developed as well (at
least on i386). The pthreads method tends
to be implemented in userspace as a trivial lookup table. The
performance of such a solution is not very good. So the new method
uses (on i386) segment registers to address a segment, where TLS area
is stored so the actual accessing of a thread variable is just
appending the segment register to the address thus addressing via it.
The segment registers are usually %gs
and
%fs
acting like segment selectors. Every thread
has its own area where the thread local data are stored and the
segment must be loaded on every context switch. This method is very
fast and used almost exclusively in the whole i386 UNIX® world.
Both FreeBSD and Linux® implement this approach and it yields very good
results. The only drawback is the need to reload the segment on
every context switch which can slowdown context switches. FreeBSD tries
to avoid this overhead by using only 1 segment descriptor for this
while Linux® uses 3. Interesting thing is that almost nothing uses
more than 1 descriptor (only Wine seems to
use 2) so Linux® pays this unnecessary price for context
switches.
The i386 architecture implements the so called segments. A
segment is a description of an area of memory. The base address
(bottom) of the memory area, the end of it (ceiling), type,
protection, etc. The memory described by a segment can be accessed
using segment selector registers (%cs
,
%ds
, %ss
,
%es
, %fs
,
%gs
). For example let us suppose we have a
segment which base address is 0x1234 and length and this code:
mov %edx,%gs:0x10
This will load the content of the %edx
register into memory location 0x1244. Some segment registers have
a special use, for example %cs
is used for code
segment and %ss
is used for stack segment but
%fs
and %gs
are generally
unused. Segments are either stored in a global GDT table or in a
local LDT table. LDT is accessed via an entry in the GDT. The
LDT can store more types of segments. LDT can be per process.
Both tables define up to 8191 entries.
There are two main ways of setting up TLS in Linux®. It can be
set when cloning a process using the clone
syscall or it can call set_thread_area
. When a
process passes CLONE_SETTLS
flag to
clone
, the kernel expects the memory pointed to
by the %esi
register a Linux® user space
representation of a segment, which gets translated to the machine
representation of a segment and loaded into a GDT slot. The
GDT slot can be specified with a number or -1 can be used meaning
that the system itself should choose the first free slot. In
practice, the vast majority of programs use only one TLS entry and
does not care about the number of the entry. We exploit this in the
emulation and in fact depend on it.
Loading of TLS for the current thread happens by calling
set_thread_area
while loading TLS for a
second process in clone
is done in the
separate block in clone
. Those two functions
are very similar. The only difference being the actual loading of
the GDT segment, which happens on the next context switch for the
newly created process while set_thread_area
must load this directly. The code basically does this. It copies
the Linux® form segment descriptor from the userland. The code
checks for the number of the descriptor but because this differs
between FreeBSD and Linux® we fake it a little. We only support
indexes of 6, 3 and -1. The 6 is genuine Linux® number, 3 is
genuine FreeBSD one and -1 means autoselection. Then we set the
descriptor number to constant 3 and copy out this to the
userspace. We rely on the userspace process using the number from
the descriptor but this works most of the time (have never seen a
case where this did not work) as the userspace process typically
passes in 1. Then we convert the descriptor from the Linux® form
to a machine dependant form (i.e. operating system independent
form) and copy this to the FreeBSD defined segment descriptor.
Finally we can load it. We assign the descriptor to threads PCB
(process control block) and load the %gs
segment using load_gs
. This loading must be
done in a critical section so that nothing can interrupt us.
The CLONE_SETTLS
case works exactly like this
just the loading using load_gs
is not
performed. The segment used for this (segment number 3) is
shared for this use between FreeBSD processes and Linux® processes
so the Linux® emulation layer does not add any overhead over
plain FreeBSD.
The amd64 implementation is similar to the i386 one but there was initially no 32bit segment descriptor used for this purpose (hence not even native 32bit TLS users worked) so we had to add such a segment and implement its loading on every context switch (when a flag signaling use of 32bit is set). Apart from this the TLS loading is exactly the same just the segment numbers are different and the descriptor format and the loading differs slightly.
Threads need some kind of synchronization and POSIX® provides some of them: mutexes for mutual exclusion, read-write locks for mutual exclusion with biased ratio of reads and writes and condition variables for signaling a status change. It is interesting to note that POSIX® threading API lacks support for semaphores. Those synchronization routines implementations are heavily dependant on the type threading support we have. In pure 1:M (userspace) model the implementation can be solely done in userspace and thus be very fast (the condition variables will probably end up being implemented using signals, i.e. not fast) and simple. In 1:1 model, the situation is also quite clear - the threads must be synchronized using kernel facilities (which is very slow because a syscall must be performed). The mixed M:N scenario just combines the first and second approach or rely solely on kernel. Threads synchronization is a vital part of thread-enabled programming and its performance can affect resulting program a lot. Recent benchmarks on FreeBSD operating system showed that an improved sx_lock implementation yielded 40% speedup in ZFS (a heavy sx user), this is in-kernel stuff but it shows clearly how important the performance of synchronization primitives is.
Threaded programs should be written with as little contention on locks as possible. Otherwise, instead of doing useful work the thread just waits on a lock. Because of this, the most well written threaded programs show little locks contention.
Linux® implements 1:1 threading, i.e. it has to use in-kernel synchronization primitives. As stated earlier, well written threaded programs have little lock contention. So a typical sequence could be performed as two atomic increase/decrease mutex reference counter, which is very fast, as presented by the following example:
pthread_mutex_lock(&mutex); .... pthread_mutex_unlock(&mutex);
1:1 threading forces us to perform two syscalls for those mutex calls, which is very slow.
The solution Linux® 2.6 implements is called futexes. Futexes implement the check for contention in userspace and call kernel primitives only in a case of contention. Thus the typical case takes place without any kernel intervention. This yields reasonably fast and flexible synchronization primitives implementation.
The futex syscall looks like this:
int futex(void *uaddr, int op, int val, struct timespec *timeout, void *uaddr2, int val3);
In this example uaddr
is an address of the
mutex in userspace, op
is an operation we are
about to perform and the other parameters have per-operation
meaning.
Futexes implement the following operations:
FUTEX_WAIT
FUTEX_WAKE
FUTEX_FD
FUTEX_REQUEUE
FUTEX_CMP_REQUEUE
FUTEX_WAKE_OP
This operation verifies that on address
uaddr
the value val
is written. If not, EWOULDBLOCK
is
returned, otherwise the thread is queued on the futex and gets
suspended. If the argument timeout
is
non-zero it specifies the maximum time for the sleeping,
otherwise the sleeping is infinite.
This operation takes a futex at uaddr
and wakes up val
first futexes queued
on this futex.
This operation takes val
threads
queued on futex at uaddr
, wakes them up,
and takes val2
next threads and requeues them
on futex at uaddr2
.
This operation does the same as
FUTEX_REQUEUE
but it checks that
val3
equals to val
first.
This operation performs an atomic operation on
val3
(which contains coded some other value)
and uaddr
. Then it wakes up
val
threads on futex at
uaddr
and if the atomic operation returned a
positive number it wakes up val2
threads on
futex at uaddr2
.
The operations implemented in
FUTEX_WAKE_OP
:
FUTEX_OP_SET
FUTEX_OP_ADD
FUTEX_OP_OR
FUTEX_OP_AND
FUTEX_OP_XOR
There is no val2
parameter in the
futex prototype. The val2
is taken from the
struct timespec *timeout
parameter
for operations FUTEX_REQUEUE
,
FUTEX_CMP_REQUEUE
and
FUTEX_WAKE_OP
.
The futex emulation in FreeBSD is taken from NetBSD and further
extended by us. It is placed in linux_futex.c
and linux_futex.h
files. The
futex
structure looks like:
struct futex { void *f_uaddr; int f_refcount; LIST_ENTRY(futex) f_list; TAILQ_HEAD(lf_waiting_paroc, waiting_proc) f_waiting_proc; };
And the structure waiting_proc
is:
struct waiting_proc { struct thread *wp_t; struct futex *wp_new_futex; TAILQ_ENTRY(waiting_proc) wp_list; };
A futex is obtained using the futex_get
function, which searches a linear list of futexes and returns the
found one or creates a new futex. When releasing a futex from the
use we call the futex_put
function, which
decreases a reference counter of the futex and if the refcount
reaches zero it is released.
When a futex queues a thread for sleeping it creates a
working_proc
structure and puts this structure
to the list inside the futex structure then it just performs a
tsleep(9) to suspend the thread. The sleep can be timed out.
After tsleep(9) returns (the thread was woken up or it timed
out) the working_proc
structure is removed
from the list and is destroyed. All this is done in the
futex_sleep
function. If we got woken up
from futex_wake
we have
wp_new_futex
set so we sleep on it. This way
the actual requeueing is done in this function.
Waking up a thread sleeping on a futex is performed in the
futex_wake
function. First in this function
we mimic the strange Linux® behaviour, where it wakes up N threads
for all operations, the only exception is that the REQUEUE
operations are performed on N+1 threads. But this usually does not
make any difference as we are waking up all threads. Next in the
function in the loop we wake up n threads, after this we check if
there is a new futex for requeueing. If so, we requeue up to n2
threads on the new futex. This cooperates with
futex_sleep
.
The FUTEX_WAKE_OP
operation is quite
complicated. First we obtain two futexes at addresses
uaddr
and uaddr2
then we
perform the atomic operation using val3
and
uaddr2
. Then val
waiters
on the first futex is woken up and if the atomic operation
condition holds we wake up val2
(i.e.
timeout
) waiter on the second futex.
The atomic operation takes two parameters
encoded_op
and uaddr
.
The encoded operation encodes the operation itself,
comparing value, operation argument, and comparing argument.
The pseudocode for the operation is like this one:
oldval = *uaddr2 *uaddr2 = oldval OP oparg
And this is done atomically. First a copying in of the number
at uaddr
is performed and the operation is
done. The code handles page faults and if no page fault occurs
oldval
is compared to
cmparg
argument with cmp comparator.
In this section I am going to describe some smaller syscalls that are worth mentioning because their implementation is not obvious or those syscalls are interesting from other point of view.
During development of Linux® 2.6.16 kernel, the *at syscalls
were added. Those syscalls (openat
for example)
work exactly like their at-less counterparts with the slight
exception of the dirfd
parameter. This
parameter changes where the given file, on which the syscall is to be
performed, is. When the filename
parameter is
absolute dirfd
is ignored but when the path to
the file is relative, it comes to the play. The
dirfd
parameter is a directory relative to which
the relative pathname is checked. The dirfd
parameter is a file descriptor of some directory or
AT_FDCWD
. So for example the
openat
syscall can be like this:
file descriptor 123 = /tmp/foo/, current working directory = /tmp/ openat(123, /tmp/bah\, flags, mode) /* opens /tmp/bah */ openat(123, bah\, flags, mode) /* opens /tmp/foo/bah */ openat(AT_FDWCWD, bah\, flags, mode) /* opens /tmp/bah */ openat(stdio, bah\, flags, mode) /* returns error because stdio is not a directory */
This infrastructure is necessary to avoid races when opening
files outside the working directory. Imagine that a process consists
of two threads, thread A and thread B. Thread A
issues open(./tmp/foo/bah., flags, mode)
and
before returning it gets preempted and thread B runs.
Thread B does not care about the needs of thread A and
renames or removes /tmp/foo/
. We got a race.
To avoid this we can open /tmp/foo
and use it
as dirfd
for openat
syscall. This also enables user to implement per-thread
working directories.
Linux® family of *at syscalls contains:
linux_openat
,
linux_mkdirat
,
linux_mknodat
,
linux_fchownat
,
linux_futimesat
,
linux_fstatat64
,
linux_unlinkat
,
linux_renameat
,
linux_linkat
,
linux_symlinkat
,
linux_readlinkat
,
linux_fchmodat
and
linux_faccessat
. All these are implemented
using the modified namei(9) routine and simple
wrapping layer.
The implementation is done by altering the
namei(9) routine (described above) to take
additional parameter dirfd
in its
nameidata
structure, which specifies the
starting point of the pathname lookup instead of using the
current working directory every time. The resolution of
dirfd
from file descriptor number to a
vnode is done in native *at syscalls. When
dirfd
is AT_FDCWD
the
dvp
entry in nameidata
structure is NULL
but when
dirfd
is a different number we obtain a
file for this file descriptor, check whether this file
is valid and if there is vnode attached to it then we get a vnode.
Then we check this vnode for being a directory. In the actual
namei(9) routine we simply substitute the
dvp
vnode for dp
variable
in the namei(9) function, which determines the
starting point. The namei(9) is not used
directly but via a trace of different functions on various
levels. For example the openat
goes like
this:
openat() --> kern_openat() --> vn_open() -> namei()
For this reason kern_open
and
vn_open
must be altered to incorporate
the additional dirfd
parameter. No compat
layer is created for those because there are not many users of
this and the users can be easily converted. This general
implementation enables FreeBSD to implement their own *at syscalls.
This is being discussed right now.
The ioctl interface is quite fragile due to its generality.
We have to bear in mind that devices differ between Linux® and FreeBSD
so some care must be applied to do ioctl emulation work right. The
ioctl handling is implemented in linux_ioctl.c
,
where linux_ioctl
function is defined. This
function simply iterates over sets of ioctl handlers to find a
handler that implements a given command. The ioctl syscall has three
parameters, the file descriptor, command and an argument. The
command is a 16-bit number, which in theory is divided into high
8 bits determining class of the ioctl command and low
8 bits, which are the actual command within the given set.
The emulation takes advantage of this division. We implement
handlers for each set, like sound_handler
or disk_handler
. Each handler has a maximum
command and a minimum command defined, which is used for determining
what handler is used. There are slight problems with this approach
because Linux® does not use the set division consistently so
sometimes ioctls for a different set are inside a set they should
not belong to (SCSI generic ioctls inside cdrom set, etc.). FreeBSD
currently does not implement many Linux® ioctls (compared to
NetBSD, for example) but the plan is to port those from NetBSD.
The trend is to use Linux® ioctls even in the native FreeBSD drivers
because of the easy porting of applications.
Every syscall should be debuggable. For this purpose we introduce a small infrastructure. We have the ldebug facility, which tells whether a given syscall should be debugged (settable via a sysctl). For printing we have LMSG and ARGS macros. Those are used for altering a printable string for uniform debugging messages.
All FreeBSD documents are available for download at http://ftp.FreeBSD.org/pub/FreeBSD/doc/
Questions that are not answered by the
documentation may be
sent to <[email protected]>.
Send questions about this document to <[email protected]>.