Table of Contents
This chapter describe processes and threads in NetBSD. This includes process startup, traps and system calls, process and thread creation and termination, signal delivery, and thread scheduling.
CAUTION! This chapter is an ongoing work: it has not been reviewed yet, neither for typos, nor for technical mistakes
On Unix systems, new programs are started using the
execve
system call. If successful,
execve
replaces the currently-executing program
by a new one. This is done within the same process, by reinitializing
the whole virtual memory mapping and loading the new program binary in
memory. All the process's threads (except for the calling one) are
terminated, and the calling thread CPU context is reset for executing
the new program startup.
Here is execve
prototype:
int execve( |
path, | |
argv, | ||
envp) ; |
const char *path
;char *const argv[]
;char *const envp[]
;path
is the filesystem path to the new
executable. argv
and envp
are two NULL-terminated string arrays that hold the new program
arguments and environment variables. execve
is
responsible for copying the arrays to the new process stack.
Here is the top-down modular diagram for execve
implementation in the NetBSD kernel when executing a native 32 bit ELF
binary on an i386 machine:
src/sys/kern/kern_exec.c
:
sys_execve
src/sys/kern/kern_exec.c
:
execve1
src/sys/kern/kern_exec.c
:
check_exec
src/sys/kern/kern_verifiedexec.c
:
veriexec_verify
src/sys/kern/kern_conf.c
:
*execsw[]->es_makecmds
src/sys/kern/exec_elf32.c
:
exec_elf_makecmds
src/sys/kern/exec_elf32.c
:
exec_check_header
src/sys/kern/exec_elf32.c
:
exec_read_from
src/sys/kern/exec_conf.c
:
*execsw[]->u.elf_probe_func
src/sys/kern/exec_elf32.c
:
netbsd_elf_probe
src/sys/kern/exec_elf32.c
:
elf_load_psection
src/sys/kern/exec_elf32.c
:
elf_load_file
src/sys/kern/exec_conf.c
:
*execsw[]->es_setup_stack
src/sys/kern/exec_subr.c
:
exec_setup_stack
*fetch_element
src/sys/kern/kern_exec.c
:
execve_fetch_element
*vcp->ev_proc
src/sys/kern/exec_subr.c
:
vmcmd_map_zero
src/sys/kern/exec_subr.c
:
vmcmd_map_pagedvn
src/sys/kern/exec_subr.c
:
vmcmd_map_readvn
src/sys/kern/exec_subr.c
:
vmcmd_readvn
src/sys/kern/exec_conf.c
:
*execsw[]->es_copyargs
src/sys/kern/kern_exec.c
:
copyargs
src/sys/kern/kern_clock.c
:
stopprofclock
src/sys/kern/kern_descrip.c
:
fdcloseexec
src/sys/kern/kern_sig.c
:
execsigs
src/sys/kern/kern_ras.c
:
ras_purgeall
src/sys/kern/exec_subr.c
:
doexechooks
src/sys/sys/event.h
:
KNOTE
src/sys/kern/kern_event.c
:
knote
src/sys/kern/exec_conf.c
:
*execsw[]->es_setregs
src/sys/arch/i386/i386/machdep.c
:
setregs
src/sys/kern/kern_exec.c
:
exec_sigcode_map
src/sys/kern/kern_exec.c
:
*p->p_emul->e_proc_exit
(NULL)
src/sys/kern/kern_exec.c
:
*p->p_emul->e_proc_exec
(NULL)
execve
calls execve1
with a pointer to a function called fetch_element
,
responsible for loading program arguments and environment variables
in kernel space.
The primary reason for this abstraction function is to allow fetching
pointers from a 32 bit process on a 64 bit system.
execve1
uses a variable of type
struct exec_package (defined in
src/sys/sys/exec.h
) to share information
with the called functions.
The makecmds
is responsible for checking
if the program can be loaded, and to build a set of virtual memory
commands (vmcmd's) that can be used later to setup the virtual memory
space and to load the program code and data sections. The set of
vmcmd's is stored in the ep_vmcmds
field of the
exec package. The use of these vmcmd set allows cancellation of the
execution process before a commitment point.
The exec switch is an array of structure struct execsw
defined in src/sys/kern/exec_conf.c
:
execsw[]
.
The struct execsw itself is defined in
src/sys/sys/exec.h
.
Each entry in the exec switch is written for a given executable
format and a given kernel ABI. It contains test methods to check if
a binary fits the format and ABI, and the methods to load it and start
it up if it does. One
can find here various methods called within execve
code path.
Table 3.1. struct execsw fields summary
Field name | Description |
---|---|
es_hdrsz |
The size of the executable format header |
es_makecmds |
A method that checks if the program can be executed, and if it does, create the vmcmds required to setup the virtual memory space (this includes loading the executable code and data sections). |
u.elf_probe_func
u.ecoff_probe_func
u.macho_probe_func
|
Executable probe method, used by the
es_makecmds method
to check if the binary can be executed.
The u field is an union that contains
probe methods for ELF, ECOFF and Mach-O formats |
es_emul |
The struct emul used for handling different kernel ABI. It is covered in detail in Section 3.2.2, “Multiple kernel ABI support with the emul switch”. |
es_prio |
A priority level for this exec switch entry. This field helps choosing the test order for exec switch entries |
es_arglen |
XXX ? |
es_copyargs |
Method used to copy the new program arguments and environment function in user space |
es_setregs |
Machine-dependent method used to set up the initial process CPU registers |
es_coredump |
Method used to produce a core from the process |
es_setup_stack |
Method called by es_makecmds
to produce a set of vmcmd for setting up the new process stack.
|
execve1
iterate on the exec switch entries,
using the es_priority
for ordering, and calls the
es_makecmds
method of each entry until it gets
a match.
The es_makecmds
will fill the exec package's
ep_vmcmds
field with vmcmds that will be used later
for setting up the new process virtual memory space. See
Section 3.1.3.2, “Virtual memory space setup commands (vmcmds)” for details about the vmcmds.
The executable format probe is called by the
es_makecmds
method. Its job is simply to check
if the executable binary can be handled by this exec switch entry.
It can check a signature in the binary (e.g.: ELF note section),
the name of a dynamic linker embedded in the binary, and so on.
Some probe functions feature wildcard, and will be used as
last resort, with the help of the es_prio
field.
This is the case of the native ELF 32 bit entry, for instance.
Vmcmds are stored in an array of struct exec_vmcmd
(defined in src/sys/sys/exec.h
) in the
ep_vmcmds
field of the exec
package, before execve1
decides to execute or
destroy them.
struct exec_vmcmd defines,
in the ev_proc
field, a pointer to the
method that will perform the command, The other fields are
used to store the method's arguments.
Four methods are available in
src/sys/kern/exec_subr.c
Table 3.2. vmcmd methods
Name | Description |
---|---|
vmcmd_map_pagedvn |
Map memory from a vnode. Appropriate for handling demand-paged text and data segments. |
vmcmd_map_readvn |
Read memory from a vnode. Appropriate for handling non-demand-paged text/data segments, i.e. impure objects (a la OMAGIC and NMAGIC). |
vmcmd_readvn |
XXX ? |
vmcmd_zero |
Maps a region of zero-filled memory |
Vmcmd are created using new_vmcmd
,
and can be destroyed using kill_vmcmd
.
The es_setup_stack
field of the exec switch
holds a pointer to the method in charge of generating the vmcmd
for setting up the stack space. Filling the stack with arguments and
environment is done later, by the es_copyargs
method.
For native ELF binaries, the
netbsd32_elf32_copyargs
(obtained by a macro from elf_copyargs
method
in src/sys/kern/exec_elf32.c
) is used. It calls the
copyargs
(from
src/sys/kern/kern_exec.c
) for the part of the
job which is not specific to ELF.
copyargs
has to copy back the arguments
and environment string from the kernel copy (in the exec package)
to the new process stack in userland. Then
the arrays of pointers to the strings are reconstructed, and finally,
the pointers to the array, and the argument count, are copied to the
top of the stack. The new program stack pointer will be set to
point to the argument count, followed by the argument array pointer,
as expected by any ANSI program.
Dynamic ELF executable are special: they need a structure called the ELF auxiliary table to be copied on the stack. The table is an array of pairs of key and values for various things such as the ELF header address in user memory, the page size, or the entry point of the ELF executable
Note that when starting a dynamic ELF executable, the ELF
loader (also known as the interpreter:
/usr/libexec/ld.elf_so
) is loaded with the
executable by the kernel. The ELF loader is started by
the kernel and is responsible for starting the executable itself
afterwards.
es_setregs
is a machine
dependent method responsible for setting up the initial
process CPU registers. On any machine, the method will
have to set the registers holding the instruction pointer,
the stack pointer and the machine state. Some ports will need
more work (for instance i386 will set up the segment registers,
and Local Descriptor Table)
The CPU registers are stored in a struct trapframe, available from struct lwp.
After execve
has finished his work,
the new process is ready for running. It is available in the run
queue and it will be picked up by the scheduler when
appropriate.
From the scheduler point of view, starting or resuming a process execution is the same operation: returning to userland. This involves switching to the process virtual memory space, and loading the process CPU registers. By loading the machine state register with the system bit off, kernel privileges are dropped.
XXX details
When the processor encounter an exception (memory fault, division
by zero, system call instruction...), it executes a trap: control
is transferred to the kernel, and after some assembly routine in
locore.S
, the CPU drops in the
syscall_plain
(from src/sys/arch/i386/i386/syscall.c
on i386) for
system calls, or in the
trap
function
(from src/sys/arch/i386/i386/trap.c
on i386) for
other traps.
There is also a syscall_fancy
system call
handler which is only used when the process is being traced by
ktrace.
The struct emul is defined in
src/sys/sys/proc.h
. It defines various methods
and parameters to handle system calls and traps. Each kernel ABI
supported by the NetBSD kernel has its own struct emul.
For instance, Linux ABI defines emul_linux
in
src/sys/compat/linux/common/linux_exec.c
,
and the native ABI defines emul_netbsd
, in
src/sys/kern/kern_exec.c
.
The struct emul for the current ABI is obtained
from the es_emul
field of the exec switch entry
that was selected by execve
. The kernel holds a
pointer to it in the process' struct proc (defined in
src/sys/sys/proc.h
).
Most importantly, the struct emul defines the system call handler function, and the system call table.
Each kernel ABI have a system call table. The table maps system
call numbers to functions implementing the system call in the kernel
(e.g.: system call number 2 is fork
). The
convention (for native syscalls) is that the kernel function
implementing syscall foo
is called sys_foo
. Emulation syscalls have
their own conventions, like linux_sys_
prefix for the Linux emulation.
The native system call table can be found in
src/sys/kern/syscalls.master
.
This file is not written in C language. After any change, it
must be processed by the Makefile
available
in the same directory. syscalls.master
processing
is controlled by the configuration found in
syscalls.conf
, and it will output several
files:
Table 3.3. Files produced from syscalls.master
File name | Description |
---|---|
syscallargs.h |
Define the system call arguments structures, used to pass data from the system call handler function to the function implementing the system call. |
syscalls.c |
An array of strings containing the names for the system calls |
syscall.h |
Preprocessor defines for each system call name and number — used in libc |
sysent.c |
An array containing for each system call an entry with the number of arguments, the size of the system call arguments structure, and a pointer to the function that implements the system call in the kernel |
In order to avoid namespace collision, non native ABI have
syscalls.conf
defining output file names prefixed
by tags (e.g: linux_
for Linux ABI).
system call argument structures (syscallarg for short) are always used to pass arguments to functions implementing the system calls. Each system call has its own syscallarg structure. This encapsulation layer is here to hide endianness differences.
All functions implementing system calls have the same prototype:
int syscall( |
l, | |
v, | ||
retval) ; |
struct lwp *l
;void * v
;register_t *retval
;l
is the struct lwp
for the calling thread, v
is the
syscallarg structure pointer, and retval
is a pointer to the return value. The function returns the error
code (see errno(2)) or 0 if there was no error. Note that
the prototype is not the same as the “declaration”
in syscalls.master
. The declaration in
syscalls.master
corresponds to the
documented prototype for the system call. This is because system
calls as seen from userland programs have different prototypes,
but the sys_
kernel functions implementing them must have the same prototype
to unify the interface between MD syscall handlers and MI
syscall implementation. In ...
syscalls.master
, the
declaration shows the syscall arguments as seen by
userland and determines the members of the syscallarg structure,
which encapsulates the syscall arguments and has one member for
each one.
While generating the files listed above some substitutions
on the function name are performed: the syscalls tagged as
COMPAT_XX
are prefixed by
compat_xx_
, same for the syscallarg structure
name. So the actual kernel function implementing those syscalls
have to be defined in a corresponding way. Example: if
syscalls.master
has a line
97 COMPAT_30 { int sys_socket(int domain, int type, int protocol); }
the actual syscall function will have this prototype:
int compat_30_sys_socket( |
l, | |
v, | ||
retval) ; |
struct lwp *l
;void * v
;register_t *retval
;
and v
is a pointer to struct
compat_30_sys_socket_args, whose declaration is the
following:
struct compat_30_sys_socket_args {syscallarg
(int)domain
;syscallarg
(int)type
;syscallarg
(int)protocol
; };
Note the correspondence with the documented prototype of the
socket(2) syscall and the declaration of
sys_socket
in
syscalls.master
. The types of syscall
arguments are wrapped by syscallarg
macro, which ensures that the structure members will be padded
to a minimum size, again for unified interface between MD and
MI code. That's why those members should not be accessed
directly, but by the SCARG
macro, which
takes a pointer to the syscall arg structure and the argument
name and extracts the argument's value. See
below for an example.
The system call implementation in libc is autogenerated
from the kernel implementation. As an example, let's examine the
implementation of the access(2) function in libc. It can be
found in the access.S
file, which does not
exist in the sources — it is autogenerated when libc is
built. It uses macros defined in
src/sys/sys/syscall.h
and
src/lib/libc/arch/
:
the MACHINE_ARCH
/SYS.hsyscall.h
file contains defines which
map the syscall names to syscall numbers. The syscall function
names are changed by replacing the sys_
prefix by SYS_
. The
syscall.h
header file is also autogenerated
from src/sys/kern/syscalls.master
by
running make init_sysent.c in
src/sys/kern
, as described above. By
including SYS.h
, we get
syscall.h
and the
RSYSCALL
macro, which accepts the syscall
name, automatically adds the SYS_
prefix,
takes the corresponding number, and defines a function of the
name given whose body is just the execution of the syscall
itself with the right number. (The method of execution and of
transfer of the syscall number and its arguments are machine
dependent, but this is hidden in the
RSYSCALL
macro.)
To continue the example of access(2),
syscall.h
contains
#define SYS_access 33
so
RSYSCALL(access)
will result
in defining the function access
, which will
execute the syscall with number 33. Thus,
access.S
needs to contain just:
#include "SYS.h" RSYSCALL(access)
To automate this further, it is enough to add the name of this
file to the ASM
variable in
src/lib/libc/sys/Makefile.inc
and the file will be
autogenerated with this content when libc is built.
The above is true for libc functions which correspond exactly
to the kernel syscalls. It is not always the case, even if the
functions are found in section 2 of the manuals. For example the
wait(2), wait3(2) and waitpid(2) functions are
implemented as wrappers of only one syscall, wait4(2). In
such case the procedure above yields the
wait4
function and the wrappers can
reference it as if it were a normal C function.
Let's pretend that the access(2) syscall does not exist yet and you want to add it to the kernel. How to proceed?
add the syscall to the
src/sys/kern/syscalls.master
list:
33 STD { int sys_access(const char *path, int flags); }
src/sys/kern
. This will update the
autogenerated files: syscallargs.h
,
syscall.h
,
init_sysent.c
and
syscalls.c
.
Implement the kernel part of the system call, which will have the prototype:
int sys_access( |
l, | |
v, | ||
retval) ; |
struct lwp *l
;void * v
;register_t *retval
;
as all other syscalls.
To get the syscall arguments cast
v
to a pointer to struct
sys_access_args and use the SCARG
macro to retrieve them from that structure. For example, to get the
flags
argument if uap
is a
pointer to struct sys_access_args obtained by
casting v
, use:
SCARG(uap, flags)
The type
struct sys_access_args and the function
sys_access
are declared in
sys/syscallargs.h
, which is autogenerated from
src/sys/kern/syscalls.master
. Use
#include <sys/syscallargs.h>
to get those declarations.
Look in
src/sys/kern/vfs_syscalls.c
for the real
implementation of sys_access
.
src/sys/sys
. This will copy the
autogenerated include files (most importantly,
syscall.h
) to
usr/include
under
DESTDIR
, where libc build will find them in
the next steps.
access.S
to the
ASM
variable in
src/lib/libc/sys/Makefile.inc
.
This is all. To test the new syscall, simply rebuild libc
(access.S
will be generated at his point) and
reboot with a new kernel containing the new syscall. To make the
new syscall generally useful, its prototype should be added to an
appropriate header file for use by userspace programs — in
the case of access(2), this is unistd.h, which is found in
the NetBSD sources at src/include/unistd.h
.
If the system call ABI (or even API) changes, it is necessary to implement the old syscall with the original semantics to be used by old binaries. The new version of the syscall has a different syscall number, while the original one retains the old number. This is called versioning.
The naming conventions associated with versioning are
complex. If the original system call is called
foo
(and implemented by a
sys_foo
function) and it is changed after the
x.y release, the new syscall will be named
__fooxy
, with the function implementing it
being named sys___fooxy
. The original syscall
(left for compatibility) will be still declared as sys_foo in
syscalls.master
, but will be tagged as
COMPAT_XY
, so the function will be named
compat_xy_sys_foo
. We will call
sys_foo
the original version,
sys___fooxy
the new version and
compat_xy_sys_foo
the compatibility version
in the procedure described below.
Now if the syscall is versioned again after version
z.q has been released, the newest version
will be called __foozq
. The intermediate
version (formerly the new version) will have to be retained for
compatibility, so it will be tagged as
COMPAT_ZQ
, which will change the function
name from sys___fooxy
to
compat_zq_sys___fooxy
. The oldest version
compat_xy_sys_foo
will be unaffected by the
second versioning.
HOW TO change a system call ABI or API and add a compatibility version? Let's look at a real example: versioning of the socket(2) system call after the error code in case of unsupported address family changed from EPROTONOSUPPORT to EAFNOSUPPORT between NetBSD 3.0 and 4.0.
sys_socket
) with the right
COMPAT_XY
in
syscalls.master
. In the case of
sys_socket
, it is
COMPAT_30
, because NetBSD 3.0 was the
last version before the system call changed.
add the new version at the end of
syscalls.master
(this effectively allocates a
new syscall number). Name the new version as described
above. In our case, it will be sys___socket30
:
394 STD { int sys___socket30(int domain, int type, int protocol); }
sys_socket
to
sys___socket30
to match the change
above. Ideally, at this moment the change which requires
versioning would be made. (Though in practice it happens
that a change is made and only later it is realized that it
breaks compatibility and versioning is needed.)
Implement the compatibility version, name it
compat_xy_sys_... as described above. The implementation belongs
under src/sys/compat
and it shouldn't be a
modified copy of the new version, because the copies would
eventually diverge. Rather, it should be implemented in terms of
the new version, adding the adjustments needed for compatibility
(which means that it should behave exactly as the old
version did.)
In our example, the compatibility version would be
named compat_30_sys_socket
. It can be found in
src/sys/compat/common/uipc_syscalls_30.c
.
syscalls.master
tables
for various emulations under
src/sys/compat
used to refer to
sys_socket
. Decision if the references
should be changed to the compatibility version or to the new
version depend on the behavior of the OS that we intend to
emulate. E.g. FreeBSD uses the old error number, while
System V uses the new one.
Now the kernel should be compilable and old statically linked binaries should work, as should binaries using the old libc. Nothing uses the new syscall yet. We have to make a new libc, which will contain both the new and the compatibility syscall:
src/lib/libc/sys/Makefile.inc
, replace
the name of the old syscall by the new syscall
(__socket30
in our example). When libc is
rebuilt, it will contain the new function, but no programs use
this internal name with underscore, so it is not useful yet. Also,
we have lost the old name.To make newly compiled programs use the new syscall
when they refer to the usual name
(socket
in our example), we add a
__RENAME(newname)
statement after the
declaration of the usual name is declared. In the case of
socket
, this is
src/sys/sys/socket.h
:
int socket(int, int, int) #if !defined(__LIBC12_SOURCE__) && !defined(_STANDALONE) __RENAME(__socket30) #endif
Now, when a program is recompiled using this header,
references to socket
will be replaced
by __socket30
, except for compilation
of standalone tools (basically bootloaders), which define
_STANDALONE
, and libc compat code itself,
which defines __LIBC12_SOURCE__
. The
__RENAME
causes the compiler to emit
references to the __socket30
symbol
when socket
is used in the source. The
symbol will be then resolved by the linker to the new
function (implemented by the new system call). Old binaries
are unaware of this and continue to reference
socket
, which should be resolved to the
old function (having the same API as before the change). We
will re-add the old function in the next step.
src/lib/libc/compat/sys
, implementing
it using the new function. Note that we did not use the
compatibility syscall in the kernel at all, so old programs
will work with the new libc, even if the kernel is built
without COMPAT_30
. The compatibility
syscall is there only for the old libc, which is used if the
shared library was not upgraded, or internally by statically
linked programs. We are done — we have covered the cases of old binaries, old libc and new kernel (including statically linked binaries), old binaries, new libc and new kernel, and new binaries, new libc and new kernel.
When committing your work (either a new syscall or a new
syscall version with the compatibility syscalls), you should
remember to commit the source
(syscalls.master
) for the autogenerated files
first, and then regenerate and commit the autogenerated
files. They contain the RCS Id of the source file and this way,
the RCS Id will refer to the current source version. The assembly
files generated by
src/lib/libc/sys/Makefile.inc
are not kept in
the repository at all, they are regenerated every time libc is
built.
When executing 32 bit binaries on a 64 bit system, care must be taken to only use addresses below 4 GB. This is a problem at process creation, when the stack and heap are allocated, but also for each system call, where 32 bits pointers handled by the 32 bit process are manipulated by the 64 bit kernel.
For a kernel built as a 64 bit binary, a 32 bit pointer is
not something that makes sense: pointers can only be 64 bit long.
This is why 32 bit pointers are defined as an u_int32_t
synonym called netbsd32_pointer_t
(in src/sys/compat/netbsd32/netbsd32.h
).
For copyin
and copyout
,
true 64 bits pointers are required. They are obtained by casting the
netbsd32_pointer_t through the
NETBSD32PTR64
macro.
Most of the time, implementation of a 32 bit system call is just
about casting pointers and to call the 64 version of the system call.
An example of such a situation can be found in
src/sys/compat/netbsd32/netbsd32_time.c
:
netbsd32_timer_delete
. Provided that the 32 bit
system call argument structure pointer is called uap
,
and the 64 bit one is called ua
, then helper macros
called NETBSD32TO64_UAP
,
NETBSD32TOP_UAP
,
NETBSD32TOX_UAP
, and
NETBSD32TOX64_UAP
can be used. Sources in
src/sys/compat/netbsd32
provide multiple examples.
For each kernel ABI, struct emul defines a
machine-dependent sendsig
function, which
is responsible for altering the process user context so that it calls a
signal handler.
sendsig
builds a stack frame containing
the CPU registers before the signal handler invocation. The CPU
registers are altered so that on return to userland, the process
executes the signal handler and have the stack pointer set to the
new stack frame.
If requested at sigaction
call time,
sendsig
will also add a struct siginfo
to the stack frame.
Last but not least, sendsig
may copy
a small assembly code involved in signal cleanup, which is called the
signal trampoline. This is detailed
in the next section. Note that that modern NetBSD native programs
do not use that feature anymore: it is only used for older programs,
and other OSes emulation.
Once the signal handler returns, the kernel must destroy the signal handler context and restore the previous process state. This can be achieved by two ways.
First method, using the kernel-provided signal trampoline:
sendsig
have copied the signal trampoline on
the stack and has prepared the stack and/or CPU registers so that the
signal handler returns to the signal trampoline. The job of the
signal trampoline is to call the sigreturn
or the setcontext
system calls, handling a pointer
to the CPU registers saved on stack. This restores the CPU registers
to their values before the signal handler invocation, and next time the
process will return to userland, it will resume its execution where it
stopped.
The native signal trampoline for i386 is called
sigcode
and can be found in
src/sys/arch/i386/i386/locore.S
. Each emulated ABI
has its own signal trampoline, which can be quite close to the native
one, except usually for the sigreturn
system call
number.
The second method is to use a signal trampoline provided by libc.
This is how modern NetBSD native programs do. At the time the
sigaction
system call is invoked, the libc stub
handle a pointer to a signal trampoline in libc, which is in charge
of calling setcontext
.
sendsig
will use that pointer as the return address
for the signal handler. This method is better than the previous one,
because it removes the need for an executable stack page where the
signal trampoline is stored. The trampoline is now stored in the code
segment of libc. For instance, for i386, the signal trampoline
is named __sigtramp_siginfo_2
and can be found in
src/lib/libc/arch/i386/sys/__sigtramp2.S
.
NetBSD 5.0 introduced a new scheduling API that allows for different scheduling algorithms to be implemented and selected at compile-time. There are currently two different scheduling algorithms available: The traditional 4.4BSD-based scheduler and the more modern M2 scheduler.
NetBSD supports the three scheduling policies required by POSIX in order to support the POSIX real-time scheduling extensions:
SCHED_OTHER: Time sharing (TS), the default on NetBSD
SCHED_FIFO: First in, first out
SCHED_RR: Round-robin
SCHED_FIFO and SCHED_RR are predefined scheduling policies, leaving SCHED_OTHER as an implementation-specific policy.
Currently, there are 224 different priority levels with 64 being available for the user level. Scheduling priorities are organized within the following classes:
Table 3.4. Scheduling priorities
Class | Range | # Levels | Description |
---|---|---|---|
Kernel (RT) | 192..223 | 32 | Software interrupts. |
User (RT) | 128..191 | 64 | Real-time user threads (SCHED_FIFO and SCHED_RR policies). |
Kernel threads | 96..127 | 32 | Internal kernel threads (kthreads), used by I/O, VM and other kernel subsystems. |
Kernel | 64..95 | 32 | Kernel priority for user processes/threads, temporarily assigned when entering kernel-space and blocking. |
User (TS) | 0..63 | 64 | Time-sharing range, user processes and threads (SCHED_RR policy) |
Threads running with the SCHED_FIFO policy have a fixed priority, i.e. the kernel does not change their priority dynamically. A SCHED_FIFO thread runs until
completion
voluntary yielding the CPU
blocking on an I/O operation or other resources (memory allocation, locks)
preemption by a higher priority real-time thread
SCHED_RR works similar to SCHED_FIFO, except that such threads have a default time-slice of 100ms.
For the SCHED_OTHER policy, both schedulers currently use the same run queue implementation, employing multi-level feedback queues. By dynamically adjusting a thread's priority to reflect its CPU and resource utilization, this approach allows the system to be responsive even under heavy loads.
Each runnable thread is placed on one of the runqueues, according to its priority. Each thread is allowed to run on the CPU for a certain amount of time, its time-slice or quantum. Once the thread has used up its time-slice, it is placed on the back on its runqueue. When the scheduler searches for a new thread to run on the CPU, the first thread of the highest priority, non-empty runqueue is selected.
The 4.4BSD scheduler adjusts a thread's priority
dynamically as it accumulates CPU-time. CPU utilization is
incremented in hardclock
each time
the system clock ticks and the thread is found to be
executing. An estimate of a thread's recent CPU
utilization is stored in l_estcpu
,
which is adjusted once per second
in schedcpu
via a digital decay
filter. Whenever a thread accumulates four ticks in its
CPU utilization, schedclock
invokes
resetpriority
to recalculate the
process's scheduling priority.
The common scheduler API is implemented within the file
src/sys/kern/kern_synch.c
. Additional
information can be found in csf(9). Generic run-queues
are implemented
in src/sys/kern/kern_runq.c
. Detailed
information about the 4.4BSD scheduler is given in
[McKusick]. A description of the SVR4
scheduler is provided in [Goodheart].