Let us take a look at the command that links the kernel. This will help identify the exact location where the loader passes execution to the kernel. This location is the kernel's actual entry point.
sys/conf/Makefile.i386:
ld -elf -Bdynamic -T /usr/src/sys/conf/ldscript.i386 -export-dynamic \
-dynamic-linker /red/herring -o kernel -X locore.o \
<lots of kernel .o files>
A few interesting things can be seen here. First, the
kernel is an ELF dynamically linked binary, but the dynamic
linker for kernel is /red/herring
, which is
definitely a bogus file. Second, taking a look at the file
sys/conf/ldscript.i386
gives an idea about
what ld options are used when
compiling a kernel. Reading through the first few lines, the
string
sys/conf/ldscript.i386:
ENTRY(btext)
says that a kernel's entry point is the symbol `btext'.
This symbol is defined in locore.s
:
sys/i386/i386/locore.s:
.text
/**********************************************************************
*
* This is where the bootblocks start us, set the ball rolling...
*
*/
NON_GPROF_ENTRY(btext)
First, the register EFLAGS is set to a predefined value of 0x00000002. Then all the segment registers are initialized:
sys/i386/i386/locore.s:
/* Don't trust what the BIOS gives for eflags. */
pushl $PSL_KERNEL
popfl
/*
* Don't trust what the BIOS gives for %fs and %gs. Trust the bootstrap
* to set %cs, %ds, %es and %ss.
*/
mov %ds, %ax
mov %ax, %fs
mov %ax, %gs
btext calls the routines
recover_bootinfo()
,
identify_cpu()
,
create_pagetables()
, which are also defined
in locore.s
. Here is a description of what
they do:
recover_bootinfo | This routine parses the parameters to the kernel
passed from the bootstrap. The kernel may have been
booted in 3 ways: by the loader, described above, by the
old disk boot blocks, or by the old diskless boot
procedure. This function determines the booting method,
and stores the struct bootinfo
structure into the kernel memory. |
identify_cpu | This functions tries to find out what CPU it is
running on, storing the value found in a variable
_cpu . |
create_pagetables | This function allocates and fills out a Page Table Directory at the top of the kernel memory area. |
The next steps are enabling VME, if the CPU supports it:
testl $CPUID_VME, R(_cpu_feature) jz 1f movl %cr4, %eax orl $CR4_VME, %eax movl %eax, %cr4
Then, enabling paging:
/* Now enable paging */ movl R(_IdlePTD), %eax movl %eax,%cr3 /* load ptd addr into mmu */ movl %cr0,%eax /* get control word */ orl $CR0_PE|CR0_PG,%eax /* enable paging */ movl %eax,%cr0 /* and let's page NOW! */
The next three lines of code are because the paging was set, so the jump is needed to continue the execution in virtualized address space:
pushl $begin /* jump to high virtualized address */ ret /* now running relocated at KERNBASE where the system is linked to run */ begin:
The function init386()
is called with
a pointer to the first free physical page, after that
mi_startup()
. init386
is an architecture dependent initialization function, and
mi_startup()
is an architecture independent
one (the 'mi_' prefix stands for Machine Independent). The
kernel never returns from mi_startup()
, and
by calling it, the kernel finishes booting:
sys/i386/i386/locore.s:
movl physfree, %esi
pushl %esi /* value of first for init386(first) */
call _init386 /* wire 386 chip for unix operation */
call _mi_startup /* autoconfiguration, mountroot etc */
hlt /* never returns to here */
init386()
is defined in
sys/i386/i386/machdep.c
and performs
low-level initialization specific to the i386 chip. The
switch to protected mode was performed by the loader. The
loader has created the very first task, in which the kernel
continues to operate. Before looking at the code, consider
the tasks the processor must complete to initialize protected
mode execution:
Initialize the kernel tunable parameters, passed from the bootstrapping program.
Prepare the GDT.
Prepare the IDT.
Initialize the system console.
Initialize the DDB, if it is compiled into kernel.
Initialize the TSS.
Prepare the LDT.
Set up proc0's pcb.
init386()
initializes the tunable
parameters passed from bootstrap by setting the environment
pointer (envp) and calling init_param1()
.
The envp pointer has been passed from loader in the
bootinfo
structure:
sys/i386/i386/machdep.c:
kern_envp = (caddr_t)bootinfo.bi_envp + KERNBASE;
/* Init basic tunables, hz etc */
init_param1();
init_param1()
is defined in
sys/kern/subr_param.c
. That file has a
number of sysctls, and two functions,
init_param1()
and
init_param2()
, that are called from
init386()
:
sys/kern/subr_param.c:
hz = HZ;
TUNABLE_INT_FETCH("kern.hz", &hz);
TUNABLE_<typename>_FETCH is used to fetch the value from the environment:
/usr/src/sys/sys/kernel.h:
#define TUNABLE_INT_FETCH(path, var) getenv_int((path), (var))
Sysctl kern.hz
is the system clock
tick. Additionally, these sysctls are set by
init_param1()
: kern.maxswzone,
kern.maxbcache, kern.maxtsiz, kern.dfldsiz, kern.maxdsiz,
kern.dflssiz, kern.maxssiz, kern.sgrowsiz
.
Then init386()
prepares the Global
Descriptors Table (GDT). Every task on an x86 is running in
its own virtual address space, and this space is addressed by
a segment:offset pair. Say, for instance, the current
instruction to be executed by the processor lies at CS:EIP,
then the linear virtual address for that instruction would be
“the virtual address of code segment CS” + EIP.
For convenience, segments begin at virtual address 0 and end
at a 4Gb boundary. Therefore, the instruction's linear
virtual address for this example would just be the value of
EIP. Segment registers such as CS, DS etc are the selectors,
i.e., indexes, into GDT (to be more precise, an index is not a
selector itself, but the INDEX field of a selector).
FreeBSD's GDT holds descriptors for 15 selectors per
CPU:
sys/i386/i386/machdep.c:
union descriptor gdt[NGDT * MAXCPU]; /* global descriptor table */sys/i386/include/segments.h:
/* * Entries in the Global Descriptor Table (GDT) */ #define GNULL_SEL 0 /* Null Descriptor */ #define GCODE_SEL 1 /* Kernel Code Descriptor */ #define GDATA_SEL 2 /* Kernel Data Descriptor */ #define GPRIV_SEL 3 /* SMP Per-Processor Private Data */ #define GPROC0_SEL 4 /* Task state process slot zero and up */ #define GLDT_SEL 5 /* LDT - eventually one per process */ #define GUSERLDT_SEL 6 /* User LDT */ #define GTGATE_SEL 7 /* Process task switch gate */ #define GBIOSLOWMEM_SEL 8 /* BIOS low memory access (must be entry 8) */ #define GPANIC_SEL 9 /* Task state to consider panic from */ #define GBIOSCODE32_SEL 10 /* BIOS interface (32bit Code) */ #define GBIOSCODE16_SEL 11 /* BIOS interface (16bit Code) */ #define GBIOSDATA_SEL 12 /* BIOS interface (Data) */ #define GBIOSUTIL_SEL 13 /* BIOS interface (Utility) */ #define GBIOSARGS_SEL 14 /* BIOS interface (Arguments) */
Note that those #defines are not selectors themselves, but just a field INDEX of a selector, so they are exactly the indices of the GDT. for example, an actual selector for the kernel code (GCODE_SEL) has the value 0x08.
The next step is to initialize the Interrupt Descriptor
Table (IDT). This table is referenced by the processor when a
software or hardware interrupt occurs. For example, to make a
system call, user application issues the
INT 0x80
instruction. This is a software
interrupt, so the processor's hardware looks up a record with
index 0x80 in the IDT. This record points to the routine that
handles this interrupt, in this particular case, this will be
the kernel's syscall gate. The IDT may have a maximum of 256
(0x100) records. The kernel allocates NIDT records for the
IDT, where NIDT is the maximum (256):
sys/i386/i386/machdep.c:
static struct gate_descriptor idt0[NIDT];
struct gate_descriptor *idt = &idt0[0]; /* interrupt descriptor table */
For each interrupt, an appropriate handler is set. The
syscall gate for INT 0x80
is set as
well:
sys/i386/i386/machdep.c:
setidt(0x80, &IDTVEC(int0x80_syscall),
SDT_SYS386TGT, SEL_UPL, GSEL(GCODE_SEL, SEL_KPL));
So when a userland application issues the
INT 0x80
instruction, control will transfer
to the function _Xint0x80_syscall
, which
is in the kernel code segment and will be executed with
supervisor privileges.
Console and DDB are then initialized:
sys/i386/i386/machdep.c:
cninit();
/* skipped */
#ifdef DDB
kdb_init();
if (boothowto & RB_KDB)
Debugger("Boot flags requested debugger");
#endif
The Task State Segment is another x86 protected mode structure, the TSS is used by the hardware to store task information when a task switch occurs.
The Local Descriptors Table is used to reference userland code and data. Several selectors are defined to point to the LDT, they are the system call gates and the user code and data selectors:
/usr/include/machine/segments.h:
#define LSYS5CALLS_SEL 0 /* forced by intel BCS */
#define LSYS5SIGR_SEL 1
#define L43BSDCALLS_SEL 2 /* notyet */
#define LUCODE_SEL 3
#define LSOL26CALLS_SEL 4 /* Solaris >= 2.6 system call gate */
#define LUDATA_SEL 5
/* separate stack, es,fs,gs sels ? */
/* #define LPOSIXCALLS_SEL 5*/ /* notyet */
#define LBSDICALLS_SEL 16 /* BSDI system call gate */
#define NLDT (LBSDICALLS_SEL + 1)
Next, proc0's Process Control Block
(struct pcb
) structure is initialized.
proc0 is a struct proc
structure that
describes a kernel process. It is always present while the
kernel is running, therefore it is declared as global:
sys/kern/kern_init.c:
struct proc proc0;
The structure struct pcb
is a part of a
proc structure. It is defined in
/usr/include/machine/pcb.h
and has a
process's information specific to the i386 architecture, such
as registers values.
This function performs a bubble sort of all the system initialization objects and then calls the entry of each object one by one:
sys/kern/init_main.c:
for (sipp = sysinit; *sipp; sipp++) {
/* ... skipped ... */
/* Call function */
(*((*sipp)->func))((*sipp)->udata);
/* ... skipped ... */
}
Although the sysinit framework is described in the Developers' Handbook, I will discuss the internals of it.
Every system initialization object (sysinit object) is
created by calling a SYSINIT() macro. Let us take as example
an announce
sysinit object. This object
prints the copyright message:
sys/kern/init_main.c:
static void
print_caddr_t(void *data __unused)
{
printf("%s", (char *)data);
}
SYSINIT(announce, SI_SUB_COPYRIGHT, SI_ORDER_FIRST, print_caddr_t, copyright)
The subsystem ID for this object is SI_SUB_COPYRIGHT (0x0800001), which comes right after the SI_SUB_CONSOLE (0x0800000). So, the copyright message will be printed out first, just after the console initialization.
Let us take a look at what exactly the macro
SYSINIT()
does. It expands to a
C_SYSINIT()
macro. The
C_SYSINIT()
macro then expands to a static
struct sysinit
structure declaration with
another DATA_SET
macro call:
/usr/include/sys/kernel.h:
#define C_SYSINIT(uniquifier, subsystem, order, func, ident) \
static struct sysinit uniquifier ## _sys_init = { \ subsystem, \
order, \ func, \ ident \ }; \ DATA_SET(sysinit_set,uniquifier ##
_sys_init);
#define SYSINIT(uniquifier, subsystem, order, func, ident) \
C_SYSINIT(uniquifier, subsystem, order, \
(sysinit_cfunc_t)(sysinit_nfunc_t)func, (void *)ident)
The DATA_SET()
macro expands to a
MAKE_SET()
, and that macro is the point
where all the sysinit magic is hidden:
/usr/include/linker_set.h:
#define MAKE_SET(set, sym) \
static void const * const __set_##set##_sym_##sym = &sym; \
__asm(".section .set." #set ",\"aw\""); \
__asm(".long " #sym); \
__asm(".previous")
#endif
#define TEXT_SET(set, sym) MAKE_SET(set, sym)
#define DATA_SET(set, sym) MAKE_SET(set, sym)
In our case, the following declaration will occur:
static struct sysinit announce_sys_init = { SI_SUB_COPYRIGHT, SI_ORDER_FIRST, (sysinit_cfunc_t)(sysinit_nfunc_t) print_caddr_t, (void *) copyright }; static void const *const __set_sysinit_set_sym_announce_sys_init = &announce_sys_init; __asm(".section .set.sysinit_set" ",\"aw\""); __asm(".long " "announce_sys_init"); __asm(".previous");
The first __asm
instruction will create
an ELF section within the kernel's executable. This will
happen at kernel link time. The section will have the name
.set.sysinit_set
. The content of this
section is one 32-bit value, the address of announce_sys_init
structure, and that is what the second
__asm
is. The third
__asm
instruction marks the end of a
section. If a directive with the same section name occurred
before, the content, i.e., the 32-bit value, will be appended
to the existing section, so forming an array of 32-bit
pointers.
Running objdump on a kernel binary, you may notice the presence of such small sections:
%
objdump -h /kernel
7 .set.cons_set 00000014 c03164c0 c03164c0 002154c0 2**2 CONTENTS, ALLOC, LOAD, DATA 8 .set.kbddriver_set 00000010 c03164d4 c03164d4 002154d4 2**2 CONTENTS, ALLOC, LOAD, DATA 9 .set.scrndr_set 00000024 c03164e4 c03164e4 002154e4 2**2 CONTENTS, ALLOC, LOAD, DATA 10 .set.scterm_set 0000000c c0316508 c0316508 00215508 2**2 CONTENTS, ALLOC, LOAD, DATA 11 .set.sysctl_set 0000097c c0316514 c0316514 00215514 2**2 CONTENTS, ALLOC, LOAD, DATA 12 .set.sysinit_set 00000664 c0316e90 c0316e90 00215e90 2**2 CONTENTS, ALLOC, LOAD, DATA
This screen dump shows that the size of .set.sysinit_set
section is 0x664 bytes, so 0x664/sizeof(void
*)
sysinit objects are compiled into the kernel.
The other sections such as .set.sysctl_set
represent other linker sets.
By defining a variable of type struct
linker_set
the content of
.set.sysinit_set
section will be
“collected” into that variable:
sys/kern/init_main.c:
extern struct linker_set sysinit_set; /* XXX */
The struct linker_set
is defined as
follows:
/usr/include/linker_set.h:
struct linker_set {
int ls_length;
void *ls_items[1]; /* really ls_length of them, trailing NULL */
};
The first node will be equal to the number of a sysinit objects, and the second node will be a NULL-terminated array of pointers to them.
Returning to the mi_startup()
discussion, it is must be clear now, how the sysinit objects
are being organized. The mi_startup()
function sorts them and calls each. The very last object is
the system scheduler:
/usr/include/sys/kernel.h:
enum sysinit_sub_id {
SI_SUB_DUMMY = 0x0000000, /* not executed; for linker*/
SI_SUB_DONE = 0x0000001, /* processed*/
SI_SUB_CONSOLE = 0x0800000, /* console*/
SI_SUB_COPYRIGHT = 0x0800001, /* first use of console*/
...
SI_SUB_RUN_SCHEDULER = 0xfffffff /* scheduler: no return*/
};
The system scheduler sysinit object is defined in the file
sys/vm/vm_glue.c
, and the entry point for
that object is scheduler()
. That
function is actually an infinite loop, and it represents a
process with PID 0, the swapper process. The proc0 structure,
mentioned before, is used to describe it.
The first user process, called init,
is created by the sysinit object
init
:
sys/kern/init_main.c:
static void
create_init(const void *udata __unused)
{
int error;
int s;
s = splhigh();
error = fork1(&proc0, RFFDG | RFPROC, &initproc);
if (error)
panic("cannot fork init: %d\n", error);
initproc->p_flag |= P_INMEM | P_SYSTEM;
cpu_set_fork_handler(initproc, start_init, NULL);
remrunqueue(initproc);
splx(s);
}
SYSINIT(init,SI_SUB_CREATE_INIT, SI_ORDER_FIRST, create_init, NULL)
The create_init()
allocates a new
process by calling fork1()
, but does not
mark it runnable. When this new process is scheduled for
execution by the scheduler, the
start_init()
will be called. That
function is defined in init_main.c
. It
tries to load and exec the init
binary,
probing /sbin/init
first, then
/sbin/oinit
,
/sbin/init.bak
, and finally
/stand/sysinstall
:
sys/kern/init_main.c:
static char init_path[MAXPATHLEN] =
#ifdef INIT_PATH
__XSTRING(INIT_PATH);
#else
"/sbin/init:/sbin/oinit:/sbin/init.bak:/stand/sysinstall";
#endif
All FreeBSD documents are available for download at http://ftp.FreeBSD.org/pub/FreeBSD/doc/
Questions that are not answered by the
documentation may be
sent to <[email protected]>.
Send questions about this document to <[email protected]>.