进程

       ______________     CPU Available     ______________
      |              | ----------------> |              |
      | TASK_RUNNING |                     | Real Running |
      |______________| <---------------- |______________|
                           CPU Busy
            |   /|\
Waiting for |    | Resource
Resource   |    | Available
           \|/   |
    ______________________
   |                      |
   | TASK_INTERRUPTIBLE / |
   | TASK-UNINTERRUPTIBLE |
   |______________________|

Main Multitasking Flow

从系统内核的角度看来，一个进程仅仅是进程控制表（process table）中的一项。进程控制表中的每一项都是一个task_struct 结构，而task_struct 结构本身是在include/linux/sched.h中定义的。在task_struct结构中存储各种低级和高级的信息，包括从一些硬件设备的寄存器拷贝到进程的工作目录的链接点。

进程控制表既是一个数组，又是一个双向链表，同时又是一个树。其物理实现是一个包括多个指针的静态数组。此数组的长度保存在include/linux/tasks.h 定义的常量NR_TASKS中，其缺省值为128，数组中的结构则保存在系统预留的内存页中。链表是由next_task 和prev_task两个指针实现的，而树的实现则比较复杂。

系统启动后，内核通常作为某一个进程的代表。一个指向task_struct的全局指针变量current用来记录正在运行的进程。变量current只能由kernel/sched.c中的进程调度改变。当系统需要查看所有的进程时，则调用for_each_task，这将比系统搜索数组的速度要快得多。

二、用户进程和内核线程

某一个进程只能运行在用户方式（user mode）或内核方式（kernel mode）下。用户程序运行在用户方式下，而系统调用运行在内核方式下。在这两种方式下所用的堆栈不一样：用户方式下用的是一般的堆栈，而内核方式下用的是固定大小的堆栈（一般为一个内存页的大小）

尽管linux是一个宏内核系统，内核线程依然存在，以便并行地处理一些内核的“家务室”。这些任务不占用USER memory（用户空间），而仅仅使用KERNEL memory。和其他内核模块一样，它们也在高级权限（i386系统中的RING 0）下工作作。内核线程是被kernel_thread [arch/i386/kernel/process]创建的,它又通过调用著名的clone系统调用[arch/i386/kernel/process.c] (类似fork系统调用的所有功能都是由它最终实现):

int kernel_thread(int (*fn)(void *), void * arg, unsigned long flags)
{
long retval, d0;

        __asm__ __volatile__(
                "movl %%esp,%%esi\n\t"
                "int $0x80\n\t"         /* Linux/i386 system call */
                "cmpl %%esp,%%esi\n\t" /* child or parent? */
                "je 1f\n\t"             /* parent - jump */
                /* Load the argument into eax, and push it. That way, it does
                 * not matter whether the called function is compiled with
                 * -mregparm or not. */
                "movl %4,%%eax\n\t"
                "pushl %%eax\n\t"
                "call *%5\n\t"          /* call fn */
                "movl %3,%0\n\t"        /* exit */
                "int $0x80\n"
                "1:\t"
                :"=&a" (retval), "=&S" (d0)
                :"0" (__NR_clone), "i" (__NR_exit),
                 "r" (arg), "r" (fn),
                 "b" (flags | CLONE_VM)
                : "memory");
        return retval;
}

一旦调用，我们就有了一个新的任务（Task） (一般PID都很小, 例如2,3,等) 等待一个响应很慢的资源，例如swap或者usb事件，以便同步。下面是一些最常用的内核线程(你可以用ps x命令):

PID      COMMAND
1        init
2        keventd
3        kswapd
4        kreclaimd
5        bdflush
6        kupdated
7        kacpid
67        khubd

init内核线程也是启动以后最初的进程。它会调用其它用户模式的任务，(/etc/inittab)例如控制台守护进程（daemons）， tty守护进程以及网络守护进程(rc脚本)。

下面是一个典型的内核线程kswapd [mm/vmscan.c].
kswapd是被clone()建立的 [arch/i386/kernel/process.c]''

|do_initcalls
   |kswapd_init
      |kernel_thread
         |syscall fork (in assembler)

·do_initcalls [init/main.c]
·kswapd_init [mm/vmscan.c]
·kernel_thread [arch/i386/kernel/process.c]

三进程创建，运行和消失

Linux系统使用系统调用fork( )来创建一个进程，使用exit( )来结束进程。fork( )和exit( )的源程序保存在kernel/fork.c and kernel/exit.c中。fork( )的主要任务是初始化要创建进程的数据结构，其主要的步骤有：

1）申请一个空闲的页面来保存task_struct。
2）查找一个空的进程槽（find_empty_process( )）。
3）为kernel_stack_page申请另一个空闲的内存页作为堆栈。
4）将父进程的LDT表拷贝给子进程。
5）复制父进程的内存映射信息。
6）管理文件描述符和链接点。

·sys_fork [arch/i386/kernel/process.c]
·do_fork [kernel/fork.c]
·alloc_task_struct [include/asm/processor.c]
·__get_free_pages [mm/page_alloc.c]
·get_pid [kernel/fork.c]
·copy_files
·copy_fs
·copy_sighand
·copy_mm
·allocate_mm
·mm_init
·pgd_alloc -> get_pgd_fast [include/asm/pgalloc.h]
·get_pgd_slow
·dup_mmap [kernel/fork.c]
·copy_page_range [mm/memory.c]
·ptep_set_wrprotect [include/asm/pgtable.h]
·clear_bit [include/asm/bitops.h]
·copy_segments [arch/i386/kernel/process.c]
·copy_thread
·SET_LINKS [include/linux/sched.h]
·wake_up_process [kernel/sched.c]

撤消一个进程可能稍微复杂些，因为撤消子进程必须通知父进程。另外，使用kill( )也可以结束一个进程。sys_kill( )、sys_wait( )和sys_exit( )都保存在文件exit.c中。

使用fork ( )创建一个进程后，程序的两个拷贝都在运行。通常一个拷贝使用exec ( )调用另一个拷贝。系统调用exec ( )负责定位可执行文件的二进制代码，并负责装入和运行。Linux系统中的exec ( )通过使用linux_binfmt结构支持多种二进制格式。每种二进制格式都代表可执行代码和链接库。linux _binfmt结构种包含两个指针，一个指向装入可执行代码的函数，另一个指向装入链接库的函数。

Unix系统提供给程序员6种调用exec( ) 的方法。其中的5种是作为库函数实现，而sys_execve( )是由系统内核实现的。它执行一个十分简单的任务：装入可执行文件的文件头，并试图执行它。如果文件的头两个字节是#! ,那么它就调用在文件第一行中所指定的解释器，否则，它将逐个尝试注册的二进制格式。

[目录]

信号
struct semaphore {
        atomic_t count; 进程抓取semaphore时减1
        int sleepers; 抓取semaphore失败时增1
        wait_queue_head_t wait; semaphore的等待队列
};
        down(&sem) 编绎成:

        movl $sem,% ecx        通过寄存器ecx向__down函数传递sem指针
        decl sem
        js 2f 如果为负值,表示semaphore已被占用,执行__down_failed过程
1:
由于出现semaphore竞争的可能性比较小,将分支代码转移到.text.lock段,以缩短正常的指令路径.
.section .text.lock,"ax"
2:        call __down_failed
        jmp 1b
.previous
        ...

up(&sem) 编绎成:

        movl $sem,% ecx
        incl sem
        jle 2f 如果小于或等于0,表示该semaphore有进程在等待,就去调用__up_wakeup
1:
.section .text.lock,"ax"
2:        call __up_wakeup
        jmp 1b
.previous
        ...
__down_failed:
        pushl % eax
        pushl % edx
        pushl % ecx ; eax,edx,ecx是3个可用于函数参数的寄存器
        call __down
        popl % ecx
        popl % edx
        popl % eax
        ret
__up_wakeup:
        pushl % eax
        pushl % edx
        pushl % ecx
        call __up
        popl % ecx
        popl % edx
        popl % eax
        ret
; semaphore.c
void __down(struct semaphore * sem)
{
        struct task_struct *tsk = current;
        DECLARE_WAITQUEUE(wait, tsk);
        tsk->state = TASK_UNINTERRUPTIBLE;
        add_wait_queue_exclusive(&sem->wait, &wait);
        // 将当前进程加入到该semaphore的等待队列中

        spin_lock_irq(&semaphore_lock);
        sem->sleepers++;
        for (;;) {
                int sleepers = sem->sleepers;

                /*
                * Add "everybody else" into it. They aren't
                * playing, because we own the spinlock.
                */

                // atomic_add_negative(int i,atomic_t *v)将i + v->counter相加,
                // 结果为负返回1,否则返回0
                if (!atomic_add_negative(sleepers - 1, &sem->count)) {
                // 如果(sleepers - 1 + sem->count.counter)非负,则说明
                // semaphore已经被释放,可以返回
                        sem->sleepers = 0;
                        break;
                }
                sem->sleepers = 1;        /* us - see -1 above */

                spin_unlock_irq(&semaphore_lock);
                // 当semaphore被up()唤醒时,schedule()返回
                schedule();
                // 虽然已线程被up恢复,但为防止碰巧又有一个线程获得了semaphore,
                // 因此将它们放在循环体中
                tsk->state = TASK_UNINTERRUPTIBLE;
                spin_lock_irq(&semaphore_lock);
        }
        spin_unlock_irq(&semaphore_lock);
        // 该进程获得了semaphore,将它从等待队列中删除
        remove_wait_queue(&sem->wait, &wait);
        tsk->state = TASK_RUNNING;
        // 为什么这里要调用wake_up,是因为调用它没有副作用从而防止潜在的死锁吗?
        wake_up(&sem->wait);
}
void __up(struct semaphore *sem)
{
扩展为
__wake_up_common(&sem->wait,TASK_UNINTERRUPTIBLE|TASK_INTERRUPTIBLE,1,0);
唤醒队列中第1个进程,即将第1个进程放入运行队列
        wake_up(&sem->wait);
}
; sched.c
static inline void __wake_up_common (wait_queue_head_t *q, unsigned int
mode,
                                     int nr_exclusive, const int sync)
{
        struct list_head *tmp, *head;
        struct task_struct *p;
        unsigned long flags;

if (!q)
goto out;

wq_write_lock_irqsave(&q->lock, flags);

#if WAITQUEUE_DEBUG
CHECK_MAGIC_WQHEAD(q);
#endif

        head = &q->task_list;
#if WAITQUEUE_DEBUG
        if (!head->next || !head->prev)
                WQ_BUG();
#endif
        tmp = head->next;
        while (tmp != head) {
                unsigned int state;
                wait_queue_t *curr = list_entry(tmp, wait_queue_t,
task_list);
                tmp = tmp->next;

#if WAITQUEUE_DEBUG
                CHECK_MAGIC(curr->__magic);
#endif
                p = curr->task;
                state = p->state;
                if (state & mode) {
#if WAITQUEUE_DEBUG
                        curr->__waker = (long)__builtin_return_address(0);
#endif
                        if (sync)
                                wake_up_process_synchronous(p);
                        else
                                wake_up_process(p);
                        if ((curr->flags & WQ_FLAG_EXCLUSIVE) && !--nr_exclusive)
                                break;
                }
        }
        wq_write_unlock_irqrestore(&q->lock, flags);
out:
        return;
}
; sched.c
inline void wake_up_process(struct task_struct * p)
{
        unsigned long flags;

        /*
        * We want the common case fall through straight, thus the goto.
        */
        spin_lock_irqsave(&runqueue_lock, flags);
        p->state = TASK_RUNNING;
        if (task_on_runqueue(p))
                goto out;
        add_to_runqueue(p);
        reschedule_idle(p);
out:
        spin_unlock_irqrestore(&runqueue_lock, flags);
}
; sched.c
static inline void wake_up_process_synchronous(struct task_struct * p)
{
        unsigned long flags;

        /*
        * We want the common case fall through straight, thus the goto.
        */
        spin_lock_irqsave(&runqueue_lock, flags);
        p->state = TASK_RUNNING;
        if (task_on_runqueue(p))
                goto out;
        add_to_runqueue(p);
out:
        spin_unlock_irqrestore(&runqueue_lock, flags);
}
; sched.h
static inline int task_on_runqueue(struct task_struct *p)
{
        return (p->run_list.next != NULL);
}
; sched.c
static inline void add_to_runqueue(struct task_struct * p)
{
        list_add(&p->run_list, &runqueue_head);
        nr_running++;
}
static LIST_HEAD(runqueue_head);

; fork.c
void add_wait_queue_exclusive(wait_queue_head_t *q, wait_queue_t *
wait)
{
unsigned long flags;

        wq_write_lock_irqsave(&q->lock, flags);
        wait->flags = WQ_FLAG_EXCLUSIVE;
        __add_wait_queue_tail(q, wait);
        wq_write_unlock_irqrestore(&q->lock, flags);
}
; wait.h
static inline void __add_wait_queue_tail(wait_queue_head_t *head,
                                                wait_queue_t *new)
{
#if WAITQUEUE_DEBUG
        if (!head || !new)
                WQ_BUG();
        CHECK_MAGIC_WQHEAD(head);
        CHECK_MAGIC(new->__magic);
        if (!head->task_list.next || !head->task_list.prev)
                WQ_BUG();
#endif
        list_add_tail(&new->task_list, &head->task_list);
}
正执行调度的函数是schedule(void),它选择一个最合适的进程执行，并且真正进行上下文切换，
使得选中的进程得以执行。而reschedule_idle(struct task_struct *p)的作用是为进程选择
一个合适的CPU来执行，如果它选中了某个CPU，则将该CPU上当前运行进程的need_resched标志
置为1,然后向它发出一个重新调度的处理机间中断，使得选中的CPU能够在中断处理返回时执行
schedule函数，真正调度进程p在CPU上执行。在schedule()和reschedule_idle()中调用了goodness()
函数。goodness()函数用来衡量一个处于可运行状态的进程值得运行的程度。此外，在schedule()
函数中还调用了schedule_tail()函数;在reschedule_idle()函数中还调用了reschedule_idle_slow()。

[目录]

/*
* 'sched.c' is the main kernel file. It contains scheduling primitives
* (sleep_on, wakeup, schedule etc) as well as a number of simple system
* call functions (type getpid(), which just extracts a field from
* current-task
*/
#include
#include
#include
#include
#include
#include
#include
#define LATCH (1193180/HZ)
extern void mem_use(void);
extern int timer_interrupt(void);
extern int system_call(void);
union task_union {
struct task_struct task;
char stack[PAGE_SIZE];
};
static union task_union init_task = {INIT_TASK,};
long volatile jiffies=0;
long startup_time=0;
struct task_struct *current = &(init_task.task), *last_task_used_math =
NULL;
struct task_struct * task[NR_TASKS] = {&(init_task.task), };
long user_stack [ PAGE_SIZE>>2 ] ;
struct {
long * a;
short b;
} stack_start = { & user_stack [PAGE_SIZE>>2] , 0x10 };
/*
* 'math_state_restore()' saves the current math information in the
* old math state array, and gets the new ones from the current task
*/
void math_state_restore() @@协处理器状态保存
{
if (last_task_used_math)
__asm__("fnsave %0"::"m" (last_task_used_math->tss.i387));
if (current->used_math)
__asm__("frstor %0"::"m" (current->tss.i387));
else {
__asm__("fninit"::);
current->used_math=1;
}
last_task_used_math=current;
}
/*
* 'schedule()' is the scheduler function. This is GOOD CODE! There
* probably won't be any reason to change this, as it should work well
* in all circumstances (ie gives IO-bound processes good response etc).
* The one thing you might take a look at is the signal-handler code
here.
*
* NOTE!! Task 0 is the 'idle' task, which gets called when no other
* tasks can run. It can not be killed, and it cannot sleep. The 'state'
* information in task[0] is never used.
*/
void schedule(void)
{
int i,next,c;
struct task_struct ** p;
/* check alarm, wake up any interruptible tasks that have got a signal
*/
for(p = &LAST_TASK ; p > &FIRST_TASK ; --p)
if (*p) {
if ((*p)->alarm && (*p)->alarm < jiffies) {
@@??
(*p)->signal |= (1<<(SIGALRM-1));@@14-1
(*p)->alarm = 0;
}
if ((*p)->signal && (*p)->state==TASK_INTERRUPTIBLE)
(*p)->state=TASK_RUNNING;
}
@@ task 1 如何变为TASK_RUNNING??signal 如何得到,alarm如何变非0且 /* this is the
scheduler proper: */
@@操作系统最重要的函数，调度算法
@@这个循环要找到一个可运行的任务才能退出,会死在这吗?即如没有一个可运行
while (1) {
c = -1;
next = 0;
i = NR_TASKS;
p = &task[NR_TASKS];
while (--i) {
if (!*--p)
continue;
if ((*p)->state == TASK_RUNNING && (*p)->counter > c)
c = (*p)->counter, next = i;
}
if (c) break; @@记数大于零
for(p = &LAST_TASK ; p > &FIRST_TASK ; --p)
if (*p)
(*p)->counter = ((*p)->counter >> 1) +
(*p)->priority;
}
switch_to(next);
}
int sys_pause(void)
{
current->state = TASK_INTERRUPTIBLE; @@任务可中断
schedule();
return 0;
}
void sleep_on(struct task_struct **p)
{
struct task_struct *tmp;
if (!p)
return;
if (current == &(init_task.task))
panic("task[0] trying to sleep");
tmp = *p;
*p = current;
current->state = TASK_UNINTERRUPTIBLE;
schedule();
if (tmp) @@激活p，什么时候回来？唤醒上次睡眠的进程
tmp->state=0;
}
void interruptible_sleep_on(struct task_struct **p)
{
struct task_struct *tmp;
if (!p)
return;
if (current == &(init_task.task))
panic("task[0] trying to sleep");
tmp=*p;
*p=current;
repeat: current->state = TASK_INTERRUPTIBLE;
schedule();
if (*p && *p != current) {
(**p).state=0;
goto repeat;
}
@@好象下不来
*p=NULL;
if (tmp)
tmp->state=0;
}
void wake_up(struct task_struct **p)
{
if (p && *p) {
(**p).state=0; @@唤醒该进程running
*p=NULL; @@睡眠栈为0
}
}
void do_timer(long cpl) @@定时调度
{
if (cpl)
current->utime++; @@用户态时间加一
else
current->stime++; @@系统态时间加一
if ((--current->counter)>0) return; @@当前记数减一
current->counter=0;
if (!cpl) return;
schedule();
}
int sys_alarm(long seconds)
{
current->alarm = (seconds>0)?(jiffies+HZ*seconds):0;
return seconds;
}
int sys_getpid(void)
{
return current->pid;
}
int sys_getppid(void)
{
return current->father;
}
int sys_getuid(void)
{
return current->uid;
}
int sys_geteuid(void)
{
return current->euid;
}
int sys_getgid(void)
{
return current->gid;
}
int sys_getegid(void)
{
return current->egid;
}
int sys_nice(long increment)
{
if (current->priority-increment>0)
current->priority -= increment;
return 0;
}
int sys_signal(long signal,long addr,long restorer)
{
long i;
switch (signal) {
case SIGHUP: case SIGINT: case SIGQUIT: case SIGILL:
case SIGTRAP: case SIGABRT: case SIGFPE: case SIGUSR1:
case SIGSEGV: case SIGUSR2: case SIGPIPE: case SIGALRM:
case SIGCHLD:
i=(long) current->sig_fn[signal-1];
current->sig_fn[signal-1] = (fn_ptr) addr;
current->sig_restorer = (fn_ptr) restorer;
return i;
default: return -1;
}
}
void sched_init(void)
{
int i;
struct desc_struct * p;
set_tss_desc(gdt+FIRST_TSS_ENTRY,&(init_task.task.tss));@@init task tss
set_ldt_desc(gdt+FIRST_LDT_ENTRY,&(init_task.task.ldt));@@init ldt
p = gdt+2+FIRST_TSS_ENTRY;
for(i=1;i task[i] = NULL;
p->a=p->b=0;
p++;
p->a=p->b=0;
p++;
}
ltr(0); @@调入task 0的tss
lldt(0); @@调入task 0的ldt
outb_p(0x36,0x43); /* binary, mode 3, LSB/MSB, ch 0 */
outb_p(LATCH & 0xff , 0x40); /* LSB */
outb(LATCH >> 8 , 0x40); /* MSB */
set_intr_gate(0x20,&timer_interrupt); @@irq 0 时钟中断
outb(inb_p(0x21)&~0x01,0x21);
set_system_gate(0x80,&system_call);
}

[目录]

进程信号队列
每个进程具有一个sigpending结构所描述的信号队列,它有3个成员,head指向第一个sigqueue成员,tail指向最末的sigqueue成员的next指针,signal描述了此队列中的信号集.

static int
send_signal(int sig, struct siginfo *info, struct sigpending *signals);
将信号sig和对应的消息结构info添加到信号队列signal中.
static int
collect_signal(int sig, struct sigpending *list, siginfo_t *info);
返回信号sig在队列list中的信息info.

struct task_struct {
        ...
        struct sigpending pending;
        ...
};
struct sigpending {
       struct sigqueue *head, **tail;
       sigset_t signal;
};
struct sigqueue {
       struct sigqueue *next;
       siginfo_t info;
       };
// kernel/signal.c
static int
send_signal(int sig, struct siginfo *info, struct sigpending *signals)
{
     struct sigqueue * q = NULL;
     /* Real-time signals must be queued if sent by sigqueue, or
       some other real-time mechanism. It is implementation
      defined whether kill() does so. We attempt to do so, on
      the principle of least surprise, but since kill is not
     allowed to fail with EAGAIN when low on memory we just
     make sure at least one signal gets delivered and don't
     pass on the info struct. */

    if (atomic_read(&nr_queued_signals) < max_queued_signals) {
       q = kmem_cache_alloc(sigqueue_cachep, GFP_ATOMIC);
    }
    // nr_queued_signals和max_queued_signals用来限制全局sigqueue成员的数目
    if (q) {
        atomic_inc(&nr_queued_signals);
        q->next = NULL;
        *signals->tail = q;
        signals->tail = &q->next; tail总是指向最末的信号成员的next指针                switch ((unsign
ed long) info)
         {
        case 0:
          // info参数如果为0,表示信号来源于当前用户进程                                q->info.si_signo =
sig;
        q->info.si_errno = 0;
        q->info.si_code = SI_USER;
        q->info.si_pid = current->pid;
        q->info.si_uid = current->uid;
        break;
        case 1:
          // info参数如果为1,表示信号来源于内核本身                                q->info.si_signo = sig;
        q->info.si_errno = 0;
        q->info.si_code = SI_KERNEL;
           q->info.si_pid = 0;
        q->info.si_uid = 0;
           break;
        default:
           // 否则从info指针中拷贝信号
           copy_siginfo(&q->info, info);
        break;
        }
      }
      else if (sig >= SIGRTMIN && info && (unsigned long)info != 1                   && info->
si_code != SI_USER)
      {
        ; 如果该信号是内核发出的实时信号,就返回错误码
        /*
         * Queue overflow, abort. We may abort if the signal was rt
         * and sent by user using something other than kill().
       */
        return -EAGAIN;
      }
      sigaddset(&signals->signal, sig); 将sig号标记在队列的信号集上
      return 0;
}
static int
collect_signal(int sig, struct sigpending *list, siginfo_t *info)
{
      if (sigismember(&list->signal, sig)) {
        /* Collect the siginfo appropriate to this signal. */                struct sigqueue *q, **
pp;
         pp = &list->head; pp指向第一个信号成员的next指针
        while ((q = *pp) != NULL) {
                if (q->info.si_signo == sig)                                            goto found_it;
                pp = &q->next;
         }
        /* Ok, it wasn't in the queue. We must have
           been out of queue space. So zero out the
          info.
         */
        sigdelset(&list->signal, sig);
        info->si_signo = sig;
        info->si_errno = 0;
        info->si_code = 0;
        info->si_pid = 0;
        info->si_uid = 0;
        return 1;
   found_it:
   // 将找到信号成员从信号队列中删除
        if ((*pp = q->next) == NULL)
        list->tail = pp;
/* Copy the sigqueue information and free the queue entry */
       copy_siginfo(info, &q->info);
       kmem_cache_free(sigqueue_cachep,q);
       atomic_dec(&nr_queued_signals);
/* Non-RT signals can exist multiple times.. */
       if (sig >= SIGRTMIN) {
                while ((q = *pp) != NULL) {
                if (q->info.si_signo == sig)                                             goto found_another;
                pp = &q->next;
                }
       }
       sigdelset(&list->signal, sig);
    found_another:
        return 1;
    }
    return 0;
}

[目录]

SMP
多处理机系统正在变得越来越普通。尽管大多数用户空间代码仍将完美地运行，而且有些情况下不需要增加额外的代码就能利用SMP特性的优势，但是内核空间代码必须编写成具备“SMP意识”且是“SMP安全的”。以下几段文字解释如何去做。

问题

    当有多个CPU时，同样的代码可能同时在两个或多个CPU上执行。这在如下所示用于初始化某个图像设备的例程中可能会出问题。
        void init_hardware(void)
        {
            outb(0x1, hardware_base + 0x30);
            outb(0x2, hardware_base + 0x30);
            outb(0x3, hardware_base + 0x30);
            outb(0x4, hardware_base + 0x30);
        }
    假设该硬件依赖于寄存器0x30按顺序依次被设为0、1、2、3来初始化，那么要是有另一个CPU来参乎的话，事情就会搞糟。想象有两个CPU的情形，它们都在执行这个例程，不过2号CPU进入得稍慢点：
        CPU 1                           CPU 2

        0x30 = 1
        0x30 = 2                        0x30 = 1
        0x30 = 3                        0x30 = 2
        0x30 = 4                        0x30 = 3
                                        0x30 = 4
    这会发生什么情况呢？从我们设想的硬件设备看来，它在寄存器0x30上收到的字节按顺序为：1、2、1、3、2、4、3、4。
    啊！原本好好的事第二个CPU一来就搞得一团糟了也。所幸的是，我们有防止这类事情发生的办法。

自旋锁小历史

    2.0.x版本的Linux内核通过给整个内核引入一个全局变量来防止多于一个CPU会造成的问题。这意味着任何时刻只有一个CPU能够执行来自内核空间的代码。这样尽管能工作，但是当系统开始以多于2个的CPU出现时，扩展性能就不怎么好。
    2.1.x版本的内核系列加入了粒度更细的SMP支持。这意味着不再依赖于以前作为全局变量出现的“大锁”，而是每个没有SMP意识的例程现在都需要各自的自旋锁。文件asm/spinlock.h中定义了若干类型的自旋锁。
    有了局部化的自旋锁后，不止一个CPU同时执行内核空间代码就变得可能了。

简单的自旋锁

    理解自旋锁的最简单方法是把它作为一个变量看待，该变量把一个例程或者标记为“我当前在另一个CPU上运行，请稍等一会”，或者标记为“我当前不在运行”。如果1号CPU首先进入该例程，它就获取该自旋锁。当2号CPU试图进入同一个例程时，该自旋锁告诉它自己已为1号CPU所持有，需等到1号CPU释放自己后才能进入。
        spinlock_t my_spinlock = SPIN_LOCK_UNLOCKED;
        unsigned long flags;

        spin_lock (&my_spinlock);
        ...
        critical section
        ...
        spin_unlock (&my_spinlock);

中断

    设想我们的硬件的驱动程序还有一个中断处理程序。该处理程序需要修改某些由我们的驱动程序定义的全局变量。这会造成混乱。我们如何解决呢？
    保护某个数据结构，使它免遭中断之修改的最初方法是全局地禁止中断。在已知只有自己的中断才会修改自己的驱动程序变量时，这么做效率很低。所幸的是，我们现在有更好的办法了。我们只是在使用共享变量期间禁止中断，此后重新使能。
    实现这种办法的函数有三个：
        disable_irq()
        enable_irq()
        disable_irq_nosync()
    这三个函数都取一个中断号作为参数。注意，禁止一个中断的时间太长会导致难以追踪程序缺陷，丢失数据，甚至更坏。
    disable_irq函数的非同步版本允许所指定的IRQ处理程序继续运行，前提是它已经在运行，普通的disable_irq则所指定的IRQ处理程序不在如何CPU上运行。
    如果需要在中断处理程序中修改自旋锁，那就不能使用普通的spin_lock()和spin_unlock()，而应该保存中断状态。这可通过给这两个函数添加_irqsave后缀很容易地做到：
        spinlock_t my_spinlock = SPIN_LOCK_UNLOCKED;
        unsigned long flags;

        spin_lock_irqsave(&my_spinlock, flags);
        ...
        critical section
        ...
        spin_unlock_irqrestore (&my_spinlock, flags);

[目录]

内核线程页目录的借用
创建内核线程的时候，由于内核线程没有用户空间，而所有进程的内核页目录都是一样的（（某些情况下可能有不同步的情况出现，主要是为了减轻同步所有进程内核页目录的开销，而只是在各个进程要访问内核空间，如果有不同步的情况，然后才进行同步处理），所以创建的内核线程的内核页目录总是借用进程0的内核页目录。

>>> kernel_thread以标志CLONE_VM调用clone系统调用
/*
* Create a kernel thread
*/
int kernel_thread(int (*fn)(void *), void * arg, unsigned long flags)
{
long retval, d0;

__asm__ __volatile__(
"movl %%esp,%%esi\n\t"
"int $0x80\n\t" /* Linux/i386 system call */
"cmpl %%esp,%%esi\n\t" /* child or parent? */
/* Load the argument into eax, and push it. That way, it does
   * not matter whether the called function is compiled with
   * -mregparm or not. */
"movl %4,%%eax\n\t"
"pushl %%eax\n\t"
"call *%5\n\t" /* call fn */
"movl %3,%0\n\t" /* exit */
"int $0x80\n"
"1:\t"
:"=&a" (retval), "=&S" (d0)
:"0" (__NR_clone), "i" (__NR_exit),
   "r" (arg), "r" (fn),
   "b" (flags | CLONE_VM)
: "memory");
return retval;
}

>>> sys_clone->do_fork->copy_mm：
static int copy_mm(unsigned long clone_flags, struct task_struct * tsk)
{
struct mm_struct * mm, *oldmm;
int retval;

。。。。。。。。

tsk->mm = NULL;
tsk->active_mm = NULL;

/*
* Are we cloning a kernel thread?
*
* We need to steal a active VM for that..
*/
>>> 如果是内核线程的子线程（mm=NULL），则直接退出，此时内核线程mm和active_mm均为为NULL
oldmm = current->mm;
if (!oldmm)
return 0;

>>> 内核线程，只是增加当前进程的虚拟空间的引用计数
if (clone_flags & CLONE_VM) {
atomic_inc(&oldmm->mm_users);
mm = oldmm;
goto good_mm;
}

。。。。。。。。。。

good_mm:
>>> 内核线程的mm和active_mm指向当前进程的mm_struct结构
tsk->mm = mm;
tsk->active_mm = mm;
return 0;

。。。。。。。
}

然后内核线程一般调用daemonize来释放对用户空间的引用：
>>> daemonize->exit_mm->_exit_mm：
/*
* Turn us into a lazy TLB process if we
* aren't already..
*/
static inline void __exit_mm(struct task_struct * tsk)
{
struct mm_struct * mm = tsk->mm;

mm_release();
if (mm) {
atomic_inc(&mm->mm_count);
if (mm != tsk->active_mm) BUG();
/* more a memory barrier than a real lock */
task_lock(tsk);
>>> 释放用户虚拟空间的数据结构
tsk->mm = NULL;
task_unlock(tsk);
enter_lazy_tlb(mm, current, smp_processor_id());

>>> 递减mm的引用计数并是否为0，是则释放mm所代表的映射
mmput(mm);
}
}

asmlinkage void schedule(void)
{
。。。。。。。。。
if (!current->active_mm) BUG();

。。。。。。。。。

prepare_to_switch();
{
struct mm_struct *mm = next->mm;
struct mm_struct *oldmm = prev->active_mm;
>>> mm = NULL，选中的为内核线程
if (!mm) {
>>> 对内核线程,active_mm = NULL，否则一定是出错了
   if (next->active_mm) BUG();
>>> 选中的内核线程active_mm借用老进程的active_mm
   next->active_mm = oldmm;
   atomic_inc(&oldmm->mm_count);
   enter_lazy_tlb(oldmm, next, this_cpu);
} else {
>>> mm != NULL 选中的为用户进程，active_mm必须与mm相等，否则一定是出错了
   if (next->active_mm != mm) BUG();
   switch_mm(oldmm, mm, next, this_cpu);
}

>>> prev = NULL ，切换出去的是内核线程
if (!prev->mm) {
>>> 设置其 active_mm = NULL 。
prev->active_mm = NULL;
mmdrop(oldmm);
}
}

}

对内核线程的虚拟空间总结一下：
１、创建的时候：
父进程是用户进程，则mm和active_mm均共享父进程的，然后内核线程一般调用daemonize适头舖m
父进程是内核线程，则mm和active_mm均为NULL
总之，内核线程的mm = NULL；进程调度的时候以此为依据判断是用户进程还是内核线程。

２、进程调度的时候
如果切换进来的是内核线程，则置active_mm为切换出去的进程的active_mm；
如果切换出去的是内核线程，则置active_mm为NULL。

[目录]

代码分析
LINUX系统是分时多用户系统, 它有多进程系统的特点，CPU按时间片分配给各个用户使用, 而在实质上应该说CPU按时间片分配给各个进程使用, 每个进程都有自己的运行环境以使得在CPU做进程切换时保存该进程已计算了一半的状态。

进程的切换包括三个层次:

    ·用户数据的保存: 包括正文段(TEXT), 数据段(DATA,BSS), 栈段(STACK), 共享内存段(SHARED MEMORY)的保存。
    ·寄存器数据的保存: 包括PC(program counter,指向下一条要执行的指令的地址),   PSW(processor status word,处理机状态字), SP(stack pointer,栈指针), PCBP(pointer of process control block,进程控制块指针), FP(frame pointer,指向栈中一个函数的local 变量的首地址), AP(augument pointer,指向栈中函数调用的实参位置), ISP(interrupt stack pointer,中断栈指针), 以及其他的通用寄存器等。
    ·系统层次的保存: 包括proc,u,虚拟存储空间管理表格,中断处理栈。以便于该进程再一次得到CPU时间片时能正常运行下去。

    多进程系统的一些突出的特点:
并行化
   一件复杂的事件是可以分解成若干个简单事件来解决的, 这在程序员的大脑中早就形成了这种概念, 首先将问题分解成一个个小问题, 将小问题再细分, 最后在一个合适的规模上做成一个函数。在软件工程中也是这么说的。如果我们以图的方式来思考, 一些小问题的计算是可以互不干扰的, 可以同时处理, 而在关键点则需要统一在一个地方来处理, 这样程序的运行就是并行的, 至少从人的时间观念上来说是这样的。而每个小问题的计算又是较简单的。
简单有序
   这样的程序对程序员来说不亚于管理一班人, 程序员为每个进程设计好相应的功能, 并通过一定的通讯机制将它们有机地结合在一起, 对每个进程的设计是简单的, 只在总控部分小心应付(其实也是蛮简单的), 就可完成整个程序的施工。
互不干扰
   这个特点是操作系统的特点, 各个进程是独立的, 不会串位。
事务化
   比如在一个数据电话查询系统中, 将程序设计成一个进程只处理一次查询即可, 即完成一个事务。当电话查询开始时, 产生这样一个进程对付这次查询; 另一个电话进来时, 主控程序又产生一个这样的进程对付, 每个进程完成查询任务后消失. 这样的编程多简单, 只要做一次查询的程序就可以了。

   Linux是一个多进程的操作系统，进程是分离的任务，拥有各自的权利和责任。如果一个进程崩溃，它不应该让系统的另一个进程崩溃。每一个独立的进程运行在自己的虚拟地址空间，除了通过安全的核心管理的机制之外无法影响其他的进程。
   在一个进程的生命周期中，进程会使用许多系统资源。比如利用系统的CPU执行它的指令，用系统的物理内存来存储它和它的数据。它会打开和使用文件系统中的文件，会直接或者间接使用系统的物理设备。如果一个进程独占了系统的大部分物理内存和CPU，对于其他进程就是不公平的。所以Linux必须跟踪进程本身和它使用的系统资源以便公平地管理系统中的进程。
   系统最宝贵的资源就是CPU。通常系统只有一个CPU。Linux作为一个多进程的操作系统，它的目标就是让进程在系统的CPU上运行，充分利用CPU。如果进程数多于CPU（一般情况都是这样），其他的进程就必须等到CPU被释放才能运行。多进程的思想就是：一个进程一直运行，直到它必须等待，通常是等待一些系统资源，等拥有了资源，它才可以继续运行。在一个单进程的系统中，比如DOS，CPU被简单地设为空闲，这样等待资源的时间就会被浪费。而在一个多进程的系统中，同一时刻许多进程在内存中，当一个进程必须等待时，操作系统将CPU从这个进程切换到另一个更需要的进程。
   我们组分析的是Linux进程的状态转换以及标志位的作用，它没有具体对应某个系统调用，而是分布在各个系统调用中。所以我们详细而广泛地分析了大量的原码，对进程状态转换的原因、方式和结果进行了分析，大致总结了整个Linux系统对进程状态管理的实现机制。

Linux中，每个进程用一个task_struct的数据结构来表示，用来管理系统中的进程。Task向量表是指向系统中每一个task_struct数据结构的指针的数组。这意味着系统中的最大进程数受到Task向量表的限制，缺省是512。这个表让Linux可以查到系统中的所有的进程。操作系统初始化后，建立了第一个task_struct数据结构INIT_TASK。当新的进程创建时，从系统内存中分配一个新的task_struct，并增加到Task向量表中。为了更容易查找，用current指针指向当前运行的进程。

   task_struct结构中有关于进程调度的两个重要的数据项：
   struct task_struct {
       ………….
       volatile long state; /* -1 unrunnable , 0 runnable , >0 stopped */
       unsigned long flags; /* per process flags, defined below */
       ………….
     }；
   每个在Task向量表中登记的进程都有相应的进程状态和进程标志，是进行进程调度的重要依据。进程在执行了相应的进程调度操作后，会由于某些原因改变自身的状态和标志，也就是改变state和flags这两个数据项。进程的状态不同、标志位不同对应了进程可以执行不同操作。在Linux2.2.8版本的sched.h中定义了六种状态，十三种标志。
//进程状态
#define TASK_RUNNING                0
#define TASK_INTERRUPTIBLE        1
#define TASK_UNINTERRUPTIBLE        2
#define TASK_ZOMBIE                4
#define TASK_STOPPED                8
#define TASK_SWAPPING                16

它们的含义分别是：

TASK_RUNNING：正在运行的进程（是系统的当前进程）或准备运行的进程（在Running队列中，等待被安排到系统的CPU）。处于该状态的进程实际参与了进程调度。
TASK_INTERRUPTIBLE：处于等待队列中的进程，待资源有效时唤醒，也可由其它进程被信号中断、唤醒后进入就绪状态。
TASK_UNINTERRUPTIBLE：处于等待队列中的进程，直接等待硬件条件，待资源有效时唤醒，不可由其它进程通过信号中断、唤醒。
TASK_ZOMBIE：终止的进程，是进程结束运行前的一个过度状态（僵死状态）。虽然此时已经释放了内存、文件等资源，但是在Task向量表中仍有一个task_struct数据结构项。它不进行任何调度或状态转换，等待父进程将它彻底释放。
TASK_STOPPED：进程被暂停，通过其它进程的信号才能唤醒。正在调试的进程可以在该停止状态。
TASK_SWAPPING：进程页面被兑换出内存的进程。这个状态基本上没有用到，只有在sched.c的count_active_tasks（）函数中判断处于该种状态的进程也属于active的进程，但没有对该状态的赋值。

//进程标志位：
#define PF_ALIGNWARN        0x00000001
#define PF_STARTING        0x00000002
#define PF_EXITING        0x00000004
#define PF_PTRACED        0x00000010
#define PF_TRACESYS        0x00000020
#define PF_FORKNOEXEC        0x00000040
#define PF_SUPERPRIV        0x00000100
#define PF_DUMPCORE        0x00000200
#define PF_SIGNALED        0x00000400
#define PF_MEMALLOC        0x00000800
#define PF_VFORK            0x00001000
#define PF_USEDFPU        0x00100000
#define PF_DTRACE        0x00200000

其中PF_STARTING没有用到。
PF_MEMEALLOC和PF_VFORK这两个标志位是新版本中才有的。

各个标志位的代表着不同含义，对应着不同调用：

PF_ALIGNWARN    标志打印“对齐”警告信息，只有在486机器上实现
PF_STARTING      进程正被创建
PF_EXITING        标志进程开始关闭。
在do_exit()时置位。
    current->flags |= PF_EXITING
用于判断是否有效进程。
在nlmclnt_proc()(在fs\lockd\clntproc.c)，如果current_flag为PF_EXITING，则进程由于正在退出清除所有的锁，将执行异步RPC 调用。
PF_PTRACED      进程被跟踪标志，
在do_fork()时清位。
    p->flags &= ~PF_PTRACED
当ptrace(0)被调用时置位，在进程释放前要清掉。
    current->flags |= PF_PTRACED
在sys_trace()中判断
如果request为PTRACE_TRACEME，如是则将current_flag置为PF_PTRACED；
如果request为PTRACE_ATTACH，则将child_flag置为PF_PTRACED，给child发一个SIGSTOP信号;
如果request为PTRACE_DETACH ，则将child清除PF_PTRACED。
在syscall_trace()中判断current_flag如果为PF_TRACED和PF_TRACESYS,则current强行退出时的出错代码置为SIGTRAP并将状态置为STOPPED。
PF_TRACESYS       正在跟踪系统调用。
do_fork()时清位，在进程释放前要清掉。
在sys_trace()中判断request如果为PTRACE_SYSCALL，则将child->flags 置为 PF_TRACESYS；如为PTRACE_SYSCALL，则将child->flags 清除 PF_TRACESYS；然后唤醒child。如果request为PTRACE_SINGLESTEP（即单步跟踪），则将child_flag清除PF_TRACESYS,唤醒child。
PF_FORKNOEXEC    进程刚创建，但还没执行。
在do_fork()时置位。
    p->flags |= PF_FORKNOEXEC
在调入格式文件时清位。
    p->flags &= ~ PF_FORKNOEXEC
PF_SUPERPRIV       超级用户特权标志。
    如果是超级用户进程则置位，用户特权设为超级用户，如是超级用户，在统计时置统计标志(accounting flag)为ASU。
PF_DUMPCORE       标志进程是否清空core文件。
Core文件由gdb进行管理，给用户提供有用信息，例如查看浮点寄存器的内容比较困难，事实上我们可以从内核文件里的用户结构中得到
Core文件格式如下图：
UPAGE
DATA
STACK
              Core 文件结构
       UPAGE是包含用户结构的一个页面，告诉gdb文件中现有内容所有寄存器也在           UPAGE中，通常只有一页。DATA存放数据区。STACK堆栈区
   最小Core文件长度为三页（12288字节）
在task_struct中定义一个dumpable变量，当dumpable==1时表示进程可以清空core文件（即将core文件放入回收站），等于0时表示该进程不能清空core文件（即core文件以放在回收站中，不可再放到回收站中），此变量初值为1。
例如在调用do_aout_core_dump()时判断current->dumpable是否等于1（即判断该进程是否能将core文件放入回收站），如果等于1则将该变量置为0，在当前目录下建立一个core dump image ,在清空用户结构前，由gdb算出数据段和堆栈段的位置和使用的虚地址，用户数据区和堆栈区在清空前将相应内容写入core dump，将PF_DUMPCORE置位，清空数据区和堆栈区。
只有在aout_core_dump()内调用do_aout_core_dump(),而没有地方调用aout_core_dump()。对其它文件格式也是类似。
9、 PF_SIGNALED       标志进程被信号杀出。
在do_signal()中判断信号，如果current收到信号为SIGHUP, SIGINT, SIGIOT, SIGKILL, SIGPIPE, SIGTERM, SIGALRM, SIGSTKFLT, SIGURG, SIGXCPU, SIGXFSZ, SIGVTALRM, SIGPROF, SIGIO, SIGPOLL, SIGLOST, SIGPWR，则执行lock_kernel(),将信号加入current的信号队列，将current->flag置为PF_SIGNALED,然后执行do_exit()
PF_USEDFPU        标志该进程使用FPU，此标志只在SMP时使用。
在task_struct中有一变量used_math，进程是否使用FPU。
在CPU从prev切换到next时，如果prev使用FPU则prev的flag清除PF_USEDFPU。
    prev->flags&=~PF_USEDFPU
在flush_thread()(arch\i386\kernel\process.c)、restore_i387_hard()、save_i387_hard()(arch\i386\kernel\signal.c)中，如果是SMP方式，且使用FPU则stts()，否则清除PF_USEDFPU。
    current->flags &= ~PF_USEDFPU
在sys_trace()中如果request为PTRACE_SETFPREGS，则将child的used_math置为1，将child_flag清除PF_USEDFPU。
    child->flags &= ~PF_USEDFPU
在SMP方式下进行跟踪时，判断是否使用FPU。
在跟踪时出现数学错误时清位。
    current->flags &= ~PF_USEDFPU
PF_DTRACE          进程延期跟踪标志，只在m68k下使用。
跟踪一个trapping指令时置位。
    current->flags |= PF_DTRACE
PF_ONSIGSTK        标志进程是否工作在信号栈，只在m68k方式下使用。
liunx 2.1.19版本中使用此标志位，而2.2.8版本中不使用。
在处理信号建立frame时如果sigaction标志为ONSTACK，则将current->flag置为PF_ONSIGSTK。
PF_MEMALLOC      进程分配内存标志。
linux 2.2.8版本中使用此标志位。
在kpiod()和kwpad()中置位。
    tsk->flags |= PF_MEMALLOC
PF_VFORK           linux 2.2.8版本中使用此标志位。
在copy_flags(unsigned long clone_flags, struct task_struct *p)，如果clone_flags为CLONE_VFORK，则将p的flags置为PF_VFORK。
在mm_release（）中将current ->flags清除PF_VFORK。
    tsk->flags &= ~PF_VFORK
    具体的分析由我组的另外同学进行。

Linux的各进程之间的状态转换的系统调用
我将参与Linux的各进程之间的状态转换的系统调用总结成一张流程图：

进程的创建：TASK_RUNNING

第一个进程在系统启动时创建，当系统启动的时候它运行在核心态，这时，只有一个进程：初始化进程。象所有其他进程一样，初始进程有一组用堆栈、寄存器等等表示的机器状态。当系统中的其他进程创建和运行的时候这些信息存在初始进程的task_struct数据结构中。在系统初始化结束的时候，初始进程启动一个核心进程（叫做init）然后执行空闲循环，什么也不做。当没有什么可以做的时候，调度程序会运行这个空闲的进程。这个空闲进程的task_struct是唯一一个不是动态分配而是在核心连接的时候静态定义的，为了不至于混淆，叫做init_task。
系统调用sys_fork 和sys_clone都调用函数do_fork（）（在kernel/fork.中定义）。
进程由do_fork()函数创建，先申请空间，申请核心堆栈；然后在Task向量表中找到空闲位置；在进行正式初始化以前，将新创建的进程的状态都置为TASK_UNINTERRUPTIBLE，以免初始化过程被打断；开始初始化工作，如初始化进程时钟、信号、时间等数据；继承父进程的资源，如文件、信号量、内存等；完成进程初始化后,由父进程调用wake_up_process()函数将其唤醒，状态变为TASK_RUNNING，挂到就绪队列run queue，返回子进程的pid。

// C:\SRCLNX\KERNEL\FORK.C
int do_fork(unsigned long clone_flags, unsigned long usp, struct pt_regs *regs)
{

        为新进程申请PCB空间;
        if (申请不到)
                返回错误,退出;
        为新进程申请核心堆栈;
        if (核心堆栈申请不到)
                返回错误,退出;
        为新进程在Task向量表中找到空闲位置;
/*复制父进程current PCB中的信息，继承current的资源*/；
    p = current;
        在进行正式初始化以前，将新创建的进程的状态都置为TASK_UNINTERRUPTIBLE，以免初始化过程被打断,并置一些标志位.
/*为防止信号、定时中断误唤醒未创建完毕的进                                       程，将子进程的状态设成不可中断的*/
        p->state = TASK_UNINTERRUPTIBLE;
/*跟踪状态和超级用户特权是没有继承性的，因为在root用户为普通用户创建进程时，出于安全考虑这个普通用户的进程不允许拥有超级用户特权。*/
        p->flags &= ~(PF_PTRACED|PF_TRACESYS|PF_SUPERPRIV);
/*将进程标志设成初建，在进程第一次获得CPU时，内核将根据此标志进行一定操作*/
        p->flags |= PF_FORKNOEXEC;
   开始Task_struct的初始化工作，如初始化进程时钟、信号、时间等数据;
   继承父进程所有资源:
                拷贝父进程当前打开的文件;
                拷贝父进程在VFS的位置;
                拷贝父进程的信号量;
                拷贝父进程运行的内存;
                拷贝父进程的线程;
   初始化工作结束，父进程将其将其唤醒,挂入running队列中，返回子进程的pid;

}

进程的调度（schedule()）：

   处于TASK_RUNNING状态的进程移到run queue，会由schedule()按CPU调度算法在合适的时候选中，分配给CPU。
   新创建的进程都是处于TASK_RUNNING状态，而且被挂到run queue的队首。进程调度采用变形的轮转法（round robin）。当时间片到时（10ms的整数倍），由时钟中断引起新一轮调度，把当前进程挂到run queue队尾。
   所有的进程部分运行与用户态，部分运行于系统态。底层的硬件如何支持这些状态各不相同但是通常有一个安全机制从用户态转入系统态并转回来。用户态比系统态的权限低了很多。每一次进程执行一个系统调用，它都从用户态切换到系统态并继续执行。这时让核心执行这个进程。Linux中，进程不是互相争夺成为当前运行的进程，它们无法停止正在运行的其它进程然后执行自身。每一个进程在它必须等待一些系统事件的时候会放弃CPU。例如，一个进程可能不得不等待从一个文件中读取一个字符。这个等待发生在系统态的系统调用中。进程使用了库函数打开并读文件，库函数又执行系统调用从打开的文件中读入字节。这时，等候的进程会被挂起，另一个更加值得的进程将会被选择执行。进程经常调用系统调用，所以经常需要等待。即使进程执行到需要等待也有可能会用去不均衡的CPU事件，所以Linux使用抢先式的调度。用这种方案，每一个进程允许运行少量一段时间，200毫秒，当这个时间过去，选择另一个进程运行，原来的进程等待一段时间直到它又重新运行。这个时间段叫做时间片。
   需要调度程序选择系统中所有可以运行的进程中最值得的进程。一个可以运行的进程是一个只等待CPU的进程。Linux使用合理而简单的基于优先级的调度算法在系统当前的进程中进行选择。当它选择了准备运行的新进程，它就保存当前进程的状态、和处理器相关的寄存器和其他需要保存的上下文信息到进程的task_struct数据结构中。然后恢复要运行的新的进程的状态（又和处理器相关），把系统的控制交给这个进程。为了公平地在系统中所有可以运行（runnable）的进程之间分配CPU时间，调度程序在每一个进程的task_struct结构中保存了信息。
   policy 进程的调度策略：Linux有两种类型的进程：普通和实时。实时进程比所有其它进程的优先级高。如果有一个实时的进程准备运行，那么它总是先被运行。实时进程有两种策略：环或先进先出（round robin and first in first out）。在环的调度策略下，每一个实时进程依次运行，而在先进先出的策略下，每一个可以运行的进程按照它在调度队列中的顺序运行，这个顺序不会改变。
   Priority 进程的调度优先级。也是它允许运行的时候可以使用的时间量（jiffies）。你可以通过系统调用或者renice命令来改变一个进程的优先级。
   Rt_priority Linux支持实时进程。这些进程比系统中其他非实时的进程拥有更高的优先级。这个域允许调度程序赋予每一个实时进程一个相对的优先级。实时进程的优先级可以用系统调用来修改Coutner 这时进程可以运行的时间量（jiffies）。进程启动的时候等于优先级（priority），每一次时钟周期递减。
调度程序schedule()从核心的多个地方运行。它可以在把当前进程放到等待队列之后运行，也可以在系统调用之后进程从系统态返回进程态之前运行。需要运行调度程序的另一个原因是系统时钟刚好把当前进程的计数器(counter)置成了0。每一次调度程序运行它做以下工作：
（1）kernel work 调度程序运行bottom half handler并处理系统的调度任务队列。
（2）Current pocess 在选择另一个进程之前必须处理当前进程。
（3）如果当前进程的调度策略是环则它放到运行队列的最后。
（4）如果任务状态是TASK_INTERRUPTIBLE的而且它上次调度的时候收到过一个信号，它的状态变为TASK_RUNNING;
     如果当前进程超时，它的状态成为RUNNING;
     如果当前进程的状态为RUNNING则保持此状态;
     不是RUNNING或者INTERRUPTIBLE的进程被从运行队列中删除。这意味着当调度程序查找最值得运行的进程时不会考虑这样的进程。
(5)Process Selection 调度程序查看运行队列中的进程，查找最值得运行的进程。如果有实时的进程（具有实时调度策略），就会比普通进程更重一些。普通进程的重量是它的counter，但是对于实时进程则是counter 加1000。这意味着如果系统中存在可运行的实时进程，就总是在任何普通可运行的进程之前运行。当前的进程，因为用掉了一些时间片（它的counter减少了），所以如果系统中由其他同等优先级的进程，就会处于不利的位置：这也是应该的。如果几个进程又同样的优先级，最接近运行队列前段的那个就被选中。当前进程被放到运行队列的后面。如果一个平衡的系统，拥有大量相同优先级的进程，那么回按照顺序执行这些进程。这叫做环型调度策略。不过，因为进程需要等待资源，它们的运行顺序可能会变化。
(6)Swap Processes 如果最值得运行的进程不是当前进程，当前进程必须被挂起，运行新的进程。当一个进程运行的时候它使用了CPU和系统的寄存器和物理内存。每一次它调用例程都通过寄存器或者堆栈传递参数、保存数值比如调用例程的返回地址等。因此，当调度程序运行的时候它在当前进程的上下文运行。它可能是特权模式：核心态，但是它仍旧是当前运行的进程。当这个进程要挂起时，它的所有机器状态，包括程序计数器(PC)和所有的处理器寄存器，必须存到进程的task_struct数据结构中。然后，必须加载新进程的所有机器状态。这种操作依赖于系统，不同的CPU不会完全相同地实现，不过经常都是通过一些硬件的帮助。
(7)交换出去进程的上下文发生在调度的最后。前一个进程存储的上下文，就是当这个进程在调度结束的时候系统的硬件上下文的快照。相同的，当加载新的进程的上下文时，仍旧是调度结束时的快照，包括进程的程序计数器和寄存器的内容。
(8)如果前一个进程或者新的当前进程使用虚拟内存，则系统的页表需要更新。同样，这个动作适合体系结构相关。Alpha AXP处理器，使用TLT（Translation Look-aside Table）或者缓存的页表条目，必须清除属于前一个进程的缓存的页表条目。

下面我就来总结一下进程创建以后到被杀死的整个进程生命周期中，状态可能在TASK_RUNNING、TASK_INTERRUPTIBLE、TASK_UNINTERRUPTIBLE 、TASK_STOPPED以及TASK_ZOMBLE之间转换的原因。

进程在TASK_RUNNING以及TASK_UNINTERRUPTIBLE、TASK_INTERRUPTIBLE之间转换：
获得CPU而正在运行的进程会由于某些原因，比如：申请不到某个资源，其状态会从TASK_RUNNING变为TASK_INTERRUPTIBLE或TASK_UNINTERRUPTIBLE的等待状态。同样在经历了某些情况，处于等待状态的进程会被重新唤醒，等待分配给CPU。状态为TASK_INTERRUPTIBLE的睡眠进程会被唤醒，回到TASK_RUNNING状态，重新等待schedule()分配给它CPU，继续运行，比如：当申请资源有效时，也可以由signal或定时中断唤醒。而状态为TASK_INTERRUPTIBLE的睡眠进程只有当申请资源有效时被唤醒，不能被signal、定时中断唤醒。

1.通过sleep_on()、interruptible_sleep_on()、sleep_on_timeout()、interruptible_sleep_on_timeout（）以及wake_up()、wake_up_process()、wake_up_interruptible()函数对进行的转换：

   sleep_on():TASK_RUNNING->TASK_UNINTERRUPTIBLE
   当拥有CPU的进程申请资源无效时，会通过sleep_on()，将进程从TASK_RUNNING切换到TASK_UNINTERRUPTIBLE状态。sleep_on()函数的作用就是将current进程的状态置成TASK_UNINTERRUPTIBLE,并加到等待队列中。
   一般来说引起状态变成TASK_UNINTERRUPTIBLE的资源申请都是对一些硬件资源的申请，如果得不到这些资源，进程将不能执行下去，不能由signal信号或时钟中断唤醒，而回到TASK_RUNNING状态。
   我们总结了这种类型的转换原因有：
（1）对某些资源的操作只能由一个进程进行，所以系统对该项资源采用上锁机制。在申请该项资源时，必须先申请资源的锁，如果已经被别的进程占用，则必须睡眠在对该锁的等待队列上。而且这种睡眠不能被中断，必须等到得到了资源才能继续进行下去。
如：
对网络连接表锁（Netlink table lock）的申请, sleep_on(&nl_table_wait)；
对交换页进行I/O操作的锁的申请, sleep_on(&lock_queue)；
对Hash表操作的锁的申请, sleep_on(&hash_wait)；
在UMSDOS文件系统创建文件或目录时，必须等待其他同样的创建工作结束,sleep_on (&dir->u.umsdos_i.u.dir_info.p);

（2）某些进程在大部分时间处于睡眠状态，仅在需要时被唤醒去执行相应的操作，当执行完后，该进程又强制去睡眠。
如：
wakeup_bdflush（)是对dirty buffer进行动态的响应，一旦该进程被激活，就将一定数量的dirty buffer写回磁盘，然后调用sleep_on(&bdflush_done),又去睡眠。

interruptible_sleep_on():TASK_RUNNING->TASK_INTERRUPTIBLE
   与sleep_on()函数非常地相象，当拥有CPU的进程申请资源无效时，会通过interruptible_sleep_on()，将进程从TASK_RUNNING切换到TASK_INTERRUPTIBLE状态。interruptible_sleep_on()函数的作用就是将current进程的状态置成TASK_INTERRUPTIBLE,并加到等待队列中。
   处于TASK_INTERRUPTIBLE状态的进程可以在资源有效时被wake_up()、wake_up_interruptible()或wake_up_process()唤醒，或收到signal信号以及时间中断后被唤醒。
   进行这种转换的原因基本上与sleep_on()相同，申请资源无效时进程切换到等待状态。与之不同的是处于interruptible_sleep_on()等待状态的进程是可以接受信号或中断而重新变为running状态。所以可以认为对这些资源的申请没有象在sleep_on()中资源的要求那么严格，必须得到该资源进程才能继续其运行下去。

sleep_on_timeout():TASK_RUNNING->TASK_UNINTERRUPTIBLE
sleep_on_timeout(&block.b_wait, 30*HZ);

interruptible_sleep_on_timeout（）：TASK_RUNNING->TASK_INTERRUPTIBLE
虽然在申请资源或运行中出现了某种错误，但是系统仍然给进程一次重新运行的机会。调用该函数将进程从TASK_RUNNING切换到TASK_INTERRUTIBLE状态，并等待规定的时间片长度,再重新试一次。
如：在smb_request_ok 中产生了连接失败的错误，会在sem_retry()中给一次重新连接的机会。//interruptible_sleep_on_timeout(&server->wait, 5*HZ)；

wake_up():TASK_UNINTERRUPTIBLE-> TASK_RUNNING;
          TASK_INTERRUPTIBLE-> TASK_RUNNING
   处于TASK_UNINTERRUPTIBLE状态的进程不能由signal信号或时钟中断唤醒，只能由wake_up()或wake_up_process()唤醒。wake_up()函数的作用是将wait_queue中的所有状态为TASK_INTERRUPTIBLE或TASK_UNINTERRUPTIBLE的进程状态都置为TASK_RUNNING,并将它们都放到running队列中去，即唤醒了所有等待在该队列上的进程。
void wake_up(struct wait_queue **q)
{
        struct wait_queue *next;
        struct wait_queue *head;

        if (!q || !(next = *q))
                return;
        head = WAIT_QUEUE_HEAD(q);
        while (next != head) {
                struct task_struct *p = next->task;
                next = next->next;
                if (p != NULL) {
                        if ((p->state == TASK_UNINTERRUPTIBLE) ||
                            (p->state == TASK_INTERRUPTIBLE))
                                wake_up_process(p);
                }
                if (!next)
                        goto bad;
        }
        return;
bad:
        printk("wait_queue is bad (eip = %p)\n",
                __builtin_return_address(0));
        printk("        q = %p\n",q);
        printk("       *q = %p\n",*q);
}

wake_up()在下列情况下被调用：
这个函数通常在资源有效时调用，资源锁已经被释放，等待该资源的所有进程都被置为TASK_RUNNING状态，移到run queue，重新参与调度，对这一资源再次竞争。这时又会有某个进程竞争到了该项资源，而其他的进程在申请失败后，又回到TASK_UNINTERRUPTIBLE或TASK_INTERRUPTIBLE状态。
如：
网络连接表锁（Netlink table lock）释放后,唤醒等待该锁的所有睡眠进程 wake_up(&nl_table_wait)；
对交换页进行I/O操作的锁释放后,唤醒等待该锁的所有睡眠进程, wake_up(&lock_queue)；
对Hash表操作的锁释放后,唤醒等待该锁的所有睡眠进程，wake_up(&hash_wait)；
在UMSDOS文件系统创建文件或目录工作结束后,唤醒其他由于等待它创建结束而睡眠的进程， wake_up (&dir->u.umsdos_i.u.dir_info.p)；

唤醒睡眠进程执行某些操作：
如：
bd_flush()函数要将一些dirty buffer写回磁盘，就调用wake_up(&bdflush_done)，唤醒正在睡眠的wakeup_bdflush（)进程去处理写回。

wake_up_process()：TASK_UNINTERRUPTIBLE-> TASK_RUNNING;
TASK_INTERRUPTIBLE-> TASK_RUNNING
wake_up_process()函数的作用是将参数所指的那个进程状态从TASK_INTERRUPTIBLE,TASK_UNINTERRUPTIBLE变为TASK_RUNNING,并将它放到running队列中去。

void wake_up_process(struct task_struct * p)
{
unsigned long flags;

        /*
        * We want the common case fall through straight, thus the goto.
        */
        spin_lock_irqsave(&runqueue_lock, flags);
        p->state = TASK_RUNNING;
        if (p->next_run)
                goto out;
        add_to_runqueue(p);
        spin_unlock_irqrestore(&runqueue_lock, flags);

        reschedule_idle(p);
        return;
out:
        spin_unlock_irqrestore(&runqueue_lock, flags);
}

这个函数的实现机制与wake_up()的不同在于，它只能唤醒某一个特定的睡眠进程，而wake_up()是唤醒整个等待队列的睡眠进程。所以，它的唤醒的原因与wake_up()也有一定的区别，除了由于wake_up()对它的调用之外，它唤醒进程并不是由于资源有效造成的，唤醒的进程也不是因等待资源有效而睡眠的进程。有以下几种情况：
父进程对子进程的唤醒：
如：
在sys_ptrace()中当收到的跟踪请求为：PTRACE_CONT（在处理完信号后继续）；PTRACE_KILL（将子进程杀出）；PTRACE_SINGLESTEP（对子进程进行单步跟踪）；PTRACE_DETACH的时候，都会在处理结束时，唤醒子进程，给子进程一个运行的机会。
在do_fork()中，新建进程初始化完毕，会由父进程唤醒它，将该进程移到run queue中，置状态为TASK_RUNNING。

当需要的时候唤醒某个睡眠的系统调用，进行处理：
如：
kswapd_process页面交换进程，通常是处于睡眠状态的，当某个进程需要更多的内存，而调用try_to_free_pages（）时，就会唤醒kswapd_process页面交换进程，调入更多的内存页面。

收到信号所进行的相应处理：
如：
某一进程的时间片到了，process_timeout()会调用wake_up_process()唤醒该进程；
收到某些signal信号:处于STOPPED状态的进程收到SIGKILL或SIGCONT会被唤醒（注：处于STOPPED状态的进程不能被wake_up()唤醒）；以及收到某些非实时信号，不需加到signal队列中去，处于TASK_INTERRUPTIBLE的进程有机会被唤醒。

资源有效时，wake_up()对整个等待队列的唤醒是通过对每个等待队列上的进程调用wake_up_process()实现的。

wake_up_interruptible（）：TASK_INTERRUPTIBLE-> TASK_RUNNING
   将wait_queue中的所有状态为 TASK_INTERRUPTIBLE的进程状态都置为TASK_RUNNING,并将它们都放到running queue中去。
   这个函数通常在send_sig(发出信号)后调用,以使信号发出后能及时得到响应，或者当空闲下来时，希望检查一下是否有收到有效信号的能运行的进程时，也可以调用这个函数，如：
在进程退出前调用notify_parent(),给父进程send_sig（）后，将调用wake_up_interruptible （），使信号能够得到及时的响应。
usr\src\linux\KERNEL\EXIT.C 中定义了
void notify_parent(struct task_struct * tsk, int signal)
{
        send_sig(signal, tsk->p_pptr, 1);
        wake_up_interruptible(&tsk->p_pptr->wait_chldexit);
}
   当某一进程要结束时，它可以通过调用notify_parent(current, current->exit_signal)通知父进程以唤醒睡眠在wait_chldexit上的父进程

2. Semaphores（信号灯）

信号量用于生成锁机制,避免发生数据不一致。
信号量最简单的形式就是内存中一个位置，它的取值可以由多个进程检验和设置。检验和设置的操作，至少对于关联的每一个进程来讲，是不可中断或者说有原子性：只要启动就不能中止。检验和设置操作的结果是信号灯当前值和设置值的和，可以是正或者负。根据测试和设置操作的结果，一个进程可能必须睡眠直到信号灯的值被另一个进程改变。信号灯可以用于实现临界区域（critical regions），就是重要的代码区，同一时刻只能有一个进程运行。
   对信号灯的操作是通过以下两组基本函数实现的：
1．void __up(struct semaphore *sem) ：TASK_UNINTERRUPTIBLE-> TASK_RUNNING;
                                  TASK_INTERRUPTIBLE-> TASK_RUNNING
    int __do_down(struct semaphore * sem, int task_state)由以下两个函数调用，分别转换到不同的等待状态：
(1)int __down_interruptible (struct semaphore * sem)：
   TASK_RUNNING ->TASK_INTERRUPTIBLE;
(2)void __down(struct semaphore * sem):
   TASK_RUNNING ->TASK_UNINTERRUPTIBLE；
2. extern inline void up(struct semaphore * sem)
   extern inline void down(struct semaphore * sem)；
   extern inline int down_interruptible(struct semaphore * sem)；

Linux信号量是通过两路counter变量实现的：当进程由于申请不到临界区资源而睡眠时，会将semaphore结构中的”count”变量值原子地递减1，进程睡眠等待临界区资源的释放；而当up()函数唤醒睡眠等待进程时，如果”count”变量值小于0，会将semaphore结构中的” waking”变量值原子地递增1，唤醒睡眠进程。虽然所有等待进程都被唤醒。但只有首先得到” waking”的进程才能得到信号量，继续运行下去，其他进程仍然回到最初的等待状态。

Linux定义信号灯结构是：
struct semaphore {
        atomic_t count;
        int waking;
        struct wait_queue * wait;
};
   信号灯的值初始化为一个宏定义的结构MUTEX的值{count=1，waking=0，wait=NULL}。

void __up(struct semaphore *sem)：
占有临界区资源的进程，调用__up()释放资源。在__up()函数中，调用wake_one_more ()函数，原子地读sem->count, 如果sem->count <=0,则sem->waking ++,并唤醒所有等待在该sem-->wait上的进程。
void __up(struct semaphore *sem)
{
wake_one_more(sem);
wake_up(&sem->wait);
}

int __do_down(struct semaphore * sem, int task_state):
申请临界区资源的进程会通过调用__do_down()来竞争资源。在__do_down()函数中，调用waking_non_zero(struct semaphore *sem)或waking_non_zero_interruptible(struct semaphore *sem)抢占临界区资源，如果抢占到，则将当前进程置为TASK_RUNNING,否则将当前进程的状态置为task_state，并处于循环等待状态。
进程通过waking_non_zero（）来竞争临界区资源，在该函数中判断sem-->waking的值，如果sem-->waking 大于0,sem->waking -- 并返回1，否则返回0。
int __do_down(struct semaphore * sem, int task_state)
{
        struct task_struct *tsk = current;
        struct wait_queue wait = { tsk, NULL };
        int                ret = 0 ;

tsk->state = task_state;
add_wait_queue(&sem->wait, &wait); /*将进程加入到等待队列*/

        for (;;)
        {
                if (waking_non_zero(sem))        /* 是否已经被唤醒 */
                    break ;                            /* 是的，跳出循环 */

                if (   task_state == TASK_INTERRUPTIBLE
                    && (tsk->signal & ~tsk->blocked)
        /* 如果进程状态为TASK_INTERRUPTIBLE，且收到信号量，并未被屏蔽*/
                   )
                {
                    ret = -EINTR ;                     /* 中断 */
                    atomic_inc(&sem->count) ;        /* 放弃down操作，原子递增信号量的count值 */
                    break ;
                }

                schedule();                    /* 重新调度 */
                tsk->state = task_state;      /*未能竞争到信号量的进程重新置成执行down操
                                        作前的状态*/
        }

        tsk->state = TASK_RUNNING;        /*竞争到信号量的进程置为TASK_RUNNING状态*/
        remove_wait_queue(&sem->wait, &wait);/*将进程从等待队列中删除*/
        return(ret) ;

} /* __do_down */

其中_do__down()又分别由__down()和__do_down()调用，进程转换到不同状态。
void __down(struct semaphore * sem): TASK_RUNNING ->TASK_UNINTERRUPTIBLE；
void __down(struct semaphore * sem)
{
__do_down(sem,TASK_UNINTERRUPTIBLE) ;
}

int __down_interruptible (struct semaphore * sem)： TASK_RUNNING ->TASK_INTERRUPTIBLE;
int __down_interruptible(struct semaphore * sem)
{
return(__do_down(sem,TASK_INTERRUPTIBLE)) ;
}

在Linux中定义了两种不同的信号灯：
（1）定义在某个数据结构上：
在linux系统中有很多数据结构中定义了这样的信号灯，来控制对这个数据结构的资源访问，比如不允许对某个内存单元进行多进程访问，就通过定义在该内存单元上的某个信号灯mmap_sem进行__up()、_down()、up()、down()操作。
如：
struct mm_struct中有mmap_sem信号灯；
struct inode中有i_sem、i_atomic_write信号灯；
struct nlm_file中有f_sema信号灯；
struct nlm_host中有h_sema信号灯；
struct superblock中有s_vfs_rename_sem信号灯；
struct vfsmount中有mnt_dquot.semaphore信号灯；
struct task_struct中有vfork_sem信号灯；//注：这个信号灯在2.0.36版本是没有的，新版本2.2.8中才有的，用于vfork()。
struct unix_opt中有readsem信号灯；
struct smb_sb_info中有sem信号灯；
申请这些数据结构中的临界区资源，就要进行相应的信号灯操作。

（2）定义在全局的单独信号灯数据：
还有一些单独的全局信号灯，它们并不属于某一个数据结构，而是系统定义的全局静态的信号灯，可能有多个进程对这种不属于某个特定数据结构的全局临界资源的申请，则系统通过这些全局信号灯来分配资源。
如：
nlm_file_sema；
nlmsvc_sema；
lockd_start；
read_sem；
nlm_host_sema；
read_semaphore；
uts_sem
mount_sem；
cache_chain_sem；
rpciod_sema；
rpciod_running；
mfw_sema；
firewall_sem；

   我们来分析一个例子说明信号灯的操作。例如对文件的写操作，我们假设有许多协作的进程对一个单一的数据文件进行写操作。我们希望对文件的访问必须严格地协调。因此这里就利用了inode结构上定义的信号灯inode->i_sem。
在 /usr/src/linux/mm/filemap.c中：
static int filemap_write_page(struct vm_area_struct * vma,
        unsigned long offset,
        unsigned long page)
{
        int result;
        struct file file;
    struct inode * inode;
        struct buffer_head * bh;

……………

        down(&inode->i_sem);
        result = do_write_page(inode, &file, (const char *) page, offset);
        up(&inode->i_sem);
        return result;
}

在该文件写操作的代码中，加入两个信号灯操作，第一个down(&inode->i_sem)检查并把信号灯的值减小，第二个up(&inode->i_sem)检查并增加它。访问文件的第一个进程试图减小信号灯的数值，如果成功，信号灯的count取值成为0。这个进程现在可以继续运行并使用数据文件。但是，如果另一个进程需要使用这个文件，现在它试图减少信号灯的count数值，它会失败因为结果会是-1。这个进程会被挂起直到第一个进程处理完数据文件。当第一个进程处理完数据文件，它会增加信号灯的waking数值成为1。现在等待进程会被唤醒，这次它减小信号灯的尝试会成功。

每一个独立的信号灯操作可能都需要维护一个调整动作。Linux至少为每一个进程的每一个信号灯数组都维护一个sem_undo的数据结构。如果请求的进程没有，就在需要的时候为它创建一个。这个新的sem_undo数据结构同时在进程的task_struct数据结构和信号灯队列的semid_ds数据结构的队列中排队。对信号灯队列中的信号灯执行操作的时候，和这个操作值相抵消的值加到这个进程的sem_undo数据结构的调整队列这个信号灯的条目上。所以，如果操作值为2，那么这个就在这个信号灯的调整条目上增加-2。

当进程被删除，比如退出的时候，Linux遍历它的sem_undo数据结构组，并实施对于信号灯数组的调整。如果删除信号灯，它的sem_undo数据结构仍旧停留在进程的task_struct队列中，但是相应的信号灯数组标识符标记为无效。这种情况下，清除信号灯的代码只是简单地废弃这个sem_undo数据结构。

3.锁机制
   lock_…（）；
   unlock_…（）；
   wait_on_…（）：TASK_RUNNING ->TASK_UNINTERRUPTIBLE；
   进程在RUNNING,WAITING状态间转换时，锁机制也是Linux中解决进程之间共享资源的一个方法。锁就是在资源的结构定义中加入一个锁成员，或为一个标志位,它的取值可以由多个进程检验和设置。锁可以用于实现对资源的共享竞争。具体来说当一个进程占用一个资源时，先对其上锁，然后再进行相关的操作，如果这时别的进程也要用这个资源，则必须等待这个锁被解开后，才可以进行下去。
   但是，锁仅在某些数据结构和资源申请中才会用到，进程在申请某种特定资源时，会调用相应的__wait_on_… 函数来判断是否该资源已经被上锁，如果未上锁或已被解锁，则分配资源给进程，否则进程加入到等待队列中去。这种类型的申请有：__wait_on_dquot、__wait_on_buffer、__wait_on_inode、__wait_on_page、__wait_on_super等。
   值得注意的是，如果申请不到这种资源，进程的状态都是转变成TASK_UNINTERRUPTIBLE。
   定义锁的方式有两种：
专门的某项数据结构：
如：Superblock的数据结构中专门定义了锁数据项：s_lock；
置数据结构中某一项的某个标志位为锁标志：
如：
struct inode中定义了i_state的数据项，通过判断i_state 是否置了 I_LOCK，来判断该inode节点是否上锁。（2.2.8版本中定义）//注意：在2.2.0.34版本中是采用了专门的数据项i_lock来进行锁操作的。
struct buffer_head 中定义了b_state的数据项，通过判断b_state是否置了 BH_Lock位,来判断该缓冲区头是否上锁。
struct dquot中定义了dq_flags的数据项，通过判断dq_flags是否置了DQ_LOCKED位,来判断该dquot是否上锁。
struct page中定义了flags的数据项，通过判断flags是否置了PG_locked 位,来判断该页是否上锁。//注：程序中一般采用PageLocked(page)函数来判断是否上锁。

   我们以buffer_head的加锁和解锁操作为例来解释一下通过锁机制进行的状态转换，在这里要申请buffer_head 资源，先要申请到锁，buffer_head的加锁和解锁就是通过置位和复位bh->b_state来实现的：
//申请资源时将该缓冲区上锁，置锁位，如果申请不到，睡眠在等待队列上，等待该锁的释放。
extern inline void lock_buffer(struct buffer_head * bh)
{
        while (set_bit(BH_Lock, &bh->b_state))
                __wait_on_buffer(bh);
}

//资源释放时，清该缓冲区锁位，并唤醒等待队列上的进程，参与竞争资源。
void unlock_buffer(struct buffer_head * bh)
{
......

        clear_bit(BH_Lock, &bh->b_state);
        wake_up(&bh->b_wait);
        ......
}

//检验该锁位是否已经置位
static inline int buffer_locked(struct buffer_head * bh)
{
return test_bit(BH_Lock, &bh->b_state);
}

//在 \USR\SRC\LINUX\FS\BUFFER.C中定义了__wait_on_buffer(stuct buffer_head * bh);该函数判断该buffer_head是否已经被上了锁，如果是，则不能得到资源，将进程置成TASK_UNINTERRUPTIBLE，加入bh-->b_wait队列中，调用schedule()转去调用其他的进程，否则，分配给资源，进程进入TASK_running状态。
void __wait_on_buffer(struct buffer_head * bh)
{
struct wait_queue wait = { current, NULL };

        bh->b_count++;
        add_wait_queue(&bh->b_wait, &wait);/*进程加入到等待锁的队列*/
repeat:
        run_task_queue(&tq_disk);
        current->state = TASK_UNINTERRUPTIBLE;/*进程状态置为TASK_UNINTERRUPTIBLE*/
        if (buffer_locked(bh)) {
                schedule();    /*如果申请不到锁，重新调度CPU*/
                goto repeat;
        }
        remove_wait_queue(&bh->b_wait, &wait);/*进程从等待队列中删除*/
        bh->b_count--;
        current->state = TASK_RUNNING; /*进程状态置为TASK_ RUNNING*/
}

4. 管道（流）
   管道做为系统的特殊设备文件,可以是内存方式的,也可以是外存方式的。管道的传输一般是单向的,即一个管道一向,若两个进程要做双向传输则需要2个管道.管道生成时即有两端,一端为读,一端为写,两个进程要协调好,一个进程从读方读,另一个进程向写方写。管道的读写使用流设备的读写函数,即:read(),write.管道的传输方式为FIFO,流方式的.不象消息队列可以按类型读取.管道分为有名管道和无名管道：
1．有名管道
    一般为系统特殊文件方式,使用的进程之间不一定要有父子关系或兄弟关系.
2．无名管道
    一般为内存方式,使用的进程之间一定要有父子关系或兄弟关系.

Linux shell允许重定向。例如：

$ ls | pr | lpr

把列出目录文件的命令ls的输出通过管道接到pr命令的标准输入上进行分页。最后，pr命令的标准输出通过管道连接到lpr命令的标准输入上，在缺省打印机上打印出结果。管道是单向的字节流，把一个进程的标准输出和另一个进程的标准输入连接在一起。没有一个进程意识到这种重定向，和它平常一样工作。是shell建立了进程之间的临时管道。在Linux中，使用指向同一个临时VFS INODE节点（本身指向内存中的一个物理页）的两个file数据结构来实现管道。当写进程向管道中写的时候，字节拷贝到了共享的数据页，当从管道中读的时候，字节从共享页中拷贝出来。Linux必须同步对于管道的访问。必须保证管道的写和读步调一致，它使用锁、等待队列和信号。
运用管道方式进行通讯的进程，由于都是调用sleep_on_interruptible,因此都是睡眠在TASK_INTERRUPTIBLE状态的。

管道结构的定义在include\linux\pipe_fs_i.h中，
struct pipe_inode_info {
        struct wait_queue * wait;
        char * base;
        unsigned int start;
        unsigned int len;
        unsigned int lock; //用到了锁
        unsigned int rd_openers;
        unsigned int wr_openers;
        unsigned int readers;
        unsigned int writers;
};

对管道的操作主要有读和写两种：
1．向一个管道写pipe_write()：
在\fs\pipe.c中定义了static int pipe_write(struct inode * inode, struct file * filp, const char * buf, int count)；
实现机制：当写进程向管道写的时候，它使用标准的write库函数。这些库函数传递的文件描述符是进程的file数据结构组中的索引，每一个都表示一个打开的文件，在这种情况下，是打开的管道。Linux系统调用使用描述这个管道的file数据结构指向的write例程。这个write例程使用表示管道的VFS INODE节点存放的信息，来管理写的请求。如果有足够的空间把所有的字节都写导管到中，只要管道没有被读进程锁定，Linux为写进程上锁，并把字节从进程的地址空间拷贝到共享的数据页。如果管道被读进程锁定或者空间不够，当前进程睡眠，并放在管道INODE节点的等待队列中，并调用调度程序，运行另外一个进程。它是可以中断的，所以它可以接收信号。当管道中有了足够的空间写数据或者锁定解除，写进程就会被读进程唤醒。当数据写完之后，管道的VFS INODE 节点锁定解除，管道INODE节点的等待队列中的所有读进程都会被唤醒。

2．从一个管道读Pipe_read()：
在\fs\pipe.c中定义了static int pipe_read(struct inode * inode, struct file * filp, char * buf, int count)；
实现机制：从管道中读取数据和写数据非常相似。进程允许进行非阻塞的读（依赖于它们打开文件或者管道的模式），这时，如果没有数据可读或者管道被锁定，会返回一个错误。这意味着进程会继续运行。另一种方式是在管道的INODE节点的等待队列中等待，直到写进程完成。如果管道的进程都完成了操作，管道的INODE节点和相应的共享数据页被废弃。

进程在TASK_RUNNING和TASK_STOPPED间的转换：
1．进程从TASK_RUNNING->TASK_STOPPED的转换：
TASK_STOPPED状态是一种暂停状态，和TASK_STOPPED状态配合工作的标志为PF_PTRACED和PF_TRACESYS，分别表示被跟踪和正在跟踪系统调用，一个是被动的，一个是主动的。
进程可通过两种途径进入TASK_STOPPED状态：
1)．受其它进程的syscall_trace（）系统调用的控制而暂时将CPU交给控制进程。
在调用syscall_trace（）之前必须先调用sys_ptrace()（简称ptrace（）），进行跟踪系统调用之前的准备工作。只有调用sys_ptrace()后，current的PF_PTRACED和PF_TRACESYS标志都已置位，跟踪和被跟踪的关系都已明确，再调用syscall_trace（）才能真正使进程转入STOPPED状态。
        syscall_trace（）实现步骤：
（1）检验是否调用过ptrace（）做过准备，没有则返回；
        （2）置状态STOPPED ，通知父进程，子进程状态已变；
        （3）进行CPU重新调度，将current进程从runqueue删除。
（4）如果exit_code非空，将它的值作为接收到的信号放到signal中。若是SIGTRAP
     则current进程将一直处于stopped，直到收到其它信号。

sys_ptrace()实现步骤：
（1）如果request == PTRACE_TRACEME，则有进程要求跟踪current进程:
                若current进程已被其它进程跟踪返回；
    否则置current进程已被进程跟踪的标记；
（2）如果current进程想跟踪其它进程:
            a.不能跟踪init进程；
                b.找pid对应的进程child，找不到返回出错；
                c.如果request为PTRACE_ATTACH
    如果current进程试图ATTACH自己，出错；
                如果试图attach的进程child已被attach，出错；
                否则        child->flags |= PF_PTRACED;做上标记，child已被attach；如果child
    不是current的子进程，将它变成current的子进程；并且发SIGSTOP信号，暂
    停它。
（3）进行其他合法性检查；
    （4）判断request，执行相应操作：
                case PTRACE_SYSCALL:继续执行，在下一个系统调用返回处停下。
    case PTRACE_CONT:发信号后重新开始运行。
                        如果request == PTRACE_SYSCALL，置child标志位PF_TRACESYS;
                        否则        清child标志位PF_TRACESYS，以防止重新运行后因历史原因在下一个
        系统调用返回处停下；
                        唤醒child进程。
        case PTRACE_KILL: 想要结束child进程，唤醒child进程，并在退出信息
                      exit_code中存放SIGKILL信号。
                case PTRACE_SINGLESTEP: 进行单步运行环境设置。
                case PTRACE_DETACH: 恢复child进程的自由。清跟踪标志，并唤醒child进程                                        恢复child进程的原始亲属关系。

2)．收到要求它暂停的信号。
另一种进入STOPPED状态的方法是信号，SIGSTOP信号使自由的current进程，打上PF_PTRACED标记，并将它转入STOPPED状态。do_signal在检查进程收到的信号时，若发现current进程已打上PF_PTRACED标记，则除收到的SIGKILL信号的情况外，current进程都将马上进入STOPPED状态。
do_signal（）实现步骤：
（1）current进程已打上PF_PTRACED标记，则除收到的SIGKILL信号的情况外，进程都将进入TASK_STOPPED状态，通知父进程，并重新调度；
（2）如果收到信号SIGSTOP:如果当前进程标志位不是PF_PTRACED，则置当前进程状态为TASK_STOPPED; 通知父进程，并重新调度；

2.进程从TASK_STOPPED->TASK_RUNNING的转换:
从TASK_STOPPED状态转到TASK_RUNNING状态通过“信号唤醒”。当有SIGKILL或SIGCONT信号发给TASK_STOPPED状态下的进程时，进程将被wake_up_process()唤醒。
int send_sig(unsigned long sig,struct task_struct * p,int priv)
{
        ………;
        save_flags(flags); cli();   /*保存标志，关中断*/
        if ((sig == SIGKILL) || (sig == SIGCONT)) {
                if (p->state == TASK_STOPPED)
                        wake_up_process(p);     /*若进程p的状态是STOPPED，并且所发送的信号是SIGKILL和SIGCONT，将p状态赋成RUNNING，并挂到run-queue*/
p->exit_code = 0;      /*退出信息没有*/
                p->signal &= ~( (1<<(SIGSTOP-1)) | (1<<(SIGTSTP-1)) |
                                (1<<(SIGTTIN-1)) | (1<<(SIGTTOU-1)) );         /*处理过信号后，将p的可能曾接受到的SIGSTOP、SIGTSTP、SIGTTIN、SIGTTOU信号清掉*/
        }
        if (sig == SIGSTOP || sig == SIGTSTP || sig == SIGTTIN || sig == SIGTTOU)
                p->signal &= ~(1<<(SIGCONT-1));     /*若信号为SIGSTOP、SIGTSTP、SIGTTIN、SIGTTOU中的任一种，将p可能曾接受到的SIGCONT信号清掉*/
        restore_flags(flags);          /*恢复CPU标志同时打开中断*/
        generate(sig,p);     /*登记不能立即被处理的信号。*/
        return 0;
}

进程的终止:从TASK_RUNNING->TASK_ZOMBIE的转换
进程终止由可终止进程的系统调用通过调用do_exit（）实现，do_exit（）终止current进程，首先为current进程做上PF_EXITING的标记，释放current进程的存储管理信息、文件系统、文件信息、信号响应函数指针数组，将状态置成TASK_ZOMBIE，通知current的父进程，最后进行重新调度。do_exit()带一个参数code，用于传递终止进程的原因。

do_exit(long code)流程：
    （1）如果进程在中断服务程序中调用do_exit（），则打印提示信息
    （2）记录进程的记帐信息
    （3）进程标志置为PF_EXITING
（4）释放定时器链表
        （5）释放临界区数据
        （6）将消息队列中和current进程有关项删除
        （7）释放进程的存储管理信息
        （8）释放进程已打开文件的信息
        （9）释放进程的文件系统
        （10）释放进程的信号响应函数指针数组等管理信息
        （11）释放进程的LDT
        （12）进程状态置为TASK_ZOMBIE
        （13）置上退出信息，通知所有进程亲戚，它要退出了#
        （14）exec_domain结构共享计数减1， binfmt结构共享计数减1
（15）重新调度，将current进程从run-queue中删除，交出CPU

exit_notify （）函数向所有和current进程有关的进程发相应的消息，以便它们开展工作，exit_notify （）还判断cueernt进程所在组是否会因current进程的退出而悬空，如果悬空并且组内有stopped状态的进程则发信号；同时进行一系列的指针调整，调整因current进程的死亡引起的进程关系转变。
exit_notify （）流程：
将所有原始进程为current的进程变成init进程的孙子。
如果父进程和current进程不在同一组，但在同一session内并且current进程组内所有进程的父进程和它在同一组，也就是说，current进程所在组会因current的退出而悬挂，同时current进程所在组内有stopped进程，就向整个组发SIGHUP和SIGCONT信号。
通知父进程进程死了。
调整所有current进程的子进程的父进程指针，将它们挂到它们的原始进程下，
     将以往的跟踪被跟踪历史清除，调整它和新的兄弟的关系；检查每一个current
     进程的子进程所在的组是否会悬挂，如果子进程和current进程不在同一组，并
     且这个组已悬挂，组内有stopped的进程，就向组员发SIGHUP 和 SIGCONT信号。        （5）如果current进程是session的主管，就和它所控制的tty脱离，向current
     进程显示终端所在的组发SIGHUP 和 SIGCONT信号。

进程直接或间接地调用do_exit() 后，进程进入ZOMBIE状态，还有一块PCB未释放。PCB的释放必须由它的父进程执行，当父进程调用sys_wait4（）时释放进入ZOMBIE状态的子进程的PCB。

具体调用do_exit()的函数有以下情况：
具体对应的系统调用出错，不得不终止进程，如：
do_page_fault（）：这个系统调用处理页面错误，它找到页面地址，出错原因，并将它转入相应的处理函数。当发生越界（out of memory）等bad page的致命错误。

sys_sigreturn（）：一般情况下sys_sigreturn（）将sigcontext的内容保存到堆栈中，保存过程中当发现段寄存器越界了，这个致命错误就将导致进程结束。

setup_frame（）：setup_frame（）建立信号响应函数的运行的环境，保存当前寄存器，将相应寄存器赋上信号响应函数的信息。在具体设定之前首先进行存储条件检验，不满足就不得不结束进程。

save_v86_state（）：save_v86_state（）保存虚拟86模式下（virtual 86 mode）的信息，如果进程PCB中vm86的信息为空的，无法继续进行操作，只能结束进程。

（2）其他终止进程的情况，通过调用以下函数实现终止：
sys_exit（）：是一个系统调用，实现终止调用它的当前进程。

sys_reboot()：sys_reboot()只能被特权级用户调用，用于重新启动系统。

do_signal（）：do_signal（）是处理信号的一个函数。检查current进程每一个接收到的signal，如果是结束进程的信号，结束进程进行相应处理。

die_if_kernel（）。

[目录]

线程
1 概述
1.1 线程的定义(Introduction)
Threads can best be described as “lightweight processes”. The traditional UNIX-style notion of a process has been found to be inconvenient, if not inadequate for several applications in distributed systems development. The needs of these applications are best served by threads, which generalize the notion of a process so that it can be associated with multiple activities. The popularity of threads has resulted in their implementation on UNIX systems and thread libraries are now widely available to programmers for the development of concurrent applications.

1.2 Threads Implementation
Threads can be implemented in one of two ways:
1. User-level threads:
There is no kernel support for multi-threaded processes. Hence, the kernel only has a single-threaded process abstraction, but multi-threaded processes are implemented in a library of procedures linked with application programs. The kernel has no knowledge of lightweight processes (threads), and therefore cannot schedule them independently. A threads run-time library organizes the scheduling of threads. A thread would block the process and therefore all threads within it if it made a blocking system call, so the asynchronous I/O facilities of UNIX are used. The major disadvantage of this scheme is that threads within a process cannot take advantage of a multi-processor.
（上段译文）User-level没有核心支持的多线程的进程。因此，核心只有单线程进程概念，而多线程进程由与应用程序连接的过程库实现。核心不知道线程的存在，也就不能独立的调度这些线程了。一个线程运行库组织线程的调度。如果一个线程调用了一个阻塞的系统调用，进程可能被阻塞，当然其中的所有线程也同时被阻塞，所以UNIX使用了异步I/O工具。这种机制的的最大缺点是不能发挥多处理器的优势。
The advantages include:
（系统消耗小）Certain thread operations are significantly less costly. For example, switching between threads belonging to the same process do not necessarily involve a system call, and hence save this over-head.
（可以修改以适应特殊的应用）User-level thread implementations can be customized or changed to suit the particular application requirements. This is particularly useful for real-time multimedia processing etc. Also, it is possible to support many more user-level threads than can by default by a kernel.
2. Kernel-level threads:
This implementation allows threads within different processes to be scheduled according to a single scheme of relative prioritizing. This is suited for exploiting the concurrence of multiprocessors.
核心级线程如许不同进程里的线程按照同一相对优先方法调度，这适合于发挥多处理器的并发优点。
Most of the current thread library implementations available today implement user-level threads. There have been several research projects that have implemented some form of Kernel-level threads. Notable among these are the Mach distributed OS, which combines the advantages of user-level and kernel-level threads by allowing user-level code to provide scheduling hints to the kernel thread scheduler. By providing such a two-level scheduling scheme, the kernel retains control over the allocation of processor time, but also allows a process to take advantage of multiple processors.

1.3 Thread Libraries
The two most widely used thread libraries are POSIX and Solaris thread libraries. Both implementations are inter-operable, their functionality is similar, and can be used within the same application. However, only POSIX threads are guaranteed to be fully portable to other POSIX-compliant environments.
Similarities:
Most of the functions in both libraries, libpthread and libthread, have a counterpart in the other library. POSIX functions and Solaris functions, whose names have similar endings, usually have similar functionality, number of arguments, and use of arguments. All POSIX threads function names begin with the prefix pthread? where as the Solaris threads function names begin with the prefix thr?
Differences:
POSIX
is more portable
establishes characteristics for each thread according to configurable attribute objects
implements thread cancellation
enforces scheduling algorithms
allows for clean-up handlers for fork(2) calls
Solaris
threads can be suspended and continued
implements an optimized mutex, readers/writer locking
may increase the concurrency
implements daemon threads, for whose demise the process does not wait

1.4 Threads Standards
There are three different definitions for thread libraries competing for attention today: Win32, OS/2, and POSIX. The first two are proprietary and limited to their individual platforms (Win32 threads run only under NT and Win95, OS/2 threads on OS/2). The POSIX specification (IEEE 1003.1c, aka Pthreads) is intended for all computing platforms, and implementations are available or in the works for almost all major UNIX systems (including Linux), along with VMS.

POSIX Threads
The POSIX standard defines the API and behavior that all the Pthreads libraries must meet. It is part of the extended portion of POSIX, so it is not a requirement for meeting XPG4, but it is required for X/Open UNIX 98, and all major UNIX vendors have committed to meeting this standard. As of this writing, (7/97) almost all UNIX vendors have released a library.

Win32 and OS/2 Threads
Both the NT and OS/2 implementations contain some fairly radical differences
from the POSIX standard--to the degree that even porting from one or the other
to POSIX will prove moderately challenging. Microsoft has not announced any
plans to adopt POSIX. There are freeware POSIX libraries for Win32 (see
Commercial Products on page 249), and OS/2 also has an optional POSIX library.

DCE Threads
Before POSIX completed work on the standard, it produced a number of drafts which it published for comment. Draft 4 was used as the basis for the threads library in DCE. It is similar to the final spec, but it does contain a number of significant differences. Presumably, no one is writing any new DCE code.

Solaris Threads
Also known as UI threads, this is the library, which SunSoft used in developing Solaris 2 before the POSIX, committee completed their work. It will be available on Solaris 2 for the foreseeable future, although we expect most applications writers will opt for Pthreads. The vast majority of the two libraries are virtually identical.
1.5        Linux线程的思想及特点
1.5.1        LinuxThreads
http://pauillac.inria.fr/~xleroy/linuxthreads
Xavier Leroy at INRIA (Paris, France), with input from Pavel Krauz, Richard Henderson and others, has developed a Pthreads library that implements the One-to-One model, allowing it to take advantage of multiple processors. It is based on the new Linux system call, clone()2 . It runs on Linux 2.0 and up, on Intel, Alpha, SPARC, m68k, and MIPS machines. One limitation is its non-standard implementation of signal handling.
1.5.2        Implementation model for LinuxThreads
LinuxThreads follows the so-called "one-to-one" model: each thread is actually a separate process in the kernel. The kernel scheduler takes care of scheduling the threads, just like it schedules regular processes. The threads are created with the Linux clone() system call, which is a generalization of fork() allowing the new process to share the memory space, file descriptors, and signal handlers of the parent.
LinuxThreads采用称为1-1模型：每个线程实际上在核心是一个个单独的进程，核心的调度程序负责线程的调度，就象调度普通进程。线程是用系统调用clone()创建的，clone()系统调用是fork()的普遍形式，它允许新进程共享父进程的存储空间、文件描述符和软中断处理程序。
Advantages of the "one-to-one" model include:

Minimal overhead on CPU-intensive multiprocessing (with about one thread per processor); 最小限度消耗的CPU级多处理技术（每个CPU一个线程）；
Minimal overhead on I/O operations; 最小限度消耗的I/O操作；
A simple and robust implementation (the kernel scheduler does most of the hard work for us)；一种简单和强壮的实现（核心调度程序为我们做了大部分艰难的工作）。

The main disadvantage is more expensive context switches on mutex and condition operations, which must go through the kernel. This is mitigated by the fact that context switches in the Linux kernel are pretty efficient.

1.5.3 Consider other implementation models

There are basically two other models. The "many-to-one" model relies on a user-level scheduler that context-switches between the threads entirely in user code; viewed from the kernel, there is only one process running. This model is completely out of the question for me, since it does not take advantage of multiprocessors, and require unholy magic to handle blocking I/O operations properly. There are several user-level thread libraries available for Linux, but I found all of them deficient in functionality, performance, and/or robustness.
还有另外两种基本模型。多对一模型依赖于用户级的调度程序，线程切换完全由用户程序完成；从核心角度看，只有一个进程正在运行。这种模型不是我们所关心的，因为它无法利用多处理器的优点，而且要用不合理的方法处理I/O操作阻塞。
The "many-to-many" model combines both kernel-level and user-level scheduling: several kernel-level threads run concurrently, each executing a user-level scheduler that selects between user threads. Most commercial Unix systems (Solaris, Digital Unix and IRIX) implement POSIX threads this way. This model combines the advantages of both the "many-to-one" and the "one-to-one" model, and is attractive because it avoids the worst-case behaviors of both models -- especially on kernels where context switches are expensive, such as Digital Unix. Unfortunately, it is pretty complex to implement, and requires kernel supporting which Linux does not provide. Linus Torvalds and other Linux kernel developers have always been pushing the "one-to-one" model in the name of overall simplicity, and are doing a pretty good job of making kernel-level context switches between threads efficient. LinuxThreads is just following the general direction they set.
2        Linux核心对线程的支持
Linux核心对线程的支持主要是通过其系统调用，下文将进行系统的介绍。
2.1        系统调用clone()
以下是系统调用clone的代码：
asmlinkage int sys_clone(struct pt_regs regs)
{
        unsigned long clone_flags;
        unsigned long newsp;

        clone_flags = regs.ebx;
        newsp = regs.ecx;
if (!newsp)
                newsp = regs.esp;
        return do_fork(clone_flags, newsp, &regs);
}

与系统调用clone功能相似的系统调用有fork,但fork事实上只是clone的功能的一部分，clone与fork的主要区别在于传递了几个参数，而当中最重要的参数就是conle_flags,下表是系统定义的几个clone_flags标志：
标志        Value        含义
CLONE_VM        0x00000100        置起此标志在进程间共享VM
CLONE_FS        0x00000200        置起此标志在进程间共享文件系统信息
CLONE_FILES        0x00000400        置起此标志在进程间共享打开的文件
CLONE_SIGHAND        0x00000800        置起此标志在进程间共享信号处理程序
如果置起以上标志所做的处理分别是：
置起CLONE_VM标志：
                mmget(current->mm);
                /*
                * Set up the LDT descriptor for the clone task.
                */
                copy_segments(nr, tsk, NULL);
                SET_PAGE_DIR(tsk, current->mm->pgd);
置起CLONE_ FS标志：
                atomic_inc(&current->fs->count);
置起CLONE_ FILES标志：
                atomic_inc(&oldf->count);
置起CLONE_ SIGHAND标志：
                atomic_inc(&current->sig->count);
2.2        与线程调度相关的系统调用
以下是glibc-linuxthread用来进行调度的系统调度：
        .long SYMBOL_NAME(sys_sched_setparam)   /* 系统调用154 */
/*用来设置进程（或线程）的调度参数*/
        .long SYMBOL_NAME(sys_sched_getparam)
/*用来获取进程（或线程）的调度参数*/
        .long SYMBOL_NAME(sys_sched_setscheduler)
/*用来设置进程（或线程）的调度参数*/
        .long SYMBOL_NAME(sys_sched_getscheduler)
/*用来获取进程（或线程）的调度参数*/
        .long SYMBOL_NAME(sys_sched_yield)
/*用来强制核心重新调度进程（或线程）*/
        .long SYMBOL_NAME(sys_sched_get_priority_max)
/*用来设置进程（或线程）的调度参数*/
        .long SYMBOL_NAME(sys_sched_get_priority_min)
/*用来获取进程（或线程）的调度参数*/
        .long SYMBOL_NAME(sys_sched_rr_get_interval) /* 系统调用161 */
/*用来获取进程（或线程）的调度时间间隔*/

3        Linux线程的实现
3.1        LinuxThreads概述
现在的0.8版LinuxThreads，是迄今为止在Linux下支持threads的最好的Runtime-library，而包含0.8版LinuxThreads的最好的Runtime-library是glibc- 2.1，下文所要分析的正是glibc-linuxthreads-2.1。
首先介绍一下0.8版LinuxThreads，它实现了一种BiCapitalized面向Linux的Posix 1003.1c"pthread"标准接口。LinuxThreads提供核心级线程即每个线程是一个独立的UNIX进程，通过调用新的系统调用与其它线程共享地址空间。线程由核心调度，就象UNIX进程调度一样。使用它的要求是：LINUX 版本2.0 或以上（要求有新的clone() 系统调用和新的实时调度程序）。对于Intel平台：要求有libc 5.2.18或后续版本，推荐使用5.2.18 或 5.4.12 及其后续版本；5.3.12和5.4.7有问题，也支持glibc 2,实际上是支持它的一个特别合适的版本。到目前支持Intel, Alpha, Sparc, Motorola 68k, ARM and MIPS平台，还支持多处理器
3.2        主要的数据结构及初始化
3.2.1        数据结构和部分数据初始化
/* Arguments passed to thread creation routine */
//传递给线程创建程序的参数
struct pthread_start_args {
void * (*start_routine)(void *); /* function to run */
void * arg;                   /* its argument */
sigset_t mask;                /* initial signal mask for thread */
int schedpolicy;              /* initial scheduling policy (if any) */
struct sched_param schedparam; /* initial scheduling parameters (if any) */
};

/* The type of thread descriptors */
//线程描述符类型
typedef struct _pthread_descr_struct * pthread_descr;

struct _pthread_descr_struct {
pthread_descr p_nextlive, p_prevlive;
                                /* Double chaining of active threads */
pthread_descr p_nextwaiting; /* Next element in the queue holding the thr */
pthread_t p_tid;              /* Thread identifier */
int p_pid;                    /* PID of Unix process */
int p_priority;               /* Thread priority (== 0 if not realtime) */
struct _pthread_fastlock * p_lock; /* Spinlock for synchronized accesses */
int p_signal;                 /* last signal received */
sigjmp_buf * p_signal_jmp;    /* where to siglongjmp on a signal or NULL */
sigjmp_buf * p_cancel_jmp;    /* where to siglongjmp on a cancel or NULL */
char p_terminated;            /* true if terminated e.g. by pthread_exit */
char p_detached;              /* true if detached */
char p_exited;                /* true if the assoc. process terminated */
void * p_retval;              /* placeholder for return value */
int p_retcode;                /* placeholder for return code */
pthread_descr p_joining;      /* thread joining on that thread or NULL */
struct _pthread_cleanup_buffer * p_cleanup; /* cleanup functions */
char p_cancelstate;           /* cancellation state */
char p_canceltype;            /* cancellation type (deferred/async) */
char p_canceled;              /* cancellation request pending */
int * p_errnop;               /* pointer to used errno variable */
int p_errno;                  /* error returned by last system call */
int * p_h_errnop;             /* pointer to used h_errno variable */
int p_h_errno;                /* error returned by last netdb function */
char * p_in_sighandler;       /* stack address of sighandler, or NULL */
char p_sigwaiting;            /* true if a sigwait() is in progress */
struct pthread_start_args p_start_args; /* arguments for thread creation */
void ** p_specific[PTHREAD_KEY_1STLEVEL_SIZE]; /* thread-specific data */
void * p_libc_specific[_LIBC_TSD_KEY_N]; /* thread-specific data for libc */
int p_userstack;                /* nonzero if the user provided the stack */
void *p_guardaddr;                /* address of guard area or NULL */
size_t p_guardsize;                /* size of guard area */
pthread_descr p_self;                /* Pointer to this structure */
int p_nr;                     /* Index of descriptor in __pthread_handles */
};

/* The type of thread handles. */
线程句柄
typedef struct pthread_handle_struct * pthread_handle;

struct pthread_handle_struct {
struct _pthread_fastlock h_lock; /* Fast lock for sychronized access */
pthread_descr h_descr; /* Thread descriptor or NULL if invalid */
char * h_bottom; /* Lowest address in the stack thread */
};

/* The type of messages sent to the thread manager thread */
//发送给线程管理线程的请求
struct pthread_request {
pthread_descr req_thread;     /* Thread doing the request */
enum {                        /* Request kind */
    REQ_CREATE, REQ_FREE, REQ_PROCESS_EXIT, REQ_MAIN_THREAD_EXIT,
    REQ_POST, REQ_DEBUG
} req_kind;
union {                       /* Arguments for request */
    struct {                    /* For REQ_CREATE: */
      const pthread_attr_t * attr; /* thread attributes */
      void * (*fn)(void *);     /*   start function */
      void * arg;               /*   argument to start function */
      sigset_t mask;            /*   signal mask */
    } create;
    struct {                    /* For REQ_FREE: */
      pthread_t thread_id;      /*   identifier of thread to free */
    } free;
    struct {                    /* For REQ_PROCESS_EXIT: */
      int code;                 /*   exit status */
    } exit;
    void * post;                /* For REQ_POST: the semaphore */
} req_args;
};

/* One end of the pipe for sending requests to the thread manager. */
//向管理线程发送请求的管道的一端；初始化为-1表示管理线程还没有创建
int __pthread_manager_request = -1;

/* Other end of the pipe for sending requests to the thread manager. */

int __pthread_manager_reader;

//线程的堆栈大小
#define STACK_SIZE (2 * 1024 * 1024)
//线程的初始堆栈大小
#define INITIAL_STACK_SIZE (4 * PAGE_SIZE)

/* Attributes for threads. */
//线程的属性
typedef struct
{
int __detachstate;
int __schedpolicy;
struct __sched_param __schedparam;
int __inheritsched;
int __scope;
size_t __guardsize;
int __stackaddr_set;
void *__stackaddr;
size_t __stacksize;
} pthread_attr_t;

//每个进程的最大线程数
#define PTHREAD_THREADS_MAX 1024

3.2.2 Main thread and manager thread initializing

/* Thread creation */

int __pthread_create_2_1(pthread_t *thread, const pthread_attr_t *attr,
                        void * (*start_routine)(void *), void *arg)
{
pthread_descr self = thread_self();
struct pthread_request request;
if (__pthread_manager_request < 0) {    //检查是否启动线程机制
        //初始化管理线程，启动线程机制
    if (__pthread_initialize_manager() < 0) return EAGAIN;
}
request.req_thread = self;
request.req_kind = REQ_CREATE;
request.req_args.create.attr = attr;
request.req_args.create.fn = start_routine;
request.req_args.create.arg = arg;
sigprocmask(SIG_SETMASK, (const sigset_t *) NULL,
              &request.req_args.create.mask);
//向管理线程发送请求
__libc_write(__pthread_manager_request, (char *) &request, sizeof(request));
suspend(self);
if (THREAD_GETMEM(self, p_retcode) == 0)
    *thread = (pthread_t) THREAD_GETMEM(self, p_retval);
return THREAD_GETMEM(self, p_retcode);
}

int __pthread_initialize_manager(void)
{
int manager_pipe[2];
int pid;
struct pthread_request request;

/* If basic initialization not done yet (e.g. we're called from a constructor run before our constructor), do it now */
//初始化初始线程
if (__pthread_initial_thread_bos == NULL) pthread_initialize();
/* Setup stack for thread manager */建立管理线程堆栈
__pthread_manager_thread_bos = malloc(THREAD_MANAGER_STACK_SIZE);
if (__pthread_manager_thread_bos == NULL) return -1;
__pthread_manager_thread_tos =
    __pthread_manager_thread_bos + THREAD_MANAGER_STACK_SIZE;
/* Setup pipe to communicate with thread manager */
//建立与管理线程通信的管道
if (pipe(manager_pipe) == -1) {
    free(__pthread_manager_thread_bos);
    return -1;
}
/* Start the thread manager */启动管理线程
pid = __clone(__pthread_manager, (void **) __pthread_manager_thread_tos,
                CLONE_VM | CLONE_FS | CLONE_FILES | CLONE_SIGHAND
                , (void *)(long)manager_pipe[0]);
if (pid == -1) {
    free(__pthread_manager_thread_bos);
    __libc_close(manager_pipe[0]);
    __libc_close(manager_pipe[1]);
    return -1;
}
__pthread_manager_request = manager_pipe[1]; /* writing end */
__pthread_manager_reader = manager_pipe[0]; /* reading end */
__pthread_manager_thread.p_pid = pid;
/* Make gdb aware of new thread manager */
if (__pthread_threads_debug && __pthread_sig_debug > 0)
    {
      raise(__pthread_sig_debug);
      /* We suspend ourself and gdb will wake us up when it is
        ready to handle us. */
      suspend(thread_self());
    }
/* Synchronize debugging of the thread manager */
request.req_kind = REQ_DEBUG;
__libc_write(__pthread_manager_request, (char *) &request, sizeof(request));
return 0;
}
//初始化初始线程
static void pthread_initialize(void)
{
struct sigaction sa;
sigset_t mask;
struct rlimit limit;
int max_stack;

/* If already done (e.g. by a constructor called earlier!), bail out */
if (__pthread_initial_thread_bos != NULL) return;
#ifdef TEST_FOR_COMPARE_AND_SWAP
/* Test if compare-and-swap is available */
__pthread_has_cas = compare_and_swap_is_available();
#endif
/* For the initial stack, reserve at least STACK_SIZE bytes of stack below the current stack address, and align that on a STACK_SIZE boundary. */
//当前堆栈下为初始堆栈留出至少STACK_SIZE，并按STACK_SIZE对齐
__pthread_initial_thread_bos =
    (char *)(((long)CURRENT_STACK_FRAME - 2 * STACK_SIZE) & ~(STACK_SIZE - 1));
/* Play with the stack size limit to make sure that no stack ever grows
     beyond STACK_SIZE minus two pages (one page for the thread descriptor
     immediately beyond, and one page to act as a guard page). */
//调整堆栈大小限制使其不能增长超过STACK_SIZE减两页（一页给线程
//描述符，一页作为保护页）
getrlimit(RLIMIT_STACK, &limit);
max_stack = STACK_SIZE - 2 * __getpagesize();
if (limit.rlim_cur > max_stack) {
    limit.rlim_cur = max_stack;
    setrlimit(RLIMIT_STACK, &limit);
}
/* Update the descriptor for the initial thread. */
__pthread_initial_thread.p_pid = __getpid();
/* If we have special thread_self processing, initialize that for the
     main thread now. */
#ifdef INIT_THREAD_SELF
INIT_THREAD_SELF(&__pthread_initial_thread, 0);
#endif
/* The errno/h_errno variable of the main thread are the global ones. */
__pthread_initial_thread.p_errnop = &_errno;
__pthread_initial_thread.p_h_errnop = &_h_errno;
#ifdef SIGRTMIN
/* Allocate the signals used. */分配使用的软中断号
__pthread_sig_restart = __libc_allocate_rtsig (1);
__pthread_sig_cancel = __libc_allocate_rtsig (1);
__pthread_sig_debug = __libc_allocate_rtsig (1);
if (__pthread_sig_restart < 0 ||
      __pthread_sig_cancel < 0 ||
      __pthread_sig_debug < 0)
    {
      /* The kernel does not support real-time signals. Use as before
        the available signals in the fixed set.
         Debugging is not supported in this case. */
      __pthread_sig_restart = DEFAULT_SIG_RESTART;
      __pthread_sig_cancel = DEFAULT_SIG_CANCEL;
      __pthread_sig_debug = 0;
    }
#endif
/* Setup signal handlers for the initial thread.
     Since signal handlers are shared between threads, these settings
     will be inherited by all other threads. */
//设置初始进程的信号处理程序
#ifndef __i386__
sa.sa_handler = pthread_handle_sigrestart;
#else
sa.sa_handler = (__sighandler_t) pthread_handle_sigrestart;
#endif
sigemptyset(&sa.sa_mask);
sa.sa_flags = 0;
__sigaction(__pthread_sig_restart, &sa, NULL);
#ifndef __i386__
sa.sa_handler = pthread_handle_sigcancel;
#else
sa.sa_handler = (__sighandler_t) pthread_handle_sigcancel;
#endif
sa.sa_flags = 0;
__sigaction(__pthread_sig_cancel, &sa, NULL);
if (__pthread_sig_debug > 0) {
    sa.sa_handler = pthread_handle_sigdebug;
    sigemptyset(&sa.sa_mask);
    sa.sa_flags = 0;
    __sigaction(__pthread_sig_debug, &sa, NULL);
}
/* Initially, block __pthread_sig_restart. Will be unblocked on demand. */
sigemptyset(&mask);
sigaddset(&mask, __pthread_sig_restart);
sigprocmask(SIG_BLOCK, &mask, NULL);
/* Register an exit function to kill all other threads. */
/* Do it early so that user-registered atexit functions are called
     before pthread_exit_process. */
__on_exit(pthread_exit_process, NULL);
}

3.3 线程的创建

Manager thread 接到创建线程请求后调用下函数。
static int pthread_handle_create(pthread_t *thread, const pthread_attr_t *attr,
void * (*start_routine)(void *), void *arg,
sigset_t * mask, int father_pid)
{
size_t sseg;
int pid;
pthread_descr new_thread;
char * new_thread_bottom;
pthread_t new_thread_id;
char *guardaddr = NULL;
size_t guardsize = 0;
int pagesize = __getpagesize();

/* First check whether we have to change the policy and if yes, whether
     we can do this. Normally this should be done by examining the
     return value of the __sched_setscheduler call in pthread_start_thread
     but this is hard to implement. FIXME */
//检查是否需要调整调度策略，如果需要，是否能够做到
if (attr != NULL && attr->__schedpolicy != SCHED_OTHER && geteuid () != 0)
    return EPERM;
/* Find a free segment for the thread, and allocate a stack if needed */
//找出一个空段，如果需要再分配堆栈
for (sseg = 2; ; sseg++)
    {
      if (sseg >= PTHREAD_THREADS_MAX)
        return EAGAIN;
      if (__pthread_handles[sseg].h_descr != NULL)
        continue;
      if (pthread_allocate_stack(attr, thread_segment(sseg), pagesize,
                                 &new_thread, &new_thread_bottom,
                                 &guardaddr, &guardsize) == 0)
        break;
    }
__pthread_handles_num++;
/* Allocate new thread identifier */分配新线程的标识符
pthread_threads_counter += PTHREAD_THREADS_MAX;
new_thread_id = sseg + pthread_threads_counter;
/* Initialize the thread descriptor */初始化新线程描述符
new_thread->p_nextwaiting = NULL;
new_thread->p_tid = new_thread_id;
new_thread->p_priority = 0;
new_thread->p_lock = &(__pthread_handles[sseg].h_lock);
new_thread->p_signal = 0;
new_thread->p_signal_jmp = NULL;
new_thread->p_cancel_jmp = NULL;
new_thread->p_terminated = 0;
new_thread->p_detached = attr == NULL ? 0 : attr->__detachstate;
new_thread->p_exited = 0;
new_thread->p_retval = NULL;
new_thread->p_joining = NULL;
new_thread->p_cleanup = NULL;
new_thread->p_cancelstate = PTHREAD_CANCEL_ENABLE;
new_thread->p_canceltype = PTHREAD_CANCEL_DEFERRED;
new_thread->p_canceled = 0;
new_thread->p_errnop = &new_thread->p_errno;
new_thread->p_errno = 0;
new_thread->p_h_errnop = &new_thread->p_h_errno;
new_thread->p_h_errno = 0;
new_thread->p_in_sighandler = NULL;
new_thread->p_sigwaiting = 0;
new_thread->p_guardaddr = guardaddr;
new_thread->p_guardsize = guardsize;
new_thread->p_userstack = attr != NULL && attr->__stackaddr_set;
memset (new_thread->p_specific, '\0',
        PTHREAD_KEY_1STLEVEL_SIZE * sizeof (new_thread->p_specific[0]));
new_thread->p_self = new_thread;
new_thread->p_nr = sseg;
/* Initialize the thread handle */
__pthread_init_lock(&__pthread_handles[sseg].h_lock);
__pthread_handles[sseg].h_descr = new_thread;
__pthread_handles[sseg].h_bottom = new_thread_bottom;
/* Determine scheduling parameters for the thread */
//确定线程的调度参数
new_thread->p_start_args.schedpolicy = -1;
if (attr != NULL) {
    switch(attr->__inheritsched) {
    case PTHREAD_EXPLICIT_SCHED:
      new_thread->p_start_args.schedpolicy = attr->__schedpolicy;
      memcpy (&new_thread->p_start_args.schedparam, &attr->__schedparam,
              sizeof (struct sched_param));
      break;
    case PTHREAD_INHERIT_SCHED:
      /* schedpolicy doesn't need to be set, only get priority */
      __sched_getparam(father_pid, &new_thread->p_start_args.schedparam);
      break;
    }
    new_thread->p_priority =
      new_thread->p_start_args.schedparam.sched_priority;
}
/* Finish setting up arguments to pthread_start_thread */
//设置pthread_start_thread的参数
new_thread->p_start_args.start_routine = start_routine;
new_thread->p_start_args.arg = arg;
new_thread->p_start_args.mask = *mask;
/* Raise priority of thread manager if needed */根据需要调整管理线程的优先级
__pthread_manager_adjust_prio(new_thread->p_priority);
/* Do the cloning */创建新线程
pid = __clone(pthread_start_thread, (void **) new_thread,
                CLONE_VM | CLONE_FS | CLONE_FILES | CLONE_SIGHAND |
                __pthread_sig_cancel, new_thread);
/* Check if cloning succeeded */
if (pid == -1) {
    /* Free the stack if we allocated it */
    if (attr == NULL || !attr->__stackaddr_set)
      {
        munmap((caddr_t)((char *)(new_thread+1) - INITIAL_STACK_SIZE),
               INITIAL_STACK_SIZE);
        if (new_thread->p_guardsize != 0)
        munmap(new_thread->p_guardaddr, new_thread->p_guardsize);
      }
    __pthread_handles[sseg].h_descr = NULL;
    __pthread_handles[sseg].h_bottom = NULL;
    __pthread_handles_num--;
    return errno;
}
/* Insert new thread in doubly linked list of active threads */
//将新线程插入双向链表
new_thread->p_prevlive = __pthread_main_thread;
new_thread->p_nextlive = __pthread_main_thread->p_nextlive;
__pthread_main_thread->p_nextlive->p_prevlive = new_thread;
__pthread_main_thread->p_nextlive = new_thread;
/* Set pid field of the new thread, in case we get there before the
     child starts. */
new_thread->p_pid = pid;
/* We're all set */
*thread = new_thread_id;
return 0;
}

3.4        线程的堆栈分配和管理
STACK_SIZE    2*1024*1024
INITIAL_STACK_SIZE    4*PAGE_SIZE
THREAD_MANAGER_STACK_SIZE      2*PAGE_SIZE-32

static int pthread_allocate_stack(const pthread_attr_t *attr,
                                  pthread_descr default_new_thread,
                                  int pagesize,
                                  pthread_descr * out_new_thread,
                                  char ** out_new_thread_bottom,
                                  char ** out_guardaddr,
                                  size_t * out_guardsize)
{
pthread_descr new_thread;
char * new_thread_bottom;
char * guardaddr;
size_t stacksize, guardsize;

if (attr != NULL && attr->__stackaddr_set)
    {
      /* The user provided a stack. */用户提供堆栈
      new_thread =
        (pthread_descr) ((long)(attr->__stackaddr) & -sizeof(void *)) - 1;
      new_thread_bottom = (char *) attr->__stackaddr - attr->__stacksize;
      guardaddr = NULL;
      guardsize = 0;
      __pthread_nonstandard_stacks = 1;
    }
else
    {
      /* Allocate space for stack and thread descriptor at default address */
//在缺省地址分配堆栈和描述符
      new_thread = default_new_thread;
      new_thread_bottom = (char *) new_thread - STACK_SIZE;
      if (mmap((caddr_t)((char *)(new_thread + 1) - INITIAL_STACK_SIZE), INITIAL_STACK_SIZE, PROT_READ | PROT_WRITE | PROT_EXEC, MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED | MAP_GROWSDOWN, -1, 0) == MAP_FAILED)
        /* Bad luck, this segment is already mapped. */
        return -1;
      /* We manage to get a stack. Now see whether we need a guard
         and allocate it if necessary. Notice that the default
         attributes (stack_size = STACK_SIZE - pagesize and
         guardsize = pagesize) do not need a guard page, since
         the RLIMIT_STACK soft limit prevents stacks from
         running into one another. */
//判断是否需要保护页，如果需要就分配
      if (attr == NULL ||
          attr->__guardsize == 0 ||
          (attr->__guardsize == pagesize &&
           attr->__stacksize == STACK_SIZE - pagesize))
        {
          /* We don't need a guard page. */
          guardaddr = NULL;
          guardsize = 0;
        }
      else
        {
          /* Put a bad page at the bottom of the stack */
          stacksize = roundup(attr->__stacksize, pagesize);
          if (stacksize >= STACK_SIZE - pagesize)
            stacksize = STACK_SIZE - pagesize;
          guardaddr = (void *)new_thread - stacksize;
          guardsize = attr->__guardsize;
          if (mmap ((caddr_t) guardaddr, guardsize, 0, MAP_FIXED, -1, 0)
              == MAP_FAILED)
            {
              /* We don't make this an error. */
              guardaddr = NULL;
              guardsize = 0;
            }
        }
    }
*out_new_thread = new_thread;
*out_new_thread_bottom = new_thread_bottom;
*out_guardaddr = guardaddr;
*out_guardsize = guardsize;
return 0;
}

3.5 线程的调度
Common threads 的调度和普通进程并无大的区别，创建者可以自己设定线程的优先级。但是Manager thread则需要实时响应各进程提出的请求，所以Manager thread被设置成高于其它线程的优先级，方法是在创建每个新线程时调整Manager thread的优先级。

/* Adjust priority of thread manager so that it always run at a priority
higher than all threads */

void __pthread_manager_adjust_prio(int thread_prio)
{
struct sched_param param;

if (thread_prio <= __pthread_manager_thread.p_priority) return;
param.sched_priority =
thread_prio < __sched_get_priority_max(SCHED_FIFO)
? thread_prio + 1 : thread_prio;
__sched_setscheduler(__pthread_manager_thread.p_pid, SCHED_FIFO, &param);
__pthread_manager_thread.p_priority = thread_prio;
}

[目录]

进程描述符
标题新兵笔记--ULK(C3) Process Descriptor
作者 Big John (stranger )
时间 05/19/01 06:01 PM

Process Descriptor

description：
进程描述符：也就是结构体task_struct，它有很多域，包含了一个进程的所有信息，主要有它的属性、当前的状态、它所占有的资料，还有一些指针用于把按不同的需求把它链进不同的链表中。
进程描述符与进程的kernel栈：每个进程都有个自己的kernel栈，当它进入kernel态时（比如进行系统调用），kernel会把栈指针指向当前进程的kernel栈。在2.2中，进程的描述符和kernel栈是捆在一起的，
union task_union {
struct task_struct task;
unsigned long stack[2048];
};
kernel每次分配一个进程描述符总会按task_union的大小（即8k）"顺手"分配一个kernel栈，这样做一个最重要的目的就是为了能方便的得到当前运行进程的描述符，当系统需要时，总是使用current来得到当前运行的进程，在2.0中current可能是个全局变量或全局数组（当多CPU时），这样一方面是不方便使用，另一方面，在多CPU中还得去找当前CPU才能确定当前的current（我当初看过，但忘了它是怎么找的了）。现在使用了这个结构，kernel随时可以通过栈底指针减8K就可得到描述符的地址。大家可以看到现在的current实际是一个函数get_current，它所做的是就是把esp减去8K，然后返回这个地址。
进程描述符的分配与释放：由这两个函数完成，alloc_task_struct和free_task_struct。这两个函数没什么花头，还是由于分配物理页帧的代码过大，这里也有一个缓存static struct task_struct * task_struct_stack[EXTRA_TASK_STRUCT]，它能缓存16项，分配和释放都尽量在这进行，除非它已经满了或空了，才去与分页系统打交道。
进程数组：struct task_struct *task[NR_TASKS];它是一个指针数组，其中NR_TASKS在i386下应该4090，实际可同时运行的进程应该是4092个，因为还得加上Process 0和Procces 1，这两个进程是不在进程数组中的。
当创建一个进程时，kernel会申请一片内存并把它加入task数组中。如何加入呢？出于效率考虑，task数组并不象我们平时处理那样把没有用的项置空，当要加入时顺序的去找数组中为空的项。它使用了类似于第二章中页表缓存链表的管理方法，tarray_freelist是这个链表的表头，具体操作如下：
初始化：
struct task_struct **tarray_freelist = NULL;
void __init sched_init(void)
{
。。。
int nr = NR_TASKS;
while(--nr > 0)
add_free_taskslot(&task[nr]); // 把task数组的每一项加到空闲链表中。
。。。
}
函数add_free_taskslot：
*t = (struct task_struct *) tarray_freelist; // 把当前指针当next用，指向空闲链表的第一项（可能是NULL）
tarray_freelist = t; // tarray_freelist指向当前项
函数get_free_taskslot：
tslot = tarray_freelist; // *tslot是第一个空闲项
tarray_freelist = (struct task_struct **) *tslot; // *tslot的内容是下一个空闲项
return tslot;

各种各样的进程指针：在进程描述符中有很多task_struct的指针，用于把它链进不同的链表或树中。最简单的是next_task和prev_task，它们把当前系统中的所有进程链成一条双向链表；其次是next_run和prev_run，它们把当前可运行的进程（state为TASK_RUNNING，这和current不同，并不表示它正在运行，只表示它可以被CPU调度运行而已）链成一条双向链表，不过我的源码里并没有作者所说的runqueue指针头，好象是init_task取代了runqueue的位置；pidhash_next和pidhash_pprev是用来链进以进程号索引的hash表的，因为大多调用都使用进程号标识进程，所以需要建立一个hash表来加快以进程号来查找进程的速度；最后是p_opptr,p_pptr,p_cptr,p_ysptr,p_osptr，这些是用来标识进程的父子，兄弟等树状关系的，作者书中的图已经很清楚了，不再需要我多说了。

等待队列：一般来说，当某个任务不能马上完成时，kernel不会陪着进程在那死等，它只是简单把进程挂进某个队列，然后继续运行，等这个任务完成，kernel会从队列中取出等待的进程，并把它链进可运行队列中。等待队列的结构体如下：
struct wait_queue {
struct task_struct * task;
struct wait_queue * next;
};
很简单的结构，一个进程指针，一个next指针，应用时它将会是一个环形队列，add_wait_queue加入一项，remove_wait_queue移去新旧的一项，都不是很难理解。麻烦的是它的init_waitqueue，内容为
#define WAIT_QUEUE_HEAD(x) ((struct wait_queue *)((x)-1))
static inline void init_waitqueue(struct wait_queue **q)
{
*q = WAIT_QUEUE_HEAD(q);
}
结合作者的解释，它实际上是把当前队列指针加上前面的四个字节假设为一项了，然后"这一项"的next指针指向它自己。这个方法代码倒很简单，但是我却不是很喜欢，可读性实在有点。。。如果是我，宁愿加一空项做表头。
sleep和wakeup：刚才所说的kernel把进程挂进队一般是通过sleep_on来做的，而取出队列是通过wake_up来做的。现在来看看它们是怎么运行的吧。比如你要读取软盘内容，指令发出去了，但要等很久才有回应，这时会调用sleep_on把你的进程标识为TASK_UNINTERRUPTIBLE或TASK_INTERRUPTIBLE状态，然后把进程挂在某一个队列上，然后调用schedule，这样就会调度其它状态为TASK_RUNNING的进程继续运行。当指令完成时，比如软盘的内容已经读到内存中了，这时可能会产生一个中断，这个中断会从等待队列中把你的进程取出来，标识为TASK_RUNNING，然后放入可运行队列，又过了若干时间，你的进程真的开始运行了，这时会执行sleep_on中schedule后的语句，即把进程从进程从等待队列中移出去，然后就可以执行下面的操作了，这时你要读的内容已经读出来了。

进程限制：谁都不希望某个用户的进程会占用所有的内存，所有的CPU，所有的任何资源，这样就要对进程有所限制，kernel用了这样一个结构：
struct rlimit {
long rlim_cur;
long rlim_max;
};
其中rlim_cur为进程现在用到的资源数，rlim_max为进程可用的最大资源数。
结构task_struct中有一项为struct rlimit rlim[RLIM_NLIMITS]，其中RLIM_NLIMITS为10，即你对进程可以有10种限制，这10种限制作者有讲解的，我也不说了。

question:
1、我的印象中，在get_current中，esp应该是栈顶指针，而且随时会变的，用它去减去8K，能得到正确的地址吗？

标题 Re: 新兵笔记--ULK(C3) Process Descriptor [re: Big John]
作者 lucian_yao (addict)
时间 05/20/01 09:16 AM

1应该是栈顶向下8K对齐得到task_struct指针。
2在2.4中最大进程数没有了，由于基本不用TSS结构，所以不受GDT大小限制。

标题 Re: 新兵笔记--ULK(C3) Process Descriptor [re: lucian_yao]
作者 Big John (stranger )
时间 05/22/01 04:24 PM

1、是我的错，把
__asm__("andl %%esp,%0; ":"=r" (current) : "0" (~8191UL));
中的andl看成addl了，所以百思而不得，呵。其实很简单，系统分配进程描述符时，都是偶数页对齐的，把sp后面的13位清0，当然也就是描述符的位置了。：）
2、2.4对进程没有限制，那当然就不会再用task_struct的数组了，那它是怎么组织的呢？不会是链表吧。

标题 Re: 新兵笔记--ULK(C3) Process Descriptor [re: Big John]
作者 iobject (stranger)
时间 05/29/01 04:08 PM

static inline struct task_struct * get_current(void)
{
struct task_struct *current;
__asm__("andl %%esp,%0; ":"=r" (current) : "" (~8191UL));
return current;
}
对于,%0,从语法上似乎是指current,但是这样的话这句话就说不通了,难道我对%0的理解有错吗
哪位指点一下,谢谢!

标题 Re: 新兵笔记--ULK(C3) Process Descriptor [re: iobject]
作者 Big John (stranger )
时间 05/29/01 05:33 PM

asm ("combine %2,%0" : "=r" (foo) : "0" (foo), "g" (bar));

The constraint `"0"' for operand 1 says that it must occupy the same
location as operand 0. A digit in constraint is allowed only in an
input operand and it must refer to an output operand.

这段来自gcc的info，大概意思是说，%1和%0将占用同一个寄存器，所以引用%0也就是引用%1了。
这样看来
__asm__("andl %%esp,%0; ":"=r" (current) : "0" (~8191UL));
展开应该是这样的：
movl $(~8191UL),%eax
#APP
andl %esp, %eax
#NO_APP
movl %eax,current
我也是现学现用，不知道对不对。

[目录]

init进程从内核态切换到用户态
标题 init进程如何从内核态切换到用户态。
作者 chstar (stranger )
时间 03/08/01 01:24 PM

init进程从内核态切换到用户态。

//谢谢lucian_yao 邀请，在此灌水一篇

大家都知道如何产生一个新的进程。
通过sys_fork,之后再调用sys_execve
系统初启后（核心态）的第一个用户态进程是init。
这要涉及到内层（特权级高）向外层（特权级低）转移的问题。
通常情况下，内核是不会调用用户层的代码，要想实现这逆向的转移，一般做法是在用户进程的核心栈(tss->esp0)压入用户态的SS,ESP,EFLAGS,CS,EIP,伪装成用户进程是通过陷阱门进入核心态，之后通过iret返回用户态。
那么linux 2.2.14中的用户态进程init是如何实现的？

首先在kernel_thread(init...)函数中，
利用系统调用sys_clone fork出一个内核级进程(此时要给该进程分配核心栈<-esp0),之后call init函数，init函数还会再起几个kernel_thread,然后会加载/sbin/init(通过execve调用)
在sys_execve中，要完成内核态到用户态的转移。
大体流程是sys_execve-->do_execve-->load_elf_binary()
-->do_load_elf_binary()-->do_mmap()
start_thread(reg,newip,newsp) （processor.h）
请大家关注do_mmap()及start_thread()很重要哦
do_mmap完成从文件虚拟空间到内存虚拟空间的映射。
而start_thread就是要在进程核心栈中的相应位置填入进程用户态的xss,esp and xcs,eip.
最后进程从ret_from_sys_call返回，iret指令从核心栈pop出xcs, eip完成特权及指令的转移, pop出 xss,esp，完成堆栈的切换。

以上我也是刚看代码悟出的，如有不对之处，还望高手指出。

[目录]

SET_LINKS
宏：SET_LINKS(p)将进程p插入到进程系中

struct task_struct {
struct task_struct *next_task, *prev_task;
...
struct task_struct *p_opptr, *p_pptr, *p_cptr, *p_ysptr,*p_osptr;        ...};
next_task和prev_task 为描述进程先后关系的环形队列
p_opptr        指向原始的父进程
p_pptr        指向当前的父进程
p_cptr        指向最年轻的子进程
p_ysptr        指向弟进程
p_osptr        指向兄进程

include/linux/sched.h

#define SET_LINKS(p) do {
\
        (p)->next_task = &init_task;
\ 进程p的下一个进程是初始化进程
        (p)->prev_task = init_task.prev_task;
\ 进程p的前一个进程是初始化进程的前一个进程
        init_task.prev_task->next_task = (p);
\ 进程p的进一进程指向p
        init_task.prev_task = (p);
\初始化进程的前一进程指向p; 即将进程p加入到环形进程队列的尾部
        (p)->p_ysptr = NULL; \ 进程p现在是最年轻的进程
        if (((p)->p_osptr = (p)->p_pptr->p_cptr) != NULL)
                (p)->p_osptr->p_ysptr = p;
\ 原来的最年轻进程变成p的兄进程
        (p)->p_pptr->p_cptr = p; \ 父进程指向新的子进程p
} while (0)

[目录]

REMOVE_LINKS
宏：REMOVE_LINKS(p)将进程p从进程系中删除

include/linux/sched.h
#define REMOVE_LINKS(p) do { \
        (p)->next_task->prev_task = (p)->prev_task;
\ 让进程p的下一进程指向p的前一进程
        (p)->prev_task->next_task = (p)->next_task;
\ 让进程p的前一进程指向p的下一进程
        if ((p)->p_osptr)
\ 如果进程p有兄进程,则让兄进程指向p的弟进程
        (p)->p_osptr->p_ysptr = (p)->p_ysptr;
        if ((p)->p_ysptr)
\ 如果进程p有弟进程,则让弟进程指向p的兄进程
                (p)->p_ysptr->p_osptr = (p)->p_osptr; \
        else \ 如果p没有弟进程,说明p最年轻,则让父进程指向p的兄进程                (p)->p_pptr->p_cptr = (p)->p_osptr;
\
} while (0)

[目录]

get_wchan()
get_wchan()给出了某个睡眠进程schedule()的调用点.

; arch/i386/kernel/process.c
unsigned long get_wchan(struct task_struct *p)
{
        unsigned long ebp, esp, eip;
        unsigned long stack_page;
        int count = 0;
        if (!p || p == current || p->state == TASK_RUNNING)
                return 0;
        stack_page = (unsigned long)p;
        esp = p->thread.esp; 取switch_to之前内核堆栈指针
        if (!stack_page || esp 8188+stack_page)
                return 0;
        /* include/asm-i386/system.h:switch_to() pushes ebp last. */
        ebp = *(unsigned long *) esp; 取保存在切换现场的schedule的ebp
        do {
                if (ebp 8184+stack_page)
                        return 0;
                eip = *(unsigned long *) (ebp+4);
                ; (ebp+0)为上一级函数的ebp,(ebp+4)为schedule()的返回地址
                ; kernel/sched.c编绎加了-fno-omit-frame-pointer编绎标志,就是在这儿起作用.
                ; first_sched和last_sched是schedule()函数所在的地址范围
                if (eip = last_sched)
                        return eip;
                ebp = *(unsigned long *) ebp;
        } while (count++         return 0;
}

现在的问题是,在什么情况下需要用count循环几次? 现有的代码好象不需要循环.

[目录]

sigframe的结构
struct pt_regs {
        long ebx;
        long ecx;
        long edx;
        long esi;
        long edi;
        long ebp;
        long eax;
        int xds;
        int xes;
        long orig_eax;
        long eip;
        int xcs;
        long eflags;
        long esp;
        int xss;
};
typedef void (*__sighandler_t)(int);
struct sigaction {
        __sighandler_t sa_handler; 用户的信号处理函数指针
        unsigned long sa_flags;
        void (*sa_restorer)(void); 用户自定义的信号恢复函数指针
        sigset_t sa_mask;
};

struct k_sigaction {
        struct sigaction sa;
};
struct exec_domain {
        const char *name;
        lcall7_func handler;
        unsigned char pers_low, pers_high;
        unsigned long * signal_map;
        unsigned long * signal_invmap;
        struct module * module;
        struct exec_domain *next;
};
struct sigcontext {
        unsigned short gs, __gsh;
        unsigned short fs, __fsh;
        unsigned short es, __esh;
        unsigned short ds, __dsh;
        unsigned long edi;
        unsigned long esi;
        unsigned long ebp;
        unsigned long esp;
        unsigned long ebx;
        unsigned long edx;
        unsigned long ecx;
        unsigned long eax;
        unsigned long trapno;
        unsigned long err;
        unsigned long eip;
        unsigned short cs, __csh;
        unsigned long eflags;
        unsigned long esp_at_signal;
        unsigned short ss, __ssh;
        struct _fpstate * fpstate;
        unsigned long oldmask;
        unsigned long cr2;
};
struct _fpstate {

        unsigned long cw;
        unsigned long        sw;
        unsigned long        tag;
        unsigned long        ipoff;
        unsigned long        cssel;
        unsigned long        dataoff;
        unsigned long        datasel;
        struct _fpreg        _st[8];
        unsigned short        status;
        unsigned short        magic;

        unsigned long        _fxsr_env[6];
        unsigned long        mxcsr;
        unsigned long        reserved;
        struct _fpxreg        _fxsr_st[8];
        struct _xmmreg        _xmm[8];
        unsigned long        padding[56];
};
struct sigframe
{
        char *pretcode; 指向retcode
        int sig; sa_handler的sig参数
        struct sigcontext sc; CPU状态
        struct _fpstate fpstate;如果进程使用过FPU的话保存FPU状态
        unsigned long extramask[(64 / 32 ) -1];
        char retcode[8]; "popl % eax; movl $__NR_sigreturn,% eax; int $0x80"
};
static void setup_frame(int sig, struct k_sigaction *ka,
                        sigset_t *set, struct pt_regs * regs)
{
        struct sigframe *frame;
        int err = 0;

        ;取信号帧的起始地址
        frame = get_sigframe(ka, regs, sizeof(*frame));
        ;检测frame指针是否越界
        if (!access_ok(VERIFY_WRITE, frame, sizeof(*frame)))
                goto give_sigsegv;
        ;每个进程可以对应于不同的运行域,如果需要的话就进行相应的信号转换
        err |= __put_user((current->exec_domain
                            current->exec_domain->signal_invmap
                            sig                            ? current->exec_domain->signal_invmap[sig]
                           : sig),

        if (err)
                goto give_sigsegv;
        ;继续在用户堆栈上填充sigframe结构
        err |= setup_sigcontext( regs, set->sig[0]);
        if (err)
                goto give_sigsegv;
        ;如果系统信号集的描述字多于1个的话,在extramask在保存多出来的部分,
        ;set->sig[0]已在sigframe->sc.oldmask保存
        if (_NSIG_WORDS > 1) {
                err |= __copy_to_user(frame->extramask,
                                      sizeof(frame->extramask));
        }
        if (err)
                goto give_sigsegv;

        /* Set up to return from userspace. If provided, use a stub
           already in userspace. */
        if (ka->sa.sa_flags SA_RESTORER) {
                ; 如果用户提供了信号的恢复代码
                err |= __put_user(ka->sa.sa_restorer,
        } else {

                err |= __put_user(frame->retcode,
                /* This is popl % eax ; movl $,% eax ; int $0x80 */
                err |= __put_user(0xb858, (short *)(frame->retcode+0));
                err |= __put_user(__NR_sigreturn, (int *)(frame->retcode+2));
                err |= __put_user(0x80cd, (short *)(frame->retcode+6));
                ;popl % eax 退掉栈顶上frame->sig来与sys_sigreturn相对应
        }

if (err)
goto give_sigsegv;

        /* Set up registers for signal handler */
        regs->esp = (unsigned long) frame;
        regs->eip = (unsigned long) ka->sa.sa_handler;
        ; 一返回用户进程,信号处理代码就开始执行.
        set_fs(USER_DS);
        regs->xds = __USER_DS;
        regs->xes = __USER_DS;
        regs->xss = __USER_DS;
        regs->xcs = __USER_CS;
        regs->eflags ~TF_MASK;

#if DEBUG_SIG
        printk("SIG deliver (%s:%d): sp=%p pc=%p ra=%p\n",
                current->comm, current->pid, frame, regs->eip, frame->pretcode);
#endif
        return;

give_sigsegv:
        ;如果建立sigframe出错,忽略用户的SIGSEGV过程,发出SIGSEGV信号强制进程退出
        if (sig == SIGSEGV)
                ka->sa.sa_handler = SIG_DFL;
        force_sig(SIGSEGV, current);
}
static inline int on_sig_stack(unsigned long sp)
{
        return (sp - current->sas_ss_sp sas_ss_size);
}
static inline void *
get_sigframe(struct k_sigaction *ka, struct pt_regs * regs, size_t frame_size)
{
        unsigned long esp;

/* Default to using normal stack */
esp = regs->esp; 用户进程中的堆栈指针

        /* This is the X/Open sanctioned signal stack switching. */
        if (ka->sa.sa_flags SA_ONSTACK) {
                if (! on_sig_stack(esp))
                ; 如果esp sas_ss_sp + current->sas_ss_size)
                        esp = current->sas_ss_sp + current->sas_ss_size;
        }

        /* This is the legacy signal stack switching. */
        else if ((regs->xss 0xffff) != __USER_DS
                !(ka->sa.sa_flags SA_RESTORER)
                ka->sa.sa_restorer) {
                ; 如果ss与ds不等,没有SA_RESTORER标志,但ka->sa.sa_restorer不为0,
                ; 表示sa_restorer是一个用户定义的堆栈指针
                esp = (unsigned long) ka->sa.sa_restorer;
        }
        ; sigframe在8字节边界上对齐
        return (void *)((esp - frame_size) -8ul);
}

static int
setup_sigcontext(struct sigcontext *sc, struct _fpstate *fpstate,
struct pt_regs *regs, unsigned long mask)
{
int tmp, err = 0;

        tmp = 0;
        __asm__("movl %%gs,%0" : "=r"(tmp): "0"(tmp));
        err |= __put_user(tmp, (unsigned int *)
        __asm__("movl %%fs,%0" : "=r"(tmp): "0"(tmp));
        err |= __put_user(tmp, (unsigned int *)

        err |= __put_user(regs->xes, (unsigned int *)
        err |= __put_user(regs->xds, (unsigned int *)
        err |= __put_user(regs->edi,
        err |= __put_user(regs->esi,
        err |= __put_user(regs->ebp,
        err |= __put_user(regs->esp,
        err |= __put_user(regs->ebx,
        err |= __put_user(regs->edx,
        err |= __put_user(regs->ecx,
        err |= __put_user(regs->eax,
        err |= __put_user(current->thread.trap_no,
        err |= __put_user(current->thread.error_code,
        err |= __put_user(regs->eip,
        err |= __put_user(regs->xcs, (unsigned int *)
        err |= __put_user(regs->eflags,
        err |= __put_user(regs->esp,
        err |= __put_user(regs->xss, (unsigned int *)
        ; 每一步都很谨慎
        tmp = save_i387(fpstate);
        if (tmp         err = 1;
        else
        err |= __put_user(tmp ? fpstate : NULL,
        ; sc->fpstate为0表示该进程没有使用过FPU
        /* non-iBCS2 extensions.. */
        err |= __put_user(mask,
        err |= __put_user(current->thread.cr2,

return err;
}

sigframe就是调用用户信号处理函数时进程堆栈指针和原来被中断进程堆栈指针之间的数据块.

[目录]

rt_sigframe结构
struct rt_sigframe
{
        char *pretcode;
        int sig;
        struct siginfo *pinfo; 指向info
        void *puc; 指向uc
        struct siginfo info;
        struct ucontext uc;
        struct _fpstate fpstate;
        char retcode[8];
};
typedef struct sigaltstack {
        void *ss_sp;
        int ss_flags;
        size_t ss_size;
} stack_t;
struct ucontext {
        unsigned long        uc_flags;
        struct ucontext *uc_link;
        stack_t uc_stack;
        struct sigcontext uc_mcontext; 相当于sigframe中的sc
        sigset_t uc_sigmask;
};
typedef struct siginfo {
        int si_signo;
        int si_errno;
        int si_code;

union {
int _pad[((128 /sizeof(int)) - 3) ];

                struct {
                        pid_t _pid;
                        uid_t _uid;
                } _kill;

                struct {
                        unsigned int _timer1;
                        unsigned int _timer2;
                } _timer;

                struct {
                        pid_t _pid;
                        uid_t _uid;
                        sigval_t _sigval;
                } _rt;

                struct {
                        pid_t _pid;
                        uid_t _uid;
                        int _status;
                        clock_t _utime;
                        clock_t _stime;
                } _sigchld;

                struct {
                        void *_addr;
                } _sigfault;

                struct {
                        int _band;
                        int _fd;
                } _sigpoll;
        } _sifields;
} siginfo_t;

; setup_rt_frame与setup_frame相比多了一个siginfo_t参数
static void setup_rt_frame(int sig, struct k_sigaction *ka, siginfo_t *info,
                           sigset_t *set, struct pt_regs * regs)
{
        struct rt_sigframe *frame;
        int err = 0;

frame = get_sigframe(ka, regs, sizeof(*frame));

if (!access_ok(VERIFY_WRITE, frame, sizeof(*frame)))
goto give_sigsegv;

        err |= __put_user((current->exec_domain
                                current->exec_domain->signal_invmap
                                sig                                ? current->exec_domain->signal_invmap[sig]
                           : sig),

        ; 初始化frame->info和frame->uc两个指针
        err |= __put_user(
        err |= __put_user(
        ; 将系统siginfo结构拷贝给用户
        err |= copy_siginfo_to_user( info);
        if (err)
                goto give_sigsegv;

        /* Create the ucontext. */
        err |= __put_user(0,
        err |= __put_user(0,
        err |= __put_user(current->sas_ss_sp,
        err |= __put_user(sas_ss_flags(regs->esp),

        err |= __put_user(current->sas_ss_size,
        err |= setup_sigcontext(
                                regs, set->sig[0]);
        err |= __copy_to_user( set, sizeof(*set));
        if (err)
                goto give_sigsegv;

        /* Set up to return from userspace. If provided, use a stub
           already in userspace. */
        if (ka->sa.sa_flags SA_RESTORER) {
                err |= __put_user(ka->sa.sa_restorer,
        } else {
                err |= __put_user(frame->retcode,
                /* This is movl $,% eax ; int $0x80 */
                err |= __put_user(0xb8, (char *)(frame->retcode+0));
                err |= __put_user(__NR_rt_sigreturn, (int *)(frame->retcode+1));
                err |= __put_user(0x80cd, (short *)(frame->retcode+5));
                ; 没有popl % eax
        }

if (err)
goto give_sigsegv;

        /* Set up registers for signal handler */
        regs->esp = (unsigned long) frame;
        regs->eip = (unsigned long) ka->sa.sa_handler;

        set_fs(USER_DS);
        regs->xds = __USER_DS;
        regs->xes = __USER_DS;
        regs->xss = __USER_DS;
        regs->xcs = __USER_CS;
        regs->eflags ~TF_MASK;

#if DEBUG_SIG
printk("SIG deliver (%s:%d): sp=%p pc=%p ra=%p\n",
current->comm, current->pid, frame, regs->eip, frame->pretcode);
#endif

return;

give_sigsegv:
        if (sig == SIGSEGV)
                ka->sa.sa_handler = SIG_DFL;
        force_sig(SIGSEGV, current);
}
int copy_siginfo_to_user(siginfo_t *to, siginfo_t *from)
{
        if (!access_ok (VERIFY_WRITE, to, sizeof(siginfo_t)))
                return -EFAULT;
        if (from->si_code                 ; 将整个siginfo_t结构拷贝到用户堆栈上的rt_sigframe中
                return __copy_to_user(to, from, sizeof(siginfo_t));
        else {
                int err;

                /* If you change siginfo_t structure, please be sure
                   this code is fixed accordingly.
                   It should never copy any pad contained in the structure
                   to avoid security leaks, but must copy the generic
                   3 ints plus the relevant union member. */
                err = __put_user(from->si_signo,
                err |= __put_user(from->si_errno,
                err |= __put_user((short)from->si_code,
                /* First 32bits of unions are always present. */
                err |= __put_user(from->si_pid,
                switch (from->si_code >> 16) {
                case __SI_FAULT >> 16:
                        break;
                case __SI_CHLD >> 16:
                        err |= __put_user(from->si_utime,
                        err |= __put_user(from->si_stime,
                        err |= __put_user(from->si_status,
                default:
                        err |= __put_user(from->si_uid,
                        break;
                /* case __SI_RT: This is not generated by the kernel as of now. */
                }
                return err;
        }
}

由此可以看出rt_sigframe是sigframe的扩展和优化,和sigframe相比,rt_sigframe多了siginfo结构,sigframe的sigcontext扩展为ucontext,用户的信号处理函数多了两个参数pinfo和uc,这样内核可以将更多的信息传递给用户.除此以外,在运行机制上它们没有什么本质的区别.

[目录]

信号队列的结构
每个进程具有一个sigpending结构所描述的信号队列,它有3个成员,head指向第一个sigqueue成员,tail指向最末的sigqueue成员的next指针,signal描述了此队列中的信号集.

static int send_signal(int sig, struct siginfo *info, struct sigpending *signals);
将信号sig和对应的消息结构info添加到信号队列signal中.
static int collect_signal(int sig, struct sigpending *list, siginfo_t *info);
返回信号sig在队列list中的信息info.

        /* Real-time signals must be queued if sent by sigqueue, or
           some other real-time mechanism. It is implementation
           defined whether kill() does so. We attempt to do so, on
           the principle of least surprise, but since kill is not
           allowed to fail with EAGAIN when low on memory we just
           make sure at least one signal gets delivered and don't
           pass on the info struct. */

        if (atomic_read(q = kmem_cache_alloc(sigqueue_cachep, GFP_ATOMIC);
        }
        // nr_queued_signals和max_queued_signals用来限制全局sigqueue成员的数目
        if (q) {
                atomic_inc(
                q->next = NULL;
                *signals->tail = q;
                signals->tail = tail总是指向最末的信号成员的next指针
                switch ((unsigned long) info) {
                        case 0: // info参数如果为0,表示信号来源于当前用户进程
                                q->info.si_signo = sig;
                                q->info.si_errno = 0;
                                q->info.si_code = SI_USER;
                                q->info.si_pid = current->pid;
                                q->info.si_uid = current->uid;
                                break;
                        case 1: // info参数如果为1,表示信号来源于内核本身
                                q->info.si_signo = sig;
                                q->info.si_errno = 0;
                                q->info.si_code = SI_KERNEL;
                                q->info.si_pid = 0;
                                q->info.si_uid = 0;
                                break;
                        default: // 否则从info指针中拷贝信号
                                copy_siginfo( info);
                                break;
                }
        } else if (sig >= SIGRTMIN info (unsigned long)info != 1
                    info->si_code != SI_USER) {
                ; 如果该信号是内核发出的实时信号,就返回错误码
                /*
                * Queue overflow, abort. We may abort if the signal was rt
                * and sent by user using something other than kill().
                */
                return -EAGAIN;
        }

        sigaddset( sig); 将sig号标记在队列的信号集上
        return 0;
}
static int collect_signal(int sig, struct sigpending *list, siginfo_t *info)
{
        if (sigismember( sig)) {
                /* Collect the siginfo appropriate to this signal. */
                struct sigqueue *q, **pp;
                pp = pp指向第一个信号成员的next指针
                while ((q = *pp) != NULL) {
                        if (q->info.si_signo == sig)
                                goto found_it;
                        pp =
                }

                /* Ok, it wasn't in the queue. We must have
                   been out of queue space. So zero out the
                   info. */
                sigdelset( sig);
                info->si_signo = sig;
                info->si_errno = 0;
                info->si_code = 0;
                info->si_pid = 0;
                info->si_uid = 0;
                return 1;

found_it:
                // 将找到信号成员从信号队列中删除
                if ((*pp = q->next) == NULL)
                        list->tail = pp;

                /* Copy the sigqueue information and free the queue entry */
                copy_siginfo(info,
                kmem_cache_free(sigqueue_cachep,q);
                atomic_dec(

                /* Non-RT signals can exist multiple times.. */
                if (sig >= SIGRTMIN) {
                        while ((q = *pp) != NULL) {
                                if (q->info.si_signo == sig)
                                        goto found_another;
                                pp =
                        }
                }

                sigdelset( sig);
found_another:
                return 1;
        }
        return 0;
}

[目录]

内核线程简介
内核线程
尽管linux是一个宏内核系统，但是kernel threads（内核线程）依然存在，以便并行地处理一些内核的“家务室”

这些任务不占用USER memory（用户空间），而仅仅使用KERNEL memory。和其他内核模块一样，它们也在高级权限（i386系统中的RING 0）下工作作。内核线程是被kernel_thread [arch/i386/kernel/process]创建的,它又通过调用著名的clone系统调用[arch/i386/kernel/process.c] (类似fork系统调用的所有功能都是由它最终实现):

int kernel_thread(int (*fn)(void *), void * arg, unsigned long flags)
{
long retval, d0;

一旦调用，我们就有了一个新的任务（Task） (一般PID都很小, 例如2,3,等) 等待一个响应很慢的资源，例如swap或者usb事件，以便同步。
下面是一些最常用的内核线程(你可以用ps x命令):

PID      COMMAND
1        init
2        keventd
3        kswapd
4        kreclaimd
5        bdflush
6        kupdated
7        kacpid
67        khubd

下面是一个典型的内核线程kswapd [mm/vmscan.c].
kswapd是被clone()建立的 [arch/i386/kernel/process.c]''

|do_initcalls
   |kswapd_init
      |kernel_thread
         |syscall fork (in assembler)

·do_initcalls [init/main.c]
·kswapd_init [mm/vmscan.c]
·kernel_thread [arch/i386/kernel/process.c]

[目录]

进程切换简介
                           TASK SWITCHING TRICK
#define switch_to(prev,next,last) do {                             \
    asm volatile("pushl %%esi\n\t"                                 \
                 "pushl %%edi\n\t"                                 \
                 "pushl %%ebp\n\t"                                 \
                 "movl %%esp,%0\n\t"        /* save ESP */         \
                 "movl %3,%%esp\n\t"        /* restore ESP */      \
                 "movl $1f,%1\n\t"          /* save EIP */         \
                 "pushl %4\n\t"             /* restore EIP */      \
                 "jmp __switch_to\n"                               \
                 "1:\t"                                            \
                 "popl %%ebp\n\t"                                  \
                 "popl %%edi\n\t"                                  \
                 "popl %%esi\n\t"                                  \
                 :"=m" (prev->thread.esp),"=m" (prev->thread.eip), \
                 "=b" (last)                                       \
                 :"m" (next->thread.esp),"m" (next->thread.eip),   \
                 "a" (prev), "d" (next),                           \
                 "b" (prev));                                      \
} while (0)

好象是一个递归？重点理解在这里:

pushl %4 把未来的EIP压入栈
jmp __switch_to 执行 __switch_to过程, 但是和call不同，我们会返回到 1 这里压入的地址。所以进入了新进程!

U S E R M O D E K E R N E L M O D E

|          |     |          |       |          |     |          |
|          |     |          | Timer |          |     |          |
|          |     | Normal | IRQ |          |     |          |
|          |     |   Exec   |------>|Timer_Int.|     |          |
|          |     |     |    |       | ..       |     |          |
|          |     |    \|/   |       |schedule()|     | Task1 Ret|
|          |     |          |       |_switch_to|<-- | Address |
|__________|     |__________|       |          | | |          |
                                     |          | |S |          |
Task1 Data/Stack   Task1 Code        |          | |w |          |
                                     |          | T|i |          |
                                     |          | a|t |          |
|          |     |          |       |          | s|c |          |
|          |     |          | Timer |          | k|h |          |
|          |     | Normal | IRQ |          | |i |          |
|          |     |   Exec   |------>|Timer_Int.| |n |          |
|          |     |     |    |       | ..       | |g |          |
|          |     |    \|/   |       |schedule()| | | Task2 Ret|
|          |     |          |       |_switch_to|<-- | Address |
|__________|     |__________|       |__________|     |__________|

Task2 Data/Stack Task2 Code Kernel Code Kernel Data/Stack

[目录]

同步机制
同步通常是为了达到多线程协同的目的而设计的一种机制，通常包含异步信号机制和互斥机制作为其实现的底层。在Linux 2.4内核中也有相应的技术实现，包括信号量、自旋锁、原子操作和等待队列，其中原子操作和等待队列又是实现信号量的底层。

一. 等待队列和异步信号

wait queue很早就作为一个基本的功能单位出现在Linux内核里了，它以队列为基础数据结构，与进程调度机制紧密结合，能够用于实现核心的异步事件通知机制。我们从它的使用范例着手，看看等待队列是如何实现异步信号功能的。

在核心运行过程中，经常会因为某些条件不满足而需要挂起当前线程，直至条件满足了才继续执行。在2.4内核中提供了一组新接口来实现这样的功能，下面的代码节选自kernel/printk.c：

    unsigned long log_size;
1: DECLARE_WAIT_QUEUE_HEAD(log_wait);...
4: spinlock_t console_lock = SPIN_LOCK_UNLOCKED;...
    int do_syslog(int type,char *buf,int len){
        ...
2:      error=wait_event_interruptible(log_wait,log_size);
        if(error)
             goto out;
        ...
5:      spin_lock_irq(&console_lock);
        ...
        log_size--;
        ...
6:        spin_unlock_irq(&console_lock);
        ...
    }
    asmlinkage int printk(const char *fmt,...){
        ...
7:        spin_lock_irqsave(&map;console_lock,flags);
        ...
        log_size++;...
8:        spin_unlock_irqrestore(&console_lock,flags);
3:      wake_up_interruptible(&map;log_wait);
        ...
    }

这段代码实现了printk调用和syslog之间的同步，syslog需要等待printk送数据到缓冲区，因此，在2:处等待log_size非0；而printk一边传送数据，一边增加log_size的值，完成后唤醒在log_wait上等待的所有线程（这个线程不是用户空间的线程概念，而是核内的一个执行序列）。执行了3:的wake_up_interruptible()后，2:处的wait_event_interruptible()返回0，从而进入syslog的实际动作。

1:是定义log_wait全局变量的宏调用。

在实际操作log_size全局变量的时候，还使用了spin_lock自旋锁来实现互斥，关于自旋锁，这里暂不作解释，但从这段代码中已经可以清楚的知道它的使用方法了。

所有wait queue使用上的技巧体现在wait_event_interruptible()的实现上，代码位于include/linux/sched.h中，前置数字表示行号：

779 #define __wait_event_interruptible(wq, condition, ret)                  \
780 do {                                                                    \
781         wait_queue_t __wait;                                            \
782         init_waitqueue_entry(&__wait, current);                         \
783                                                                         \
784         add_wait_queue(&wq, &__wait);                                   \
785         for (;;) {                                                      \
786                 set_current_state(TASK_INTERRUPTIBLE);                  \
787                 if (condition)                                          \
788                         break;                                          \
789                 if (!signal_pending(current)) {                         \
790                         schedule();                                     \
791                         continue;                                       \
792                 }                                                       \
793                 ret = -ERESTARTSYS;                                     \
794                 break;                                                  \
795         }                                                               \
796         current->state = TASK_RUNNING;                                  \
797         remove_wait_queue(&wq, &__wait);                                \
798 } while (0)
799
800 #define wait_event_interruptible(wq, condition)                         \
801 ({                                                                      \
802         int __ret = 0;                                                  \
803         if (!(condition))                                               \
804                 __wait_event_interruptible(wq, condition, __ret);       \
805         __ret;                                                          \
806 })

在wait_event_interruptible()中首先判断condition是不是已经满足，如果是则直接返回0，否则调用__wait_event_interruptible()，并用__ret来存放返回值。__wait_event_interruptible()首先定义并初始化一个wait_queue_t变量__wait，其中数据为当前进程结构current（struct task_struct），并把__wait入队。在无限循环中，__wait_event_interruptible()将本进程置为可中断的挂起状态，反复检查condition是否成立，如果成立则退出，如果不成立则继续休眠；条件满足后，即把本进程运行状态置为运行态，并将__wait从等待队列中清除掉，从而进程能够调度运行。如果进程当前有异步信号（POSIX的），则返回-ERESTARTSYS。

挂起的进程并不会自动转入运行的，因此，还需要一个唤醒动作，这个动作由wake_up_interruptible()完成，它将遍历作为参数传入的log_wait等待队列，将其中所有的元素（通常都是task_struct）置为运行态，从而可被调度到，执行__wait_event_interruptible()中的代码。

DECLARE_WAIT_QUEUE_HEAD(log_wait)经过宏展开后就是定义了一个log_wait等待队列头变量：

struct __wait_queue_head log_wait = {
lock: SPIN_LOCK_UNLOCKED,
task_list: { &map;log_wait.task_list, &log_wait.task_list }
}

其中task_list是struct list_head变量，包括两个list_head指针，一个next、一个prev，这里把它们初始化为自身，属于队列实现上的技巧，其细节可以参阅关于内核list数据结构的讨论，add_wait_queue()和remove_wait_queue()就等同于list_add()和list_del()。

wait_queue_t结构在include/linux/wait.h中定义，关键元素即为一个struct task_struct变量表征当前进程。

除了wait_event_interruptible()/wake_up_interruptible()以外，与此相对应的还有wait_event()和wake_up()接口，interruptible是更安全、更常用的选择，因为可中断的等待可以接收信号，从而挂起的进程允许被外界kill。

wait_event*()接口是2.4内核引入并推荐使用的，在此之前，最常用的等待操作是interruptible_sleep_on(wait_queue_head_t *wq)，当然，与此配套的还有不可中断版本sleep_on()，另外，还有带有超时控制的*sleep_on_timeout()。sleep_on系列函数的语义比wait_event简单，没有条件判断功能，其余动作与wait_event完全相同，也就是说，我们可以用interruptible_sleep_on()来实现wait_event_interruptible()（仅作示意〉：

do{
        interruptible_sleep_on(&log_wait);
        if(condition)
                break;
}while(1);

相对而言，这种操作序列有反复的入队、出队动作，更加耗时，而很大一部分等待操作的确是需要判断一个条件是否满足的，因此2.4才推荐使用wait_event接口。

在wake_up系列接口中，还有一类wake_up_sync()和wake_up_interruptible_sync()接口，保证调度在wake_up返回之后进行。

二. 原子操作和信号量

POSIX有信号量，SysV IPC有信号量，核内也有信号量，接口很简单，一个down()，一个up()，分别对应P操作和V操作，down()调用可能引起线程挂起，因此和sleep_on类似，也有interruptible系列接口。down意味着信号量减1，up意味着信号量加1，这两个操作显然需要互斥。在Linux 2.4中，并没有如想象中的用锁实现，而是使用了原子操作。

在include/asm/atomic.h中定义了一系列原子操作，包括原子读、原子写、原子加等等，大多是直接用汇编语句来实现的，这里就不详细解释。

我们从信号量数据结构开始，它定义在include/asm/semaphore.h中：

struct semaphore {
        atomic_t count;
        int sleepers;
        wait_queue_head_t wait;
}

down()操作可以理解为申请资源，up()操作可以理解为释放资源，因此，信号量实际表示的是资源的数量以及是否有进程正在等待。在semaphore结构中，count相当于资源计数，为正数或0时表示可用资源数，-1则表示没有空闲资源且有等待进程。而等待进程的数量并不关心。这种设计主要是考虑与信号量的原语相一致，当某个进程执行up()函数释放资源，点亮信号灯时，如果count恢复到0，则表示尚有进程在等待该资源，因此执行唤醒操作。一个典型的down()-up()流程是这样的：

    down()-->count做原子减1操作，如果结果不小于0则表示成功申请，从down()中返回；
    -->如果结果为负（实际上只可能是-1），则表示需要等待，则调用__down_fail()；
    __down_fail()调用__down()，__down()用C代码实现，要求已不如down()和__down_fail()严格，在此作实际的等待（arch/i386/kernel/semaphore.c）：

void __down(struct semaphore * sem)
{
        struct task_struct *tsk = current;
        DECLARE_WAITQUEUE(wait, tsk);
        tsk->state = TASK_UNINTERRUPTIBLE;
        add_wait_queue_exclusive(&sem->wait, &wait);

        spin_lock_irq(&semaphore_lock);
        sem->sleepers++;
        for (;;) {
                int sleepers = sem->sleepers;

                /*
                 * Add "everybody else" into it. They aren't
                 * playing, because we own the spinlock.
                 */
                if (!atomic_add_negative(sleepers - 1, &sem->count)) {
                        sem->sleepers = 0;
                        break;
                }
                sem->sleepers = 1;      /* us - see -1 above */
                spin_unlock_irq(&semaphore_lock);

                schedule();
                tsk->state = TASK_UNINTERRUPTIBLE;
                spin_lock_irq(&semaphore_lock);
        }
        spin_unlock_irq(&semaphore_lock);
        remove_wait_queue(&sem->wait, &wait);
        tsk->state = TASK_RUNNING;
        wake_up(&sem->wait);
}

__down()-->当前进程进入wait等待队列，状态为不可中断的挂起，sleepers++，如果这是第一次申请失败，则sleepers值为1，否则为2--这个设置纯粹是为了下面这句原子加而安排的。

在真正进入休眠以前，__down()还是需要判断一下是不是确实没有资源可用，因为在spin_lock之前什么都可能发生。atomic_add_negative()将sleepers-1（只可能是0或者1，分别表示仅有一个等待进程或是多个）加到count（如果有多个进程申请资源失败进入__down()，count可能为-2、-3等）之上，这个加法完成后，结果为0只可能是在sleepers等于1的时候发生（因为如果sleepers等于2，表示有多个进程执行了down()，则count必然小于-1，因此sleepers-1+count必然小于0），表示count在此之前已经变为0了，也就是说已经有进程释放了资源，因此本进程不用休眠而是获得资源退出__down()，从而也从down()中返回；如果没有进程释放资源，那么在所有等待进程的这一加法完成后，count将等于-1。因此，从down()调用外面看（无论是在down()中休眠还是获得资源离开down()），count为负时只可能为-1（此时sleepers等于1），这么设计使得up()操作只需要对count加1，判断是否为0就可以知道是否有必要执行唤醒操作__up_wakeup()了。

获得了资源的进程将把sleepers设为0，并唤醒所有其他等待进程，这个操作实际上只是起到恢复count为-1，并使它们再次进入休眠的作用，因为第一个被唤醒的等待进程执行atomic_add_negative()操作后会将count恢复为-1，然后将sleepers置为1；以后的等待进程则会像往常一样重新休眠。

将down()操作设计得如此复杂的原因和结果就是up操作相当简单。up()利用汇编原子地将count加1，如果小于等于0表示有等待进程，则调用__up_wakeup()-->__up()唤醒wait；否则直接返回。

在down()中竞争获得资源的进程并不是按照优先级排序的，只是在up()操作完成后第一个被唤醒或者正在__down()中运行而暂未进入休眠的进程成功的可能性稍高一些。

尽管可以将信号量的count初始化为1从而实现一种互斥锁（mutex），但Linux并不保证这个count不会超过1，因为up操作并不考虑count的初值，所以只能依靠程序员自己来控制不要无谓的执行up()从而破坏mutex的语义。相关的初始化接口定义在include/asm/semaphore.h中，但一般程序员可以通过sema_init()接口来初始化信号量：

#define DECLARE_MUTEX(name) __DECLARE_SEMAPHORE_GENERIC(name,1)
#define DECLARE_MUTEX_LOCKED(name) __DECLARE_SEMAPHORE_GENERIC(name,0)
static inline void sema_init (struct semaphore *sem, int val)
static inline void init_MUTEX (struct semaphore *sem)
static inline void init_MUTEX_LOCKED (struct semaphore *sem)

除了down()以外，Linux还提供了一个down_interruptible()，操作序列与down()基本相同，仅在休眠状态为可中断和信号处理上有所不同。在标准的信号量以外，还有一套读写信号量，用于将资源的读写区分开来进行同步以提高效率，采用读写锁来实现，有兴趣的可以参阅文后列出的参考资料。

三. 自旋锁

锁是一个概念，正如上面提到的mutex互斥锁仅仅是其中的一种。互斥锁被锁定时进入休眠，而系统还能正常运转，但有很多时候，锁应该不仅仅互斥访问，甚至应该让系统挂起，直至锁成功，也就是说在锁操作中"自旋"，这就是Linux中的spinlock机制。

从实现上来说，自旋锁比较简单，主要分为两部分，一部分是中断处理，一部分是自旋处理，最基础的部分在spin_lock_string和spin_unlock_string这两段汇编代码中：

#define spin_lock_string \
        "\n1:\t" \
        "lock ; decb %0\n\t" \
        "js 2f\n" \
        ".section .text.lock,\"ax\"\n" \
        "2:\t" \
        "cmpb $0,%0\n\t" \
        "rep;nop\n\t" \
        "jle 2b\n\t" \
        "jmp 1b\n" \
        ".previous"
#define spin_unlock_string \
        "movb $1,%0"

不详细解释这段汇编代码的语义，spin_lock_string对锁原子减1，循环检查锁值，直到锁值大于0；而spin_unlock_string则是对锁赋值1。spin_lock_string用于构成spin_lock()函数，spin_unlock_string用于构成spin_unlock()函数。

spin_lock()/spin_unlock()构成了自旋锁机制的基础，它们和关中断local_irq_disable()/开中断local_irq_enable()、关bh local_bh_disable()/开bh local_bh_enable()、关中断并保存状态字local_irq_save()/开中断并恢复状态字local_irq_restore()结合就形成了整套自旋锁机制，接口定义在include/linux/spinlock.h中，这里就不列举了。

实际上，以上提到的spin_lock()都是在CONFIG_SMP的前提下才会生成的，也就是说，在单处理机系统中，spin_lock()是一条空语句，因为在处理机执行它的语句时，不可能受到打扰，语句肯定是串行的。在这种简单情况下，spin_lock_irq()就只需要锁中断就可以完成任务了。在include/linux/spinlock.h中就用#ifdef CONFIG_SMP来区分两种不同的情况。

自旋锁有很多种，信号量也可以用来构成互斥锁，原子操作也有锁功能，而且还有与标准锁机制类似的读写锁变种，在不同的应用场合应该选择不同的锁，下一节就是介绍如何选择。

四. 锁的使用

1.用户上下文之间

如果所访问的共享资源仅在用户上下文中使用，最高效的办法就是使用信号量。在net/core/netfilter.c中有一处使用信号量的例子：

static DECLARE_MUTEX(nf_sockopt_mutex);
int nf_register_sockopt(struct nf_sockopt_ops *reg)
{
...
        if (down_interruptible(&nf_sockopt_mutex) != 0)
                return -EINTR;
...
out:
        up(&nf_sockopt_mutex);
        return ret;
}

2.用户上下文与bottom half之间

此时有两种情况需要使用锁，一是用户上下文被bottom half中断，二是多个处理机同时进入一个临界段。一般情况下，使用spin_lock_bh()/spin_unlock_bh()可以满足要求，它将关闭当前CPU的bottom half，然后再获取锁，直至离开临界段再释放并对bottom half重新使能。

3.用户上下文与软中断（Tasklet）之间

tasklet与bottom half的实现机制是一样的，实际上spin_lock_bh()也同时关闭了tasklet的执行，因此，在这种情况下，用户上下文与tasklet之间的同步也使用spin_lock_bh()/spin_unlock_bh()。

4.bottom half之间

bottom half本身的机制就保证了不会有多于1个的bottom half同时处于运行态，即使对于SMP系统也是如此，因此，在设计共享数据的bottom half时无需考虑互斥。

5.tasklet之间

tasklet和bottom half类似，也是受到local_bh_disable()保护的，因此，同一个tasklet不会同时在两个CPU上运行；但不同的tasklet却有可能，因此，如果需要同步不同的tasklet访问共享数据的话，就应该使用spin_lock()/spin_unlock()。正如上面提到的，这种保护仅对SMP系统有意义，UP系统中tasklet的运行不会受到另一个tasklet（不论它是否与之相同）的打扰，因此也就没有必要上锁。

6.softirq之间

softirq是实现tasklet和bottom half的基础，限制较后二者都少，允许两个softirq同时运行于不同的CPU之上，而不论它们是否来自同一个softirq代码，因此，在这种情况下，都需要用spin_lock()/spin_unlock()来同步。

7.硬中断和软中断之间

硬中断是指硬件中断的处理程序上下文，软中断包括softirq和在它基础上实现的tasklet和bottom half等，此时，为了防止硬件中断软中断的运行，同步措施必须包括关闭硬件中断，spin_lock_irq()/spin_unlock_irq()就包括这个动作。还有一对API，spin_lock_irqsave()/spin_unlock_irqrestore()，不仅关闭中断，还保存机器状态字，并在打开中断时恢复。

8.其他注意事项

首先需要提醒的是"死锁"，这在操作系统原理的课本上都做过介绍，无论是使用信号量还是使用自旋锁，都有可能产生死锁，特别是自旋锁，如果死锁在spin_lock上，整个系统就会挂起。如何避免死锁是理论课的问题，这里就不多说了。

另外一点就是尽可能短时间的锁定，因此，"对数据上锁，而不是对代码上锁"就成了一个简单的原则；在有可能的情况下，使用读写锁，而不要总是使用互斥锁；对读写排序，使用原子操作，从而完全避免使用锁，也是一个不错的设计思想。

不要在锁定状态下调用可能引起休眠的操作，以下这些操作就是目前可能因此休眠的函数：

    对用户内存的访问：copy_from_user()、copy_to_user()、get_user()、put_user()
    kmalloc(GFP_KERNEL)
    down_interruptible()和down()，如果需要在spinlock中使用信号量，可以选择down_trylock()，它不会引起挂起 printk()的灵巧设计使得它不会挂起，因此可以在任何上下文中使用。

[目录]

[ 本文件由良友·收藏家自动生成 ]

目 录

目录