%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%									%
%	Copyright (C) 1992, 1993 Michael K. Johnson,			%
%	johnsonm@sunsite.unc.edu					%
%									%
%	This file is freely copyable, but you must preserve this	%
%	copyright notice on all copies, it must only be distributed	%
%	as part of the Linux Kernel Hackers' Guide, and its use is	%
%	is subject to the conditions expressed in the copyright for	%
%	the whole guide, in the file prelim/copyright.tex		%
%									%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


\chapter{The \linux\ scheduler}

{\bf [This chapter is very bad, and is included here mostly to
encourage Jim Wissner to write the better version that he has
mentioned.  Unfortunately, I have been unable to get in touch with
Jim, and so may have to re-write this myself.]}

The \linux\ scheduler is significantly different from the schedulers
in other unices, especially in its treatment of `nice level'
priorities.  Instead of scheduling processes with higher priority
first, \linux\ does round-robin scheduling, but lets higher priority
processes run both sooner and longer.  Because it uses a fairly fast
clock (100 Hz) and fairly small timeslices, and has quick response to
interrupts, this makes for a very responsive system.  Like OS/2's scheduler,
the scheduler is primarily designed to make the system feel fast to a
single user, but does much better at keeping interactive tasks from
lagging than OS/2's scheduler.

It should be noted that patches are available which give a different
scheduler that is more rugged under heavier load, which has a little
more overhead, but is closer in design to standard unix schedulers.
Contact Jim Wissner ({\tt wissner@beech.mcs.gvsu.edu}) for the patches.
{\bf[This may change.  They will probably be available in the patches
subdirectory at {\tt tsx-11.mit.edu} eventually.]}

\section{The code}

Here is a copy of the relevant code from
/usr/src/linux/kernel/sched.c, annotated and abridged.

\begin{screen}\begin{verbatim}
void schedule(void)
{
        int i,next,c;
        struct task_struct ** p;

/* check alarm, wake up any interruptible tasks that have got a signal */

        need_resched = 0;
        for(p = &LAST_TASK ; p > &FIRST_TASK ; --p) {
\end{verbatim}\end{screen}
The process table is an array of pointers to {\tt struct task\_struct}
structures.
See /usr/include/linux/sched.h for the definition of this structure.
\begin{screen}\begin{verbatim}
                if (!*p || ((*p)->state != TASK_INTERRUPTIBLE))
                        continue;
                if ((*p)->timeout && (*p)->timeout < jiffies) {
\end{verbatim}\end{screen}
If a process has a timeout and has reached it, then {\tt jiffies} (the
number of 100th's of a second since system start) will have passed
{\tt timeout}.  {\tt timeout} is set as {\tt jiffies + \em
desired\_timeout}.
\begin{screen}\begin{verbatim}
                        (*p)->timeout = 0;
                        (*p)->state = TASK_RUNNING;
                } else if ((*p)->signal & ~(*p)->blocked)
\end{verbatim}\end{screen}
If the process has been sent a signal, and is no longer blocked, then
let the process run again.
\begin{screen}\begin{verbatim}
                        (*p)->state = TASK_RUNNING;
        }

/* this is the scheduler proper: */

\end{verbatim}\end{screen}
At this point, all runnable processes have been flagged as runnable,
and we are ready to choose one to run, by running through the process
table.  What we are looking for is the process with the largest
counter.  The counter for each runnable process is incremented each
time the scheduler is called by an amount that is weighted by the
priority, which is the kernel version of the `nice' value.  (It
differs in that the priority is never negative.)
\begin{screen}\begin{verbatim}

        while (1) {
                c = -1;
                next = 0;
                i = NR_TASKS;
                p = &task[NR_TASKS];
                while (--i) {
                        if (!*--p)
\end{verbatim}\end{screen}
If there is no process in this slot then don't bother\dots
\begin{screen}\begin{verbatim}
                                continue;
                        if ((*p)->state == TASK_RUNNING && (*p)->counter > c)
                                c = (*p)->counter, next = i;
\end{verbatim}\end{screen}
If the counter is higher than any previous counter, then make the
process the next process, unless, of course, an even higher one is
encountered later in the loop.
\begin{screen}\begin{verbatim}
                }
                if (c)
                        break;
                for(p = &LAST_TASK ; p > &FIRST_TASK ; --p)
                        if (*p)
                                (*p)->counter = ((*p)->counter >> 1) +
                                                (*p)->priority;
\end{verbatim}\end{screen}
Here is where the counter is set.  It is first divided by 2, and then
the priority is added.  Note that this happens only if no process has
been found to switch to.
\begin{screen}\begin{verbatim}
        }
        sti();
        switch_to(next);
}

\end{verbatim}\end{screen}
I have truncated {\tt do\_timer()} extremely, only showing the pieces
that relate specifically to {\tt schedule()}.  For information on the
rest, see the appropriate section.  For instance, for commentary on the
{\tt itimer} mechanism see the section on {\tt itimers}.  {\bf [I
suppose I need to {\em write} that section now\dots  I will need to
put a reference here to that section.]}  I have specifically left out
all the accounting stuff, all the timer stuff, and the floppy timer.
\begin{screen}\begin{verbatim}

static void do_timer(struct pt_regs * regs)
{
        unsigned long mask;
        struct timer_struct *tp = timer_table+0;
        struct task_struct ** task_p;

        jiffies++;
\end{verbatim}\end{screen}
Here is where {\tt jiffies} is incremented.  This is all-important to
the rest of the kernel.
\begin{screen}\begin{verbatim}

        if (current == task[0] || (--current->counter)<=0) {
                current->counter=0;
                need_resched = 1;
        }
}

\end{verbatim}\end{screen}
Don't let task 0 run if anything else can run.  If task 0 is running,
the machine is idle, but don't let it be idle if anything else is
happening, so run schedudule as soon as possible.  Set the {\tt
need\_resched} flag if necessary so that schedule gets called again as
soon as possible.  Note the {\tt schedule()} is {\em not} called
anywhere from within {\tt do\_timer()}!

You are probably wondering how preemption is done, if {\tt schedule()}
is not to be called from within {\tt do\_timer()}.  What happens is
that when, in {\tt sched\_init()} in kernel/sched.c, {\tt
request\_irq()} is used to get the timer interrupt, {\tt
request\_irq()} sets up housekeeping before and after the interrupt is
serviced, as seen in {\tt <asm/irq.h>}.  One of the things that is
done after an interrupt is serviced is that all the registers are
properly set up, and {\tt ret\_from\_sys\_call()} (an assembly
language routine in kernel/sys\_call.S) is called.  That, in turn, may
call {\tt schedule()} to re-schedule the processes.  This same code is
also used after system calls and after most interrupts.  Interrupts
that are serviced often and that must be serviced quickly, such as
serial interrupts, do {\em not} call {\tt ret\_from\_sys\_call()} when
done and do as little as possible, to keep the overhead down.  In
particular, they only save the registers that C would clobber, and
assume that if the handler is going to use any others, the handler
will deal with that.  {\bf [This paragraph is poorly written and hard
to follow.  Re-write.]}