%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%									%
%	Copyright (C) 1992, 1993 Michael K. Johnson,			%
%	johnsonm@sunsite.unc.edu, and any others named in copyright	%
%	below.								%
%									%
%	This file is freely copyable, but you must preserve this	%
%	copyright notice on all copies, it must only be distributed	%
%	as part of the Linux Kernel Hackers' Guide, and its use is	%
%	is subject to the conditions expressed in the copyright for	%
%	the whole guide, in the file prelim/copyright.tex		%
%									%
%	If other copyright conditions are given below, they supercede	%
%	these conditions.						%
%									%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

% Copyright (C) 1993 Stanley Scalsky

\chapter{How System Calls Work}\label{sec-syscall}

{\bf [This needs to be a little re-worked and expanded upon, but I am
waiting to see if the iBCS stuff makes any impact on it as I write
other stuff.]}

This section covers first the mechanisms provided by the 386 for
handling system calls, and then shows how \linux\ uses those
mechanisms.  This is not a reference to the individual system calls:
There are very many of them, new ones are added occasionally, and they
are documented in man pages that should be on your \linux\ system.
Well, they are {\em supposed\/} to be documented and on your \linux\
system.  This is being worked on.  {\bf [Ideally, this chapter should
be part of another section, I think.  Maybe, however, it should just
be expanded.  I think it belongs somewhere near the chapter on how to
write a device driver, because it explains how to write a system call.]}

\section{What Does the 386 Provide?}
The 386 recognizes two event classes: exceptions and interrupts.  Both
cause a forced context switch to new a procedure or task.  Interrupts can
occur at unexpected times during the execution of a program and are
used to respond to signals from hardware.  Exceptions are caused by
the execution of instructions.

Two sources of interrupts are recognized by the 386: Maskable
interrupts and Nonmaskable interrupts.  Two sources of exceptions are
recognized by the 386: Processor detected exceptions and programmed
exceptions.

Each interrupt or exception has a number, which is referred to
by the 386 literature as the vector.  The NMI interrupt and the
processor detected exceptions have been assigned vectors in the range 0
through 31, inclusive.  The vectors for maskable interrupts are
determined by the hardware.  External interrupt controllers put the
vector on the bus during the interrupt-acknowledge cycle.  Any vector
in the range 32 through 255, inclusive, can be used for maskable
interrupts or programmed exceptions.  See figure~\ref{fig-intassign}
for a listing of all the possible interrupts and exceptions.
{\bf [Check all this out to make sure that it is right.]}

\begin{figure}[htp]
\centerline{
\begin{tabular}{|r|l|}\hline
0 & divide error\\\hline
1 & debug exception\\\hline
2 & NMI interrupt\\\hline
3 & Breakpoint\\\hline
4 &  INTO-detected Overflow\\\hline
5 &  BOUND range exceeded\\\hline
6 &  Invalid opcode\\\hline
7 &  coprocessor not available\\\hline
8 &  double fault\\\hline
9 &  coprocessor segment overrun\\\hline
10 & invalid task state segment\\\hline
11 & segment not present\\\hline
12 & stack fault\\\hline
13 & general protection\\\hline
14 & page fault\\\hline
15 & reserved\\\hline
16 & coprocessor error\\\hline
17--31 & reserved\\\hline
32--255 & maskable interrupts\\\hline
\end{tabular}}
\caption{Interrupt and Exception Assignments}\label{fig-intassign}
\end{figure}

\begin{figure}[htb]
\centerline{
\begin{tabular}{r|l}
HIGHEST & Faults except debug faults\\
& Trap instructions INTO, INT n, INT 3 \\
& Debug traps for this instruction \\
& Debug traps for next instruction \\
& NMI interrupt \\
LOWEST & INTR interrupt \\
\end{tabular}}
\caption{Priority of simultaneous interrupts and exceptions}
\end{figure}

\section{How \linux\ Uses Interrupts and Exceptions}
Under Linux the execution of a system call is invoked by a maskable
interrupt or {\bf exception} class transfer, caused by the instruction
{\tt int 0x80}.  We use vector 0x80 to transfer control to the kernel,
which is initialized during system startup.

As of version 0.99.2 of \linux, there are 116 system calls.
Documentation for many can be found in the man (2) pages. When a user
invokes a system call, execution flow is as follows:

\begin{itemize}
\item Each call is vectored through a stub in libc.  Each call within the libc 
library is generally a {\tt syscall{\sl X}()} macro, where {\sl X} is the
number of parameters used by the actual routine. Some system calls are
more complex then others because of variable length argument lists,
but even these complex system calls must use the same entry point:
they just have more parameter setup overhead.  Examples of a complex
system call include {\tt open()} and {\tt ioctl()}.

\item Each syscall macro expands to an assembly routine which sets up the
calling stack frame and calls {\tt \_system\_call()} through an
interrupt, via the instruction {\tt int \$0x80}

For example, the setuid system call is coded as
\begin{screen}{\tt \_syscall1(int,setuid,uid\_t,uid);}\end{screen}
Which will expand to:
\begin{screen}\begin{verbatim}
_setuid:
  subl $4,%exp
  pushl %ebx
  movzwl 12(%esp),%eax
  movl %eax,4(%esp)
  movl $23,%eax
  movl 4(%esp),%ebx
  int $0x80
  movl %eax,%edx
  testl %edx,%edx
  jge L2
  negl %edx
  movl %edx,_errno
  movl $-1,%eax
  popl %ebx
  addl $4,%esp
  ret
L2:
  movl %edx,%eax
  popl %ebx
  addl $4,%esp
  ret
\end{verbatim}\end{screen}

The macro definition for the {\tt syscall{\sl X}()} macros can be found in
/usr/include/linux/unistd.h, and 
the user-space system call library code can be found in
/usr/src/libc/syscall/

\item At this point no system code for the call has been executed. Not
until the int \$0x80 is executed does the call transfer to the kernel entry
point {\tt \_system\_call()}.  This entry point is the same for all system
calls. It is responsible for saving all registers, checking to make
sure a valid system call was invoked and then ultimately transfering
control to the actual system call code via the offsets in the
{\tt \_sys\_call\_table}.  It is also responsible for calling {\tt
\_ret\_from\_sys\_call()} when the system call has been completed, but
before returning to user space.

Actual code for {\tt system\_call} entry point can be found in 
/usr/src/linux/kernel/sys\_call.S
Actual code for many of the system calls can be found in
/usr/src/linux/kernel/sys.c, and the rest are found elsewhere.
{\tt find} is your friend.

\item After the system call has executed, {\tt \_ret\_from\_sys\_call()}
is called.  It checks to see if the scheduler should be run, and if
so, calls it.

\item Upon return from the system call, the {\tt syscall{\sl X}()}
macro code checks for a negative return value, and if there is one,
puts a positive copy of the return value in the global variable
{\tt \_errno}, so that it can be accessed by code like {\tt perror()}.

\end{itemize}

\section{How \linux\ Initializes the system call vectors}

The {\tt startup\_32()} code found in /usr/src/linux/boot/head.S starts
everything off by calling {\tt setup\_idt()}. This routine sets up an
IDT (Interrupt Descriptor Table) with 256 entries.  No interrupt entry
points are actually loaded by this routine, as that is done only after
paging has been enabled and the kernel has been moved to 0xC0000000.
An IDT has 256 entries, each 4 bytes long, for a total of 1024 bytes.

When {\tt start\_kernel()} (found in /usr/src/linux/init/main.c) is
called it invokes {\tt trap\_init()} (found in
/usr/src/linux/kernel/traps.c).  {\tt trap\_init()} sets up the IDT via
the macro {\tt set\_trap\_gate()} (found in /usr/include/asm/system.h).
{\tt trap\_init()} initializes the interrupt descriptor table as shown
in figure~\ref{fig-intinit}.

\begin{figure}[htb]
\centerline{
\begin{tabular}{|r|l|}\hline
0 &  divide\_error\\\hline
1 &  debug\\\hline
2 &  nmi\\\hline
3 &  int3\\\hline
4 &  overflow\\\hline
5 &  bounds\\\hline
6 &  invalid\_op\\\hline
7 &  device\_not\_available\\\hline
8 &  double\_fault\\\hline
9 &  coprocessor\_segment\_overrun\\\hline
10 & invalid\_TSS\\\hline
11 & segment\_not\_present\\\hline
12 & stack\_segment\\\hline
13 & general\_protection\\\hline
14 & page\_fault\\\hline
15 & reserved\\\hline
16 & coprocessor\_error\\\hline
17 & alignment\_check\\\hline
18--48 & reserved\\\hline
\end{tabular}}
\caption{Initialization of interrupts}\label{fig-intinit}
\end{figure}

\noindent At this point the interrupt vector for the system calls is
not set up. It is initialized by {\tt sched\_init()} (found in
/usr/src/linux/kernel/sched.c). A call to {\tt
set\_system\_gate (0x80, \&system\_call)} sets interrupt 0x80 to be a
vector to the {\tt system\_call()} entry point.


\section{How to Add Your Own System Calls}

\begin{enumerate}

\item Create a directory under the /usr/src/linux/ directory to hold
your code.

\item Put any include files in /usr/include/sys/ and /usr/include/linux/.

\item Add the relocatable module produced by the link of your new
kernel code to the {\tt ARCHIVES} and the subdirectory to the {\tt
SUBDIRS} lines
of the top level Makefile. See fs/Makefile, target fs.o for an example.

\item Add a {\tt \#define \_\_NR\_{\sl xx}} to unistd.h to assign a
call number for your system call, where {\sl xx}, the index, is
something descriptive relating to your system call. It will be used to
set up the vector through {\tt sys\_call\_table} to invoke you code.

\item Add an entry point for your system call to the {\tt sys\_call\_table}
in sys.h. It should match the index ({\sl xx}) that you assigned in
the previous step.  The {\tt NR\_syscalls} variable will be
recalculated automatically.

\item Modify any kernel code in kernel/fs/mm/, etc.\ to take into account
the environment needed to support your new code.

\item Run make from the top level to produce the new kernel incorporating
your new code.
\end{enumerate}

\noindent At this point, you will have to either add a syscall to your
libraries, or use the proper {\tt \_syscall{\sl n}()} macro in your
user program for your programs to access the new system call.

The {\em 386DX Microprocessor Programmer's Reference Manual} is a
helpful reference, as is James Turley's {\em Advanced 80386
Programming Techniques.}  See the Annotated bibliography in
Appendix~\ref{bibliography}.