%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%									%
%	Copyright (C) 1992, 1993 Michael K. Johnson,			%
%	johnsonm@sunsite.unc.edu, and any others named in		%
%	copyright below.						%
%									%
%	This file is freely copyable, but you must preserve this	%
%	copyright notice on all copies, it must only be distributed	%
%	as part of the Linux Kernel Hackers' Guide, and its use is	%
%	is subject to the conditions expressed in the copyright for	%
%	the whole guide, in the file prelim/copyright.tex		%
%									%
%	If other copyright conditions are given below, they supercede	%
%	these conditions.						%
%									%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


\chapter{Linux Memory Management}\label{linux-mm}

{\bf [This chapter needs to be made much friendlier.  I'd hate to
remove detail, but it needs to be less daunting.  Many have told me
that this is a daunting chapter, and it need not be.  I'll re-work it
later.  In the meantime, please bear with me.]}
     
% Copyright (C) 1992, 1993 krishna balasubramanian, douglas johnson,
% and Michael K. Johnson
     
\section{Overview}

The Linux memory manager implements demand paging with a copy-on-write
strategy relying on the 386's paging support. A process acquires its
page tables from its parent (during a {\tt fork()}) with the entries
marked as read-only or swapped.  Then, if the process tries to write
to that memory space, and the page is a copy-on-write page, it is
copied, and the page is marked read-write.  An {\tt exec()} results in
the reading in of a page or so from the executable.  The process then
faults in any other pages it needs.

Each process has a page directory which means it can access 1 KB of
page tables pointing to 1 MB of 4 KB pages which is 4 GB of memory.  A
process' page directory is initialized during a fork by {\tt
copy\_page\_tables()}.  The idle process has its page directory
initialized during the initialization sequence.

Each user process has a local descriptor table that contains a code
segment and data-stack segment. These user segments extend from 0 to
3~GB~(0xc0000000).  In user space linear addresses\footnote{In the
80386, linear address run from 0GB to 4GB.  A linear address points to
a particular memory location within this space.  A linear address is
{\bf not} a physical address~--- it is a virtual address.} and logical
addresses\footnote{A logical address consists of a selector and an
offset.  The selector points to a segment and the offset tells how far
into that segment the address is located.} are identical.

The kernel code and data segments are priveleged segments defined in the
global descriptor table and extend from 3~GB to 4~GB.  The swapper page
directory ({\tt swapper\_page\_dir} is set up so that logical addresses
and physical addresses are identical in kernel space.

The space above 3~GB appears in a process' page directory as pointers
to kernel page tables. This space is invisible to the process in user 
mode but the mapping becomes relevant when privileged mode is entered,
for example, to handle a system call.

Supervisor mode is entered within the context of the current process
so address translation occurs with respect to the process' page
directory but using kernel segments. This is identically the mapping
produced by using the {\tt swapper\_pg\_dir} and kernel segments as
both page directories use the same page tables in this space. Only
{\tt task[0]} (the idle task\footnote{Sometimes called the swapper
task, even though it has nothing to do with swapping in the \linux\
implementation, for historical reasons} {\bf [This should be
documented earlier in this guide\dots]}) uses the {\tt
swapper\_pg\_dir} directly.
\begin{itemize}
\item The user process' {\tt segment\_base} = 0x00, {\tt page\_dir}
      private to the process. 
\item user process makes a system call: {\tt segment\_base}=0xc0000000
      {\tt page\_dir} = same user {\tt page\_dir}.
\item {\tt swapper\_pg\_dir} contains a mapping for all physical pages
      from 0xc0000000 to 0xc0000000 + {\tt end\_mem},
      so the first 768 entries in {\tt swapper\_pg\_dir} are 0's, and
      then there are 4 or more that point to kernel page tables.
\item The user page directories have the same entries as {tt swapper\_pg\_dir}
      above 768. The first 768 entries map the user space.
\end{itemize}
The upshot is that whenever the linear address is above 0xc0000000
everything uses the same kernel page tables.

The user stack sits at the top of the user data segment and grows down.
The kernel stack is not a pretty data structure or segment that I can 
point to with a ``yon lies the kernel stack.''  A {\tt kernel\_stack\_frame}
(a page) is associated with each newly created process and is used 
whenever the kernel operates within the context of that process. Bad 
things would happen if the kernel stack were to grow below its current 
stack frame.  {\bf [Where is the kernel stack put?  I know that there
is one for every process, but where is it stored when it's not being used?]}

User pages can be stolen or swapped.  A user page is one that is mapped
below 3~GB in a user page table.  This region does not contain page 
directories or page tables.  Only dirty pages are swapped.

Minor alterations are needed in some places (tests for process memory
limits comes to mind) to provide support for programmer defined segments. 


\section{Physical memory}

Here is a map of physical memory before any user processes are
executed. The column on the left gives the {\bf starting} address of
the item, numbers in {\em italics} are approximate.  The column in the
middle names the item(s).  The column on the far right gives the
relevant routine or variable name or explains the entry.

\noindent\begin{tabular}{|r|c|l|}\hline
{\em 0x110000} & FREE & {\tt memory\_end} or {\tt high\_memory} \\\hline
         & {\tt mem\_map} & {\tt mem\_init()} \\\hline
         & {\tt inode\_table} & {\tt inode\_init()} \\\hline
         & device data & {\tt device\_init()}\dag \\\hline
0x100000 & more {\tt pg\_table}s & {\tt paging\_init()} \\\hline
0x0A0000 & RESERVED & \\\hline
{\em 0x060000} & FREE & \\\hline
         & {\tt low\_memory\_start} & \\\hline
0x006000 & kernel code $+$ data & \\\hline
         & {\tt floppy\_track\_buffer} & \\\hline
         & \parbox[t]{1.75 in}{\hfil{\tt bad\_pg\_table}\hfil\\
                               \mbox{\strut}\hfil{\tt bad\_page}\hfil} &
            \parbox[t]{3 in}{used by {\tt page\_fault\_handler}s to
                             kill processes gracefully when out of
                             memory.\strut} \\\hline
0x002000 & {\tt pg0} & the first kernel page table. \\\hline
0x001000 & {\tt swapper\_pg\_dir} & the kernel page directory. \\\hline
0x000000 & null page & \\\hline
\end{tabular}\\
\dag device-inits that acquire memory are(main.c):
{\tt profil\_buffer}, {\tt con\_init}, {\tt psaux\_init}, {\tt
rd\_init}, {\tt scsi\_dev\_init}.
\smallskip
Note that all memory not marked as FREE is RESERVED ({\tt mem\_init}).
RESERVED pages belong to the kernel and are {\bf never} freed or swapped.\\


\section{A user process' view of memory}

\begin{tabular}{|r|c|l|}\hline
0xc0000000 & The invisible kernel & reserved \\\hline
           & initial stack & \\\hline
           & room for stack growth & 4 pages \\\hline
0x60000000 & shared libraries & \\\hline
{\tt brk}  & unused & \\\hline
           & malloc memory & \\\hline
{\tt end\_data} & uninitialized data & \\\hline
{\tt end\_code} & initialized data & \\\hline
0x00000000 & text & \\\hline
\end{tabular}

Both the code segment and data segment extend all the way from 0x00 to
3~GB.  Currently the page fault handler {\tt do\_wp\_page} checks to
ensure that a process does not write to its code space.  However, by
catching the {\tt SEGV} signal, it is possible to write to code space,
causing a copy-on-write to occur.  The handler {\tt do\_no\_page}
ensures that any new pages the process acquires belong to either the
executable, a shared library, the stack, or lie within the {\tt brk}
value.

A user process can reset its {\tt brk} value by calling {\tt sbrk()}.
This is what {\tt malloc()} does when it needs to. The text and data
portions are allocated on separate pages unless one chooses the {\tt
-N} compiler option.  Shared library load addresses are currently
taken from the shared image itself. The address is between 1.5 GB and
3 GB, except in special cases.

\noindent {\bf User process Memory Allocation}\\
\begin{tabular}{|l|c|c|}\hline
                         &  swappable & shareable \\\hline
a few code pages         &  Y         & Y \\\hline
a few data pages         &  Y         & N? \\\hline
stack                    &  Y         & N \\\hline
{\tt pg\_dir}            &  N         & N \\\hline
code/data {\tt page\_table}&N         & N \\\hline
stack {\tt page\_table}  &  N         & N \\\hline
{\tt task\_struct}       &  N         & N \\\hline
{\tt kernel\_stack\_frame}& N         & N \\\hline
shlib {\tt page\_table}  &  N         & N \\\hline
a few shlib pages        &  Y         & Y? \\\hline
\end{tabular}
\medskip

{\bf [What do the question marks mean?  Do they mean that they might
go either way, or that you are not sure?]} The stack, shlibs and data
are too far removed from each other to be spanned by one page table.
All kernel {\tt page\_table}s are shared by all processes so they are
not in the list.  Only dirty pages are swapped.  Clean pages are
stolen so the process can read them back in from the executable if it
likes.  Mostly only clean pages are shared.  A dirty page ends up
shared across a fork until the parent or child chooses to write to it
again.


\section{Memory Management data in the process table}

Here is a summary of some of the data kept in the process table which
is used for memory managment:  {\bf [These should be much better
documented.  We need more details.]}

\begin{itemize}
\item Process memory limits:  {\tt ulong start\_code, end\_code,
end\_data, brk, start\_stack;}

\item Page fault counting: {\tt ulong min\_flt, maj\_flt, cmin\_flt, cmaj\_flt}

\item Local descriptor table:
{\tt struct desc\_struct ldt[32]} is the local descriptor table for task.

\item {\tt rss}: number of resident pages.

\item {\tt swappable}: if 0, then process's pages will not be swapped.

\item {\tt kernel\_stack\_page}: pointer to page allocated in fork.

\item {\tt saved\_kernel\_stack}: V86 mode stuff.


\item {\tt struct tss}
\begin{itemize}
 \item Stack segments
  \begin{description}
  \item [{\tt esp0}] kernel stack pointer ({\tt kernel\_stack\_page})
  \item [{\tt ss0}] kernel stack segment (0x10)
  \item [{\tt esp1}] $=$ {\tt ss1} $=$ {\tt esp2} $=$ {\tt ss2} $=$ 0\\
                     unused privelege levels.
  \end{description}

 \item Segment selectors: {\tt ds $=$ es $=$ fs $=$ gs $=$ ss $=$
0x17}, {\tt cs $=$ 0x0f}\\
All point to segments in the current {\tt ldt[]}.

 \item {\tt cr3}: points to the page directory for this process.

 \item {\tt ldt}: {\tt \_LDT(n)} selector for current task's LDT.
 \end{itemize}
\end{itemize}



\section{Memory initialization}

In {\tt start\_kernel()} (main.c) there are 3 variables related to memory 
initialization: 

\begin{tabular}{ll}
{\tt memory\_start} & starts out at 1~MB. Updated by device
initialization. \\
{\tt memory\_end} & end of physical memory: 8~MB, 16~MB, or whatever. \\
{\tt low\_memory\_start} & end of the kernel code and data that is
loaded initially. \\
\end{tabular}

Each device init typically takes {\tt memory\_start} and returns an updated
value if it allocates space at {\tt memory\_start} (by simply grabbing
it).  {\tt paging\_init()} initializes the page tables in the {\tt
swapper\_pg\_dir} (starting at 0xc0000000) to cover all of the
physical memory from {\tt memory\_start} to {\tt memory\_end}.
Actually the first 4~MB is done in {\tt startup\_32} (head.S).  {\tt
memory\_start} is incremented if any new {\tt page\_table}s are added.
The first page is zeroed to trap null pointer references in the
kernel.

In {\tt sched\_init()} the {\tt ldt} and {\tt tss} descriptors for
{\tt task[0]} are set in the GDT, and loaded into the TR and LDTR (the
only time it's done explicitly).  A trap gate (0x80) is set up for
{\tt system\_call()}. The nested task flag is turned off in
preparation for entering user mode. The timer is turned on.  The {\tt
task\_struct} for {\tt task[0]} appears in its entirety in {\tt
<linux/sched.h>}.

{\tt mem\_map} is then constructed by {\tt mem\_init()} to reflect the
current usage of physical pages. This is the state reflected in the
physical memory map of the previous section.

Then \linux\ moves into user mode with an {\tt iret} after pushing the
current {\tt ss}, {\tt esp}, etc. Of course the user segments for {\tt
task[0]} are mapped right over the kernel segments so execution
continues exactly where it left off.

\noindent{\tt task[0]}:\begin{itemize}
\item {\tt pg\_dir} $=$ {\tt swapper\_pg\_dir} which means the the only
      addresses mapped are in the range 3~GB to 3~GB + {\tt high\_memory}.
\item {\tt LDT[1]} $=$ user code, base=0xc0000000, size = 640K
\item {\tt LDT[2]} $=$ user data, base=0xc0000000, size = 640K
\end{itemize}

The first {\tt exec()} sets the LDT entries for {\tt task[1]} to the
user values of
base = 0x0, limit = TASK\_SIZE = 0xc0000000. Thereafter, no process 
sees the kernel segments while in user mode.

\subsection{Processes and the Memory Manager}

\noindent Memory-related work done by {\tt fork()}:
\begin{itemize}
\item Memory allocation
  \begin{itemize}
  \item 1 page for the {\tt task\_struct}.
  \item 1 page for the kernel stack.
  \item 1 for the {\tt pg\_dir} and some for {\tt pg\_table}s
        (copy\_page\_tables)
  \end{itemize}

\item Other changes
  \begin{itemize}
  \item {\tt ss0} set to kernel stack segment (0x10) to be sure?
  \item {\tt esp0} set to top of the newly allocated {\tt kernel\_stack\_page}
  \item {\tt cr3} set by {\tt copy\_page\_tables()} to point to newly
        allocated page directory.
  \item {\tt ldt = \_LDT(task\_nr)} creates new ldt descriptor.
  \item descriptors set in gdt for new tss and {\tt ldt[]}.
  \item The remaining registers are inherited from parent. 
  \end{itemize}
\end{itemize}
The processes end up sharing their code and data segments (although
they have separate local desctriptor tables, the entries point 
to the same segments). The  stack and data pages will be copied when 
the parent or child writes to them (copy-on-write).

\noindent Memory-related work done by {\tt exec()}:
\begin{itemize}
\item memory allocation
  \begin{itemize}
  \item 1 page for exec header entire file for omagic
  \item 1 page or more for stack (MAX\_ARG\_PAGES)
  \end{itemize}

\item {\tt clear\_page\_tables()} used to remove old pages.

\item {\tt change\_ldt()} sets the descriptors in the new {\tt LDT[]}
\item {\tt ldt[1]} = code base=0x00, limit=TASK\_SIZE
\item {\tt ldt[2]} = data base=0x00, limit=TASK\_SIZE\\
      These segments are DPL=3, P=1, S=1, G=1. type=a (code) or 2 (data)

\item  Up to {\tt MAX\_ARG\_PAGES} dirty pages of argv and envp are
  allocated and stashed at the top of the data segment for the newly
  created user stack.  

\item Set the instruction pointer of the caller {\tt eip = ex.a\_entry}
\item Set the stack pointer of the caller to the stack just created
      (esp = stack pointer)
      These will be popped off the stack when the caller resumes. 

\item update memory limits\\
    {\tt end\_code = ex.a\_text}\\
    {\tt end\_data = end\_code + ex.a\_data}\\
    {\tt brk = end\_data + ex.a\_bss}
\end{itemize}

Interrupts and traps are handled within the context of the current task.
In particular, the page directory of the current process is used in
address translation. The segments, however, are kernel segments so that all 
linear addresses point into kernel memory.  For example, assume a user
process invokes a system call and the kernel wants to access a
variable at address 0x01.  The linear address is 0xc0000001 (using
kernel segments) and the physical address is 0x01.  The later is
because the process' page directory maps this range exactly as {\tt
page\_pg\_dir}.

The kernel space (0xc0000000 + {\tt high\_memory}) is mapped by the
kernel page tables which are themselves part of the RESERVED memory.
They are therefore shared by all processes. During a fork {\tt
copy\_page\_tables()} treats RESERVED page tables differently. It sets
pointers in the process page directories to point to kernel page
tables and does not actually allocate new page tables as it does
normally. As an example the {\tt kernel\_stack\_page} (which sits
somewhere in the kernel space) does not need an associated {\tt
page\_table} allocated in the process' {\tt pg\_dir} to map it.

The interrupt instruction sets the stack pointer and stack segment from 
the privilege 0 values saved in the tss of the current task.
Note that the kernel stack is a really fragmented object~--- it's not
a single object, but rather a bunch of stack frames each allocated when a
process is created, and released when it exits. The kernel stack should
never grow so rapidly within a process context that it extends below the
current frame.

\section{Acquiring and Freeing Memory: Paging Policy}

When any kernel routine wants memory it ends up calling {\tt
get\_free\_page()}. This is at a lower level than {\tt kmalloc()} (in
fact {\tt kmalloc()} uses {\tt get\_free\_page()} when it needs more
memory).

{\tt get\_free\_page()} takes one parameter, a priority. Possible
values are {\tt GFP\_BUFFER}, {\tt GFP\_KERNEL}, and {\tt
GFP\_ATOMIC}.  It takes a page off of the {\tt free\_page\_list},
updates {\tt mem\_map}, zeroes the page and returns the physical
address of the page (note that {\tt kmalloc()} returns a physical
address. The logic of the mm depends on the identity map between
logical and physical addresses).

That itself is simple enough. The problem, of course, is that the {\tt
free\_page\_list} may be empty. If you did not request an atomic
operation, at this stage, you enter into the realm of page stealing
which we'll go into in a moment.  As a last resort (and for atomic
requests) a page is torn off from the {\tt secondary\_page\_list} (as
you may have guessed, when pages are freed, the {\tt
secondary\_page\_list} gets filled up first).

The actual manipulation of the {\tt page\_list}s and {\tt mem\_map}
occurs in this mysterious macro called {\tt REMOVE\_FROM\_MEM\_QUEUE()}
which you probably never want to look into. Suffice it to say that
interrupts are disabled.  {\bf [I think that this should be explained
here.  It is not {\em that} hard\dots]}

Now back to the page stealing bit. {\tt get\_free\_page()} calls {\tt
try\_to\_free\_page()} which repeatedly calls {\tt shrink\_buffers()}
and {\tt swap\_out()} in that order until it is successful in freeing a
page. The priority is increased on each successive iteration so that
these two routines run through their page stealing loops more often.

Here's one run through {\tt swap\_out()}: 
\begin{itemize}
\item Run through the process table and get a swappable task say {\sl Q.}
\item Find a user page table (not RESERVED) in {\sl Q}'s space.
\item For each {\sl page} in the table {\tt try\_to\_swap\_out( {\sl
      page} )}.
\item Quit when a page is freed.
\end{itemize}
Note that {\tt swap\_out()} (called by {\tt try\_to\_free\_page()})
maintains static variables so it may resume the search where it left
off on the previous call.

\smallskip\noindent{\tt try\_to\_swap\_out()} scans the page tables of
all user processes and enforces the stealing policy:
\begin{enumerate}
\item  Do not fiddle with RESERVED pages.
\item  Age the page if it is marked accessed (1 bit).
\item  Don't tamper with recently acquired pages
       ({\tt last\_free\_pages[]}).
\item  Leave dirty pages with {\tt map\_counts} $>$ 1 alone.
\item  Decrement the {\tt map\_count} of clean pages.
\item  Free clean pages if they are unmapped.
\item  Swap dirty pages with a {\tt map\_count} of 1.
\end{enumerate}

Of these actions, 6 and 7 will stop the process as they result in the
actual freeing of a physical page.  Action 5 results in one of the
processes losing an unshared clean page that was not accessed recently
(decrement {\sl Q}{\tt ->rss}) which is not all that bad, but the
cumulative effects of a few iterations can slow down a process
considerably.  At present, there are 6 iterations, so a page shared by
6 processes can get stolen if it is clean.

Page table entries are updated and the TLB invalidated.  {\bf [Wonder about the
latter. It seems unnecessary since accessed pages aren't offed and there
is a walk through many page tables between iterations \dots may be in case
an interrupt came along and wanted the most recently axed page?]}

The actual work of freeing the page is done by {\tt free\_page()}, the
complement of {\tt get\_free\_page()}. It ignores RESERVED pages,
updates {\tt mem\_map}, then frees the page and updates the {\tt
page\_list}s if it is unmapped. For swapping (in 6 above), {\tt
write\_swap\_page()} gets called and does nothing remarkable from the
memory management perspective.

The details of {\tt shrink\_buffers()} would take us too far afield.
Essentially it looks for free buffers, then writes out dirty buffers,
then goes at busy buffers and calls {\tt free\_page()} when its able
to free all the buffers on a page.

Note that page directories and page tables along with RESERVED pages
do not get swapped, stolen or aged. They are mapped in the process
page directory through reserved page tables. They are freed only on 
exit from the process.


\section{The page fault handlers}

When a process is created via fork, it starts out with a page directory 
and a page or so of the executable. So the page fault handler is the 
source of most of a processes' memory. 

The page fault handler {\tt do\_page\_fault()} retrieves the faulting address
from the register cr2. The error code (retrieved in sys\_call.S)
differentiates user/supervisor access and the reason for the fault~---
write protection or a missing page. The former is handled by {\tt
do\_wp\_page()}
and the latter by {\tt do\_no\_page()}.

If the faulting address is greater than TASK\_SIZE the process receives a 
SIGKILL.  {\bf [Why this check?  This can only happen in kernel mode because
of segment level protection.]}

These routines have some subtleties as they can get called from an
interrupt. You can't assume that it is the `current' task that is executing.

{\tt do\_no\_page()} handles three possible situations:
\begin{enumerate}
\item The page is swapped.
\item The page belongs to the executable or a shared library.
\item The page is missing~--- a data page that has not been allocated.
\end{enumerate}

In all cases {\tt get\_empty\_pgtable()} is called first to ensure the
existence of a page table that covers the faulting address. In case 3
{\tt get\_empty\_page()} is called to provide a page at the required
address and in case of the swapped page, {\tt swap\_in()} is called.

In case 2, the handler calls {\tt share\_page()} to see if the page is
shareable with some other process. If that fails it reads in the page
from the executable or library (It repeats the call to {\tt
share\_page()} in case another process did the same meanwhile). Any
portion of the page beyond the brk value is zeroed.

A page read in from the disk is counted as a major fault ({\tt
maj\_flt}). This happens with a {\tt swap\_in()} or when it is read
from the executable or a library.  Other cases are deemed minor faults
({\tt min\_flt}).

When a shareable page is found, it is write-protected. A process that
writes to a shared page will then have to go through {\tt
do\_wp\_page()} which does the copy-on-write.

\medskip\noindent{\tt do\_wp\_page()} does the following: 
\begin{itemize}
\item send SIGSEGV if any user process is writing to current {\tt code\_space}.
\item If the old page is not shared then just unprotect it.\\
      Else {\tt get\_free\_page()} and {\tt copy\_page()}. The page
      acquires the dirty flag from the old page.  Decrement the map
      count of the old page.
\end{itemize}


\section{Paging}

Paging is swapping on a page basis rather than by entire
processes. We will use swapping here to refer to paging, since \linux\
only pages, and does not swap, and people are more used to the word
``swap'' than ``page.''  Kernel pages are never swapped.  Clean pages
are also not written to swap. They are freed and reloaded when
required. The swapper maintains a single bit of aging info in the
{\tt PAGE\_ACCESSED} bit of the page table entries.  {\bf [What are
the maintainance details?  How is it used?]}

\linux\ supports multiple swap files or devices which may be turned on or
off by the swapon and swapoff system calls. Each swapfile or device is 
described by a {\tt struct swap\_info\_struct} (swap.c).
\begin{screen}\begin{verbatim}
static struct swap_info_struct {
      unsigned long flags;
      struct inode * swap_file;
      unsigned int swap_device;
      unsigned char * swap_map;
      char * swap_lockmap;
      int lowest_bit;
      int highest_bit;
} swap_info[MAX_SWAPFILES];
\end{verbatim}\end{screen}

The flags field ({\tt SWP\_USED} or {\tt SWP\_WRITEOK}) is used to
control access to the swap files. When {\tt SWP\_WRITEOK} is off space
will not be allocated in that file. This is used by swapoff when it
tries to unuse a file.  When swapon adds a new swap file it sets {\tt
SWP\_USED}. A static variable {\tt nr\_swapfiles} stores the number of
currently active swap files.  The fields {\tt lowest\_bit} and {\tt
highest\_bit} bound the free region in the swap file and are used to
speed up the search for free swap space.

The user program mkswap initializes a swap device or file. The first
page contains a signature (`{\tt SWAP-SPACE}') in the last 10 bytes, and
holds a bitmap. Initially 0's in the bitmap signal bad pages. A `1' in
the bitmap means the corresponding page is free. This page is never
allocated so the initialization needs to be done just once.

The syscall {\tt swapon()} is called by the user program swapon typically from 
/etc/rc. A couple of pages of memory are  allocated for {\tt swap\_map} and 
{\tt swap\_lockmap}.

{\tt swap\_map} holds a byte for each page in the swapfile. It is initialized 
from the bitmap to contain a 0 for available pages and 128 for unusable
pages. It is used to  maintain a count of swap requests on each page in
the swap file.  {\tt swap\_lockmap} holds a bit for each page that is
used to ensure mutual exclusion when reading or writing swap files.

When a page of memory is to be swapped out an index to the swap location is 
obtained by a call to {\tt get\_swap\_page()}. This index is then stored in 
bits 1--31 of the page table entry so the swapped page may be located 
by the page fault handler, {\tt do\_no\_page()} when needed.

The upper 7 bits of the index give the swapfile (or device) and the
lower 24 bits give the page number on that device. That makes as many
as 128 swapfiles, each with room for about 64~GB, but the space overhead 
due to the {\tt swap\_map} would be large. Instead the swapfile size
is limited to 16~MB, because the {\tt swap\_map} then takes 1 page.

The function {\tt swap\_duplicate()} is used by {\tt
copy\_page\_tables()} to let a child process inherit swapped pages
during a fork. It just increments the count maintained in {\tt
swap\_map} for that page. Each process will swap in a separate copy of
the page when it accesses it.

{\tt swap\_free()} decrements the count maintained in {\tt swap\_map}.
When the count drops to 0 the page can be reallocated by {\tt
get\_swap\_page()}. It is called each time a swapped page is read into
memory ({\tt swap\_in()}) or when a page is to be discarded ({\tt
free\_one\_table()}, etc.).



\section{80386 Memory Mangament}

A logical address specified in an instruction is first translated
to a linear address by the segmenting hardware. This linear address
is then translated to a physical address by the paging unit.


\subsection{Paging on the 386}

There are two levels of indirection in address translation by the paging
unit. A {\bf page directory} contains pointers to 1024 page tables.
Each {\bf page table} contains pointers to 1024 pages.
The register CR3 contains the physical base address of the page
directory and is stored as part of the TSS in the {\tt task\_struct}
and is therefore loaded on each task switch.

\medskip\noindent
A 32-bit Linear address is divided as follows:\\\smallskip
\begin{tabular}{|c|c|c|}\hline
31 \hspace{1 in} 22 & 21 \hspace{1 in} 12 & 11 \hspace{1 in} 0 \\\hline
DIR & TABLE & OFFSET \\\hline
\end{tabular}\\
\smallskip\noindent
Physical address is then computed (in hardware) as:\\
\begin{tabular}{|r|l|}\hline
CR3 + DIR           & points to the table\_base.\\\hline
table\_base + TABLE & points to the page\_base.\\\hline
physical\_address = & page\_base + OFFSET\\\hline
\end{tabular}

\medskip
Page directories (page tables) are page aligned so the lower 12 bits
are used to store useful information about the page table (page) pointed
to by the entry.

\medskip\noindent
Format for Page directory and Page table entries:\\\smallskip
\begin{tabular}{|c|c|ccccccccc|}\hline
31 \hspace{1 in} 12 & 11 \ 9 & 8 & 7 & 6 & 5 & 4 & 3 &  2  &  1  & 0 \\\hline
ADDRESS             &   OS   & 0 & 0 & D & A & 0 & 0 & U/S & R/W & P \\\hline
\end{tabular}\\
\begin{tabular}{rl}
D &    1 means page is dirty (undefined for page directory entry).\\
R/W &  0 means readonly for user.\\
U/S &  1 means user page.\\
P &    1 means page is present in memory.\\
A &    1 means page has been accessed (set to 0 by aging).\\
OS &   bits can be used for LRU etc, and are defined by the OS.\\
\end{tabular}\\
The corresponding definitions for Linux are in {\tt <linux/mm.h>}.

When a page is swapped, bits 1--31 of the page table entry are used to
mark where a page is stored in swap (bit 0 must be 0).

Paging is enabled by setting the highest bit in CR0. {\bf [in
head.S?]} At each stage of the address translation access permissions
are verified and pages not present in memory and protection violations
result in page faults.  The fault handler (in memory.c) then either
brings in a new page or unwriteprotects a page or does whatever needs
to be done.


\subsection*{Page Fault handling Information}
\begin{itemize}
\item The register CR2 contains the linear address that caused the
      last page fault.
\item Page Fault Error Code (16 bits):\\\smallskip
      \begin{tabular}{|c|l|l|}\hline
      bit & cleared            & set \\\hline\hline
      0   & page not present   & page level protection \\\hline
      1   & fault due to read  & fault due to write \\\hline
      2   & supervisor mode    & user mode \\\hline
      \end{tabular}\\\smallskip
      The rest are undefined. These are extracted in sys\_call.S.
\end{itemize}

The Translation Lookaside Buffer (TLB) is a hardware cache for
physical addresses of the most recently used virtual addresses. When a
virtual address is translated the 386 first looks in the TLB to see if
the information it needs is available. If not, it has to make a couple
of memory references to get at the page directory and then the page
table before it can actually get at the page. Three physical memory
references for address translation for every logical memory reference
would kill the system, hence the TLB.

The TLB is flushed if CR3 loaded or by task switch that changes CR0.
It is explicitly flushed in Linux by calling {\tt invalidate()} which
just reloads CR3.

\subsection{Segments in the 80386}

Segment registers are used in address translation to generate a linear
address from a logical (virtual) address.
\begin{screen}{\tt linear\_address = segment\_base + logical\_address}
\end{screen}
The linear address is then translated into a physical address by the 
paging hardware.

Each segment in the system is described by a 8 byte segment descriptor 
which contains all pertinent information (base, limit, type, privilege).

\medskip\noindent
{\bf The segments are:}\\\medskip
{\bf Regular segments}
 \begin{itemize} \item code and data segments \end{itemize}
{\bf System segments}\\
 \begin{itemize}
 \item (TSS) task state segments
 \item (LDT) local descriptor tables
 \end{itemize}

\medskip\noindent Characteristics of system segments:
\begin{itemize}
\item System segments are task specific.
\item  There is a Task State Segment (TSS) associated with each task
in the system. It contains the {\tt tss\_struct} (sched.h). The size
of the segment is that of the {\tt tss\_struct} excluding the {\tt
i387\_union} (232 bytes).  It contains all the information necessary
to restart the task.

\item The LDT's contain regular segment descriptors that are private
to a task.  In \linux\ there is one LDT per task.  There is room for
32 descriptors in the linux {\tt task\_struct}. The normal LDT
generated by Linux has a size of 24 bytes, hence room for only 3
entries as above.  Its contents are:
  \begin{itemize}
  \item LDT[0]  Null (mandatory)
  \item LDT[1]  user code segment descriptor.
  \item LDT[2]  user data/stack segment descriptor.
  \end{itemize}
\item The user segments all have base=0x00 so that the linear address
is the same as the logical address. 
\end{itemize}

To keep track of all these segments, the 386 uses a global descriptor
table (GDT) that is setup in memory by the system (located by the GDT
register).  The GDT contains a segment descriptors for each task state
segment, each local descriptor tablet and also regular segments.
The Linux GDT contains just two normal segment entries:
\begin{itemize}
\item GDT[0] is the null descriptor. 
\item GDT[1] is the kernel code segment descriptor.
\item GDT[2] is the kernel data/stack segment descriptor. 
\end{itemize}
The rest of the GDT is filled with TSS and LDT system descriptors:
\begin{itemize}
\item GDT[3]    ???
\item GDT[4] = TSS0, GDT[5] = LDT0, 
\item GDT[6] = TSS1, GDT[7] = LDT1 
\item  \dots etc \dots
\end{itemize}

\noindent {\bf Note LDT[n] != LDTn}
\begin{itemize}
\item LDT[n] = the nth descriptor in the LDT of the current task.
\item LDTn   = a descriptor in the GDT for the LDT of the nth task.
\end{itemize}

At present the GDT has a total of 256 entries or room for as many as 126 tasks.
The kernel segments have base 0xc0000000 which is where the kernel lives in 
the linear view.
Before a segment can be used, the contents of the descriptor for that 
segment must be loaded into the segment register. The 386 has a complex
set of criteria regarding access to segments so you can't simply load
a descriptor into a segment register. Also these segment registers 
have programmer invisible portions. 
The visible portion is what is usually called a segment register: cs,
ds, es, fs, gs, and ss.

The programmer loads one of these registers with a 16-bit value called
a selector. The selector uniquely identifies a segment descriptor
in one of the tables. Access is validated and the corresponding 
descriptor loaded by the hardware. 

Currently \linux\ largely ignores the (overly?) complex segment level 
protection afforded by the 386. It is biased towards the paging hardware
and the associated page level protection. 
The segment level rules that apply to user processes are 
\begin{enumerate}
\item A process cannot directly access the kernel data or code segments 
\item There is always limit checking but given that every user segment 
    goes from 0x00 to 0xc0000000 it is unlikely to apply.  {\bf [This
    has changed, and needs updating, please.]}
\end{enumerate}


\subsection{Selectors in the 80386}

A segment selector is loaded into a segment register (cs, ds, etc.) to
select one of the regular segments in the system as the one addressed
via that segment register.

\noindent Segment selector Format:\\\smallskip
\begin{tabular}{|c|c|c|}\hline
15 \hspace{1.5 in} 3 & 2 1 & 0 \\\hline
index                & TI  &  RPL \\\hline
\end{tabular}
\begin{description}
\item [TI] Table indicator: \\
0 means selector indexes into GDT\\
1 means selector indexes into LDT
\item [RPL] Privelege level.  \linux\ uses only two privelege levels.\\
0 means kernel\\
3 means user
\end{description}

\noindent
{\bf Examples:}
\begin{description}
\item [Kernel code segment]\mbox{}\\
TI=0, index=1, RPL=0, therefore selector = 0x08 (GDT[1])
\item [User data segment]\mbox{}\\
TI=1, index=2, RPL=3, therefore selector = 0x17 (LDT[2])
\end{description}

\noindent
Selectors used in \linux:\\\smallskip
\begin{tabular}{|c|l|l|l|l|l|}\hline
{\bf TI} & {\bf index} & {\bf RPL} & {\bf selector} & {\bf segment} & \\\hline
0  & 1     & 0   & 0x08     & kernel code       &  GDT[1] \\\hline
0  & 2     & 0   & 0x10     & kernel data/stack &  GDT[2] \\\hline
0  & 3     & 0   & ???      & ???               &  GDT[3] \\\hline
1  & 1     & 3   & 0x0F     & user code         &  LDT[1] \\\hline
1  & 2     & 3   & 0x17     & user data/stack   &  LDT[2] \\\hline
\end{tabular}\\\smallskip
Selectors for system segments are not to be loaded directly into
segment registers. Instead one must load the TR or LDTR.

\medskip\noindent
On entry into syscall:
\begin{itemize}
\item ds and es are set to the kernel data segment (0x10)
\item fs is set to the user data segment (0x17) and is used to access
data pointed to by arguments to the system call.
\item The stack segment and pointer are  automatically set to ss0 and
esp0 by the interrupt and the old values restored when the syscall returns.
\end{itemize}


\subsection{Segment descriptors}

There is a segment descriptor used to describe each segment in the system.
There are regular descriptors and system descriptors.
Here's a descriptor in all its glory. The strange format is essentally
to maintain compatibility with the 286. Note that it takes 8 bytes.\\\smallskip
{\small\begin{tabular}{|c|c|c|c|c|c|c|c|c|c|c|c|}\hline
63--54   & 55 & 54 & 53 & 52 & 51--48 & 47 & 46 & 45 & 44--40 & 39--16 &
    15--0 \\\hline
Base & G & D & R & U & Limit & P & DPL & S & TYPE & Segment Base &
    Segment Limit \\
31--24 & & & & & 19--16 & & & & & 23--0 & 15--0 \\\hline
\end{tabular}}

\medskip\noindent
{\bf Explanation:}\\
\begin{tabular}{rl}
{\bf R}   & reserved (0)\\
{\bf DPL} & 0 means kernel, 3 means user\\
{\bf G}   & 1 means 4K granularity (Always set in \linux)\\
{\bf D}   & 1 means default operand size 32bits	\\
{\bf U}   & programmer definable\\
{\bf P}   & 1 means present in physical memory\\
{\bf S}   & 0 means system segment, 1 means normal code or data segment.\\
{\bf Type} & There are many possibilities. Interpreted differently for
       system and normal descriptors.\\
\end{tabular}

\medskip\noindent
{\bf Linux system descriptors:}\\
TSS:  P=1, DPL=0, S=0, type=9, limit = 231 room for 1 {\tt tss\_struct}.\\
LDT:  P=1, DPL=0, S=0, type=2, limit = 23  room for 3 segment descriptors.\\
The base is set during {\tt fork()}. There is a TSS and LDT for each task.

\medskip\noindent
{\bf Linux regular kernel descriptors:} (head.S)\\
code: P=1, DPL=0, S=1, G=1, D=1, type=a, base=0xc0000000, limit=0x3ffff\\
data: P=1, DPL=0, S=1, G=1, D=1, type=2, base=0xc0000000, limit=0x3ffff\\

\medskip\noindent
{\bf The LDT for task[0] contains:} (sched.h)\\
code: P=1, DPL=3, S=1, G=1, D=1, type=a, base=0xc0000000, limit=0x9f \\
data: P=1, DPL=3, S=1, G=1, D=1, type=2, base=0xc0000000, limit=0x9f \\

\medskip\noindent
{\bf The default LDT for the remaining tasks:} ({\tt exec()})\\
code: P=1, DPL=3, S=1, G=1, D=1, type=a, base=0, limit= 0xbffff\\
data: P=1, DPL=3, S=1, G=1, D=1, type=2, base=0, limit= 0xbffff\\

\medskip
The size of the kernel segments is 0x40000 pages (4KB pages since G=1
= 1 Gigabyte.
The type implies that the permissions on the code segment is read-exec
and on the data segment is read-write.

\medskip\noindent
{\bf Registers associated with segmentation.}
Format of segment register: (Only the selector is programmer
visible)\\\smallskip
\begin{tabular}{|c|c|c|c|}\hline
 16-bit  &      32-bit        &    32-bit     &            \\\hline
selector & physical base addr & segment limit & attributes \\\hline
\end{tabular}\\\smallskip
The invisible portion of the segment register is more conveniently
viewed in terms of the format used in the descriptor table entries 
that the programmer sets up.
The descriptor tables have registers associated with them that are used to
locate them in memory. The GDTR (and IDTR) are initialized at startup once
the tables are defined. The LDTR is loaded on each task switch.

\medskip\noindent
{\bf Format of GDTR (and IDTR):}\\
\begin{tabular}{|c|c|}\hline
     32-bits     &  16-bits    \\\hline
Linear base addr & table limit \\\hline
\end{tabular}

\medskip
The TR and LDTR are loaded from the GDT and so have the format of the
other segment registers.  The task register (TR) contains the
descriptor for the currently executing task's TSS. The execution of a
jump to a TSS selector causes the state to be saved in the old TSS,
the TR is loaded with the new descriptor and the registers are
restored from the new TSS.  This is the process used by schedule to
switch to various user tasks.  Note that the field {\tt
tss\_struct.ldt} contains a selector for the LDT of that task. It is
used to load the LDTR. (sched.h)



\subsection{Macros used in setting up descriptors}

Some assembler macros are defined in sched.h and system.h to ease access
and setting of descriptors. Each TSS entry and LDT entry takes 8 bytes. 

\medskip\noindent
{\bf Manipulating GDT system descriptors:}
\begin{itemize}
\item {\tt \_TSS(n)},\\
      {\tt \_LDT(n)} These provide the index into the GDT for the n'th task.
\item {\tt \_LDT(n)} is stored in the the ldt field of the {\tt
      tss\_struct} by fork.

\item {\tt \_set\_tssldt\_desc(n, addr, limit, type)}\\
  {\tt ulong *n} points to the GDT entry to set (see fork.c).
  The segment base (TSS or LDT) is set to 0xc0000000 + {\tt addr}.
  Specific instances of the above are, where ltype refers to the byte
  containing P, DPL, S and type:
  \begin{dispitems}
  \item [{\tt set\_ldt\_desc(n, addr)} ltype = 0x82]
        P=1, DPL=0, S=0, type=2 means LDT entry.
        limit = 23 => room for 3 segment descriptors.
  \item [{\tt set\_tss\_desc(n, addr)} ltype = 0x89]
        P=1, DPL=0, S=0, type = 9, means available 80386 TSS 
        limit = 231 room for 1 tss\_struct.
  \end{dispitems}

\item {\tt load\_TR(n)},\\
      {\tt load\_ldt(n)} load descriptors for task
number n into the task register and ldt register. 
\item {\tt ulong get\_base (struct desc\_struct ldt)} gets the base
      from a descriptor. 
\item {\tt ulong get\_limit (ulong segment)} 
      gets the limit (size) from a segment selector. 
      Returns the size of the segment in bytes.
\item {\tt set\_base(struct desc\_struct ldt, ulong base)},\\
      {\tt set\_limit(struct desc\_struct ldt, ulong limit)}\\
      Will set the base and limit for descriptors (4K granular segments).
      The limit here is actually the size in bytes of the segment.

\item {\tt \_set\_seg\_desc(gate\_addr, type, dpl, base, limit)}\\
  Default values 0x00408000 => D=1, P=1, G=0\\
  Present, operation size is 32 bit and max size is 1M.\\
  {\tt gate\_addr} must be a {\tt (ulong *)}\\
\end{itemize}
