############################################################
#        Coding system					   #
############################################################

1. NOTATION:

coding-system: Method for encoding several character-sets.
  Represented as an emacslisp object of type vector.

2. Coding-system

You should specify one of coding-system object on File I/O,
Process I/O, Output to terminal, Input from keyboard.  Each
coding-system is defined by calling make-coding-system:
--- mule.el ---------------------------------------------
(defun make-coding-system (name type mnemonic doc &optional crlf flags)
  "Register symbol NAME as a coding-system of TYPE, MNEMONIC, DOC, CRLF, FLAGS.
 TYPE is information for encoding or decoding.  If it is one of below,
	nil: no conversion, t: automatic conversion,
	0:Internal, 1:Shift-JIS, 2:ISO2022, 3:Big5.
  the system provides appropriate code conversion facility.
 MNEMONIC: a character to be displayed on mode-line for this coding-system,
 DOC: a describing documents for the coding-system,
 CRLF (option): if non-nil, CRLF <-> LF conversion is done on I/O.
 FLAGS (option): more precise information about the coding-system,
If TYPE is 2 (ISO2022), FLAGS should be a list of:
 LC-G0, LC-G1, LC-G2, LC-G3:
	Leading character of charset initially designated to G? graphic set,
	nil means G? is not designated initially,
	lc-invalid means G? can never be designated to,
	if (- leading-char) is specified, it is designated on output,
 SHORT: non-nil - allow such as \"ESC $ B\", nil - always \"ESC $ \( B\",
 ASCII-EOL: non-nil - designate ASCII to g0 at each end of line on output,
 ASCII-CNTL: non-nil - designate ASCII to g0 before TAB and SPACE on output.
 SEVEN: non-nil - use 7-bit environment on output.
 LOCK-SHIFT: non-nil - use locking-shift (SO/SI) instead of single-shift
	or designation by escape sequence.
 USE-ROMAN: non-nil - write ASCII as JIS0201-1976-Roman.
 USE-OLDJIS: non-nil - write JIS0208-1983 as JIS0208-1976.
If TYPE is 3 (Big5), FLAGS T means Big5-ETen, NIL means Big5-HKU."
------------------------------------------------------------

Usage of this function.  For instance, we have the following code.
--- mule.el --------------------------------------------------
;; Definitions of predefined coding-systems

(make-coding-system
 '*noconv* nil
 ?= "No conversion.")

(make-coding-system
 '*autoconv* t
 ?+ "Automatic conversion.")

(make-coding-system
 '*internal* 0
 ?= "Internal coding-system used in an Mule buffer.")

(make-coding-system
 '*sjis* 1
 ?S "Coding-system of Shift-JIS used in Japan.")

(make-coding-system
 '*sjis-dos* 1
 ?s "Coding-system of Shift-JIS with CRLF at eol."
 'crlf)

(make-coding-system
 '*junet* 2
 ?J "Coding-system used for communication with mail and news in Japan."
 nil
 (list lc-ascii lc-invalid lc-invalid lc-invalid
       'short 'ascii-eol 'ascii-cntl 'seven))

(make-coding-system
 '*oldjis* 2
 ?J "Coding-system used for old jis terminal."
 nil
 (list lc-ascii lc-invalid lc-invalid lc-invalid
       'short 'ascii-eol 'ascii-cntl 'seven nil 'use-roman 'use-oldjis))

(make-coding-system
 '*ctext* 2
 ?X "Coding-system used in X as Compound Text Encoding."
 nil
 (list lc-ascii lc-ltn1 lc-invalid lc-invalid
       nil 'ascii-eol))

(make-coding-system
 '*euc-japan* 2
 ?E "Coding-system of Japanese EUC (Extended Unix Code)."
 nil
 (list lc-ascii lc-jp lc-kana lc-jp2
       'short 'ascii-eol 'ascii-cntl))

(make-coding-system
 '*euc-korea* 2
 ?K "Coding-system of Korean EUC (Extended Unix Code)."
 nil
 (list lc-ascii lc-kr lc-invalid lc-invalid
       nil 'ascii-eol 'ascii-cntl))

(make-coding-system
 '*iso-2022-kr* 2
 ?k "Coding-System used for communication with mail in Korea."
 nil
 (list lc-ascii (- lc-kr) lc-invalid lc-invalid
       nil 'ascii-eol 'ascii-cntl 'seven 'lock-shift))
(setq *korean-mail* '*korean-mail*)
(put '*korean-mail* 'coding-system (get '*iso-2022-kr* 'coding-system))

(make-coding-system
 '*euc-china* 2
 ?C "Coding-system of Chinese EUC (Extended Unix Code)."
 nil
 (list lc-ascii lc-cn lc-invalid lc-invalid
       nil 'ascii-eol 'ascii-cntl))

(make-coding-system
 '*iso-2022-ss2-8 2
 ?I "ISO-2022 coding system using SS2 for 96-charset in 8-bit code."
 nil
 (list lc-ascii lc-invalid nil lc-invalid
       nil 'ascii-eol 'ascii-cntl))

(make-coding-system
 '*iso-2022-ss2-7 2
 ?I "ISO-2022 coding system using SS2 for 96-charset in 7-bit code."
 nil
 (list lc-ascii lc-invalid nil lc-invalid
       nil 'ascii-eol 'ascii-cntl 'seven))

(make-coding-system
 '*iso-2022-lock 2
 ?i "ISO-2022 coding system using Locking-Shift for 96-charset."
 nil
 (list lc-ascii nil lc-invalid lc-invalid
       nil 'ascii-eol 'ascii-cntl 'seven))

(make-coding-system
 '*big5-eten* 3
 ?B "Coding-system of BIG5-ETen."
 nil t)

(make-coding-system
 '*big5-hku* 3
 ?B "Coding-system of BIG5-HKU."
 nil nil)

(make-coding-system
 '*big5-eten-dos* 3
 ?b "Coding-system of BIG5-ETen with CRLF at eol."
 'crlf t)

(make-coding-system
 '*big5-hku-dos* 3
 ?b "Coding-system of BIG5-HKU with CRLF at eol."
 'crlf nil)
------------------------------------------------------------
At the present, there's no difference among *noconv*,
*autoconv*, and *internal* on output, and no difference
between *noconv* and *internal* on input.

3. Automatic conversion

We have a function for automatic detection of coding-system.
But, the detection is not so powerful, it only detect
category of coding-system as follows:
  0. INTERNAL -- Internal code used is emacs buffer.
  1. SJIS -- Shift-JIS.
  2. JUNET -- ISO2022, all char-sets are invoked only to GL.
  3. CTEXT -- ISO2022, char-sets invoked to GR is one-byte code.
  4. EUC -- ISO2022, char-sets invoked to GR is two-byte code.
  5. BIG5 -- Big5
or ASCII only.  The automatic detection also detect if each
text has CRLF at end of line.

If the code detection routine found one of above category,
it returns the following coding-system according to the
category.
  0. *internal-code-category*
  1. *sjis-code-category*
  2. *junet-code-category*
  3. *ctext-code-category*
  4. *euc-code-category*
  5. *big5-code-category*
or nil if ASCII only.

These are pre-defined as follows:
--- mule.el --------------------------------------------------
(setq *internal-code-category* '(*internal* . *sjis-dos*)
      *sjis-code-category* '(*sjis* . *sjis-dos*)
      *junet-code-category* '(*junet* . *sjis-dos*)
      *euc-code-category* '(*euc-japan* . *sjis-dos*)
      *ctext-code-category* '(*ctext* . *sjis-dos*)
      *big5-code-category* '(*big5-hku* . *big5-hku-dos*))
------------------------------------------------------------
Each car part is the coding-system with LF at eol, and cdr
part is the coding-system with CRLF at eol.

If the routine can't decide sole coding-system, it returns
coding-system of the highest priority according to a
variable code-priority.  The the automatic detection find
only ASCII text but with CRLF at eol, it returns
*crlf-code-category* which is predefined as follows.
--- mule.el --------------------------------------------------
(setq *crlf-code-category* '(nil . *sjis-dos*))
------------------------------------------------------------

The automatic detection concerns the user defined priority
for each category.
------------------------------------------------------------
code-priority:
Documentation:
List of categoriy symbols of coding-system:
 *internal-code-category*: INTERNAL, *sjis-code-category*: Shift-JIS,
 *junet-code-category*: ISO2022(JUNET), *euc-code-category*: ISO2022(EUC),
 *ctext-code-category*: ISO2022(CTEXT), *big5-code-category*: BIG5.
This priority list is used while detecting coding-system.
------------------------------------------------------------
and the value of code-priority is predefined in each
language specific files (e.g. japanese.el, chinese.el...).

############################################################
We also have functions for encoding/decoding to/from
Internal-code.  But for decoding to Type 2 (ISO2022), we
have the following restrictions:

Locking-Shift:
  Use SI and SO only when decoding with a coding-system
  whose LOCK-SHIFT and SEVEN is t.
Single-Shift:
  Use SS2 and SS3 (if SEVEN is nil) or ESC N and ESC O (if
  SEVEN is t).
Designation:
  Designate a char-set to G0 or G1 according to its GRAPHIC
  value. (See character.text).
Invocation:
  G0 is always invoked to GL, G1 to GR (but only if SEVEN is
  nil).  G2 and G3 are invoked to GL by Shingle-Shift of SS2
  and SS3.
Unofficial use of ESC sequence for designation:
  If SEVEN is t, LOCK-SHIFT is nil, and designation to G2
  and G3 are prohibited, we should designate all character
  sets to G0 (and hence invoke to GL).  To designate 96
  char-set to G0, we use "ESC , <F>".  For instance, to
  designate ISO8859-1 to G0, we use "ESC , A".
Unofficial use of ESC sequence for composit character:
  To indicate the start and end of composit character, we
  use ESC 0 (start) and ESC 1 (end).
############################################################

4. Setting coding-systems for I/O.

The deepest functions for various I/O are re-defined so that
they can set appropriate coding-system used for I/O.  See
documents of insert-file-contents, write-region,
call-process, start-process, open-network-stream.


5. Another related functions (not all are listed)

describe-coding-system
code-detect-region
code-convert-region
code-convert-string
s2e
e2s
define-service-kanji-code     
detect-code-category	      


6. Big5 support

As far as I know, there's several different codes called
Big5.  The most famous ones are Big5-ETen and
Big5-HKU-form2.  Since both of them use a code range 0xa140
- 0xfefe (in each row, columns (second byte) 0x7f - 0xa0 is
skipped) and number of characters is more than 13000, it's
impossible to treat each of them as a single character-set
in the current Mule system.  So, Mule treat them in a quite
irregular manner as described below:

(1) Mule does not treats them as a different character set,
but as the same character set called Big5.
	Caution!! Big5 is a different character set from GB.

(2) Mule divides Big5 into two sub-character-sets:
	0xa140 - 0xc67e (Level 1)
	0xc6a1 - 0xfefe (Level 2)
and allocates two leading-chars lc-big5-1 and lc-big5-2 to
them.

(3) Usually, each leading-char (or character-set) has unique
character category.  But lc-big5-1 and lc-big5-2 has the
same character category of mnemonic 'b'.  So, regular
expression "\\cb" matches any Big5 (Level 1 and Level 2)
characters.

(4) If you specify ISO2022 type coding-system on output,
Mule converts Big5 code using unofficial final-characters
'0' (for Level 1) and '1' (for Level 2).

(5) You can use either fonts of ETen or HKU for displaying
Big5 code.  Mule judges which font is used by examining
existence of character C6A1 in the fonts.  If it exists, the
font is HKU, else the fonts is ETen.