############################################################
#        Multilingual Character Handling		   #
############################################################

1. Type of a character

	'Type n-m' means original n-byte code represented by m-byte

  Type 1-1: ASCII characters
  Type 1-2: Characters in one-byte character-sets (e.g. ISO8859-1, Latin-1)
  Type 1-3: Characters in one-byte character-sets of private use
  Type 2-3: Characters in two-byte character-sets (e.g. JISX0208, Japanese)
  Type 2-4: Characters in two-byte character-sets of private use
  Type 3-4: Characters in three-byte character-sets of private use
  Type N:   Composite characters of variable length

We assume that each character-set satisfies the technical
requirements of ISO2022.

2. Representation of a character in buffer and string

  Type 1-1: 1-byte 'C' [C <= 0x7F] (same as previous representation) 
  Type 1-2: 2-byte sequence 'LC1 C1'
	LC1 = leading character for the character-set,
		0x81 <= LC1 <= 0x8F
        C1 = 0x80 | (original byte for the character)]
		0xA0 <= C1 <= 0xFF
  Type 1-3: 3-byte sequence 'LCPRV1 LC12 C1'
	LCPRV1 = 0x9A (for one column) or 0x9B (for two column)
	LC12 = extended leading character,
		0xA0 <= LC12 <=0xB7 (if LCPRV1 = 0x9A) -- 24 sets
		0xB8 <= LC12 <=0xBF (if LCPRV1 = 0x9B) -- 8 sets
	C1 = same as above
  Type 2-3: 3-byte sequence 'LC2 C21 C22'
	LC2 = leading character for the character-set,
		0x90 <= LC2 <= 0x99
	C21 = 0x80 | (original first byte for the character),
	C22 = 0x80 | (original second byte for the character),
		0xA0 <= C21,C22 <= 0xFF
  Type 2-4: 4-byte sequence 'LCPRV2 LC22 C21 C22'
	LCPRV2 = 0x9C (for one column) or 0x9D (for two column)
	LC22 = extended leading character,
		0xC0 <= LC22 <=0xC7 (if LCPRV2 = 0x9C) -- 8 sets
		0xC8 <= LC22 <=0xDF (if LCPRV2 = 0x9D) -- 24 sets
	C21, C22 = same as above
  Type 3-4: 4-byte sequence 'LC3 C31 C32 C33'
	LCPRV3 = leading character for the character-set,
		LCPRV3 = 0x9E
	C31 = 0x80 | (original first byte for the character),
		0xA0 <= C31 <= 0xBF
	C32 = 0x80 | (original second byte for the character)
	C33 = 0x80 | (original third byte for the character)
		0xA0 <= C31,C32,C33 <= 0xFF
  Type N: n-byte sequence 'LCCMP LCN1 C11 ... LCN2 C21 ... LCNn Cn1 ...'
	all characters 'LCN1 C11 ... LCN2 C21 ... LCNN CN1 ...'
	are displayed on the same column.
	LCCMP = 0x80
	LCN1 .. LCNN = leading character + 0x20, but, for ASCII, 0xA0

Refer Section 5 for predefined leading characters.

Here's an example of a text with mixture of these types
(at the place of 0x?? comes real binary code) .

"Here comes Latin-1 character of n with ~ '0x81 0xF1' and
here comes Japanese Hiragana '0x92 0xA4 0xA2'."


3. Representation of a character object in Emacslisp

Emacslisp treated character objects as integers of values
less than 256.  Now, we extended character object as follows:

  Type 1-1: C [C <= 7F] (same as character code itself)
	0 .. 0x7F
  Type 1-2: ((LC1 & 0x7F) << 8) | C1
	0x01A0 .. 0x0FFF
  Type 1-3: ((LC21 & 0x7F) << 8) | C1
	0x20A0 .. 0x3FFF
  Type 2-3: ((LC2 & 0x7F) << 16) | (C21 << 8) | C22
	0x10A0A0 .. 0x19FFFF
  Type 2-4: ((LC22 & 0x7F) << 16) | (C21 << 8) | C22
	0x40A0A0 .. 0x5FFFFF
  Type 3-4: ((C31 - 0x40) << 16) | (C32 << 8) | C33
	0x60A0A0 .. 0x7FFFFF
  Type N:
	Can't be treated as a character object.

For instance, if '?' is followed by Type 1-2 character '0x81
0xF1', 498 [= ((0x81 - 0x80) << 8) | 0xF1] is returned.

In the above table, several blocks are not defined.  Those
are used internally to represent partially defined
characters.
  0x80 .. 0x9F: leading-char only
  0x10A0 .. 0x19FF: LC2 + C21 of Type 2-3
  0x40A0 .. 0x5FFF: LC22 + C21 of Type 2-4
  0x60A0 .. 0x7FFF: LC3 + C31 + C32 of Type 3-4

4. functions

To handle multilingual characters, we extended or added the
following functions:

In editfns.c ...

char-to-string: 
Convert arg CHAR to a string containing that character.
If CHAR < 0, it is considered as a multilingual character, and
returned a correct string.

Example:
	(char-to-string ?A) => "A"
	(char-to-string ?あ) => "あ"
	(char-to-string 1221794) => "あ"

string-to-char:
Convert arg STRING to a character, the first character of that string.

Example:
	(string-to-char "ABあい") => 65 (== ?A)
	(string-to-char "あい") => 1221794 (== ?あ)

sref:
DEFUN ("sref", Fsref, Ssref, 2, 2, 0,
Return the character in STRING at index INDEX.
INDEX starts at 0.
If INDEX does not points to character boundary, -1 is returned.

Example:
	(sref "ABあい" 1) => 66 (== ?b)
	(sref "ABあい" 2) => 1221794 (== ?あ)
	(sref "ABあい" 3) => -1 (non character boundary)
	(sref "ABあい" 5) => 1221796 (== ?い)

sset:
Store into STRING at index INDEX the character CHAR.
INDEX should point to a character of same bytes as CHAR.
If not, returns nil, else returns CHAR.

Example:
	(setq s "ABあい")
	(sset s 1 ?C) => ?C (s == "ACあい")
	(sset s 2 ?う) => ?う (s == "ACうい")
	(sset s 2 ?A) => ?A (s == "ACA\244\246い")
	(sset s 8 ?A) => nil (out of range)

following-char:
Return the character following point, as a number.
If mc-flag of the current buffer is not nil, the returned character
 may be a multi-byte character.

Example:  If cursor is at 'あ' of buffer "..Aあ..",
	(following-char) => 1221794 (== ?あ)
	(let (mc-flag)
	 (following-char t)) => 146 (== leading char of ?あ)

preceding-char:
Return the character preceding point, as a number.
If mc-flag of the current buffer is not nil, the returned character
 may be a multi-byte character.

Example:  If cursor is at 'A' of buffer "..あA..",
	(preceding-char) => 1221794 (== ?あ)
	(let (mc-flag)
	  (preceding-char t)) => 162 (== last byte of ?あ)

char-after
First arg, POS, a number.  Return the character in the current buffer
at position POS.
If POS is out of range, the value is NIL.
If mc-flag of the current buffer is not nil, the returned character
 may be a multi-byte character.

Function 'insert' and 'insert-char' also work correctly with
multilingual characters.

	(insert ?あ) -- inserts "あ" at point.

buffer-substring:
Return the contents of part of the current buffer as a string.
The two arguments specify the start and end, as character numbers.
If mc-flag of the current buffer is non-nil, region may be widen
 to meet character boundary.

Example: If a buffer starts with the contents like "あいう..."
	(buffer-substring 1 2) => "あ"
	(buffer-substring 1 3) => "あ"
	(buffer-substring 2 4) => "あい"

Other functions which deal with 'region' also widen range automatically.

subst-char-in-region:
From START to END, replace FROMCHAR with TOCHAR each time it occurs.
If optional arg NOUNDO is non-nil, don't record this change for undo
and don't mark the buffer as really changed.
It also works well with multilingual characters only if the substitution
doesn't alter the length of buffer.

Example:
	(subst-char-in-region 1 10 ?a ?b) => possible
	(subst-char-in-region 1 10 ?あ ?い) => possible
	(subst-char-in-region 1 10 ?a ?あ) => impossible

In functions 'message' and 'format', %c works well with
multilingual characters.

	(message "%c" ?あ) -- shows "あ" in echo area.

In mule.c ...

make-character:
Make multi-byte character from ARG1, ARG2, and ARG3 (optional).
Actually what returned is (for the moment and can be changed in the future):
 ((ARG1 << 8) & 0x1F) + (ARG2 | 0x80), or
 ((ARG1 << 16) & 0x1F) + (ARG2 | 0x80) << 8 + (ARG3 | 0x80).

Example:
	(make-character lc-jp ?\244 ?\242) => 1221794 (== ?あ)

char-component:
Return a components of multi-byte character CHAR.
Second arg IDX indicate which component should be returned as follows.
 0: leading character
 1: first code of the character
 2: second code of the character.
If the character does not have the components, 0 is returned.

Example:
	(char-component ?あ 0) => 146 (== lc-jp)
	(char-component ?あ 1) => 164
	(char-component ?あ 2) => 162
	(char-component ?A 1)  => 0

char-leading-char:
Return leading character of CHAR.
If CHAR is not a multi-byte code, 0 is returned.

Example:
	(char-leading-char ?あ) => 146 (== lc-jp)
	(char-leading-char ?A) => 0

char-bytes:
Return number of bytes CHAR will occupy in a buffer.
You can specify a character set to be concerned
 by providing a leading character as CHAR.

Example:
	(char-bytes ?あ) => 3
	(char-bytes ?A) => 1
	(char-bytes lc-jp) => 3

char-width:
Return number of columns CHAR will occupy when displayed.
You can specify a character set to be concerned
 by providing a leading character as CHAR.

Example:
	(char-width ?あ) => 2
	(char-width ?A) => 1
	(char-width lc-jp) => 2

chars-in-string:
Return number of characters in STRING.
Each multilingual character is also counted as one.

Example:
	(chars-in-string "ABあい") => 4

char-boundary-p:
Return non nil value if POS is at character boundary.
The value is:
 0: if POS is at an ASCII character or end of range,
 1: if POS is at a leading char of 2-byte character.
 2: if POS is at a leading char of 3-byte character.
If POS is out of range or not at character boundary, nil is returned.


5. Character-set

character-set: such as ASCII, right half of ISO8859-1, JIS X0208, ...
  A character-set is identified by a leading character
assigned to each set uniquely.

Each character-set is characterized by the following attributes:
  1. byte length of code: 1-byte or 2-byte,
	ISO8859-1, Right half of JISX0201 (Japanese Katakana) -- 1-byte
	GB2312-1980 (Chinese), JISX0208 (Japanese) -- 2-byte
  2. columns occupied on a screen: 1-column or 2-column,
	ISO8859-1, Right half of JISX0201 (Japanese Katakana) -- 1-column
	GB2312-1980 (Chinese), JISX0208 (Japanese) -- 2-column
  3. type: 94-char-set, 96-char-set, 94x94-char-set, or 96x96 char-set,
  4. graphic set: GL or GR,
  5. final character: one of '0' thru '~',
  6. displaying direction: Left-to-right or Right-to-left
  7. leading character: the system assigns one by one.

3 thru 5 are notations of ISO2022.

Character-sets are defined by 'new-character-set' function call.

--- mule.c ---------------------------------------------------------
DEFUN ("new-character-set", Fnew_character_set, Snew_character_set, 8, MANY, 0,
  "Define new character set of LEADING-CHAR (1st arg).\n\
Rest of args are:\n\
 BYTE: 1, 2, or 3\n\
 COLUMNS: 1 or 2\n\
 TYPE: 0 (94 chars), 1 (96 chars), 2 (94x94 chars), or 3 (96x96 chars)\n\
 GRAPHIC: 0 (use g0 on output) or 1 (use g1 on output)\n\
 FINAL: final character of ISO escape sequence\n\
 DIRECTION: 0 (left-to-right) or 1 (right-to-left)\n\
 DOC: short description string.\n\
If LEADING-CHAR >= 0xA0, it is regarded as extended leading-char\n\
and BYTE and COLUMNS args are ignored.")
------------------------------------------------------------

The system pre-defines the following character-sets.

--- mule.el ---------------------------------------------------------
(defconst *predefined-character-set*
  (list
   ;; (cons lc '(bytes width type graphic final direction doc))
   ;; (cons lc-ascii '(0 1 0 0 ?B 0 "ASCII")) ;; Predefined in C file
   (cons lc-ltn1 '(1 1 1 1 ?A 0 "ISO8859-1 Latin-1"))
   (cons lc-ltn2 '(1 1 1 1 ?B 0 "ISO8859-2 Latin-2"))
   (cons lc-ltn3 '(1 1 1 1 ?C 0 "ISO8859-3 Latin-3"))
   (cons lc-ltn4 '(1 1 1 1 ?D 0 "ISO8859-4 Latin-4"))
   (cons lc-grk '(1 1 1 1 ?F 0 "ISO8859-7 Greek"))
   (cons lc-arb '(1 1 1 1 ?G 1 "ISO8859-6 Arabic"))
   (cons lc-hbw '(1 1 1 1 ?H 1 "ISO8859-8 Hebrew"))
   (cons lc-kana '(1 1 0 1 ?I 0 "JIS X0201 Japanese Katakana"))
   (cons lc-roman '(1 1 0 0 ?J 0 "JIS X0201 Japanese Roman"))
   (cons lc-crl '(1 1 1 1 ?L 0 "ISO8859-5 Cyrillic"))
   (cons lc-ltn5 '(1 1 1 1 ?M 0 "ISO8859-9 Latin-5"))
   (cons lc-jpold '(2 2 2 0 ?@ 0 "JIS X0208-1976 Japanese Old"))
   (cons lc-cn '(2 2 2 0 ?A 0 "GB 2312-1980 Chinese"))
   (cons lc-jp '(2 2 2 0 ?B 0 "JIS X0208 Japanese"))
   (cons lc-kr '(2 2 2 0 ?C 0 "KS C5601-1987 Korean"))
   (cons lc-jp2 '(2 2 2 0 ?D 0 "JIS X0212 Japanese Supplement"))
   (cons lc-big5-1 '(2 2 2 0 ?0 0 "Big5 Level 1"))
   (cons lc-big5-2 '(2 2 2 0 ?1 0 "Big5 Level 2"))))

(let ((c *predefined-character-set*)
      lc data)
  (while c
    (setq lc (car (car c))
	  data (cdr (car c)))
    (apply 'new-character-set lc data)
    (setq c (cdr c))))

In addition, the following private character sets are predifined.
--- mule-config.el -----------------------------------------
;; REGISTRATION OF PRIVATE CHARACTER SETS

;; PinYin-ZhuYin
(setq lc-sisheng (new-private-character-set 1 1 0 0 ?0 0 "PinYin-ZhuYin"))

;; Thai TSCII
(setq lc-thai (new-private-character-set 1 1 0 0 ?1 0 "Thai TSCII"))
------------------------------------------------------------

Values of variables lc-ascii thru lc-big5-2 are also
predefined as follows:

/** The followings are for 1-byte characters. **/
lc-ascii = 0x00		/* Omitted in a buffer */
lc-ltn1	= 0x81		/* Right half of ISO 8859-n */
lc-ltn2	= 0x82		/*  */
lc-ltn3	= 0x83		/*  */
lc-ltn4	= 0x84		/*  */
	  0x85		/* for future use */
lc-grk	= 0x86		/*  */
lc-arb	= 0x87		/*  */
lc-hbw	= 0x88		/*  */
lc-kana	= 0x89		/* Right half of JIS X0201-1976 */
lc-roman = 0x8A		/* Left half of JIS X0201-1976 */
	  0x8B		/* for future use */
lc-crl	= 0x8C		/* Right half of ISO 8859-5 */
lc-ltn5	= 0x8D		/*  */
	  0x8E		/* for future use */
	  0x8F		/* for future use */

/** The followings are for 2-byte characters. **/
	  0x90		/* for future use */
lc-cn	= 0x91		/* For Chinese Hanzi GB2312-1980 */
lc-jp	= 0x92		/* For Japanese JIS X0208-1983 */
lc-kr	= 0x93		/* For Hangul KS C5601-1987 */
lc-jp2	= 0x94		/* For Japanese JIS X0212-1990 */
	  0x95-0x97	/* for future use */
lc-big5-1 = 0x98	/* For Big5 Level 1 */
lc-big5-2 = 0x99	/* For Big5 Level 2 */
lc-prv11 = 0x9A
lc-prv12 = 0x9B
lc-prv21 = 0x9C
lc-prv22 = 0x9D
lc-prv3 = 0x9E

/** The followings are for internal use **/
lc-cmp = 0x80		/* For composite character */
lc-invalid = 0x9F