############################################################ # Multilingual Character Handling # ############################################################ 1. Type of a character 'Type n-m' means original n-byte code represented by m-byte Type 1-1: ASCII characters Type 1-2: Characters in one-byte character-sets (e.g. ISO8859-1, Latin-1) Type 1-3: Characters in one-byte character-sets of private use Type 2-3: Characters in two-byte character-sets (e.g. JISX0208, Japanese) Type 2-4: Characters in two-byte character-sets of private use Type 3-4: Characters in three-byte character-sets of private use Type N: Composite characters of variable length We assume that each character-set satisfies the technical requirements of ISO2022. 2. Representation of a character in buffer and string Type 1-1: 1-byte 'C' [C <= 0x7F] (same as previous representation) Type 1-2: 2-byte sequence 'LC1 C1' LC1 = leading character for the character-set, 0x81 <= LC1 <= 0x8F C1 = 0x80 | (original byte for the character)] 0xA0 <= C1 <= 0xFF Type 1-3: 3-byte sequence 'LCPRV1 LC12 C1' LCPRV1 = 0x9A (for one column) or 0x9B (for two column) LC12 = extended leading character, 0xA0 <= LC12 <=0xB7 (if LCPRV1 = 0x9A) -- 24 sets 0xB8 <= LC12 <=0xBF (if LCPRV1 = 0x9B) -- 8 sets C1 = same as above Type 2-3: 3-byte sequence 'LC2 C21 C22' LC2 = leading character for the character-set, 0x90 <= LC2 <= 0x99 C21 = 0x80 | (original first byte for the character), C22 = 0x80 | (original second byte for the character), 0xA0 <= C21,C22 <= 0xFF Type 2-4: 4-byte sequence 'LCPRV2 LC22 C21 C22' LCPRV2 = 0x9C (for one column) or 0x9D (for two column) LC22 = extended leading character, 0xC0 <= LC22 <=0xC7 (if LCPRV2 = 0x9C) -- 8 sets 0xC8 <= LC22 <=0xDF (if LCPRV2 = 0x9D) -- 24 sets C21, C22 = same as above Type 3-4: 4-byte sequence 'LC3 C31 C32 C33' LCPRV3 = leading character for the character-set, LCPRV3 = 0x9E C31 = 0x80 | (original first byte for the character), 0xA0 <= C31 <= 0xBF C32 = 0x80 | (original second byte for the character) C33 = 0x80 | (original third byte for the character) 0xA0 <= C31,C32,C33 <= 0xFF Type N: n-byte sequence 'LCCMP LCN1 C11 ... LCN2 C21 ... LCNn Cn1 ...' all characters 'LCN1 C11 ... LCN2 C21 ... LCNN CN1 ...' are displayed on the same column. LCCMP = 0x80 LCN1 .. LCNN = leading character + 0x20, but, for ASCII, 0xA0 Refer Section 5 for predefined leading characters. Here's an example of a text with mixture of these types (at the place of 0x?? comes real binary code) . "Here comes Latin-1 character of n with ~ '0x81 0xF1' and here comes Japanese Hiragana '0x92 0xA4 0xA2'." 3. Representation of a character object in Emacslisp Emacslisp treated character objects as integers of values less than 256. Now, we extended character object as follows: Type 1-1: C [C <= 7F] (same as character code itself) 0 .. 0x7F Type 1-2: ((LC1 & 0x7F) << 8) | C1 0x01A0 .. 0x0FFF Type 1-3: ((LC21 & 0x7F) << 8) | C1 0x20A0 .. 0x3FFF Type 2-3: ((LC2 & 0x7F) << 16) | (C21 << 8) | C22 0x10A0A0 .. 0x19FFFF Type 2-4: ((LC22 & 0x7F) << 16) | (C21 << 8) | C22 0x40A0A0 .. 0x5FFFFF Type 3-4: ((C31 - 0x40) << 16) | (C32 << 8) | C33 0x60A0A0 .. 0x7FFFFF Type N: Can't be treated as a character object. For instance, if '?' is followed by Type 1-2 character '0x81 0xF1', 498 [= ((0x81 - 0x80) << 8) | 0xF1] is returned. In the above table, several blocks are not defined. Those are used internally to represent partially defined characters. 0x80 .. 0x9F: leading-char only 0x10A0 .. 0x19FF: LC2 + C21 of Type 2-3 0x40A0 .. 0x5FFF: LC22 + C21 of Type 2-4 0x60A0 .. 0x7FFF: LC3 + C31 + C32 of Type 3-4 4. functions To handle multilingual characters, we extended or added the following functions: In editfns.c ... char-to-string: Convert arg CHAR to a string containing that character. If CHAR < 0, it is considered as a multilingual character, and returned a correct string. Example: (char-to-string ?A) => "A" (char-to-string ?あ) => "あ" (char-to-string 1221794) => "あ" string-to-char: Convert arg STRING to a character, the first character of that string. Example: (string-to-char "ABあい") => 65 (== ?A) (string-to-char "あい") => 1221794 (== ?あ) sref: DEFUN ("sref", Fsref, Ssref, 2, 2, 0, Return the character in STRING at index INDEX. INDEX starts at 0. If INDEX does not points to character boundary, -1 is returned. Example: (sref "ABあい" 1) => 66 (== ?b) (sref "ABあい" 2) => 1221794 (== ?あ) (sref "ABあい" 3) => -1 (non character boundary) (sref "ABあい" 5) => 1221796 (== ?い) sset: Store into STRING at index INDEX the character CHAR. INDEX should point to a character of same bytes as CHAR. If not, returns nil, else returns CHAR. Example: (setq s "ABあい") (sset s 1 ?C) => ?C (s == "ACあい") (sset s 2 ?う) => ?う (s == "ACうい") (sset s 2 ?A) => ?A (s == "ACA\244\246い") (sset s 8 ?A) => nil (out of range) following-char: Return the character following point, as a number. If mc-flag of the current buffer is not nil, the returned character may be a multi-byte character. Example: If cursor is at 'あ' of buffer "..Aあ..", (following-char) => 1221794 (== ?あ) (let (mc-flag) (following-char t)) => 146 (== leading char of ?あ) preceding-char: Return the character preceding point, as a number. If mc-flag of the current buffer is not nil, the returned character may be a multi-byte character. Example: If cursor is at 'A' of buffer "..あA..", (preceding-char) => 1221794 (== ?あ) (let (mc-flag) (preceding-char t)) => 162 (== last byte of ?あ) char-after First arg, POS, a number. Return the character in the current buffer at position POS. If POS is out of range, the value is NIL. If mc-flag of the current buffer is not nil, the returned character may be a multi-byte character. Function 'insert' and 'insert-char' also work correctly with multilingual characters. (insert ?あ) -- inserts "あ" at point. buffer-substring: Return the contents of part of the current buffer as a string. The two arguments specify the start and end, as character numbers. If mc-flag of the current buffer is non-nil, region may be widen to meet character boundary. Example: If a buffer starts with the contents like "あいう..." (buffer-substring 1 2) => "あ" (buffer-substring 1 3) => "あ" (buffer-substring 2 4) => "あい" Other functions which deal with 'region' also widen range automatically. subst-char-in-region: From START to END, replace FROMCHAR with TOCHAR each time it occurs. If optional arg NOUNDO is non-nil, don't record this change for undo and don't mark the buffer as really changed. It also works well with multilingual characters only if the substitution doesn't alter the length of buffer. Example: (subst-char-in-region 1 10 ?a ?b) => possible (subst-char-in-region 1 10 ?あ ?い) => possible (subst-char-in-region 1 10 ?a ?あ) => impossible In functions 'message' and 'format', %c works well with multilingual characters. (message "%c" ?あ) -- shows "あ" in echo area. In mule.c ... make-character: Make multi-byte character from ARG1, ARG2, and ARG3 (optional). Actually what returned is (for the moment and can be changed in the future): ((ARG1 << 8) & 0x1F) + (ARG2 | 0x80), or ((ARG1 << 16) & 0x1F) + (ARG2 | 0x80) << 8 + (ARG3 | 0x80). Example: (make-character lc-jp ?\244 ?\242) => 1221794 (== ?あ) char-component: Return a components of multi-byte character CHAR. Second arg IDX indicate which component should be returned as follows. 0: leading character 1: first code of the character 2: second code of the character. If the character does not have the components, 0 is returned. Example: (char-component ?あ 0) => 146 (== lc-jp) (char-component ?あ 1) => 164 (char-component ?あ 2) => 162 (char-component ?A 1) => 0 char-leading-char: Return leading character of CHAR. If CHAR is not a multi-byte code, 0 is returned. Example: (char-leading-char ?あ) => 146 (== lc-jp) (char-leading-char ?A) => 0 char-bytes: Return number of bytes CHAR will occupy in a buffer. You can specify a character set to be concerned by providing a leading character as CHAR. Example: (char-bytes ?あ) => 3 (char-bytes ?A) => 1 (char-bytes lc-jp) => 3 char-width: Return number of columns CHAR will occupy when displayed. You can specify a character set to be concerned by providing a leading character as CHAR. Example: (char-width ?あ) => 2 (char-width ?A) => 1 (char-width lc-jp) => 2 chars-in-string: Return number of characters in STRING. Each multilingual character is also counted as one. Example: (chars-in-string "ABあい") => 4 char-boundary-p: Return non nil value if POS is at character boundary. The value is: 0: if POS is at an ASCII character or end of range, 1: if POS is at a leading char of 2-byte character. 2: if POS is at a leading char of 3-byte character. If POS is out of range or not at character boundary, nil is returned. 5. Character-set character-set: such as ASCII, right half of ISO8859-1, JIS X0208, ... A character-set is identified by a leading character assigned to each set uniquely. Each character-set is characterized by the following attributes: 1. byte length of code: 1-byte or 2-byte, ISO8859-1, Right half of JISX0201 (Japanese Katakana) -- 1-byte GB2312-1980 (Chinese), JISX0208 (Japanese) -- 2-byte 2. columns occupied on a screen: 1-column or 2-column, ISO8859-1, Right half of JISX0201 (Japanese Katakana) -- 1-column GB2312-1980 (Chinese), JISX0208 (Japanese) -- 2-column 3. type: 94-char-set, 96-char-set, 94x94-char-set, or 96x96 char-set, 4. graphic set: GL or GR, 5. final character: one of '0' thru '~', 6. displaying direction: Left-to-right or Right-to-left 7. leading character: the system assigns one by one. 3 thru 5 are notations of ISO2022. Character-sets are defined by 'new-character-set' function call. --- mule.c --------------------------------------------------------- DEFUN ("new-character-set", Fnew_character_set, Snew_character_set, 8, MANY, 0, "Define new character set of LEADING-CHAR (1st arg).\n\ Rest of args are:\n\ BYTE: 1, 2, or 3\n\ COLUMNS: 1 or 2\n\ TYPE: 0 (94 chars), 1 (96 chars), 2 (94x94 chars), or 3 (96x96 chars)\n\ GRAPHIC: 0 (use g0 on output) or 1 (use g1 on output)\n\ FINAL: final character of ISO escape sequence\n\ DIRECTION: 0 (left-to-right) or 1 (right-to-left)\n\ DOC: short description string.\n\ If LEADING-CHAR >= 0xA0, it is regarded as extended leading-char\n\ and BYTE and COLUMNS args are ignored.") ------------------------------------------------------------ The system pre-defines the following character-sets. --- mule.el --------------------------------------------------------- (defconst *predefined-character-set* (list ;; (cons lc '(bytes width type graphic final direction doc)) ;; (cons lc-ascii '(0 1 0 0 ?B 0 "ASCII")) ;; Predefined in C file (cons lc-ltn1 '(1 1 1 1 ?A 0 "ISO8859-1 Latin-1")) (cons lc-ltn2 '(1 1 1 1 ?B 0 "ISO8859-2 Latin-2")) (cons lc-ltn3 '(1 1 1 1 ?C 0 "ISO8859-3 Latin-3")) (cons lc-ltn4 '(1 1 1 1 ?D 0 "ISO8859-4 Latin-4")) (cons lc-grk '(1 1 1 1 ?F 0 "ISO8859-7 Greek")) (cons lc-arb '(1 1 1 1 ?G 1 "ISO8859-6 Arabic")) (cons lc-hbw '(1 1 1 1 ?H 1 "ISO8859-8 Hebrew")) (cons lc-kana '(1 1 0 1 ?I 0 "JIS X0201 Japanese Katakana")) (cons lc-roman '(1 1 0 0 ?J 0 "JIS X0201 Japanese Roman")) (cons lc-crl '(1 1 1 1 ?L 0 "ISO8859-5 Cyrillic")) (cons lc-ltn5 '(1 1 1 1 ?M 0 "ISO8859-9 Latin-5")) (cons lc-jpold '(2 2 2 0 ?@ 0 "JIS X0208-1976 Japanese Old")) (cons lc-cn '(2 2 2 0 ?A 0 "GB 2312-1980 Chinese")) (cons lc-jp '(2 2 2 0 ?B 0 "JIS X0208 Japanese")) (cons lc-kr '(2 2 2 0 ?C 0 "KS C5601-1987 Korean")) (cons lc-jp2 '(2 2 2 0 ?D 0 "JIS X0212 Japanese Supplement")) (cons lc-big5-1 '(2 2 2 0 ?0 0 "Big5 Level 1")) (cons lc-big5-2 '(2 2 2 0 ?1 0 "Big5 Level 2")))) (let ((c *predefined-character-set*) lc data) (while c (setq lc (car (car c)) data (cdr (car c))) (apply 'new-character-set lc data) (setq c (cdr c)))) In addition, the following private character sets are predifined. --- mule-config.el ----------------------------------------- ;; REGISTRATION OF PRIVATE CHARACTER SETS ;; PinYin-ZhuYin (setq lc-sisheng (new-private-character-set 1 1 0 0 ?0 0 "PinYin-ZhuYin")) ;; Thai TSCII (setq lc-thai (new-private-character-set 1 1 0 0 ?1 0 "Thai TSCII")) ------------------------------------------------------------ Values of variables lc-ascii thru lc-big5-2 are also predefined as follows: /** The followings are for 1-byte characters. **/ lc-ascii = 0x00 /* Omitted in a buffer */ lc-ltn1 = 0x81 /* Right half of ISO 8859-n */ lc-ltn2 = 0x82 /* */ lc-ltn3 = 0x83 /* */ lc-ltn4 = 0x84 /* */ 0x85 /* for future use */ lc-grk = 0x86 /* */ lc-arb = 0x87 /* */ lc-hbw = 0x88 /* */ lc-kana = 0x89 /* Right half of JIS X0201-1976 */ lc-roman = 0x8A /* Left half of JIS X0201-1976 */ 0x8B /* for future use */ lc-crl = 0x8C /* Right half of ISO 8859-5 */ lc-ltn5 = 0x8D /* */ 0x8E /* for future use */ 0x8F /* for future use */ /** The followings are for 2-byte characters. **/ 0x90 /* for future use */ lc-cn = 0x91 /* For Chinese Hanzi GB2312-1980 */ lc-jp = 0x92 /* For Japanese JIS X0208-1983 */ lc-kr = 0x93 /* For Hangul KS C5601-1987 */ lc-jp2 = 0x94 /* For Japanese JIS X0212-1990 */ 0x95-0x97 /* for future use */ lc-big5-1 = 0x98 /* For Big5 Level 1 */ lc-big5-2 = 0x99 /* For Big5 Level 2 */ lc-prv11 = 0x9A lc-prv12 = 0x9B lc-prv21 = 0x9C lc-prv22 = 0x9D lc-prv3 = 0x9E /** The followings are for internal use **/ lc-cmp = 0x80 /* For composite character */ lc-invalid = 0x9F