############################################################ # Coding system # ############################################################ 1. NOTATION: coding-system: Method for encoding several character-sets. Represented as an emacslisp object of type vector. 2. Coding-system You should specify one of coding-system object on File I/O, Process I/O, Output to terminal, Input from keyboard. Each coding-system is defined by calling make-coding-system: --- mule.el --------------------------------------------- (defun make-coding-system (name type mnemonic doc &optional crlf flags) "Register symbol NAME as a coding-system of TYPE, MNEMONIC, DOC, CRLF, FLAGS. TYPE is information for encoding or decoding. If it is one of below, nil: no conversion, t: automatic conversion, 0:Internal, 1:Shift-JIS, 2:ISO2022, 3:Big5. the system provides appropriate code conversion facility. MNEMONIC: a character to be displayed on mode-line for this coding-system, DOC: a describing documents for the coding-system, CRLF (option): if non-nil, CRLF <-> LF conversion is done on I/O. FLAGS (option): more precise information about the coding-system, If TYPE is 2 (ISO2022), FLAGS should be a list of: LC-G0, LC-G1, LC-G2, LC-G3: Leading character of charset initially designated to G? graphic set, nil means G? is not designated initially, lc-invalid means G? can never be designated to, if (- leading-char) is specified, it is designated on output, SHORT: non-nil - allow such as \"ESC $ B\", nil - always \"ESC $ \( B\", ASCII-EOL: non-nil - designate ASCII to g0 at each end of line on output, ASCII-CNTL: non-nil - designate ASCII to g0 before TAB and SPACE on output. SEVEN: non-nil - use 7-bit environment on output. LOCK-SHIFT: non-nil - use locking-shift (SO/SI) instead of single-shift or designation by escape sequence. USE-ROMAN: non-nil - write ASCII as JIS0201-1976-Roman. USE-OLDJIS: non-nil - write JIS0208-1983 as JIS0208-1976. If TYPE is 3 (Big5), FLAGS T means Big5-ETen, NIL means Big5-HKU." ------------------------------------------------------------ Usage of this function. For instance, we have the following code. --- mule.el -------------------------------------------------- ;; Definitions of predefined coding-systems (make-coding-system '*noconv* nil ?= "No conversion.") (make-coding-system '*autoconv* t ?+ "Automatic conversion.") (make-coding-system '*internal* 0 ?= "Internal coding-system used in an Mule buffer.") (make-coding-system '*sjis* 1 ?S "Coding-system of Shift-JIS used in Japan.") (make-coding-system '*sjis-dos* 1 ?s "Coding-system of Shift-JIS with CRLF at eol." 'crlf) (make-coding-system '*junet* 2 ?J "Coding-system used for communication with mail and news in Japan." nil (list lc-ascii lc-invalid lc-invalid lc-invalid 'short 'ascii-eol 'ascii-cntl 'seven)) (make-coding-system '*oldjis* 2 ?J "Coding-system used for old jis terminal." nil (list lc-ascii lc-invalid lc-invalid lc-invalid 'short 'ascii-eol 'ascii-cntl 'seven nil 'use-roman 'use-oldjis)) (make-coding-system '*ctext* 2 ?X "Coding-system used in X as Compound Text Encoding." nil (list lc-ascii lc-ltn1 lc-invalid lc-invalid nil 'ascii-eol)) (make-coding-system '*euc-japan* 2 ?E "Coding-system of Japanese EUC (Extended Unix Code)." nil (list lc-ascii lc-jp lc-kana lc-jp2 'short 'ascii-eol 'ascii-cntl)) (make-coding-system '*euc-korea* 2 ?K "Coding-system of Korean EUC (Extended Unix Code)." nil (list lc-ascii lc-kr lc-invalid lc-invalid nil 'ascii-eol 'ascii-cntl)) (make-coding-system '*iso-2022-kr* 2 ?k "Coding-System used for communication with mail in Korea." nil (list lc-ascii (- lc-kr) lc-invalid lc-invalid nil 'ascii-eol 'ascii-cntl 'seven 'lock-shift)) (setq *korean-mail* '*korean-mail*) (put '*korean-mail* 'coding-system (get '*iso-2022-kr* 'coding-system)) (make-coding-system '*euc-china* 2 ?C "Coding-system of Chinese EUC (Extended Unix Code)." nil (list lc-ascii lc-cn lc-invalid lc-invalid nil 'ascii-eol 'ascii-cntl)) (make-coding-system '*iso-2022-ss2-8 2 ?I "ISO-2022 coding system using SS2 for 96-charset in 8-bit code." nil (list lc-ascii lc-invalid nil lc-invalid nil 'ascii-eol 'ascii-cntl)) (make-coding-system '*iso-2022-ss2-7 2 ?I "ISO-2022 coding system using SS2 for 96-charset in 7-bit code." nil (list lc-ascii lc-invalid nil lc-invalid nil 'ascii-eol 'ascii-cntl 'seven)) (make-coding-system '*iso-2022-lock 2 ?i "ISO-2022 coding system using Locking-Shift for 96-charset." nil (list lc-ascii nil lc-invalid lc-invalid nil 'ascii-eol 'ascii-cntl 'seven)) (make-coding-system '*big5-eten* 3 ?B "Coding-system of BIG5-ETen." nil t) (make-coding-system '*big5-hku* 3 ?B "Coding-system of BIG5-HKU." nil nil) (make-coding-system '*big5-eten-dos* 3 ?b "Coding-system of BIG5-ETen with CRLF at eol." 'crlf t) (make-coding-system '*big5-hku-dos* 3 ?b "Coding-system of BIG5-HKU with CRLF at eol." 'crlf nil) ------------------------------------------------------------ At the present, there's no difference among *noconv*, *autoconv*, and *internal* on output, and no difference between *noconv* and *internal* on input. 3. Automatic conversion We have a function for automatic detection of coding-system. But, the detection is not so powerful, it only detect category of coding-system as follows: 0. INTERNAL -- Internal code used is emacs buffer. 1. SJIS -- Shift-JIS. 2. JUNET -- ISO2022, all char-sets are invoked only to GL. 3. CTEXT -- ISO2022, char-sets invoked to GR is one-byte code. 4. EUC -- ISO2022, char-sets invoked to GR is two-byte code. 5. BIG5 -- Big5 or ASCII only. The automatic detection also detect if each text has CRLF at end of line. If the code detection routine found one of above category, it returns the following coding-system according to the category. 0. *internal-code-category* 1. *sjis-code-category* 2. *junet-code-category* 3. *ctext-code-category* 4. *euc-code-category* 5. *big5-code-category* or nil if ASCII only. These are pre-defined as follows: --- mule.el -------------------------------------------------- (setq *internal-code-category* '(*internal* . *sjis-dos*) *sjis-code-category* '(*sjis* . *sjis-dos*) *junet-code-category* '(*junet* . *sjis-dos*) *euc-code-category* '(*euc-japan* . *sjis-dos*) *ctext-code-category* '(*ctext* . *sjis-dos*) *big5-code-category* '(*big5-hku* . *big5-hku-dos*)) ------------------------------------------------------------ Each car part is the coding-system with LF at eol, and cdr part is the coding-system with CRLF at eol. If the routine can't decide sole coding-system, it returns coding-system of the highest priority according to a variable code-priority. The the automatic detection find only ASCII text but with CRLF at eol, it returns *crlf-code-category* which is predefined as follows. --- mule.el -------------------------------------------------- (setq *crlf-code-category* '(nil . *sjis-dos*)) ------------------------------------------------------------ The automatic detection concerns the user defined priority for each category. ------------------------------------------------------------ code-priority: Documentation: List of categoriy symbols of coding-system: *internal-code-category*: INTERNAL, *sjis-code-category*: Shift-JIS, *junet-code-category*: ISO2022(JUNET), *euc-code-category*: ISO2022(EUC), *ctext-code-category*: ISO2022(CTEXT), *big5-code-category*: BIG5. This priority list is used while detecting coding-system. ------------------------------------------------------------ and the value of code-priority is predefined in each language specific files (e.g. japanese.el, chinese.el...). ############################################################ We also have functions for encoding/decoding to/from Internal-code. But for decoding to Type 2 (ISO2022), we have the following restrictions: Locking-Shift: Use SI and SO only when decoding with a coding-system whose LOCK-SHIFT and SEVEN is t. Single-Shift: Use SS2 and SS3 (if SEVEN is nil) or ESC N and ESC O (if SEVEN is t). Designation: Designate a char-set to G0 or G1 according to its GRAPHIC value. (See character.text). Invocation: G0 is always invoked to GL, G1 to GR (but only if SEVEN is nil). G2 and G3 are invoked to GL by Shingle-Shift of SS2 and SS3. Unofficial use of ESC sequence for designation: If SEVEN is t, LOCK-SHIFT is nil, and designation to G2 and G3 are prohibited, we should designate all character sets to G0 (and hence invoke to GL). To designate 96 char-set to G0, we use "ESC , ". For instance, to designate ISO8859-1 to G0, we use "ESC , A". Unofficial use of ESC sequence for composit character: To indicate the start and end of composit character, we use ESC 0 (start) and ESC 1 (end). ############################################################ 4. Setting coding-systems for I/O. The deepest functions for various I/O are re-defined so that they can set appropriate coding-system used for I/O. See documents of insert-file-contents, write-region, call-process, start-process, open-network-stream. 5. Another related functions (not all are listed) describe-coding-system code-detect-region code-convert-region code-convert-string s2e e2s define-service-kanji-code detect-code-category 6. Big5 support As far as I know, there's several different codes called Big5. The most famous ones are Big5-ETen and Big5-HKU-form2. Since both of them use a code range 0xa140 - 0xfefe (in each row, columns (second byte) 0x7f - 0xa0 is skipped) and number of characters is more than 13000, it's impossible to treat each of them as a single character-set in the current Mule system. So, Mule treat them in a quite irregular manner as described below: (1) Mule does not treats them as a different character set, but as the same character set called Big5. Caution!! Big5 is a different character set from GB. (2) Mule divides Big5 into two sub-character-sets: 0xa140 - 0xc67e (Level 1) 0xc6a1 - 0xfefe (Level 2) and allocates two leading-chars lc-big5-1 and lc-big5-2 to them. (3) Usually, each leading-char (or character-set) has unique character category. But lc-big5-1 and lc-big5-2 has the same character category of mnemonic 'b'. So, regular expression "\\cb" matches any Big5 (Level 1 and Level 2) characters. (4) If you specify ISO2022 type coding-system on output, Mule converts Big5 code using unofficial final-characters '0' (for Level 1) and '1' (for Level 2). (5) You can use either fonts of ETen or HKU for displaying Big5 code. Mule judges which font is used by examining existence of character C6A1 in the fonts. If it exists, the font is HKU, else the fonts is ETen.