Handling of character sets in SP

The following description applies only to the multi-byte version of SP. In the single-byte version of SP, each character is represented both internally and in storage objects by a single byte equal to the number of the character in the document character set.

SP's entity manager converts the bytes comprising a storage object into a sequence of characters. This conversion is determined by the encoding associated with the storage object.

An encoding may be specified using the name of a mapping from sequences of characters to sequences of bytes.

An encoding can also be specified relative to the document character set. This kind of an encoding maps a sequence of characters in the repertoire of the document character set into a sequence of bytes by

mapping each character to its bit combination in the document character set, and then
applying a transformation that maps sequences of bit combinations to sequences of bytes.

The transformation applied in the second step is called a bit combination transformation format (BCTF). A document character set relative encoding is specified by giving the name of a BCTF.

An application receives characters from SP represented as non-negative integers. The mapping from characters to integers is determined by SP's internal character set. SP can operate in a mode in which the internal character set is the same as the document character set. (Versions of SP up to 1.1.1 always operated in this mode.) The multibyte version of SP can also operate in a mode in which the internal character set does not vary with the document character set, but is always a fixed character set, known as the system character set; this mode of operation is called fixed character set mode.

Environment

SP's character set handling is controlled by the following environment variables:

SP_CHARSET_FIXED

If this variable is 1 or YES, then SP will operate in fixed character set mode.

SP_SYSTEM_CHARSET

This identifies the system character set. When in fixed character set mode, this character set is used as the internal character set. When not in fixed character set mode this character set is used as the internal character set until the document character set has been read, at which point the document character set is used as the internal character set.

The only currently recognized value for this is JIS. This refers to a character set which combines JIS X 0201, JIS X 0208 and JIS X 0212 by adding 0x8080 to the codes of characters in JIS X 0208 and 0x8000 to the codes of characters in JIS X 0212.

The default system character set is Unicode 2.0.

SP_ENCODING

This specifies the default encoding when operating in fixed character set mode. The value must be the name of an available encoding. The default encoding cannot be document character set relative when operating in fixed character set mode.

SP_BCTF

This specifies the default encoding when not operating in fixed character set mode. The value must be the name of an available BCTF. When not operating in fixed character set mode, the default encoding is the document character set relative encoding with this BCTF. The default encoding is required to be document character set relative when not operating in fixed character set mode.

The default encoding is used for file input and output, and, except under Windows 95 and Windows NT, for all other interfaces with the operating system including filenames, environment varable names, environment variable values and command line arguments.

Under Windows 95 and Windows NT there are no restrictions on the default encoding. Note that in order for non-ASCII characters to be correctly displayed on your console you must select a TrueType font, such as Lucida Console, as your console font. (This seems to work only on Windows NT.)

Under other operating systems, the default encoding must be one in which ASCII characters are represented by a single byte.

Applications built with SP may require fixed character set mode and a particular system character set; such applications will ignore the SP_SYSTEM_CHARSET and SP_CHARSET_FIXED environment variables.

Available encodings

Encoding names are case insensitive. The following named encodings are available:

utf-8: Each character is represented by a variable number of bytes according to UCS Transformation Format 8 defined in Annex P to be added by the first proposed drafted amendment (PDAM 1) to ISO/IEC 10646-1:1993.
ucs-2
iso-10646-ucs-2: This is ISO/IEC 10646 with the UCS-2 transformation format. Each character is represented by 2 bytes. No special treatment is given to the byte order mark character.
unicode: Each character is represented by 2 bytes. The bytes representing the entire storage object may be preceded by a pair of bytes representing the byte order mark character (0xFEFF). The bytes representing each character are in the system byte order, unless the byte order mark character is present, in which case the order of its bytes determines the byte order. When the storage object is read, any byte order mark character is discarded.
euc-jp: This is equivalent to the Extended_UNIX_Code_Packed_Format_for_Japanese Internet charset. Each character is encoded by a variable length sequence of octets.
euc-kr: This is ASCII and KSC 5601 encoded with the EUC encoding as defined by KS C 5861-1992.
euc-cn
cn-gb
gb2312: This is ASCII and GB 2312-80 encoded with the EUC encoding. It is equivalent to the CN-GB MIME charset defined in RFC 1922.
sjis
shift_jis: This is equivalent to the Shift_JIS Internet charset. Each character is encoded by a variable length sequence of octets. This is Microsoft's standard encoding for Japanese.
big5
cn-big5: This is equivalent to the CN-Big5 MIME charset defined in RFC 1922.
is8859-n
iso-8859-n: n can be any single digit other than 0. Each character in the repertoire of ISO 8859-n is represented by a single byte.
xml: On input, this uses XML's rules to determine the encoding. On output, this uses UTF-8.

The following additional encodings are supported under Windows 95 and Windows NT:

windows: Specify this encoding when a storage object is encoded using your system's default Windows character set. This uses the so-called ANSI code page.
wunicode: This uses the unicode encoding if the storage object starts with a byte order mark and otherwise the windows encoding. If you are working with Unicode, this is probably the best value for SP_ENCODING.
ms-dos: Specify this encoding when a storage object (file) uses the OEM code page. The OEM code-page for a particular machine is the code-page used by FAT file-systems on that machine and is the default code-page for MS-DOS consoles.

Available BCTFs

The following BCTFs are available:

identity

Each bit combination is represented by a single byte.

fixed-2

Each bit combination is represented by exactly 2 bytes, with the more significant byte first.

euc

Each bit combination is represented by a variable number of bytes depending on the values of the 0x80 and 0x8000 bits:

if neither bits are set, then the bit combination is represented by a single byte equal to the bit combination;
bit combinations with both bits set, are represented by the MSB of the bit combination followed by the LSB of the bit combination;
bit combinations with just the 0x80 bit set are represented by 0x8E followed by a byte equal to the bit combination;
bit combinations with just the 0x8000 bit set are represented by 0x8F followed by the MSB of the bit combination followed by the LSB of the bit combination.

sjis

A bit combination between 0 and 127 or between 161 and 223 is encoded as a single byte with the same value as the bit combination. A bit combination with the 0x8000 and 0x80 bits set is encoded by the sequence of bytes with which the SJIS encoding encodes the character whose number in JIS X 0208 added to 0x8080 is equal to the bit combination.

big5

A bit combination less than 0x80 is encoded as a single byte. A bit combination with the 0x8000 bit set is encoded as two bytes, the MSB of the bit combination followed by the LSB of the bit combination.

James Clark
jjc@jclark.com