Handling of character sets in SP
The following description applies only to the multi-byte version of
SP. In the single-byte version of SP, each character is represented
both internally and in storage objects by a single byte equal to the
number of the character in the document character set.
SP's entity manager converts the bytes comprising a storage object
into a sequence of characters.
This conversion is determined by the encoding associated with
the storage object.
An encoding may be specified using the name of a mapping from
sequences of characters to sequences of bytes.
An encoding can also be specified relative to the document character
set. This kind of an encoding maps a sequence of characters in the
repertoire of the document character set into a sequence of bytes by
-
mapping each character to its bit
combination in the document character set, and then
-
applying a transformation that maps sequences of bit combinations
to sequences of bytes.
The transformation applied in the second step is called a bit
combination transformation format (BCTF).
A document character set relative encoding is specified by giving
the name of a BCTF.
An application receives characters from SP represented as non-negative
integers.
The mapping from characters to integers is determined by SP's internal
character set.
SP can operate in a mode in which the internal character set is the
same as the document character set.
(Versions of SP up to 1.1.1 always operated in this mode.)
The multibyte version of SP can also operate in a mode in which the
internal character set does not vary with the document character set,
but is always a fixed character set, known as the system character set;
this mode of operation is called fixed character set mode.
Environment
SP's character set handling is controlled by the following
environment variables:
-
SP_CHARSET_FIXED
-
If this variable is 1 or YES, then SP will operate in fixed character set mode.
-
SP_SYSTEM_CHARSET
-
This identifies the system character set. When in fixed character set
mode, this character set is used as the internal character set. When
not in fixed character set mode this character set is used as the
internal character set until the document character set has been read,
at which point the document character set is used as the internal
character set.
The only currently recognized value for this is JIS.
This refers to a character set which combines JIS X 0201, JIS X 0208
and JIS X 0212 by adding 0x8080 to the codes of characters in JIS X
0208 and 0x8000 to the codes of characters in JIS X 0212.
The default system character set is Unicode 2.0.
-
SP_ENCODING
-
This specifies the default encoding when operating in fixed character set mode.
The value must be the name of an available encoding.
The default encoding cannot be document character set relative
when operating in fixed character set mode.
-
SP_BCTF
-
This specifies the default encoding when not operating in fixed character set mode.
The value must be the name of an available BCTF.
When not operating in fixed character set mode, the default encoding is
the document character set relative encoding with this BCTF.
The default encoding is required to be document character set relative
when not operating in fixed character set mode.
The default encoding is used for file input and output, and, except
under Windows 95 and Windows NT, for all other interfaces with the
operating system including filenames, environment varable names,
environment variable values and command line arguments.
Under Windows 95 and Windows NT there are no restrictions on the
default encoding. Note that in order for non-ASCII characters to be
correctly displayed on your console you must select a TrueType font,
such as Lucida Console, as your console font. (This seems to work
only on Windows NT.)
Under other operating systems, the default encoding must be one in
which ASCII characters are represented by a single byte.
Applications built with SP may require fixed character set mode and a
particular system character set; such applications will ignore the
SP_SYSTEM_CHARSET and SP_CHARSET_FIXED environment variables.
Encoding names are case insensitive.
The following named encodings are available:
-
utf-8
-
Each character is represented by a variable number of bytes
according to UCS Transformation Format 8 defined in Annex P to be
added by the first proposed drafted amendment (PDAM 1) to ISO/IEC
10646-1:1993.
-
ucs-2
-
iso-10646-ucs-2
-
This is ISO/IEC 10646 with the UCS-2 transformation format.
Each character is represented by 2 bytes.
No special treatment is given to the byte order mark character.
-
unicode
-
Each character is represented by 2 bytes. The bytes
representing the entire storage object may be preceded by a pair of
bytes representing the byte order mark character (0xFEFF). The bytes
representing each character are in the system byte order, unless
the byte order mark character is present, in which case the order of
its bytes determines the byte order. When the storage object is read,
any byte order mark character is discarded.
-
euc-jp
-
This is equivalent to
the Extended_UNIX_Code_Packed_Format_for_Japanese Internet charset.
Each character is encoded by a variable length sequence of octets.
-
euc-kr
-
This is ASCII and KSC 5601 encoded with the EUC encoding
as defined by KS C 5861-1992.
-
euc-cn
-
cn-gb
-
gb2312
-
This is ASCII and GB 2312-80 encoded with the EUC encoding.
It is equivalent to the CN-GB MIME charset defined in RFC 1922.
-
sjis
-
shift_jis
-
This is equivalent to the Shift_JIS Internet charset.
Each character is encoded by a variable length sequence of octets.
This is Microsoft's standard encoding for Japanese.
-
big5
-
cn-big5
-
This is equivalent to the CN-Big5 MIME charset defined in RFC 1922.
-
is8859-n
-
iso-8859-n
-
n can be any single digit other than 0. Each
character in the repertoire of ISO 8859-n is represented
by a single byte.
-
xml
-
On input, this uses XML's rules to determine the encoding.
On output, this uses UTF-8.
The following additional encodings are supported under Windows 95
and Windows NT:
-
windows
-
Specify this encoding when a storage object is encoded using your
system's default Windows character set.
This uses the so-called ANSI code page.
-
wunicode
-
This uses the unicode encoding if the storage object starts
with a byte order mark and otherwise the windows encoding.
If you are working with Unicode, this is probably the best value
for SP_ENCODING.
-
ms-dos
-
Specify this encoding when a storage object (file) uses the OEM code page.
The OEM code-page for a particular
machine is the code-page used by FAT file-systems on that machine and
is the default code-page for MS-DOS consoles.
The following BCTFs are available:
-
identity
-
Each bit combination is represented by a single byte.
-
fixed-2
-
Each bit combination is represented by exactly 2 bytes,
with the more significant byte first.
-
euc
-
Each bit combination is represented by a variable number of bytes
depending on the values of the 0x80 and 0x8000 bits:
-
if neither bits are set, then the bit combination is
represented by a single byte equal to the bit combination;
-
bit combinations with both bits set, are represented by the MSB of the
bit combination followed by the LSB of the bit combination;
-
bit combinations with just the 0x80 bit set are represented
by 0x8E followed by a byte equal to the bit combination;
-
bit combinations with just the 0x8000 bit set are represented
by 0x8F followed by the MSB of the bit combination followed
by the LSB of the bit combination.
-
sjis
-
A bit combination between 0 and 127 or between 161 and
223 is encoded as a single byte with the same value as the bit combination.
A bit combination with the 0x8000 and 0x80 bits set is encoded by the
sequence of bytes with which the SJIS encoding encodes the character
whose number in JIS X 0208 added to 0x8080 is equal to the bit
combination.
-
big5
-
A bit combination less than 0x80 is encoded as a single byte.
A bit combination with the 0x8000 bit set is encoded as two bytes,
the MSB of the bit combination followed by the LSB of the bit
combination.
James Clark
jjc@jclark.com