to top

Summary: Protected Ctors | Methods | Inherited Methods | [Expand All]

Added in API level 1

public abstract class

Charset

extends Object
implements Comparable<T>

java.lang.Object
↳	java.nio.charset.Charset

Class Overview

A charset is a named mapping between Unicode characters and byte sequences. Every Charset can decode, converting a byte sequence into a sequence of characters, and some can also encode, converting a sequence of characters into a byte sequence. Use the method canEncode() to find out whether a charset supports both.

Characters

In the context of this class, character always refers to a Java character: a Unicode code point in the range U+0000 to U+FFFF. (Java represents supplementary characters using surrogates.) Not all byte sequences will represent a character, and not all characters can necessarily be represented by a given charset. The method contains(Charset) can be used to determine whether every character representable by one charset can also be represented by another (meaning that a lossless transformation is possible from the contained to the container).

Encodings

There are many possible ways to represent Unicode characters as byte sequences. See UTR#17: Unicode Character Encoding Model for detailed discussion.

The most important mappings capable of representing every character are the Unicode Transformation Format (UTF) charsets. Of those, UTF-8 and the UTF-16 family are the most common. UTF-8 (described in RFC 3629) encodes a character using 1 to 4 bytes. UTF-16 uses exactly 2 bytes per character (potentially wasting space, but allowing efficient random access into BMP text), and UTF-32 uses exactly 4 bytes per character (trading off even more space for efficient random access into text that includes supplementary characters).

UTF-16 and UTF-32 encode characters directly, using their code point as a two- or four-byte integer. This means that any given UTF-16 or UTF-32 byte sequence is either big- or little-endian. To assist decoders, Unicode includes a special byte order mark (BOM) character U+FEFF used to determine the endianness of a sequence. The corresponding byte-swapped code point U+FFFE is guaranteed never to be assigned. If a UTF-16 decoder sees 0xfe, 0xff, for example, it knows it's reading a big-endian byte sequence, while 0xff, 0xfe, would indicate a little-endian byte sequence.

UTF-8 can contain a BOM, but since the UTF-8 encoding of a character always uses the same byte sequence, there is no information about endianness to convey. Seeing the bytes corresponding to the UTF-8 encoding of U+FEFF (0xef, 0xbb, 0xbf) would only serve to suggest that you're reading UTF-8. Note that BOMs are decoded as the U+FEFF character, and will appear in the output character sequence. This means that a disadvantage to including a BOM in UTF-8 is that most applications that use UTF-8 do not expect to see a BOM. (This is also a reason to prefer UTF-8: it's one less complication to worry about.)

Because a BOM indicates how the data that follows should be interpreted, a BOM should occur as the first character in a character sequence.

See the Byte Order Mark (BOM) FAQ for more about dealing with BOMs.

Endianness and BOM behavior

The following tables show the endianness and BOM behavior of the UTF-16 variants.

This table shows what the encoder writes. "BE" means that the byte sequence is big-endian, "LE" means little-endian. "BE BOM" means a big-endian BOM (that is, 0xfe, 0xff).

Charset	Encoder writes
UTF-16BE	BE, no BOM
UTF-16LE	LE, no BOM
UTF-16	BE, with BE BOM

The next table shows how each variant's decoder behaves when reading a byte sequence. The exact meaning of "failure" in the table is dependent on the CodingErrorAction supplied to malformedInputAction(), so "BE, failure" means "the byte sequence is treated as big-endian, and a little-endian BOM triggers the malformedInputAction".

The phrase "includes BOM" means that the output includes the U+FEFF byte order mark character.

Charset	BE BOM	LE BOM	No BOM
UTF-16BE	BE, includes BOM	BE, failure	BE
UTF-16LE	LE, failure	LE, includes BOM	LE
UTF-16	BE	LE	BE

Charset names

A charset has a canonical name, returned by name(). Most charsets will also have one or more aliases, returned by aliases(). A charset can be looked up by canonical name or any of its aliases using forName(String).

Guaranteed-available charsets

The following charsets are available on every Java implementation:

ISO-8859-1
US-ASCII
UTF-16
UTF-16BE
UTF-16LE
UTF-8

All of these charsets support both decoding and encoding. The charsets whose names begin "UTF" can represent all characters, as mentioned above. The "ISO-8859-1" and "US-ASCII" charsets can only represent small subsets of these characters. Except when required to do otherwise for compatibility, new code should use one of the UTF charsets listed above. The platform's default charset is UTF-8. (This is in contrast to some older implementations, where the default charset depended on the user's locale.)

Most implementations will support hundreds of charsets. Use availableCharsets() or isSupported(String) to see what's available. If you intend to use the charset if it's available, just call forName(String) and catch the exceptions it throws if the charset isn't available.

Additional charsets can be made available by configuring one or more charset providers through provider configuration files. Such files are always named as "java.nio.charset.spi.CharsetProvider" and located in the "META-INF/services" directory of one or more classpaths. The files should be encoded in "UTF-8". Each line of their content specifies the class name of a charset provider which extends CharsetProvider. A line should end with '\r', '\n' or '\r\n'. Leading and trailing whitespace is trimmed. Blank lines, and lines (after trimming) starting with "#" which are regarded as comments, are both ignored. Duplicates of names already found are also ignored. Both the configuration files and the provider classes will be loaded using the thread context class loader.

Although class is thread-safe, the CharsetDecoder and CharsetEncoder instances it returns are inherently stateful.

Summary

Protected Constructors
	Charset(String canonicalName, String[] aliases) Constructs a `Charset` object.

Public Methods
final Set<String>	aliases() Gets the set of this charset's aliases.
static SortedMap<String, Charset>	availableCharsets() Returns an immutable case-insensitive map from canonical names to `Charset` instances.
boolean	canEncode() Returns true if this charset supports encoding, false otherwise.
final int	compareTo(Charset charset) Compares this charset with the given charset.
abstract boolean	contains(Charset charset) Determines whether this charset is a superset of the given charset.
final CharBuffer	decode(ByteBuffer buffer) Returns a new `CharBuffer` containing the characters decoded from `buffer`.
static Charset	defaultCharset() Returns the system's default charset.
String	displayName() Gets the name of this charset for the default locale.
String	displayName(Locale l) Gets the name of this charset for the specified locale.
final ByteBuffer	encode(CharBuffer buffer) Returns a new `ByteBuffer` containing the bytes encoding the characters from `buffer`.
final ByteBuffer	encode(String s) Returns a new `ByteBuffer` containing the bytes encoding the characters from `s`.
final boolean	equals(Object obj) Determines whether this charset equals to the given object.
static Charset	forName(String charsetName) Returns a `Charset` instance for the named charset.
final int	hashCode() Gets the hash code of this charset.
final boolean	isRegistered() Indicates whether this charset is known to be registered in the IANA Charset Registry.
static boolean	isSupported(String charsetName) Determines whether the specified charset is supported by this runtime.
final String	name() Gets the canonical name of this charset.
abstract CharsetDecoder	newDecoder() Gets a new instance of a decoder for this charset.
abstract CharsetEncoder	newEncoder() Gets a new instance of an encoder for this charset.
final String	toString() Gets a string representation of this charset.

[Expand]

Inherited Methods

From class java.lang.Object

Object	clone() Creates and returns a copy of this `Object`.
boolean	equals(Object o) Compares this instance with the specified object and indicates if they are equal.
void	finalize() Invoked when the garbage collector has detected that this instance is no longer reachable.
final Class<?>	getClass() Returns the unique instance of `Class` that represents this object's class.
int	hashCode() Returns an integer hash code for this object.
final void	notify() Causes a thread which is waiting on this object's monitor (by means of calling one of the `wait()` methods) to be woken up.
final void	notifyAll() Causes all threads which are waiting on this object's monitor (by means of calling one of the `wait()` methods) to be woken up.
String	toString() Returns a string containing a concise, human-readable description of this object.
final void	wait() Causes the calling thread to wait until another thread calls the `notify()` or `notifyAll()` method of this object.
final void	wait(long millis, int nanos) Causes the calling thread to wait until another thread calls the `notify()` or `notifyAll()` method of this object or until the specified timeout expires.
final void	wait(long millis) Causes the calling thread to wait until another thread calls the `notify()` or `notifyAll()` method of this object or until the specified timeout expires.

From interface java.lang.Comparable

Protected Constructors

protected Charset (String canonicalName, String[] aliases)

Added in API level 1

Constructs a Charset object. Duplicated aliases are ignored.

Parameters

canonicalName	the canonical name of the charset.
aliases	an array containing all aliases of the charset. May be null.

Throws

IllegalCharsetNameException	on an illegal value being supplied for either `canonicalName` or for any element of `aliases`.

Public Methods

public final Set<String> aliases ()

Added in API level 1

Gets the set of this charset's aliases.

Returns

an unmodifiable set of this charset's aliases.

public static SortedMap<String, Charset> availableCharsets ()

Added in API level 1

Returns an immutable case-insensitive map from canonical names to Charset instances. If multiple charsets have the same canonical name, it is unspecified which is returned in the map. This method may be slow. If you know which charset you're looking for, use forName(String).

Returns

an immutable case-insensitive map from canonical names to Charset instances

public boolean canEncode ()

Added in API level 1

Returns true if this charset supports encoding, false otherwise.

Returns

true if this charset supports encoding, false otherwise.

public final int compareTo (Charset charset)

Added in API level 1

Compares this charset with the given charset. This comparison is based on the case insensitive canonical names of the charsets.

Parameters

charset	the given object to be compared with.

Returns

a negative integer if less than the given object, a positive integer if larger than it, or 0 if equal to it.

public abstract boolean contains (Charset charset)

Added in API level 1

Determines whether this charset is a superset of the given charset. A charset C1 contains charset C2 if every character representable by C2 is also representable by C1. This means that lossless conversion is possible from C2 to C1 (but not necessarily the other way round). It does not imply that the two charsets use the same byte sequences for the characters they share.

Note that this method is allowed to be conservative, and some implementations may return false when this charset does contain the other charset. Android's implementation is precise, and will always return true in such cases.

Parameters

charset	a given charset.

Returns

true if this charset is a super set of the given charset, false if it's unknown or this charset is not a superset of the given charset.

public final CharBuffer decode (ByteBuffer buffer)

Added in API level 1

Returns a new CharBuffer containing the characters decoded from buffer. This method uses CodingErrorAction.REPLACE.

Applications should generally create a CharsetDecoder using newDecoder() for performance.

Parameters

buffer	the byte buffer containing the content to be decoded.

Returns

a character buffer containing the output of the decoding.

public static Charset defaultCharset ()

Added in API level 1

Returns the system's default charset. This is determined during VM startup, and will not change thereafter. On Android, the default charset is UTF-8.

public String displayName ()

Added in API level 1

Gets the name of this charset for the default locale.

The default implementation returns the canonical name of this charset. Subclasses may return a localized display name.

Returns

the name of this charset for the default locale.

public String displayName (Locale l)

Added in API level 1

Gets the name of this charset for the specified locale.

The default implementation returns the canonical name of this charset. Subclasses may return a localized display name.

Parameters

l	a certain locale

Returns

the name of this charset for the specified locale

public final ByteBuffer encode (CharBuffer buffer)

Added in API level 1

Returns a new ByteBuffer containing the bytes encoding the characters from buffer. This method uses CodingErrorAction.REPLACE.

Applications should generally create a CharsetEncoder using newEncoder() for performance.

Parameters

buffer	the character buffer containing the content to be encoded.

Returns

the result of the encoding.

public final ByteBuffer encode (String s)

Added in API level 1

Returns a new ByteBuffer containing the bytes encoding the characters from s. This method uses CodingErrorAction.REPLACE.

Applications should generally create a CharsetEncoder using newEncoder() for performance.

Parameters

s	the string to be encoded.

Returns

the result of the encoding.

public final boolean equals (Object obj)

Added in API level 1

Determines whether this charset equals to the given object. They are considered to be equal if they have the same canonical name.

Parameters

obj	the given object to be compared with.

Returns

true if they have the same canonical name, otherwise false.

public static Charset forName (String charsetName)

Added in API level 1

Returns a Charset instance for the named charset.

Parameters

charsetName	a charset name (either canonical or an alias)

Throws

IllegalCharsetNameException	if the specified charset name is illegal.
UnsupportedCharsetException	if the desired charset is not supported by this runtime.

public final int hashCode ()

Added in API level 1

Gets the hash code of this charset.

Returns

the hash code of this charset.

public final boolean isRegistered ()

Added in API level 1

Indicates whether this charset is known to be registered in the IANA Charset Registry.

Returns

true if the charset is known to be registered, otherwise returns false.

public static boolean isSupported (String charsetName)

Added in API level 1

Determines whether the specified charset is supported by this runtime.

Parameters

charsetName	the name of the charset.

Returns

true if the specified charset is supported, otherwise false.

Throws

IllegalCharsetNameException	if the specified charset name is illegal.

public final String name ()

Added in API level 1

Gets the canonical name of this charset.

Returns

this charset's name in canonical form.

public abstract CharsetDecoder newDecoder ()

Added in API level 1

Gets a new instance of a decoder for this charset.

Returns

a new instance of a decoder for this charset.

public abstract CharsetEncoder newEncoder ()

Added in API level 1

Gets a new instance of an encoder for this charset.

Returns

a new instance of an encoder for this charset.

public final String toString ()

Added in API level 1

Gets a string representation of this charset. Usually this contains the canonical name of the charset.

Returns

a string representation of this charset.

Results

Classes

Exceptions

Errors

Charset

Class Overview

Characters

Encodings

Endianness and BOM behavior

Charset names

Guaranteed-available charsets

Summary

Protected Constructors

protected Charset (String canonicalName, String[] aliases)

Parameters

Throws

Public Methods

public final Set<String> aliases ()

Returns

public static SortedMap<String, Charset> availableCharsets ()

Returns

public boolean canEncode ()

Returns

public final int compareTo (Charset charset)

Parameters

Returns

public abstract boolean contains (Charset charset)

Parameters

Returns

public final CharBuffer decode (ByteBuffer buffer)

Parameters

Returns

public static Charset defaultCharset ()

public String displayName ()

Returns

public String displayName (Locale l)

Parameters

Returns

public final ByteBuffer encode (CharBuffer buffer)

Parameters

Returns

public final ByteBuffer encode (String s)

Parameters

Returns

public final boolean equals (Object obj)

Parameters

Returns

public static Charset forName (String charsetName)

Parameters

Throws

public final int hashCode ()

Returns

public final boolean isRegistered ()

Returns

public static boolean isSupported (String charsetName)

Parameters

Returns

Throws

public final String name ()

Returns

public abstract CharsetDecoder newDecoder ()

Returns

public abstract CharsetEncoder newEncoder ()

Returns

public final String toString ()

Returns