This is Info file pm.info, produced by Makeinfo version 1.68 from the input file bigpm.texi.  File: pm.info, Node: URI/Bookmarks/Netscape, Next: URI/Escape, Prev: URI/Bookmarks, Up: Module List Perl module containing routines for Netscape bookmark files *********************************************************** NAME ==== URI::Bookmarks::Netscape - Perl module containing routines for Netscape bookmark files SYNOPSIS ======== See L. DESCRIPTION =========== URI::Bookmarks::Netscape contains some helper routines specifically for URI::Bookmarks objects which were originally Netscape bookmark files. AUTHOR ====== Adam Spiers SEE ALSO ======== *Note URI/Bookmarks: URI/Bookmarks,, `URI::Bookmarks::*' in this node, *Note URI/Bookmark: URI/Bookmark,, `URI::Bookmark::*' in this node, `perl(1)' in this node.  File: pm.info, Node: URI/Escape, Next: URI/Find, Prev: URI/Bookmarks/Netscape, Up: Module List Escape and unescape unsafe characters ************************************* NAME ==== URI::Escape - Escape and unescape unsafe characters SYNOPSIS ======== use URI::Escape; $safe = uri_escape("10% is enough\n"); $verysafe = uri_escape("foo", "\0-\377"); $str = uri_unescape($safe); DESCRIPTION =========== This module provides functions to escape and unescape URI strings as defined by RFC 2396. URIs consist of a restricted set of characters, denoted as `uric' in RFC 2396. The restricted set of characters consists of digits, letters, and a few graphic symbols chosen from those common to most of the character encodings and input facilities available to Internet users: "A" .. "Z", "a" .. "z", "0" .. "9", ";", "/", "?", ":", "@", "&", "=", "+", "$", ",", # reserved "-", "_", ".", "!", "~", "*", "'", "(", ")" In addition any byte (octet) can be represented in a URI by an escape sequence; a triplet consisting of the character "%" followed by two hexadecimal digits. Bytes can also be represented directly by a character using the US-ASCII character for that octet (iff the character is part of `uric'). Some of the `uric' characters are reserved for use as delimiters or as part of certain URI components. These must be escaped if they are to be treated as ordinary data. Read RFC 2396 for further details. The functions provided (and exported by default) from this module are: uri_escape($string, [$unsafe]) This function replaces all unsafe characters in the $string with their escape sequences and returns the result. The uri_escape() function takes an optional second argument that overrides the set of characters that are to be escaped. The set is specified as a string that can be used in a regular expression character class (between [ ]). E.g.: "\x00-\x1f\x7f-\xff" # all control and hi-bit characters "a-z" # all lower case characters "^A-Za-z" # everything not a letter The default set of characters to be escaped is all those which are not part of the `uric' character class shown above. uri_unescape($string,...) Returns a string with all %XX sequences replaced with the actual byte (octet). This does the same as: $string =~ s/%([0-9A-Fa-f]{2})/chr(hex($1))/eg; but does not modify the string in-place as this RE would. Using the uri_unescape() function instead of the RE might make the code look cleaner and is a few characters less to type. In a simple benchmark test I made I got something like 40% slowdown by calling the function (instead of the inline RE above) if a few chars where unescaped and something like 700% slowdown if none where. If you are going to unescape a lot of times it might be a good idea to inline the RE. If the uri_unescape() function is passed multiple strings, then each one is unescaped returned. The module can also export the `%escapes' hash which contains the mapping from all 256 bytes to the corresponding escape code. Lookup in this hash is faster than evaluating `sprintf("%%%02X", ord($byte))' each time. SEE ALSO ======== *Note URI: URI, COPYRIGHT ========= Copyright 1995-2000 Gisle Aas. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.  File: pm.info, Node: URI/Find, Next: URI/Heuristic, Prev: URI/Escape, Up: Module List Find URIs in arbitrary text *************************** NAME ==== URI::Find - Find URIs in arbitrary text SYNOPSIS ======== use URI::Find; $how_many_found = find_uris($text, \&callback); DESCRIPTION =========== This module does one thing: Finds URIs and URLs in plain text. It finds them quickly and it finds them all (or what URI::URL considers a URI to be.) It employs a series of heuristics to: Find schemeless URIs (ie. www.foo.com) Avoid picking up trailing characters from the text Avoid picking up URL-like things such as perl module names. Functions --------- URI::Find exports one function, find_uris(). It takes two arguments, the first is a text string to search, the second is a function reference. The function is a callback which is called on each URI found. It is passed two arguments, the first is a URI::URL object representing the URI found. The second is the original text of the URI found. The return value of the callback will replace the original URI in the text. EXAMPLES ======== Simply print the original URI text found and the normalized representation. find_uris($text, sub { my($uri, $orig_uri) = @_; print "The text '$orig_uri' represents '$uri'\n"; return $orig_uri; }); Check each URI in document to see if it exists. use LWP::Simple; find_uris($text, sub { my($uri, $orig_uri) = @_; if( head $uri ) { print "$orig_uri is okay\n"; } else { print "$orig_uri cannot be found\n"; } return $orig_uri; }); Wrap each URI found in an HTML anchor. find_uris($text, sub { my($uri, $orig_uri) = @_; return qq|$orig_uri|; }); CAVEATS, BUGS, ETC... ===================== RFC 2396 Appendix E suggests using the form '' or '' when putting URLs in plain text. URI::Find accomidates this suggestion and considers the entire thing (brackets and all) to be part of the URL found. This means that when find_uris() sees '' it will hand that entire string to your callback, not just the URL. NOTE: The prototype on find_uris() is already getting annoying to me. I might remove it in a future version. SEE ALSO ======== L, L, RFC 2396 (especially Appendix E) AUTHOR ====== Michael G Schwern with insight from Uri Gutman, Greg Bacon and Jeff Pinyan.  File: pm.info, Node: URI/Heuristic, Next: URI/Sequin, Prev: URI/Find, Up: Module List Expand URI using heuristics *************************** NAME ==== uf_uristr - Expand URI using heuristics SYNOPSIS ======== use URI::Heuristic qw(uf_uristr); $u = uf_uristr("perl"); # http://www.perl.com $u = uf_uristr("www.sol.no/sol"); # http://www.sol.no/sol $u = uf_uristr("aas"); # http://www.aas.no $u = uf_uristr("ftp.funet.fi"); # ftp://ftp.funet.fi $u = uf_uristr("/etc/passwd"); # file:/etc/passwd DESCRIPTION =========== This module provides functions that expand strings into real absolute URIs using some builtin heuristics. Strings that already represent absolute URIs (i.e. start with a `scheme:' part) are never modified and are returned unchanged. The main use of these functions are to allow abbreviated URIs similar to what many web browsers allow for URIs typed in by the user. The following functions are provided: uf_uristr($str) The uf_uristr() function will try to make the string passed as argument into a proper absolute URI string. The "uf_" prefix stands for "User Friendly". Under MacOS, it assumes that any string with a common URL scheme (http, ftp, etc.) is a URL rather than a local path. So don't name your volumes after common URL schemes and expect uf_uristr() to construct valid file: URL's on those volumes for you, because it won't. uf_uri($str) This functions work the same way as uf_uristr() but it will return a `URI' object. ENVIRONMENT =========== If the hostname portion of a URI does not contain any dots, then certain qualified guesses will be made. These guesses are governed be the following two environment variables. COUNTRY This is the two letter country code (ISO 3166) for your location. If the domain name of your host ends with two letters, then it is taken to be the default country. See also *Note Locale/Country: Locale/Country,. URL_GUESS_PATTERN Contain a space separated list of URL patterns to try. The string "ACME" is for some reason used as a placeholder for the host name in the URL provided. Example: URL_GUESS_PATTERN="www.ACME.no www.ACME.se www.ACME.com" export URL_GUESS_PATTERN Specifying URL_GUESS_PATTERN disables any guessing rules based on country. An empty URL_GUESS_PATTERN disables any guessing that involves host name lookups. COPYRIGHT ========= Copyright 1997-1998, Gisle Aas This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.  File: pm.info, Node: URI/Sequin, Next: URI/URL, Prev: URI/Heuristic, Up: Module List Extract information from the URLs of Search-Engines *************************************************** NAME ==== URI::Sequin - Extract information from the URLs of Search-Engines SYNOPSIS ======== use URI::Sequin qw/se_extract key_extract log_extract %log_types/; $url = &log_extract($line_from_log_file, 'NCSA'); $log_types{'MyLogType'} = '^(.+?) -> .+$'; $url = &log_extract($line_from_log_file, 'MyLogType'); $keyword_string = &key_extract($url); ($search_engine_name, $search_engine_url) = @{&se_extract($url)}; DESCRIPTION =========== This module provides three tools to aid people trying to analyse Search-Engine URLs. It’s meant mainly for those who want to analyse referrer logs and pick out key information about site visitors, such as which Search-Engine and keywords they used to find the site. The functions and globals provided (and exported by default) from this module are: log_extract($log_line, 'Type') This will pick out the referring URL from a line of a logfile. The 'type' can be one of the built in types or can be a user-created one. For more information, see %log_types below. This subroutine accepts a scalar, and returns a scalar. key_extract($url) This will try and determine the keywords used in $url. It accepts a scalar and returns a scalar. Should nothing be found, it returns an undefined value. se_extract($url) This will try and determine the name of the Search-Engine used and its URL. It accepts a scalar, and returns an array containing firstly the Search- Engine’s name and secondly the Search-Engine’s URL. Should the URL appear not to be from a Search Query, it returns a reference to an empty array. %log_types There are five built-in logfile types already in this hash. They are: * IIS1 - Microsoft IIS 3.0 and 2.0 * IIS2 - Microsoft IIS4.0 (W3SVC format) * NCSA - For APACHE, NETSCAPE and any other NCSA format logs * ORW - O'Reilly WebSite format * General - A generalised one that will work with most logfiles It’s easy to add another one. Simply add a key to the hash, with a value that is a regex. Parenthesise the part that is the referring URL, as the script uses $1 to obtain the URL. (see the example in the Synopsis section). AUTHOR ====== Peter Sergeant COPYRIGHT ========= Copyright 2001 Peter Sergeant. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.  File: pm.info, Node: URI/URL, Next: URI/WithBase, Prev: URI/Sequin, Up: Module List Uniform Resource Locators ************************* NAME ==== URI::URL - Uniform Resource Locators SYNOPSIS ======== $u1 = URI::URL->new($str, $base); $u2 = $u1->abs; DESCRIPTION =========== This module is provided for backwards compatibility with modules that depend on the interface provided by the URI::URL class that used to be distributed with the libwww-perl library. The following differences compared to the `URI' class interface exist: * The URI::URL module exports the url() function as an alternate constructor interface. * The constructor takes an optional $base argument. See *Note URI/WithBase: URI/WithBase,. * The URI::URL->newlocal class method is the same as URI::file->new_abs * URI::URL::strict(1) * $url->print_on method * $url->crack method * $url->full_path; same as ($uri->abs_path || "/") * $url->netloc; same as $uri->authority * $url->epath, $url->equery; same as $uri->path, $uri->query * $url->path and $url->query pass unescaped strings. * $url->path_components; same as $uri->path_segments (if you don't consider path segment parameters). * $url->params and $url->eparams methods. * $url->base method. See *Note URI/WithBase: URI/WithBase,. * $url->abs and $url->rel have an optional $base argument. See *Note URI/WithBase: URI/WithBase,. * $url->frag; same as $uri->fragment * $url->keywords; same as $uri->query_keywords; * $url->localpath with friends map to $uri->file * $url->address and $url->encoded822addr; same as $uri->to for mailto URI. * $url->groupart method for news URI. * $url->article; same as $uri->message SEE ALSO ======== *Note URI: URI,, *Note URI/WithBase: URI/WithBase, COPYRIGHT ========= Copyright 1998-2000 Gisle Aas.  File: pm.info, Node: URI/WithBase, Next: URI/data, Prev: URI/URL, Up: Module List URI which remember their base ***************************** NAME ==== URI::WithBase - URI which remember their base SYNOPSIS ======== $u1 = URI::WithBase->new($str, $base); $u2 = $u1->abs; $base = $u1->base; $u1->base( $new_base ) DESCRIPTION =========== This module provide the `URI::WithBase' class. Objects of this class are like `URI' objects, but can keep their base too. The methods provided in addition to or modified from those of `URI' are: $uri = URI::WithBase->new($str, [$base]) The constructor takes a an optional base URI as the second argument. $uri->base( [$new_base] ) This method can be used to get or set the value of the base attribute. $uri->abs( [$base_uri] ) The $base_uri argument is now made optional as the object carries it's base with it. $uri->rel( [$base_uri] ) The $base_uri argument is now made optional as the object carries it's base with it. SEE ALSO ======== *Note URI: URI, COPYRIGHT ========= Copyright 1998-2000 Gisle Aas.  File: pm.info, Node: URI/data, Next: URI/file, Prev: URI/WithBase, Up: Module List URI that contain immediate data ******************************* NAME ==== URI::data - URI that contain immediate data SYNOPSIS ======== use URI; $u = URI->new("data:"); $u->media_type("image/gif"); $u->data(scalar(`cat camel.gif`)); print "$u\n"; open(XV, "|xv -") and print XV $u->data; DESCRIPTION =========== The `URI::data' class supports `URI' objects belonging to the data URI scheme. The data URI scheme is specified in RFC 2397. It allows inclusion of small data items as "immediate" data, as if it had been included externally. Examples: data:,Perl%20is%20good data:image/gif;base64,R0lGODdhIAAgAIAAAAAAAPj8+CwAAAAAI AAgAAAClYyPqcu9AJyCjtIKc5w5xP14xgeO2tlY3nWcajmZZdeJcG Kxrmimms1KMTa1Wg8UROx4MNUq1HrycMjHT9b6xKxaFLM6VRKzI+p KS9XtXpcbdun6uWVxJXA8pNPkdkkxhxc21LZHFOgD2KMoQXa2KMWI JtnE2KizVUkYJVZZ1nczBxXlFopZBtoJ2diXGdNUymmJdFMAADs= `URI' objects belonging to the data scheme support the common methods (described in *Note URI: URI,) and the following two scheme specific methods: $uri->media_type( [$new_media_type] ) This method can be used to get or set the media type specified in the URI. If no media type is specified, then the default `"text/plain;charset=US-ASCII"' is returned. $uri->data( [$new_data] ) This method can be used to get or set the data contained in the URI. The data is passed unescaped (in binary form). The decision about whether to base64 encode the data in the URI is taken automatically based on what encoding produces the shortest URI string. SEE ALSO ======== *Note URI: URI, COPYRIGHT ========= Copyright 1995-1998 Gisle Aas. This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.  File: pm.info, Node: URI/file, Next: URI/ldap, Prev: URI/data, Up: Module List URI that map to local file names ******************************** NAME ==== URI::file - URI that map to local file names SYNOPSIS ======== use URI::file; $u1 = URI->new("file:/foo/bar"); $u2 = URI->new("foo/bar", "file"); $u3 = URI::file->new($path); $u4 = URI::file->new("c:\\windows\\", "win32"); $u1->file; $u1->file("mac"); DESCRIPTION =========== The `URI::file' class supports `URI' objects belonging to the file URI scheme. This scheme allows us to map the conventional file names found on various computer systems to the URI name space. An old specification of the file URI scheme is found in RFC 1738. Some older background information is also in RFC 1630. There are no newer specifications as far as I know. If you want simply to construct file URI objects from URI strings, use the normal `URI' constructor. If you want to construct file URI objects from the actual file names used by various systems, then use one of the following `URI::file' constructors: $u = URI::file->new( $filename, [$os] ) Maps a file name to the file: URI name space, creates an URI object and returns it. The $filename is interpreted as one belonging to the indicated operating system ($os), which defaults to the value of the $^O variable. The $filename can be either absolute or relative, and the corresponding type of URI object for $os is returned. $u = URI::file->new_abs( $filename, [$os] ) Same as URI::file->new, but will make sure that the URI returned represents an absolute file name. If the $filename argument is relative, then the name is resolved relative to the current directory, i.e. this constructor is really the same as: URI::file->new($filename)->abs(URI::file->cwd); $u = URI::file->cwd Returns a file URI that represents the current working directory. See *Note Cwd: Cwd,. The following methods are supported for file URI (in addition to the common and generic methods described in *Note URI: URI,): $u->file( [$os] ) This method return a file name. It maps from the URI name space to the file name space of the indicated operating system. It might return undef if the name can not be represented in the indicated file system. $u->dir( [$os] ) Some systems use a different form for names of directories than for plain files. Use this method if you know you want to use the name for a directory. The `URI::file' module can be used to map generic file names to names suitable for the current system. As such, it can work as a nice replacement for the File::Spec module. For instance the following code will translate the Unix style file name `Foo/Bar.pm' to a name suitable for the local system. $file = URI::file->new("Foo/Bar.pm", "unix")->file; die "Can't map filename Foo/Bar.pm for $^O" unless defined $file; open(FILE, $file) || die "Can't open '$file': $!"; # do something with FILE MAPPING NOTES ============= Most computer systems today have hierarchically organized file systems. Mapping the names used in these systems to the generic URI syntax allows us to work with relative file URIs that behave as they should when resolved using the generic algorithm for URIs (specified in RFC 2396). Mapping a file name to the generic URI syntax involves mapping the path separator character to "/" and encoding of any reserved characters that appear in the path segments of the file names. If path segments consisting of the strings "." or ".." have a different meaning than what is specified for generic URIs, then these must be encoded as well. If the file system has device, volume or drive specifications as the root of the name space, then it makes sense to map them to the authority field of the generic URI syntax. This makes sure that relative URI can not be resolved "above" them , i.e. generally how relative file names work in those systems. Another common use of the authority field is to encode the host that this file name is valid on. The host name "localhost" is special and generally have the same meaning as an missing or empty authority field. This use will be in conflict with using it as a device specification, but can often be resolved for device specifications having characters not legal in plain host names. File name to URI mapping in normally not one-to-one. There are usually many URI that map to the same file name. For instance an authority of "localhost" maps the same as a URI with a missing or empty authority. Example 1: The Mac use ":" as path separator, but not in the same way as generic URI. ":foo" is a relative name. "foo:bar" is an absolute name. Also path segments can contain the "/" character as well as be literal "." or "..". It means that we will map like this: Mac URI ---------- ------------------- :foo:bar <==> foo/bar : <==> ./ ::foo:bar <==> ../foo/bar ::: <==> ../../ foo:bar <==> file:/foo/bar foo:bar: <==> file:/foo/bar/ .. <==> %2E%2E <== / foo/ <== file:/foo%2F ./foo.txt <== file:/.%2Ffoo.txt Note that if you want a relative URL, you *must* begin the path with a :. Any path that begins with [^:] will be treated as absolute. Example 2: The Unix file system is easy to map as it use the same path separator as URIs, have a single root, and segments of "." and ".." have the same meaning. URIs that have the character "\0" or "/" as part of any path segment can not be turned into valid Unix file names. Unix URI ---------- ------------------ foo/bar <==> foo/bar /foo/bar <==> file:/foo/bar /foo/bar <== file://localhost/foo/bar file: ==> ./file: <== file:/fo%00/bar / <==> file:/ SEE ALSO ======== *Note URI: URI,, *Note File/Spec: File/Spec,, *Note Perlport: (perl.info)perlport, COPYRIGHT ========= Copyright 1995-1998 Gisle Aas. This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.  File: pm.info, Node: URI/ldap, Next: Unicode/CharName, Prev: URI/file, Up: Module List LDAP Uniform Resource Locators ****************************** NAME ==== URI::ldap - LDAP Uniform Resource Locators SYNOPSIS ======== use URI; $uri = URI->new("ldap:$uri_string"); $dn = $uri->dn; $filter = $uri->filter; @attr = $uri->attributes; $scope = $uri->scope; %extn = $uri->extensions; $uri = URI->new("ldap:"); # start empty $uri->host("ldap.itd.umich.edu"); $uri->dn("o=University of Michigan,c=US"); $uri->attributes(qw(postalAddress)); $uri->scope('sub'); $uri->filter('(cn=Babs Jensen)'); print $uri->as_string,"\n"; DESCRIPTION =========== URI::ldap provides an interface to parse an LDAP URI in its constituent parts and also build a URI as described in RFC 2255. METHODS ======= URI::ldap support all the generic and server methods defined by *Note URI: URI,, plus the following. Each of the following methods can be used to set or get the value in the URI. The values are passed in unescaped form. None of these will return undefined values, but elements without a default can be empty. If arguments are given then a new value will be set for the given part of the URI. $uri->dn( [$new_dn] ) Set or get the *Distinguised Name* part of the URI. The DN identifies the base object of the LDAP search. $uri->attributes( [@new_attrs] ) Set or get the list of attribute names which will be returned by the search. $uri->scope( [$new_scope] ) Set or get the scope that the search will use. The value can be one of `"base"', `"one"' or `"sub"'. If none is given in the URI then the return value will default to `"base"'. $uri->_scope( [$new_scope] ) Same as scope(), but does not default to anything. $uri->filter( [$new_filter] ) Set or get the filter that the search will use. If none is given in the URI then the return value will default to `"(objectClass=*)"'. $uri->_filter( [$new_filter] ) Same as filter(), but does not default to anything. $uri->extensions( [$etype => $evalue,...] ) Set or get the extensions used for the search. The list passed should be in the form etype1 => evalue1, etype2 => evalue2,... This is also the form of list that will be returned. SEE ALSO ======== `RFC-2255|http:' in this node AUTHOR ====== Graham Barr <`gbarr@pobox.com'> Slightly modified by Gisle Aas to fit into the URI distribution. COPYRIGHT ========= Copyright (c) 1998 Graham Barr. All rights reserved. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.  File: pm.info, Node: Unicode/CharName, Next: Unicode/Map, Prev: URI/ldap, Up: Module List Look up Unicode character names ******************************* NAME ==== Unicode::CharName - Look up Unicode character names SYNOPSIS ======== use Unicode::CharName qw(uname ublock); print uname(ord('%')), "\n"; print ublock(0x0300), "\n"; DESCRIPTION =========== This module provide a two functions named uname() and ublock(). The uname() function will return the Unicode character name for the given code (a number between 0 and 0x10FFFF). Unicode character names are written in upper-case ASCII letters, and are strings like: LATIN CAPITAL LETTER A LATIN SMALL LETTER A WITH RING ABOVE CJK UNIFIED IDEOGRAPH 7C80 HANGUL SYLLABLE PWILH The ublock() will return the name of the Unicode character block that the given character belongs to. SEE ALSO ======== *Note Unicode/String: Unicode/String, COPYRIGHT ========= Copyright 1997 Gisle Aas. This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself. Name table extracted from the Unicode 2.0 Character Database. Copyright (c) 1991-1996 Unicode, Inc. All Rights reserved.  File: pm.info, Node: Unicode/Map, Next: Unicode/Map8, Prev: Unicode/CharName, Up: Module List maps charsets from and to utf16 unicode *************************************** NAME ==== Unicode::Map V0.108 - maps charsets from and to utf16 unicode SYNOPSIS ======== use Unicode::Map(); *$Map* = new Unicode::Map("ISO-8859-1"); *$utf16* = *$Map* -> to_unicode ("Hello world!"); => $utf16 == "\0H\0e\0l\0l\0o\0 \0w\0o\0r\0l\0d\0!" *$locale* = *$Map* -> from_unicode (*$utf16*); => $locale == "Hello world!" A more detailed description below. 2do: short note about perl's Unicode perspectives. DESCRIPTION =========== This module converts strings from and to 2-byte Unicode UCS2 format. All mappings happen via 2 byte UTF16 encodings, not via 1 byte UTF8 encoding. To transform these use Unicode::String. For historical reasons this module coexists with Unicode::Map8. Please use Unicode::Map8 unless you need to care for two byte character sets, e.g. chinese GB2312. Anyway, if you stick to the basic functionality (see documentation) you can use both modules equivalently. Practically this module will disappear from earth sooner or later as Unicode mapping support needs somehow to get into perl's core. If you like to work on this field please don't hesitate contacting Gisle Aas! This module can't deal directly with utf8. Use Unicode::String to convert utf8 to utf16 and vice versa. Character mapping is according to the data of binary mapfiles in Unicode::Map hierarchy. Binary mapfiles can also be created with this module, enabling you to install own specific character sets. Refer to mkmapfile or file REGISTRY in the Unicode::Map hierarchy. CONVERSION METHODS ================== Probably these are the only methods you will need from this module. Their usage is compatible with Unicode::Map8. new *$Map* = new Unicode::Map("GB2312-80") Returns a new Map object for GB2312-80 encoding. from_unicode $dest = *$Map* -> from_unicode ($src) Creates a string in locale charset representation from utf16 encoded string $src. to_unicode $dest = *$Map* -> to_unicode ($src) Creates a string in utf16 representation from $src. to8 Alias for from_unicode. For compatibility with Unicode::Map8 to16 Alias for to_unicode. For compatibility with Unicode::Map8 WARNINGS ======== You can demand Unicode::Map to issue warnings at deprecated or incompatible usage with the constants WARN_DEFAULT, WARN_DEPRECATION or WARN_COMPATIBILITY. The latter both can be ored together. No special warnings: $Unicode::Map::WARNINGS = Unicode::Map::WARN_DEFAULT Warnings for deprecated usage: $Unicode::Map::WARNINGS = Unicode::Map::WARN_DEPRECATION Warnings for incompatible usage: $Unicode::Map::WARNINGS = Unicode::Map::WARN_COMPATIBILITY MAINTAINANCE METHODS ==================== Note: These methods are solely for the maintainance of Unicode::Map. Using any of these methods will lead to programs incompatible with Unicode::Map8. alias *@list* = *$Map* -> alias (*$csid*) Returns a list of alias names of character set *$csid*. mapping $path = *$Map* -> mapping (*$csid*) Returns the absolute path of binary character mapping for character set *$csid* according to REGISTRY file of Unicode::Map. id *$real_id*||"" = *$Map* -> id (*$test_id*) Returns a valid character set identifier *$real_id*, if *$test_id* is a valid character set name or alias name according to REGISTRY file of Unicode::Map. ids *@ids* = *$Map* -> ids() Returns a list of all character set names defined in REGISTRY file. read_text_mapping 1||0 = *$Map* -> read_text_mapping (*$csid*, $path, $style) Read a text mapping of style $style named *$csid* from filename $path. The mapping then can be saved to a file with method: write_binary_mapping. <$style> can be: style description "unicode" A text mapping as of ftp://ftp.unicode.org/MAPPINGS/ "" Same as "unicode" "reverse" Similar to unicode, but both columns are switched "keld" A text mapping as of ftp://dkuug.dk/i18n/charmaps/ src $path = *$Map* -> src (*$csid*) Returns the path of textual character mapping for character set *$csid* according to REGISTRY file of Unicode::Map. style $path = *$Map* -> style (*$csid*) Returns the style of textual character mapping for character set *$csid* according to REGISTRY file of Unicode::Map. write_binary_mapping 1||0 = *$Map* -> write_binary_mapping (*$csid*, $path) Stores a mapping that has been loaded via method read_text_mapping in file $path. DEPRECATED METHODS ================== Some functionality is no longer promoted. noise Deprecated! Don't use any longer. reverse_unicode Deprecated! Use Unicode::String::byteswap instead. BINARY MAPPINGS =============== Structure of binary Mapfiles Unicode character mapping tables have sequences of sequential key and sequential value codes. This property is used to crunch the maps easily. n (0: offset structure value 0x00 word 0x27b8 (magic) 0x02 @( || ) The mapfile ends with extended mode in main stream. : 0x00 byte != 0 charsize1 (bits) 0x01 byte n1 number of chars for one entry 0x02 byte charsize2 (bits) 0x03 byte n2 number of chars for one entry 0x04 @( || || entry occurs. : 0x00 size=0|1|2|4 n, number of sequential characters size bs1 key1 +bs1 bs2 value1 +bs2 bs1 key2 +bs1 bs2 value2 ... key_val_seq ends, if either file ends (n = infinite mode) or n pairs are read. : 0x00 byte n, number of sequential characters 0x01 bs1 key_start, first character of sequence 1+bs1 @( || ) A key sequence starts with a byte count telling how long the sequence is. It is followed by the key start code. After this comes a list of value sequences. The list of value sequences ends, if sum(m) equals n. : 0x00 byte m, number of sequential characters 0x01 bs2 val_start, first character of sequence : 0x00 byte 0 0x01 byte ftype 0x02 byte fsize, size of following structure 0x03 fsize bytes something For future extensions or private use one can insert here 1..255 byte long streams. ftype can have values 30..255, values 0..29 are reserved. Modi are not fully defined now and could change. They will be explained later. TO BE DONE ========== - Something clever, when a character has no translation. - Direct charset -> charset mapping. - Better performance. - Support for mappings according to RFC 1345. SEE ALSO ======== - File `REGISTRY' and binary mappings in directory `Unicode/Map' of your perl library path - recode(1), map(1), mkmapfile(1), Unicode::Map(3), Unicode::Map8(3), Unicode::String(3), Unicode::CharName(3), mirrorMappings(1) - RFC 1345 - Mappings at Unicode consortium ftp://ftp.unicode.org/MAPPINGS/ - Registrated Internet character sets ftp://dkuug.dk/i18n/charmaps/ - 2do: more references AUTHOR ====== Martin Schwartz <`martin@nacho.de'>  File: pm.info, Node: Unicode/Map8, Next: Unicode/MapUTF8, Prev: Unicode/Map, Up: Module List Mapping table between 8-bit chars and Unicode ********************************************* NAME ==== Unicode::Map8 - Mapping table between 8-bit chars and Unicode SYNOPSIS ======== require Unicode::Map8; my $no_map = Unicode::Map8->new("ISO646-NO") || die; my $l1_map = Unicode::Map8->new("latin1") || die; my $ustr = $no_map->to16("V}re norske tegn b|r {res\n"); my $lstr = $l1_map->to8($ustr); print $lstr; print $no_map->tou("V}re norske tegn b|r {res\n")->utf8 DESCRIPTION =========== The Unicode::Map8 class implement efficient mapping tables between 8-bit character sets and 16 bit character sets like Unicode. The tables are efficient both in terms of space allocated and translation speed. The 16-bit strings is assumed to use network byte order. The following methods are available: $m = Unicode::Map8->new( [$charset] ) The object constructor creates new instances of the Unicode::Map8 class. I takes an optional argument that specify then name of a 8-bit character set to initialize mappings from. The argument can also be a the name of a mapping file. If the charset/file can not be located, then the constructor returns undef. If you omit the argument, then an empty mapping table is constructed. You must then add mapping pairs to it using the addpair() method described below. $m->addpair( $u8, $u16 ); Adds a new mapping pair to the mapping object. It takes two arguments. The first is the code value in the 8-bit character set and the second is the corresponding code value in the 16-bit character set. The same codes can be used multiple times (but using the same pair has no effect). The first definition for a code is the one that is used. Consider the following example: $m->addpair(0x20, 0x0020); $m->addpair(0x20, 0x00A0); $m->addpair(0xA0, 0x00A0); It means that the character 0x20 and 0xA0 in the 8-bit charset maps to themselves in the 16-bit set, but in the 16-bit character set 0x0A0 maps to 0x20. $m->default_to8( $u8 ) Set the code of the default character to use when mapping from 16-bit to 8-bit strings. If there is no mapping pair defined for a character then this default is substituted by to8() and recode8(). $m->default_to16( $u16 ) Set the code of the default character to use when mapping from 8-bit to 16-bit strings. If there is no mapping pair defined for a character then this default is used by to16(), tou() and recode8(). $m->nostrict; All undefined mappings are replaced with the identity mapping. Undefined character are normally just removed (or replaced with the default if defined) when converting between character sets. $m->to8( $ustr ); Converts a 16-bit character string to the corresponding string in the 8-bit character set. $m->to16( $str ); Converts a 8-bit character string to the corresponding string in the 16-bit character set. $m->tou( $str ); Same an to16() but return a Unicode::String object instead of a plain UCS2 string. $m->recode8($m2, $str); Map the string $str from one 8-bit character set ($m) to another one ($m2). Since we assume we know the mappings towards the common 16-bit encoding we can use this to convert between any of the 8-bit character sets. $m->to_char16( $u8 ) Maps a single 8-bit character code to an 16-bit code. If the 8-bit character is unmapped then the constant NOCHAR is returned. The default is not used and the callback method is not invoked. $m->to_char8( $u16 ) Maps a single 16-bit character code to an 8-bit code. If the 16-bit character is unmapped then the constant NOCHAR is returned. The default is not used and the callback method is not invoked. The following callback methods are available. You can override these methods by creating a subclass of Unicode::Map8. $m->unmapped_to8 When mapping to 8-bit character string and there is no mapping defined (and no default either), then this method is called as the last resort. It is called with a single integer argument which is the code of the unmapped 16-bit character. It is expected to return a string that will be incorporated in the 8-bit string. The default version of this method always returns an empty string. Example: package MyMapper; @ISA=qw(Unicode::Map8); sub unmapped_to8 { my($self, $code) = @_; require Unicode::CharName; "<" . Unicode::CharName::uname($code) . ">"; } $m->unmapped_to16 Likewise when mapping to 16-bit character string and no mapping is defined then this method is called. It should return a 16-bit string with the bytes in network byte order. The default version of this method always returns an empty string. FILES ===== The Unicode::Map8 constructor can parse two different file formats; a binary format and a textual format. The binary format is simple. It consist of a sequence of 16-bit integer pairs in network byte order. The first pair should contain the magic value 0xFFFE, 0x0001. Of each pair, the first value is the code of an 8-bit character and the second is the code of the 16-bit character. If follows from this that the first value should be less than 256. The textual format consist of lines that is either a comment (first non-blank character is '#'), a completely blank line or a line with two hexadecimal numbers. The hexadecimal numbers must be preceded by "0x" as in C and Perl. This is the same format used by the Unicode mapping files available from . The mapping table files are installed in the `Unicode/Map8/maps' directory somewhere in the Perl @INC path. The variable $Unicode::Map8::MAPS_DIR is the complete path name to this directory. Binary mapping files are stored within this directory with the suffix *.bin*. Textual mapping files are stored with the suffix *.txt*. The scripts *map8_bin2txt* and *map8_txt2bin* can translate between these mapping file formats. A special file called aliases within $MAPS_DIR specify all the alias names that can be used to denote the various character sets. The first name of each line is the real file name and the rest is alias names separated by space. The ``umap --list'' command be used to list the character sets supported. BUGS ==== Does not handle Unicode surrogate pairs as a single character. SEE ALSO ======== `umap(1)' in this node, *Note Unicode/String: Unicode/String, COPYRIGHT ========= Copyright 1998 Gisle Aas. This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.  File: pm.info, Node: Unicode/MapUTF8, Next: Unicode/String, Prev: Unicode/Map8, Up: Module List Conversions to and from arbitrary character sets and UTF8 ********************************************************* NAME ==== Unicode::MapUTF8 - Conversions to and from arbitrary character sets and UTF8 SYNOPSIS ======== use Unicode::MapUTF8 qw(to_utf8 from_utf8 utf8_supported_charset); # Convert a string in 'ISO-8859-1' to 'UTF8' my $output = to_utf8({ -string => 'An example', -charset => 'ISO-8859-1' }); # Convert a string in 'UTF8' encoding to encoding 'ISO-8859-1' my $other = from_utf8({ -string => 'Other text', -charset => 'ISO-8859-1' }); # List available character set encodings my @character_sets = utf8_supported_charset; # Add a character set alias utf8_charset_alias({ 'ms-japanese' => 'sjis' }); # Convert between two arbitrary (but largely compatible) charset encodings # (SJIS to EUC-JP) my $utf8_string = to_utf8({ -string =>$sjis_string, -charset => 'sjis'}); my $euc_jp_string = from_utf8({ -string => $utf8_string, -charset => 'euc-jp' }) # Verify that a specific character set is supported if (utf8_supported_charset('ISO-8859-1') { # Yes } DESCRIPTION =========== Provides an adapter layer between core routines for converting to and from UTF8 and other encodings. In essence, a way to give multiple existing Unicode modules a single common interface so you don't have to know the underlaying implementations to do simple UTF8 to-from other character set encoding conversions. As such, it wraps the Unicode::String, Unicode::Map8, Unicode::Map and Jcode modules in a standardized and simple API. This also provides general character set conversion operation based on UTF8 - it is possible to convert between any two compatible and supported character sets via a simple two step chaining of conversions. As with most things Perlish - if you give it a few big chunks of text to chew on instead of lots of small ones it will handle many more characters per second. By design, it can be easily extended to encompass any new charset encoding conversion modules that arrive on the scene. CHANGES ======= 1.08 2000.11.06 - Added 'utf8_charset_alias' function to allow for runtime setting of character set aliases. Added several alternate names for 'sjis' (shiftjis, shift-jis, shift_jis, s-jis, and s_jis). Corrected 'croak' messages for 'from_utf8' functions to appropriate function name. Tightened up initialization encapsulation Corrected fatal problem in jcode from unicode internals. Problem and fix found by Brian Wisti . 1.07 2000.11.01 - Added 'croak' to use Carp declaration to fix error messages. Problem and fix found by Brian Wisti . 1.06 2000.10.30 - Fix to handle change in stringification of overloaded objects between Perl 5.005 and 5.6. Problem noticed by Brian Wisti . 1.05 2000.10.23 - Error in conversions from UTF8 to multibyte encodings corrected 1.04 2000.10.23 - Additional diagnostic messages added for internal error conditions 1.03 2000.10.22 - Bug fix for load time autodetction of Unicode::Map8 encodings 1.02 2000.10.22 - Added load time autodetection of Unicode::Map8 supported character set encodings. Fixed internal calling error for some character sets with 'from_utf8'. Thanks goes to Ilia Lobsanov for reporting this problem. 1.01 2000.10.02 - Fixed handling of empty strings and added more identification for error messages. 1.00 2000.09.29 - Pre-release version FUNCTIONS ========= utf8_charset_alias({ $alias => $charset }); Used for runtime assignment of character set aliases. Called with no parameters, returns a hash of defined aliases and the character sets they map to. Example: my $aliases = utf8_charset_alias; my @alias_names = keys %$aliases; If called with ONE parameter, returns the name of the 'real' charset if the alias is defined. Returns undef if it is not found in the aliases. Example: if (! utf8_charset_alias('VISCII')) { # No alias for this } If called with a list of 'alias' => 'charset' pairs, defines those aliases for use. Example: utf8_charset_alias({ 'japanese' => 'sjis', 'japan' => 'sjis' }); Note: It will croak if a passed pair does not map to a character set defined in the predefined set of character encoding. It is NOT allowed to alias something to another alias. Multiple character set aliases can be set with a single call. To clear an alias, pass a character set mapping of undef. Example: utf8_charset_alias({ 'japanese' => undef }); While an alias is set, the 'utf8_supported_charset' function will return the alias as if it were a predefined charset. Overriding a base defined character encoding with an alias will generate a warning message to STDERR. utf8_supported_charset($charset_name); Returns true if the named charset is supported (including user defiend aliases). Returns false if it is not. Example: if (! utf8_supported_charset('VISCII')) { # No support yet } If called in a list context with no parameters, it will return a list of all supported character set names (including user defined aliases). Example: my @charsets = utf8_supported_charset; to_utf8({ -string => $string, -charset => $source_charset }); Returns the string converted to UTF8 from the specified source charset. from_utf8({ -string => $string, -charset => $target_charset}); Returns the string converted from UTF8 to the specified target charset. VERSION ======= 1.08 2000.11.06 COPYRIGHT ========= Copyright September, 2000 Benjamin Franz. All rights reserved. This software is free software. You can redistribute it and/or modify it under the same terms as Perl itself. AUTHOR ====== Benjamin Franz TODO ==== Regression tests for Jcode, 2-byte encodings and encoding aliases SEE ALSO ======== Unicode::String Unicode::Map8 Unicode::Map Jcode