This is Info file pm.info, produced by Makeinfo version 1.68 from the
input file bigpm.texi.


File: pm.info,  Node: URI/Bookmarks/Netscape,  Next: URI/Escape,  Prev: URI/Bookmarks,  Up: Module List

Perl module containing routines for Netscape bookmark files
***********************************************************

NAME
====

   URI::Bookmarks::Netscape - Perl module containing routines for Netscape
bookmark files

SYNOPSIS
========

     See L<URI::Bookmarks>.

DESCRIPTION
===========

   URI::Bookmarks::Netscape contains some helper routines specifically for
URI::Bookmarks objects which were originally Netscape bookmark files.

AUTHOR
======

   Adam Spiers <adam@spiers.net>

SEE ALSO
========

   *Note URI/Bookmarks: URI/Bookmarks,, `URI::Bookmarks::*' in this node,
*Note URI/Bookmark: URI/Bookmark,, `URI::Bookmark::*' in this node,
`perl(1)' in this node.


File: pm.info,  Node: URI/Escape,  Next: URI/Find,  Prev: URI/Bookmarks/Netscape,  Up: Module List

Escape and unescape unsafe characters
*************************************

NAME
====

   URI::Escape - Escape and unescape unsafe characters

SYNOPSIS
========

     use URI::Escape;
     $safe = uri_escape("10% is enough\n");
     $verysafe = uri_escape("foo", "\0-\377");
     $str  = uri_unescape($safe);

DESCRIPTION
===========

   This module provides functions to escape and unescape URI strings as
defined by RFC 2396.  URIs consist of a restricted set of characters,
denoted as `uric' in RFC 2396.  The restricted set of characters consists
of digits, letters, and a few graphic symbols chosen from those common to
most of the character encodings and input facilities available to Internet
users:

     "A" .. "Z", "a" .. "z", "0" .. "9",
     ";", "/", "?", ":", "@", "&", "=", "+", "$", ",",   # reserved
     "-", "_", ".", "!", "~", "*", "'", "(", ")"

   In addition any byte (octet) can be represented in a URI by an escape
sequence; a triplet consisting of the character "%" followed by two
hexadecimal digits.  Bytes can also be represented directly by a character
using the US-ASCII character for that octet (iff the character is part of
`uric').

   Some of the `uric' characters are reserved for use as delimiters or as
part of certain URI components.  These must be escaped if they are to be
treated as ordinary data.  Read RFC 2396 for further details.

   The functions provided (and exported by default) from this module are:

uri_escape($string, [$unsafe])
     This function replaces all unsafe characters in the $string with their
     escape sequences and returns the result.

     The uri_escape() function takes an optional second argument that
     overrides the set of characters that are to be escaped.  The set is
     specified as a string that can be used in a regular expression
     character class (between [ ]).  E.g.:

          "\x00-\x1f\x7f-\xff"          # all control and hi-bit characters
          "a-z"                         # all lower case characters
          "^A-Za-z"                     # everything not a letter

     The default set of characters to be escaped is all those which are
     not part of the `uric' character class shown above.

uri_unescape($string,...)
     Returns a string with all %XX sequences replaced with the actual byte
     (octet).

     This does the same as:

          $string =~ s/%([0-9A-Fa-f]{2})/chr(hex($1))/eg;

     but does not modify the string in-place as this RE would.  Using the
     uri_unescape() function instead of the RE might make the code look
     cleaner and is a few characters less to type.

     In a simple benchmark test I made I got something like 40% slowdown by
     calling the function (instead of the inline RE above) if a few chars
     where unescaped and something like 700% slowdown if none where.  If
     you are going to unescape a lot of times it might be a good idea to
     inline the RE.

     If the uri_unescape() function is passed multiple strings, then each
     one is unescaped returned.

   The module can also export the `%escapes' hash which contains the
mapping from all 256 bytes to the corresponding escape code.  Lookup in
this hash is faster than evaluating `sprintf("%%%02X", ord($byte))' each
time.

SEE ALSO
========

   *Note URI: URI,

COPYRIGHT
=========

   Copyright 1995-2000 Gisle Aas.

   This program is free software; you can redistribute it and/or modify it
under the same terms as Perl itself.


File: pm.info,  Node: URI/Find,  Next: URI/Heuristic,  Prev: URI/Escape,  Up: Module List

Find URIs in arbitrary text
***************************

NAME
====

     URI::Find - Find URIs in arbitrary text

SYNOPSIS
========

     use URI::Find;

     $how_many_found = find_uris($text, \&callback);

DESCRIPTION
===========

   This module does one thing: Finds URIs and URLs in plain text.  It
finds them quickly and it finds them all (or what URI::URL considers a URI
to be.)  It employs a series of heuristics to:

Find schemeless URIs (ie. www.foo.com)
Avoid picking up trailing characters from the text
Avoid picking up URL-like things such as perl module names.
Functions
---------

   URI::Find exports one function, find_uris().  It takes two arguments,
the first is a text string to search, the second is a function reference.

   The function is a callback which is called on each URI found.  It is
passed two arguments, the first is a URI::URL object representing the URI
found.  The second is the original text of the URI found.  The return
value of the callback will replace the original URI in the text.

EXAMPLES
========

   Simply print the original URI text found and the normalized
representation.

     find_uris($text,
               sub {
                   my($uri, $orig_uri) = @_;
                   print "The text '$orig_uri' represents '$uri'\n";
                   return $orig_uri;
               });

   Check each URI in document to see if it exists.

     use LWP::Simple;
     find_uris($text,
               sub {
                   my($uri, $orig_uri) = @_;
                   if( head $uri ) {
                       print "$orig_uri is okay\n";
                   }
                   else {
                       print "$orig_uri cannot be found\n";
                   }
                   return $orig_uri;
               });

   Wrap each URI found in an HTML anchor.

     find_uris($text,
               sub {
                   my($uri, $orig_uri) = @_;
                   return qq|<a href="$uri">$orig_uri</a>|;
               });

CAVEATS, BUGS, ETC...
=====================

   RFC 2396 Appendix E suggests using the form '<http://www.foo.com>' or
'<URL:http://www.foo.com>' when putting URLs in plain text.  URI::Find
accomidates this suggestion and considers the entire thing (brackets and
all) to be part of the URL found.  This means that when find_uris() sees
'<URL:http://www.foo.com>' it will hand that entire string to your
callback, not just the URL.

   NOTE:  The prototype on find_uris() is already getting annoying to me.
I might remove it in a future version.

SEE ALSO
========

     L<URI::URL>, L<URI>, RFC 2396 (especially Appendix E)

AUTHOR
======

   Michael G Schwern <schwern@pobox.com> with insight from Uri Gutman,
Greg Bacon and Jeff Pinyan.


File: pm.info,  Node: URI/Heuristic,  Next: URI/Sequin,  Prev: URI/Find,  Up: Module List

Expand URI using heuristics
***************************

NAME
====

   uf_uristr - Expand URI using heuristics

SYNOPSIS
========

     use URI::Heuristic qw(uf_uristr);
     $u = uf_uristr("perl");             # http://www.perl.com
     $u = uf_uristr("www.sol.no/sol");   # http://www.sol.no/sol
     $u = uf_uristr("aas");              # http://www.aas.no
     $u = uf_uristr("ftp.funet.fi");     # ftp://ftp.funet.fi
     $u = uf_uristr("/etc/passwd");      # file:/etc/passwd

DESCRIPTION
===========

   This module provides functions that expand strings into real absolute
URIs using some builtin heuristics.  Strings that already represent
absolute URIs (i.e. start with a `scheme:' part) are never modified and
are returned unchanged.  The main use of these functions are to allow
abbreviated URIs similar to what many web browsers allow for URIs typed in
by the user.

   The following functions are provided:

uf_uristr($str)
     The uf_uristr() function will try to make the string passed as
     argument into a proper absolute URI string.  The "uf_" prefix stands
     for "User Friendly".  Under MacOS, it assumes that any string with a
     common URL scheme (http, ftp, etc.) is a URL rather than a local
     path.  So don't name your volumes after common URL schemes and expect
     uf_uristr() to construct valid file: URL's on those volumes for you,
     because it won't.

uf_uri($str)
     This functions work the same way as uf_uristr() but it will return a
     `URI' object.

ENVIRONMENT
===========

   If the hostname portion of a URI does not contain any dots, then
certain qualified guesses will be made.  These guesses are governed be the
following two environment variables.

COUNTRY
     This is the two letter country code (ISO 3166) for your location.  If
     the domain name of your host ends with two letters, then it is taken
     to be the default country. See also *Note Locale/Country:
     Locale/Country,.

URL_GUESS_PATTERN
     Contain a space separated list of URL patterns to try.  The string
     "ACME" is for some reason used as a placeholder for the host name in
     the URL provided.  Example:

          URL_GUESS_PATTERN="www.ACME.no www.ACME.se www.ACME.com"
          export URL_GUESS_PATTERN

     Specifying URL_GUESS_PATTERN disables any guessing rules based on
     country.  An empty URL_GUESS_PATTERN disables any guessing that
     involves host name lookups.

COPYRIGHT
=========

   Copyright 1997-1998, Gisle Aas

   This library is free software; you can redistribute it and/or modify it
under the same terms as Perl itself.


File: pm.info,  Node: URI/Sequin,  Next: URI/URL,  Prev: URI/Heuristic,  Up: Module List

Extract information from the URLs of Search-Engines
***************************************************

NAME
====

   URI::Sequin - Extract information from the URLs of Search-Engines

SYNOPSIS
========

     use URI::Sequin qw/se_extract key_extract log_extract %log_types/;

     $url = &log_extract($line_from_log_file, 'NCSA');

     $log_types{'MyLogType'} = '^(.+?) -> .+$';
     $url = &log_extract($line_from_log_file, 'MyLogType');

     $keyword_string = &key_extract($url);

     ($search_engine_name, $search_engine_url) = @{&se_extract($url)};

DESCRIPTION
===========

   This module provides three tools to aid people trying to analyse
Search-Engine URLs. It’s meant mainly for those who want to analyse
referrer logs and pick out key information about site visitors, such as
which Search-Engine and keywords they used to find the site.

   The functions and globals provided (and exported by default) from this
module are:

log_extract($log_line, 'Type')
     This will pick out the referring URL from a line of a logfile. The
     'type' can be one of the built in types or can be a user-created one.
     For more information, see %log_types below. This subroutine accepts a
     scalar, and returns a scalar.

key_extract($url)
     This will try and determine the keywords used in $url. It accepts a
     scalar and returns a scalar. Should nothing be found, it returns an
     undefined value.

se_extract($url)
     This will try and determine the name of the Search-Engine used and
     its URL.  It accepts a scalar, and returns an array containing
     firstly the Search- Engine’s name and secondly the Search-Engine’s
     URL. Should the URL appear not to be from a Search Query, it returns
     a reference to an empty array.

%log_types
     There are five built-in logfile types already in this hash. They are:

        * IIS1 - Microsoft IIS 3.0 and 2.0

        * IIS2 - Microsoft IIS4.0 (W3SVC format)

        * NCSA - For APACHE, NETSCAPE and any other NCSA format logs

        * ORW - O'Reilly WebSite format

        * General - A generalised one that will work with most logfiles

     It’s easy to add another one. Simply add a key to the hash, with a
     value that is a regex. Parenthesise the part that is the referring
     URL, as the script uses $1 to obtain the URL. (see the example in the
     Synopsis section).

AUTHOR
======

   Peter Sergeant <pete_sergeant@hotmail.com>

COPYRIGHT
=========

   Copyright 2001 Peter Sergeant.

   This program is free software; you can redistribute it and/or modify it
under the same terms as Perl itself.


File: pm.info,  Node: URI/URL,  Next: URI/WithBase,  Prev: URI/Sequin,  Up: Module List

Uniform Resource Locators
*************************

NAME
====

   URI::URL - Uniform Resource Locators

SYNOPSIS
========

     $u1 = URI::URL->new($str, $base);
     $u2 = $u1->abs;

DESCRIPTION
===========

   This module is provided for backwards compatibility with modules that
depend on the interface provided by the URI::URL class that used to be
distributed with the libwww-perl library.

   The following differences compared to the `URI' class interface exist:

   * The URI::URL module exports the url() function as an alternate
     constructor interface.

   * The constructor takes an optional $base argument.  See *Note
     URI/WithBase: URI/WithBase,.

   * The URI::URL->newlocal class method is the same as URI::file->new_abs

   * URI::URL::strict(1)

   * $url->print_on method

   * $url->crack method

   * $url->full_path; same as ($uri->abs_path || "/")

   * $url->netloc; same as $uri->authority

   * $url->epath, $url->equery; same as $uri->path, $uri->query

   * $url->path and $url->query pass unescaped strings.

   * $url->path_components; same as $uri->path_segments (if you don't
     consider path segment parameters).

   * $url->params and $url->eparams methods.

   * $url->base method.  See *Note URI/WithBase: URI/WithBase,.

   * $url->abs and $url->rel have an optional $base argument.  See *Note
     URI/WithBase: URI/WithBase,.

   * $url->frag; same as $uri->fragment

   * $url->keywords; same as $uri->query_keywords;

   * $url->localpath with friends map to $uri->file

   * $url->address and $url->encoded822addr; same as $uri->to for mailto
     URI.

   * $url->groupart method for news URI.

   * $url->article; same as $uri->message

SEE ALSO
========

   *Note URI: URI,, *Note URI/WithBase: URI/WithBase,

COPYRIGHT
=========

   Copyright 1998-2000 Gisle Aas.


File: pm.info,  Node: URI/WithBase,  Next: URI/data,  Prev: URI/URL,  Up: Module List

URI which remember their base
*****************************

NAME
====

   URI::WithBase - URI which remember their base

SYNOPSIS
========

     $u1 = URI::WithBase->new($str, $base);
     $u2 = $u1->abs;

     $base = $u1->base;
     $u1->base( $new_base )

DESCRIPTION
===========

   This module provide the `URI::WithBase' class.  Objects of this class
are like `URI' objects, but can keep their base too.

   The methods provided in addition to or modified from those of `URI' are:

$uri = URI::WithBase->new($str, [$base])
     The constructor takes a an optional base URI as the second argument.

$uri->base( [$new_base] )
     This method can be used to get or set the value of the base attribute.

$uri->abs( [$base_uri] )
     The $base_uri argument is now made optional as the object carries it's
     base with it.

$uri->rel( [$base_uri] )
     The $base_uri argument is now made optional as the object carries it's
     base with it.

SEE ALSO
========

   *Note URI: URI,

COPYRIGHT
=========

   Copyright 1998-2000 Gisle Aas.


File: pm.info,  Node: URI/data,  Next: URI/file,  Prev: URI/WithBase,  Up: Module List

URI that contain immediate data
*******************************

NAME
====

   URI::data - URI that contain immediate data

SYNOPSIS
========

     use URI;

     $u = URI->new("data:");
     $u->media_type("image/gif");
     $u->data(scalar(`cat camel.gif`));
     print "$u\n";
     open(XV, "|xv -") and print XV $u->data;

DESCRIPTION
===========

   The `URI::data' class supports `URI' objects belonging to the data URI
scheme.  The data URI scheme is specified in RFC 2397.  It allows
inclusion of small data items as "immediate" data, as if it had been
included externally.  Examples:

     data:,Perl%20is%20good

     data:image/gif;base64,R0lGODdhIAAgAIAAAAAAAPj8+CwAAAAAI
       AAgAAAClYyPqcu9AJyCjtIKc5w5xP14xgeO2tlY3nWcajmZZdeJcG
       Kxrmimms1KMTa1Wg8UROx4MNUq1HrycMjHT9b6xKxaFLM6VRKzI+p
       KS9XtXpcbdun6uWVxJXA8pNPkdkkxhxc21LZHFOgD2KMoQXa2KMWI
       JtnE2KizVUkYJVZZ1nczBxXlFopZBtoJ2diXGdNUymmJdFMAADs=

   `URI' objects belonging to the data scheme support the common methods
(described in *Note URI: URI,) and the following two scheme specific
methods:

$uri->media_type( [$new_media_type] )
     This method can be used to get or set the media type specified in the
     URI.  If no media type is specified, then the default
     `"text/plain;charset=US-ASCII"' is returned.

$uri->data( [$new_data] )
     This method can be used to get or set the data contained in the URI.
     The data is passed unescaped (in binary form).  The decision about
     whether to base64 encode the data in the URI is taken automatically
     based on what encoding produces the shortest URI string.

SEE ALSO
========

   *Note URI: URI,

COPYRIGHT
=========

   Copyright 1995-1998 Gisle Aas.

   This library is free software; you can redistribute it and/or modify it
under the same terms as Perl itself.


File: pm.info,  Node: URI/file,  Next: URI/ldap,  Prev: URI/data,  Up: Module List

URI that map to local file names
********************************

NAME
====

   URI::file - URI that map to local file names

SYNOPSIS
========

     use URI::file;

     $u1 = URI->new("file:/foo/bar");
     $u2 = URI->new("foo/bar", "file");

     $u3 = URI::file->new($path);
     $u4 = URI::file->new("c:\\windows\\", "win32");
     
     $u1->file;
     $u1->file("mac");

DESCRIPTION
===========

   The `URI::file' class supports `URI' objects belonging to the file URI
scheme.  This scheme allows us to map the conventional file names found on
various computer systems to the URI name space.  An old specification of
the file URI scheme is found in RFC 1738.  Some older background
information is also in RFC 1630. There are no newer specifications as far
as I know.

   If you want simply to construct file URI objects from URI strings, use
the normal `URI' constructor.  If you want to construct file URI objects
from the actual file names used by various systems, then use one of the
following `URI::file' constructors:

$u = URI::file->new( $filename, [$os] )
     Maps a file name to the file: URI name space, creates an URI object
     and returns it.  The $filename is interpreted as one belonging to the
     indicated operating system ($os), which defaults to the value of the
     $^O variable.  The $filename can be either absolute or relative, and
     the corresponding type of URI object for $os is returned.

$u = URI::file->new_abs( $filename, [$os] )
     Same as URI::file->new, but will make sure that the URI returned
     represents an absolute file name.  If the $filename argument is
     relative, then the name is resolved relative to the current directory,
     i.e. this constructor is really the same as:

          URI::file->new($filename)->abs(URI::file->cwd);

$u = URI::file->cwd
     Returns a file URI that represents the current working directory.
     See *Note Cwd: Cwd,.

   The following methods are supported for file URI (in addition to the
common and generic methods described in *Note URI: URI,):

$u->file( [$os] )
     This method return a file name.  It maps from the URI name space to
     the file name space of the indicated operating system.

     It might return undef if the name can not be represented in the
     indicated file system.

$u->dir( [$os] )
     Some systems use a different form for names of directories than for
     plain files.  Use this method if you know you want to use the name for
     a directory.

   The `URI::file' module can be used to map generic file names to names
suitable for the current system.  As such, it can work as a nice
replacement for the File::Spec module.  For instance the following code
will translate the Unix style file name `Foo/Bar.pm' to a name suitable
for the local system.

     $file = URI::file->new("Foo/Bar.pm", "unix")->file;
     die "Can't map filename Foo/Bar.pm for $^O" unless defined $file;
     open(FILE, $file) || die "Can't open '$file': $!";
     # do something with FILE

MAPPING NOTES
=============

   Most computer systems today have hierarchically organized file systems.
Mapping the names used in these systems to the generic URI syntax allows
us to work with relative file URIs that behave as they should when
resolved using the generic algorithm for URIs (specified in RFC 2396).
Mapping a file name to the generic URI syntax involves mapping the path
separator character to "/" and encoding of any reserved characters that
appear in the path segments of the file names.  If path segments
consisting of the strings "." or ".." have a different meaning than what
is specified for generic URIs, then these must be encoded as well.

   If the file system has device, volume or drive specifications as the
root of the name space, then it makes sense to map them to the authority
field of the generic URI syntax.  This makes sure that relative URI can
not be resolved "above" them , i.e. generally how relative file names work
in those systems.

   Another common use of the authority field is to encode the host that
this file name is valid on.  The host name "localhost" is special and
generally have the same meaning as an missing or empty authority field.
This use will be in conflict with using it as a device specification, but
can often be resolved for device specifications having characters not
legal in plain host names.

   File name to URI mapping in normally not one-to-one.  There are usually
many URI that map to the same file name.  For instance an authority of
"localhost" maps the same as a URI with a missing or empty authority.

   Example 1: The Mac use ":" as path separator, but not in the same way
as generic URI. ":foo" is a relative name.  "foo:bar" is an absolute name.
Also path segments can contain the "/" character as well as be literal
"." or "..".  It means that we will map like this:

     Mac                   URI
     ----------            -------------------
     :foo:bar     <==>     foo/bar
     :            <==>     ./
     ::foo:bar    <==>     ../foo/bar
     :::          <==>     ../../
     foo:bar      <==>     file:/foo/bar
     foo:bar:     <==>     file:/foo/bar/
     ..           <==>     %2E%2E
     <undef>      <==      /
     foo/         <==      file:/foo%2F
     ./foo.txt    <==      file:/.%2Ffoo.txt
     
     Note that if you want a relative URL, you *must* begin the path with a :.  Any
     path that begins with [^:] will be treated as absolute.

   Example 2: The Unix file system is easy to map as it use the same path
separator as URIs, have a single root, and segments of "." and ".."  have
the same meaning.  URIs that have the character "\0" or "/" as part of any
path segment can not be turned into valid Unix file names.

     Unix                  URI
     ----------            ------------------
     foo/bar      <==>     foo/bar
     /foo/bar     <==>     file:/foo/bar
     /foo/bar     <==      file://localhost/foo/bar
     file:         ==>     ./file:
     <undef>      <==      file:/fo%00/bar
     /            <==>     file:/

SEE ALSO
========

   *Note URI: URI,, *Note File/Spec: File/Spec,, *Note Perlport:
(perl.info)perlport,

COPYRIGHT
=========

   Copyright 1995-1998 Gisle Aas.

   This library is free software; you can redistribute it and/or modify it
under the same terms as Perl itself.


File: pm.info,  Node: URI/ldap,  Next: Unicode/CharName,  Prev: URI/file,  Up: Module List

LDAP Uniform Resource Locators
******************************

NAME
====

   URI::ldap - LDAP Uniform Resource Locators

SYNOPSIS
========

     use URI;

     $uri = URI->new("ldap:$uri_string");
     $dn     = $uri->dn;
     $filter = $uri->filter;
     @attr   = $uri->attributes;
     $scope  = $uri->scope;
     %extn   = $uri->extensions;
     
     $uri = URI->new("ldap:");  # start empty
     $uri->host("ldap.itd.umich.edu");
     $uri->dn("o=University of Michigan,c=US");
     $uri->attributes(qw(postalAddress));
     $uri->scope('sub');
     $uri->filter('(cn=Babs Jensen)');
     print $uri->as_string,"\n";

DESCRIPTION
===========

   URI::ldap provides an interface to parse an LDAP URI in its constituent
parts and also build a URI as described in RFC 2255.

METHODS
=======

   URI::ldap support all the generic and server methods defined by *Note
URI: URI,, plus the following.

   Each of the following methods can be used to set or get the value in
the URI. The values are passed in unescaped form.  None of these will
return undefined values, but elements without a default can be empty.  If
arguments are given then a new value will be set for the given part of the
URI.

$uri->dn( [$new_dn] )
     Set or get the *Distinguised Name* part of the URI.  The DN
     identifies the base object of the LDAP search.

$uri->attributes( [@new_attrs] )
     Set or get the list of attribute names which will be returned by the
     search.

$uri->scope( [$new_scope] )
     Set or get the scope that the search will use. The value can be one of
     `"base"', `"one"' or `"sub"'. If none is given in the URI then the
     return value will default to `"base"'.

$uri->_scope( [$new_scope] )
     Same as scope(), but does not default to anything.

$uri->filter( [$new_filter] )
     Set or get the filter that the search will use. If none is given in
     the URI then the return value will default to `"(objectClass=*)"'.

$uri->_filter( [$new_filter] )
     Same as filter(), but does not default to anything.

$uri->extensions( [$etype => $evalue,...] )
     Set or get the extensions used for the search. The list passed should
     be in the form etype1 => evalue1, etype2 => evalue2,... This is also
     the form of list that will be returned.

SEE ALSO
========

   `RFC-2255|http:' in this node

AUTHOR
======

   Graham Barr <`gbarr@pobox.com'>

   Slightly modified by Gisle Aas to fit into the URI distribution.

COPYRIGHT
=========

   Copyright (c) 1998 Graham Barr. All rights reserved. This program is
free software; you can redistribute it and/or modify it under the same
terms as Perl itself.


File: pm.info,  Node: Unicode/CharName,  Next: Unicode/Map,  Prev: URI/ldap,  Up: Module List

Look up Unicode character names
*******************************

NAME
====

   Unicode::CharName - Look up Unicode character names

SYNOPSIS
========

     use Unicode::CharName qw(uname ublock);
     print uname(ord('%')), "\n";
     print ublock(0x0300), "\n";

DESCRIPTION
===========

   This module provide a two functions named uname() and ublock().  The
uname() function will return the Unicode character name for the given code
(a number between 0 and 0x10FFFF).  Unicode character names are written in
upper-case ASCII letters, and are strings like:

     LATIN CAPITAL LETTER A
     LATIN SMALL LETTER A WITH RING ABOVE
     CJK UNIFIED IDEOGRAPH 7C80
     HANGUL SYLLABLE PWILH

   The ublock() will return the name of the Unicode character block that
the given character belongs to.

SEE ALSO
========

   *Note Unicode/String: Unicode/String,

COPYRIGHT
=========

   Copyright 1997 Gisle Aas.

   This library is free software; you can redistribute it and/or modify it
under the same terms as Perl itself.

   Name table extracted from the Unicode 2.0 Character Database. Copyright
(c) 1991-1996 Unicode, Inc. All Rights reserved.


File: pm.info,  Node: Unicode/Map,  Next: Unicode/Map8,  Prev: Unicode/CharName,  Up: Module List

maps charsets from and to utf16 unicode
***************************************

NAME
====

   Unicode::Map V0.108 - maps charsets from and to utf16 unicode

SYNOPSIS
========

   use Unicode::Map();

   *$Map* = new Unicode::Map("ISO-8859-1");

   *$utf16* = *$Map* -> to_unicode ("Hello world!");   => $utf16 ==
"\0H\0e\0l\0l\0o\0 \0w\0o\0r\0l\0d\0!"

   *$locale* = *$Map* -> from_unicode (*$utf16*);   => $locale == "Hello
world!"

   A more detailed description below.

   2do: short note about perl's Unicode perspectives.

DESCRIPTION
===========

   This module converts strings from and to 2-byte Unicode UCS2 format.
All mappings happen via 2 byte UTF16 encodings, not via 1 byte UTF8
encoding. To transform these use Unicode::String.

   For historical reasons this module coexists with Unicode::Map8.  Please
use Unicode::Map8 unless you need to care for two byte character sets,
e.g. chinese GB2312. Anyway, if you stick to the basic functionality (see
documentation) you can use both modules equivalently.

   Practically this module will disappear from earth sooner or later as
Unicode mapping support needs somehow to get into perl's core. If you like
to work on this field please don't hesitate contacting Gisle Aas!

   This module can't deal directly with utf8. Use Unicode::String to
convert utf8 to utf16 and vice versa.

   Character mapping is according to the data of binary mapfiles in
Unicode::Map hierarchy. Binary mapfiles can also be created with this
module, enabling you to install own specific character sets. Refer to
mkmapfile or file REGISTRY in the Unicode::Map hierarchy.

CONVERSION METHODS
==================

   Probably these are the only methods you will need from this module.
Their usage is compatible with Unicode::Map8.

new
     *$Map* = new Unicode::Map("GB2312-80")

     Returns a new Map object for GB2312-80 encoding.

from_unicode
     $dest = *$Map* -> from_unicode ($src)

     Creates a string in locale charset representation from utf16 encoded
     string $src.

to_unicode
     $dest   = *$Map* -> to_unicode ($src)

     Creates a string in utf16 representation from $src.

to8
     Alias for from_unicode. For compatibility with Unicode::Map8

to16
     Alias for to_unicode. For compatibility with Unicode::Map8

WARNINGS
========

   You can demand Unicode::Map to issue warnings at deprecated or
incompatible usage with the constants WARN_DEFAULT, WARN_DEPRECATION or
WARN_COMPATIBILITY.  The latter both can be ored together.

No special warnings:
     $Unicode::Map::WARNINGS = Unicode::Map::WARN_DEFAULT

Warnings for deprecated usage:
     $Unicode::Map::WARNINGS = Unicode::Map::WARN_DEPRECATION

Warnings for incompatible usage:
     $Unicode::Map::WARNINGS = Unicode::Map::WARN_COMPATIBILITY

MAINTAINANCE METHODS
====================

   Note: These methods are solely for the maintainance of Unicode::Map.
Using any of these methods will lead to programs incompatible with
Unicode::Map8.

alias
     *@list* = *$Map* -> alias (*$csid*)

     Returns a list of alias names of character set *$csid*.

mapping
     $path = *$Map* -> mapping (*$csid*)

     Returns the absolute path of binary character mapping for character
     set *$csid* according to REGISTRY file of Unicode::Map.

id
     *$real_id*||"" = *$Map* -> id (*$test_id*)

     Returns a valid character set identifier *$real_id*, if *$test_id* is
     a valid character set name or alias name according to REGISTRY file of
     Unicode::Map.

ids
     *@ids* = *$Map* -> ids()

     Returns a list of all character set names defined in REGISTRY file.

read_text_mapping
     1||0 = *$Map* -> read_text_mapping (*$csid*, $path, $style)

     Read a text mapping of style $style named *$csid* from filename $path.
     The mapping then can be saved to a file with method:
     write_binary_mapping.  <$style> can be:

          style          description

          "unicode"    A text mapping as of ftp://ftp.unicode.org/MAPPINGS/
          ""           Same as "unicode"
          "reverse"    Similar to unicode, but both columns are switched
          "keld"       A text mapping as of ftp://dkuug.dk/i18n/charmaps/

src
     $path = *$Map* -> src (*$csid*)

     Returns the path of textual character mapping for character set
     *$csid* according to REGISTRY file of Unicode::Map.

style
     $path = *$Map* -> style (*$csid*)

     Returns the style of textual character mapping for character set
     *$csid* according to REGISTRY file of Unicode::Map.

write_binary_mapping
     1||0 = *$Map* -> write_binary_mapping (*$csid*, $path)

     Stores a mapping that has been loaded via method read_text_mapping in
     file $path.

DEPRECATED METHODS
==================

   Some functionality is no longer promoted.

noise
     Deprecated! Don't use any longer.

reverse_unicode
     Deprecated! Use Unicode::String::byteswap instead.

BINARY MAPPINGS
===============

   Structure of binary Mapfiles

   Unicode character mapping tables have sequences of sequential key and
sequential value codes. This property is used to crunch the maps easily.
n (0<n<256) sequential characters are represented as a bytecount n and the
first character code key_start. For these subsequences the according value
sequences are crunched together, also. The value 0 is used to start an
extended information block (that is just partially implemented, though).

   One could think of two ways to make a binary mapfile. First method would
be first to write a list of all key codes, and then to write a list of all
value codes. Second method, used here, appends to all partial key code
lists the according crunched value code lists. This makes value codes a
little bit closer to key codes.

   *Note: the file format is still in a very liquid state. Neither rely on
that it will stay as this, nor that the description is bugless, nor that
all features are implemented.*

   STRUCTURE:

<main>:
          offset  structure     value

          0x00    word          0x27b8   (magic)
          0x02    @(<extended> || <submapping>)

     The mapfile ends with extended mode <end> in main stream.

<submapping>:
          0x00    byte != 0     charsize1 (bits)
          0x01    byte          n1 number of chars for one entry
          0x02    byte          charsize2 (bits)
          0x03    byte          n2 number of chars for one entry
          0x04    @(<extended> || <key_seq> || <key_val_seq)

          bs1=int((charsize1+7)/8), bs2=int((charsize2+7)/8)

     One submapping ends when <mapend> entry occurs.

<key_val_seq>:
          0x00    size=0|1|2|4  n, number of sequential characters
          size    bs1           key1
          +bs1    bs2           value1
          +bs2    bs1           key2
          +bs1    bs2           value2
          ...

     key_val_seq ends, if either file ends (n = infinite mode) or n pairs
     are read.

<key_seq>:
          0x00    byte          n, number of sequential characters
          0x01    bs1           key_start, first character of sequence
          1+bs1   @(<extended> || <val_seq>)

     A key sequence starts with a byte count telling how long the sequence
     is. It is followed by the key start code. After this comes a list of
     value sequences. The list of value sequences ends, if sum(m) equals n.

<val_seq>:
          0x00    byte          m, number of sequential characters
          0x01    bs2           val_start, first character of sequence

<extended>:
          0x00    byte          0
          0x01    byte          ftype
          0x02    byte          fsize, size of following structure
          0x03    fsize bytes   something

     For future extensions or private use one can insert here 1..255 byte
     long streams. ftype can have values 30..255, values 0..29 are
     reserved. Modi are not fully defined now and could change. They will
     be explained later.

TO BE DONE
==========

-
     Something clever, when a character has no translation.

-
     Direct charset -> charset mapping.

-
     Better performance.

-
     Support for mappings according to RFC 1345.

SEE ALSO
========

-
     File `REGISTRY' and binary mappings in directory `Unicode/Map' of your
     perl library path

-
     recode(1), map(1), mkmapfile(1), Unicode::Map(3), Unicode::Map8(3),
     Unicode::String(3), Unicode::CharName(3), mirrorMappings(1)

-
     RFC 1345

-
     Mappings at Unicode consortium ftp://ftp.unicode.org/MAPPINGS/

-
     Registrated Internet character sets ftp://dkuug.dk/i18n/charmaps/

-
     2do: more references

AUTHOR
======

   Martin Schwartz <`martin@nacho.de'>


File: pm.info,  Node: Unicode/Map8,  Next: Unicode/MapUTF8,  Prev: Unicode/Map,  Up: Module List

Mapping table between 8-bit chars and Unicode
*********************************************

NAME
====

   Unicode::Map8 - Mapping table between 8-bit chars and Unicode

SYNOPSIS
========

     require Unicode::Map8;
     my $no_map = Unicode::Map8->new("ISO646-NO") || die;
     my $l1_map = Unicode::Map8->new("latin1")    || die;

     my $ustr = $no_map->to16("V}re norske tegn b|r {res\n");
     my $lstr = $l1_map->to8($ustr);
     print $lstr;

     print $no_map->tou("V}re norske tegn b|r {res\n")->utf8

DESCRIPTION
===========

   The Unicode::Map8 class implement efficient mapping tables between
8-bit character sets and 16 bit character sets like Unicode.  The tables
are efficient both in terms of space allocated and translation speed.  The
16-bit strings is assumed to use network byte order.

   The following methods are available:

$m = Unicode::Map8->new( [$charset] )
     The object constructor creates new instances of the Unicode::Map8
     class.  I takes an optional argument that specify then name of a 8-bit
     character set to initialize mappings from.  The argument can also be a
     the name of a mapping file.  If the charset/file can not be located,
     then the constructor returns undef.

     If you omit the argument, then an empty mapping table is constructed.
     You must then add mapping pairs to it using the addpair() method
     described below.

$m->addpair( $u8, $u16 );
     Adds a new mapping pair to the mapping object.  It takes two
     arguments.  The first is the code value in the 8-bit character set and
     the second is the corresponding code value in the 16-bit character
     set.  The same codes can be used multiple times (but using the same
     pair has no effect).  The first definition for a code is the one that
     is used.

     Consider the following example:

          $m->addpair(0x20, 0x0020);
          $m->addpair(0x20, 0x00A0);
          $m->addpair(0xA0, 0x00A0);

     It means that the character 0x20 and 0xA0 in the 8-bit charset maps to
     themselves in the 16-bit set, but in the 16-bit character set 0x0A0
     maps to 0x20.

$m->default_to8( $u8 )
     Set the code of the default character to use when mapping from 16-bit
     to 8-bit strings.  If there is no mapping pair defined for a character
     then this default is substituted by to8() and recode8().

$m->default_to16( $u16 )
     Set the code of the default character to use when mapping from 8-bit
     to 16-bit strings. If there is no mapping pair defined for a character
     then this default is used by to16(), tou() and recode8().

$m->nostrict;
     All undefined mappings are replaced with the identity mapping.
     Undefined character are normally just removed (or replaced with the
     default if defined) when converting between character sets.

$m->to8( $ustr );
     Converts a 16-bit character string to the corresponding string in the
     8-bit character set.

$m->to16( $str );
     Converts a 8-bit character string to the corresponding string in the
     16-bit character set.

$m->tou( $str );
     Same an to16() but return a Unicode::String object instead of a plain
     UCS2 string.

$m->recode8($m2, $str);
     Map the string $str from one 8-bit character set ($m) to another one
     ($m2).  Since we assume we know the mappings towards the common 16-bit
     encoding we can use this to convert between any of the 8-bit character
     sets.

$m->to_char16( $u8 )
     Maps a single 8-bit character code to an 16-bit code.  If the 8-bit
     character is unmapped then the constant NOCHAR is returned.  The
     default is not used and the callback method is not invoked.

$m->to_char8( $u16 )
     Maps a single 16-bit character code to an 8-bit code. If the 16-bit
     character is unmapped then the constant NOCHAR is returned.  The
     default is not used and the callback method is not invoked.

   The following callback methods are available.  You can override these
methods by creating a subclass of Unicode::Map8.

$m->unmapped_to8
     When mapping to 8-bit character string and there is no mapping defined
     (and no default either), then this method is called as the last
     resort.  It is called with a single integer argument which is the code
     of the unmapped 16-bit character.  It is expected to return a string
     that will be incorporated in the 8-bit string.  The default version of
     this method always returns an empty string.

     Example:

          package MyMapper;
          @ISA=qw(Unicode::Map8);
          
          sub unmapped_to8
          {
             my($self, $code) = @_;
             require Unicode::CharName;
             "<" . Unicode::CharName::uname($code) . ">";
          }

$m->unmapped_to16
     Likewise when mapping to 16-bit character string and no mapping is
     defined then this method is called.  It should return a 16-bit string
     with the bytes in network byte order.  The default version of this
     method always returns an empty string.

FILES
=====

   The Unicode::Map8 constructor can parse two different file formats; a
binary format and a textual format.

   The binary format is simple.  It consist of a sequence of 16-bit
integer pairs in network byte order.  The first pair should contain the
magic value 0xFFFE, 0x0001.  Of each pair, the first value is the code of
an 8-bit character and the second is the code of the 16-bit character.  If
follows from this that the first value should be less than 256.

   The textual format consist of lines that is either a comment (first
non-blank character is '#'), a completely blank line or a line with two
hexadecimal numbers.  The hexadecimal numbers must be preceded by "0x" as
in C and Perl.  This is the same format used by the Unicode mapping files
available from <URL:ftp://ftp.unicode.org/Public>.

   The mapping table files are installed in the `Unicode/Map8/maps'
directory somewhere in the Perl @INC path.  The variable
$Unicode::Map8::MAPS_DIR is the complete path name to this directory.
Binary mapping files are stored within this directory with the suffix
*.bin*.  Textual mapping files are stored with the suffix *.txt*.

   The scripts *map8_bin2txt* and *map8_txt2bin* can translate between
these mapping file formats.

   A special file called aliases within $MAPS_DIR specify all the alias
names that can be used to denote the various character sets.  The first
name of each line is the real file name and the rest is alias names
separated by space.

   The ``umap --list'' command be used to list the character sets
supported.

BUGS
====

   Does not handle Unicode surrogate pairs as a single character.

SEE ALSO
========

   `umap(1)' in this node, *Note Unicode/String: Unicode/String,

COPYRIGHT
=========

   Copyright 1998 Gisle Aas.

   This library is free software; you can redistribute it and/or modify it
under the same terms as Perl itself.


File: pm.info,  Node: Unicode/MapUTF8,  Next: Unicode/String,  Prev: Unicode/Map8,  Up: Module List

Conversions to and from arbitrary character sets and UTF8
*********************************************************

NAME
====

   Unicode::MapUTF8 - Conversions to and from arbitrary character sets and
UTF8

SYNOPSIS
========

     use Unicode::MapUTF8 qw(to_utf8 from_utf8 utf8_supported_charset);

     # Convert a string in 'ISO-8859-1' to 'UTF8'
     my $output = to_utf8({ -string => 'An example', -charset => 'ISO-8859-1' });

     # Convert a string in 'UTF8' encoding to encoding 'ISO-8859-1'
     my $other  = from_utf8({ -string => 'Other text', -charset => 'ISO-8859-1' });

     # List available character set encodings
     my @character_sets = utf8_supported_charset;

     # Add a character set alias
     utf8_charset_alias({ 'ms-japanese' => 'sjis' });

     # Convert between two arbitrary (but largely compatible) charset encodings
     # (SJIS to EUC-JP)
     my $utf8_string   = to_utf8({ -string =>$sjis_string, -charset => 'sjis'});
     my $euc_jp_string = from_utf8({ -string => $utf8_string, -charset => 'euc-jp' })

     # Verify that a specific character set is supported
     if (utf8_supported_charset('ISO-8859-1') {
         # Yes
     }

DESCRIPTION
===========

   Provides an adapter layer between core routines for converting to and
from UTF8 and other encodings. In essence, a way to give multiple existing
Unicode modules a single common interface so you don't have to know the
underlaying implementations to do simple UTF8 to-from other character set
encoding conversions. As such, it wraps the Unicode::String, Unicode::Map8,
Unicode::Map and Jcode modules in a standardized and simple API.

   This also provides general character set conversion operation based on
UTF8 - it is possible to convert between any two compatible and supported
character sets via a simple two step chaining of conversions.

   As with most things Perlish - if you give it a few big chunks of text
to chew on instead of lots of small ones it will handle many more
characters per second.

   By design, it can be easily extended to encompass any new charset
encoding conversion modules that arrive on the scene.

CHANGES
=======

   1.08 2000.11.06 - Added 'utf8_charset_alias' function to
  allow for runtime setting of character                   set aliases.
Added several alternate                   names for 'sjis' (shiftjis,
shift-jis,                   shift_jis, s-jis, and s_jis).

     Corrected 'croak' messages for
     'from_utf8' functions to appropriate
     function name.

     Tightened up initialization encapsulation

     Corrected fatal problem in jcode from
     unicode internals. Problem and fix
     found by Brian Wisti <wbrian2@uswest.net>.

   1.07 2000.11.01 - Added 'croak' to use Carp declaration to
    fix error messages.  Problem and fix                   found by Brian
Wisti                   <wbrian2@uswest.net>.

   1.06 2000.10.30 - Fix to handle change in stringification
   of overloaded objects between Perl 5.005                   and 5.6.
Problem noticed by Brian Wisti                   <wbrian2@uswest.net>.

   1.05 2000.10.23 - Error in conversions from UTF8 to
multibyte encodings corrected

   1.04 2000.10.23 - Additional diagnostic messages added
for internal error conditions

   1.03 2000.10.22 - Bug fix for load time autodetction of
 Unicode::Map8 encodings

   1.02 2000.10.22 - Added load time autodetection of
Unicode::Map8 supported character set                   encodings.

     Fixed internal calling error for some
     character sets with 'from_utf8'. Thanks
     goes to Ilia Lobsanov
     <ilia@lobsanov.com> for reporting this
     problem.

   1.01 2000.10.02 - Fixed handling of empty strings and
added more identification for error                   messages.

   1.00 2000.09.29 - Pre-release version

FUNCTIONS
=========

utf8_charset_alias({ $alias => $charset });
     Used for runtime assignment of character set aliases.

     Called with no parameters, returns a hash of defined aliases and the
     character sets they map to.

     Example:

          my $aliases     = utf8_charset_alias;
          my @alias_names = keys %$aliases;

     If called with ONE parameter, returns the name of the 'real' charset
     if the alias is defined. Returns undef if it is not found in the
     aliases.

     Example:

          if (! utf8_charset_alias('VISCII')) {
              # No alias for this
          }

     If called with a list of 'alias' => 'charset' pairs, defines those
     aliases for use.

     Example:

          utf8_charset_alias({ 'japanese' => 'sjis', 'japan' => 'sjis' });

     Note: It will croak if a passed pair does not map to a character set
     defined in the predefined set of character encoding. It is NOT
     allowed to alias something to another alias.

     Multiple character set aliases can be set with a single call.

     To clear an alias, pass a character set mapping of undef.

     Example:

          utf8_charset_alias({ 'japanese' => undef });

     While an alias is set, the 'utf8_supported_charset' function will
     return the alias as if it were a predefined charset.

     Overriding a base defined character encoding with an alias will
     generate a warning message to STDERR.

utf8_supported_charset($charset_name);
     Returns true if the named charset is supported (including user
     defiend aliases).

     Returns false if it is not.

     Example:

          if (! utf8_supported_charset('VISCII')) {
              # No support yet
          }

     If called in a list context with no parameters, it will return a list
     of all supported character set names (including user defined aliases).

     Example:

          my @charsets = utf8_supported_charset;

to_utf8({ -string => $string, -charset => $source_charset });
     Returns the string converted to UTF8 from the specified source
     charset.

from_utf8({ -string => $string, -charset => $target_charset});
     Returns the string converted from UTF8 to the specified target
     charset.

VERSION
=======

   1.08 2000.11.06

COPYRIGHT
=========

   Copyright September, 2000 Benjamin Franz. All rights reserved.

   This software is free software.  You can redistribute it and/or modify
it under the same terms as Perl itself.

AUTHOR
======

   Benjamin Franz <snowhare@nihongo.org>

TODO
====

   Regression tests for Jcode, 2-byte encodings and encoding aliases

SEE ALSO
========

   Unicode::String Unicode::Map8 Unicode::Map Jcode