This is Info file pm.info, produced by Makeinfo version 1.68 from the
input file bigpm.texi.


File: pm.info,  Node: Lingua/EN/Nickname,  Next: Lingua/EN/Numbers,  Prev: Lingua/EN/NameParse,  Up: Module List

Genealogical nickname matching (Liz=Beth)
*****************************************

NAME
====

   Lingua::EN::Nickname - Genealogical nickname matching (Liz=Beth)

SYNOPSIS
========

     use Lingua::EN::Nickname;

     # Equivalent first names?
     $score= nickname_eq( $firstn_0, $firstn_1 );

     # Full, expanded, name(s)
     @roots= rootname( $firstn );

DESCRIPTION
===========

   Nicknames, alternate spellings, and alternate etymological derivations
make checking first name equivalence nearly impossible.  This module will
tell you that 'Maggie', 'Peg', and 'Margaret' are all probably the same
name.

SOURCES
=======

   * USGenWeb Project  `http:' in this node

   * TNGenWeb Project  `http:' in this node

   * Chesnut Family Pages  `http:' in this node

   * Ultimate Family Tree  `http:' in this node

TODO
====

   * Hire a team of experts to provide a more scientific,  statistically
     accurate Name Etymology source file.

   * Create more phoenetically-based sub-regexes.

   * Detect simple monosyllabic truncation nicknames,  be less certain
     about them, but match more.

   * Pay more attention to gender.

AUTHOR
======

   Brian Lalonde, <brianl@sd81.k12.wa.us>

SEE ALSO
========

   perl(1)


File: pm.info,  Node: Lingua/EN/Numbers,  Next: Lingua/EN/Numbers/Easy,  Prev: Lingua/EN/Nickname,  Up: Module List

Converts numeric values into their English string equivalents.
**************************************************************

NAME
====

   Lingua::EN::Numbers - Converts numeric values into their English string
equivalents.

SYNOPSIS
========

     ## EXAMPLE 1

     use Lingua::EN::Numbers qw(American);

     $n = new Lingua::EN::Numbers(313721.23);
     if (defined $n) {
     	$s = $n->get_string;
     	print "$s\n";
     }

     ## EXAMPLE 2

     use Lingua::EN::Numbers;

     $n = new Lingua::EN::Numbers;
     $n->parse(-1281);
     print "N = " . $n->get_string . "\n";

REQUIRES
========

   Perl 5, Exporter, Carp

DESCRIPTION
===========

   Lingua::EN::Numbers converts arbitrary numbers into human-oriented
English text. Limited support is included for parsing standardly formatted
numbers (i.e. '3,213.23'). But no attempt has been made to handle any
complex formats. Support for multiple variants of English are supported.
Currently only "American" formatting is supported.

   To use the class, an instance is generated. The instance is then loaded
with a number. This can occur either during construction of the instance
or later, via a call to the parse method. The number is then analyzed and
parsed into the english text equivalent.

   The instance, now initialized, can be converted into a string, via the
*get_string* method. This method takes the parsed data and converts it
from a data structure into a formatted string. Elements of the string's
formatting can be tweaked between calls to the *get_string* function.
While such changes are unlikely, this has been done simply to provide
maximum flexability.

METHODS
=======

Creation
--------

new Lingua::EN::Numbers $numberString
     Creates, optionally initializes, and returns a new instance.

Initialization
--------------

$number->parse $numberString
     Parses a number and (re)initializes an instance.

Output
------

$number->get_string
     Returns a formatted string based on the most recent parse.

CLASS VARIABLES
===============

$Lingua::EN::Numbers::VERSION
     The version of this class.

$Lingua::EN::Numbers::MODE
     The current locale mode. Currently only *American* is supported.

%Lingua::EN::Numbers::INPUT_GROUP_DELIMITER
     The delimiter which seperates number groups.  Example:
     "1*,*321*,*323" uses the comma '*,*' as the group delimiter.

%Lingua::EN::Numbers::INPUT_DECIMAL_DELIMITER
     The delimiter which seperates the main number from its decimal part.
     Example: "132.2" uses the period '.' as the decimal delimiter.

%Lingua::EN::Numbers::OUTPUT_BLOCK_DELIMITER
     A character used at output time to convert the number into a string.
     Example: One Thousand, Two-Hundred and Twenty-Two point Four.  Uses
     the space character ' ' as the block delimiter.

%Lingua::EN::Numbers::OUTPUT_GROUP_DELIMITER
     A character used at output time to convert the number into a string.
     Example: One Thousand*,* Two-Hundred and Twenty-Two point Four.  Uses
     the comma '*,*' character as the group delimiter.

%Lingua::EN::Numbers::OUTPUT_NUMBER_DELIMITER
     A character used at output time to convert the number into a string.
     Example: One Thousand, Two-Hundred and Twenty-Two point Four.  Uses
     the dash '-' character as the number delimiter.

%Lingua::EN::Numbers::OUTPUT_DECIMAL_DELIMITER
     A character used at output time to convert the number into a string.
     Example: One Thousand, Two-Hundred and Twenty-Two point Four.  Uses
     the 'point' string as the decimal delimiter.

%Lingua::EN::Numbers::NUMBER_NAMES
     A list of names for numbers.

%Lingua::EN::Numbers::SIGN_NAMES
     A list of names for positive and negative signs.

$Lingua::EN::Numbers::SIGN_POSITIVE
     A constant indicating the the current number is positive.

$Lingua::EN::Numbers::SIGN_NEGATIVE
     A constant indicating the the current number is negative.

DIAGNOSTICS
===========

Error: Lingua::EN::Numbers does not support tag: '$tag'.
     (F) The module has been invoked with an invalid locale.

Error: bad number format: '$number'.
     (F) The number specified is not in a valid numeric format.

Error: bad number format: '.$number'.
     (F) The decimal portion of number specified is not in a valid numeric
     format.

AUTHOR
======

   Stephen Pandich, pandich@yahoo.com


File: pm.info,  Node: Lingua/EN/Numbers/Easy,  Next: Lingua/EN/Numbers/Ordinate,  Prev: Lingua/EN/Numbers,  Up: Module List

Hash access to Lingua::EN::Numbers objects.
*******************************************

NAME
====

   Lingua::EN::Numbers::Easy  -  Hash access to Lingua::EN::Numbers
objects.

SYNOPSIS
========

     use Lingua::EN::Numbers::Easy;

     print "$N{1} fish, $N{2} fish, blue fish, red fish";
                          # one fish, two fish, blue fish, red fish.

DESCRIPTION
===========

   `Lingua::EN::Numbers' is a module that translates numbers to English
words. Unfortunally, it has an object oriented interface, which makes it
hard to interpolate them in strings. `Lingua::EN::Numbers::Easy'
translates numbers to words using a tied hash, which can be interpolated.

   By default, `Lingua::EN::Numbers::Easy' exports a hash `%N' to the
importing package. Also, by default, `Lingua::EN::Numbers::Easy' uses the
British mode of `Lingua::EN::Numbers'. Both defaults can be changed by
optional arguments to the `use Lingua::EN::Numbers::Easy;' statement.

   The first argument determines the parsing mode of `Lingua::EN::Numbers'.
Currently, `Lingua::EN::Numbers' supports *British* and *American*.  The
second argument determines the name of the hash in the importing package.

     use Lingua::EN::Numbers::Easy qw /American %nums/;

   would use *American* parsing mode, and `%nums' as the tied hash.

   See also the `Lingua::EN::Numbers' man page.

   `Lingua::EN::Numbers::Easy' caches results - numbers will only be
translated once.

   Any other operation on the exported hash than fetches will throw an
exception.

REVISION HISTORY
================

     $Log: Easy.pm,v $
     Revision 1.2  1999/11/07 15:17:34  abigail
     Worked around a bug (0 -> 'zero') in Lingua::EN::Numbers.

     Revision 1.1  1999/11/07 14:59:14  abigail
     Initial revision

AUTHOR
======

   This package was written by Abigail, abigail@delanet.com.

COPYRIGHT and LICENSE
=====================

   This package is copyright 1999 by Abigail.

   Permission is hereby granted, free of charge, to any person obtaining a
copy of this software and associated documentation files (the "Software"),
to deal in the Software without restriction, including without limitation
the rights to use, copy, modify, merge, publish, distribute, sublicense,
and/or sell copies of the Software, and to permit persons to whom the
Software is furnished to do so, subject to the following conditions:

   The above copyright notice and this permission notice shall be included
in all copies or substantial portions of the Software.

   THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS
OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
THE OPEN GROUP BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY,
WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF
OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.


File: pm.info,  Node: Lingua/EN/Numbers/Ordinate,  Next: Lingua/EN/Nums2Words,  Prev: Lingua/EN/Numbers/Easy,  Up: Module List

go from cardinal number (3) to ordinal ("3rd")
**********************************************

NAME
====

   Lingua::EN::Numbers::Ordinate - go from cardinal number (3) to ordinal
("3rd")

SYNOPSIS
========

     use Lingua::EN::Numbers::Ordinate;
     print ordinate(4), "\n";
      # prints 4th
     print ordinate(-342), "\n";
      # prints -342nd

     # Example of actual use:
     ...
     for(my $i = 0; $i < @records; $i++) {
       unless(is_valid($record[$i]) {
         warn "The ", ordinate($i), " record is invalid!\n";
         next;
       }
       ...
     }

DESCRIPTION
===========

   There are two kinds of numbers in English - cardinals (1, 2, 3...), and
ordinals (1st, 2nd, 3rd...).  This library provides functions for giving
the ordinal form of a number, given its cardinal value.

FUNCTIONS
=========

ordinate(SCALAR)
     Returns a string consisting of that scalar's string form, plus the
     appropriate ordinal suffix.  Example: `ordinate(23)' returns "23rd".

     As a special case, `ordinate(undef)' and `ordinate("")' return "0th",
     not "th".

     This function is exported by default.

th(SCALAR)
     Merely an alias for `ordinate', but not exported by default.

ordsuf(SCALAR)
     Returns just the appropriate ordinal suffix for the given scalar
     numeric value.  This is what `ordinate' uses to actually do its work.
     For example, `ordsuf(3)' is "rd".

     Not exported by default.

   The above functions are all prototyped to take a scalar value, so
`ordinate(@stuff)' is the same as `ordinate(scalar @stuff)'.

CAVEATS
=======

   * Note that this library knows only about numbers, not number-words.
`ordinate('seven')' might just as well be `ordinate('superglue')' or
`ordinate("\x1E\x9A")' - you'll get the fallthru case of the input string
plus "th".

   * As is unavoidable, `ordinate(0256)' returns "174th" (because ordinate
sees the value 174). Similarly, `ordinate(1E12)' returns
"1000000000000th".  Returning "trillionth" would be nice, but that's an
awfully atypical case.

   * Note that this library's algorithm (as well as the basic concept and
implementation of ordinal numbers) is totally language specific.

   To pick a trivial example, consider that in French, 1 ordinates as
"1ier", whereas 41 ordinates as "41ieme".

STILL NOT SATISFIED?
====================

   Bored of this...?

     use Lingua::EN::Numbers::Ordinate qw(ordinate th);
     ...
     print th($n), " entry processed...\n";
     ...

   Try this bit of lunacy:

     {
       my $th_object;
       sub _th () { $th_object }

     package Lingua::EN::Numbers::Ordinate::Overloader;
     my $x; # Gotta have something to bless.
     $th_object = bless \$x; # Define the object now, which _th returns
     use Carp ();
     use Lingua::EN::Numbers::Ordinate ();
     sub overordinate {
       Carp::croak "_th should be used only as postfix!" unless $_[2];
       Lingua::EN::Numbers::Ordinate::ordinate($_[1]);
     }
     use overload '&' => \&overordinate;
       }

   Then you get to do:

     print 3 & _th, "\n";
       # prints "3rd"
     
     print 1 + 2 & _th, "\n";
       # prints "3rd" too!
       # Because of the precedence of & !
     
     print _th & 3, "\n";
       # dies with: "th should be used only as postfix!"

   Kooky, isn't it?  For more delightful deleria like this, see Damian
Conway's *Object Oriented Perl* from Manning Press.

   Kinda makes you like `th(3)', doesn't it?

COPYRIGHT
=========

   Copyright (c) 2000 Sean M. Burke.  All rights reserved.

   This library is free software; you can redistribute it and/or modify it
under the same terms as Perl itself.

AUTHOR
======

   Sean M. Burke `sburke@cpan.org'


File: pm.info,  Node: Lingua/EN/Nums2Words,  Next: Lingua/EN/Sentence,  Prev: Lingua/EN/Numbers/Ordinate,  Up: Module List

compute English verbage from numerical values
*********************************************

NAME
====

   Nums2Words - compute English verbage from numerical values

SYNOPSIS
========

use Nums2Words;
$Verbage = &num2word($Number);
$Verbage = &num2word_ordinal($Number);
$Verbage = &num2word_short_ordinal($Number);
$Verbage = &num2usdollars($Number);
DESCRIPTION
===========

   To the best of my knowledge, this code has the potential for generating
US English verbage representative of every real value from negative
infinity to positive infinity if the module's private variables
@Classifications and @Categories are filled appropriately.  This module
generates verbage based on the thousands system.

   See http://www.quinion.demon.co.uk/words/numbers.htm for details of the
thousands system versus millions system of linguistically representing
large numbers.

   Copyright (C) 1996-1998, Lester H. Hightower, Jr.
hightowe@united-railway.com 				hightowe@progressive-comp.com
hightowe@railwayex.com

LICENSE
=======

   A license is hereby granted for anyone to reuse this Perl module in its
original, unaltered form for any purpose, including any commercial
software endeavor.  However, any modifications to this code or any
derivative work (including ports to other languages) must be submitted to
the original author, Lester H. Hightower Jr., before the modified software
or derivative work is used in any commercial application.  All
modifications or derivative works must be submitted to the author with 30
days of completion.  The author reserves the right to incorporate any
modifications or derivative work into future releases of this software.

   This software cannot be placed on a CD-ROM or similar media for
commercial distribution without the prior approval of the author.


File: pm.info,  Node: Lingua/EN/Sentence,  Next: Lingua/EN/Squeeze,  Prev: Lingua/EN/Nums2Words,  Up: Module List

Module for splitting text into sentences.
*****************************************

NAME
====

   Lingua::EN::Sentence - Module for splitting text into sentences.

SYNOPSIS
========

     use Lingua::EN::Sentence qw( get_sentences add_acronyms );

     add_acronyms(('lt','gen'));		## adding support for 'Lt. Gen.'
     my $sentences=get_sentences($text);	## Get the sentences.
     foreach my $sentence (@$sentences) {
     	## do something with $sentence
     }

DESCRIPTION
===========

   The `Lingua::EN::Sentence' module contains the function get_sentences,
which splits text into its constituent sentences, based on a regular
expression and a list of abbreviations (built in and given).

   Certain well know exceptions, such as abreviations, may cause incorrect
segmentations.  But some of them are already integrated into this code and
are being taken care of.  Still, if you see that there are words causing
the get_sentences() to fail, you can add those to the module, so it
notices them.

ALGORITHM
=========

   Basically, I use a 'brute' regular expression to split the text into
sentences.  (Well, nothing is yet split - I just mark the
end-of-sentence).  Then I look into a set of rules which decide when an
end-of-sentence is justified and when it's a mistake. In case of a
mistake, the end-of-sentence mark is removed.

   What are such mistakes? Cases of abbreviations, for example. I have a
list of such abbreviations (Please see `Acronym/Abbreviations list'
section), and more general rules (for example, the abbreviations 'i.e.'
and '.e.g.' need not to be in the list as a special rule takes care of all
single letter abbreviations).

FUNCTIONS
=========

   All functions used should be requested in the 'use' clause. None is
exported by default.

get_sentences( $text )
     The get sentences function takes a scalar containing ascii text as an
     argument and returns a reference to an array of sentences that the
     text has been split into.  Returned sentences will be trimmed
     (beginning and end of sentence) of white-spaces.  Strings with no
     alpha-numeric characters in them, won't be returned as sentences.

add_acronyms( @acronyms )
     This function is used for adding acronyms not supported by this code.
     Please see `Acronym/Abbreviations list' section for the
     abbreviations already supported by this module.

get_acronyms(	)
     This function will return the defined list of acronyms.

set_acronyms( @my_acronyms )
     This function replaces the predefined acroym list with the given list.

get_EOS(	)
     This function returns the value of the string used to mark the end of
     sentence. You might want to see what it is, and to make sure your
     text doesn't contain it. You can use set_EOS() to alter the
     end-of-sentence string to whatever you desire.

set_EOS( $new_EOS_string )
     This function alters the end-of-sentence string used to mark the end
     of sentences.

Acronym/Abbreviations list
==========================

   Currently supported acronym lists are:

     PEOPLE ( 'jr', 'mr', 'mrs', 'ms', 'dr', 'prof' )
     INSTITUTES ( 'dept', 'univ' )
     COMPANIES ( 'inc', 'ltd' )
     MISC ( 'vs', 'etc', 'no' )

   If I come across a good general-purpose list - I'll incorporate it into
this module.  Feel free to suggest such lists.

SEE ALSO
========

     Text::Sentence

AUTHOR
======

   Shlomo Yona <Shlomo.Yona@Siftology.com>

COPYRIGHT
=========

   Copyright (c) 2001 Siftology Inc.. All rights reserved.

   This library is free software.  You can redistribute it and/or modify
it under the same terms as Perl itself.


File: pm.info,  Node: Lingua/EN/Squeeze,  Next: Lingua/EN/Summarize,  Prev: Lingua/EN/Sentence,  Up: Module List

Shorten text to minimum syllables by using hash table and vowel deletion
************************************************************************

NAME
====

   Squeeze.pm - Shorten text to minimum syllables by using hash table and
vowel deletion

REVISION
========

   $Id: Squeeze.pm,v 1.25 1998/12/04 10:00:08 jaalto Exp $

SYNOPSIS
========

     use Squeeze.pm;	    # imnport only function
     use Squeeze qw( :ALL ); # import all functions and variables
     use English;

     while (<>)
     {
     	print SqueezeText $ARG;
     }

DESCRIPTION
===========

   Squeeze English text to most compact format possibly so that it is
barely readable. You should convert all text to lowercase for maximum
compression, because optimizations have been designed mostly fr
uncapitalised letters.

   `Warning: Each line is processed multiple times, so prepare for slow
conversion time'

   You can use this module e.g. to preprocess text before it is sent to
electronic media that has some maximum text size limit. For example pagers
have an arbitrary text size limit, typically 200 characters, which you want
to fill as much as possible. Alternatively you may have GSM cellular phone
which is capable of receiving Short Messages (SMS), whose message size
limit is 160 characters. For demonstration of this module's SqueezeText()
function , the description text of this paragraph has been converted below.
See yourself if it's readable (Yes, it takes some time to get used to). The
compress ratio is typically 30-40%

     u _n use thi mod e.g. to prprce txt bfre i_s snt to
     elrnic mda has som max txt siz lim. f_xmple pag
     hv  abitry txt siz lim, tpcly 200 chr, W/ u wnt
     to fll as mch as psbleAlternatvly u may hv GSM cllar P8
     w_s cpble of rcivng Short msg (SMS), WS/ msg siz
     lim is 160 chr. 4 demonstrton of thi mods SquezText
     fnc ,  dsc txt of thi prgra has ben cnvd_ blow
     See uself if i_s redble (Yes, it tak som T to get usdto
     compr rat is tpcly 30-40

   And if $SQZ_OPTIMIZE_LEVEL is set to non-zero

     u_nUseThiModE.g.ToPrprceTxtBfreI_sSntTo
     elrnicMdaHasSomMaxTxtSizLim.F_xmplePag
     hvAbitryTxtSizLim,Tpcly200Chr,W/UWnt
     toFllAsMchAsPsbleAlternatvlyUMayHvGSMCllarP8
     w_sCpbleOfRcivngShortMsg(SMS),WS/MsgSiz
     limIs160Chr.4DemonstrtonOfThiModsSquezText
     fnc,DscTxtOfThiPrgraHasBenCnvd_Blow
     SeeUselfIfI_sRedble(Yes,ItTakSomTToGetUsdto
     comprRatIsTpcly30-40

   The comparision of these two show

     Original text   : 627 characters
     Level 0	    : 433 characters	reduction 31 %
     Level 1	    : 345 characters	reduction 45 %	(+14 improvement)

   There are few grammar rules which are used to shorten some English
tokens very much:

     Word that has _ is usually a verb

     Word that has / is usually a substantive, noun,
     		    pronomine or other non-verb

   For example, these tokens must be understood before text can be read.
This is not yet like Geek code, because you don't need external parser to
understand this, but just some common sense and time to adapt yourself to
this text. *For a complete up to date list, you have to peek the source
code*

     automatically => 'acly_'

     for           => 4
     for him       => 4h
     for her       => 4h
     for them      => 4t
     for those     => 4t

     can           => _n
     does          => _s

     it is         => i_s
     that is       => t_s
     which is      => w_s
     that are      => t_r
     which are     => w_r

     less          => -/
     more          => +/
     most          => ++

     however       => h/ver
     think         => thk_

     useful        => usful

     you           => u
     your          => u/
     you'd         => u/d
     you'll        => u/l
     they          => t/
     their         => t/r

     will          => /w
     would         => /d
     with          => w/
     without       => w/o
     which         => W/
     whose         => WS/

   Time is expressed with big letters

     time          => T
     minute        => MIN
     second        => SEC
     hour          => HH
     day           => DD
     month         => MM
     year          => YY

   Other Big letter acronyms

     phone	  => P8

EXAMPLES
========

   To add new words e.g. to word conversion hash table, you'd define your
custom set and merge them to existing ones. Do similarly to
%SQZ_WXLATE_MULTI_HASH and $SQZ_ZAP_REGEXP and then start using the
conversion function.

     use English;
     use Squeeze qw( :ALL );

     my %myExtraWordHash =
     (
     	  new-word1  => 'conversion1'
     	, new-word2  => 'conversion2'
     	, new-word3  => 'conversion3'
     	, new-word4  => 'conversion4'
     );

     #	First take the existing tables and merge them with my
     #	translation table

     my %mySustomWordHash =
     (
     	  %SQZ_WXLATE_HASH
     	, %SQZ_WXLATE_EXTRA_HASH
     	, %myExtraWordHash
     );

     my $myXlat = 0;				# state flag

     while (<>)
     {
     	if ( $condition )
     	{
     	    SqueezeHashSet \%mySustomWordHash;	# Use MY conversions
     	    $myXlat = 1;
     	}

     if ( $myXlat and $condition )
     {
         SqueezeHashSet "reset";		# Back to default table
         $myXlat = 0;
     }

     print SqueezeText $ARG;
         }

   Similarly you can redefine the multi word translate table by supplying
another hash reference in call to SqueezeHashSet(). To kill more text
immediately in addtion to default, just concatenate the regexps to
$SQZ_ZAP_REGEXP

KNOWN BUGS
==========

   There may be lot of false conversions and if you think that some word
squeezing went too far, please 1) turn on the debug 2) send you example
text 3) debug log log to the maintainer. To see how the conversion goes
e.g. for word *Messages*:

     use English;
     use Lingua::EN:Squeeze;

     #	activate debug when case-insensitive worj "Messages" is found from the
     #	line.

     SqueezeDebug( 1, '(?i)Messages' );

     $ARG = "This line has some Messages in it";
     print SqueezeText $ARG;

EXPORTABLE VARIABLES
====================

   The defaults may not conquer all possible text, so you may wish to
extend the hash tables and $SQZ_ZAP_REGEXP to cope with your typical text.

$SQZ_ZAP_REGEXP
---------------

   Text to kill immediately, like "Hm, Hi, Hello..." You can only set this
once, because this regexp is compiled immediately when `SqueezeText()' is
caller for the first time.

$SQZ_OPTIMIZE_LEVEL
-------------------

   This controls how optimized the text will be. Curretly there is only
levels 0 (default) and level 1, which squeezes out all spaces. This
improves compression by average of 10%, but the text is more harder to
read. If space is tight, use this extended compression optimization.

%SQZ_WXLATE_MULTI_HASH
----------------------

   *Multi Word* conversion hash table:  "for you" => "4u" ...

%SQZ_WXLATE_HASH
----------------

   *Single Word* conversion hash table: word => conversion. This table is
applied after %SQZ_WXLATE_MULTI_HASH has been used.

%SQZ_WXLATE_EXTRA_HASH
----------------------

   Aggressive *Single Word* conversions like: without => w/o. Applied last.

INTERFACE FUNCTIONS
===================

   # **********************************************************************
# #   PUBLIC FUNCTION # #
*********************************************************************

SqueezeText($)
--------------

Description
     Squeeze text by using vowel substitutions and deletions and hash
     tables that guide text substitutions. The line is parsed multiple
     times and this will take some time.

arg1: $text
     String. Line of Text.

Return values
     String, squeezed text.

new()
-----

Description
     Return class object.

Return values
     object.

SqueezeHashSet($;$)
-------------------

Description
     Set hash tables to use for converting text. The multiple word
     conversion is done first and after that the single words conversions.

arg1: \%wordHashRef
     Pointer to be used to convert single words.  If "reset", use default
     hash table.

arg2: \%multiHashRef [optional]
     pointer to be used to convert multiple words.  If "reset", use
     default hash table.

Return values
     None.

SqueezeControl(;$)
------------------

Description
     Select level of text squeezing: noconv, enable, medium, maximum.

arg1: $state
     String. If nothing, set maximum squeeze level (kinda: restore
     defualts).

          noconv	Turn off squeeze
          conv	Turn on squeeze
          med		Set squeezing level to medium
          max		Set squeezing level to maximum

Return values
     None.

SqueezeDebug(;$$)
-----------------

Description
     Activate or deactivate debug.

arg1: $state [optional]
     If not given, turn debug off. If non-zero, turn debug on.  You must
     also supply regexp if you turn on debug, unless you have given it
     previously.

arg1: $regexp [optional]
     If given, use regexp to trigger debug output when debug is on.

Return values
     None.

   Author can be reached at jari.aalto@poboxes.com HomePage via forwarding
service is at http://www.netforward.com/poboxes/?jari.aalto or
alternatively absolute url is at ftp://cs.uta.fi/pub/ssjaaa/ but this may
move without notice. Prefer keeping the forwarding service link in your
bookmark.

   Latest version of this module can be found at
$CPAN/modules/by-module/Lingua/

AUTHOR
======

   Copyright (C) 1998-1999 Jari Aalto. All rights reserved. This program is
free software; you can redistribute it and/or modify it under the same
terms as Perl itself or in terms of Gnu General Public licence v2 or later.


File: pm.info,  Node: Lingua/EN/Summarize,  Next: Lingua/EN/Summarize/Filters,  Prev: Lingua/EN/Squeeze,  Up: Module List

A simple tool for summarizing bodies of English text.
*****************************************************

NAME
====

   Lingua::EN::Summarize - A simple tool for summarizing bodies of English
text.

SYNOPSIS
========

     use Lingua::EN::Summarize;
     my $summary = summarize( $text );                    # Easy, no? :-)
     my $summary = summarize( $text, maxlength => 500 );  # 500-byte summary
     my $summary = summarize( $text, filter => 'html' );  # Strip HTML formatting
     my $summary = summarize( $text, wrap => 75 );        # Wrap output to 75 col.

DESCRIPTION
===========

   This is a simple module which makes an unscientific effort at
summarizing English text. It recognizes simple patterns which look like
statements, abridges them, and concatenates them into something vaguely
resembling a summary. It needs more work on large bodies of text, but it
seems to have a decent effect on small inputs at the moment.

   Lingua::EN::Summarize exports one function, `summarize()', which takes
the text to summarize as its first argument, and any number of optional
directives in `name => value' form. The options it'll take are:

maxlength
     Specifies the maximum length, in bytes, of the generated summary.

wrap
     Prettyprints the summary output by wrapping it to the number of
     columns which you specify.

filter
     Passes the text through a filter before handing it to the summarizer.
     Currently, only two filters are implemented: `"html"', which uses
     HTML::TreeBuilder and HTML::FormatText to strip all HTML formatting
     from a document, and `"easyhtml"', which quickly (and less accurately)
     strips all HTML from a document using a simple regular expression, if
     you don't have the abovementioned modules. An `"email"' filter, for
     converting mail and news messages to easily-summarizable text, is in
     the works for the next version.

   Unlike the HTML::Summarize module (which is quite interesting, and worth
a look), this module considers its input to be plain English text, and
doesn't try to gather any information from the formatting. Thus, without
any cues from the document's format, the scheme that HTML::Summarize uses
isn't applicable here. The current scheme goes something like this:

   "Filter the text according to the user's filter option. Split the text
into discrete sentences with the Text::Sentence module, then further split
them into clauses on commas and semicolons. Keep only the ones that have a
(subject very-simple-verb object) structure. Construct the summary out of
the first sentences in the list, staying within the maxlength limit, or
under 30% of the size of the original text, whichever is smaller."

   Needless to say, this is a very simple and not terribly universally
effective scheme, but it's good enough for a first draft, and I'll bang on
it more later. Like I said, it's not a scientific approach to the problem,
but it's better than nothing (and often better than HTML::Summarize!), and
I don't really need A.I. quality output from it.

AUTHOR
======

   Dennis Taylor, <dennis@funkplanet.com>

SEE ALSO
========

   HTML::Summarize, Text::Sentence,
http://www.vancouvertoday.com/city_guide/dining/reviews/barbers_modern_club.html


File: pm.info,  Node: Lingua/EN/Summarize/Filters,  Next: Lingua/EN/Syllable,  Prev: Lingua/EN/Summarize,  Up: Module List

Helper functions for the Summarize module
*****************************************

NAME
====

   Lingua::EN::Summarize::Filters - Helper functions for the Summarize
module

SYNOPSIS
========

     See the Lingua::EN::Summarize documentation.

DESCRIPTION
===========

   See the Lingua::EN::Summarize documentation.

AUTHOR
======

   Dennis Taylor, <dennis@funkplanet.com>

SEE ALSO
========

   Lingua::EN::Summarize (got the point yet? :-)


File: pm.info,  Node: Lingua/EN/Syllable,  Next: Lingua/ID/Nums2Words,  Prev: Lingua/EN/Summarize/Filters,  Up: Module List

Routine for estimating syllable count in words.
***********************************************

NAME
====

   Lingua::EN::Syllable - Routine for estimating syllable count in words.

SYNOPSIS
========

     use Lingua::EN::Syllable;

     $count = syllable('supercalifragilisticexpialidocious'); # 14

DESCRIPTION
===========

   Lingua::EN::Syllable::syllable() estimates the number of syllables in
the word passed to it.

   Note that it isn't entirely accurate...  it fails (by one syllable) for
about 10-15% of my /usr/dict/words.  The only way to get a 100% accurate
count is to do a dictionary lookup, so this is a small and fast alternative
where more-or-less accurate results will suffice, such as estimating the
reading level of a document.

   I welcome pointers to more accurate algorithms, since this one is
pretty quick-and-dirty.  This was designed for English (well, American at
least) words, but sometimes guesses well for other languages.

KNOWN LIMITATIONS
=================

   Accuracy for words with non-alpha characters is somewhat undefined.  In
general, punctuation characters, et al, should be trimmed off before
handing the word to syllable(), and hyphenated compounds should be broken
into their separate parts.

   Syllables for all-digit words (eg, "1998";  some call them "numbers")
are often counted as the number of digits.  A cooler solution would be
converting "1998" to "nineteen eighty eight" (or "one thousand nine
hundred eighty eight", or...), but that is left as an exercise for the
reader.

   Contractions are not well supported.

   Compound words (like "lifeboat"), where the first word ends in a silent
'e' are counted with an extra syllable.

COPYRIGHT
=========

   Distributed under the same terms as Perl.  Contact the author with any
questions.

AUTHOR
======

   Greg Fast (gdf@imsa.edu)


File: pm.info,  Node: Lingua/ID/Nums2Words,  Next: Lingua/ID/Words2Nums,  Prev: Lingua/EN/Syllable,  Up: Module List

convert number to Indonesian verbage.
*************************************

NAME
====

   Lingua::ID::Nums2Words - convert number to Indonesian verbage.

SYNOPSIS
========

     use Lingua::ID::Nums2Words ;
     
     print nums2words(123)        ; # "seratus dua puluh tiga"
     print nums2words_simple(123) ; # "satu dua tiga"

DESCRIPTION
===========

   *nums2words* currently can handle real numbers in normal and scientific
form in the order of hundreds of trillions. It also preserves formatting
in the number string (e.g, given "1.00" *nums2words* will pronounce the
zeros).

AUTHOR
======

   Steven Haryanto <sh@hhh.indoglobal.com>

SEE ALSO
========

   *Note Lingua/ID/Words2Nums: Lingua/ID/Words2Nums,


File: pm.info,  Node: Lingua/ID/Words2Nums,  Next: Lingua/IW/Logical,  Prev: Lingua/ID/Nums2Words,  Up: Module List

convert Indonesian verbage to number.
*************************************

NAME
====

   Lingua::ID::Words2Nums - convert Indonesian verbage to number.

SYNOPSIS
========

     use Lingua::ID::Words2Nums ;
     
     print words2nums("seratus dua puluh tiga") ; # 123
     print words2nums_simple("satu dua tiga") ;   # 123

DESCRIPTION
===========

   *words2nums* currently can handle real numbers in normal and scientific
form in the order of hundreds of trillions.

   *words2nums* will return undef is its argument contains unknown verbage
or "syntax error".

   *words2nums* will produce unexpected result if you feed it stupid
verbage.

AUTHOR
======

   Steven Haryanto <sh@hhh.indoglobal.com>

SEE ALSO
========

   *Note Lingua/ID/Nums2Words: Lingua/ID/Nums2Words,


File: pm.info,  Node: Lingua/IW/Logical,  Next: Lingua/Ident,  Prev: Lingua/ID/Words2Nums,  Up: Module List

module for working with logical and visual hebrew
*************************************************

NAME
====

   Lingua::IW::Logical - module for working with logical and visual hebrew

SYNOPSIS
========

   use Lingua::IW::Logical;

   $visual = log2vis_string($logical)

   $vistext = log2vis_text($logtext)

   $vistext = log2vis_text($logtext,$linelength)

   $vistext = log2vis_text($logtext,$linelength,$start)

   $vistext = log2vis_text($logtext,$linelength,$start,$end)

DESCRIPTION
===========

   This module is intended to automate task of converting logical Hebrew
to visual Hebrew.

log2vis_string STRING
---------------------

   This function converts it's argument from logical representaion to
visual (renders it like it should be printed).

log2vis_text STRING,LENGTH,START,END
------------------------------------

   This function allows to convert blocks of text, using `log2vis_string'.
All arguments except the first are optional. LENGTH defines the maximal
length of the resulting line, with default of 80. START defines the text
added before each line of the resulting text. If START is undefined, the
line is padded so that the text is right-aligned. END defines what is
added after each line of text. Default is newline.

   Example 1:

     #!/usr/bin/perl
     use Lingua::IW::Logical;

     while(<>)
     {
       print log2vis_text($_);
     }

   This short program will convert any text in logical Hebrew to readable
visual Hebrew.

   Example 2:

     #!/usr/bin/perl
     use Lingua::IW::Logical;

     while(<>)
     {
       print log2vis_text($_,80,"<nobr>","</nobr><br>\n");
     }

   This example show how you can HTML-ize file in logical hebrew, so that
you can put it on the web page.

KNOWN BUGS
==========

   Bug reports are welcome.  There are some unresolved things with
constructions like "H-1", where H is hebrew letter. The parser now thinks
it's number "-1", while in this context it might as well be dash. Also, it
seems that Word likes to write percents as "%12.5". This is also not
rendered well. I'm thinking how to resolve these issues.

AUTHOR
======

   Stanislav Malyshev (frodo@sharat.co.il)


File: pm.info,  Node: Lingua/Ident,  Next: Lingua/Ispell,  Prev: Lingua/IW/Logical,  Up: Module List

Statistical language identification
***********************************

NAME
====

   Lingua::Ident - Statistical language identification

SYNOPSIS
========

     use Lingua::Ident;
     $i    = new Lingua::Ident("filename 1" ... "filename n");
     $lang = $i->identify("text to classify"), "\n";

DESCRIPTION
===========

   This module implements a statistical language identifier.

   The filename attributes to the constructor must refer to files
containing tables of n-gram probabilites for languages. These tables can
be generated using the trainlid(1) utility program.

RETURN VALUE
============

   The identify() method returns the value specified in the *_LANG* field
of the probabilities table of the language to which the text most likely
belongs (see `"WARNINGS"' in this node).

   It is recommended to be a POSIX locale name constructed from an ISO 639
2-letter language code, possibly extended by an ISO 3166 2-letter country
code and a character set identifier. Example: *de_DE.iso88591*.

WARNINGS
========

   Since Lingua::Ident is based on statistics it cannot be 100 % accurate.
More precisely, Dunning (see below) reports his implementation to achieve
92 % accuracy with 50K of training text for 20 character strings
discriminating bewteen English and Spanish. This implementation should be
as accurate as Dunning's. However, not only the size but also the quality
of the training text play a role.

   The current implementation doesn't use a threshold to determine if the
most probable language has a high enough probability; if you're trying to
classify a text in a language for which there is no probability table,
this results in getting an incorrect language.

AUTHOR
======

   Lingua::Ident was developed by Michael Piotrowski <mxp@dynalabs.de>.

SEE ALSO
========

   Dunning, Ted (1994). *Statistical Identification of Language.*
Technical report CRL MCCS-94-273. Computing Research Lab, New Mexico State
University.


File: pm.info,  Node: Lingua/Ispell,  Next: Lingua/JA/Jtruncate,  Prev: Lingua/Ident,  Up: Module List

a module encapsulating access to the Ispell program.
****************************************************

NAME
====

   Lingua::Ispell.pm - a module encapsulating access to the Ispell program.

   Note: this module was previously known as Text::Ispell; if you have
Text::Ispell installed on your system, it is now obsolete and should be
replaced by Lingua::Ispell.

NOTA BENE
=========

   ispell, when reporting on misspelled words, indicates the string it was
unable to verify, as well as its starting offset in the input line.  No
such information is returned for words which are deemed to be correctly
spelled.  For example, in a line like "Can't buy a thrill", ispell simply
reports that the line contained four correctly spelled words.

   Lingua::Ispell would like to identify which substrings of the input
line are words - correctly spelled or otherwise.  It used to attempt to
split the input line into words according to the same rules ispell uses;
but that has proven to be very difficult, resulting in both slow and
error-prone code.

Consequences
------------

   Lingua::Ispell now operates only in "terse" mode.  In this mode, only
misspelled words are reported.  Words which ispell verifies as correctly
spelled are silently accepted.

   In the report structures returned by `spellcheck()', the `'term'' member
is now always identical to the `'original'' member; of the two, you should
probably use the `'term'' member.  (Also consider the `'offset'' member.)
ispell does not report this information for correctly spelled words; if at
some point in the future this capability is added to ispell, Lingua::Ispell
will be updated to take advantage of it.

   Use of the `$word_chars' variable has been removed; setting it no longer
has any effect.

   `terse_mode()' now does nothing.

SYNOPSIS
========

     # Brief:
     use Lingua::Ispell;
     Lingua::Ispell::spellcheck( $string );
     # or
     use Lingua::Ispell qw( spellcheck ); # import the function
     spellcheck( $string );

     # Useful:
     use Lingua::Ispell qw( :all );  # import all symbols
     for my $r ( spellcheck( "hello hacking perl shrdlu 42" ) ) {
       print "$r->{'type'}: $r->{'term'}\n";
     }

DESCRIPTION
===========

   Lingua::Ispell::spellcheck() takes one argument.  It must be a string,
and it should contain only printable characters.  One allowable exception
is a terminal newline, which will be chomped off anyway.  The line is fed
to a coprocess running ispell for analysis.  ispell parses the line into
"terms" according to the language-specific rules in effect.

   The result of ispell's analysis of each term is a categorization of the
term into one of six types: ok, compound, root, miss, none, and guess.
Some of these carry additional information.  The first three types are
"correctly" spelled terms, and the last three are for "incorrectly"
spelled terms.

   Lingua::Ispell::spellcheck returns a list of objects, each
corresponding to a term in the spellchecked string.  Each object is a hash
(hash-ref) with at least two entries: 'term' and 'type'.  The former
contains the term ispell is reporting on, and the latter is ispell's
determination of that term's type (see above).  For types 'ok' and 'none',
that is all the information there is.  For the type 'root', an additional
hash entry is present: 'root'.  Its value is the word which ispell
identified in the dictionary as being the likely root of the current term.
For the type 'miss', an additional hash entry is present: 'misses'.  Its
value is an ref to an array of words which ispell identified as being
"near-misses" of the current term, when scanning the dictionary.

NOTE
----

   As mentioned above, `Lingua::Ispell::spellcheck()' currently only
reports on misspelled terms.

EXAMPLE
-------

     use Lingua::Ispell qw( spellcheck );
     Lingua::Ispell::allow_compounds(1);
     for my $r ( spellcheck( "hello hacking perl salmoning fruithammer shrdlu 42" ) ) {
       if ( $r->{'type'} eq 'ok' ) {
         # as in the case of 'hello'
         print "'$r->{'term'}' was found in the dictionary.\n";
       }
       elsif ( $r->{'type'} eq 'root' ) {
         # as in the case of 'hacking'
         print "'$r->{'term'}' can be formed from root '$r->{'root'}'\n";
       }
       elsif ( $r->{'type'} eq 'miss' ) {
         # as in the case of 'perl'
         print "'$r->{'term'}' was not found in the dictionary;\n";
         print "Near misses: @{$r->{'misses'}}\n";
       }
       elsif ( $r->{'type'} eq 'guess' ) {
         # as in the case of 'salmoning'
         print "'$r->{'term'}' was not found in the dictionary;\n";
         print "Root/affix Guesses: @{$r->{'guesses'}}\n";
       }
       elsif ( $r->{'type'} eq 'compound' ) {
         # as in the case of 'fruithammer'
         print "'$r->{'term'}' is a valid compound word.\n";
       }
       elsif ( $r->{'type'} eq 'none' ) {
         # as in the case of 'shrdlu'
         print "No match for term '$r->{'term'}'\n";
       }
       # and numbers are skipped entirely, as in the case of 42.
     }

ERRORS
------

   `Lingua::Ispell::spellcheck()' starts the ispell coprocess if the
coprocess seems not to exist.  Ordinarily this is simply the first time
it's called.

   ispell is spawned via the `Open2::open2()' function, which throws an
exception (i.e. dies) if the spawn fails.  The caller should be prepared
to catch this exception - unless, of course, the default behavior of die
is acceptable.

Nota Bene
---------

   The full location of the ispell executable is stored in the variable
`$Lingua::Ispell::path'.  The default value is `/usr/local/bin/ispell'.
If your ispell executable has some name other than this, then you must set
`$Lingua::Ispell::path' accordingly before you call
`Lingua::Ispell::spellcheck()' (or any other function in the module) for
the first time!

AUX FUNCTIONS
=============

add_word(word)
--------------

   Adds a word to the personal dictionary.  Be careful of capitalization.
If you want the word to be added "case-insensitively", you should call
`add_word_lc()'

add_word_lc(word)
-----------------

   Adds a word to the personal dictionary, in lower-case form.  This
allows ispell to match it in a case-insensitive manner.

accept_word(word)
-----------------

   Similar to adding a word to the dictionary, in that it causes ispell to
accept the word as valid, but it does not actually add it to the
dictionary.  Presumably the effects of this only last for the current
ispell session, which will mysteriously end if any of the
coprocess-restarting functions are called...

parse_according_to(formatter)
-----------------------------

   Causes ispell to parse subsequent input lines according to the
specified formatter.  As of ispell v. 3.1.20, only 'tex' and 'nroff' are
supported.

set_params_by_language(language)
--------------------------------

   Causes ispell to set its internal operational parameters according to
the given language.  Legal arguments to this function, and its effects,
are currently unknown by the author of Lingua::Ispell.

save_dictionary()
-----------------

   Causes ispell to save the current state of the dictionary to its disk
file.  Presumably ispell would ordinarily only do this upon exit.

terse_mode(bool:terse)
----------------------

   **NOTE:* This function has been disabled!  Lingua::Ispell now always
operates in terse mode.*

   In terse mode, ispell will not produce reports for "correct" words.
This means that the calling program will not receive results of the types
'ok', 'root', and 'compound'.

FUNCTIONS THAT RESTART ISPELL
=============================

   The following functions cause the current ispell coprocess, if any, to
terminate.  This means that all the changes to the state of ispell made by
the above functions will be lost, and their respective values reset to
their defaults.  The only function above whose effect is persistent is
save_dictionary().

   Perhaps in the future we will figure out a good way to make this state
information carry over from one instantiation of the coprocess to the next.

allow_compounds(bool)
---------------------

   When this value is set to True, compound words are accepted as legal -
as long as both words are found in the dictionary; more than two words are
always illegal.  When this value is set to False, run-together words are
considered spelling errors.

   The default value of this setting is dictionary-dependent, so the
caller should set it explicitly if it really matters.

make_wild_guesses(bool)
-----------------------

   This setting controls when ispell makes "wild" guesses.

   If False, ispell only makes "sane" guesses, i.e.  possible root/affix
combinations that match the current dictionary; only if it can find none
will it make "wild" guesses, which don't match the dictionary, and might
in fact be illegal words.

   If True, wild guesses are always made, along with any "sane" guesses.
This feature can be useful if the dictionary has a limited word list, or a
word list with few suffixes.

   The default value of this setting is dictionary-dependent, so the
caller should set it explicitly if it really matters.

use_dictionary([dictionary])
----------------------------

   Specifies what dictionary to use instead of the default.  Dictionary
names are actually file names, and are searched for according to the
following rule: if the name does not contain a slash, it is looked for in
the directory containing the default dictionary, typically /usr/local/lib.
Otherwise, it is used as is: if it does not begin with a slash, it is
construed from the current directory.

   If no argument is given, the default dictionary will be used.

use_personal_dictionary([dictionary])
-------------------------------------

   Specifies what personal dictionary to use instead of the default.

   Dictionary names are actually file names, and are searched for
according to the following rule: if the name begins with a slash, it is
used as is (i.e. it is an absolute path name). Otherwise, it is construed
as relative to the user's home directory ($HOME).

   If no argument is given, the default personal dictionary will be used.

FUTURE ENHANCEMENTS
===================

   ispell options:

     -w chars
            Specify additional characters that can be part of a word.

DEPENDENCIES
============

   Lingua::Ispell uses the external program ispell, which is the
"International Ispell", available at

     http://fmg-www.cs.ucla.edu/geoff/ispell.html

   as well as various archives and mirrors, such as

     ftp://ftp.math.orst.edu/pub/ispell-3.1/

   This is a very popular program, and may already be installed on your
system.

   Lingua::Ispell also uses the standard perl modules FileHandle,
IPC::Open2, and Carp.

AUTHOR
======

   jdporter@min.net (John Porter)

COPYRIGHT
=========

   This module is free software; you may redistribute it and/or modify it
under the same terms as Perl itself.


