Return-Path: <env_14086384-227227126@hermes.sun.com>
Received: from pacific-carrier-annex.mit.edu by po10.mit.edu (8.9.2/4.7) id SAA29581; Tue, 23 Apr 2002 18:51:10 -0400 (EDT)
Received: from hermes.sun.com (hermes.sun.com [64.124.140.169])
	by pacific-carrier-annex.mit.edu (8.9.2/8.9.2) with SMTP id SAA17276
	for <alexp@mit.edu>; Tue, 23 Apr 2002 18:51:09 -0400 (EDT)
Date: Tue, 23 Apr 2002 14:51:09 GMT-08:00
From: "JDC Tech Tips" <body_14086384-227227126@hermes.sun.com>
To: alexp@mit.edu
Message-Id: <14086384-227227126@hermes.sun.com>
Subject: JDC Tech Tips, April 23, 2002 (Pattern Matching, Creating a HelpSet)
Precedence: junk
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
X-Mailer: Beyond Email 

 J  D  C    T  E  C  H    T  I  P  S


                      TIPS, TECHNIQUES, AND SAMPLE CODE



WELCOME to the Java Developer Connection(sm) (JDC) Tech Tips, 
April 23, 2002. This issue covers:

         * Pattern Matching
         * Creating a HelpSet with JavaHelp(tm) software
       
These tips were developed using Java 2 SDK, Standard Edition, 
v 1.4. 

This issue of the JDC Tech Tips is written by John Zukowski, 
president of JZ Ventures, Inc. (http://www.jzventures.com).

You can view this issue of the Tech Tips on the Web at
http://java.sun.com/jdc/JDCTechTips/2002/tt0423.html

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
PATTERN MATCHING 

The javax.util.regex package is a new package in Java 2 Platform, 
Standard Edition version 1.4. The package provides a regular 
expression library. A regular expression is a pattern of 
characters that describes a set of strings, and is often used in 
pattern matching. The classes in the javax.util.regex package 
let you match sequences of characters against a regular 
expression. These classes, which comprise the regular expression 
library, use the Perl 5 regular expression pattern syntax, and 
provide a much more powerful way of parsing text than was 
previously available with the java.io.StreamTokenizer and the
java.util.StringTokenizer classes.

The regular expression library has three classes: Pattern, 
Matcher, and PatternSyntaxException. Ignoring the exception class, 
what you really have is one class to define the regular 
expression you want to match (the Pattern), and another class 
(the Matcher) for searching a pattern in a given string.

Most of the work of using the regular expression library is 
understanding its pattern syntax. The actual parsing is the easy 
part. So let's look at what makes up a regular expression.

The simplest kind of regular expression is a literal. A literal 
is not simply a character within the regular expression, but a 
character that is not part of some special grouping or expression 
within the regular expression.

For instance, the literal "x" is a regular expression. Using the
literal, a matcher, and a string, you can ask "Does the regular 
expression 'x' match the entire string?" Here's an expression that
asks the question:

boolean b = Pattern.matches("x", someString);

If the pattern "x" is the string referenced by someString, 
then b is true. Otherwise, b is false. By itself, literals are 
not that complicated to understand. Notice here that the matcher 
is defined by the Pattern class, not the Matcher class. The 
matches method is defined by the Pattern class as a convenience 
for when a regular expression is used just once. Normally, you 
would define a Pattern class, a Matcher class for the Pattern, and 
then use the matches method defined by the Matcher class:

Pattern p = new Pattern("x");
Matcher m = p.matcher("sometext");
boolean b = m.matches();

The tip will cover those steps later.

Of course, regular expressions can be more complex than literals.
Adding to the complexity are wildcards and quantifiers. There is 
only one wildcard used in regular expressions. It is the period 
(.) character. A wildcard is used to match any single character, 
possibly even a newline. The quantifier characters are the + and 
*. (Technically, the question mark is also a quantifier 
character.) The + character placed after a regular expression 
allows for a regular expression to be matched one or more times. 
The * is like the + character, but works zero or more times. For 
instance, if you want to find a string with a j at the beginning,
a z at the end, and at least one character between the two, you 
use the expression "j.+z". If there doesn't have to be any 
characters between the j and the z, you use "j.*z" instead.

Note that pattern matching tries to find the largest possible 
"hit" within a string. So if you request a match against the 
pattern "j.*z", using the string "jazjazjazjaz", it returns the 
entire string, not just a single "jaz". This is called "greedy 
behavior." It is the default in a regular expression unless you
specify otherwise.

Now let's get a little more complex. By placing multiple 
expressions in parentheses, you can request a match against 
multi-character patterns. For instance, to match a j followed by 
a z, you can use the "(jz)" pattern. By itself, that doesn't buy 
you much. It is the same as "jz". But, by using parenthesis, you 
can use the quantifiers and say match any number of "jz" patterns: 
"(jz)+".

Another way of working with patterns is through character 
classes. With character classes, you specify a range of possible 
characters instead of specifying individual characters. For 
instance, if you want to match against any letter from j to z, 
you specify the range j-z in square brackets: "[j-z]". You could 
also attach a quantifier to the expression, for example, 
"[j-z]+", to get an expression matching at least one character 
between j and z, inclusively.

Certain character classes are predefined. These represent classes 
that are common, and so they have a common shorthand. Some of the
predefined character classes are:

\d    A digit ([0-9])
\D    A non-digit ([^0-9])
\s    A whitespace character [ \t\n\x0B\f\r]
\S    non-whitespace character: [^\s]
\w    A word character: [a-zA-Z_0-9] 
\W    A non-word character: [^\w] 

Notice that for character classes, ^ is used for negation of an 
expression.

There is a second set of predefined character classes, called 
POSIX character classes. These are taken from the POSIX 
specification, and work with US-ASCII characters only:

\p{Lower}    A lower-case alphabetic character: [a-z] 
\p{Upper}    An upper-case alphabetic character:[A-Z] 
\p{ASCII}    All ASCII:[\x00-\x7F] 
\p{Alpha}    An alphabetic character:[\p{Lower}\p{Upper}] 
\p{Digit}    A decimal digit: [0-9] 
\p{Alnum}    An alphanumeric character:[\p{Alpha}\p{Digit}] 
\p{Punct}    Punctuation: one of !"#$%&'()*,-./:;<=>?@[\]^_`{|}~ 
\p{Graph}    A visible character: [\p{Alnum}\p{Punct}] 
\p{Print}    A printable character: [\p{Graph}] 
\p{Blank}    A space or a tab: [ \t] 
\p{Cntrl}    A control character: [\x00-\x1F\x7F] 
\p{XDigit}   A hexadecimal digit: [0-9a-fA-F] 
\p{Space}    A whitespace character: [ \t\n\x0B\f\r] 

The final set of character classes listed here are the boundary
matchers. These are meant to match the beginning or end of 
a sequence of characters, specifically a line, word, or pattern.

^     The beginning of a line 
$     The end of a line 
\b    A word boundary 
\B    A non-word boundary 
\A    The beginning of the input 
\G    The end of the previous match 
\Z    The end of the input but for the final terminator, if any 
\z    The end of the input 

The key thing to understand about all the character class 
expressions is the use of the \. When you compose a regular 
expression as a Java string, you must escape the \ character. 
Otherwise, the character following the \ will be treated as 
special by the javac compiler. To escape the \ character, specify 
a double \\. By placing a double \\ in the string, you are saying 
you want the actual \ character there. For instance, if you want 
to use a pattern for any string of alphanumeric characters, 
simply having a string containing \p{Alnum}* is not sufficient. 
You must escape the \ as follows:

boolean b = Pattern.matches("\\p{Alnum}*", someString);

As the name implies, the Pattern class is for defining patterns, 
that is, it defines the regular expression you want to match. 
Instead of using matches to see if a pattern matches the whole 
string, what normally happens is you check to see if a pattern 
matches the next part of the string. 

To use a pattern you must compile it. You do this with the 
compile method.

Pattern pattern = Pattern.compile(somePattern);

Pattern compilation can take some time, and doing it once is 
wise. The matches method of the Pattern class compiles the 
pattern with each call. If you want to use a pattern many 
times, you can avoid multiple compilation by getting a Matcher
class for the Pattern class and then using the Matcher class.

After you compile the pattern, you can request to get a Matcher 
for a specific string.

Matcher matcher = pattern.matcher(someString);

The Matcher provides a matches method that checks against the 
entire string. The class also provides a find() method that tries
to find the next sequence, possibly not at the beginning of the 
string, that matches the pattern.
 
After you know you have a match, you can get the match with the
group method:

if (matcher.find()) {
  System.out.println(matcher.group());
}

You can also use the matcher as a search and replace mechanism. 
For instance, to replace all occurrences of a pattern within 
a string, you use the following expression:

String newString = matcher.replaceAll("replacement words");

Here, all occurrences of the pattern in question would be 
replaced by the replacement words.

Here's a demonstration of pattern matching. The following program 
takes three command line arguments. The first argument is a 
string to search. The second is a pattern for the search. The 
third is the replacement string. The replacement string replaces 
each occurrence of the pattern found in the search string.

import java.util.regex.*;

public class MyMatch {
  public static void main(String args[]) {

    if (args.length != 3) {
      System.out.println(
        "Pass in source string, pattern, " +
        "and replacement string");
      System.exit(-1);
    }

    String sourceString = args[0];
    String thePattern = args[1];
    String replacementString = args[2];

    Pattern pattern = Pattern.compile(thePattern);
    Matcher match = pattern.matcher(sourceString);
    if (match.find()) {
      System.out.println(
        match.replaceAll(replacementString));
    }
  }
}

For example, if you compile the program, and then run it like 
this:

java MyMatch "I want to be in lectures" "lect" "pict"

It returns:

I want to be in pictures

Notice that when you run the program, it is unnecessary to 
escape the \ character from the command line. That's because
the javac compiler does not process that information. For 
example, if the search string is:

"I want to be in lectures\I want to be a star" 

and you run the program with the same pattern ("lect") and
replacement string ("pict"), it returns:

I want to be in pictures\I want to be a star

For more information about pattern matching and regular 
expressions, see the technical article Regular Expressions and 
the Java Programming Language 
(http://java.sun.com/jdc/technicalArticles/releases/1.4regex/). 

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
CREATING A HELPSET WITH JAVAHELP SOFTWARE

JavaHelp software allows you to add online help to any system 
that has a Java Runtime Environment (JRE). With JavaHelp 
software, you can embed online documentation inside your 
client-side programs. This includes the obvious applets and 
applications, but you can also use JavaHelp software with
JavaBeans(tm) technology components or as standalone help 
for third-party systems. 

Getting started with the JavaHelp software is easy. Just go to
http://java.sun.com/products/javahelp/download_binary.html. You 
can download either the user-version, with a JRE, or a 
developer-centric version (as a Zip or self-extracting 
executable). There is also a JavaHelp User Guide that comes with 
the software downloads. If you view the JavaHelp User Guide, 
you'll see the JavaHelp system in action.

Once started, the Swing-based help viewer for JavaHelp presents 
information in a series of views. You'll find a Table of 
Contents, index of topics, and search. These three features 
combined are called the HelpSet and may include multiple help 
topic files. Essentially, it is your job to create the help topic 
files and the navigation files, mapping the help topic to the file 
with the necessary information. The topic files are basic HTML, 
and the navigation files are formatted in XML. You can however use
a third-party tool to automatically produce the necessary files.
For example, a tool such as RoboHELP generates the necessary 
files in the JavaHelp format. See the list of tools supporting 
the JavaHelp software format at 
http://java.sun.com/products/javahelp/industry.html.

To demonstrate the JavaHelp system in action, let's create a 
"Hello, JavaHelp" HelpSet. To do this, you'll need to configure 
a special directory structure. It helps if you work in 
a subdirectory to start, so that you don't mix up the HelpSet 
files with any others. Navigation files go in the top-level 
directory, and topic and image files in subdirectories.

To get started, create a directory named help. Under help,
create a directory named Hello.

In the Hello directory, you create subdirectories for subtopics 
to hold the actual help files. For the "Hello, JavaHelp" 
demonstration, create one directory named First and another Last.

Once the directory structure is created, you can start creating 
the navigation and help files. The directory structure now looks 
as follows:

+ help
  + Hello
    + First
    + Last

The DTD for the main HelpSet file is contained in 
http://java.sun.com/products/javahelp/helpset_1_0.dtd. In it, you 
create entries for the term map as well as table of contents and 
index views. There is really no magic in the filenames. Just be 
sure the HelpSet file ends with the extension .hs. Here's what 
the HelpSet file, hello.hs, might look like, where the map is in 
Map.jhm, table of contents is in toc.xml, and index is 
in index.xml. Create this hello.hs file in the help directory.

<?xml version='1.0' encoding='ISO-8859-1' ?>
<!DOCTYPE helpset
  PUBLIC "-//Sun Microsystems Inc.//DTD JavaHelp HelpSet Version 1.0//EN"
         "http://java.sun.com/products/javahelp/helpset_1_0.dtd">

<helpset version="1.0">
  <title>Hello, JavaHelp</title>
  <maps>
    <mapref location="Map.jhm"/>
    <homeID>overview</homeID>
  </maps>
  <view>
    <name>TOC</name>
    <label>TOC</label>
    <type>javax.help.TOCView</type>
    <data>toc.xml</data>
  </view>
  <view>
    <name>Index</name>
    <label>Index</label>
    <type>javax.help.IndexView</type>
    <data>index.xml</data>
  </view>
</helpset>

For the map file, you need to create a mapping from map ID to 
files, similar to the following:

<mapID target="one" url="Hello/First/one.htm" />

Be sure the help files are specified as relative locations from 
the HelpSet. You could hard code complete paths, but then as soon 
as you JAR up the HelpSet, all paths would be wrong. Of course, 
these could be complete URLs to resources on the Web. If you want 
to have one "overview" help file at the top, and two help files
in each of the First and Last directories, your XML mapping might 
appear as follows. Create this Map.jhm file in the help directory.

<?xml version='1.0' encoding='ISO-8859-1' ?>
<!DOCTYPE map
  PUBLIC "-//Sun Microsystems Inc.//DTD JavaHelp Map Version 1.0//EN"
         "http://java.sun.com/products/javahelp/map_1_0.dtd">

<map version="1.0">
  <mapID target="overview" url="Hello/overview.htm" />
  <mapID target="one" url="Hello/First/one.htm" />
  <mapID target="two" url="Hello/First/two.htm" />
  <mapID target="three" url="Hello/Last/three.htm" />
  <mapID target="four" url="Hello/Last/four.htm" />
</map>

The table of contents and index files are next. These provide 
alternate means of working through the various help files. Again, 
these are described in XML files.

For the table of contents, each target from the map is mapped to 
text to appear in the table of contents. Create this toc.xml file 
in the help directory.

<?xml version='1.0' encoding='ISO-8859-1' ?>
<!DOCTYPE toc
  PUBLIC "-//Sun Microsystems Inc.//DTD JavaHelp TOC Version 1.0//EN"
         "http://java.sun.com/products/javahelp/toc_1_0.dtd">

<toc version="1.0">
<tocitem image="toplevelfolder" target="overview" text="Hello, JavaHelp">
    <tocitem text="First Stuff">
      <tocitem target="one" text="The One"/>
      <tocitem target="two" text="The Second"/>
    </tocitem>
    <tocitem text="Last Stuff">
      <tocitem target="three" text="What's Third?"/>
      <tocitem target="four" text="The End"/>
    </tocitem>
</tocitem>
</toc>

The index is just another way of presenting the data. As you 
create the index.xml file, you must alphabetize/list terms in the
order you want them presented. Simply create the XML file with 
a set of hierarchical <indexitem> entries. In each <indexitem> 
entry, provide a value for the text attribute and a value for the 
target attribute. The value for the text attribute specifies what 
to display to the user in the index. The value for the target 
attribute specifies what help to display. Create this index.xml 
file in the help directory.

<?xml version='1.0' encoding='ISO-8859-1' ?>
<!DOCTYPE index
  PUBLIC "-//Sun Microsystems Inc.//DTD JavaHelp Index Version 1.0//EN"
         "http://java.sun.com/products/javahelp/index_1_0.dtd">

<index version="1.0">
  <indexitem text="The First?">
    <indexitem target="one" text="I'm One"/>
    <indexitem target="two" text="I'm Second"/>
  </indexitem>
  <indexitem text="The Last?">
    <indexitem target="three" text="We're Third!"/>
    <indexitem target="four" text="We're Last"/>
  </indexitem>
  <indexitem target="overview" text="Overview!!!"/>
</index>

The map file mentions five HTML files:
 
   Hello/overview.htm
   Hello/First/one.htm
   Hello/First/two.htm
   Hello/Last/three.htm
   Hello/Last/four.htm

So you must create them. Make sure to create the files in the 
appropriate Hello directory or subdirectory. Try to create the 
files with something interesting in them, for example, a few 
sentences of overview information in the overview.htm file. The 
whole directory structure now looks like this:

+ help
  hello.hs
  index.xml
  Map.jhm
  toc.xml
  + Hello
    overview.htm
    + First
      one.htm
      two.htm
    + Last
      three.htm
      four.htm

To test if you have everything connected properly, run the 
hsviewer utility that comes with the JavaHelp software, and have 
it load the hello.hs file. You can find the utility in the 
demos/bin (Unix) or demos\bin (Windows) subdirectory of your 
JavaHelp installation directory. For example, in Unix 
change to the demos/bin subdirectory, and enter:

hsviewer -helpset hello.hs -classpath path

Replace "path" with the path to the hello.hs HelpSet.

After starting up hsviewer, click on the Browse button to locate 
the hello.hs file. Then click on the Display button to bring up 
the help viewer. Because hello.hs has two <view> tags, you'll 
find two tabs on the left side: one for the TOC and one for the 
index. The right side will display the HTML associated with the 
item selected on the left.

You can also add a search tab. To do this, run the jhindexer 
program and add another <view> to the HelpSet. Enter the 
jhindexer command as follows in the directory that contains the 
hello.hs file. 

jhindexer Hello

If the command isn't in your path, you'll need to prefix the 
command with its full path. You can find the command in the 
javahelp/bin (Unix) or javahelp\bin (Windows) subdirectory of 
your JavaHelp installation directory.  

Here's the <view> tag you need to add to hello.hs. JavaHelpSearch 
is the name of the directory used for the help index support 
files to be saved.

   <view>
     <name>Search</name>
     <label>Word Search</label>
     <type>javax.help.SearchView</type>
   <data engine="com.sun.java.help.search.DefaultSearchEngine">
       JavaHelpSearch
     </data>
   </view>

For more information about JavaHelp software, see the JavaHelp
software page (http://java.sun.com/products/javahelp/). 

.  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .

IMPORTANT: Please read our Terms of Use, Privacy, and Licensing 
policies:
http://www.sun.com/share/text/termsofuse.html
http://www.sun.com/privacy/ 
http://developer.java.sun.com/berkeley_license.html

* FEEDBACK
  Comments? Send your feedback on the JDC Tech Tips to: 
  jdc-webmaster@sun.com

* SUBSCRIBE/UNSUBSCRIBE
  - To subscribe, go to the subscriptions page,
    (http://developer.java.sun.com/subscription/), choose
    the newsletters you want to subscribe to and click "Update".
  - To unsubscribe, go to the subscriptions page,
    (http://developer.java.sun.com/subscription/), uncheck the
    appropriate checkbox, and click "Update".
  - To use our one-click unsubscribe facility, see the link at 
    the end of this email:
    
- ARCHIVES
You'll find the JDC Tech Tips archives at:

http://java.sun.com/jdc/TechTips/index.html


- COPYRIGHT
Copyright 2002 Sun Microsystems, Inc. All rights reserved.
901 San Antonio Road, Palo Alto, California 94303 USA.

This document is protected by copyright. For more information, see:

http://java.sun.com/jdc/copyright.html


JDC Tech Tips 
April 23, 2002

Sun, Sun Microsystems, Java, Java Developer Connection, JavaHelp, 
and JavaBeans are trademarks or registered trademarks of 
Sun Microsystems, Inc. in the United States and other countries.


To use our one-click unsubscribe facility, select the following URL:
http://bulkmail.sun.com/unsubscribe?14086384-227227126
