Received: from FORT-POINT-STATION.MIT.EDU by po10 (5.61/4.7) id AA24942; Tue, 13 Jun 00 15:59:45 EDT
Received: from hermes.java.sun.com (hermes.javasoft.com [204.160.241.85])
	by fort-point-station.mit.edu (8.9.2/8.9.2) with ESMTP id PAA12573;
	Tue, 13 Jun 2000 15:57:23 -0400 (EDT)
Received: (from nobody@localhost)
	by hermes.java.sun.com (8.9.3+Sun/8.9.1) id TAA22260;
	Tue, 13 Jun 2000 19:57:26 GMT
Date: Tue, 13 Jun 2000 19:57:26 GMT
Message-Id: <200006131957.TAA22260@hermes.java.sun.com>
X-Authentication-Warning: hermes.java.sun.com: Processed from queue /bulkmail/data/ed_38/mqueue3
X-Mailing: 217
From: JDCTechTips@sun.com
Subject: JDC Tech Tips, June 13, 2000
To: JDCMember@sun.com
Reply-To: JDCTechTips@sun.com
Errors-To: bounced_mail@hermes.java.sun.com
Precedence: junk
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
X-Mailer: Beyond Email 2.2


 J  D  C    T  E  C  H    T  I  P  S

                      TIPS, TECHNIQUES, AND SAMPLE CODE


WELCOME to the Java Developer Connection(sm) (JDC) Tech Tips, 
June 13, 2000. This issue covers:

         * Using BreakIterator to Parse Text
         * Goto Statements and Java(tm) Programming
                  
These tips were developed using Java(tm) 2 SDK, Standard Edition, 
v 1.2.2.

You can view this issue of the Tech Tips on the Web at
http://developer.java.sun.com/developer/TechTips/2000/tt0613.html
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
USING BREAKITERATOR TO PARSE TEXT

The standard Java(tm) packages such as java.util include several 
classes that you can use to break text into words or other logical 
units. One of these classes is java.util.StringTokenizer. When you 
use StringTokenizer, you specify a set of delimiter characters; 
instances of StringTokenizer then return words delimited by these 
characters. java.io.StreamTokenizer is a class that does something 
similar.

These classes are quite useful. However they have some limitations.
This is especially true when you're trying to parse text that 
represents human language. For example, the classes don't have 
built-in knowledge of punctuation rules, and the classes might
define a "word" as simply a string of contiguous non-whitespace 
characters.

java.text.BreakIterator is a class specifically designed to parse
human language text into words, lines, and sentences. To see how it
works, here's a simple example:

    import java.text.BreakIterator;
    
    public class BreakDemo1 {
        public static void main(String args[]) {
    
            // string to be broken into sentences
    
            String str = "\"Testing.\" \"???\" (This is a test.)";
    
            // create a sentence break iterator
    
            BreakIterator brkit =
                BreakIterator.getSentenceInstance();
            brkit.setText(str);
    
            // iterate across the string
    
            int start = brkit.first();
            int end = brkit.next();
            while (end != BreakIterator.DONE) {
                String sentence = str.substring(start, end);
                System.out.println(start + " " + sentence);
                start = end;
                end = brkit.next();
            }
        }
    }

The input string is:

    "Testing." "???" (This is a test.)

It is immediately apparent that parsing this input is not trivial. 
For example, suppose you follow a simple rule that a sentence ends
with a period. Well, actually, it doesn't. The fact that it
doesn't is demonstrated by the following two sentences, both
of which are considered correct:

    "This is a test."

    "This is a test".

The first of these sentences is more standard relative to 
long-standing English usage.

BreakIterator applies a set of rules to handle situations such as 
this. When you run the BreakDemo1 program in the United States 
locale, the result is:

    0 "Testing." 
    11 "???" 
    17 (This is a test.)

The numbers are offsets into the string where each sentence starts. 
In other words, BreakIterator return a series of offsets that tell 
where some particular unit (sentence, word) starts in a string. 
BreakIterator is particularly useful in applications such as word 
processing, where, for example, you might be trying to find the 
location of the next sentence in some currently displayed text.

The demo program uses default locale settings, but it could have 
specified a specific locale, for example:

    ... BreakIterator.getSentenceInstance(Locale.GERMAN);

Another way you can use BreakIterator is to find line breaks,
that is, locations in text where a line could be broken for 
text formatting. Here's an example:

    import java.text.BreakIterator;
    
    public class BreakDemo2 {
        public static void main(String args[]) {
    
            // string to be broken into sentences
    
            String str = "This sen-tence con-tains hyphenation.";
    
            // create a line break iterator
    
            BreakIterator brkit =
                BreakIterator.getLineInstance();
            brkit.setText(str);
    
            // iterate across the string
    
            int start = brkit.first();
            int end = brkit.next();
            while (end != BreakIterator.DONE) {
                String sentence = str.substring(start, end);
                System.out.println(start + " " + sentence);
                start = end;
                end = brkit.next();
            }
        }
    }

Program output is:

    0 This 
    5 sen-
    9 tence 
    15 con-
    19 tains 
    25 hyphenation.

BreakIterator applies punctuation rules about where text can be 
broken, such as between words or within a hyphenated word (but not 
between a word and a following ".").

You can also use BreakIterator to find word and character breaks. 
It's important to note that in finding breaks, BreakIterator 
analyzes characters independently of how they are stored. 
A "character" in a human language is not necessarily equivalent to 
a single Java 16-bit char. For example, an accented character might
be stored as a base character along with a mark. BreakIterator 
analyzes these kinds of composite characters as a single character.  

One final note about BreakIterator: it's intended for use with 
human languages, not computer ones. For example, a "sentence" in 
programming language source code has little meaning.

For more information about BreakIterator, see
http://java.sun.com/products//jdk/1.2/docs/api/java/text/BreakIterator.html

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
GOTO STATEMENTS AND JAVA(TM) PROGRAMMING

Suppose you write a C/C++ program that searches a 5 x 5 array 
to find the first occurrence of a particular value. You might use 
the following approach:

    #include <stdio.h>
    
    /* 5 x 5 array of numbers */
    
    #define N 5
    static int vec[N][N] = {
        {1, 2, 3, 4, 5},
        {2, 3, 4, 5, 6},
        {3, 4, 5, 6, 7},
        {4, 5, 6, 7, 8},
        {5, 6, 7, 8, 9}
    };
    
    /* target number to be searched for */
    
    static int TARGET = 8;
    
    int main() {
        int i = 0;
        int j = 0;
        int found = 0;
    
        /* iterate through the array, looking for the target */
    
        for (i = 0; i < N; i++) {
            for (j = 0; j < N; j++) {
                if (vec[i][j] == TARGET) {
                    found = 1;
                    goto done;
                }
            }
        }
    
        done:
    
        if (found) {
            printf("Found at %d %d\n", i, j);
        }

        return 0;
    }


If you run the program, you get the result:

    Found at 3 4

In this example, a loop nested in another loop is used to find
the matching array element. If the program finds the element, it
needs to "break" from the nested loops. It's not sufficient to 
simply break from the inner loop. Doing that only takes the program
to the outer loop, it does not actually terminate both loops. So 
a goto is used to jump out of the inner loop and transfer control 
to the "done:" label. Using a goto is not the only way to solve the 
problem in C/C++, but this is one place where a goto is sometimes 
used.

Goto statements are controversial. One problem is that it's hard 
to control the program logic effectively if you use these
statements. For example, look again at the program above. It's 
clear that the "found" test that is just after the "done:" label 
is intended for use after the loop has terminated (that is, after 
the loop terminates normally or through the goto). But there's no 
way to enforce this rule; control can be transferred to this label 
from anywhere in the function.

In the Java(tm) programming language, goto is a reserved word; 
the Java programming language does not have a goto statement. 
However there are alternative statements that you can use in 
the Java programming language in place of the goto statement. 
This tip demonstrates three alternative statements.

The first of these is a rewrite of the above program:

    public class ControlDemo1 {
    
        // 5 x 5 array of numbers
    
        static int vec[][] = {
            {1, 2, 3, 4, 5},
            {2, 3, 4, 5, 6},
            {3, 4, 5, 6, 7},
            {4, 5, 6, 7, 8},
            {5, 6, 7, 8, 9}
        };
        static final int N = 5;
    
        // target number to be searched for
    
        static final int TARGET = 8;
    
        public static void main(String args[]) {
            int i = 0;
            int j = 0;
            boolean found = false;
    
            // iterate through the array, looking for the target
    
            outer:
            for (i = 0; i < N; i++) {
                for (j = 0; j < N; j++) {
                    if (vec[i][j] == TARGET) {
                        found = true;
                        break outer;
                    }
                }
            }
    
            if (found) {
                System.out.println("Found at " + i + " " + j);
            }
        }
    }

The key point in this example is that break statements can be
labeled, that is, a break can designate a labeled loop. Specifying
"break outer" in the above example terminates the loop labeled 
"outer". In other words, the break statement terminates both 
loops.

The same idea applies to continue statements, for example:

    public class ControlDemo2 {
        public static void main(String args[]) {
    
            outer:
            for (int i = 1; i <= 3; i++) {
                for (int j = 1; j <= 3; j++) {
                    System.out.println(i + " " + j);
                    if (i == 2 && j == 2) {
                        continue outer;
                    }
                }
            }
        }
    }
    
Output here is:

    1 1
    1 2
    1 3
    2 1
    2 2
    3 1
    3 2
    3 3

Break statements are normally used in loop and switch statements,
but you can also use them in any labeled block. Here's an example
that illustrates this idea:

    public class ControlDemo3 {
    
        // add two numbers together, a >= 0 and b >= 0
        // throw IllegalArgumentException if a or b out of range
    
        static int add(int a, int b) {
            block1: {
                if (a < 0) {
                    break block1;
                }
                if (b < 0) {
                    break block1;
                }
                return a + b;
            }
            throw new IllegalArgumentException("a < 0 || b < 0");
        }
    
        public static void main(String args[]) {
    
            // legal case
    
            try {
                int a = 37;
                int b = 47;
                int c = add(a, b);
                System.out.println(c);
            }
            catch (IllegalArgumentException e) {
                System.err.println(e);
            }
    
            // illegal case
    
            try {
                int a = 37;
                int b = -47;
                int c = add(a, b);
                System.out.println(c);
            }
            catch (IllegalArgumentException e) {
                System.err.println(e);
            }
        }
    }

In this example there's a block labeled "block1". The program 
handles errors by breaking out of the block. If there are no 
errors, the program returns normally from within the block. 
An error causes an exception to be thrown after the block is 
exited. Note in this example that there are other ways of 
structuring the code. For example, you might simply say:

    if (a < 0 || b < 0) {
        throw new IllegalArgumentException("a < 0 || b < 0");
    }
    return a + b;

Which approach is "correct" depends a lot on the complexity of the 
logic, and what style you prefer.

The final example illustrates the case where you'd like to perform
some actions, and then somehow gain control for cleanup processing. 
You want to do this whether the actions succeed, fail, or trigger 
an exception. This case is sometimes implemented in C/C++ by using 
a goto to jump to the end of a function, where there is some 
cleanup code.

Here's an example of how you can do this using a Java(tm) program:

    public class ControlDemo4 {
    
        // add two numbers together, a >= 0 and b >= 0
        // throw IllegalArgumentException if a or b out of range
    
        static int traceadd(int a, int b) {
            try {
                if (a < 0 || b < 0) {
                    throw new IllegalArgumentException(
                        "a < 0 || b < 0");
                }
                return a + b;
            }
            finally {
                System.out.println("trace: leaving traceadd");
            }
        }
    
        public static void main(String args[]) {
    
            // legal case
    
            try {
                int a = 37;
                int b = 47;
                int c = traceadd(a, b);
                System.out.println(c);
            }
            catch (IllegalArgumentException e) {
                System.err.println(e);
            }
    
            // illegal case
    
            try {
                int a = 37;
                int b = -47;
                int c = traceadd(a, b);
                System.out.println(c);
            }
            catch (IllegalArgumentException e) {
                System.err.println(e);
            }
        }
    }

This example does program tracing. It prints a message when the 
traceadd method exits. The exit can be normal, through the return 
statement, or abnormal, through an exception. Using try...finally 
(no catch) like this:

    try {
        statement 1
        statement 2
        statement 3
        ...
    }
    finally {
        cleanup
    }

is a way to get control for cleanup, no matter what happens in the 
try clause.

For further reading, see chapter 14 in "The Java(tm) Language
Specification" by James Gosling, Bill Joy, and Guy Steele
(http://java.sun.com/docs/books/jls/).


.  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .  .

- NOTE
The names on the JDC mailing list are used for internal Sun
Microsystems(tm) purposes only. To remove your name from the list,
see Subscribe/Unsubscribe below.


- FEEDBACK
Comments? Send your feedback on the JDC Tech Tips to:

jdc-webmaster@sun.com


- SUBSCRIBE/UNSUBSCRIBE
The JDC Tech Tips are sent to you because you elected to subscribe
when you registered as a JDC member. To unsubscribe from JDC email,
go to the following address and enter the email address you wish to
remove from the mailing list:

http://developer.java.sun.com/unsubscribe.html


To become a JDC member and subscribe to this newsletter go to:

http://java.sun.com/jdc/


- ARCHIVES
You'll find the JDC Tech Tips archives at:

http://developer.java.sun.com/developer/TechTips/index.html


- COPYRIGHT
Copyright 2000 Sun Microsystems, Inc. All rights reserved.
901 San Antonio Road, Palo Alto, California 94303 USA.

This document is protected by copyright. For more information, see:

http://developer.java.sun.com/developer/copyright.html


This issue of the JDC Tech Tips is written by Glen McCluskey.

JDC Tech Tips 
June 13, 2000














