CONTENTS
PREFACE
COPYRIGHT AND LICENSE
INTRODUCTION
Installation
Acknowledgements
FAQ
SYNTAX
What is a regular expression?
Perl5 regular expressions
THE INTERFACES
Pattern
PatternCompiler
PatternMatcher
MatchResult
THE CLASSES
Perl5Pattern
Perl5Compiler
Perl5Matcher
PatternMatcherInput
Perl5StreamInput
Util
Perl5Debug
SAMPLE PROGRAMS
MatchResult example
Difference between matches() and contains()
Case sensitivity
Searching an InputStream
Splits
Substitutions
APPENDIX
Package API reference (javadoc generated)
| |
The Classes
The current set of OROMatcher TM implement Perl5 regular expressions,
but future releases will include classes for other regular expression
grammars that users request. As a side note, you do not need to include the
Util or Perl5Debug classes with software you write with OROMatcher TM
if you do not use those classes in your code. This can reduce the size
of your software distribution by a few kilobytes.
Perl5Pattern
Perl5Pattern implements the Pattern interface for Perl5 regular expressions.
The only reason it is made visible to the programmer is for type safety
when calling the Perl5Matcher(Perl5StreamInput, Perl5Pattern)
method and for programmer accesibility when the class is made serializable
in a future release incorporating 1.1 features. Currenly we want the
package to be usable with the 1.0.2 and 1.1.* JDK's. But we will release
a 1.1 enhanced version of the package leveraging the 1.1 features, such
as serializability, that our users want. At that point we will distribute
both 1.0.2 and 1.1 versions of the classes.
Perl5Compiler
The Perl5Compiler class creates compiled regular expressions
conforming to the Perl5 regular expression syntax. It generates
Perl5Pattern instances upon compilation to be used in conjunction with
a Perl5Matcher instance. Please refer to the
Syntax section for more information on
Perl5 regular expressions.
The Perl5Compiler compile() methods can take the following flags which
can be bitwise or'ed together to affect the nature of the compiled pattern:
- DEFAULT_MASK
-
The default mask for the compile methods.
It is equal to 0.
The default behavior is for a regular expression to be case sensitive
and to not specify if it is multiline or singleline. When MULITLINE_MASK
and SINGLINE_MASK are not defined, the ^, $,
and . metacharacters are
interpreted according to the value of isMultiline() in Perl5Matcher.
The default behavior of Perl5Matcher is to treat the Perl5Pattern
as though MULTILINE_MASK were enabled. If isMultiline() returns false,
then the pattern is treated as though SINGLINE_MASK were set. However,
compiling a pattern with the MULTILINE_MASK or SINGLELINE_MASK masks
will ALWAYS override whatever behavior is specified by the setMultiline()
in Perl5Matcher.
- CASE_INSENSITIVE_MASK
-
A mask passed as an option to the compile methods
to indicate a compiled regular expression should be case insensitive.
- MULTILINE_MASK
-
A mask passed as an option to the compile methods
to indicate a compiled regular expression should treat input as having
multiple lines. This option affects the interpretation of
the ^ and $ metacharacters. When this mask is used,
the ^ metacharacter matches at the beginning of every line,
and the $ metacharacter matches at the end of every line.
Additionally the . metacharacter will not match newlines when
an expression is compiled with MULTILINE_MASK , which is its
default behavior.
The SINGLELINE_MASK and MULTILINE_MASK should not be
used together.
- SINGLELINE_MASK
-
A mask passed as an option to the compile methods
to indicate a compiled regular expression should treat input as being
a single line. This option only affects the interpretation of
the ^ and $ metacharacters. When this mask is used,
the ^ metacharacter matches at the beginning of the input,
and the $ metacharacter matches at the end of the input.
The ^ and $ metacharacters will not match at the beginning
and end of lines occurring between the begnning and end of the input.
Additionally, the . metacharacter will match newlines when
an expression is compiled with SINGLELINE_MASK , unlike its
default behavior.
The SINGLELINE_MASK and MULTILINE_MASK should not be
used together.
- EXTENDED_MASK
-
A mask passed as an option to the compile methods
to indicate a compiled regular expression should be treated as a Perl5
extended pattern (i.e., a pattern using the /x modifier). This
option tells the compiler to ignore whitespace that is not backslashed or
within a character class. It also tells the compiler to treat the
# character as a metacharacter introducing a comment as in
Perl. In other words, the # character will comment out any
text in the regular expression between it and the next newline.
The intent of this option is to allow you to divide your patterns
into more readable parts. It is provided to maintain compatibility
with Perl5 regular expressions, although it will not often
make sense to use it in Java.
Perl5Matcher
The Perl5Matcher classes function according to the PatternMatcher
interface when used with Perl5Patterns. Perl5Matcher contains
3 methods that don't appear in the PatternMatcher interface:
- setMultiline(boolean)
- Sets whether or not subsequent calls to matches() or contains()
should treat the input as consisting of multiple lines. The default
behavior is for input to be treated as consisting of multiple
lines. This method should only be called if the Perl5Pattern used for
a match was compiled without either of the
Perl5Compiler.MULTILINE_MASK or Perl5Compiler.SINGLELINE_MASK flags,
and you want to alter the behavior of how the ^ and $ metacharacters
are interpreted on the fly. The compilation options used when
compiling a pattern ALWAYS override the behavior specified by
setMultiline().
- isMultiline()
- Returns the last value set by setMultiline(). The default value
is true.
- contains(Perl5StreamInput, Perl5Pattern)
- Determines if the contents of a Perl5StreamInput instance, starting from
the current offset of the input, contains a pattern. If a pattern match
is found, a MatchResult instance representing the first such match is
made acessible via getMatch() . The current offset of the input stream
is advanced to the end offset corresponding to the end of the
match. Consequently a subsequent call to this method will
continue searching where the last call left off. See
Perl5StreamInput
for more details.
PatternMatcherInput
The PatternMatcherInput class is used to preserve state across calls
to the contains() methods of PatternMatcher instances. It is also used
to specify that only a subregion of a string should be used as input
when looking for a pattern match. All that is meant by preserving
state is that the end offset of the last match is remembered, so that
the next match is performed from that point where the last match left
off. This offset can be accessed from the getCurrentOffset() method
and can be set with the setCurrentOffset(int) method.
You would use a PatternMatcherInput object when you want to search for
more than just the first occurrence of a pattern in a string, or when
you only want to search a subregion of the string for a match. An
example of its most common use is:
PatternMatcher matcher;
PatternCompiler compiler;
Pattern pattern;
PatternMatcherInput input;
MatchResult result;
compiler = new Perl5Compiler();
matcher = new Perl5Matcher();
try {
pattern = compiler.compile(somePatternString);
} catch(MalformedPatternException e) {
System.out.println("Bad pattern.");
System.out.println(e.getMessage());
return;
}
input = new PatternMatcherInput(someStringInput);
while(matcher.contains(input, pattern)) {
result = matcher.getMatch();
// Perform whatever processing on the result you want.
}
// Suppose we want to start searching from the beginning again with
// a different pattern.
// Just set the current offset to the begin offset.
input.setCurrentOffset(input.getBeginOffset());
// Second search omitted
// Suppose we're done with this input, but want to search another string.
// There's no need to create another PatternMatcherInput instance.
// We can just use the setInput() method.
input.setInput(aNewInputString);
Perl5StreamInput
The Perl5StreamInput class is used to look for pattern matches in an
InputStream in conjunction with the Perl5Matcher class. It is called
Perl5StreamInput instead of Perl5InputStream to stress that it is a
form of streamed input for the Perl5Matcher rather than a subclass of
InputStream. Perl5StreamInput performs special internal buffering to
accelerate pattern searches through a stream. You can determine the
size of this buffer and how it grows by using the appropriate
constructor. You should avoid using buffer increments smaller than
4096 bytes, as they will adversely affect peformance.
If you want to perform line by line matches on an InputStream, you
should use DataInputStream or BufferedReader class (depending on
whether you are using JDK 1.0.2 or 1.1) in conjunction with one of the
PatternMatcher methods taking a String, char[], or PatternMatcherInput
as an argument. The DataInputStream and BufferedReader readLine()
methods are implemented as native methods and therefore more efficient
than supporting line by line searching within Perl5StreamInput.
In the future the programmer will be able to set this class to save
all the input it sees so that it can be accessed later. This will
avoid having to read a stream more than once for whatever reason.
For an example of how to use the Perl5StreamInput class, look at
streamInputExample.java
.
Util
The Util class is a holder for useful static utility methods that can
be generically applied to Pattern and PatternMatcher instances.
This class cannot and is not meant to be instantiated.
The Util class currently contains versions of the split() and substitute()
methods inspired by Perl's split function and s operation
respectively, although they are implemented in such a way as not to
rely on the Perl5 implementations of the OROMatcher packages regular
expression interfaces. They may operate on any interface implementations
conforming to the OROMatcher API specification for the PatternMatcher,
Pattern, and MatchResult interfaces. Future versions of the class may
include additional utility methods.
A grep method is not included for two reasons:
- The details of reading a line at a time from an input stream
differ in JDK 1.0.2 and JDK 1.1, making it difficult to
retain compatibility across both Java releases.
- Grep style processing is trivial for the programmer to implement
in a while loop. Rarely does anyone want to retrieve all
occurences of a pattern and then process them. More often a
programmer will retrieve pattern matches and process them as they
are retrieved, which is more efficient than storing them all in a
Vector and then accessing them.
For an example of how to use the split and substitute methods look
at splitExample.java
and substituteExample.java .
Perl5Debug
The Perl5Debug class is not intended for general use and should not be
instantiated, but is provided because some users may find the output
of its single method to be useful. The Perl5Compiler class generates a
representation of a regular expression identical to that of Perl5 in
the abstract, but not in terms of actual data structures. The
Perl5Debug class allows the bytecode program contained by a
Perl5Pattern to be printed out for comparison with the program
generated by Perl5 with the -r option. The Perl5Debug class is
provided primarily for Perl programmers used to using the Perl -r option.
Copyright © 1997 ORO, Inc. All rights reserved.
Original Reusable Objects, ORO, the ORO logo,
and "Component software for the Internet" are trademarks or registered
trademarks of ORO, Inc. in the United States and other countries.
Java is a trademark of Sun Microsystems. All other trademarks are the
property of their respective holders.
|