All Packages Class Hierarchy This Package Previous Next Index
Class com.oroinc.text.perl.Perl5Util
java.lang.Object
|
+----com.oroinc.text.perl.Perl5Util
- public final class Perl5Util
- extends Object
- implements MatchResult
This is a utility class implementing the 3 most common Perl5 operations
involving regular expressions:
- [m]/pattern/[i][m][s][x]
- s/pattern/replacement/[g][i][m][o][s][x]
- split
As with Perl, any non-alphanumeric character can be used in lieu of
the slashes.
The objective of the class is to minimize the amount of code a Java
programmer using OROMatcherTM
has to write to achieve the same results as Perl by
transparently handling regular expression compilation, caching, and
matching. A second objective is to use the same Perl pattern matching
syntax to ease the task of Perl programmers transitioning to Java
(this also reduces the number of parameters to a method).
All the state affecting methods are synchronized to avoid
the maintenance of explicit locks in multithreaded programs. This
philosophy differs from the
OROMatcherTM package, where
you are expected to either maintain explicit locks, or more preferably
create separate compiler and matcher instances for each thread.
To use this class, first create an instance using the default constructor
or initialize the instance with a PatternCache of your choosing using
the alternate constructor. The default cache used by Perl5Util is a
PatternCacheLRU of capacity GenericPatternCache.DEFAULT_CAPACITY. You may
want to create a cache with a different capacity, a different
cache replacement policy, or even devise your own PatternCache
p * implementation. The PatternCacheLRU is probably the best general purpose
pattern cache, but your specific application may be better served by
a different cache replacement policy. You should remember that you can
front-load a cache with all the patterns you will be using before
initializing a Perl5Util instance, or you can just let Perl5Util
fill the cache as you use it.
You might use the class as follows:
Perl5Util util = new Perl5Util();
String line;
DataInputStream input;
PrintStream output;
// Initialization of input and output omitted
while((line = input.readLine()) != null) {
// First find the line with the string we want to substitute because
// it is cheaper than blindly substituting each line.
if(util.match("/HREF=\"description1.html\"") {
line = util.substitute("s/description1\\.html/about1.html/", line);
}
output.println(line);
}
A couple of things to remember when using this class are that the
match() methods have the same meaning as
contains() in OROMatcherTM
and =~ m/pattern/
in Perl. The methods are named match
to more closely associate them with Perl and to differentiate them
from matches() in OROMatcherTM.
A further thing to keep in mind is that the
MalformedPerl5PatternException class is derived from RuntimeException
which means you DON'T have to catch it. The reasoning behind this is that
you will detect your regular expression mistakes as you write and debug
your program when a MalformedPerl5PatternException is thrown during a test
run. However, we STRONGLY recommend that you ALWAYS catch
MalformedPerl5PatternException whenever you deal with a DYNAMICALLY
created pattern. Relying on a fatal MalformedPerl5PatternException being
thrown to detect errors while debugging is only useful for dealing with
static patterns, that is actual pregenerated strings present in your
program. Patterns created from user input or some other dynamic method
CANNOT be relied upon to be correct and MUST be handled by catching
MalformedPerl5PatternException for your programs to be robust.
Finally, as a convenience Perl5Util implements
the com.oroinc.text.regex.MatchResult interface found in the
OROMatcherTM package. The methods
are merely wrappers which call the corresponding method of the last
MatchResult found (which can be accessed with
getMatch() by a match or substitution
(or even a split, but this isn't particularly useful).
Copyright © 1997 Original Resuable Objects, Inc. All rights reserved.
- See Also:
- MalformedPerl5PatternException, PatternCache, PatternCacheLRU, MatchResult
-
SPLIT_ALL
- A constant passed to the split() methods indicating
that all occurrences of a pattern should be used to split a string.
-
Perl5Util()
- Default constructor for Perl5Util.
-
Perl5Util(PatternCache)
- A secondary constructor for Perl5Util.
-
begin(int)
- Returns the begin offset of the subgroup of the last match found
relative the beginning of the match.
-
beginOffset(int)
- Returns an offset marking the beginning of the last pattern match
found relative to the beginning of the input from which the match
was extracted.
-
end(int)
- Returns the end offset of the subgroup of the last match found
relative the beginning of the match.
-
endOffset(int)
- Returns an offset marking the end of the last pattern match found
relative to the beginning of the input from which the match was
extracted.
-
getMatch()
- Returns the last match found by a call to a match(), substitute(), or
split() method.
-
group(int)
- Returns the contents of the parenthesized subgroups of the last match
found according to the behavior dictated by the MatchResult interface.
-
groups()
-
-
length()
- Returns the length of the last match found.
-
match(String, char[])
- Searches for the first pattern match somewhere in a character array
taking a pattern specified in Perl5 native format:
[m]/pattern/[i][m][s][x]
The m
prefix is optional and the meaning of the optional
trailing options are:
- i
- case insensitive match
- m
- treat the input as consisting of multiple lines
- s
- treat the input as consisting of a single line
- x
- enable extended expression syntax incorporating whitespace
and comments
As with Perl, any non-alphanumeric character can be used in lieu of
the slashes.
-
match(String, PatternMatcherInput)
- Searches for the next pattern match somewhere in a
com.oroinc.text.regex.PatternMatcherInput instance, taking
a pattern specified in Perl5 native format:
[m]/pattern/[i][m][s][x]
The m
prefix is optional and the meaning of the optional
trailing options are:
- i
- case insensitive match
- m
- treat the input as consisting of multiple lines
- s
- treat the input as consisting of a single line
- x
- enable extended expression syntax incorporating whitespace
and comments
As with Perl, any non-alphanumeric character can be used in lieu of
the slashes.
-
match(String, Perl5StreamInput)
- Searches for the next pattern match somewhere in a
com.oroinc.text.regex.Perl5StreamInput instance, taking
a pattern specified in Perl5 native format:
[m]/pattern/[i][m][s][x]
The m
prefix is optional and the meaning of the optional
trailing options are:
- i
- case insensitive match
- m
- treat the input as consisting of multiple lines
- s
- treat the input as consisting of a single line
- x
- enable extended expression syntax incorporating whitespace
and comments
As with Perl, any non-alphanumeric character can be used in lieu of
the slashes.
-
match(String, String)
- Searches for the first pattern match in a String taking
a pattern specified in Perl5 native format:
[m]/pattern/[i][m][s][x]
The m
prefix is optional and the meaning of the optional
trailing options are:
- i
- case insensitive match
- m
- treat the input as consisting of multiple lines
- s
- treat the input as consisting of a single line
- x
- enable extended expression syntax incorporating whitespace
and comments
As with Perl, any non-alphanumeric character can be used in lieu of
the slashes.
-
postMatch()
- Returns the part of the input following that last match found.
-
postMatchCharArray()
- Returns the part of the input following that last match found as a char
array.
-
preMatch()
- Returns the part of the input preceding that last match found.
-
preMatchCharArray()
- Returns the part of the input preceding that last match found as a
char array.
-
split(String)
- Splits input in the default Perl manner, splitting on all whitespace.
-
split(String, String)
- This method is identical to calling:
split(pattern, input, SPLIT_ALL);
-
split(String, String, int)
- Splits a String into strings contained in a Vector of size no greater
than a specified limit.
-
substitute(String, String)
- Substitutes a pattern in a given input with a replacement string.
-
toString()
- Returns the same as group(0).
SPLIT_ALL
public static final int SPLIT_ALL
- A constant passed to the split() methods indicating
that all occurrences of a pattern should be used to split a string.
Perl5Util
public Perl5Util(PatternCache cache)
- A secondary constructor for Perl5Util. It initializes the Perl5Matcher
used by the class to perform matching operations, but requires the
programmer to provide a PatternCache instance for the class
to use to compile and store regular expressions. You would want to
use this constructor if you want to change the capacity or policy
of the cache used. Example uses might be:
// We know we're going to use close to 50 expressions a whole lot, so
// we create a cache of the proper size.
util = new Perl5Util(new PatternCacheLRU(50));
or
// We're only going to use a few expressions and know that second-chance
// fifo is best suited to the order in which we are using the patterns.
util = new Perl5Util(new PatternCacheFIFO2(10));
Perl5Util
public Perl5Util()
- Default constructor for Perl5Util. This initializes the Perl5Matcher
used by the class to perform matching operations and creates a
default PatternCacheLRU instance to use to compile and cache regular
expressions. The size of this cache is
GenericPatternCache.DEFAULT_CAPACITY.
match
public synchronized boolean match(String pattern,
char input[]) throws MalformedPerl5PatternException
- Searches for the first pattern match somewhere in a character array
taking a pattern specified in Perl5 native format:
[m]/pattern/[i][m][s][x]
The m
prefix is optional and the meaning of the optional
trailing options are:
- i
- case insensitive match
- m
- treat the input as consisting of multiple lines
- s
- treat the input as consisting of a single line
- x
- enable extended expression syntax incorporating whitespace
and comments
As with Perl, any non-alphanumeric character can be used in lieu of
the slashes.
If the input contains the pattern, the com.oroinc.text.regex.MatchResult
can be obtained by calling getMatch() .
However, Perl5Util implements the MatchResult interface as a wrapper
around the last MatchResult found, so you can call its methods to
access match information.
- Parameters:
- pattern - The pattern to search for.
- input - The char[] input to search.
- Returns:
- True if the input contains the pattern, false otherwise.
- Throws: MalformedPerl5PatternException
- If there is an error in
the pattern. You are not forced to catch this exception
because it is derived from RuntimeException.
match
public synchronized boolean match(String pattern,
String input) throws MalformedPerl5PatternException
- Searches for the first pattern match in a String taking
a pattern specified in Perl5 native format:
[m]/pattern/[i][m][s][x]
The m
prefix is optional and the meaning of the optional
trailing options are:
- i
- case insensitive match
- m
- treat the input as consisting of multiple lines
- s
- treat the input as consisting of a single line
- x
- enable extended expression syntax incorporating whitespace
and comments
As with Perl, any non-alphanumeric character can be used in lieu of
the slashes.
If the input contains the pattern, the com.oroinc.text.regex.MatchResult
can be obtained by calling getMatch() .
However, Perl5Util implements the MatchResult interface as a wrapper
around the last MatchResult found, so you can call its methods to
access match information.
- Parameters:
- pattern - The pattern to search for.
- input - The String input to search.
- Returns:
- True if the input contains the pattern, false otherwise.
- Throws: MalformedPerl5PatternException
- If there is an error in
the pattern. You are not forced to catch this exception
because it is derived from RuntimeException.
match
public synchronized boolean match(String pattern,
PatternMatcherInput input) throws MalformedPerl5PatternException
- Searches for the next pattern match somewhere in a
com.oroinc.text.regex.PatternMatcherInput instance, taking
a pattern specified in Perl5 native format:
[m]/pattern/[i][m][s][x]
The m
prefix is optional and the meaning of the optional
trailing options are:
- i
- case insensitive match
- m
- treat the input as consisting of multiple lines
- s
- treat the input as consisting of a single line
- x
- enable extended expression syntax incorporating whitespace
and comments
As with Perl, any non-alphanumeric character can be used in lieu of
the slashes.
If the input contains the pattern, the com.oroinc.text.regex.MatchResult
can be obtained by calling getMatch() .
However, Perl5Util implements the MatchResult interface as a wrapper
around the last MatchResult found, so you can call its methods to
access match information.
After the call to this method, the PatternMatcherInput current offset
is advanced to the end of the match, so you can use it to repeatedly
search for expressions in the entire input using a while loop as
explained in the OROMatcherTM package.
- Parameters:
- pattern - The pattern to search for.
- input - The PatternMatcherInput to search.
- Returns:
- True if the input contains the pattern, false otherwise.
- Throws: MalformedPerl5PatternException
- If there is an error in
the pattern. You are not forced to catch this exception
because it is derived from RuntimeException.
match
public synchronized boolean match(String pattern,
Perl5StreamInput input) throws IOException, MalformedPerl5PatternException
- Searches for the next pattern match somewhere in a
com.oroinc.text.regex.Perl5StreamInput instance, taking
a pattern specified in Perl5 native format:
[m]/pattern/[i][m][s][x]
The m
prefix is optional and the meaning of the optional
trailing options are:
- i
- case insensitive match
- m
- treat the input as consisting of multiple lines
- s
- treat the input as consisting of a single line
- x
- enable extended expression syntax incorporating whitespace
and comments
As with Perl, any non-alphanumeric character can be used in lieu of
the slashes.
If the input contains the pattern, the com.oroinc.text.regex.MatchResult
can be obtained by calling getMatch() .
However, Perl5Util implements the MatchResult interface as a wrapper
around the last MatchResult found, so you can call its methods to
access match information.
See the OROMatcherTM documentation
for more details on Perl5StreamInput.
- Parameters:
- pattern - The pattern to search for.
- input - The Perl5StreamInput to search.
- Returns:
- True if the input contains the pattern, false otherwise.
- Throws: MalformedPerl5PatternException
- If there is an error in
the pattern. You are not forced to catch this exception
because it is derived from RuntimeException.
getMatch
public synchronized MatchResult getMatch()
- Returns the last match found by a call to a match(), substitute(), or
split() method. This method is only intended for use to retrieve a match
found by the last match found by a match() method. This method should
be used when you want to save MatchResult instances. Otherwise, for
simply accessing match information, it is more convenient to use the
Perl5Util methods implementing the MatchResult interface.
- Returns:
- The com.oroinc.text.regex.MatchResult instance containing the
last match found.
substitute
public synchronized String substitute(String expression,
String input) throws MalformedPerl5PatternException
- Substitutes a pattern in a given input with a replacement string.
The substitution expression is specified in Perl5 native format:
s/pattern/replacement/[g][i][m][o][s][x]
The s
prefix is mandatory and the meaning of the optional
trailing options are:
- g
- Substitute all occurrences of pattern with replacement.
The default is to replace only the first occurrence.
- i
- perform a case insensitive match
- m
- treat the input as consisting of multiple lines
- o
- If variable interopolation is used, only evaluate the
interpolation once (the first time). This is equivalent
to using a numInterpolations argument of 1 in the
OROMatcherTM
Util.substitute() method. The default is to compute
each interpolation independently. See the
OROMatcherTM
Util.substitute() method for more details on variable
interpolation in substitutions.
- s
- treat the input as consisting of a single line
- x
- enable extended expression syntax incorporating whitespace
and comments
As with Perl, any non-alphanumeric character can be used in lieu of
the slashes. This is helpful to avoid backslashing. For example,
using slashes you would have to do:
result = util.substitute("s/foo\\/bar/goo\\/\\/baz/", input);
when you could more easily write:
result = util.substitute("s#foo/bar#goo//baz#", input);
where the hashmarks are used instead of slashes.
There is a special case of backslashing that you need to pay attention
to. As demonstrated above, to denote a delimiter in the substituted
string it must be backslashed. However, this can be a problem
when you want to denote a backslash at the end of the substituted
string. For this special case, a double backslash can be used
to express a single backslash. For example:
result = util.substitute("s#/#\\\\#g", input);
This replaces all instances of a forward slash with a single backslash.
However, the following will replace all instaces of /a with \\a rather
than \a:
result = util.substitute("s#/a#\\\\a#g", input);
To obitain \a as a a replacement you would use:
result = util.substitute("s#/a#\\a#g", input);
In other words, a double backslash (quadrupled in the Java String)
always represents two backslashes unless the second backslash is
followed by the delimiter, in which case it represents a single
backslash.
- Parameters:
- expression - The substitution expression.
- input - The input.
- Returns:
- The input after substitutions have been performed.
- Throws: MalformedPerl5PatternException
- If there is an error in
the expression. You are not forced to catch this exception
because it is derived from RuntimeException.
split
public synchronized Vector split(String pattern,
String input,
int limit) throws MalformedPerl5PatternException
- Splits a String into strings contained in a Vector of size no greater
than a specified limit. The String is split using a regular expression
as the delimiteraking. The regular expressions is a pattern specified
in Perl5 native format:
[m]/pattern/[i][m][s][x]
The m
prefix is optional and the meaning of the optional
trailing options are:
- i
- case insensitive match
- m
- treat the input as consisting of multiple lines
- s
- treat the input as consisting of a single line
- x
- enable extended expression syntax incorporating whitespace
and comments
As with Perl, any non-alphanumeric character can be used in lieu of
the slashes.
The limit parameter causes the string to be split on at most the first
limit - 1 number of pattern occurences.
Of special note is that this split method performs EXACTLY the same
as the Perl split() function. In other words, if the split pattern
contains parentheses, additional Vector elements are created from
each of the matching subgroups in the pattern. Using an example
similar to the one from the Camel book:
split("/([,-])/", "8-12,15,18")
produces the Vector containing:
{ "8", "-", "12", ",", "15", ",", "18" }
The Util.split() method in the
OROMatcherTM package does NOT
implement this particular behavior because it is intended to
be usable with Pattern instances other than Perl5Pattern.
- Parameters:
- pattern - The regular expression to use as a split delimiter.
- input - The String to split.
- limit - The limit on the size of the returned
Vector
.
Values <= 0 produce the same behavior as the SPLIT_ALL constant which
causes the limit to be ignored and splits to be performed on all
occurrences of the pattern. You should use the SPLIT_ALL constant
to achieve this behavior instead of relying on the default behavior
associated with non-positive limit values.
- Returns:
- A
Vector
containing the substrings of the input
that occur between the regular expression delimiter occurences. The
input will not be split into any more substrings than the specified
limit. A way of thinking of this is that only the first
limit - 1
matches of the delimiting regular expression will be used to split the
input.
- Throws: MalformedPerl5PatternException
- If there is an error in
the expression. You are not forced to catch this exception
because it is derived from RuntimeException.
split
public synchronized Vector split(String pattern,
String input) throws MalformedPerl5PatternException
- This method is identical to calling:
split(pattern, input, SPLIT_ALL);
split
public synchronized Vector split(String input) throws MalformedPerl5PatternException
- Splits input in the default Perl manner, splitting on all whitespace.
This method is identical to calling:
split("/\\s+/", input);
length
public synchronized int length()
- Returns the length of the last match found.
- Returns:
- The length of the last match found.
groups
public synchronized int groups()
- Returns:
- The number of groups contained in the last match found.
This number includes the 0th group. In other words, the
result refers to the number of parenthesized subgroups plus
the entire match itself.
group
public synchronized String group(int group)
- Returns the contents of the parenthesized subgroups of the last match
found according to the behavior dictated by the MatchResult interface.
- Parameters:
- group - The pattern subgroup to return.
- Returns:
- A string containing the indicated pattern subgroup. Group
0 always refers to the entire match. If a group was never
matched, it returns null. This is not to be confused with
a group matching the null string, which will return a String
of length 0.
begin
public synchronized int begin(int group)
- Returns the begin offset of the subgroup of the last match found
relative the beginning of the match.
- Parameters:
- group - The pattern subgroup.
- Returns:
- The offset into group 0 of the first token in the indicated
pattern subgroup. If a group was never matched or does
not exist, returns -1. Be aware that a group that matches
the null string at the end of a match will have an offset
equal to the length of the string, so you shouldn't blindly
use the offset to index an array or String.
end
public synchronized int end(int group)
- Returns the end offset of the subgroup of the last match found
relative the beginning of the match.
- Parameters:
- group - The pattern subgroup.
- Returns:
- Returns one plus the offset into group 0 of the last token in
the indicated pattern subgroup. If a group was never matched
or does not exist, returns -1. A group matching the null
string will return its start offset.
beginOffset
public synchronized int beginOffset(int group)
- Returns an offset marking the beginning of the last pattern match
found relative to the beginning of the input from which the match
was extracted.
- Parameters:
- group - The pattern subgroup.
- Returns:
- The offset of the first token in the indicated
pattern subgroup. If a group was never matched or does
not exist, returns -1.
endOffset
public synchronized int endOffset(int group)
- Returns an offset marking the end of the last pattern match found
relative to the beginning of the input from which the match was
extracted.
- Parameters:
- group - The pattern subgroup.
- Returns:
- Returns one plus the offset of the last token in
the indicated pattern subgroup. If a group was never matched
or does not exist, returns -1. A group matching the null
string will return its start offset.
toString
public synchronized String toString()
- Returns the same as group(0).
- Returns:
- A string containing the entire match.
- Overrides:
- toString in class Object
preMatch
public synchronized String preMatch()
- Returns the part of the input preceding that last match found. This
method has no meaning for matches found on Perl5StreamInput forms
of input. It will always return a zero length string when used to
reference such a form of input.
- Returns:
- The part of the input following the last match found.
postMatch
public synchronized String postMatch()
- Returns the part of the input following that last match found. This
method has no meaning for matches found on Perl5StreamInput forms
of input. It will always return a zero length string when used to
reference such a form of input.
- Returns:
- The part of the input following the last match found.
preMatchCharArray
public synchronized char[] preMatchCharArray()
- Returns the part of the input preceding that last match found as a
char array. This method has no meaning for matches found on
Perl5StreamInput forms of input. It will always return null when used to
reference such a form of input. This method eliminates the extra
buffer copying caused by preMatch().toCharArray().
- Returns:
- The part of the input following the last match found as a char[].
If the result is of zero length, returns null instead of a zero
length array.
postMatchCharArray
public synchronized char[] postMatchCharArray()
- Returns the part of the input following that last match found as a char
array. This method has no meaning for matches found on Perl5StreamInput
forms of input. It will always return a zero length string when used to
reference such a form of input. This method eliminates the extra
buffer copying caused by preMatch().toCharArray().
- Returns:
- The part of the input following the last match found as a char[].
If the result is of zero length, returns null instead of a zero
length array.
All Packages Class Hierarchy This Package Previous Next Index