Native Regex Support Comes to Java

Industry News

by Barry Johnson

on July 25, 2002

All but the most casual Linux users no doubt have encountered regular expressions, if only in a cursory manner. Regular expressions are extremely powerful and flexible pattern-matching and -replacing tools. They are not confined to the UNIX world, but they are certainly pervasive there.

Regex engines are at least partially implemented in a host of common UNIX tools, including grep, awk, vi and Emacs. Furthermore, many widely used scripting languages contain regex support. Python, Tcl, JavaScript and (most notably) Perl are among them.

I am a Perl hacker from way back, and if you are like me, you have come to rely heavily on having such powerful text-munging tools at your fingertips. In recent years, however, I have, like many developers, become more and more involved in Java development.

Java has much to recommend it as a development language, but one thing it always has lacked is native regex support. Until recently, using regular expressions in Java has meant using third-party class libraries, many of which were of inconsistent quality and poorly maintained. This shortcoming was always a huge roadblock in my acceptance of Java as a primary development language.

You can imagine, then, my glee when I learned that Sun's 1.4.0 version of the Java JDK contained java.util.regex, a full-blown, native regular-expressions package. Ironically, it took me quite a while to learn about this hidden gem. I wondered why an enhancement of this magnitude had not merited more publicity.

Better late than never of course, and Sun has jumped into the regex world with both feet. The java.util.regex package is breathtaking in the thoroughness of its implementation and, as always, is meticulously documented. Even more obscure regex arcana are present, such as lookaheads and greediness quantifiers. Some constructs (perhaps most notably character-class intersections and unions) are not even found in Perl.

The core of the regex package comprises two new classes, Pattern and Matcher. Pattern is the object representation of the pattern that is to be searched for. Matcher is the object that actually affects the search. Additionally, there is a new exception class, PatternSyntaxException, which is thrown whenever an invalid search pattern is encountered.

Developers already fluent in regular-expression usage will find the Java package quite simple to work with. Granted, Perl diehards who are spoiled by that language's one-line matches and substitutions will have to do a little more work than they may be used to, but the process is still straightforward.

The limitations of this article do not allow for a full (or even cursory) tutorial on regular-expression usage. For those who could use a refresher, I highly recommend Jeffrey Frieldl's Mastering Regular Expressions, published by O'Reilly and Associates. I would, however, like to provide a few examples to help people get started and to show how easy it can be.

As a simple but nontrivial example, let's consider a regex for matching phone numbers. It sounds pretty mundane, but such a thing can be extremely useful. Potential applications include finding phone numbers embedded in text files, validating the proper format of a phone number (for example, validating input from HTML form submission) and coercing phone numbers into a consistent format.

The task of developing a simple expression designed to match any phone number is complicated by the fact that there are multiple, equally valid ways one might choose to format a phone number. For example, the following, as well as others, may be considered equally valid representations: (212) 555-1212, 212-555-1212 and 212 555 1212.

So let's get started writng a regular expression. To keep things simple in the beginning, we will craft a regex that identifies phone numbers if they are expressed only in the following format: (nnn) nnn-nnnn.

The first step is to create a pattern object to match such a substring. Once we get it running, we can genericize if need be. A regex designed to match this pattern might look something like:

(\{3})\\{3}-\{4}

The \ sequence is used to match any digit, 0 through 9. The addition of the {n} quantifier is a convenient shorthand that instructs the regex to match a series of any three consecutive digits (\\\). The \ is another useful construct that matches any character considered whitespace, such as a space, tab, newline character and so on.

Simple enough, right? A couple of things, however, need to be done to this pattern before we can actually use it in a Java program. Alphanumeric characters preceded by a backslash have special significance for the Java interpreter. It does not understand all of the backslash sequences relevant to the regex package, but it will try nonetheless. To prevent this, and to allow the backslashes to be passed properly to our pattern object, we will need to escape the backslashes themselves. Furthermore, parentheses have multiple meanings in regular expressions. If we want them to be interpreted literally (as we do), we need to escape them as well. This leaves us with:

\\(\\{3}\\)\\\\{3}-\\{4}

Slightly less pretty, yes, but it does the trick. Here is how we actually might implement it in code. Remember, when using the regex package, include the line

import java.util.regex.*;

at the top of your class. The following is an excerpt from a quick-and-dirty Java program that reads lines from a text file called phone and scans each line for a phone number. When it finds one, it prints it out to the console.

   BufferedReader in;
   Pattern pattern = 
Pattern.compile("\\(\\{3}\\)\\\\{3}-\\{4}");
   in = new BufferedReader(new FileReader("phone"));
   String s;
   while ((s = in.readLine()) != null)
   {
         Matcher matcher = pattern.matcher(s);
         if (matcher.find())
         {
                System.out.println(matcher.group());
         }
   }
   in.close();

The code should not look especially unusual to anyone who has used regular expressions in languages such as Python or JavaScript. In these languages, as in others, the regular-expression patterns are compiled explicitly and then applied to whatever data one wishes to analyze. There is slightly more work involved than with Perl's one-step matching, but the burden is not onerous.

The find() method, as you would expect, simply finds any matches in the target string (s, in this case) that correspond to our pattern. The group() method, perhaps less intuitively, returns a string containing the matched text. It should be noted that the code above is good for matching only one phone number per line. The regex package is certainly capable of finding multiple matches for each line, but the intent here is to present a few simple examples to spur further study by the reader; this article only scratches the surface.

That was a pretty cool, yet unforgiving, phone-number matcher. Obviously, a couple of improvements could be made. For starters, there may be a space between the area code and the exchange. We should be able to match both cases. By adding \? to our regex, we can. The ? qualifier simply means that there may be a whitespace character in our pattern. The match is triggered either way.

Secondly, it is not uncommon to see a space separating the exchange from the last four digits, rather than a hyphen. In some cases, there is no separator at all, and seven digits are simply run together. We can account for all three variations by using (-| )?. This structure is known as an alternation. The regex will match one of multiple, pipe-delimited subpatterns enclosed in parentheses. In this case, it will match either a space or a hyphen. The trailing ? means that there may be no separator at all.

Finally, the area code may not be enclosed in parentheses. Simply appending the ? quantifier to each parenthesis, however, is not a good solution, as it would allow unmatched parentheses, as in "(555" or "555)". Instead, we can force our phone number to have either both parentheses or neither, by using another alternation:

(\(\{3}\)|\{3})

Our final pattern, which can be substituted in the code snippet above, is a much more useful phone-number matcher.

   Pattern pattern =
        Pattern.compile("(\\(\\{3}\\)|\\{3})\\?\\{3}(-|)?\\{4}");

Granted, further improvements are possible, but you get the idea.

To take a second example, adapted from Friedl's book, consider the problem of identifying doubled words in a text file. This is a common typographical mistake, and one that a spell checker doesn't always find.

Matching a word, like many other tasks, can be accomplished in a number of different ways using regular expressions. Perhaps the most straightforward, however, is \\+\. This takes advantage of a few new regex constructs. The \ character sequence matches any character that can be thought of as a word character (typically an alphanumeric character or an underscore). The + quantifier tells us that we are looking to match at least one (but possibly more) of these characters in a row. The \ construct matches a word boundary. This could be whitespace or any number of different punctuation marks (including commas, periods, etc.).

Now, how do we identify when a given word is repeated back-to-back? To accomplish this, it is helpful to take advantage of a regex tool known as backreferencing. As I mentioned previously, parentheses perform a number of different tasks in the world of regular expressions. One of these is to provide capturing groups. Capturing groups save the results of a match or a partial match to be used later, even within the same pattern. It is possible (and often desirable) to have more than one capturing group in the same regex. The results of the match of the nth capturing group can be accessed via the backreference \n. Backreferences make it easy to search for duplicated words:

\(\+)\+\\1\

The parentheses form a capturing group, the first (and only) in this expression. The backreference \1, then, refers to whatever word was matched by \+. Our regular expression thus matches a word, followed by one or more whitespace characters, followed by a second instance of the word. The trailing word boundary sequence (\) is necessary to prevent false, embedded matches. We want to match "Paris in the the spring", but not to match "Java's regex package is the theme of this article", for example. In its final Java-ready form, our regex pattern becomes:

   Pattern pattern =
Pattern.compile("\\(\\+)\\+\\1\\");

As a final modification, we may wish to make our match case-insensitive. For example, we might want to be alerted to something like "The the theme of this article is the Java's regex package." This can be accomplished quite simply by compiling the regex with one of the Pattern class' predefined static flags, CASE_INSENSITIVE:

   Pattern pattern =
Pattern.compile("\\(\\+)\\+\\1\\",
                Pattern.CASE_INSENSITIVE);

The topic of regular expressions is rich and complex, and Java's implementation of it is comprehensive. A thorough study of the package would be a significant undertaking, and we have only presented the tip of the iceberg. If you are new to regular expressions, you will see how quickly you come to rely on their power and flexibility. If you are a seasoned regex hacker coming from Perl or some other language, you can now confidently dive head first into the Java world without having to give up a toolset you have come to view as a necessity.

Barry Johnson is a Carolina-born and bred software consultant currently working in New York City. A nuclear physicist in a former life, his hobbies include beer, roots music, Linux, beer, dogs and politics. And beer.

email: barry@vbsweb.com

Load Disqus comments