Regular Expressions
To appreciate the power of regular expressions, let's look at a simple Perl script that helps system administrators look for authentication failures. For the following examples I used rather expressive regular expressions to show different features. You may write simpler ones to describe the same strings.
Each time someone fails to log in, syslogd writes messages to /var/log/messages that read like this:
Jul 26 16:35:25 myhost su(pam_unix)[2549]: authentication failure; logname=verdi uid=500 euid=0 tty= ruser=organtin rhost= user=root Jul 27 14:54:36 myhost login(pam_unix)[688]: authentication failure; logname=LOGIN uid=0 euid=0 tty=tty1 ruser= rhost= user=mozart
These lines list the time at which the login attempt was made, the user who tried to log in as another user, if available, and the target user. For example, the user verdi tried to log in as root two times, while someone failed to log in as mozart from the console.
Consider the Perl script shown in Listing 1. It reads the /var/log/messages file, then identifies the lines that look interesting and extracts only the relevant information.
Listing 1. Sample Perl Script for Finding Authentication Errors
First of all, we select only relevant lines and match them with the regular expression /authentication failure/ shown on line 7. Everything else is discarded. Then each line is matched with a regular expression (line 8) that should be read as follows: take all the strings starting (^) with exactly three ({3}) alphabetic ([a-zA-Z]) characters, followed by a space, followed by at most two (+) characters that could be either numeric (0-9, equivalent in Perl to the metacharacter \d) or a space. After a space, an arbitrary number (*) of digits or semicolons must follow. The portion of the string described so far is enclosed in parentheses, so it is stored in a back reference called \1 (it is the first one). After that, any number of characters (.*) can be found before the string “logname=”. That string must be followed by any number of alphanumeric characters. Again, because there are a couple of parentheses, we will store them in \2. Any number of characters, finally, can be present before the string “user=”, followed by any number of alphanumeric characters. This all gets stored into \3.
From this example, you can see how it is possible to extract substrings from strings. You do not need to know their relative positions, as long as you can describe their appearance.
Perl provides a helpful feature for working with regexps. The automagic definition of Perl variables named after the back references as $1, $2 and so on, can be used after a regular expression has been matched. Perl also lets users define useful symbols, such as \d or \w (equivalent to [A-Za-z0-9_]), as well as POSIX-compliant symbols representing the same things (see man perlre for more information).
Basic regular expressions are used by several other programs, like sed or egrep.
In basic regular expressions, the metacharacters |, + and ? do not exist, and parentheses and braces need to be escaped to be interpreted as metacharacters. The ^, $ and * metacharacters follow more complicated rules (see man 7 regex for more details). In most cases, however, they behave like their extended counterparts. It is often convenient to express the regular expression in the extended format, then add the escape characters when needed.
As an example, the script shown in Listing 2 generates an HTML-formatted page to read the content of system log files using an internet browser. Besides echoing HTML tags for the headers of the page and for a table, it simply lists files in a given directory and pipes the result to sed, which transforms it using a regexp. The syntax used by sed for text substitution is rather common and is something like:
s/regexp/replacement/
where regexp is a regular expression that must be replaced.
Listing 2. Example Script for Generating and HTML-Formatted Page for Reading Log Files
Essentially, the syntax represents a string composed of nine elements properly described by the appropriate regular expressions. For example [rwxds-] asks for the possible characters that can be found within the first element.
The latter part of the string consists of alphanumeric characters, with slashes interspersed. You may notice that the regular expression used in this case is (.*\/)(.*). The first group matches all characters preceding a (escaped) slash, i.e., the path name. The second group lists all the following characters (the filename). The number of slashes in the path doesn't matter. Regular expressions (both basic and extended), in fact, are said to be greedy—they try to match as many characters as possible.
The result of the script is written to standard output and can be redirected to a given file (by cron at fixed intervals, for example) to be shown on the Web.
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Sponsored by AMD
Built-in forensics, incident response, and security with Red Hat Enterprise Linux 6
Every security policy provides guidance and requirements for ensuring adequate protection of information and data, as well as high-level technical and administrative security requirements for a system in a given environment. Traditionally, providing security for a system focuses on the confidentiality of the information on it. However, protecting the data integrity and system and data availability is just as important. For example, when processing United States intelligence information, there are three attributes that require protection: confidentiality, integrity, and availability.
Learn more about catching the bad guy in this free white paper.
Sponsored by DLT Solutions
| Designing Electronics with Linux | May 22, 2013 |
| Dynamic DNS—an Object Lesson in Problem Solving | May 21, 2013 |
| Using Salt Stack and Vagrant for Drupal Development | May 20, 2013 |
| Making Linux and Android Get Along (It's Not as Hard as It Sounds) | May 16, 2013 |
| Drupal Is a Framework: Why Everyone Needs to Understand This | May 15, 2013 |
| Home, My Backup Data Center | May 13, 2013 |
- New Products
- Linux Systems Administrator
- Senior Perl Developer
- Technical Support Rep
- UX Designer
- Web & UI Developer (JavaScript & j Query)
- Designing Electronics with Linux
- Dynamic DNS—an Object Lesson in Problem Solving
- Using Salt Stack and Vagrant for Drupal Development
- Making Linux and Android Get Along (It's Not as Hard as It Sounds)
- Reply to comment | Linux Journal
6 hours 12 min ago - Reply to comment | Linux Journal
6 hours 28 min ago - Favorite (and easily brute-forced) pw's
8 hours 19 min ago - Have you tried Boxen? It's a
14 hours 11 min ago - seo services in india
18 hours 43 min ago - For KDE install kio-mtp
18 hours 43 min ago - Evernote is much more...
20 hours 43 min ago - Reply to comment | Linux Journal
1 day 5 hours ago - Dynamic DNS
1 day 6 hours ago - Reply to comment | Linux Journal
1 day 7 hours ago
Enter to Win an Adafruit Pi Cobbler Breakout Kit for Raspberry Pi

It's Raspberry Pi month at Linux Journal. Each week in May, Adafruit will be giving away a Pi-related prize to a lucky, randomly drawn LJ reader. Winners will be announced weekly.
Fill out the fields below to enter to win this week's prize-- a Pi Cobbler Breakout Kit for Raspberry Pi.
Congratulations to our winners so far:
- 5-8-13, Pi Starter Pack: Jack Davis
- 5-15-13, Pi Model B 512MB RAM: Patrick Dunn
- 5-21-13, Prototyping Pi Plate Kit: Philip Kirby
- Next winner announced on 5-27-13!
Featured Jobs
| Linux Systems Administrator | Houston and Austin, Texas | Host Gator |
| Senior Perl Developer | Austin, Texas | Host Gator |
| Technical Support Rep | Houston and Austin, Texas | Host Gator |
| UX Designer | Austin, Texas | Host Gator |
| Web & UI Developer (JavaScript & j Query) | Austin, Texas | Host Gator |
Free Webinar: Hadoop
How to Build an Optimal Hadoop Cluster to Store and Maintain Unlimited Amounts of Data Using Microservers
Realizing the promise of Apache® Hadoop® requires the effective deployment of compute, memory, storage and networking to achieve optimal results. With its flexibility and multitude of options, it is easy to over or under provision the server infrastructure, resulting in poor performance and high TCO. Join us for an in depth, technical discussion with industry experts from leading Hadoop and server companies who will provide insights into the key considerations for designing and deploying an optimal Hadoop cluster.
Some of key questions to be discussed are:
- What is the “typical” Hadoop cluster and what should be installed on the different machine types?
- Why should you consider the typical workload patterns when making your hardware decisions?
- Are all microservers created equal for Hadoop deployments?
- How do I plan for expansion if I require more compute, memory, storage or networking?




Comments
Re: Regular Expressions
"...followed by a space, followed by at most two (+) characters that could be either numeric..."
Is this a mistake? I can't see how that regexp isolates 2 characters (day of the month) without matching the space and hour as well. Surely you need something like this?
$line =~ /^([a-zA-Z]{3} [ 0-9]{2}
[0-9:]*).*logname=([a-zA-Z0-9]*).*user=
([a-zA-Z0-9]*)$/;
Otherwise you'll chew up further spaces and digits until you hit the first ':'.