Removing Duplicate PATH Entries, Part II: the Rise of Perl

 

With apologies to Arnold and the Terminator franchise for the title, let's look one more time at removing duplicates from the PATH variable. This take on doing it was prompted by a comment from a reader named Shaun on the previous post that asked "if you're willing to use a non-bash solution (AWK) to solve the problem, why not use Perl?" Shaun was kind enough to provide a Perl version of the code, which was good, since I'd have been hard-pressed to come up with one. It's a short piece of code, shorter than the AWK version, so it seemed like it ought to be fairly easy to pick it apart. In the end, I'm not sure I'd call it easy, but it was interesting, and I thought other non-Perl programmers might find it interesting too.

 

First, let me repeat from the first post: there is no compelling reason to remove duplicates from your PATH variable; the shell will ignore the second and subsequent occurrences of a duplicate path.

The code, slightly modified, from the comment was:

PATH='/usr/bin:/bin::/usr/local/bin:/root/new folder:/bin:/usr/bin'
PATH=$(perl -E 'chomp($_=<>);say join":",grep{$_&&!$_{$_}++}split/:/' <<<$PATH)

To make that a bit easier to read, I rearranged it a bit and added a touch of white space and put it in a file:

chomp($_ = <>);
say join ":", grep { $_ && !$_{$_}++ } split /:/

Just to make sure something didn't get messed up, I ran that, and of course, as is usual the first time you try something, it fails:

$ perl test.pl <<<'/usr/bin:/bin::/usr/local/bin:/root/new folder:/bin:/usr/bin'
syntax error at test.pl line 2, near "say join"
Execution of test.pl aborted due to compilation errors.

The culprit here is the -E option, or the lack of it in this case, that was used in the original command. The -E option, in addition to passing the script as a string to the interpreter also enables some newer Perl features that aren't enabled by default. In this case, it's the say feature. To enable those newer features in a script file, you need to tell Perl to use the feature:

use feature qw(say)   # use "say" feature
chomp($_ = <>);
say join ":", grep { $_ && !$_{$_}++ } split /:/

After that, it runs as expected:

$ perl test2.pl <<<'/usr/bin:/bin::/usr/local/bin:/root/new folder:/bin:/usr/bin'
/usr/bin:/bin:/usr/local/bin:/root/new folder

So how does it work? In the broadest sense, it works pretty much the same way as the AWK version did: it splits the path on colons, uses an associative array (aka a hash in Perl) to determine whether a path element has been seen before, and then once the final list of paths is known, it joins them back together with colons. But, let's step through the details a bit.

The first line (ignoring the use line) invokes the chomp function on $_=<>. Chomp is pretty straightforward; it removes a single "record separator" from each of its arguments. The default record separator is a newline, as it is in AWK. The argument to chomp is an assignment expression that assigns the null file handle <> to Perl's "default input and pattern-searching space", which is the variable named $_. The null file handle, in this case the standard input via bash's <<< syntax, is the value of the path variable. The default pattern space is equivalent to $0 in AWK: it is the data that functions operate on by default in the absence of an explicit variable/expression.

So the assignment expression $_=<> assigns the path to the pattern space, then the chomp function removes a record separator from the end of the pattern space. All that was a long-winded way of saying that the first line of code removes that pesky newline from the input. If you remember from the first post, the newline was eliminated by adding the -n option to echo.

Note: The Null File Handle

The full story of Perl's <> is a bit more involved. Things between "<" and ">" are file handles, so for example, <STDIN> refers to the standard input. But there are some magic file handles, such as <ARGV>, which refers to the input files to the script specified on the command line. The input files are read in order, line by line, as you read from <ARGV>. If no input files are specified on the command line, Perl sets <ARGV> to "-", so that the standard input is read. The null file handle, <>, is the equivalent of <ARGV>.

The second part of the code is where the real work happens. Syntactically, what the code looks like is this:

say EXPR
   /    \
  /      \
join STR EXPR
        /    \
       /      \
    grep EXPR LIST
             /    \
            /      \
          split PATTERN

At the bottom of that diagram is a call to the split function:

split /:/

This should be reasonably obvious: this splits the pattern space on colons. In other words, it separates the path into individual path elements. It operates on the pattern space since no other specific data is provided. The split function produces a list for grep to work on. If you remember from the first post, the AWK version changed the record separator to a colon to split the path into individual elements; in this version, the split function is used to accomplish that.

So how come the chomp function used parentheses around its arguments and the split function doesn't? Actually, the chomp function didn't use parentheses around its arguments; the syntax of Perl function calls is simply func args, no parentheses. The parentheses in the call to chomp is to change the precedence of the terms in the expression. Without the parentheses, the call would be seen as chomp($_)=<>;, which is invalid syntax.

Next up in the diagram is grep. Note that this is not the command-line grep; this is the Perl grep function. Perl's grep function evaluates its EXPR for each item in the LIST, and when the expression is true, the current item from the list is added to a new list that will become the result of the grep function. In this case, the EXPR is actually a BLOCK (code in curly braces {}).

Ignoring the contents of the block for a moment, the rest of the code should also be fairly obvious: the grep function is going to produce the final list of paths, the join function is going to join them back together with colons, and the say function is going to output the final path value.

So the last part of the code to understand is the code in the block that the grep function evaluates for each path element:

$_ && !$_{$_}++

Here, as in many languages, && is the logical and operator, ! is the logical negation operator, and ++ is the post increment operator.

On the left side of the and operator is the simple expression $_, which tests to see if the pattern space is not blank. This is where empty path elements get eliminated: if the pattern space is empty, the left-hand side of the and operator is false and, therefore, the whole expression is false, and the current (empty) path element is not added to the list that the grep function is creating.

[edit: as pointed out in a comment below, this is incorrect: empty path elements should not be removed, they "mean" the current directory.  To fix it just remove the "$_ &&" from the expression. -- Mitch]

If you remember above, I said that the split function created a list that served as grep's data to process, so you might wonder how the pattern space, $_, got into the action. According to the Perl documentation, the grep function assigns each list item to a local copy of the $_ variable before evaluating its block/expression.

If you thought you were gonna get out of this unscathed, well this is where the going gets a bit tougher. On the right-hand side of the and expression is the hash (the associative array). In Perl, curly braces are also used to surround the key value when referencing an element in a hash, so {$_} is using the (local) pattern space, aka the current path element, as a key into a hash. Not too complex, but then you see what you would assume is the name of the hash: $_, but how can the pattern space variable also be the name of the hash? Well, it isn't, but once again, let's ignore that for the moment. To alleviate the cognitive dissonance that you're feeling, you can change the code and use a different name for the hash (and this actually works in the real code):

$_ && !$myhash{$_}++

My current working hypothesis is that using $_ as the hash name is the equivalent of a secret handshake down at the local lodge: club members get it, everybody else is left wondering. Just kidding, this actually was a good choice for the hash name, because it's the one piece that I spent the most time trying to grasp, and that led me down a number of useful paths. But before I get to some of that, let's finish with the code.

So the expression tests the hash element to see if it's zero $myhash{$_}. If the hash element is zero, the expression is true (due to the negation). Therefore, the current path element has not been seen before, and it is added to the list that grep is creating. If the hash element is non-zero, the path has already been seen, the expression is false, and the path element is not added to the list. After the hash element is tested, it is incremented via the ++ operator, so that the next time the path element is seen, the expression will be false, and the duplicate path element will be skipped. To recap, in pseudo-code, the block looks like this:

if path-element is not blank
    tmp = hash[path-element]
    hash[path-element] += 1
    if tmp == 0
        "BLOCK" is true, add path to output list
    else
        "BLOCK" is false, path already added to output

As in AWK, referencing a hash (aka an associative array) in Perl that does not exist, automatically creates it. And performing arithmetic on a value that is undefined is the same as using zero. So the hash and all its elements are created as needed without any explicit code to create them.

Now let's get back to the name of that hash. The Perl documentation (perldata) states:

Perl has three built-in data types: scalars, arrays of scalars, and associative arrays of scalars, known as "hashes". A scalar is a single string (of any size, limited only by the available memory), number, or a reference to something [...]. Normal arrays are ordered lists of scalars indexed by number, starting with 0. Hashes are unordered collections of scalar values indexed by their associated string key.

In code, that looks like this:

$scalar_var = 1;                      # simple var
@list_var   = (1, 2);                 # a list
%hash_var   = (one => 1, two => 2);   # a hash

So you might think, "ok I get it, each data type uses a special character in its name". So you try this:

$v = 1;
@v = (1, 2);
%v = (one => 1, two => 2);

And sure enough, Perl likes that just fine. However, you don't actually have three variables named $v, @v and %v. What you have is three variables named v, whose names come from three different name spaces. Hopefully, some Perl language lawyer out there will point out the correct way to explain that, but for now that will work.

Just a bit more, and I can get back to the hash name conundrum of $_{$_}. Consider the following code that creates a scalar and a hash with the same name and then prints out the scalar and an element from the hash:

my $v = "scalar v";
my %v = ( a => "hash element with key a", b => "hash element with key b" );
print "v   : $v\n";
print "v{a}: $v{a}\n";

Running that, you get what you'd expect:

$ perl test.pl
v   : scalar v
v{a}: hash element with key a

But look closely at the last print statement. You were probably expecting to see %v{a} inside the string rather than $v{a}, since it's printing a value from a hash. So here is proof positive that the leading character $, @ or % is not part of the variable name. These leading characters are called "sigils" in Perl, which seems like an appropriate name given the definition of sigil:

A sign, word, or device held to have occult power in astrology or magic.

When talking about Perl, you will often see references to something being in scalar context or list context. For example, consider the following code that creates a list and then refers to the list in scalar context:

@v = (4, 5, 6);
$u = scalar @v;
print "@v\n";
print "$u\n";
$ perl test.pl
4 5 6
3

When you refer to a list in list context, you get the list; when you refer to it in scalar context you get its length. So in the example above where the print statement refers to $v{a}, the reference is to the hash element v{a}, and since the hash element is a single value, and since you want to print that single value, you need to refer to it in scalar context. So rather than using %v{a}, which would give you a hash and not a scalar, you use $v{a} to get the scalar value. And actually, if you changed the script to use %v{a}, you would not get you a hash, because it's not valid syntax, you would just get:

$ perl test.pl
v   : scalar v
v{a}: %v{a}

Finally back to the original code:

!$_{$_}++

So here, the first use of $_ refers to a hash named _ and not to Perl's pattern space variable $_. However, as I already mentioned, the second use of $_ does refer to the pattern space; it is used as the key into the hash. After the hash element is found (or created), it is tested and negated in scalar context. And then finally, and also in scalar context, it is incremented.

That was a fairly long explanation for a pretty short piece of code, but hopefully you found it interesting. As long-winded as it was, if it didn't put you to sleep by now, that's probably good enough. As I mentioned up front, I'm not a Perl programmer, so I may have messed it up somewhere. Let me know where.

Mitch Frazier is an embedded systems programmer at Emerson Electric Co. Mitch has been a contributor to and a friend of Linux Journal since the early 2000s.

Load Disqus comments