Normalizing Filenames and Data with Bash

on October 30, 2018

URLify: convert letter sequences into safe URLs with hex equivalents.

This is my 155th column. That means I've been writing for Linux Journal for:


$ echo "155/12" | bc
12

No, wait, that's not right. Let's try that again:


$ echo "scale=2;155/12" | bc
12.91

Yeah, that many years. Almost 13 years of writing about shell scripts and lightweight programming within the Linux environment. I've covered a lot of ground, but I want to go back to something that's fairly basic and talk about filenames and the web.

It used to be that if you had filenames that had spaces in them, bad things would happen: "my mom's cookies.html" was a recipe for disaster, not good cookies—um, and not those sorts of web cookies either!

As the web evolved, however, encoding of special characters became the norm, and every Web browser had to be able to manage it, for better or worse. So spaces became either "+" or %20 sequences, and everything else that wasn't a regular alphanumeric character was replaced by its hex ASCII equivalent.

In other words, "my mom's cookies.html" turned into "my+mom%27s+cookies.html" or "my%20mom%27s%20cookies.html". Many symbols took on a second life too, so "&" and "=" and "?" all got their own meanings, which meant that they needed to be protected if they were part of an original filename too. And what about if you had a "%" in your original filename? Ah yes, the recursive nature of encoding things....

So purely as an exercise in scripting, let's write a script that converts any string you hand it into a "web-safe" sequence. Before starting, however, pull out a piece of paper and jot down how you'd solve it.

Normalizing Filenames for the Web

My strategy is going to be easy: pull the string apart into individual characters, analyze each character to identify if it's an alphanumeric, and if it's not, convert it into its hexadecimal ASCII equivalent, prefacing it with a "%" as needed.

There are a number of ways to break a string into its individual letters, but let's use Bash string variable manipulations, recalling that ${#var} returns the number of characters in variable $var, and that ${var:x:1} will return just the letter in $var at position x. Quick now, does indexing start at zero or one?

Here's my initial loop to break $original into its component letters:


input="$*"

echo $input

for (( counter=0 ; counter < ${#input} ; counter++ ))
do
   echo "counter = $counter -- ${input:$counter:1}"
done

Recall that $* is a shortcut for everything from the invoking command line other than the command name itself—a lazy way to let users quote the argument or not. It doesn't address special characters, but that's what quotes are for, right?

Let's give this fragmentary script a whirl with some input from the command line:


$ sh normalize.sh "li nux?"
li nux?
counter = 0 -- l
counter = 1 -- i
counter = 2 --
counter = 3 -- n
counter = 4 -- u
counter = 5 -- x
counter = 6 -- ?

There's obviously some debugging code in the script, but it's generally a good idea to leave that in until you're sure it's working as expected.

Now it's time to differentiate between characters that are acceptable within a URL and those that are not. Turning a character into a hex sequence is a bit tricky, so I'm using a sequence of fairly obscure commands. Let's start with just the command line:


$ echo '~' | xxd -ps -c1 | head -1
7e

Now, the question is whether "~" is actually the hex ASCII sequence 7e or not. A quick glance at http://www.asciitable.com confirms that, yes, 7e is indeed the ASCII for the tilde. Preface that with a percentage sign, and the tough job of conversion is managed.

But, how do you know what characters can be used as they are? Because of the weird way the ASCII table is organized, that's going to be three ranges: 0–9 is in one area of the table, then A–Z in a second area and a–z in a third. There's no way around it, that's three range tests.

There's a really cool way to do that in Bash too:


if [[ "$char" =~ [a-z] ]]

What's happening here is that this is actually a regular expression (the =~) and a range [a-z] as the test. Since the action I want to take after each test is identical, it's easy now to implement all three tests:


if [[ "$char" =~ [a-z] ]]; then
  output="$output$char"
elif [[ "$char" =~ [A-Z] ]]; then
  output="$output$char"
elif [[ "$char" =~ [0-9] ]]; then
  output="$output$char"
else

As is obvious, the $output string variable will be built up to have the desired value.

What's left? The hex output for anything that's not an otherwise acceptable character. And you've already seen how that can be implemented:


hexchar="$(echo "$char" | xxd -ps -c1 | head -1)"
 output="$output%$hexchar"

A quick run through:


$ sh normalize.sh "li nux?"
li nux? translates to li%20nux%3F

See the problem? Without converting the hex into uppercase, it's a bit weird looking. What's "nux"? That's just another step in the subshell invocation:


hexchar="$(echo "$char" | xxd -ps -c1 | head -1 | \
   tr '[a-z]' '[A-Z]')"

And now, with that tweak, the output looks good:


$ sh normalize.sh "li nux?"
li nux? translates to li%20nux%3F

What about a non-Latin-1 character like an umlaut or an n-tilde? Let's see what happens:


$ sh normalize.sh "Señor Günter"
Señor Günter translates to Se%C3B1or%200AG%C3BCnter

Ah, there's a bug in the script when it comes to these two-byte character sequences, because each special letter should have two hex byte sequences. In other words, it should be converted to se%C3%B1or g%C3%BCnter (I restored the space to make it a bit easier to see what I'm talking about).

In other words, this gets the right sequences, but it's missing a percentage sign—%C3B should be %C3%B, and %C3BC should be %C3%BC.

Undoubtedly, the problem is in the hexchar assignment subshell statement:


hexchar="$(echo "$char" | xxd -ps -c1 | head -1 | \
   tr '[a-z]' '[A-Z]')"

Is it the -c1 argument to xxd? Maybe. I'm going to leave identifying and fixing the problem as an exercise for you, dear reader. And while you're fixing up the script to support two-byte characters, why not replace "%20" with "+" too?

Finally, to make this maximally useful, don't forget that there are a number of symbols that are valid and don't need to be converted within URLs too, notably the set of "-_./!@#=&?", so you'll want to ensure that they don't get hexified (is that a word?).

Dave Taylor has been hacking shell scripts on UNIX and Linux systems for a really long time. He's the author of Learning Unix for Mac OS X and Wicked Cool Shell Scripts. You can find him on Twitter as @DaveTaylor, and you can reach him through his tech Q&A site: Ask Dave Taylor.