Fun with Mail Merge and Cool Bash Arrays

Creating a sed-based file substitution tool.

A few weeks ago, I was digging through my spam folder and found an email message that started out like this:


Dear #name#
Congratulations on winning the $15.7 million lottery payout!
To learn how to claim your winnings, please...

Obviously, it was a scam (does anyone actually fall for these?), but what captured my attention was the #name# sequence. Clearly that was a fail on the part of the sender who presumably didn't know how to use AnnoyingSpamTool 1.3 or whatever the heck he or she was using.

The more general notation for bulk email and file transformations is pretty interesting, however. There are plenty of legitimate reasons to use this sort of substitution, ranging from email newsletters (like the one I send every week from AskDaveTaylor.com—check it out!) to stockholder announcements and much more.

With that as the inspiration, let's build a tool that offers just this capability.

The simple version will be a 1:1 substitution, so #name# becomes, say, "Rick Deckard", while #first# might be "Rick" and #last# might be "Deckard". Let's build on that, but let's start small.

Simple Word Substitution in Linux

There are plenty of ways to tackle the word substitution from the command line, ranging from Perl to awk, but here I'm using the original UNIX command sed (stream editor) designed for exactly this purpose. General notation for a substitution is s/old/new/, and if you tack on a g at the end, it matches every occurrence on a line, not only the first, so the full command is s/old/new/g.

Before going further, here's a simple document that has necessary substitutions embedded:


$ cat convertme.txt
#date#

Dear #name#, I wanted to start by again thanking you for your
generous donation of #amount# in #month#. We couldn't do our
work without support from humans like you, #first#.

This year we're looking at some unexpected expenses,
particularly in Sector 5, which encompasses #state#, as you
know. I'm hoping you can start the year with an additional
contribution? Even #suggested# would be tremendously helpful.

Thanks for your ongoing support. With regards,

Rick Deckard
Society for the Prevention of Cruelty to Replicants

Scan through it, and you'll see there's a lot of substitutions to do: #date#, #name#, #amount#, #month#, #first#, #state# and #suggested#. It turns out that #date# will be replaced with the current date, and #suggested# is one that'll be calculated as the letter is processed, but that's for a bit later, so stay tuned for that.

To make life easy, a source file that's a comma-separated list allows for easy interaction with a source spreadsheet, so a sample input data file might look like this:


name:first:amount:month:state
Eldon Tyrell:Eldon:500:July:California

At its most basic, the first line defines variable names (without the # notation), and subsequent lines are a set of values for a particular donor or recipient. To start, let's read in the variable names:


while IFS=',' read -r f1 f2 f3 f4 f5 f6 f7
do
  declare -a varname=($f1 $f2 $f3 $f4 $f5 $f6 $f7)
done

Key to understanding this is to know about IFS, the internal field separator. Normally, it's white space, which is why, for example, ls my file name looks for three files called my, file and name. But you can change it, as I demonstrate by changing IFS to a comma.

Those Cool Bash Arrays

I declare an array called varname that receives each of the fields read into the script. There are only five fields in use at this point, but let's support up to seven to make the resultant script a bit more flexible.

Arrays are really cool in Bash actually, but the notation is a smidge funky. That is, you can't just use $array[index], because it won't be parsed correctly, so curly braces are a necessary addition:


echo ${varname[1]}

That works just fine.

For a basic algorithm, you're going to have two parallel arrays (parallel in that their indices will match up): one that retains all the variable names, and the other that contains the values for this instance of the data entry list.

This means you'll need to differentiate between the situation when the script is reading the first line and when subsequent lines of the data file are read. Easily done:


(( lines++ ))

if [ $lines -eq 1 ] ; then   # field names
  # variable names
  declare -a varname=($f1 $f2 $f3 $f4 $f5 $f6 $f7)
else
  # values for this line (can contain spaces)
  declare -a value=("$f1" "$f2" "$f3" "$f4" "$f5"
     "$f6" "$f7")
fi

As with most code, this makes assumptions here, but they're safe: variable names aren't quoted because they're always a single word, but variable values might have spaces, so they do end up quoted in the declare statement. Otherwise, this should be easy, and the (( lines++ )) notation should make you cheer—it's a nice Bash shortcut!

Once you're past the very first line, the script can look in varname[x] for the xth variable name, and value[x] for the value of that named variable, expressed as a series of sed-friendly substitution commands:


for ((i=0; i<${#value[*]}; i++))
do
  if [ ! -z "${value[$i]}" ] ; then
    echo "s/#${varname[$i]}#/${value[$i]}/g"
  fi
done

Which produces this:


s/#name#/Eldon Tyrell/g
s/#first#/Eldon/g
s/#amount#/500/g
s/#month#/July/g
s/#state#/California/g

That's pretty darn close to what you want actually. Let's push forward.

Working with sed

The stream editor sed is far more powerful than its modest and ancient history might suggest. It's perfect for this job, as shown above.

You could write the above lines into a temp file and invoke sed directly, but let's avoid the file I/O and turn it all into a command-line argument as necessary. That's done by simply separating each command with a semicolon, which you can do by building it in a temp variable instead:


for ((i=0; i<${#value[*]}; i++))
do
  if [ ! -z "${value[$i]}" ] ; then
    if [ -z "$SUBS" ] ; then
      SUBS="s/#${varname[$i]}#/${value[$i]}/g"
    else
      SUBS="$SUBS;s/#${varname[$i]}#/${value[$i]}/g"
    fi
  fi
done

There's undoubtedly a way to avoid the innermost if-then-else statement to omit the unnecessary ; prefix, but sometimes it's easier to have a few lines of code than yet more gobbledygook.

Otherwise, the above is a simple expansion from the previous for loop shown. This time, it builds the entire sed command within the SUBS substitution variable. Here's how to test:


echo "   sed \"$SUBS\" $inputfile"

When you run this with the input data file, here's what's pushed out to the terminal:


sed "s/$name$/Eldon Tyrell/g;s/$first$/Eldon/g;
    s/$amount$/500/g;s/$month$/July/g;
    s/$state$/California/g" convertme.txt
sed "s/$name$/Rachel/g;s/$first$/Rachel/g;
    s/$amount$/100/g;s/$month$/March/g;
    s/$state$/New York/g" convertme.txt

(Note: line breaks added for formatting purposes only.)

It's actually a very small step from here to invoke the command, so let's do that:


$ sub.sh
#date#

Dear Eldon Tyrell, I wanted to start by again thanking you
for your generous donation of 500 in July. We couldn't do
our work without support from humans like you, Eldon.

This year we're looking at some unexpected expenses,
particularly in Sector 5, which encompasses California, as
you know. I'm hoping you can start the year with an
additional contribution? Even #suggested# would be
tremendously helpful.

Thanks for your ongoing support. With regards,

Rick Deckard
Society for the Prevention of Cruelty to Replicants
$

Generally, this looks good. #date# and #suggested# are still untranslated, but that's as expected. What is a bit odd is that it didn't get the second entry too. A bug.

I'm going to stop here, however, and maybe next time, I'll add some system substitutions like #date# and figure out how to calculate #suggested#, which can be 50% of the actual donation. See you soon!

Dave Taylor has been hacking shell scripts on UNIX and Linux systems for a really long time. He's the author of Learning Unix for Mac OS X and Wicked Cool Shell Scripts. You can find him on Twitter as @DaveTaylor, and you can reach him through his tech Q&A site: Ask Dave Taylor.

Load Disqus comments