Python Scripts as a Replacement for Bash Utility Scripts

on January 16, 2013

For Linux users, the command line is a celebrated part of our entire experience. Unlike other popular operating systems, where the command line is a scary proposition for all but the most experienced veterans, in the Linux community, command-line use is encouraged. Often the command line can provide a more elegant and efficient solution when compared to doing a similar task with a graphical user interface.

As the Linux community has grown up with a dependence on the command line, UNIX shells, such as bash and zsh, have grown into extremely formidable tools that complement the UNIX shell experience. With bash and other similar shells, a number of powerful features are available, such as piping, filename wild-carding and the ability to read commands from a file called a script.

Let's look at a real-world example to demonstrate the power of the command line. Every time users log in to a service, their user names are logged to a text file. For this example, let's find out how many unique users use the service.

The series of commands in the following example show the power of more complex utilities by chaining together smaller building blocks:


$ cat names.log | sort | uniq | wc -l

The pipe symbol (|) is used to pass the standard output of one command into the standard input of the next command. In the example here, the output of cat names.txt is passed into the sort command. The output of the sort command is each line of the file rearranged in alphabetical order. This subsequently is piped into the uniq command, which removes any duplicate names. Finally, the output of uniq is passed to the wc command. wc is a counting command, and with the -l flag set, it returns the number of lines. This allows you to chain a number of commands together.

However, sometimes what is needed can become quite complex, and chaining commands together can become unwieldy. In that case, shell scripts are the answer. A shell script is a list of commands that are read by the shell and executed in order. Shell scripts also support some programming language fundamentals, such as variables, flow control and data structures. Shell scripts can be very useful for batch jobs that will be run often and repeatedly. Unfortunately, shell scripts come with some disadvantages:

Shell scripts easily can become overly complicated and unreadable to a developer wanting to improve or maintain them.
Often the syntax and interpreter for these shell scripts can be awkward and unintuitive. The more awkward the syntax, the less readable it is for the developer who must work with these scripts.
The code is generally unusable in other scripts. Code reuse among scripts tends to be difficult, and scripts tend to be very specific to a certain problem.
Libraries for advanced features, such as HTML parsing or HTTP requests, are not as easily available as they are with modern programming and scripting languages.

These problems can make shell scripting an awkward undertaking and often can lead to a lot of wasted developer time. Instead, the Python programming language can be used as a very able replacement. There are many benefits to using Python as a replacement for shell scripts:

Python is installed by default on all the major Linux distributions. Opening a command line and typing python immediately will drop you into a Python interpreter. This ubiquity makes it a sensible choice for most scripting tasks.
Python has a very easy to read and understand syntax. Its style emphasizes minimalism and clean code while allowing the developer to write in a bare-bones style that suits shell scripting.
Python is an interpreted language, meaning there is no compile stage. This makes Python an ideal language for scripting. Python also comes with a Read Eval Print Loop, which allows you to try out new code quickly in an interpreted way. This lets the developer tinker with ideas without having to write the full program out into a file.
Python is a fully featured programming language. Code reuse is simple, because Python modules easily can be imported and used in any Python script. Scripts easily can be extended or built upon.
Python has access to an excellent standard library and thousands of third-party libraries for all sorts of advanced utilities, such as parsers and request libraries. For instance, Python's standard library includes datetime libraries that allow you to parse dates into any format that you specify and compare it to other dates easily.
Python can be a simple link in the chain. Python should not replace all the bash commands. It is as powerful to write Python programs that behave in a UNIX fashion (that is, read in standard input and write to standard output) as it is to write Python replacements for existing shell commands, such as cat and sort.

Let's build on the problem that was solved earlier in this article. Besides the work already done, let's find out know how many times a certain user has logged in to the system. The uniq command simply removes duplicates but gives no information on how many duplicates there are. Instead of uniq, a Python script can be used as another command in the chain. Here's a Python program to do this (in my examples, I refer to this file as namescount.py):


#!/usr/bin/env python
import sys

if __name__ == "__main__":
    # Initialize a names dictionary as empty to start with.
    # Each key in this dictionary will be a name and the value
    # will be the number of times that name appears.
    names = {}
    # sys.stdin is a file object. All the same functions that
    # can be applied to a file object can be applied to sys.stdin.
    for name in sys.stdin.readlines():
            # Each line will have a newline on the end
            # that should be removed.
            name = name.strip()
            if name in names:
                    names[name] += 1
            else:
                    names[name] = 1

    # Iterating over the dictionary,
    # print name followed by a space followed by the
    # number of times it appeared.
    for name, count in names.iteritems():
            sys.stdout.write("%d\t%s\n" % (count, name))

Let's look at how this Python script fits into the chain of commands. First, it reads in input from standard input exposed through the sys.stdin object. Any output is written to the sys.stdout object, which is how standard output is implemented in Python. A Python dictionary (often called a hash map in other languages) is used to get a mapping from the user name to the duplicate count. To get a count of all the users, execute the following:


$ cat names.log | python namescount.py

This displays a count of how many times a user appears along with the user's name using a tab as a separator. The next thing to do is display, in order, the users who used the system most often. This can be done at the Python level, but let's implement it using the utilities that are already provided by the core UNIX utilities. Previously, I used the sort command to sort alphabetically. If the command is provided with a -rn flag, it sorts the lines numerically, in descending order. As the Python script prints to standard out, you simply can pipe the command into sort and retrieve the output you want:


$ cat names.log | python namescount.py | sort -rn

This is an example of the power of using Python as part of a chain of commands. The advantages of using Python in this scenario are as follows:

The ability to chain with tools like cat and sort. Simple utilities (reading a file line by line and sorting a file numerically) are handled by tried-and-trusted UNIX commands. These commands also are reading line by line, which means these functions can scale to files that are large in size, and they are very quick.
When some heavy-lifting is needed in the chain, a very clear, concise Python script can be written, which does what it needs to do and then offloads the responsibility to the next link in the chain.
It is a reusable module, although this example is specifically about names, if you feed this any input that contains duplicate lines, it will print out each line and the number of duplicates. Making the Python code modular allows you to apply it in a range of scenarios.

To demonstrate the power of combining Python scripts in a modular and piped fashion, let's expand further on the problem space. Let's find the top five users of the service. head is a command that allows you to specify a certain number of lines to display of the standard input it is given. Adding this to the command chain gives the following:


$ cat names.log | python namescount.py | sort -rn | head -n 5

This prints only the top five users and ignores the rest. Similarly, to get the five users who use the service least, you can use the tail command, which takes the same arguments. The result of the Python command being printed to standard output allows you to build and extend upon its functionality.

To demonstrate the modularity of this script, let's once again change the problem space. The service also generates a comma-separated value (CSV) log file that contains a list of e-mail addresses and the comments that each e-mail address made about the service. Here's an example entry:


"email@example.com", "This service is great."

The task is to provide a way for the service to send a thank-you message to the top ten users in terms of comment frequency. First, you need a script that can read and print a certain column of CSV data. The standard library of Python provides a CSV reader. The Python script below completes this goal:


#!/usr/bin/env python
# CSV module that comes with the Python standard library
import csv
import sys


if __name__ == "__main__":
    # The CSV module exposes a reader object that takes
    # a file object to read. In this example, sys.stdin.
    csvfile = csv.reader(sys.stdin)

    # The script should take one argument that is a column number.
    # Command-line arguments are accessed via sys.argv list.
    column_number = 0
    if len(sys.argv) > 1:
            column_number = int(sys.argv[1])

    # Each row in the CSV file is a list with each 
    # comma-separated value for that line.
    for row in csvfile:
            print row[column_number]

This script can parse the CSV data and return in plain text the column that is supplied as a command-line argument. It uses print instead of sys.stdout.write, as print, by default, uses standard out as its output file.

Let's add this script to the chain. The new script is chained with the others to print out a list of e-mail addresses and their comment frequencies using the command listed below (the .csv log file is assumed to be called emailcomments.csv and the new Python script, csvcolumn.py):


$ cat emailcomments.csv | python csvcolumn.py | 
 ↪python namescount.py | sort -rn | head -n 5

Next, you need a way to send an e-mail. In the Python standard library of functions, you can import smtplib, which is a module that allows you to connect to an SMTP server to send mail. Let's write a simple Python script that uses this library to send a message to each of the top ten e-mail addresses found already:


#!/usr/bin/env python
import smtplib
import sys


GMAIL_SMTP_SERVER = "smtp.gmail.com"
GMAIL_SMTP_PORT = 587

GMAIL_EMAIL = "Your Gmail Email Goes Here"
GMAIL_PASSWORD = "Your Gmail Password Goes Here"


def initialize_smtp_server():
    '''
    This function initializes and greets the smtp server.
    It logs in using the provided credentials and returns 
    the smtp server object as a result.
    '''
    smtpserver = smtplib.SMTP(GMAIL_SMTP_SERVER, GMAIL_SMTP_PORT)
    smtpserver.ehlo()
    smtpserver.starttls()
    smtpserver.ehlo()
    smtpserver.login(GMAIL_EMAIL, GMAIL_PASSWORD)
    return smtpserver


def send_thank_you_mail(email):
    to_email = email
    from_email = GMAIL_EMAIL
    subj = "Thanks for being an active commenter"
    # The header consists of the To and From and Subject lines
    # separated using a newline character
    header = "To:%s\nFrom:%s\nSubject:%s \n" % (to_email,
            from_email, subj)
    # Hard-coded templates are not best practice.
    msg_body = """
    Hi %s,

    Thank you very much for your repeated comments on our service.
    The interaction is much appreciated.

    Thank You.""" % email
    content = header + "\n" + msg_body
    smtpserver = initialize_smtp_server()
    smtpserver.sendmail(from_email, to_email, content)
    smtpserver.close()


if __name__ == "__main__":
    # for every line of input.
    for email in sys.stdin.readlines():
            send_thank_you_mail(email)

This Python script supports contacting any SMTP server, whether local or remote. For ease of use, I have included Gmail's SMTP server, and it should work, provided you give the scripts the correct Gmail credentials. The script uses the functions provided to send mail in smtplib. This again demonstrates the power of using Python at this level. Something like SMTP interaction is easy and readable in Python. Equivalent shell scripts are messy, and such libraries are not as easily accessible, if they exist at all.

In order to send the e-mails to the top ten users sorted by comment frequency, first you must isolate only the e-mail column of the output of column names. To isolate a certain column in Linux, you use the cut command. In the example below, the commands are given in two separate chains. For ease of use, I wrote the output into a temporary file, which can be loaded into the second chain. This simply makes the process more readable (the Python script for sending mail is referred to as sendemail.py):


$ cat emailcomments.csv | python csvcolumn.py | 
 ↪python namescount.py | sort -rn > /tmp/comment_freq
$ cat /tmp/comment_freq | head -n 10 | cut -f2 | 
 ↪python sendemail.py

This shows the real power of Python as a utility in a chain of bash commands such as this. Writing scripts that accept input from standard input and write any data out to standard out, allows the developer to chain commands such as these together quickly and easily with a link in the chain often being a Python program. This philosophy of designing a small application that services one purpose fits nicely with the flow of commands being used here.

Often in Python scripts that are used on the command line, arguments are used to give users options when they run a certain command. For instance, the head command takes a -n argument that takes the number following it and prints only that number of lines. Each argument that is provided to a Python script is exposed through the sys.argv array, which can be accessed by first importing sys. The code below shows how to take single words as arguments. This program is a simple adder, which takes two number arguments and adds them, and prints that out to the user. However, this format of taking in command-line arguments is rather basic. It is easy to make mistakes—for instance, pass two strings, such as hello and world, to this command, and you will start to get errors:


#!/usr/bin/env python
import sys

if __name__ == "__main__":
    # The first argument of sys.argv is always the filename,
    # meaning that the length of system arguments will be
    # more than one, when command-line arguments exist.
    if len(sys.argv) > 2:
            num1 = long(sys.argv[1])
            num2 = long(sys.argv[2])
    else:
            print "This command takes two arguments and adds them"
            print "Less than two arguments given."
            sys.exit(1)
    print "%s" % str(num1 + num2)

Thankfully, Python has a number of modules to deal with command-line arguments. My personal favorite is OptionParser. OptionParser is part of the optparse module that is provided by the standard library. OptionParser allows you to do a range of very useful things with command-line arguments:

Specify a default if a certain argument is not provided.
It supports both argument flags (either present or not) and arguments with values (-n 10000).
It supports different formats of passing arguments—for example, the difference between -n=100000 and -n 100000.

Let's use the OptionParser to enhance the sending-mail script. The original script had a lot of variables hard-coded into place, such as the SMTP details and the users' login credentials. In the code provided below, command-line arguments are used to pass in these variables:


#!/usr/bin/env python
import smtplib
import sys

from optparse import OptionParser

def initialize_smtp_server(smtpserver, smtpport, email, pwd):
    '''
    This function initializes and greets the SMTP server.
    It logs in using the provided credentials and returns the
    SMTP server object as a result.
    '''
    smtpserver = smtplib.SMTP(smtpserver, smtpport)
    smtpserver.ehlo()
    smtpserver.starttls()
    smtpserver.ehlo()
    smtpserver.login(email, pwd)
    return smtpserver


def send_thank_you_mail(email, smtpserver):
    to_email = email
    from_email = GMAIL_EMAIL
    subj = "Thanks for being an active commenter"
    # The header consists of the To and From and Subject lines
    # separated using a newline character.
    header = "To:%s\nFrom:%s\nSubject:%s \n" % (to_email,
            from_email, subj)
    # Hard-coded templates are not best practice.
    msg_body = """
    Hi %s,

    Thank you very much for your repeated comments on our service.
    The interaction is much appreciated.

    Thank You.""" % email
    content = header + "\n" + msg_body
    smtpserver.sendmail(from_email, to_email, content)


if __name__ == "__main__":
    usage = "usage: %prog [options]"
    parser = OptionParser(usage=usage)
    parser.add_option("--email", dest="email",
            help="email to login to smtp server")
    parser.add_option("--pwd", dest="pwd",
            help="password to login to smtp server")
    parser.add_option("--smtp-server", dest="smtpserver",
            help="smtp server url", default="smtp.gmail.com")
    parser.add_option("--smtp-port", dest="smtpserverport",
            help="smtp server port", default=587)
    options, args = parser.parse_args()

    if not (options.email or options.pwd):
            parser.error("Must provide both an email and a password")

    smtpserver = initialize_smtp_server(options.stmpserver,
            options.smtpserverport, options.email, options.pwd)

    # for every line of input.
    for email in sys.stdin.readlines():
            send_thank_you_mail(email, smtpserver)
    smtpserver.close()

This script shows the usefulness of OptionParser. It provides a simple, easy-to-use interface for command-line arguments, allowing you to define certain properties for each command-line option. It also allows you to specify default values. If certain arguments are not provided, it allows you to throw specific errors.

So what have you learned? Instead of replacing a series of bash commands with one Python script, it often is better to have Python do only the heavy lifting in the middle. This allows for more modular and reusable scripts, while also tapping into the power of all that Python offers. Using stdin as a file object allows Python to read input, which is piped to it from other commands, and writing to stdout allows it to continue passing the information through the piping system. Combining information like this can make for some very powerful programs. The examples I have given here are all for a fictional service that logs to a file.

As a real-world example, recently I have been working with gigabytes of CSV files that I have been converting using a Python script to a file that contains SQL commands to insert the information. To understand the sort of data I'm concerned with here, I ran the data for a single table, and the script took 23 hours to execute and generated an SQL file that was 20GB in size. The advantage of using a Python script in the fashion described in this article is that the whole file does not need to be read into memory. This means that an entire 20GB+ file can be processed one line at a time. Also it is easier to think about a problem when each step (reading, sorting, manipulation and writing) is separated into these logical steps. The guarantee that each of these commands, which are part of the core utilities of UNIX-like environment, is efficient and stable helps the entire experience to be more stable and secure.

The other benefit is that there is no hard-coded file that is read in. Often having the flexibility to pass it strings rather than the concept of files is very powerful. For instance, if 20,000 lines through a certain file, the script breaks, instead of re-running the script from the start, tail can be used to read only from the line on which the script failed.

There are a lot of aspects to Python in the shell that go beyond the scope of this article, such as the os module and the subprocess module. The os module is a standard library function that holds a lot of key operating system-level operations, such as listing directories and stating files, along with an excellent submodule os.path that deals with normalizing directories paths. The subprocess module allows Python programs to run system commands and other advanced operations, such as handling piping as described above within Python code between spawned processes. Both of these libraries are worth checking out if you intend to do any Python shell scripting.