SIGALRM Timers and Stdin Analysis

It's not hard to create functions to ensure that your script doesn't run forever. But what if you want portions to be timed while others can take as long as they need? Not so fast, Dave explains in his latest Work the Shell.

In an earlier article, I started building out a skeleton script that would have the basic functions needed for any decent shell script you might want to create. I started with command-line argument processing with getopts, then explored syslog and status logging as scripts. Finally, I ended that column by talking about how to capture signals like Ctrl-C and invoke functions that can clean up temp files and so on before actually giving up control of your shell script.

This time, I want to explore a different facet of signal management in a shell script: having built-in timers that let you specify an allowable quantum of time for a specific function or command to complete with explicit consequences if it hangs.

When does a command hang? Often when you're tapping into a network resource. For example, you might have a script that looks up definitions by handing a query to Google via curl. If everything's running fine, it'll complete in a second or two, and you're on your way.

But if the network's off-line or Google's having a problem or any of the million other reasons that a network query can fail, what happens to your script? Does it just hang forever, relying on the curl program to have its own timeout feature? That's not good.

Alarm Timers

One of the most common alarm timer approaches is to give the entire script a specific amount of time within which it has to finish by spawning a subshell that waits that quantum, then kills its parent. Yeah, kinda Oedipal, but at least we're not poking any eyes out in this script!

The additional lines end up looking like this:


(
sleep 600           # if 10 minutes pass
kill -TERM $$       # send it a SIGTERM signal
)&

There's no "trap" involved—easy enough. Notice especially that the closing parenthesis has a trailing ampersand to ensure that the subshell is pushed into the background and runs without blocking the parent script from proceeding.

A smarter, cleaner way to do this would be for the timer child subshell to send the appropriate SIGALRM signal to the parent—a small tweak:


(
sleep 600            # if 10 minutes pass
kill -ALRM $$        # send it a SIGALRM signal
)&

If you do that, however, what do you need in the parent script to capture the SIGALRM? Let's add that, and let's set up a few functions along the way to continue the theme of useful generic additions to your scripts:


function allow_time
{
   ( echo timer allowing $1 seconds for execution
     sleep $1
     kill -ALRM $$
   ) &
}

This first function lets you easily set a time for subsequent execution, while the second presents your ALRM handler in a bit neater fashion:


function timeout_handler
{
   echo allowable time for execution exceeded.
   exit 1
}

Note that both scripts have debugging output that's probably not needed for actual production code. It's easily commented out, but running it as is will help you understand how things interact and work together.

How might this be used? Like this:


trap timeout_handler SIGALRM
allow_time 10
code that has ten seconds to complete

That would give the script ten seconds to finish.

The problem is, what happens if it finishes up in less time than allotted? The subshell is still out there, waiting, and it pushes out the signal to a nonexistent process, causing the following sloppy error message to show up:


sigtest.sh: line 7: kill: (10532) - No such process

There are two ways to fix this, either kill the subshell when the parent shell exits or have the subshell test for the existence of the parent shell just before it sends the signal.

Let's do the latter. It's easier, and having the subshell float around for a few seconds in a sleep is certainly not going to be a waste of computing resources.

The easiest way to test for the existence of a specified process is to use ps and check the return code, like this:


ps $$ >/dev/null ; echo $?

If the process exists, the return code will be 0. If it's gone, the return code will be nonzero. This suggests a simple test:


if [ ! $(ps $$ > /dev/null) ]

But, that won't work because it's the return code, not what's handed to the shell. The solution? Simply invoke the ps command, then have the expression test the return code:


function allow_time
{
   ( echo timer allowing $1 seconds for execution
     sleep $1
     ps $$ > /dev/null
     if [ ! $? ] ; then
       kill -ALRM $$
     fi
   ) &
}

That solves that problem. But, what if you have sections of code where you want to limit your execution time followed by other sections where you don't care?

That's easy if you don't mind leaving some child processes around waiting to shoot a signal at the parent. Just use this:


trap '' SIGALRM

when you're done with the timed passage. What happens is that the timer generates a signal, but the parent script ignores it.

The limitation on this, of course, is if you have code like this:


regular code
possible runaway code <-- allocate 100 seconds
cancel timer
more regular code
possible runaway code <-- allocate 100 seconds

The situation arises if the second code block is started before the first timer runs out. Imagine that you've allocated 100 seconds for the first timed block and it finishes in 90 seconds. Regular code takes five seconds, then you're in block two, for exactly ten seconds. Then the first ALRM timer triggers, after ten seconds rather than another 100. Not good.

This is admittedly a bit of a corner case, but to fix it, let's reverse the decision about having child processes test for the existence of the parent before sending the signal and instead have the parent script kill all child subshells upon completion of the timed portion. It's a bit tricky to build, because it requires the use of ps and picks up more processes than just that subshell, so you not only need to screen out your own process, you also want to get rid of any subshell processes that aren't actually the script itself.

I use the following:


ps -g $$ | grep $myname | cut -f1 -d\  | grep -v $$

This generates a list of process IDs (pids) for all the subshells running, which you then can feed to kill:


pids=$(ps -g $$ | grep $myname | cut -f1 -d\  | grep -v $$)
kill $pids

The problem is that not all of those processes are still around by the time they're handed to the kill program. The solution? Ignore any errors generated by PID not found:


kill $pids > /dev/null 2>&1

Combined as a function, it'd look like this:


function kill_children
{
   myname=$(basename $0)
   pids=$(ps -g $$ | grep $myname | cut -f1 -d\  | grep -v $$)
   kill $pids > /dev/null 2>&1
}

If you're thinking "holy cow, multiple timers in the same script is a bit of a mess", you're right. At the point where you need something of this nature, it's quite possible that a different solution would be a smarter path.

Further, I'm sure there are other ways to address this, in which case I'd be most interested in hearing from readers about whether you've encountered a situation where you need to have multiple timed portions of your code, and if so, how you managed it! Send e-mail via http://www.linuxjournal.com/contact.

Stop watch image via Shutterstock.com.

______________________

Dave Taylor has been hacking shell scripts for over thirty years. Really. He's the author of the popular "Wicked Cool Shell Scripts" and can be found on Twitter as @DaveTaylor and more generally at www.DaveTaylorOnline.com.

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Nice article.

M^2's picture

Thank you so much for such a nice article. I enjoyed it. Even comments are very informative.

improved if

Michal Papis's picture

A cleaner version of check for process, you can put the command directly in if:

if ps $$ > /dev/null
then kill -ALRM $$
fi

I was working on similar project some time ago: https://github.com/mpapis/shell-timeouts/blob/master/build.sh

Yuk!

Kees-Jan Dijkzeul's picture

Your construct of

  ps $$ > /dev/null
  if [ ! $? ] ; then
    kill -ALRM $$
  fi

looks kind of ugly. May I suggest

  if ! ps $$ > /dev/null ; then # No square braces here
    kill -ALRM $$
  fi

or perhaps even

  ps $$ > /dev/null && kill -ALRM $$

All of these have the risk that the parent process terminates after ps, but before kill, so could still trigger an error message from kill. Hence my personal favorite:

  kill -ALRM $$ 2> /dev/null

Thanks Kees-Jan Dijkzeul! I

anti aging cream's picture

Thanks Kees-Jan Dijkzeul! I think I'll try out your suggestion to make it cleaner. anti aging cream fhr9vmg7

SIGALRM Timers - A more elegant solution

AlphaGeek's picture

You could use the '$!' variable in the parent shell. The value of $! is the last sub-shell spawned from the current shell.

In your last example:
function kill_children
{
myname=$(basename $0)
pids=$(ps -g $$ | grep $myname | cut -f1 -d\ | grep -v $$)
kill $pids > /dev/null 2>&1
}

Simply record the PID of the child shell after you invoke it:

MyChild=$!

and change the function kill_children to:

function kill_child {
ps -p ${MyChild} >/dev/null 2>&! # does the child process still exist
if [ $? = 0 ]
then
kill -TERM ${MyChild}
fi
}

If you have multiple sub-process shells, then change MyChild to an array and iterate across the array members.

Webinar
One Click, Universal Protection: Implementing Centralized Security Policies on Linux Systems

As Linux continues to play an ever increasing role in corporate data centers and institutions, ensuring the integrity and protection of these systems must be a priority. With 60% of the world's websites and an increasing share of organization's mission-critical workloads running on Linux, failing to stop malware and other advanced threats on Linux can increasingly impact an organization's reputation and bottom line.

Learn More

Sponsored by Bit9

Webinar
Linux Backup and Recovery Webinar

Most companies incorporate backup procedures for critical data, which can be restored quickly if a loss occurs. However, fewer companies are prepared for catastrophic system failures, in which they lose all data, the entire operating system, applications, settings, patches and more, reducing their system(s) to “bare metal.” After all, before data can be restored to a system, there must be a system to restore it to.

In this one hour webinar, learn how to enhance your existing backup strategies for better disaster recovery preparedness using Storix System Backup Administrator (SBAdmin), a highly flexible bare-metal recovery solution for UNIX and Linux systems.

Learn More

Sponsored by Storix