Reading Multiple Files with Bash

 in

Reading files is no big deal with bash: you just redirect the input to the script or pipe the output of another command into the script, or you could do it inside the script if the file names are pre-determined. You could also use process substitution to pass in the open files (command pipelines actually) from the command line. Another option, the one I describe here, is to just open the files and read (or write) them as you like, as you'd do in other programming languages.

The mechanism used here takes advantage of bash's ability to redirect input (or output) using a specific file descriptor with the following syntax:

n<file
n>file
n>>file
n<>file

The "n" here is a small integer that specifies the file descriptor to use to open the named file. If no "n" is specified then the following defaults apply:

<file           # same as 0<file
>file           # same as 1>file
>>file          # same as 1>>file
<>file          # same as 0<>file

This is of course the standard redirection stuff that is used all the time.

So, given that the "n" is there, it would seem that one could easily open files as needed and process them as needed. How to actually do it though is less than obvious, but it's actually quite simple:

exec 7<file1
exec 8<file2

This opens file1 on file descriptor 7 for input, and file2 on file descriptor 8. Now we can read them easily with:

read data1 <&7
read data2 <&8

Notice the input redirection to read uses another special form that includes the ampersand (&) to specify that what follows is a file descriptor and not a file name.

Use file descriptors in the range 3-9. File descriptors below 3 are used for standard input, output, and error, the ones above 9 may be used by the shell internally.

Although there is no explicit syntax for closing a file, re-using the file descriptor will close the file before opening the new file. (08/21/2009: this is incorrect, there is a syntax for closing files, see the comments below --Mitch)

To be safe you could do the following to close the files:

exec 7</dev/null
exec 8</dev/null

The reason for the exec is so that the opening of the file is done in the current shell and not in a sub-shell, which would close the file descriptor as soon as the command completed (not that it would be available in the calling shell anyways). It may also surprise you that n<file by itself is not a syntax error, but it's not.

An example of doing all this follows:

#!/bin/bash

function readfiles()
{
	local FD1=7
	local FD2=8
	local file1=$1
	local file2=$2
	local count1=0
	local count2=0
	local eof1=0
	local eof2=0
	local data1
	local data2

	# Open files.
	# ***** 08/22/2009: See comments below for a way to avoid    *****
	# *****             hardcoding the file descriptors -- Mitch *****
	exec 7<$file1
	exec 8<$file2

	while [[ $eof1 -eq 0  ||  $eof2 -eq 0 ]]
	do
		if read data1 <&$FD1; then
			let count1++
			printf "%s, line %d: %s\n" $file1 $count1 "$data1"
		else
			eof1=1
		fi
		if read data2 <&$FD2; then
			let count2++
			printf "%s, line %d: %s\n" $file2 $count2 "$data2"
		else
			eof2=1
		fi
	done
}

echo "Reading file1 and file2"
readfiles file1 file2

echo "Reading file3 and file4"
readfiles file3 file4


# vim: tabstop=4: shiftwidth=4: noexpandtab:
# kate: tab-width 4; indent-width 4; replace-tabs false;

The function at the top reads the files, the main code processes 2 files, then processes 2 different files. Running the command produces:

$ bash readmult.sh
Reading file1 and file2
file1, line 1: f1 line 1
file2, line 1: f2 line 1
file1, line 2: f1 line 2
file2, line 2: f2 line 2
file1, line 3: f1 line 3
file2, line 3: f2 line 3
file1, line 4: f1 line 4
file2, line 4: f2 line 4
file1, line 5: f1 line 5
file2, line 5: f2 line 5
file1, line 6: f1 line 6
Reading file3 and file4
file3, line 1: f3 line 1
file4, line 1: f4 line 1
file3, line 2: f3 line 2
file4, line 2: f4 line 2
file3, line 3: f3 line 3

A similar process can be used for writing multiple output files using the n>file or n>>file syntax. A possible time saver if you're writing a lot of data to the same file in many different places in your script.

AttachmentSize
readmult.tgz634 bytes
______________________

Mitch Frazier is an Associate Editor for Linux Journal.

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

No mention in the manual? Huh?

Anonymous's picture

You are aware that the documentation for Bash is in Texinfo?

And even then, the manual page says:

Duplicating File Descriptors
The redirection operator

[n]<&word

is used to duplicate input file descriptors. If word expands to one or
more digits, the file descriptor denoted by n is made to be a copy of
that file descriptor. If the digits in word do not specify a file
descriptor open for input, a redirection error occurs. If word evalu‐
ates to -, file descriptor n is closed.

Thanks

Mitch Frazier's picture

Guess I missed that part. Now that I check the man page a bit closer, I see that it is in there. Although it doesn't explicitly state that "[n]>&-" closes "n", which is what the comment below referred to, although that does appear to work also.

Mitch Frazier is an Associate Editor for Linux Journal.

From the Man Page

Mitch Frazier's picture

As a reference, the man page describes what exec is doing:

exec [-cl] [-a name] [command [arguments]]
[...removed...] If command is not specified, any redirections take effect in the current shell, and the return status is 0. If there is a redirection error, the return status is 1.

Mitch Frazier is an Associate Editor for Linux Journal.

exec syntax

augmentedfourth's picture

Could you have written:

exec 7<$file1

as:

exec $FD1<$file1

?

Why would you hard-code the file descriptor value in the exec line when it was already defined in a variable? Does the exec command not like that for some reason?

It Doesn't Like It

Mitch Frazier's picture

Unfortunately that doesn't work, you get:

Reading file1 and file2
readmult.sh: line 17: exec: 7: not found

Exec does the substitution of the value of $FD1 but it doesn't then reparse that to see if it's an integer, rather it assumes it's a command. The message is saying that the command 7 is not found.

Mitch Frazier is an Associate Editor for Linux Journal.

Just As I Wrote That...

Mitch Frazier's picture

Just as I hit submit I realized how to make that work:

    # Open files.
    eval exec "$FD1<$file1"
    eval exec "$FD2<$file2"

Mitch Frazier is an Associate Editor for Linux Journal.

Benchmarks

Anonymous's picture

Comparing the use of file descriptors to standard redirection:

http://www.los-gatos.ca.us/davidbu/faster_sh.html

Closing file descriptors

Anonymous's picture

There is an explicit syntax for closing a file descriptor. If you want to close descriptor 7:

exec 7>&-

I recommend avoiding this with descriptors 0-2, since many programs will behave erratically if run with these descriptors closed.

Interesting

Mitch Frazier's picture

Thanks. That does appear to do what you describe. You can test it with the script:

#!/bin/bash

exec 7>junk
echo JUNK >&7
lsof -p $$ | grep -v mem
exec 7>&-
echo
lsof -p $$ | grep -v mem
Which should produce something like:
COMMAND   PID  USER   FD   TYPE DEVICE SIZE/OFF     NODE NAME
sh      27781 mitch  cwd    DIR    9,0    20480 12517486 /home/mitch/tmp
sh      27781 mitch  rtd    DIR    9,0     4096        2 /
sh      27781 mitch  txt    REG    9,0   725048  5316638 /bin/bash
sh      27781 mitch    0u   CHR  136,1      0t0        3 /dev/pts/1
sh      27781 mitch    1u   CHR  136,1      0t0        3 /dev/pts/1
sh      27781 mitch    2u   CHR  136,1      0t0        3 /dev/pts/1
sh      27781 mitch    7w   REG    9,0        5 12517390 /home/mitch/tmp/junk
sh      27781 mitch  255r   REG    9,0       99 12519090 /home/mitch/tmp/j.sh

COMMAND   PID  USER   FD   TYPE DEVICE SIZE/OFF     NODE NAME
sh      27781 mitch  cwd    DIR    9,0    20480 12517486 /home/mitch/tmp
sh      27781 mitch  rtd    DIR    9,0     4096        2 /
sh      27781 mitch  txt    REG    9,0   725048  5316638 /bin/bash
sh      27781 mitch    0u   CHR  136,1      0t0        3 /dev/pts/1
sh      27781 mitch    1u   CHR  136,1      0t0        3 /dev/pts/1
sh      27781 mitch    2u   CHR  136,1      0t0        3 /dev/pts/1
sh      27781 mitch  255r   REG    9,0       99 12519090 /home/mitch/tmp/j.sh

As you can see, in the second output from lsof the file on file descriptor 7 is now closed.

Most interesting about this is that it doesn't appear to be in the man page anywhere, closest thing I see is:

Similarly, the redirection operator
     [n]>&digit-
moves the file descriptor digit to file descriptor n, or the standard output (file descriptor 1) if n is not specified.

So I guess it's a special form of that. If you check back, leave a note as to where you found that documented. Thanks again.

Mitch Frazier is an Associate Editor for Linux Journal.

It's also more efficient

Anonymous's picture

Just an additional note... Using file descriptors is also more efficient (reducing processing time by about 5x). This is because the file is not opened/closed implicitly between operations.

White Paper
Linux Management with Red Hat Satellite: Measuring Business Impact and ROI

Linux has become a key foundation for supporting today's rapidly growing IT environments. Linux is being used to deploy business applications and databases, trading on its reputation as a low-cost operating environment. For many IT organizations, Linux is a mainstay for deploying Web servers and has evolved from handling basic file, print, and utility workloads to running mission-critical applications and databases, physically, virtually, and in the cloud. As Linux grows in importance in terms of value to the business, managing Linux environments to high standards of service quality — availability, security, and performance — becomes an essential requirement for business success.

Learn More

Sponsored by Red Hat

White Paper
Private PaaS for the Agile Enterprise

If you already use virtualized infrastructure, you are well on your way to leveraging the power of the cloud. Virtualization offers the promise of limitless resources, but how do you manage that scalability when your DevOps team doesn’t scale? In today’s hypercompetitive markets, fast results can make a difference between leading the pack vs. obsolescence. Organizations need more benefits from cloud computing than just raw resources. They need agility, flexibility, convenience, ROI, and control.

Stackato private Platform-as-a-Service technology from ActiveState extends your private cloud infrastructure by creating a private PaaS to provide on-demand availability, flexibility, control, and ultimately, faster time-to-market for your enterprise.

Learn More

Sponsored by ActiveState