An Illusory Intertwingling of Reason and Response

Tech: Yes, I’m a geek. I admit it. At least I’m not a nerd!

Tafel :: tech

Thursday, May 04, 2006

Parsing a List of Lines in Bash

Parsing newline-delimited data records in bash is simple, if you have this odd redirect up your sleeve.

Working on my current shell-script project, a scheduling utility driven by the BSD calendar, I found myself needing to parse some input files linewise. See, I had been reading in the event data files (one for each record), translating newlines to tildes, and cutting the resultant data string on tildes (since cut doesn't like cutting on newlines, it would seem) to obtain my data fields. However, this added up to almost a half a second of runtime per record. I mean, I didn't expect bash to be the world's fastest string parser, but sometimes enough is quite simply enough.

Okay, let me put in the code here so people don't lose themselves in the article, and I'll explain in a moment.

# This shell script echoes individual lines from the file specified
# usage: . <scriptname> [file to parse]

while read line; do
	echo $line
	done < $1

The magic here is in that last line: done < $1
Because of the odd mechanics of shell substitution and token parsing, for line in $(cat $1); do . . . ; done won't work. You'd end up executing the loop whenever you hit whitespace, whether it be space, tab, or newline. What we need is some way to ensure that each line is passed as a distinct entity through the loop.

That's what read is here for. read is a shell built-in (in bash, anyway . . . I can't speak for other shells) that takes a single line of STDIN and sets it to the variable named as its argument, like so:

usage: read varname

But in a complex script, it can be difficult to track down where the interpreter believes STDIN, STDOUT, and STDERR are in the code path. In this case, if you try piping the file in, like so:

cat $1 | while read line; do . . . ; done


while cat $1 read line; do . . . ; done

or even using a standard shell redirect, as:

while read line < $1; do . . . ; done

you'll be in for some highly-unpredictable output. It turns out that STDIN for read can be accessed after the loop controlled by it, simply by redirecting the the STDIN of the entire loop to the desired file.

No, please don't ask me why! I don't think anyone knows why anything is the way it is in bash. There are fundamental programmatic reasons why it is necessary to sacrifice a goat at midnight to get your script to run properly.

Oh, by the way. Skipping all the utility invocations I had been using before cut my parser runtime by nearly two thirds . . . and I only had to hack at it for an two hours!