Sunday, July 22, 2012

Simple bash scripts to achieve parallel processing of shell tasks

We've all had the experience of scripting commands which we know will take a long time to finish, and wishing for the ability to run them in parallel in some easy fashion. On occasion I scripted special-purpose scripts to manage parallel processes to achieve a long-running task, e.g., to copy large numbers of files across machines. But this is such a general problem it makes more sense to have a general solution.

To solve this problem I wrote parallel, a bash script which takes a single integer argument telling it how many processes it should spawn to do work. It reads standard input for a series of commands which it then distributes to its subprocesses. There must be no ordering dependency between these commands since the order of their execution will not be known in advance. The subprocesses execute these commands until no work is left.

Here is the code.

:
n=$1

if [ -z "$n" ]; then
        n=5
else
        shift
fi

t=$TMP/parallel.$$

mkdir -p $t/in
mkdir -p $t/work
mkdir -p $t/out
x=0

while read cmd; do
        echo $cmd > $t/in/$x
        x=`expr $x + 1`
done

while [ $n != 0 ]; do
        parallel.1proc $n $t &
        n=`expr $n - 1`
done
wait
cat $t/out/*

rm -r $t
The script creates a temporary directory to keep track of the work. Under that directory is a subdirectory named in, where we keep the commands waiting to be executed. The subdirectory out is where we keep the output from the individual commands. The subdirectory work is where we keep track of the work which is currently being done.

A second script parallel.1proc does the actual work per process. This script expects an integer ID and a pointer to the temporary directory. It loops across a sequence where it repeatedly grabs work from the in directory, records what it's doing in the work directory and puts the output into the out directory. Here is the code:

:
id=$1
t=$2

echo parallel.1proc $id $t starting...
x=0
while [ 1 ]; do
        current_cmd_fn=$t/work/$id.$x
        while [ ! -f "$current_cmd_fn" ]; do
                next_cmd_base=`ls $t/in | tail -$id | head -1`
                if [ -z "$next_cmd_base" ]; then
                        echo parallel.1proc $id done
                        exit 0
                fi
                next_cmd_fn=$t/in/$next_cmd_base
                mv $next_cmd_fn $current_cmd_fn
        done
        (
        date
        cat $current_cmd_fn
        eval `cat $current_cmd_fn`
        date
        ) > $t/out/$id.$x
      
        x=`expr $x + 1`
done
The in is logically the work queue. There will be a race condition between the multiple copies of parallel.1proc as they attempt to move individual command files from the in directory to the work directory. The scheme hinges on the ability of each process to tell whether it won the race. When they attempt to copy work files they make a destination filename which incorporates their unique IDs, so if the destination file including, for example, the ID 4 exists after the move attempt, then sub process 4 knows that it was the lucky winner who succeeded in grabbing the work and can go ahead and execute it. Any other process which was trying to grab the same command will realize that it failed since the file named by its variable next_cmd_fn will not correspond to an existing file. A losing subprocess simply retries until it either does succeed in grabbing a command or else the supply of commands runs out, at which point the sub process exits.

No comments:

Post a Comment