Sunday, August 12, 2012

Bash scripts to distribute parallel processing across machines

In cases where the resources on a single host are not enough for the work you are doing, just splitting that work across multiple processes on that one host will not optimize performance. Rather, we need to distribute the work across multiple hosts. Following are changes to the script framework that executes work in parallel that I discussed earlier, now supporting execution across machines.
:
# execute in N parallel processes (default 5)
# if HOST1 HOST2 HOST3 ... specified, 
# execute in N processes per specified host
n=$1

if [ -z "$n" ]; then
        n=5
        hosts=''
else
        case "$1" in
                [0-9]*)
                        n=$1
                ;;
                *)
                        hosts=$*
                ;;
        esac
fi

if [ -z "$hosts" ]; then
        hosts='localhost'
fi

t=$TMP/parallel.$$

mkdir -p $t/in
mkdir -p $t/work
mkdir -p $t/out
mkdir -p $t/hosts

x=0

while read cmd; do
        echo $cmd > $t/in/$x
        ((x++))
done

id=1
for host in $hosts; do
        n2=$n
        while [ $n2 != 0 ]; do
                echo `expand_host_name $host` > $t/hosts/$id
                parallel.1proc $id $t &
                let id++
                n2=`expr $n2 - 1`
        done
done

wait
cat $t/out/*

rm -r $t
The individual processes still run in a separate script parallel.1proc with the following code:
:
id=$1
t=$2

host_fn=`ls.nth $id $t/hosts`
if [ -z "$host_fn" ]; then
        host=localhost
else
        host=`cat $t/hosts/$host_fn`
fi

echo "parallel.1proc $id $t starting (will direct work to $host)..."
x=0
while [ 1 ]; do
        current_cmd_fn=$t/work/$id.$x
        while [ ! -f "$current_cmd_fn" ]; do
                next_cmd_base=`ls.nth $id $t/in`
                if [ -z "$next_cmd_base" ]; then
                        echo parallel.1proc $id done
                        exit 0
                fi
                next_cmd_fn=$t/in/$next_cmd_base
                mv $next_cmd_fn $current_cmd_fn 2> /dev/null
        done
        (
        cmd=`cat $current_cmd_fn`
        echo "==============================================================================="
        echo "`date` parallel proc $id executing $cmd on $host"
        if [ $host = "localhost" ]; then
                eval $cmd
        else
                unset DISPLAY
                export DISPLAY
                ssh -o StrictHostKeyChecking=no $host $cmd
        fi
        date
        ) > $t/out/$id.$x

        x=`expr $x + 1`
done
Of course it is assumed that each of the hosts will be able to access the needed disks to execute the work at hand. This approach works nicely in an environment where there is some set of NFS disks that are available across all the hosts you want working on the task.

No comments:

Post a Comment