Monday, May 21, 2018

Memoization: an easy way to mock dependencies

Years ago I wrote about a simple method to get the benefits of memoization for any commandline application whose outputs are a function of its commandline arguments. I used this technique this year while implementing multivcs_query, a utility for examining software source which spans multiple (and possibly incompatible) source code repositories. The tests for multivcs_query call out to all the major source control repository types, including git, subversion, perforce and even a variant of ClearCase. Memoization is key for the tests to run fast because of the dispiriting slowness of some of the source control programs multivcs_query supports.*

It occurred to me a bit later that a broader purpose could be served by memoization in this instance. One of the least attractive aspects of multivcs_query is its testing dependency on the existence of such a variety of source control management systems and also particular code lines which will surely not exist on most people's servers. I had thought about having some sort of bootstrap code which would establish simple code lines with enough content and history to support the tests I need to run, but of course that would be a significant amount of work and also keep the unfortunate assumption that the source control systems involved are even installed on the local system, a good bet with git, but probably not with any of the others. But if multivcs_query uses wrappers with memoization for its interactions with every source control systems it uses, then all I need to do is seed the memoization cache with appropriate data and my tests will work even if none of the source control systems exist on a local server. For each request, the memoization layer will see a hit in the cache and immediately return valid results, never even attempting to run the (possibly non-existent) version control systems that are theoretically involved. And that means I can pursue develop work for multivcs_query on a laptop which naturally has none of the server-side source control software installed.

To make this beautiful vision a reality, a several of pieces of work were required.

  • I had to change cache.pl not to generate its cached result files by simply concatenating the inputs -- I was immediately exceeding the filename maximum length with calls referring to multiple source code files. So instead what I do is follow the old method to generate a (sometimes very long) string key, and then just call cksum with that key to make a unique ID.
  • Then, to avoid cache contents becoming unmanageably opaque, I also save a companion file with a
    .cmd
    suffix recording the actual command that was run.
  • Now that I want all test cases which are based on these version control system dependencies to use the cache, I am frequently looking at the current cache and extracting the appropriate files needed to be saved away for future successful test runs. Especially now that the cache files have names based on cksum values, it is no longer a simple matter to look at the cache directory and understand what is what. To make this situation transparent, I have implemented a simple utility cache.ls to list the contents of the cache and propose commands to copy the relevant files to a new location (presumably the folder containing the same to cache contents needed for successful test runs). Here is the code for cache.ls:
    #!/bin/bash
    search_args=$*
    
    if [ -z "$search_args" ]; then
            search_args=.
    fi
    
    for cf in `ls $TMP/cache* | grep -v 'cmd$'`; do
            if cat $cf.cmd | grepm $search_args; then
                    cat $cf
                    echo EOD
                    echo "cp -p $cf* ."
                    echo '----------------------------------------------------------------------------------------------'
            fi
    done
    
  • Finally, it is important to initialize the cache with the appropriate data when running on a host for the first time. In my test wrapper, I added the following code to publish this task:
    if [ ! -f $TMP/CACHE_SEEDED_FOR_TESTS ]; then
            echo "Initializing cache data for test runs on this host:"
            echo "cp -p test/cache_seed/* $TMP..."
            if !  cp -p test/cache_seed/* $TMP; then
                    echo "$0: cp -p test/cache_seed/* $TMP failed, exiting..." 1>&2
                    exit 1
            fi
            if ! touch $TMP/CACHE_SEEDED_FOR_TESTS; then
                    echo "$0: touch $TMP/CACHE_SEEDED_FOR_TESTS failed, exiting..." 1>&2
                    exit 1
            fi
    fi
    
So that's how it works. For completeness, following is the updated cache.pl code using cksum to generate the cache filenames:
use strict;
use IO::File;

my $__trace = 0;

sub get_cached_output_path
{
  my($extra_key, $s) = @_;

  my $key = $extra_key . $s;
  
  my $fn_base = `echo $key | cksum`;
  chomp $fn_base;
  $fn_base =~ s/ .*//;
  
  my $fn = "$ENV{'TMP'}/cache." . $fn_base;
  
  my $f = new IO::File("$fn.cmd", "w");
  $f->write($key);
  $f->close();

  return $fn;
}

my @argv = @ARGV;
my $extra_key = $ENV{"CACHE_EXTRA_ARG"};
$extra_key = "" if !defined $extra_key;

if ($argv[1] eq "-cache-clear")
{
  my $cached_output_stem = get_cached_output_path($extra_key, $argv[0]);
  die "empty output stem" unless $cached_output_stem;
  my $cmd = "rm -f $cached_output_stem* 2> /dev/null";
  print "$cmd\n" if $__trace;
  print `$cmd`;
  exit(0);
}


my $cmd = join('" "', @argv);


$cmd =~ s/(" ")*$//g;
$cmd = '"' . $cmd . '"';
$cmd =~ s/"([\w_#,\.\/]+)"/$1/g;

print "cmd=$cmd\n" if $__trace;

my $cached_output = get_cached_output_path($extra_key, $cmd);

if (-f $cached_output)
{
  print "using existing $cached_output\n" if $__trace;
}
else
{
  my $cmd_with_redirects = "$cmd > $cached_output 2> $cached_output.err";
  `$cmd_with_redirects`;
  if ($__trace)
  {
    print "Executed $cmd_with_redirects\n";
  }
  if ( `cat $cached_output.err` eq '' )
  {
    if ($__trace)
    {
      print "No error output, so deleting $cached_output.err\n";
    }
    unlink "$cached_output.err";
  }
}
print `cat $cached_output`;
if (-f "$cached_output.err" )
{
  print STDERR `cat $cached_output.err`;
  # assume trouble if there was output to stderr, and remove the cached output:
  unlink "$cached_output.err";
  unlink $cached_output;
}

* It really is strange -- svn in particular has about a 2 second overhead for me no matter how simple my call. I'm guessing this is some sort of pathological misconfiguration of the local subversion server, but I don't control it and it is tangential enough to the central purpose of multivcs_query that I can't justify launching a campaign to improve it. But thanks to memoization, I don't have to care too much.

No comments:

Post a Comment