Monday, December 26, 2022

GPT as a learning tool

I've really liked using GPT as a tutor. I recently needed to learn about SSO/SAML2, and it was such a luxury to be able to pose questions to GPT, and in particular to be able to express my understanding of the protocol and have GPT either confirm or correct it. I think it is pretty clear that this capacity to evaluate our expressed understanding of a concept is going to really accelerate our learning.

Normally when I learn some new concept online, I spend a lot of extra time with supplementary reading looking to confirm or disprove a model of that concept in my mind. But In this case with GPT I could skip this step and just summarize what I thought I knew about the topic and ask GPT to compare what I was saying with its own understanding.

I also really like the ability of GPT to confirm whether a given technique in an engineering implementation is common. I think a lot of the problem of managing risk in an engineering project is to try to keep your integration points on the beaten path, i.e., it is best to structure how you connect to shared tools such that these connections are consistent with the way other users are using the tool; this way you are more likely to be in line with the design and less subject to surprises -- and you are less likely to be caught out with an awkward upgrade path as the software evolves.

Friday, September 9, 2022

Running ssh tunneled X clients after su

When you ssh to a machine and then run an X client, the authentication is based on having a valid entry in your ~/.XAuthority file, which your ssh client intializes automatically if you run ssh with -X or -Y. But if you su to be another user, that authorization cannot be found, and the X client initiailization will fail.

To manage this situation, I wrote a pair of scripts and aliases. X.auth_save saves the authentication away to /tmp:

  
#!/bin/bash
. X.auth_saved.inc
auth=`xauth -f $HOME/.Xauthority list | tail -1`
echo $auth > $x_auth_saved
chmod 777    $x_auth_saved
echo "Saved X authority $auth to $x_auth_saved"

Restore as a new user with X.auth_restore:
#!/bin/bash
. X.auth_saved.inc
if [ -r $x_auth_saved ]; then
        echo "OK found \"$x_auth_saved\"" 1>&2
else
        echo "FAIL could not find \"$x_auth_saved\"" 1>&2
        exit 1
fi
auth=`cat $x_auth_saved`
echo "Restored X authority $auth from $x_auth_saved"
touch $HOME/.Xauthority
if ! xauth add $auth; then
        echo "FAIL: xauth add $auth failed, exiting..." 1>&2
        exit 1
else
        echo "OK xauth add $auth"
fi

Lastly track the shared file variable name in X.auth_saved.inc:
x_auth_saved=/tmp/X.auth_saved
Since the names are unwieldy, I aliased them to xas and xar:
alias xas=X.auth_save
alias xar=X.auth_restore

Monday, May 21, 2018

Memoization: an easy way to mock dependencies

Years ago I wrote about a simple method to get the benefits of memoization for any commandline application whose outputs are a function of its commandline arguments. I used this technique this year while implementing multivcs_query, a utility for examining software source which spans multiple (and possibly incompatible) source code repositories. The tests for multivcs_query call out to all the major source control repository types, including git, subversion, perforce and even a variant of ClearCase. Memoization is key for the tests to run fast because of the dispiriting slowness of some of the source control programs multivcs_query supports.*

It occurred to me a bit later that a broader purpose could be served by memoization in this instance. One of the least attractive aspects of multivcs_query is its testing dependency on the existence of such a variety of source control management systems and also particular code lines which will surely not exist on most people's servers. I had thought about having some sort of bootstrap code which would establish simple code lines with enough content and history to support the tests I need to run, but of course that would be a significant amount of work and also keep the unfortunate assumption that the source control systems involved are even installed on the local system, a good bet with git, but probably not with any of the others. But if multivcs_query uses wrappers with memoization for its interactions with every source control systems it uses, then all I need to do is seed the memoization cache with appropriate data and my tests will work even if none of the source control systems exist on a local server. For each request, the memoization layer will see a hit in the cache and immediately return valid results, never even attempting to run the (possibly non-existent) version control systems that are theoretically involved. And that means I can pursue develop work for multivcs_query on a laptop which naturally has none of the server-side source control software installed.

To make this beautiful vision a reality, a several of pieces of work were required.

  • I had to change cache.pl not to generate its cached result files by simply concatenating the inputs -- I was immediately exceeding the filename maximum length with calls referring to multiple source code files. So instead what I do is follow the old method to generate a (sometimes very long) string key, and then just call cksum with that key to make a unique ID.
  • Then, to avoid cache contents becoming unmanageably opaque, I also save a companion file with a
    .cmd
    suffix recording the actual command that was run.
  • Now that I want all test cases which are based on these version control system dependencies to use the cache, I am frequently looking at the current cache and extracting the appropriate files needed to be saved away for future successful test runs. Especially now that the cache files have names based on cksum values, it is no longer a simple matter to look at the cache directory and understand what is what. To make this situation transparent, I have implemented a simple utility cache.ls to list the contents of the cache and propose commands to copy the relevant files to a new location (presumably the folder containing the same to cache contents needed for successful test runs). Here is the code for cache.ls:
    #!/bin/bash
    search_args=$*
    
    if [ -z "$search_args" ]; then
            search_args=.
    fi
    
    for cf in `ls $TMP/cache* | grep -v 'cmd$'`; do
            if cat $cf.cmd | grepm $search_args; then
                    cat $cf
                    echo EOD
                    echo "cp -p $cf* ."
                    echo '----------------------------------------------------------------------------------------------'
            fi
    done
    
  • Finally, it is important to initialize the cache with the appropriate data when running on a host for the first time. In my test wrapper, I added the following code to publish this task:
    if [ ! -f $TMP/CACHE_SEEDED_FOR_TESTS ]; then
            echo "Initializing cache data for test runs on this host:"
            echo "cp -p test/cache_seed/* $TMP..."
            if !  cp -p test/cache_seed/* $TMP; then
                    echo "$0: cp -p test/cache_seed/* $TMP failed, exiting..." 1>&2
                    exit 1
            fi
            if ! touch $TMP/CACHE_SEEDED_FOR_TESTS; then
                    echo "$0: touch $TMP/CACHE_SEEDED_FOR_TESTS failed, exiting..." 1>&2
                    exit 1
            fi
    fi
    
So that's how it works. For completeness, following is the updated cache.pl code using cksum to generate the cache filenames:
use strict;
use IO::File;

my $__trace = 0;

sub get_cached_output_path
{
  my($extra_key, $s) = @_;

  my $key = $extra_key . $s;
  
  my $fn_base = `echo $key | cksum`;
  chomp $fn_base;
  $fn_base =~ s/ .*//;
  
  my $fn = "$ENV{'TMP'}/cache." . $fn_base;
  
  my $f = new IO::File("$fn.cmd", "w");
  $f->write($key);
  $f->close();

  return $fn;
}

my @argv = @ARGV;
my $extra_key = $ENV{"CACHE_EXTRA_ARG"};
$extra_key = "" if !defined $extra_key;

if ($argv[1] eq "-cache-clear")
{
  my $cached_output_stem = get_cached_output_path($extra_key, $argv[0]);
  die "empty output stem" unless $cached_output_stem;
  my $cmd = "rm -f $cached_output_stem* 2> /dev/null";
  print "$cmd\n" if $__trace;
  print `$cmd`;
  exit(0);
}


my $cmd = join('" "', @argv);


$cmd =~ s/(" ")*$//g;
$cmd = '"' . $cmd . '"';
$cmd =~ s/"([\w_#,\.\/]+)"/$1/g;

print "cmd=$cmd\n" if $__trace;

my $cached_output = get_cached_output_path($extra_key, $cmd);

if (-f $cached_output)
{
  print "using existing $cached_output\n" if $__trace;
}
else
{
  my $cmd_with_redirects = "$cmd > $cached_output 2> $cached_output.err";
  `$cmd_with_redirects`;
  if ($__trace)
  {
    print "Executed $cmd_with_redirects\n";
  }
  if ( `cat $cached_output.err` eq '' )
  {
    if ($__trace)
    {
      print "No error output, so deleting $cached_output.err\n";
    }
    unlink "$cached_output.err";
  }
}
print `cat $cached_output`;
if (-f "$cached_output.err" )
{
  print STDERR `cat $cached_output.err`;
  # assume trouble if there was output to stderr, and remove the cached output:
  unlink "$cached_output.err";
  unlink $cached_output;
}

* It really is strange -- svn in particular has about a 2 second overhead for me no matter how simple my call. I'm guessing this is some sort of pathological misconfiguration of the local subversion server, but I don't control it and it is tangential enough to the central purpose of multivcs_query that I can't justify launching a campaign to improve it. But thanks to memoization, I don't have to care too much.

Friday, March 18, 2016

simple grep replacement for sorted data

The idea here is that if we are looking at a file which is composed of lines sorted in order, and you have some idea what the lines you're interested in will start with, you should be able to efficiently pull out your target, even if the file is large.

For example, many log files start with a timestamp composed in a sortable order (i.e., year, month, day, hour, minute, second), where you may well have a fairly precise idea of what you want (e.g., from 12:02:05 on March 3, 2016 and the 10 seconds following). Normal grep will scan the entire file, which is a painful exercise if the file is many gigabytes.  sgrep will in contrast seek to the interesting region and then search for a regular expression pattern only within that region.

So for the example given above, one would search as follows if looking at an artifactory request log:

        sgrep 20160303120205 20160303120215 some_regex_pattern /private/artifactory/logs/request.log
Here's the code to do it, changing some minute searches to sub-second searches:
class Sgrep
        attr_accessor :patt
        attr_accessor :beginning_of_significance
        attr_accessor :ending_of_significance
        attr_accessor :fn
        attr_accessor :f
        def initialize(beginning_of_significance, ending_of_significance, patt, fn)
                self.beginning_of_significance = beginning_of_significance
                self.ending_of_significance = ending_of_significance
                self.patt = Regexp.new(patt)
                self.fn = fn
                if !File.readable?(fn)
                        STDERR.puts "sgrep: #{fn}: No such file or directory"
                        exit(1)
                end
                self.f = File.open(fn, "r")
                puts "sgrep looking for \"#{patt}\" in file #{fn}, bounded by the significant region starting with \"#{beginning_of_significance}\" and ending with \"#{ending_of_significance}\"..." if Sgrep.trace
                #        header = fh.readline
                # Process the header
                #     while(line = fh.gets) != nil
                #         #do stuff
                #     end
                # end"@@
        end
        def search_sequentially_from(pos)
                if pos > 0
                        f.seek(pos-1, IO::SEEK_SET)
                        self.seek_next_line
                else
                        f.seek(pos, IO::SEEK_SET)
                end
                while !self.f.eof? do
                        next_line_start = self.f.tell
                        line = self.f.gets
                        if line.start_with?(self.beginning_of_significance) || line > self.beginning_of_significance
                                f.seek(next_line_start, IO::SEEK_SET)
                                puts "searched sequentially to #{next_line_start} (seeing #{line})" if Sgrep.trace
                                return
                        end
                end
        end
        def seek_beginning_of_significance()
                lower_bound = 0
                upper_bound = File.size(self.fn)
                puts "lower_bound=#{lower_bound}, upper_bound=#{upper_bound}" if Sgrep.trace
                while upper_bound > lower_bound do
                        midpoint = (lower_bound + ((upper_bound - lower_bound) / 2)).to_i
                        f.seek(midpoint, IO::SEEK_SET)
                        self.seek_next_line
                        next_line_start = self.f.tell
                        line = self.f.gets
                        puts "see #{line.chomp}, lower_bound=#{lower_bound}, upper_bound=#{upper_bound}, midpoint=#{midpoint}, next_line_start=#{next_line_start}" if Sgrep.trace
                        if line.start_with?(self.beginning_of_significance)
                                puts "match" if Sgrep.trace
                                if upper_bound > next_line_start
                                        upper_bound = next_line_start
                                else
                                        break
                                end
                        elsif line < self.beginning_of_significance
                                puts "under" if Sgrep.trace
                                lower_bound = self.f.tell+1
                        else
                                puts "over" if Sgrep.trace
                                upper_bound = midpoint-1
                        end
                end
                self.search_sequentially_from(lower_bound)
        end
        def seek_next_line()
                while !self.f.eof? do
                        if self.f.getc == "\n"
                                return
                        end
                end
        end
        def significant_lines
                while !self.f.eof? do
                        line = f.gets
                        if line.start_with?(self.ending_of_significance) || line < self.ending_of_significance
                                yield line
                        else
                                return
                        end
                end
                return
        end
        def search()
                #return `grep "#{self.patt}" #{self.fn}`
                self.seek_beginning_of_significance

                exit_code = 1
                self.significant_lines do | line |
                        if self.patt.match(line)
                                exit_code = 0
                                print line
                        end
                end
                return exit_code
        end
        class << self
                attr_accessor :trace
        end
end

j = 0
while ARGV[j].start_with?("-") do
        case ARGV[j]
        when "-v"
                Sgrep.trace = true
        end
        j += 1
end
beginning_of_significance = ARGV[j]
ending_of_significance = ARGV[j+1]
patt = ARGV[j+2]
fn = ARGV[j+3]
g = Sgrep.new(beginning_of_significance, ending_of_significance, patt, fn)
exit(g.search())

Sunday, June 7, 2015

Generating the simplest possible nginx load-balancing config

I recently had to put together a load-balancing configuration for nginx and wish that I had a shell script to generate the very simplest possible nginx setup.  I think I have it now:

:
if [ -z "$1" ]; then
echo "Usage: $0 URI HOST1 HOST2 ..."
exit 1
fi
uri="$1"
shift
hosts=$@
out=/etc/nginx/nginx.conf
if [ ! -f $out.bak ]; then
        mv $out $out.bak
fi
cat <<EOF > $out
events {
}
http {
        upstream configuration_servers {
EOF

for h in $hosts; do
        echo "                server $h:8080;"
done >> $out

cat <<EOF >> $out
        }
        server {
                listen       80;
                location $uri {
                proxy_pass http://configuration_servers$uri;
                }
        }
}
EOF

cat $out

service nginx restart

url=localhost:80$uri
if ! curl $url; then
        echo "$0: curl $url failed, exiting..." 1>&2
        exit 1
fi

Monday, January 12, 2015

Sending strangers and anonymous callers to voicemail in Google Voice

I normally only write about software development in this blog, but I can't resist adding a software configuration recipe that I have found useful for cutting back on unwanted callers getting through to me via Google voice and my android phone. This problem recently became much worse when I acquired a new phone number which had previously been used by a woman named Melody, a Bay Area woman who apparently ran up lots of debts.

I can understand that you would need to have a won't-take-no-for-an-answer kind of personality to work as a debt collector, but wow those people are not fun to talk to. But I didn't want to just block strangers and anonymous callers, since once in a while those calls are legitimate. Doctors offices, for example, typically call anonymously in order to protect the privacy of their patients. So what I really wanted to do was to send all those folks directly to voicemail.

Sounds simple enough, and I was readily assured that this was possible, but I never did catch up with an explicit recipe to do it.

So here is one:

  • Browse to Google voice settings
  • on the Phones tab, disable all of your devices (i.e., uncheck the checkbox associated with each)
  • on the Voicemail & Text tab, click the edit button for the special Callers group "All Contacts"
  • enable all devices that you want to ring when one of your contacts calls you
That's it!

Sunday, May 25, 2014

Network storage performance spanning multiple clouds

In previous posts I've looked at different factors in performance for s3 and Google storage. I looked first at the viability of s3 as a web application data store, and then compared s3's performance for this purpose with Google storage. For all of my benchmarks I allocated virtual machines on the farm hosted by the cloud storage provider, i.e., to test s3 I used an ec2 box, and to test Google storage I used a Google compute engine box. This approach implies that one would always use a single provider for both storage and VMs (and presumably all other cloud function, e.g., load balancing, etc.) but there are some drawbacks to relying on a single source for all of these services.

Using a single source makes you vulnerable to outages at that source. I almost never hear this discussed, I believe because it is considered a legitimate excuse for your application to be down if Amazon is down. Hey, if netflix puts up with it, then clearly this isn't a problem only faced by idiots, right? But wouldn't it be nice to be sufficiently decoupled that a major outage upstream doesn't sink your application?

The second drawback is the inability to compare prices between providers. Although Amazon and Google and others post prices for various things, for many essential aspects of the service it is practically impossible to compare without having a running instance of your application on the infrastructure of each provider. The only exception to this is if your need is particularly simple; if, for example, you just need to store large amounts of data persistently without a care for performance, then you can just look up the price sheets and see the price per gigabyte for a provider's cheapest storage. But for most applications, the needs are more varied and complex and there is a large performance component that must be evaluated. The different cloud providers run different hardware, run their servers on networks with different capabilities, and charge for a dizzying variety of different metrics which make apples-to-apples comparisons practically impossible.

A trivial example of this billing complexity can be seen in the invoice spreadsheet Amazon sends out for even the simplest configuration. A single VM consumes resources which are described by a spreadsheet containing hundreds of rows. I think in Amazon's case this is an intentional obfuscation of the costs designed to impede the commodification of their business that might follow from arranging charges that could be compared.

My vote is to avoid the quagmire of a deep analysis of the cloud services invoices; an easier approach is install your application everywhere and then measure the work being done by each cloud provider from within your application; later reconcile that record of work in terms of your application against the cost incurred. Since any serious web application will need to track performance and volume of work anyway, it is little extra work to measure these quantities against different server farms. Then it is just a matter of evaluating the work done versus the expense paid and one can have a precise, real world reckoning of the relative values of the different cloud providers.

Of course determining the relative value of cloud offerings is not the only question one faces in determining how to deploy. It is also an open question whether the performance yielded by mixing services across multiple clouds will be viable. In my benchmarks it appears that cloud storage accessed from ec2 or gce is adequate for an interactive application. To give some context to the numbers I also benchmarked performance from my home over a Comcast line; the results with this last option were not great (though not crazily bad either).

The numbers generally follow the pattern that we've already seen with a couple of interesting variations. The biggest surprise for me was that small and medium sized s3 reads were faster executed on Google machines than on Amazon's own ec2. That result really doesn't pass the smell test; I wonder if some temporary network anomaly combined with the short duration of the tests to yield this not very credible finding? But I feel more comfortable answering the broader question of the viability of mixing cloud services across providers based on these numbers: although there is a performance hit to spanning cloud providers, it is not significant for small data sizes. Here are the results:

rate/stop typestorage typehost typechunk size
46.50.02readgsgcesmall
36.00.03readgsgcemedium
20.50.05readgsgcelarge
14.70.07reads3gcesmall
14.10.07reads3gcemedium
11.40.09readgsec2medium
10.30.10reads3ec2small
10.30.10readgsec2small
10.20.10reads3ec2medium
8.10.12reads3comcastsmall
8.00.12readgscomcastsmall
7.50.13readgsec2large
7.00.14writes3gcesmall
6.20.16writes3gcemedium
6.20.16reads3gcelarge
6.10.16writegsgcesmall
5.90.17readgscomcastmedium
5.40.18writegsgcemedium
5.20.19writes3ec2small
4.80.21reads3comcastmedium
4.60.22writes3comcastsmall
4.50.22writes3ec2medium
4.50.22writegsgcelarge
4.30.23reads3ec2large
3.40.30writes3gcelarge
3.10.32writegscomcastsmall
2.90.35writes3ec2large
2.50.40writegsec2medium
2.40.41writegscomcastmedium
2.30.44readgscomcastlarge
2.10.47writegsec2large
1.90.53writes3comcastmedium
1.80.55writegsec2small
1.60.64reads3comcastlarge
0.61.69writegscomcastlarge
0.42.25writes3comcastlarge