02 May 2014

Integrating Integrity, part 1

One of the most important parts of any backup solution is being able to identify when files have become corrupt due to a failing disk. Ideally, we'd be able to identify impending failure before the excrement hits the rotational cooling device, but we don't always have that luxury. I intend to cover things like S.M.A.R.T. disk checks in a later post; for today, I want to address per-file integrity checking. Because the only thing worse than having no backup is having a backup of the already-corrupt data.

md5sum has been my go-to tool for this in the past. The checksums it generates are slightly better than the old CRC32 method, and it's ubiquitous. While it is important to realise it is not a cryptographically secure checksum and cannot protect against malicious tampering, it is a very effective way to check a file for damage. However, while the venerable md5sum command works perfectly fine, I really want to make my own version with a few improvements that I find myself wanting.


Eternal Vigilance

To start things off, I'm going to write a script (in Perl 5, naturally) called vigil.pl. Its job will be to create and check .md5sum files containing the sums of all the files in a given directory. So why not just use the actual md5sum program? Well, for starters, when md5sum does its thing it simply processes each file on each line, listing whether the checksum is correct or not and moving on to the next file. This means that when checking a bunch of small files, the list scrolls by so quickly you can't see which files failed, unless you pipe that to a text file and review it later. It also means that for really large files, you get no indication of progress until it finishes. I aim to correct that, and produce a summary at the end of each run that tells me how many files passed, how many failed, and which ones. We could also display a progress bar that estimates the time until the whole job is completed.

Here's the 'boilerplate' code of my first cut of the program:-
#!/usr/bin/perl -CSDA

use strict;
use warnings;
use utf8;
use Getopt::Long;
use Digest::MD5;
use Term::ANSIColor;

# Define our command-line arguments.
my %opts = ( 'blocksize' => 16384 );
GetOptions(\%opts, "verify=s", "create=s", "update=s", "files", "blocksize=s", "help!");

# Fix up --blocksize argument if any
$opts{blocksize} *= 1024 if $opts{blocksize} =~ /^\d+[km]$/i;
$opts{blocksize} *= 1024 if $opts{blocksize} =~ /^\d+[m]$/i;
die "--blocksize must a number be greater than 0, optionally with k or m suffixes!" unless $opts{blocksize} > 0;


sub help
{
   print "Usage:-\n";
   print "  $0 --verify filename.md5sum\n";
   print "  $0 --create filename.md5sum\n";
   print "  $0 --update filename.md5sum\n";
   print "  $0 --files file_to_check ...\n";
   print "\n";
   print "     --blocksize <size>{k,m}\n";
   exit 1;
}

...

sub main
{
   if ($opts{files}) {
      # Just print out the sums of all the files given as @ARGV.
      display_file_sums(@ARGV);
   } else {
      help();
   }
}


help() if $opts{help};
main();
We turn on UTF-8 for command-line arguments, default IO streams and things. I've probably forgotten something important here but oh well, getting something done and building up momentum is more important for me right now. The modules we're using are Getopt::Long, Digest::MD5, and Term::ANSIColor. Similar to writing comments before I write functions, I like to sketch out what the command-line options should look like to sketch out how the program will be used.

To get the ball rolling, I want to implement a simple --files option that will treat all other arguments as filenames for which we want to compute and print an MD5 sum. This way we can check to see if we're actually computing the checksums properly. I'm going to use the same basic idea as from the file-renaming script I made; populate a @plan list with things we should be doing, then act on that plan. Each entry in the @plan list will be a hash reference, with keys for filename, size, and the correct md5 if one is known. Getting the size of each file in advance is important, so that we can accurately estimate our progress and time until completion.
# plan entries are hashrefs with keys: filename, size (bytes), correct_md5 (optional),
# at verification end, additional keys are: computed_md5, start_time (unix timestamp), elapsed_time (secs).
# This way we can attempt an accurate time estimate.
sub create_plan_for_filename
{
   my ($filename, $correct_md5) = @_;

   my $plan_entry = {};
   $plan_entry->{filename} = $filename;

   if (-f $filename) {
      $plan_entry->{size} = -s $filename;
   } else {
      # Doesn't exist or is a directory or a socket or something. Die?
   }

   # If the md5 is supplied by the caller, then we're checking a file. If not, we're probably creating one.
   if (defined $correct_md5) {
      $plan_entry->{correct_md5} = $correct_md5;
   }

   return $plan_entry;
}


# No matter whether we're verifying or creating a checksum, we're basically gonna run the same thing over all the
# files given to us in the plan: stream bytes through md5, report progress as we do so.
# Only once the plan has been executed do we *really* need to report which files don't match, although a running
# alert for failing files would be nice too.
sub run_md5_file
{
   my ($plan_entry, $progress_fn) = @_;

   # We use the OO interface to Digest::MD5 so we can feed it data a chunk at a time.
   my $md5 = Digest::MD5->new();
   my $current_bytes_read = 0;
   my $buffer;
   $plan_entry->{start_time} = time();
   $plan_entry->{elapsed_time} = 0;
   $plan_entry->{elapsed_bytes} = 0;

   # 3 argument form of open() allows us to specify 'raw' directly instead of using binmode and is a bit more modern.
   open(my $fh, '<:raw', $plan_entry->{filename}) or die "Couldn't open file $plan_entry->{filename}, $!\n";

   # Read the file in chunks and feed into md5.
   while ($current_bytes_read = read($fh, $buffer, $opts{blocksize})) {
      $md5->add($buffer);
      $plan_entry->{elapsed_bytes} += $current_bytes_read;
      $plan_entry->{elapsed_time} = time() - $plan_entry->{start_time};
      &$progress_fn($plan_entry->{elapsed_bytes});
   }
   # The loop will exit as soon as read() returns 0 or undef. 0 is normal EOF, undef indicates an error.
   die "Error while reading $plan_entry->{filename}, $!\n" if ( ! defined $current_bytes_read);

   close($fh) or die "Couldn't close file $plan_entry->{filename}, $!\n";

   # We made it out of the file alive. Store the md5 we computed. Note that this resets the Digest::MD5 object.
   $plan_entry->{computed_md5} = $md5->hexdigest();
}
I could use Moose and make a proper class for a "plan entry" - I wish I could think of a better name for them but I can't - but this is a simple script and a simple hashref will do. The create_plan_for_filename sub can take a filename and optionally the correct md5 as recorded by a .md5sum file somewhere, and return a hash reference complete with keys describing the initial state of the entry. We can pass these entries to the run_md5_file sub, which will create a Digest::MD5 object, feed it data, and then set additional keys in the plan entry - the computed md5 and how long it took. For run_md5_file, we also pass in a subroutine reference via the $progress_fn scalar variable. This is a callback function that we can give to run_md5_file for it to report back on progress as it loops through the file data.
sub display_file_sums
{
   my (@filenames) = @_;

   # Initialise our plan.
   my @plan = map { create_plan_for_filename($_) } @filenames;

   # Do the sums.
   foreach my $plan_entry (@plan) {
      print STDERR "Computing MD5 of $plan_entry->{filename}: ";
      run_md5_file($plan_entry, sub { my $progress = shift; print STDERR "."; } );
      print STDERR " done\n";
   }

   # Show the user.
   print_plan_sums(@plan);
}


sub print_plan_sums
{
   my (@plan) = @_;
   foreach my $plan_entry (@plan) {
      next unless $plan_entry->{computed_md5};

      my $sumcolour = "cyan";
      if ($plan_entry->{correct_md5}) {
         $sumcolour = ( $plan_entry->{computed_md5} eq $plan_entry->{correct_md5} ) ? "green" : "red";
      }

      print colored($plan_entry->{computed_md5}, $sumcolour) . "  " . $plan_entry->{filename}, "\n";
   }

   #### TODO: Show files with errors. Separate sub?
}
display_file_sums (coming up with pithy subroutine and variable names is hard ok?) is responsible for handling the "compute the checksum of these given files and just show me what they are" action we invoke with --files. We can easily convert our list of @filenames into a list of plan entries using perl's map keyword and the create_plan_for_filename sub we made earlier. Then we act on each entry of that plan, computing the checksum and reporting progress via a very hastily-written subroutine that just prints a dot character for every block of data that gets processed:
sub { my $progress = shift; print STDERR "."; }
Finally we print things out to the terminal in print_plan_sums, adding some code to colour things prettily if the computed checksums match the known good checksums - which we don't have yet because we don't have any code to load those in yet. Now that this last function is written, we can run the program and check that it produces the correct md5 for files:-

james@yang(): ~/bin
$ ./vigil.pl --files vigil.pl 
Computing MD5 of vigil.pl: . done
0b8673f3a80536d3cf1fcc7477b68bd6  vigil.pl

james@yang(): ~/bin
$ md5sum vigil.pl
0b8673f3a80536d3cf1fcc7477b68bd6  vigil.pl
So far, so good!

Making Progress

The first thing I want to do is add a proper progress bar. A stream of dots tells us that the program hasn't crashed or stalled, but it otherwise isn't super helpful. I'm extremely tempted to write my own custom progress bar code, but for the moment let's fight NIH syndrome and use a perl module someone else has written. Term::ProgressBar seems handy; let's try that one. Using it is pretty simple, and we only need to modify the display_file_sums sub to initialise the progress bar and supply an appropriate callback function to update it instead of our initial "stream of dots" version:-
sub display_file_sums
{
   my (@filenames) = @_;

   # Initialise our plan.
   my @plan = map { create_plan_for_filename($_) } @filenames;

   # How big is this going be?
   my $total_size = get_plan_size(@plan);
   my $total_progress_so_far = 0;      # updated after each file is processed.

   # Initialise the progress bar.
   my $bar = Term::ProgressBar->new({
            name => "Computing MD5",
            count => $total_size,
            ETA => 'linear',
         });
   $bar->minor(0);      # No asterisks used for progress animation.
   my $next_bar_update = 0;      # used to ensure we don't spend more time drawing the bar than actual work.

   # Do the sums.
   foreach my $plan_entry (@plan) {
      $bar->message($plan_entry->{filename});
      run_md5_file($plan_entry,
            sub {
               my $progress = $total_progress_so_far + $_[0];
               if ($progress > $next_bar_update) {
                  $next_bar_update = $bar->update($progress);
               }
            });

      $total_progress_so_far += $plan_entry->{size};
      $next_bar_update = $bar->update($total_progress_so_far);
   }
   $bar->update($total_progress_so_far);
   print STDERR "\n";

   # Show the user.
   print_plan_sums(@plan);
}


sub get_plan_size
{
   my (@plan) = @_;
   my $total_size = 0;
   foreach my $entry (@plan) {
      $total_size += $entry->{size} // 0;
   }
   return $total_size;
}
The only other addition is a small subroutine to calculate the total size (in bytes) of our plan. Once we have that, we can initialise our $bar with Term::ProgressBar->new, and the body of our callback now looks like this:-
my $progress = $total_progress_so_far + $_[0];
if ($progress > $next_bar_update) {
   $next_bar_update = $bar->update($progress);
}
The awesome thing about using anonymous subs as callbacks like this in Perl is that while $progress and $_[0] are variables local to the subroutine, it can reference variables outside of that lexical scope - it doesn't lose access to them just because it's being called as &$progress_fn() from run_md5_file. We don't need to set up any complex system of exposing $total_progress_so_far, $next_bar_update and $bar to it, just use them how you would naturally.

Random side-note: If we were returning this anonymous sub from another sub, and a variable it was using (such as $total_progress_so_far) fell out of scope, this would create a closure and capture the state of that variable when the sub was made. It's nice how it just does what you'd expect rather than reference something that no longer exists.

Here's what the new progress bar looks like:-
james@yang(): /yang/data/random_backups
$ vigil.pl --files sumomo-20140216-svnrdump.dump 
sumomo-20140216-svnrdump.dump                                                                                                  
Computing MD5:  62% [=========================================                        ]0m03s Left
Not bad! And for a 1.5G file, decently fast. Not as fast as the (presumably C) md5sum program, but at least this one tells me how much longer I'll need to wait! Our perl version is probably I/O bound, and can be optimised by tweaking the block size used - which is why I added that as a parameter. But early optimisation is the devil, and it's better to get all the features sorted out first. This post is already pretty long, and the basic functionality works; I'll make another post later where we add stuff like checking a whole file full of md5 checksums. Go to part 2.

1 comment:

  1. This comment has been removed by a blog administrator.

    ReplyDelete