30 May 2014

Integrating Integrity, part 2

Last post, I made a perl script to generate MD5 checksums for me, while displaying a progress bar. Now I want to expand its functionality to generate a .md5sum file listing the md5 for everything in a given directory, or check all the files in the list to see if their actual md5 matches the 'correct' one. I will also set things up so that any checksum mismatches or other errors are reported at the end of the run so that they aren't pushed off the terminal's scrollback buffer when working with a large list of files.


Making a list, checking it twice

The file format I will support initially is the kind created by md5sum my file names > file.md5sum. This way, I can use vigil.pl to check any of these files I come across, and if I'm somehow without my Perl version I can always fall back on venerable ol' md5sum. Contrary to what the manpage suggests, the format of these files is the hexadecimal md5 checksum, followed by exactly two spaces, then the file name. The manpage suggests a single space character is used for text files, and an asterisk for binary files, but I've never seen the asterisk variant in the wild - it might only matter on systems that have a separate mode for opening text files, like MS Windows. I'll just stick with the spaces, and make some simple subroutines to slurp a .md5sum file into a list of hashrefs, and to write out a .md5sum file based on the list of hashrefs we have in our @plan.
sub load_md5sum_file
{
   my ($filename) = @_;
   my @plan;

   # Yep, we're assuming everything is utf8. Things which are not utf8 cause you pain. Horrible icky pain.
   open(my $fh, '<:utf8', $filename) or die "Couldn't open '$filename' : $!\n";
   my $linenum = 0;
   while (my $line = <$fh>) {
      chomp $line;
      $linenum++;
      if ($line =~ /^(?<md5>\p{ASCII_Hex_Digit}{32})  (?<filename>.*)$/) {
         # Checksum and filename compatible with md5sum output.
         push @plan, create_plan_for_filename($+{filename}, $+{md5});

      } elsif ($line =~ /^(?<md5>\p{ASCII_Hex_Digit}{32})  (?<filename>.*)$/) {
         # Checksum and filename compatible with md5sum's manpage but not valid for the actual program.
         # We'll use it, but complain.
         print STDERR colored("Warning: ", 'bold red'), colored("md5sum entry '", 'red'), $line, colored("' on line $linenum of file $filename is using only one space, not two - this doesn't match the output of the actual md5sum program!.", 'red'), "\n";
         push @plan, create_plan_for_filename($+{filename}, $+{md5});

      } elsif ($line =~ /^\s*$/) {
         # Blank line, ignore.

      } else {
         # No idea. Best not to keep quiet, it could be a malformed checksum line and we don't want to just quietly skip the file if so.
         print STDERR colored("Warning: ", 'bold red'), colored("Unrecognised md5sum entry '", 'red'), $line, colored("' on line $linenum of file $filename.", 'red'), "\n";
         push @plan, { error => "Unrecognised md5sum entry" };
      }
   }
   close($fh) or die "Couldn't close '$filename' : $!\n";

   return @plan;
}


sub save_md5sum_file
{
   my ($filename, @plan) = @_;

   open(my $fh, '>:utf8', $filename) or die "Couldn't write to '$filename' : $!\n";
   foreach my $plan_entry (@plan) {
      next unless $plan_entry->{correct_md5};
      next unless $plan_entry->{filename};
      print $fh "$plan_entry->{correct_md5}  $plan_entry->{filename}\n";
   }
   close($fh) or die "Couldn't close '$filename' : $!\n";
}
Loading the file basically just iterates over each line of the file with <$fh> and checking the line against a regular expression that captures the fields we want. Here's the expression in all its Perly glory:-
^(?<md5>\p{ASCII_Hex_Digit}{32})  (?<filename>.*)$
I'm using a feature introduced back in Perl 5.10 - named capture groups. Instead of having a bunch of parens capturing various parts of your pattern and then having to count them out to determine which is $1, $2 etc, you can simply give them a name and refer to them as $+{name}. Furthermore, I'm using \p which lets you test for any character matching a particular Unicode property - it makes things a little more explicit about what I'm looking to match.

With our new ability to load and save .md5sum files, I moved on to writing a verify_file_sums sub that would go through each file referenced in that document and check it, again with a progress bar. However, it's basically a carbon copy of display_file_sums with a few minor changes. This kind of code duplication is no good, especially as we'll be making a create_file_sums which also does the same basic run through a list of files while reporting progress back to the user. We need to rip out the essence of these three functions and make something generic that can do the bulk of the work. Since they're all basically calling run_md5_file repeatedly while managing the progress bar, I guess I'll call it run_md5_files_with_progress. Yeah, inspiration isn't flowing strongly today.

One Ring to find them

With run_md5_files_with_progress taking care of the essence of doing all the checks and reporting progress back to the user, the subroutines that get called based on the kind of command-line action requested become much smaller and leaner. verify_file_sums just needs to build a plan by calling load_md5sum_file, and the original display_file_sums is just mapping the filenames from @ARGV. The act of creating a brand new .md5sum file from a big list of directories and files is more complicated. We want to be able to supply a directory name and have the program traverse that directory recursively, including all the files it finds along the way. To accomplish this, we pull in the Perl module File::Find.

File::Find's usage is pretty simple. You call find() in one of several ways to supply some options, a list of directories and files, and most importantly a 'wanted' function to be used as a callback. In our case, we want to define it as an anonymous subroutine written right there in-line with the call to find() so that I can easily reference the @plan I want to build up. The 'wanted' function will be called by File::Find at each file and directory it traverses. Here's what the code that deals with the --create option looks like now:-
sub create_file_sums
{
   my ($md5sum_filename, @dirs_and_files_to_sum) = @_;

   # Initialise our plan from the files and directories given on the commandline.
   # Traversing a big tree recursively could take a while, so let the user know.
   print STDERR colored("Constructing plan...", 'yellow'), "\n";
   my @plan;
   find( {
            # We pass in an anonymous sub to File::Find for it to run on every file and directory
            # as it traverses the tree. Note that the parameter name, 'wanted', is a misnomer;
            # File::Find will not do anything with the return value of our sub.
            wanted => sub {
               # Apparently File::Find doesn't handle utf8 filenames.
               my $fullpathname = $File::Find::name;
               utf8::decode($fullpathname);

               if (-e $fullpathname && ! -d $fullpathname) {
                  push @plan, create_plan_for_filename($fullpathname);
               }
            },
            follow_skip => 2,    # If we get duplicate dirs, files etc. during find, ignore and carry on.
            no_chdir => 1,       # We want our current dir consistent after Find does its work.
         },
         @dirs_and_files_to_sum
      );
   @plan = prune_plan_duplicates(@plan);
   print STDERR colored("Complete. Plan contains " . scalar @plan . " items.", 'yellow'), "\n";

   # Go sum all those files.
   run_md5_files_with_progress("Creating MD5", @plan);

   # Then write the computed MD5s to the md5sum file given with --create.
   save_md5sum_file($md5sum_filename, @plan);
}


sub prune_plan_duplicates
{
   my (@plan) = @_;
   my @uniq_plan;
   my %names;
   foreach my $plan_entry (@plan) {
      next unless $plan_entry->{filename};
      if ($names{$plan_entry->{filename}} // 0 == 1) {
         print STDERR colored("Warning: File '$plan_entry->{filename}' mentioned multiple times. Duplicate entries in plan will be ignored.", 'bold yellow'), "\n";
      }
      push @uniq_plan, $plan_entry unless $names{$plan_entry->{filename}};
      $names{$plan_entry->{filename}}++;
   }
   return @uniq_plan;
}
There were a few problems I quickly ran into while testing this which are addressed in the code above. Firstly, one must make sure to run utf8::decode() on the file names that File::Find gives you. Thanks to the -CSDA options at the top of our script, Perl is already decoding @ARGV filenames as UTF-8 for us, but not so the names we get while traversing the directory tree. Obviously, if you are on a filesystem which is not using UTF-8 and instead uses some other encoding, my script will fail horribly. I've decided that "Assume a sane system using UTF-8 throughout" is as far as I want to go down the locale rabbit-hole for the time being. Things get much, much more complicated when your filenames from the filesystem can be in different encodings than the .md5sum files you load and the arguments you get on the command-line and so on. I'm making this script for me and also in the hopes that it could be useful for others, but not necessarily everyone everywhere ever.

The second problem vexed me for a little while; I was getting the list of filenames properly, but then my MD5 steps were all failing, claiming the files didn't exist. What was going on? Well, by default, File::Find uses chdir to change the current working directory while it does its thing; it sets Perl's magic default variable $_ to the basename of the file and $File::Find::name to the full name including the path. But it doesn't necessarily tidy up after itself once it's run, and so my MD5 checks weren't running from the directory they should have been in. Setting the option no_chdir avoids this behavior; we always get the full path, and we never change directory.

Lastly, I observed that if I gave --create a directory and also mentioned a few files within that directory, the files would be added to the plan multiple times and be checksummed multiple times. This is unnecessary, so prune_plan_duplicates is there to check for that and warn the user about it. I've also written some fairly straightforward subs to summarise things at the end of the program: print_plan_pass_fail_summary gives a quick recount of how many files were verified successfully and how many, and which, failed. print_plan_errors checks to see if there's an 'error' attribute set on any files, and if so, lists them grouped by error. I'll omit listing them here because they're pretty much just doing some foreach loops and maybe building a hash.

Staying up to date

So now we can create an .md5sum file based on a directory of files, and verify that .md5sum file afterwards, what else is there to do? One use-case I forsee is a situation where you have a file full of md5sums, but some of the files referenced have been removed, or new ones have been added to the directory it is responsible for and you want to include their checksums. In this case, you could use --create and clobber the old file with new data, but that could take a considerable length of time and you'd want to --verify first to make sure the files aren't corrupt - else you'd be recording that corrupt checksum as being valid.

Naturally, the act of updating the .md5sum file will require both loading the file contents and building a plan based on recursively scanning a set of directories. It looks like it's already time to refactor our File::Find call out into its own subroutine so we can avoid duplicated code.

Once we've done that, we can load both the .md5sum file and the directory tree and compare them, alerting the user to the differences and updating the file:-
sub update_file_sums
{
   my ($md5sum_filename, @dirs_and_files_to_sum) = @_;

   # We need to load both the contents of our .md5sum file and build an index of the files in given directories,
   # so that we can compare them.
   my @official_plan = load_md5sum_file($md5sum_filename);
   my @actual_plan = build_plan_from_directories(@dirs_and_files_to_sum);

   my %official_filenames = map { $_->{filename}, $_ } grep { defined $_->{filename} } @official_plan;
   my %actual_filenames = map { $_->{filename}, $_ } grep { defined $_->{filename} } @actual_plan;

   my @new_files = grep { $_->{filename} && ! defined $official_filenames{$_->{filename}} } @actual_plan;
   my @removed_files = grep { $_->{filename} && ! defined $actual_filenames{$_->{filename}} } @official_plan;

   if (scalar @removed_files == scalar keys %official_filenames) {
      print STDERR colored("All files mentioned in '$md5sum_filename' would be removed - refusing to act, something doesn't look right here.", 'bold red'), "\n";
      print STDERR colored("If you really want to do this, use --create to create a brand-new .md5sum file instead.", 'bold red'), "\n";
      exit 1;
   }
   if (@removed_files == 0 && @new_files == 0) {
      print STDERR colored("No changes necessary.", 'bold green'), "\n";
      exit 0;
   }

   if (@removed_files) {
      print STDERR colored(scalar @removed_files . " Removed files:-", 'red'), "\n";
      print_plan_sums(@removed_files);
   }
   if (@new_files) { 
      print STDERR colored(scalar @new_files . " New files:-", 'cyan'), "\n";
   }
   # Go sum the new files.
   run_md5_files_with_progress("Updating MD5", @new_files);

   # Remove the removed files and add the new files. Leave unchanged entries as per the original.
   my @updated_plan;
   foreach my $plan_entry (@official_plan, @new_files) {
      # Skip files that can no longer be found.
      next unless $plan_entry->{filename} && defined $actual_filenames{$plan_entry->{filename}};

      # If there isn't a {correct_md5} attribute, we should set it from the {computed_md5}.
      $plan_entry->{correct_md5} //= $plan_entry->{computed_md5};

      push @updated_plan, $plan_entry;
   }

   print STDERR colored("Saving file '$md5sum_filename'...", 'bold green');
   save_md5sum_file($md5sum_filename, @updated_plan);
   print STDERR colored(" Done!", 'bold green'), "\n";
}
As we're making heavy use of various lists of things, the natural thing to do to help us manipulate them is perl's map and grep functions. The data gets processed right-to-left; for instance, we take the @official_plan we loaded from disk, and grep that list to include only the entries that actually have a filename - since the file may be malformed. Then we take the output of grep and run it through map: The code reference we give it turns each entry into a pair of values, and lists of paired-up values are suitable for assignment to a hash. The hashes will be used to easily test whether or not a file exists in our .md5sum file or actually on disk in our File::Find results.

After wrangling our data into lists of new and removed files, we only need to generate the checksums of the new files, assign those checksums as being "correct", and save the .md5sum file. It is at this point I find a bug - I've been saving the {computed_md5} attribute in save_md5sum_file, but what I really intended is to save the {correct_md5}. Not a huge difference when you're creating a brand new checksum file from scratch, but when you're partially updating one it becomes important to get it right. A quick fix for the update and creation subs looks like this:-
$plan_entry->{correct_md5} //= $plan_entry->{computed_md5};
This line is using the defined-or assignment operator introduced back in Perl 5.10. This operator is my best friend. Defined-or is great for including "default values" when you're pulling data from a source that might be undefined - something like my $dir = @ARGV[0] // "."; will let you easily add default fallback values to your code. Just like many other operators, you can combine them with an assignment - just as $a = $a + $b becomes $a += $b, you can apply //= to something. The line above is a terse way of saying "Assign the correct_md5 attribute its own value, unless its value is undefined, in which case assign it the computed_md5". In other languages this could be annoyingly verbose code.

Anyway, I'm running out of steam and this project has made for two large posts already. It works well enough; we've found a bunch of cases where it could have broken, and made sure to avoid that. It could be faster, but it's not slow, and the additional features more than make up for things.

I've to set up a page where you can download this script in its final form, along with my other larger scripts from these blog posts. Get them here: Script Toolbox

No comments:

Post a Comment