18 November 2013

Naming Names, Part 2

Remember how, in the last post, I invited you all to join me "next week" for part 2? What a funny joke that was! In the meantime I've been busy being sick, getting sucked back into WoW thanks to a "gift" from a "friend", and have been dealing with hilariously bad hardware failures. But I'm finally back to finish what I started, because I owe you all that much.

In Part 1, we converted a crufty old shell script of mine to its Perl 5 equivalent, and then built it up a little to be more smart about how it goes about renaming files.

In Part two, we go nuts with feature-creep. Follow me along for the ride.



Getting into an Argument

Our little script is quite good so far, but there's one thing it doesn't do yet: the thing it is designed to do, rename files. Up until now I've left out the final rename call so that I can test that the program works without messing up my test files each time. Since I'm fairly confident it won't break now, let's add in this core functionality.

But I still want to be able to test substitution commands without renaming files. It's pretty essential to be able to preview the effects of a major renaming operation. To get the best of both worlds, let's add a command-line option that supresses the rename action and merely previews what the script would do.

Parsing command-line options by hand is a tedious and thankless task. Naturally, Perl has a module to do all the hard work for us: Getopt::Long. I'm pretty sure it comes as standard with any recent Perl distribution, and if for some reason it's not available on some strange system I'm using then I am sad and don't want to use that system.

use Getopt::Long;

my $dryrun = 0;
GetOptions("dry-run|dryrun" => \$dryrun);

Parsing out a simple --dry-run flag from our argument list is as simple as loading the module, setting up a variable, and calling GetOptions(). I'm never any good at remembering if options have hyphens in them or not, so we use the | character to set up an alias; the user can type --dry-run or --dryrun and we get the same value stored in our $dryrun variable.

Now, I know that in the last post I said I didn't want to get into Perl references, because it's more of an advanced topic, and I'm not sure I can explain them without confusing people. However, to properly use the GetOptions() function, we do need to touch on them a little. The function takes a big list of key/value pairs where the key is what the option should be named and how it behaves, and the value is a variable of ours that will be filled in appropriately. But that means we can't just pass in the variable as $dryrun - that passes in the value of the $dryrun variable, and doesn't allow GetOptions() to modify it in any way. Normally, that's what you want from function parameters: You give it a set of values but you don't want any weird side effects at your end of the function call.

So instead of passing the $dryrun variable's value, we need to somehow provide GetOptions() with a reference to our variable, explicitly allowing the function to change the same value we are keeping in our code under the name $dryrun. In Perl, the way to get at a reference to a variable instead of the variable's value is by putting the backslash character in front of the usual sigil - $var becomes \$var, @list becomes \@list, %hash becomes \%hash. GetOptions() is expecting a reference, so it is free to change its value accordingly and we continue to use the variable as normal.

Here's the new subroutines for showing or performing our @plan:-

   my @plan = map { $_, $filename_mapping{$_} } @filenames; # pairs of oldname, newname.
   if ($dryrun) {
      show_plan(@plan);
   } else {
      execute_plan(@plan);
   }
}


sub show_plan
{
   my @plan = @_;
   print "Proposed changes:-\n";
   while (@plan) {
      my $orig_name = shift @plan;
      my $new_name = shift @plan;
      if ($orig_name ne $new_name) {
         print "$orig_name -> $new_name\n";
      } else {
         print "$orig_name : unchanged.\n";
      }
   }
}


sub execute_plan
{
   my @plan = @_;
   print "Renaming files:-\n";
   while (@plan) {
      my $orig_name = shift @plan;
      my $new_name = shift @plan;
      if ($orig_name ne $new_name) {
         print "$orig_name -> $new_name\n";
         rename $orig_name, $new_name or die "Failed to rename $orig_name : $!\n";
      } else {
         print "$orig_name : unchanged.\n";
      }
   }
}

Observe that we are finally calling rename $orig_name, $new_name - our program finally does what it was designed to do!

So far, we only need to use --dry-run as a flag - did the user specify it, or did they not? More complex command line arguments can be specified with Getopt::Long, such as options which take an additional text parameter, or which can be used multiple times, but we shouldn't need those capabilities just yet. See the documentation on metacpan or use perldoc Getopt::Long for more information.

Since it's been a little while since the last post, here's the full text of the script so far:-

#!/usr/bin/perl -CSDA

use strict;
use warnings;
use Getopt::Long;

my $dryrun = 0;
GetOptions("dry-run|dryrun" => \$dryrun);

sub help
{
   print "Usage: $0 [--dry-run] '<substitutions>' filename(s)\n";
   print "  e.g. $0 's/from/to/g;' myfiles/*\n";
   exit 1;
}


sub main
{
   my $substitutions = shift @_;
   my @filenames = @_;

   my %filename_mapping;    # original name => new name
   my %reverse_mapping;     # new name => original name
   foreach my $orig_filename (@filenames) {
      my $new_filename = transform_filename($substitutions, $orig_filename);
      $filename_mapping{$orig_filename} = $new_filename;
      $reverse_mapping{$new_filename} = $orig_filename;
   }

   # Check there are no filename collisions; the number of unique output filenames
   # should be equal to the number of unique input filenames.
   if (keys %filename_mapping != keys %reverse_mapping) {
      die "Given substitution commands result in non-unique filenames, refusing to clobber your files.\n";
   }

   my @plan = map { $_, $filename_mapping{$_} } @filenames; # pairs of oldname, newname.
   if ($dryrun) {
      show_plan(@plan);
   } else {
      execute_plan(@plan);
   }
}


sub show_plan
{
   my @plan = @_;
   print "Proposed changes:-\n";
   while (@plan) {
      my $orig_name = shift @plan;
      my $new_name = shift @plan;
      if ($orig_name ne $new_name) {
         print "$orig_name -> $new_name\n";
      } else {
         print "$orig_name : unchanged.\n";
      }
   }
}


sub execute_plan
{
   my @plan = @_;
   print "Renaming files:-\n";
   while (@plan) {
      my $orig_name = shift @plan;
      my $new_name = shift @plan;
      if ($orig_name ne $new_name) {
         print "$orig_name -> $new_name\n";
         rename $orig_name, $new_name or die "Failed to rename $orig_name : $!\n";
      } else {
         print "$orig_name : unchanged.\n";
      }
   }
}


sub transform_filename
{
   my ($substitutions, $orig_filename) = @_;

   $_ = $orig_filename;
   my $rval = eval $substitutions;
   my $new_filename = $_;
   die "Error in substitutions: $@" if $@;

   return $new_filename;
}


help() if (@ARGV < 2);
main(@ARGV);

Living on the Edge

90% of a good programmer's time is spent thinking about edge cases. The other 90% is accounted for by a complete inability to come up with accurate time estimates.

There are a few situations that we might find ourselves in when using this script where we might try to be clever and end up shooting ourselves in the foot. Say we had a bunch of sequentially-numbered files. For whatever reason, we need those numbers incremented by one. We might try something like the following:-

james@qi: ~/Ubuntu One/blogpost/example2
$ ls
IMG_0013.JPG  IMG_0014.JPG  IMG_0015.JPG  IMG_0016.JPG  INVISIBLE BIKE.JPG

james@qi: ~/Ubuntu One/blogpost/example2
$ ../filenamesed6.pl --dry-run 'if ($_ =~ /IMG_(\d+)\.JPG/) { my $num = $1 + 1; $_ = sprintf("IMG_%04d.JPG", $num); }' *.JPG
Proposed changes:-
IMG_0013.JPG -> IMG_0014.JPG
IMG_0014.JPG -> IMG_0015.JPG
IMG_0015.JPG -> IMG_0016.JPG
IMG_0016.JPG -> IMG_0017.JPG
INVISIBLE BIKE.JPG : unchanged.

The transformation we are applying here is a little more complex than the simple substitutions we've been doing up until now; it is the benefit of converting our script to use Perl (we get to use arbitrary Perl commands!) and also the drawback (I can do more complex things now... which is complex!). My little one-liner there can be written slightly more clearly as:-

if ($_ =~ /IMG_(\d+)\.JPG/) {
   my $num = $1 + 1;
   $_ = sprintf("IMG_%04d.JPG", $num);
}

This matches only the IMG_ files, capturing the sequence of digits in the name. It adds one to that, and reformats the filename using sprintf - four digits with leading zeroes. So this will bump up the number for each of those files by one. Pretty clever, huh? Except we're shooting ourselves in the foot. Take a look at the proposed changes again; it intends to rename image #13 to #14, then rename image #14 to #15. Except after that first step, IMG_0013.JPG will have been moved over the top of IMG_0014.JPG and clobbered it. By the end, you'll be left with only one photo left, IMG_0017.JPG, and it will have the contents of the former #13. We need to make the script smarter so that it can handle this case.

Here's the approach I'm taking to re-order the plan before we execute it:-

sub reorder_plan
{
   my @plan = @_;
   my %mapping = @_;
   my @new_plan;
   my $frustration = 0;
   # Go through the plan, pushing pairs on to @new_plan only if it's safe to do that step.
   # It is safe to do that step only if the destination name does not appear in 
   # @plan's remaining list of source names. (or if filename is unchanged)
   while (@plan) {
      my $orig_name = shift @plan;
      my $new_name = shift @plan;
      if ($orig_name eq $new_name || ! defined $mapping{$new_name}) {
         # This one is safe. Add it to the new plan.
         push @new_plan, $orig_name, $new_name;
         delete $mapping{$orig_name};
         $frustration = 0;
      } else {
         # This one is not safe to do (yet). Put it at the back of the current plan.
         push @plan, $orig_name, $new_name;
         $frustration++;
         # Is the whole thing impossible?
         if ($frustration * 2 >= @plan) {
            die "I can't figure out how to order this plan without clobbering things mid-process!\n";
         }
      }
   }

   return @new_plan;
}

It takes a list of filename pairs (the @plan) and returns a list of pairs (@new_plan). The %mapping hash is also assigned to the @_ list of arguments, which fills in the hash based on those pairs. I keep a count of the algorithm's $frustration so that it won't get stuck in an infinite loop if asked to sort out some horrible circular mapping.

The body of the loop is fairly straightforward; removing a pair of names from the start of the @plan, it checks to see if the $new_name from this action would clobber any of the remaining original names on the plan (via the handy %mapping hash we built). If it's safe, it pushes that pair onto the back of @new_plan; if not, it goes on the back of the current @plan list that we are currently processing. That way, we can give it another try after we've cleared some space.

Let's try it again on the "incrementing photo numbers" example:-

james@qi: ~/Ubuntu One/blogpost/example2
$ ../filenamesed6.pl --dry-run 'if ($_ =~ /IMG_(\d+)\.JPG/) { my $num = $1 + 1; $_ = sprintf("IMG_%04d.JPG", $num); }' *.JPG
Proposed changes:-
IMG_0016.JPG -> IMG_0017.JPG
INVISIBLE BIKE.JPG : unchanged.
IMG_0015.JPG -> IMG_0016.JPG
IMG_0014.JPG -> IMG_0015.JPG
IMG_0013.JPG -> IMG_0014.JPG

Excellent! It's figured out it can do the renaming in reverse order and won't accidentally murder your vacation photos. Here's the final version of the script without any promises for future posts on the matter:-

#!/usr/bin/perl -CSDA

use strict;
use warnings;
use Getopt::Long;

my $dryrun = 0;
GetOptions("dry-run|dryrun" => \$dryrun);

sub help
{
   print "Usage: $0 [--dry-run] '<substitutions>' filename(s)\n";
   print "  e.g. $0 's/from/to/g;' myfiles/*\n";
   exit 1;
}


sub main
{
   my $substitutions = shift @_;
   my @filenames = @_;

   my %filename_mapping;    # original name => new name
   my %reverse_mapping;     # new name => original name
   foreach my $orig_filename (@filenames) {
      my $new_filename = transform_filename($substitutions, $orig_filename);
      $filename_mapping{$orig_filename} = $new_filename;
      $reverse_mapping{$new_filename} = $orig_filename;
   }

   # Check there are no filename collisions; the number of unique output filenames
   # should be equal to the number of unique input filenames.
   if (keys %filename_mapping != keys %reverse_mapping) {
      die "Given substitution commands result in non-unique filenames, refusing to clobber your files.\n";
   }

   my @plan = map { $_, $filename_mapping{$_} } @filenames; # pairs of oldname, newname.
   # Re-order the plan to make it safe against stepping on our own toes while renaming.
   @plan = reorder_plan(@plan);
   # Show it or do it.
   if ($dryrun) {
      show_plan(@plan);
   } else {
      execute_plan(@plan);
   }
}


sub show_plan
{
   my @plan = @_;
   print "Proposed changes:-\n";
   while (@plan) {
      my $orig_name = shift @plan;
      my $new_name = shift @plan;
      if ($orig_name ne $new_name) {
         print "$orig_name -> $new_name\n";
      } else {
         print "$orig_name : unchanged.\n";
      }
   }
}


sub reorder_plan
{
   my @plan = @_;
   my %mapping = @_;
   my @new_plan;
   my $frustration = 0;
   # Go through the plan, pushing pairs on to @new_plan only if it's safe to do that step.
   # It is safe to do that step only if the destination name does not appear in 
   # @plan's remaining list of source names. (or if filename is unchanged)
   while (@plan) {
      my $orig_name = shift @plan;
      my $new_name = shift @plan;
      if ($orig_name eq $new_name || ! defined $mapping{$new_name}) {
         # This one is safe. Add it to the new plan.
         push @new_plan, $orig_name, $new_name;
         delete $mapping{$orig_name};
         $frustration = 0;
      } else {
         # This one is not safe to do (yet). Put it at the back of the current plan.
         push @plan, $orig_name, $new_name;
         $frustration++;
         # Is the whole thing impossible?
         if ($frustration * 2 >= @plan) {
            die "I can't figure out how to order this plan without clobbering things mid-process!\n";
         }
      }
   }

   return @new_plan;
}


sub execute_plan
{
   my @plan = @_;
   print "Renaming files:-\n";
   while (@plan) {
      my $orig_name = shift @plan;
      my $new_name = shift @plan;
      if ($orig_name ne $new_name) {
         print "$orig_name -> $new_name\n";
         rename $orig_name, $new_name or die "Failed to rename $orig_name : $!\n";
      } else {
         print "$orig_name : unchanged.\n";
      }
   }
}


sub transform_filename
{
   my ($substitutions, $orig_filename) = @_;

   $_ = $orig_filename;
   my $rval = eval $substitutions;
   my $new_filename = $_;
   die "Error in substitutions: $@" if $@;

   return $new_filename;
}


help() if (@ARGV < 2);
main(@ARGV);

Real World Usage

Now that I've finally completed this post, I can use the script for my own purposes. I've been wanting to reorganise the scans of various invoices I got from the health problems I had back in 2012. Originally, I named my files according to what I had done, when it was, and maybe some ancillary information. Here's a sample:-

ANAS-20120426-invoice.png
ANAS-20120427-receipt.png
ENT-20120404-invoice-amended.jpg
ENT-20120404-receipt-sans-docket-needs-referral.png
ENT-20120404-receipt-with-docket-needs-referral.png
HCF-20120813-ANAS+CT-rebate.png
Medicare Claims 20120509 & 20120711.png
MKPh-20120604-cortisol.png
MMI-20120330-receipt.png
MMI-20120403-receipt.png
MMI-20120426-the-missing-receipt.pdf
MMI-20120427-invoice1.jpg
MMI-20120427-invoice2.jpg
MMI-20120427-receipt1.jpg
MMI-20120427-receipt2.jpg
MMI-20120614-reminder.jpg
MQH-20120426-inpatient-excess.png
MQNS-20120402-receipt.jpg
MQOP-20120521-receipt.png
MQPh-20120427-receipt.png

However, having to deal with taxes and medicare claims and the ridiculous "financial year" system, I find I'd much rather have all these files named by date first, then all the rest. If only I'd made some sort of Perl 5 script that allowed me to use arbitrary Perl 5 regexps and expressions to rename files simply and easily!

My plan here is to match files with a -20120427 style date-stamp in them, capture that and replace it with the empty string, then prepend the date and a hyphen to the name. Here's the result:-

james@qi: ~/Ubuntu One/blogpost/example3
$ ../filenamesed6.pl 'if (s/-(\d{8})//) { $_ = $1 . "-" . $_; }' *
Renaming files:-
ANAS-20120426-invoice.png -> 20120426-ANAS-invoice.png
ANAS-20120427-receipt.png -> 20120427-ANAS-receipt.png
ENT-20120404-invoice-amended.jpg -> 20120404-ENT-invoice-amended.jpg
ENT-20120404-receipt-sans-docket-needs-referral.png -> 20120404-ENT-receipt-sans-docket-needs-referral.png
ENT-20120404-receipt-with-docket-needs-referral.png -> 20120404-ENT-receipt-with-docket-needs-referral.png
HCF-20120813-ANAS+CT-rebate.png -> 20120813-HCF-ANAS+CT-rebate.png
Medicare Claims 20120509 & 20120711.png : unchanged.
MKPh-20120604-cortisol.png -> 20120604-MKPh-cortisol.png
MMI-20120330-receipt.png -> 20120330-MMI-receipt.png
MMI-20120403-receipt.png -> 20120403-MMI-receipt.png
MMI-20120426-the-missing-receipt.pdf -> 20120426-MMI-the-missing-receipt.pdf
MMI-20120427-invoice1.jpg -> 20120427-MMI-invoice1.jpg
MMI-20120427-invoice2.jpg -> 20120427-MMI-invoice2.jpg
MMI-20120427-receipt1.jpg -> 20120427-MMI-receipt1.jpg
MMI-20120427-receipt2.jpg -> 20120427-MMI-receipt2.jpg
MMI-20120614-reminder.jpg -> 20120614-MMI-reminder.jpg
MQH-20120426-inpatient-excess.png -> 20120426-MQH-inpatient-excess.png
MQNS-20120402-receipt.jpg -> 20120402-MQNS-receipt.jpg
MQOP-20120521-receipt.png -> 20120521-MQOP-receipt.png
MQPh-20120427-receipt.png -> 20120427-MQPh-receipt.png

james@qi: ~/Ubuntu One/blogpost/example3
$ ls
20120330-MMI-receipt.png
20120402-MQNS-receipt.jpg
20120403-MMI-receipt.png
20120404-ENT-invoice-amended.jpg
20120404-ENT-receipt-sans-docket-needs-referral.png
20120404-ENT-receipt-with-docket-needs-referral.png
20120426-ANAS-invoice.png
20120426-MMI-the-missing-receipt.pdf
20120426-MQH-inpatient-excess.png
20120427-ANAS-receipt.png
20120427-MMI-invoice1.jpg
20120427-MMI-invoice2.jpg
20120427-MMI-receipt1.jpg
20120427-MMI-receipt2.jpg
20120427-MQPh-receipt.png
20120521-MQOP-receipt.png
20120604-MKPh-cortisol.png
20120614-MMI-reminder.jpg
20120813-HCF-ANAS+CT-rebate.png
Medicare Claims 20120509 & 20120711.png

Now I can very easily see my files in date order (the date of the event rather than the date the file was created), and instead of renaming them all by hand I just had to work a little regexp magic. This alone was worth the effort of making the script - and I get to reuse it again and again for any future filename wrangling I want to do.

I hope it can also be of use for everyone reading.

No comments:

Post a Comment