15 May 2013

Naming Names, Part 1

I frequently go off on tangents. While watching some youtube videos I'd downloaded, I got the notion that it'd be neater to see them all grouped into the teams they are on. I was about to use my trusty, crusty, old bash function to rename them when it occurred to me that it was past due for an update. For reference, here is the function I've been using in my profile since Forever:-
function filenamesed () {
   if [ "$1" = "" -o "$2" = "" ]; then
      echo "Usage: filenamesed 'sedcmds' filename(s)"
      return
   fi
   SEDCMD="$1"
   shift
   while [ "$1" != "" ]; do
      F="$1"
      NF=`echo "$F" | sed "$SEDCMD"`
      if [ "$F" != "$NF" ]; then
         echo "$F -> $NF"
         mv -i "$F" "$NF"
      fi
      shift
   done
}
I won't explain it in depth. Suffice to say that it takes a GNU sed command as its first argument, and a bunch of files as other arguments, and then applies that sed command to the filenames. For example, when I want to change all the underscores in filenames into spaces, I would run:-
filenamesed 's/_/ /g;' videos/youtube/*.mp4
So if the original goal was "Watch some MindCrack UHC" and Tangent #1 was "rename those files", Tangent #2 is "update your crusty old shell script" and we finally come to the maximally-tangental task of bringing you lovely people along for the ride. Let's do this in Perl 5, piecemeal, and explain our thought processes along the way.



Shebangs, shebangs

The first step is to put the usual Perl 5 boilerplate that isn't exactly essential but is certainly recommended; warnings and strict. The #! line (called the shebang line) at the top tells Linux what interpreter should be used to run this script; for my personal scripts, I usually just plug #!/usr/bin/perl in there and call it a day. However, some bizarre UNIX systems might have perl installed in /usr/local/bin/ instead, so if we want to be properly portable, we can ask the env program to select the appropriate perl to use. I mean, assuming they have env installed...
#!/usr/bin/env perl

use strict;
use warnings;
Now, let's make a very basic first-cut of the script that is essentially a transliteration of the bash version.
if (@ARGV < 2) {
   print "Usage: $0 '<substitutions>' filename(s)\n";
   print "  e.g. $0 's/from/to/g;' myfiles/*\n";
   exit 1;
}
The first thing the script does is check that it has been run properly, with at least two arguments. In perl, we can check the @ARGV list to see what arguments were passed to our script. Comparing @ARGV to a scalar (a simple value) here makes @ARGV evaluate to the number of items in the list rather than the list's values. If we don't have enough, print some usage instructions and exit. Note we are using the $0 special variable to refer to the name of our script, exactly as the user invoked it.
my $substitutions = shift @ARGV;
foreach my $orig_filename (@ARGV) {
   $_ = $orig_filename;
   my $rval = eval $substitutions;
   my $new_filename = $_;
Shifting the first entry off the front of the @ARGV list gets us the substitution commands we want to perform, and then we can simply loop over the filename entries that remain. Rather than piping the name through the sed program, we want to give the user the full, awesome, and terrifying power of Perl itself via the command line. To do that, we are going to use the eval function, which will run the text given on the command line as perl 5 code and see what that does to the filename. Setting $_ prior to doing the eval puts our original filename into the "default variable" that gets acted on in certain situations in Perl. We pull it out of $_ again immediately afterwards; the actual return value of eval is not as useful to us here, and will probably just be '1' if the s/// pattern matched.
   die "Error in substitutions: $@" if $@;

   if ($new_filename ne $orig_filename) {
      print "$orig_filename -> $new_filename\n";
   } else {
      print "$orig_filename : unchanged.\n";
   }
}
After an eval, the special $@ variable gets set if there were any errors. We don't want users to be able to shoot themselves in the foot, so it's probably best to just exit out as early as possible with the error message. Finally, assuming the new filename is different, we can display the name change and move on to the next filename. I'm not doing any actual file renaming yet; this will require lots of testing and it is simpler to just see what the program would do rather than replace my test files each time.
#!/usr/bin/env perl

use strict;
use warnings;

if (@ARGV < 2) {
   print "Usage: $0 '<substitutions>' filename(s)\n";
   print "  e.g. $0 's/from/to/g;' myfiles/*\n";
   exit 1;
}

my $substitutions = shift @ARGV;
foreach my $orig_filename (@ARGV) {
   $_ = $orig_filename;
   my $rval = eval $substitutions;
   my $new_filename = $_;
   die "Error in substitutions: $@" if $@;

   if ($new_filename ne $orig_filename) {
      print "$orig_filename -> $new_filename\n";
   } else {
      print "$orig_filename : unchanged.\n";
   }
}

At this point, we have a viable replacement for the original bash script, with a few modern conveniences that perl gives us with its advanced regexes and the ability to use more arbitrary transformations. Let's test it out a bit:-
james@qi: ~/Ubuntu One/blogpost/example
$ ls
0511.avidyazen.MindCrack_Ultra_Hardcore_-_S10E03_-_Contact!-Vje7iwlB3Ko.mp4
0511.kurtjmac.Minecraft_MindCrack_-_UHC_S10E03_-_The_Trench-79xpiEor-P8.mp4
0511.mhykol.Mindcrack_UHC_X_-_Episode_3-wSGmxqqCMBc.mp4
けしからん猫の垂直跳びには敵わない。 NOTHING CAN RIVAL THE JUMPING CAT!.mp4

james@qi: ~/Ubuntu One/blogpost/example
$ ../filenamesed1.pl 's/(?=kurt|avidya|mhykol)/TEAM SOBRIETY./; s/_/ /g;' *.mp4
0511.avidyazen.MindCrack_Ultra_Hardcore_-_S10E03_-_Contact!-Vje7iwlB3Ko.mp4 -> 0511.TEAM SOBRIETY.avidyazen.MindCrack Ultra Hardcore - S10E03 - Contact!-Vje7iwlB3Ko.mp4
0511.kurtjmac.Minecraft_MindCrack_-_UHC_S10E03_-_The_Trench-79xpiEor-P8.mp4 -> 0511.TEAM SOBRIETY.kurtjmac.Minecraft MindCrack - UHC S10E03 - The Trench-79xpiEor-P8.mp4
0511.mhykol.Mindcrack_UHC_X_-_Episode_3-wSGmxqqCMBc.mp4 -> 0511.TEAM SOBRIETY.mhykol.Mindcrack UHC X - Episode 3-wSGmxqqCMBc.mp4
けしからん猫の垂直跳びには敵わない。 NOTHING CAN RIVAL THE JUMPING CAT!.mp4 : unchanged.
So, I can use Perl 5's fancy zero-width-look-ahead expressions (the (?=pattern) regex), and it tags and tidies my filenames up fine. It even output the Unicode filename correctly... or did it? Let's try another transformation which should add a space between each character. The pattern s/\X\K/ /g; is quite 'clever' in that it matches all the grapheme clusters (Unicode speak for 'final combined character that a user would see on screen') with \X, and would replace them with a space ...except for the magic \K part of the pattern which means "keep everything matched so far intact". A guide to regular expressions would be the subject of many, many lengthy blog posts, so just take my word for it and let's see how it performs:-
james@qi: ~/Ubuntu One/blogpost/example
$ ../filenamesed1.pl 's/\X\K/ /g;' *.mp4
0511.avidyazen.MindCrack_Ultra_Hardcore_-_S10E03_-_Contact!-Vje7iwlB3Ko.mp4 -> 0 5 1 1 . a v i d y a z e n . M i n d C r a c k _ U l t r a _ H a r d c o r e _ - _ S 1 0 E 0 3 _ - _ C o n t a c t ! - V j e 7 i w l B 3 K o . m p 4 
0511.kurtjmac.Minecraft_MindCrack_-_UHC_S10E03_-_The_Trench-79xpiEor-P8.mp4 -> 0 5 1 1 . k u r t j m a c . M i n e c r a f t _ M i n d C r a c k _ - _ U H C _ S 1 0 E 0 3 _ - _ T h e _ T r e n c h - 7 9 x p i E o r - P 8 . m p 4 
0511.mhykol.Mindcrack_UHC_X_-_Episode_3-wSGmxqqCMBc.mp4 -> 0 5 1 1 . m h y k o l . M i n d c r a c k _ U H C _ X _ - _ E p i s o d e _ 3 - w S G m x q q C M B c . m p 4 
けしからん猫の垂直跳びには敵わない。 NOTHING CAN RIVAL THE JUMPING CAT!.mp4 -> � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � N O T H I N G   C A N   R I V A L   T H E   J U M P I N G   C A T ! . m p 4 
Well, it worked fine for the regular-ol'ASCII filenames, but it totally mangled the Japanese filename. What gives?

Unicode of Conduct

The thing is, the script only appeared to process and print out the Japanese filename correctly on the first run. It was really only due to me running a fully UTF-8 system that it worked at all; the filename is encoded as a UTF-8 byte stream, which was passed to perl, which didn't need to do much to it, printed it out again unmolested, which was interpreted by my terminal as UTF-8. It was only when we tried to treat those bytes as characters that things broke. Such is the way of proper Unicode support - it can seem fine until suddenly it isn't. Every programmer should be aware of how to do Unicode properly, and yet the problem is so intrinsically complicated that it would require another series of lengthy, boring blog posts to explain it all. Perl 5 has pretty comprehensive support of the Unicode spec, but it still takes a fair bit of magic to enable it because of all the legacy systems it has to support.

With my previous perl programs, I've been content to just put this preamble at the top of my files, which covers most of the basics; use of Unicode characters in the source itself, outputting UTF-8 to the screen, and certain string-manipulation functions like length working with characters rather than bytes:-
use utf8;
binmode(STDOUT, ":utf8");
binmode(STDERR, ":utf8");
use feature 'unicode_strings';
This has been fine for me so far, but only because my programs were minding their own business, printing some Unicode themselves, and not interacting too much with the system. Also, my own OS uses UTF-8 throughout; for legacy systems, one has to consider the locales that might be in use and the different character encodings they use. And again, dealing with locales requires lengthy discourse and loss of sanity so let's not touch on it too much. Suffice to say that if I wanted to do things properly, I'd have to look at the user's locale and use that to decode the filenames given to us by the system from whatever character encoding system it uses into something Perl can use internally as Unicode.

Screw that, if it's not using UTF-8 then I don't really want to be using it. It's certainly a bit much for this blog post to cover for a little bash-to-perl walkthrough. In stark contrast to my portability concerns in using #!/usr/bin/env, let's just tell perl to assume all input from the commandline is UTF-8, and assume the terminal is fine to handle UTF-8 as output.

We could do this using a module I just discovered called utf8::all, a nice handy little thing that goes above and beyond what the standard 'use utf8' pragma gives us. However, it doesn't come as standard, and I needed to install libutf8-all-perl on my xubuntu installation. Naturally, that's easy and painless for me to do, but this is a script I can see myself using on lots of systems. I'd prefer to avoid dependencies if I can in this case.

There's also another way to quickly turn on various bits of Unicode support (It wouldn't be Perl if there was only one way of doing it ;)) . The solution is to use -C, a convenient command-line switch to the perl interpreter. The downside is, due to how the shebang line works, I can't get it to play nicely with env. So screw it!, this is my script and I can demand a nice sane UTF-8 locale and perl installed in /usr/bin. People with weird systems can adjust it as appropriate. Changing the shebang line to read:-
#!/usr/bin/perl -CSDA
is sufficient to fix our immediate Unicode problems.
james@qi: ~/Ubuntu One/blogpost/example
$ ../filenamesed2.pl 's/\X\K/ /g;' *CAT*
けしからん猫の垂直跳びには敵わない。 NOTHING CAN RIVAL THE JUMPING CAT!.mp4 -> け し か ら ん 猫 の 垂 直 跳 び に は 敵 わ な い 。   N O T H I N G   C A N   R I V A L   T H E   J U M P I N G   C A T ! . m p 4 
Now that we've got the basics working, what's next? The script itself is still quite primitive. The biggest problem at the moment is that if we give it a substitution that removes some important, unique, part of a filename it will happily rename (read: move) all files onto the same name, clobbering all but the last.
james@qi: ~/Ubuntu One/blogpost/example
$ ../filenamesed2.pl 's/\d+\.\K.*\.mp4//g;' *.mp4
0511.avidyazen.MindCrack_Ultra_Hardcore_-_S10E03_-_Contact!-Vje7iwlB3Ko.mp4 -> 0511.
0511.kurtjmac.Minecraft_MindCrack_-_UHC_S10E03_-_The_Trench-79xpiEor-P8.mp4 -> 0511.
0511.mhykol.Mindcrack_UHC_X_-_Episode_3-wSGmxqqCMBc.mp4 -> 0511.
けしからん猫の垂直跳びには敵わない。 NOTHING CAN RIVAL THE JUMPING CAT!.mp4 : unchanged.
That would be bad. The script needs the wisdom to check what operations are about to be performed, and refuse to perform them if it would result in a situation like this.

A Unique Problem

We essentially want to check that the number of unique filenames going into our program is the same as the number of unique filenames coming out. The first thing to do is separate out the code that determines a file's new name from the code which does the renaming.

Well, since I haven't put the rename step in yet,... Done! That was easy.

More seriously, I'm thinking of moving code into separate functions - termed 'subroutines' in perl - in the perl script to make the code more modular. A subroutine definition in perl is easy, and looks like this:-
sub transform_filename
{
   my ($substitutions, $orig_filename) = @_;

   $_ = $orig_filename;
   my $rval = eval $substitutions;
   my $new_filename = $_;
   die "Error in substitutions: $@" if $@;

   return $new_filename;
}
The majority of our subroutine is unchanged from the previous iteration of our code, except for the first line. When we want to call the transform_filename subroutine, we pass it two parameters - the code to perform the changes, and the filename to work on. In Perl 5, subroutine parameters get passed in via the special @_ list. The my ($a, $b) = @_ construct is a neat way of telling perl that you want to assign the @_ list to the list you're talking about on the left-hand side of the = sign - and that list contains two variables in the first two positions. Perl does the right thing and takes the first two items from @_ and puts them in the appropriate variables according to their positions.

I'll go ahead and add subs for the rest of the code too, which will make things cleaner to expand upon later:-
#!/usr/bin/perl -CSDA

use strict;
use warnings;

sub help
{
   print "Usage: $0 '<substitutions>' filename(s)\n";
   print "  e.g. $0 's/from/to/g;' myfiles/*\n";
   exit 1;
}


sub main
{
   my $substitutions = shift @_;
   foreach my $orig_filename (@_) {
      my $new_filename = transform_filename($substitutions, $orig_filename);

      if ($new_filename ne $orig_filename) {
         print "$orig_filename -> $new_filename\n";
      } else {
         print "$orig_filename : unchanged.\n";
      }
   }
}


sub transform_filename
{
   my ($substitutions, $orig_filename) = @_;

   $_ = $orig_filename;
   my $rval = eval $substitutions;
   my $new_filename = $_;
   die "Error in substitutions: $@" if $@;

   return $new_filename;
}


help() if (@ARGV < 2);
main(@ARGV);
Now the body of the script defines a help(), main(), and transform_filename(). As the last step, we either jump into help() or main() as appropriate.

#HashTag

Things are looking tidier, but we're still not checking the uniqueness of the filenames. The easiest way to do this in Perl is to put all the names into a hash. A hash is a series of key => value pairs, sometimes known as maps, tables, or dictionaries in other languages. In perl, it is a built-in type that uses the % sigil. We will make two hashes, one that maps the original filename to the corresponding changed version, and one which does the reverse.
sub main
{
   my $substitutions = shift @_;
   my @filenames = @_;

   my %filename_mapping;    # original name => new name
   my %reverse_mapping;     # new name => original name
   foreach my $orig_filename (@filenames) {
      my $new_filename = transform_filename($substitutions, $orig_filename);
      $filename_mapping{$orig_filename} = $new_filename;
      $reverse_mapping{$new_filename} = $orig_filename;
   }
We still iterate over all the provided filenames as before, calling transform_filename() to get the changed version, but now we put those filenames into the hashes. The first time we do this:-
$filename_mapping{$orig_filename} = $new_filename;
Perl will set up an entry in the %filename_mapping hash for the original filename. The value we assign to that key is the new filename. Then we do the opposite for the %reverse_mapping hash. As we keep looping over filenames, if the same filename gets used as a hash key more than once, all that will happen is that the old entry will get replaced with the new entry; no new key will be created in the hash.

As an aside, this is one part of Perl 5 guaranteed to throw off people who are unused to the language; we declared the hashes with the % sigil, but when we accessed their values, there was a $ sigil at the front. This is because in Perl 5, the sigils are not directly bound to the type of the variable, but to the type of data you want to get from the variable. %foo means a hash called foo, a list of key => value pairs. $foo{bar} means the value of the entry 'bar' in the 'foo' hash, which is a scalar value, and so uses the $ sigil.

Anyway, we have two hashes, each keyed with the old and new filenames. If there is a problem with the transformed filenames overlapping each other, checking the number of keys in each hash will reveal it.
   
   # Check there are no filename collisions; the number of unique output filenames
   # should be equal to the number of unique input filenames.
   if (keys %filename_mapping != keys %reverse_mapping) {
      die "Given substitution commands result in non-unique filenames, refusing to clobber your files.\n";
   }

   show_plan(%filename_mapping);
}
Here is more Perl type magic. The keys function takes a hash, and pulls out just the key names as a list. In this case, we are comparing the return value of keys with !=, which is perl's numerically-not-equal operator. This enforces scalar context on the operation, which causes keys to report the number of keys in each hash rather then the full list of keys. Functions behaving differently based on scalar or list context is another defining, and initially off-putting, feature of Perl that I personally love.

Finally, I've also turned the act of printing out our planned changes into a subroutine:-
sub show_plan
{
   my @plan = @_;
   while (@plan) {
      my $orig_name = shift @plan;
      my $new_name = shift @plan;
      if ($orig_name ne $new_name) {
         print "$orig_name -> $new_name\n";
      } else {
         print "$orig_name : unchanged.\n";
      }
   }
}
It's pretty simple; it takes a list of filename pairs (in @plan for readability), and as long as there are entries in the list, it shifts them off two at a time into $orig_name and $new_name.

But where are we getting this list of filename pairs in the first place? Weren't we calling this subroutine as show_plan(%filename_mapping)? Don't be alarmed. The %filename_mapping hash is conceptually just a list of key => value pairs, after all - and that is what gets passed into show_plan's argument list. Perl likes lists, and makes it very easy to flatten things into a list by default. If we wanted to pass the hash over to show_plan as some sort of distinct object, well, there's ways to do that - but there's no way I'm getting into references in this walkthrough. Let's keep it as simple (hah, "simple", he says) as possible.

Check it works so far:-
james@jin: ~/Ubuntu One/blogpost/example
$ ../filenamesed4.pl 's/\d+\.\K.*\.mp4//g;' *.mp4
Given substitution commands result in non-unique filenames, refusing to clobber your files.
Good little script, good.

Wait, I wanted more scary Perl cleverness

Oh, okay, fine. Let's check what the output of the script is with a simple substitution that we know works:-
james@jin: ~/Ubuntu One/blogpost/example
$ ../filenamesed4.pl 's/(mindcrack|ultra|hardcore|uhc)//ig; s/_+/ /g;' *.mp4
0511.avidyazen.MindCrack_Ultra_Hardcore_-_S10E03_-_Contact!-Vje7iwlB3Ko.mp4 -> 0511.avidyazen. - S10E03 - Contact!-Vje7iwlB3Ko.mp4
けしからん猫の垂直跳びには敵わない。 NOTHING CAN RIVAL THE JUMPING CAT!.mp4 : unchanged.
0511.mhykol.Mindcrack_UHC_X_-_Episode_3-wSGmxqqCMBc.mp4 -> 0511.mhykol. X - Episode 3-wSGmxqqCMBc.mp4
0511.kurtjmac.Minecraft_MindCrack_-_UHC_S10E03_-_The_Trench-79xpiEor-P8.mp4 -> 0511.kurtjmac.Minecraft - S10E03 - The Trench-79xpiEor-P8.mp4
Okay, that works... except something's strange. Before, the names were all in alphanumerical order due to the shell expanding '*.mp4' in that order. Now, the names are shuffled around a bit. Why is that?

That's because when we provided the list to show_plan, we gave it the keys and values directly from the %filename_mapping hash. The order that you get the keys back in this case is not defined, because hashes don't keep their entries in any kind of order internally. We could sort the list, but it's a list of pairs, remember! We'd be destroying that association by just blindly sorting them all. We could pass the hash in by reference and let show_plan do the sorting, but like I said, I don't want to get into references just now. It would be nice if we had a way to create the list of pairs based on the order the @filenames were originally given to us on the command line.

Is Perl a good language for aribtrarily transforming lists of strings from one representation to another? for juggling data chainsaws while riding a unicycle? I don't know, you guys...

So we have a list of original @filenames. We also have a hash that stores the transformed filename for each original: %filename_mapping. We want to map the @filenames list into a new, bigger list that alternates between old and new names. Here's what we do:-
show_plan(map { $_, $filename_mapping{$_} } @filenames);
You read it sort of right-to-left; the input to map is the original @filenames in the original order that we wanted. Each entry of that list gets transformed by the code block { $_, $filename_mapping{$_} }. There's that magic default variable $_ again. The code is just mentioning a list of two things: the current entry being processed, and the result of looking up that entry in the %filename_mapping hash. map does this for everything in @filenames in turn, returns the new list and it gets sent off to show_plan like we wanted.

Here's the full script so far:-
#!/usr/bin/perl -CSDA

use strict;
use warnings;

sub help
{
   print "Usage: $0 '<substitutions>' filename(s)\n";
   print "  e.g. $0 's/from/to/g;' myfiles/*\n";
   exit 1;
}


sub main
{
   my $substitutions = shift @_;
   my @filenames = @_;

   my %filename_mapping;    # original name => new name
   my %reverse_mapping;     # new name => original name
   foreach my $orig_filename (@filenames) {
      my $new_filename = transform_filename($substitutions, $orig_filename);
      $filename_mapping{$orig_filename} = $new_filename;
      $reverse_mapping{$new_filename} = $orig_filename;
   }

   # Check there are no filename collisions; the number of unique output filenames
   # should be equal to the number of unique input filenames.
   if (keys %filename_mapping != keys %reverse_mapping) {
      die "Given substitution commands result in non-unique filenames, refusing to clobber your files.\n";
   }

   show_plan(map { $_, $filename_mapping{$_} } @filenames);
}


sub show_plan
{
   my @plan = @_;
   while (@plan) {
      my $orig_name = shift @plan;
      my $new_name = shift @plan;
      if ($orig_name ne $new_name) {
         print "$orig_name -> $new_name\n";
      } else {
         print "$orig_name : unchanged.\n";
      }
   }
}


sub transform_filename
{
   my ($substitutions, $orig_filename) = @_;

   $_ = $orig_filename;
   my $rval = eval $substitutions;
   my $new_filename = $_;
   die "Error in substitutions: $@" if $@;

   return $new_filename;
}


help() if (@ARGV < 2);
main(@ARGV);

More!

I originally envisioned writing this up in an afternoon, fixing up my old script and making a little blog about it as a bonus. It has quickly transformed into this huge monster of a post, and I still have more ideas on how to improve the script. So what we'll do is, I'll continue writing in the background and make a Part 2 post sometime next week. It'll also give you a breather after my massive wall of text.

(Edit: It's six months late,  but damnit I wanted to finally complete this. Part 2 is here.)

Despite its length, this post has been slightly more newbie-oriented than my other mentions of Perl, but it's still nowhere near simple enough to serve as a beginner's tutorial on learning the language with zero prior knowledge. I'm not sure how I would teach Perl programming from first principles, but perhaps some readers are curious about it thanks to me writing this monstrosity. To those wanting to learn more, I highly recommend the book Modern Perl, which is available in a free ebook version. It neatly side-steps a lot of the scary cruft that Perl 5 has accumulated over the years, and highlights the really neat modern conveniences that are available in recent versions.

No comments:

Post a Comment