31 July 2014

Find Bug!

No-one is perfect. Sometimes you'll write code and think it works fine; you've tested all the edge-cases you can think of and nothing seems amiss. It won't be until months later that you're using that code and see it stumble over something it shouldn't. Something that couldn't possibly go wrong just did, right before your eyes.

In my case, the vigil.pl program I wrote in Integrating Integrity stumbled on some directories with Unicode in their names. It's a good thing I noticed it, because I'd started to rely on my little program more and more recently. Let's figure out what happened and fix the bug.

First thing to do is reproduce the bug. Here's a trimmed-down set of example directories that it was having trouble with:-

./new
./new/chinesepod_QW0305pb.mp3
./new/chinesepod_QW0302pb.mp3
./new/feed.url
./new/chinesepod_QW0304pb.mp3
./new/chinesepod_QW0301pb.mp3
./new/chinesepod_QW0303pb.mp3
./0.Newbie 菜鸟
./0.Newbie 菜鸟/chinesepod_intro7_The_End_of_the_Beginning.mp3
./0.Newbie 菜鸟/feed.url
./1.Elementary 初级
./1.Elementary 初级/chinesepod543_B102_20070402.mp3
./1.Elementary 初级/chinesepod537_B101_20070327.mp3
./1.Elementary 初级/feed.url
./1.Elementary 初级/chinesepod533_B100_20070322.mp3

I was attempting to make a .md5sum indexing all these directories, and was running the following:-

vigil.pl --create chinesepod.md5um .

Incidentally, ChinesePod is a fantastic resource for learning Mandarin. Anyway, this all looked pretty straightforward - I'd tested to make sure I was getting Unicode filenames done properly before, and I just wanted to recursively checksum everything in the '.' directory - the current directory. However, I discovered that vigil.pl was passing over the Newbie and Elementary lessons, and only indexing the files in the 'new' directory. What was going on?

Unicode is Hard (to do right)

It looked like a Unicode related problem. Even though I'd tested to make sure files with Unicode characters were loaded correctly, processed correctly, and written out correctly, I clearly had a problem with directory names. But my code wasn't directly responsible for recursively traversing the tree - I'd left that in the capable hands of File::Find. But there's no way such a core module would have such a critical bug, right? So surely it's something dumb I've done and it's my fault.

Let's add a little bit of debugging to our find() invocation, just as a sanity check:-
   find( {
            # We pass in an anonymous sub to File::Find for it to run on every file and directory
            # as it traverses the tree. Note that the parameter name, 'wanted', is a misnomer;
            # File::Find will not do anything with the return value of our sub.
            wanted => sub {
               # Apparently File::Find doesn't handle utf8 filenames.
               my $fullpathname = $File::Find::name;
               utf8::decode($fullpathname);

               utf8::decode($_);
               print STDERR "Visited: $_\n";

               if (-e $fullpathname && ! -d $fullpathname) {
                  push @plan, create_plan_for_filename($fullpathname);
               }
            },
            follow_skip => 2,    # If we get duplicate dirs, files etc. during find, ignore and carry on.
            no_chdir => 1,       # We want our current dir consistent after Find does its work.
         },
         @dirs_and_files_to_sum
      );

Sure enough, it shows that our wanted subroutine got called for all the directories, even the ones with Chinese in their names, but oddly enough not any of the files within. Curious!

Constructing plan...
Visited: .
Visited: ./0.Newbie 菜鸟
Visited: ./1.Elementary 初级
Visited: ./new
Visited: ./new/chinesepod_QW0301pb.mp3
Visited: ./new/chinesepod_QW0302pb.mp3
Visited: ./new/chinesepod_QW0303pb.mp3
Visited: ./new/chinesepod_QW0304pb.mp3
Visited: ./new/chinesepod_QW0305pb.mp3
Visited: ./new/feed.url
Complete. Plan contains 6 items.

Try as I might, I couldn't figure out what was going wrong. It should be fine, my brain kept saying, and so determining the cause of the problem looked as impossible as the bug itself. My code must be doing something weird to File::Find to cause it to misbehave. It's time to write a small test program and see if we can replicate the results in that:-

#!/usr/bin/perl -CSDA

use strict;
use warnings;
use utf8;
use File::Find;

find( {
         wanted => sub {
               my $n = $File::Find::name;
               utf8::decode($n);
               print "$n";
               if (-e $n && ! -d $n) {
                  print " - will checksum\n";
               } else {
                  print "\n";
               }
            },
         no_chdir => 1
      }, ".");

And of course, it ... works perfectly?! What?

.
./0.Newbie 菜鸟
./0.Newbie 菜鸟/chinesepod_intro7_The_End_of_the_Beginning.mp3 - will checksum
./0.Newbie 菜鸟/feed.url - will checksum
./1.Elementary 初级
./1.Elementary 初级/chinesepod533_B100_20070322.mp3 - will checksum
./1.Elementary 初级/chinesepod537_B101_20070327.mp3 - will checksum
./1.Elementary 初级/chinesepod543_B102_20070402.mp3 - will checksum
./1.Elementary 初级/feed.url - will checksum
./new
./new/chinesepod_QW0301pb.mp3 - will checksum
./new/chinesepod_QW0302pb.mp3 - will checksum
./new/chinesepod_QW0303pb.mp3 - will checksum
./new/chinesepod_QW0304pb.mp3 - will checksum
./new/chinesepod_QW0305pb.mp3 - will checksum
./new/feed.url - will checksum

Unicode is Easy (to muck up)

This was maddening. The test program worked fine - so what is it doing differently? After a bit of talking things over with my flatmate ("Rubber-Duck Debugging"), standing on the balcony and staring off into the distance, one possibility occurred. In the test program, we are hard-coding the directory to scan as ".", but in vigil.pl we take it from @ARGV, the list of command-line arguments that Getopt::Long didn't touch. Because of the 'A' in -CSDA, the command-line options for perl that turn additional Unicode support on, our @ARGV is being interpreted as UTF-8. File::Find, as we've seen before, doesn't support Unicode strings. A quick tweak to our test program, to use @ARGV instead of "." confirms it:-

.
./0.Newbie 菜鸟
./1.Elementary 初级
./new
./new/chinesepod_QW0301pb.mp3 - will checksum
./new/chinesepod_QW0302pb.mp3 - will checksum
./new/chinesepod_QW0303pb.mp3 - will checksum
./new/chinesepod_QW0304pb.mp3 - will checksum
./new/chinesepod_QW0305pb.mp3 - will checksum
./new/feed.url - will checksum

So what's going on here?

Perl 5 can represent strings in two different ways. The first is just a bunch of bytes - Perl neither knows nor cares if they happen to also be characters in some encoding, and the various string manipulation functions work on the bytes. That's great if you're only working in ASCII, but modern times demand modern strings. The second representation Perl uses is strings of characters, and these aren't limited to th 0-255 range. Internally, I believe it uses UTF-8 to represent them, but it's best not to think to hard about this and instead consider them to be pure abstract Unicode strings.

File::Find, however, only deals with the byte sequence strings. It gets its filenames from the filesystem, which could be using all sorts of weird encodings, and as long as it supplies these same byte sequences to the system calls that let it traverse the filesystem, everything should be fine. It hands the filenames as bytes to our wanted() function, we arrogantly assume it's UTF-8 and decode it as such, and everything works fine.

Except we are passing in one of our command-line arguments as the base directory to traverse, and we're explicitly decoding them as UTF-8 so that variable will be a Unicode Character String. I can only speculate that File::Find is concatenating or otherwise manipulating the name of the directories it looks at with that variable, and character string plus byte string equals madness. It worked fine on the 'new' directory because utf8 flag or no, everything sits nicely in the ASCII range. But when it attempts to construct a path using the other directories, it ends up with something it or the OS doesn't consider a valid path.

We can test a quick fix by chopping the 'A' off our perl invocation, and sure enough it works. A neater solution is to utf8::encode() the directory names just before they go to File::Find; this will convert them back into byte sequences and keep all the non-Unicode parts of my program clearly walled off. Again, I'm assuming the filesystem is always UTF-8, and I shouldn't, but I don't even know where to begin adding support for multiple encodings. This'll do.

Flagging Errors

But wait a minute. Reverse up a little there. We were feeding File::Find the string literal '.' and it was working fine - but it shouldn't have worked! The string '.' should be every bit as Unicode and character-based as our decoded @ARGV! While I'm happy I found a solution that fixes the program, I'm not particularly happy that I've rocked my understanding of how things are supposedly working in the process. Is there a way to test how Perl is treating my string literal?

To figure this out, I did a bit of research. Perl has a utf8 flag for strings and it gets turned on when Perl is certain that it is dealing with decoded Unicode characers and not any kind of byte representation. It's also interesting to note that Goal #1 of Perl 5's Unicode support is that "Old byte-oriented programs should not spontaneously break on the old byte-oriented data they used to work on.". So perhaps Perl wasn't treating '.' as Unicode because there weren't any non-ASCII characters in there. Let's do some further testing:-

#!/usr/bin/perl -CSDA

use strict;
use warnings;
use utf8;
use File::Find;

print "\@ARGV is ", join(', ', @ARGV), " and ";
print utf8::is_utf8($ARGV[0])? "has" : "does not have";
print " the utf8 flag set.\n";

my $dir = ".";
print "My \$dir variable is '$dir' and ";
print utf8::is_utf8($dir)? "has" : "does not have";
print " the utf8 flag set.\n";

my $otherdir = "中文";
print "My \$otherdir variable is '$otherdir' and ";
print utf8::is_utf8($otherdir)? "has" : "does not have";
print " the utf8 flag set.\n";

find( {
         wanted => sub {
               my $n = $File::Find::name;
               utf8::decode($n);
               print "$n";
               if (-e $n && ! -d $n) {
                  print " - will checksum\n";
               } else {
                  print "\n";
               }
            },
         no_chdir => 1
      }, @ARGV);

This gives us the output:-

@ARGV is . and has the utf8 flag set.
My $dir variable is '.' and does not have the utf8 flag set.
My $otherdir variable is '中文' and has the utf8 flag set.
.
./0.Newbie 菜鸟
./1.Elementary 初级
./new
./new/chinesepod_QW0301pb.mp3 - will checksum
./new/chinesepod_QW0302pb.mp3 - will checksum
./new/chinesepod_QW0303pb.mp3 - will checksum
./new/chinesepod_QW0304pb.mp3 - will checksum
./new/chinesepod_QW0305pb.mp3 - will checksum
./new/feed.url - will checksum

This does seem to support my theory. Two string literals in a utf8 source file, one gets the flag and one doesn't. That's a little awkward, but I guess in the name of older programs not breaking it may have been necessary. Am I missing something, though? Is there a way to ensure all literals are promoted to Unicode no matter what, and should I be using it if it exists?

That, I think, is a question for another day. It will involve much meditation and perhaps consultation of user tchrist's fantastic post on Unicode and Perl on StackOverflow. Seriously, it's a great read and shows just why this whole Unicode business gets so complicated.

The fixed version of vigil.pl is available on the Script Toolbox download page.

No comments:

Post a Comment