30 June 2014

Decimating Directories

Decimating Directories

Whenever you set up some automated system that produces files, there's always that nagging fear that you'll forget about it and it will run rampant, filling up your hard drive with clutter. One good example is using motion as a security system - you want to keep the most recent video clips in case you need to refer back to them, but there's little point in keeping the oldest ones.

Keeping only the most recent n videos and deleting the rest could be problematic, because the individual files could be large. Keeping anything younger than a certain number of days is no good, because there could be a burst of activity that creates a lot of files. So we want to make a script that will trim a directory of files down to a specific size.



This'll be a short one. No need to overthink things, no grand architecture, let's just make a script to trim a directory down to size. Skipping on past the usual boilerplate (You can download the complete script later if you really want a look), let's flesh out the options we want this utility to offer:-

my %opts;
GetOptions(\%opts, 'help!', 'delete!', 'move=s', 'size=s', 'atime!', 'mtime!', 'ctime!', 'basename!', 'pathname!', 'verbose!');

sub help
{
   print STDERR <<EOF;
Usage: $0 --size <size>{K,M,G} dirname/ --delete
       $0 --size <size>{K,M,G} dirname/ --move destdir/

       Omitting --delete and --move does a dry-run.

       --atime    : Sort by last access time
       --mtime    : Sort by last modified time (default)
       --ctime    : Sort by last inode change time
       --basename : Sort by file basename
       --pathname : Sort by file pathname

       --verbose : Show the list of files with timestamps

Sorts files contained in the given directory according to age, and deletes
the oldest files until the total size of the directory is under the given
size limit.
EOF
   return 1;
}
Since it's designed to run in an automated fashion, it's not going to print any output by default, and it's not going to take any action either, unless explicitly specified by --delete or --move. Sorting files by their "mtime" is the most likely use-case, so it's the default, but I do want to leave the door open for other sorting options in case I need them.

On to the main() subroutine. I'm going to be using two 'global' variables, variables declared outside the lexical scope of any of these subroutines which can be accessed by any of them. @FILES will be our list of hash-refs, each one representing a file we are potentially culling and containing keys storing the file name, size, and modification times. $TIME_ATTR just contains the name of the attribute we want to sort by. Generally speaking, it's good programming practice to avoid magic globals that get modified via spooky-action-at-a-distance. However, this is a small script and the alternative would be to generate our collating and sorting functions in a very roundabout way using closures. We don't really need that complexity for a simple little script.

The main() sub needs to check over the command-line arguments, then use File::Find to populate @FILES. Rather than write the entire find callback in-line, this time I feel it's big enough to put in its own subroutine. The second main step that needs to happen is sorting @FILES by the chosen attribute - I'm using the variant of Perl's builtin sort that allows you to specify the name of a subroutine to use to do the sorting.

sub main
{
   # We accept exactly one argument not covered by Getopt::Long : which dir to operate on.
   my $dir = shift @ARGV;
   die "Need to supply a directory name!" unless (defined $dir && -d $dir);
   die "Need to supply a --size!" unless (defined $opts{size});
   die "--move option needs a directory to move to!" if ($opts{move} && ! -d $opts{move});

   # Adjust options based on command line.
   my $size = parse_size($opts{size});
   $TIME_ATTR = "atime" if ($opts{atime});
   $TIME_ATTR = "mtime" if ($opts{mtime});
   $TIME_ATTR = "ctime" if ($opts{ctime});
   $TIME_ATTR = "basename" if ($opts{basename});
   $TIME_ATTR = "pathname" if ($opts{pathname});

   # Collate file info. This fills in @FILES.
   find({ wanted => \&find_callback, no_chdir => 1 }, $dir);

   # Sort by mtime (or whatever was set in $TIME_ATTR)
   my @sorted_files = reverse sort by_time_attr @FILES;  # highest mtime first (oldest last)

   # Make a cut if you go above the limit.
   my ($keep_files, $cull_files) = split_files_after_limit($size, @sorted_files);

   print "Keeping ", scalar @$keep_files, " files:-\n" if ($opts{verbose});
   print_files(@$keep_files) if ($opts{verbose});

   if ($opts{move}) {
      print "\nMoving ", scalar @$cull_files, " files:-\n" if ($opts{verbose});
      print_files(@$cull_files) if ($opts{verbose});
      move_files(@$cull_files);

   } elsif ($opts{delete}) {
      print "\nDeleting ", scalar @$cull_files, " files:-\n" if ($opts{verbose});
      print_files(@$cull_files) if ($opts{verbose});
      delete_files(@$cull_files);

   } else {
      print "\nWould Cull ", scalar @$cull_files, " files:-\n" if ($opts{verbose});
      print_files(@$cull_files) if ($opts{verbose});

   }
}
We further delegate the actual splitting of the list of files to another subroutine, split_files_after_limit(), and finally act on --delete or --move if they were specified.

So, what do the finding and sorting callbacks look like?

sub find_callback
{
   # We want to use File::Find's no_chdir option, so we want to figure out the basename by ourselves.
   my $pathname = $File::Find::name;
   utf8::decode($pathname);
   my $basename = basename($pathname);

   # Don't match dotfiles.
   return if ($basename =~ /^\./);
   # We only care about files, not dirs.
   return unless (-f $pathname);

   # Collect stats.
   my ($dev,$ino,$mode,$nlink,$uid,$gid,$rdev,$size,$atime,$mtime,$ctime,$blksize,$blocks) = lstat($pathname);
   push @FILES, {
            pathname => $pathname,
            basename => $basename,
            size => $size,
            atime => $atime,
            mtime => $mtime,
            ctime => $ctime,
         };
}


sub by_time_attr
{
   if ($TIME_ATTR eq 'basename' || $TIME_ATTR eq 'pathname') {
      $a->{$TIME_ATTR} cmp $b->{$TIME_ATTR};
   } else {
      $a->{$TIME_ATTR} <=> $b->{$TIME_ATTR};
   }
}
I've learned my lesson with File::Find, and make sure to use the no_chdir flag, so we always get the filename complete with path elements. We still need to explicitly decode the filename as utf8 ... brave users of non-utf8 systems can hack the file up themselves, I'm sure. We return early if the file is a hidden 'dot' file or if it's not a normal file. Otherwise, we record the information about the file that we're interested in and move on.

Sorting is pretty easy. I'd have done it in-line, but there is one little gotcha: if I wanted to sort by the basename of the file (because sometimes the file names are timestamps with some special meaning, or a sequence number, for example) then I need to compare the sorting arguments with cmp, perl's stringwise comparison operator. Otherwise, we're dealing with a numeric comparison and need to use the <=> numeric comparison operator. See perldoc perlop for more information.

What else? Well, we need to write the subroutine to split a list of files into the two groups, which is straightforward enough; just loop through them and keep a running total of the size.

sub split_files_after_limit
{
   my ($limit, @files) = @_;
   my (@keep, @cull);
   my $running_total = 0;

   foreach my $file (@files) {
      $running_total += $file->{size};

      # If this file is the one that puts us over the limit, it and everything afterwards
      # goes in 'cull'.
      if ($running_total > $limit) {
         push @cull, $file;
      } else {
         push @keep, $file;
      }
   }

   return (\@keep, \@cull);
}
Moving the files is easily accomplished with the File::Copy module, and deleting is accomplished with the builtin unlink function. I hope this script proves useful to people - you can download it on the Script Toolbox downloads page.

No comments:

Post a Comment