"Find a safe part and use it as an anchor"
-- Brian Eno and Peter Schmidt, Oblique Strategies
In my previous The Perl Journal article, "Resource Locking with Semaphore Files", I showed some of the problems that can appear when simultaneously running processes try accessing the same file (or other scarce resource) at the same time; and for a solution, I proposed semaphore files based on Unix flock. However, that depended on all the tasks in question being on the same machine, and that machine basically supporting a Unix file-locking model. In this article, I'll deal with how to do resource locking when we can't assume any of these things, and where the processes in question might be on different machines with no common filesystem, and where possibly none of them are Unixes.
Many Ways To Do It
The thing that makes file locking work correctly is that no two processes do it at the same time -- if several try, then only one is allowed to succeed. That's not a characteristic of just file locking, though:
- In many database systems, if several simultaneously connected processes try to create a record with the same unique identifier, then only one is allowed to do so, and the rest are told that it wasn't possible.
- In most filesystems, the same is true of deleting a file. So if several processes try deleting the same file at the same time, only one actually gets told that it succeeded in being the deleter.
- Similarly, in any even reasonably sane filesystem, if several processes try to make a directory "foo" (which doesn't already exist) at the same time, then only one of them actually succeeds, and the others are told that it can't mkdir "foo", as "foo" already exists.
- In many filesystem models, you can differentiate opening an existing file from creating one that wasn't there. So if several processes try, even at the very same moment, to create a new file named "foo.txt", only the first one actually succeeds, and the others are told that it can't be done, because such a file already exists.
When doing resource locking over a network, many people use some variation of the first solution, involving databases. This approach is inconvenient unless you have a network-accessible database system already set up; and it's not possible unless you know what operations the database system will treat like the above operations. Instead, the third way is the one I prefer, and it works like this:
There could be several concurrent processes that want to manipulate some resource (like reading in a file called "/stuff/big_config.txt", making some changes, and then re-writing it). And if it would be bad for more than one of them to be manipulating that resource at the same time, then before any process deals with that resource, it must try to mkdir "/stuff/big_config_lock". If that mkdir operation succeeds, then that process has permission to manipulate that other resource, the "/stuff/big_config.txt" file. But if a process calls mkdir "/stuff/big_config_lock", and it doesn't succeed, then we take this to mean that it already exists, and some other process must be in the middle of accessing the "/stuff/big_config.txt" resource. Or in Perl terms:
sub wait_until_lock_dir { my $dir = $_[0]; until( mkdir $dir ) { sleep 2 } return; } sub unlock_dir { my $dir = $_[0]; warn "Couldn't rmdir $dir: $!" unless rmdir $dir; return; } wait_until_lock_dir("/stuff/big_conf_lock"); print "OK, I've got the lock!\n!'; ... do whatever with /stuff/big_conf.txt ... unlock_dir( "/stuff/big_conf_lock");
Compared to the Unix file-locking model, this model has the following disadvantage. Suppose, under the Unix file-locking model, that a process opens and locks a semaphore file, and then dies while manipulating the process. The operating system will close the file, and thereby end the lock; so under that model, it's impossible for a file to be locked, unless there is a process that is running and has it locked. But suppose that instead of a semaphore file, we're using the existence of a directory to mean "that resource is in use". If a process starts up, successfully makes the directory, and then dies while manipulating the process, then there's nothing to automatically make that directory go away. Unless something comes along and deletes it, every process that tries to deal with that resource will end up waiting forever for it to be available. This problem can be coped with in various ways, as I'll describe later in this article.
This use of mkdir as a locking mechanism has two advantages over normal Unix-ish flock. First off, it works under OSs other than Unix! Second off, it works across ftp! That is, instead of making the directory on the local filesystem with mkdir($dirname), you'd log into an ftp server and $ftp->mkdir($dirname). The point of doing this is so that ftp server can essentially serve as a "lock server" for any processes that can access it.
Using ftp mkdir for Resource Locking
So let's use a concrete example here, such that should be familiar to all our readers. You're the Chief Technical Officer for KAOS, the worldwide consortium of supervillians. Among your many evil duties is coordinating the exchange of the staff phone number directory, phone.tsv. In an ideally evil world, you would have all the evil resources needed to build a solid network. But for years now, all available funds have been diverted to the construction of the giant orbiting death-ray, leaving you with only a ramshackle mess of legacy systems.
Notably, the chief of Henchman Resources, Generalissima Mok, personally updates the phone directory on a daily basis, as old staff members are killed for their incompetence and others are promoted. She has learned to do this using FileMaker on an old Apple laptop running MacOS 7. You once meekly suggested that she use a Web interface for maintaining the phone database, and the Generalissima reacted by torturing you with electric shocks until you promised to forget the whole idea.
She did, however, let you write and install a MacPerl program on her laptop to ftp the phone.tsv file from her Mac to the main server, and she dutifully runs the program after every evil update to the database. The meat of that program looks like this:
use Net::FTP; # The class for FTP connections my($host, $user, $pass, $file) = ( 'ftp.kaos.int', 'phoneacc', 'dmb4ever', 'phone.tsv' ); my $conn = Net::FTP->new($host) || die "Can't ftp-connect to $host: $@"; $conn->login($user, $pass) || die "Can't log in as $user\@$host"; $conn->put($file) || die "Can't ftp-put $file"; $conn->quit(); undef $conn;
And that program does a fine job of ftp'ing up the file when run. The problem you quickly run into is that others might be downloading phone.tsv from the server just as she's uploading it, resulting in their getting a file that's truncated or garbled. This upsets people, because if they fail to find some friend's name in the phone directory, they assume he has been killed for incompetence. This leads to a tense work environment. You solve the problem of unnerving partial transfers by using a lock directory, like so:
... $conn->login($user, $pass) || die "Can't log in as $user\@$host"; my $lock_dir = 'phone_lock'; until( $conn->mkdir($lock_dir) ) { sleep 3 } # Send it to the remote server $conn->put($file) || die "Can't put ftp file $file"; $conn->rmdir($lock_dir); $conn->quit(); undef $conn;
The until line waits until it's the one and only process that succeeds in making the directory phone_lock, which then signals that it is the only one writing to it. The only other new command that we've added here is the rmdir, which deletes phone_lock, signaling that we're no longer staking a claim on this resource, now that we're done with it.
On various hosts that need their own copy of phone.dat, you write Perl programs to do the downloading. They look exactly like the above program, except that the put line is replaced with this:
# Get it from the remote server $conn->get($file) || die "Can't get ftp file $file";
So if one of those processes logs in when Generalissima Mok is uploading her phone.tsv, they'll get as far as the until line that tries to create phone_lock. But that mkdir command will fail, causing the loop body of sleep 3 to be executed, and then it'll try again, and keep trying until in the meantime the Generalissima will have finished the upload and rmdir'd the phone_lock directory. Then when this downloader process tries to call mkdir again, it will succeed, and the download can continue.
Or, going the other way, if the Generalissima's uploader process tries to log in while some other process is downloading phone.tsv, it will go into a sleep loop until that process in finished and rmdir's the directory.
There are three problems with this system:
First, the system dutifully stops Generalissima Mok from uploading a phone.tsv at the same moment that anyone is downloading phone.tsv. But it does this by stopping any two processes, of any kind, from accessing the file at the same time. This is a bit broad, because it stops two download processes from downloading the file at the same time; but two processes downloading the same file at the same time is harmless. The fact that the system stops this from happening is a problem only in that it slows things down somewhat; but I see no way around it.
Second, what if a process can't rmdir the lock directory? I've seen this happen only in a particular case, where a server would let a user mkdir a directory under an a+rwx directory that was owned by someone else, but wouldn't let the same user rmdir the directory that he'd just created! The solution is to run an interactive ftp session before you set up your lockdir system, to make sure that your ftp host doesn't behave that way. If it does, then put the lock directory in a directory the account can access, so that account can rmdir in as needed. This should just be a simple matter of setting $lock_dir to something different, like this:
my $lock_dir = '/evilusers/phoneacc/phone_lock';
And third, what if a process successfully makes the directory phone_lock, then starts downloading or uploading phone.tsv, but then the ftp connection dies because of a network outage? In that case, phone_lock would stick around forever. As I pointed out, this isn't a problem with Unix flock, but it's sure a problem here. More and more processes would accumulate, one by one connecting to the remote server and waiting literally forever for phone_lock to be removed. I see no general solution to this problem; but in these circumstances, I have used two makeshift solutions. First, I make the clients give up after a while, by changing the until loop to this:
my $give_up_after = time() + 10 * 60; # Give up after 10 minutes from now until( $conn->mkdir($lock_dir) ) { sleep 3; die "Tired of waiting for $lock_dir!" if time() > $give_up_after; }
Second, I decide that a lock should not be any good if it's beyond a certain age. The simplest way to do this is to crontab a process to run every few minutes on the ftp server and to forcibly remove that directory if it exists but is old. This is done on the presumption that any lock directory that old must be the result of a process that didn't clean up after itself -- since no operation could take that long1. The program to run on the ftp server is a simple one:
#!/usr/bin/perl # Delete phone_lock if it's from earlier than six minutes ago use strict; my $dir = "/evil/ftp/phone_lock"; my $six_mins_ago = time() - 6 * 60; if(-e $dir and (stat(_))[9] < $six_mins_ago ) { rmdir $dir or warn "Couldn't rmdir $dir: $!"; } exit;
I arbitrarily chose ten minutes as how long any process would wait for a lock, and six minutes as the maximum lifetime of a lock file. You'd need to come up with figures that may be much longer or shorter, depending on the time it could reasonably take any process to transfer phone.tsv, and the number of processes that are ever likely to be queued up waiting for a lock at any one time.
Incidentally, you can write a program like the above that kills old directory files, and have it run not locally, but instead over ftp. But, the mechanics of asking the age of a directory over ftp are very implementation-dependent, and this is left as an exercise for the reader.
Locksmith Processes
The real trouble in the above scenario is the possibility of processes creating a phone_lock directory but failing to remove it. The program I showed previously that removes too old phone_lock directories will eventually catch all such cases. But in some situations we can do better. A program going away without cleaning up its phone_lock directory might be the result of network trouble, and we can't do much about that. But often it's the result of the program throwing an exception. That's not very likely in the above program, because the only thing happening between the mkdir and the rmdir is a single get or put operation. But suppose that the program were more complex, pulling down one file, running some sort of elaborate conversion on it, and then sending it back up; and suppose that that conversion could potentially run out of system memory, which isn't a trappable exception -- the OS will simply kill the process. In that case, there's no way for this process to be sure to rmdir that lock directory.
However, if we have mostly Unix-like2 process control where the program is running, we can delegate the whole business of obtaining and releasing the lock to a very simple independent subprocess, which I call a "locksmith". All that a locksmith process needs to do is get the lock (or report that it can't), and then wait around for one of two things to happen: either the parent process's signaling that it wants the lock to be released, or the parent process's untimely demise. When either happens, the locksmith processes will then tidily remove the lock directory.
In very Perlish fashion, "there's more than one way to do it", but this is the way I implement it: the parent process opens a one-way pipe from netlock, which is what I've named my locksmith program. The parent passes the information about how to log in as command-line parameters, and then netlock prints one line: either something starting with "OK" to signal that it got the lock, or anything else (consisting of an error message) to signal that it couldn't get a lock. Then it just waits either for the parent process to die (which can be discerned with getppid() == 1) or for the parent process to close the child's pipe to the parent (which can be detected by a SIGPIPE signal being thrown when the child tries to write to that pipe).
The netlock program is simple enough, introducing few actually new constructs:
use strict; # netlock. Call me like: "netlock phonestuff" my %profile = ( 'phonestuff' => [ 'phoneacc', 'dmb4ever', 'ftp.xaos.int', 'phone.tsv' ], ); # - - - - - - - - - - - - - - - - use Net::FTP; my $profile = $ARGV[0]; $| = 1; # Don't buffer my STDOUT! sub bye { print @_, "\n"; exit } # Routine to complain that we couldn't get a lock bye "What profile?" unless defined $profile; my $auth = $profile{$profile}; bye "I don't know profile $profile'?" unless $auth; my($user, $pass, $host, $dir) = @$auth; my $ftp = Net::FTP->new($host) or bye "Can't connect to $$auth[2]"; $ftp->login($user,$pass) or bye "Can't log into account $user\@$host"; my $quit_flag; $SIG{'PIPE'} = sub {$quit_flag = 1}; my $locked; until($quit_flag or 1 == getppid) { if( $ftp->mkdir($dir) ) { $locked = 1; last; } sleep 4; # Don't totally hammer the ftp server! } exit unless $locked; # If we aborted early somehow. print "OK\n"; # tell the parent that we got the lock! # Wait around until the parent says to release, or dies. until($quit_flag or 1 == getppid) { $ftp->pwd(); # Don't let the connection time out. sleep 5; print STDOUT '.'; } $ftp->rmdir($dir); $ftp->quit(); exit;
This may seem a bit mysterious, but there are only three unusual things here. First, there's getppid, which returns the process ID of the parent process (i.e., the process for which netlock is acting as locksmith). A basic fact to note is that it returns 1 if and only if the original parent process has died. Second, there is this $SIG{'PIPE'} = sub {$quit_flag = 1} line. This simply tells Perl "from now on, in this process, if you get a SIGPIPE signal, call this routine that consists just of setting $quit_flag to 1". And third, we have this mysterious loop that prints a period to STDOUT every three seconds. Although the parent never reads anything past the first line, you might expect that the dots do no good. However, we are printing them for their side effect: if we print anything to a STDOUT (our pipe to our parent process) and there's no SIGPIPE signal, then our channel to our parent is alive and well. But, if the parent has closed the pipe, then the printing to that pipe will throw a SIGPIPE, which will cause $quit_flag to be set to 1, which will cause the loop to bail out, and the program to rmdir the directory across FTP, and then quit.
As large as that netlock program is, it encapsulates the whole business of getting a lock, so that the parent process need only call it like so:
open LOCKER, "/path/to/netlock phonestuff |" or die "Couldn't pipe from 'netlock phonestuff' ?!"; my $status = <LOCKER>; $status = '[netlock startup error]' unless defined $status; die "$status from netlock $lockname" unless $status =~ m/^OK/s;
and then when we want to release the lock, we simply call close(LOCKER).
Object Interface to Locksmith Processes
Just as in the last article I ended with a class that implemented a simple class for flock semaphore files, I'll end this article with a class that implements a simple interface for ftp lock directories as accessed and tended by a netlock locksmith process.
package NetLock; sub new { my($class, $tag) = @_; die "You forgot to specify a tag!" unless defined $tag and length $tag; open my $fh, "perl -S netlock $tag |" or die "Couldn't open netlock $tag: $!"; my $status = <$fh>; $status = '[netlock startup error]' unless defined $status; die "$status from netlock $tag" unless $status =~ m/^OK/s; return bless { 'fh' => $fh }, ref($class) || $class; } sub unlock { my $fh = delete $_[0]{'fh'}; return 0 unless defined $fh; close($fh); } 1;
An object of class NetLock represents a lock on a resource as obtained across ftp by a locksmith process that will delete that lock either when this process asks it to, or when this process dies. Such an object is made by calling NetLock->new('tagname') where tagname is a string, such as "phonestuff", that is in netlock's %profile hash of account data. The lock stays around either for as long as the object is around, or until we dismiss the lock by calling $lock->unlock, which will return false if we've already closed it, or true in the case where it was just now closed.
Sean M. Burke (sburke@cpan.org) quit his job at KAOS to work on a book for O'Reilly. It's called Perl & LWP and should be out this summer. Sean would like to thank Arnar Hrafnkelsson, Uri Guttman, Conrad Heiney, Jim Tranowski, and Ronald Schmidt for help and encouragement on the locksmith part of this article.
- 1 Incidentally, this doesn't solve the quite serious problem of the aborted upload leaving a truncated phone.tsv on the server. To deal with this, the usual approach is to upload your phone.tsv as a randomly named temp file, then make a lock directory, delete the remote server's old phone.tsv, and rename your newly uploaded temp file to "phone.tsv", and then rmdir the temp directory. Corresponding solutions apply in the equally serious case of aborted downloads leaving a truncated local "phone.tsv".
- 2 And by "UNIX-like", I do not mean MSWindows. However, as this article is going to press, Ronald Schmidt has found a brilliantly simple way to get this working under MSWindows: instead of setting a signal handler and checking getppid (neither of which are currently implemented under Perl for MSWindows), you just check the return value of the print STDOUT "."; If it's a false value, then the pipe has broken, and the while loop should be exited!