"When is it for?"
-- Brain Eno and Peter Schmidt, Oblique Strategies
The worst kinds of bugs are the ones that don't appear during development, but then randomly appear only in real use. In the case of a complicated program running on several different platforms, such problems are not too surprising; but the first time I ran into such a problem was in a very simple program that ran on the same machine I developed it on. It was a simple SSI counter for a Web page, and it looked like this:
open COUNTER, "<counter.dat" or die "Can't read-open: $!"; my $hits = <COUNTER>; close(COUNTER); ++$hits; print "Hits on this page: $hits\n"; open COUNTER, ">counter.dat" or die "Can't write-open: $!"; print COUNTER $hits; close(COUNTER);I got it going, and everything seemed to work fine:
% perl -cw counter.pl counter.pl syntax OK % echo 0 > counter.dat ; chmod a+rw counter.dat % perl -w counter.pl Hits on this page: 1 % perl -w counter.pl Hits on this page: 2 % perl -w counter.pl Hits on this page: 3I tested it in an .shtml Web page and, in the browser, it merrily displayed "Hits on this page: 4", then on reloading displayed "Hits on this page: 5", and so on. When the Web page was put on a public site, it dutifully started reporting "Hits on this page: 249", and I'd check back later and see "Hits on this page: 634", and everything seemed fine. But then I'd look back later and see "Hits on this page: 45". Something was clearly amiss, but I could see absolutely nothing wrong with the tiny counter program. So, I sought the advice of others, and they pointed out to me the problem that I will now explain to you.
We as programmers are used to putting ourselves in the shoes of our program, and relating to it as an individual: "What file should I open now? What do I do if I can't open that file? What do I do if that other program went and deleted that file?" and so on. But this handy metaphor breaks down when we need to imagine other simultaneous instances of our program following the same set of instructions. And that's just how the above counter program was getting into trouble. In testing, I never had two instances of the program running at once; but once the counter was on a publicly visible Web page, there were eventually two instances of the counter running at once, with various unfortunate results.
Problems with Simultaneous Instances
Imagine that two people, at about the same instant, are accessing the Web page with the counter discussed above. This leads the Web server to start up an instance of counter.pl for each user, at slightly different times. Let's suppose that the content of counter.dat at the beginning is the number "1000" and trace what each instance does.
Instance 1 Instance 2 ----------------- ----------------- open COUNTER, "<counter.dat" or die "Can't read-open: $!"; my $hits = <COUNTER>; close(COUNTER);So instance 1 has read "1000" into $hits . Then:
open COUNTER, "<counter.dat" or die "Can't read-open: $!"; my $hits = <COUNTER>; close(COUNTER);Instance 2 has read "1000" into $hits . Then:
++$hits; print "Hits on this page: $hits\n"; ++$hits; print "Hits on this page: $hits\n";Each instance increments its $hits and each gets 1001, and each displays that figure to its respective user. Then:
open COUNTER, ">counter.dat" or die "Can't write-open: $!"; print COUNTER $hits; close(COUNTER); Instance 1 has updated counter.dat to 1001, and then ends. Then finally:
open COUNTER, ">counter.dat" or die "Can't write-open: $!"; print COUNTER $hits; close(COUNTER);Instance 2 has updated counter.dat to 1001. The problem is that this is incorrect; even though we served the page twice, the counter ends up only 1 hit greater. That's beside the fact that we just told two different users that they were both the 1001st viewer of this page, whereas one was really the 1002nd.
Here's a more drastic case: imagine that the two instances are slightly more out of phase. Suppose instance 1 is writing the value "1501" to counter.dat as instance 2 is starting up and reading it:
Instance 1 Instance 2 ----------------- ----------------- open COUNTER, ">counter.dat" or die "Can't write-open: $!"; open COUNTER, "<counter.dat" or die "Can't read-open: $!"; my $hits = <COUNTER>; print COUNTER $hits; close(COUNTER);There, instance 1 overwrites counter.dat (with a zero-length file), but just as it's about to write the new value of its $hits , instance 2 opens that 0-length file and reads from it into its $hits . Reading from a 0-length file is just like reading from the end of any file: it returns undef. Then, instance 1 writes "1501" to counter.dat and ends. But instance 2 is still working:
++$hits; print "Hits on this page: $hits\n"; open COUNTER, ">counter.dat" or die "Can't write-open: $!"; print COUNTER $hits; close(COUNTER);It has incremented $hits , and incrementing an undef value gives you 1. It then tells the user "Hits on this page: 1", and updates the counter.dat with a new value: 1. Our counter just went from 1501 to 1!
Each program was perfectly following its own instructions, but together they managed to be wrong. I had tacitly assumed that this case, where two instances coincide, would never happen; but I never actually put anything in place to stop it from happening. Or maybe I'd assumed it could happen, but that the chances were astronomical. After all, "it's just a stupid Web page counter anyway". But anything worth doing, is worth doing right, and what needed doing here was to make sure that the above scenarios couldn't happen. Moreover, the way to keep this counter program from losing its count is also the way we keep more important data from being lost in other programs: file locking, a UNIX OS feature that's meant to help in just these sorts of cases.
A first hack at using file locking would change the program to read like this:
use Fcntl ':flock'; # import LOCK_* constants open COUNTER, "<counter.dat" or die "Can't read-open: $!"; flock COUNTER, LOCK_EX; # So only one instance gets to access this at a time! my $hits = <COUNTER>; close(COUNTER); ++$hits; print "Hits on this page: $hits\n"; open COUNTER, ">counter.dat" or die "Can't write-open: $!"; flock COUNTER, LOCK_EX; # So only one instance gets to access this at a time! print COUNTER $hits; close(COUNTER);So, when a given program instance calls " flock FH, LOCK_EX " on a given filehandle, it is signaling, via the operating system, that it wants exclusive access to that file; and if some other process has just called " flock FH, LOCK_EX " first, then our instance will wait around until it's done. And similarly, once we get a lock on this file, if any other process calls " flock FH, LOCK_EX ", the OS will make it wait until we're done. The way the above program signals that it's done, is by calling close on the filehandle.
Although it could have called flock COUNTER, LOCK_UN, it's enough to just close it, because of these important facts about locking in the basic UNIX file model:
- You can't lock a file until you've already opened it.
- When you close a file, you give up any lock you have on it.
- If a process ends while it has a file open, the file gets closed.
- So the only way a file can be locked at any moment is if a process had opened it, and then locked it, and hasn't yet closed it (either specifically, or by ending).
Unfortunately, this means trouble for our flock-using code. Notably, there can still be a problem with instances being out of phase -- since we can't lock a file without already having opened it, things can still happen in that brief moment between opening the file and locking it. Consider when one instance is updating counter.dat just as another new instance is about to read it:
Instance 1 Instance 2 ----------------- ----------------- open COUNTER, ">counter.dat" or die "Can't write-open: $!"; open COUNTER, "<counter.dat" or die "Can't read-open: $!"; flock COUNTER, LOCK_EX; my $hits = <COUNTER>; close(COUNTER); flock COUNTER, LOCK_EX;There, the OS dutifully kept two instances at once from having an exclusive lock on the file. But the locking is too late, because instance 1, just by opening the file, has already overwritten counter.dat with a zero-length file, just as instance 2 was about to read it. So we're back to the same problem that existed before we had any flock calls at all: two processes accessing a file that we wish only one process at a time could access.
Semaphore Files
There are various special solutions to problems like the above, but the most general one involves semaphore files. The line of reasoning behind them goes like this: Since you can't lock a file until you've already opened it, any content you have in locked files still isn't safe. So just don't have any content at all in a locked file. However, we do have content we need to protect, namely the data in counter.dat. But that just means we can't use that as the file we go locking. Instead, we'll use some other file, never with any content of interest, whose only purpose will be to be a thing that different instances can lock for as long as they want access to counter.dat. The file that we lock but never store anything in, we call a semaphore file.
The way we actually use a semaphore file is by opening it and locking it before we access some other real resource (like a counter file), and then not closing the semaphore file until we're done with the real resource. So, we can go back to our original program and make it safe by just adding code at the beginning to open a semaphore file, and one line at the end to close it:
use Fcntl ':flock'; # import LOCK_* constants open SEM, ">counter.sem" or die "Can't write-open counter.sem: $!"; flock SEM, LOCK_EX; open COUNTER, "<counter.dat" or die "Can't read-open counter.dat: $!"; my $hits = <COUNTER>; close(COUNTER); ++$hits; print "Hits on this page: $hits\n"; open COUNTER, ">counter.dat" or die "Can't write-open counter.dat: $!"; print COUNTER $hits; close(COUNTER); close(SEM);This avoids all the problems we saw earlier. Because the above program doesn't do anything with counter.dat until it has an exclusive lock on counter.sem , and doesn't give up that lock until it's done, there can be only one instance of the above program accessing counter.dat at a time.
It can still happen that some other program alters counter.dat without first locking counter.sem -- so don't do that! As long as every process locks the appropriate semaphore file while it's working on a given resource, all is well. All that you need to do is settle on some correspondence between file(s), and the semaphore file that controls access for them. It's a purely arbitrary choice, but when naming a semaphore file for a resource file.ext, I tend to name the semaphore file file.sem, file.ext.sem, or file.ext_S. As with any arbitrary decision, I advise picking one style and sticking with it -- clearly the whole purpose of this is defeated if one program looks to counter.sem as the semaphore file, while another looks to counter.dat_S.
Semaphore Objects
With our simple counter program, our simplistic but effective approach was just to bracket our program with this code:
use Fcntl ':flock'; # import LOCK_* constants open SEM, ">counter.sem" or die "Can't write-open counter.sem: $!"; flock SEM, LOCK_EX; ...do things... close(SEM); ...do anything else that doesn't involve counter.sem...That works quite well when our program is simple and involves just one semaphore file -- all we need to do is close(SEM) once we're done with counter.sem or whatever resource the SEM filehandle denotes a lock for. However, when a given program involves a lot of different files (which each requires its own semaphore file, and which are being locked and unlocked in arbitrary orders) then you can't just have them all in one global filehandle object called "SEM". You can use lexical filehandles using the Perl 5.6 open my $fh,... syntax, as here:
{ use Fcntl ':flock'; # import LOCK_* constants open my $sem, ">dodad.sem" or die "Can't write-open dodad.sem: $!"; flock $sem, LOCK_EX; ...things dealing with the resource that dodad.sem denotes a lock on... close($sem); }In fact, the close($sem) command there isn't particularly necessary -- assuming you haven't copied the object from $sem into any other variable in memory, then when the program hits the end of the block where my $sem was declared, Perl will delete that variable's value from memory. Then, seeing that that is the only copy of that filehandle object, it will implicitly close the file, releasing the lock.
The benefit of using my'd filehandles instead of globals is that this method prevents namespace collisions; you could have other my $sem variables defined in other scopes in this program, and they wouldn't interfere with this. But, creating each semaphore object would still require the same repetitive open and flock calls, and needless repetition is no friend of programmers. We might as well wrap it up in a function:
sub sem { my $filespec = shift(@_) || die "What filespec?"; open my $fh, ">", $filespec or die "Can't open semaphore file $filespec: $!"; chmod 0666, $filespec; # assuming you want it a+rw use Fcntl 'LOCK_EX'; flock $fh, LOCK_EX; return $fh; }Then, whenever you want a semaphore lock on a file, you need only call:
my $sem = sem('/wherever/locks/thing.sem');
All you would then do with that object in
$sem
is keep it around as long as you need the lock on that semaphore file; or you could explicitly release the lock with just a
close($sem)
.
If you were an OOP fan, you could even wrap this up in a proper class, an object of which denotes an exclusive lock on a given semaphore file. A minimal class would look like this:
package Sem; sub new { my $class = shift(@_); use Carp (); my $filespec = shift(@_) || Carp::croak("What filespec?"); open my $fh, ">", $filespec or Carp::croak("Can't open semaphore file $filespec: $!""); chmod 0666, $filespec; # assuming you want it a+rw use Fcntl 'LOCK_EX'; flock $fh, LOCK_EX; return bless {'fh' => $fh}, ref($class) || $class; } sub unlock { close(delete $_[0]{'fh'} or return 0); return 1; } 1; # End of moduleThen you need only create the proper semaphore objects, like so:
use Sem; my $sem = Sem->new('/wherever/locks/thing.sem'); ...later... $sem->unlock;Conclusion
If you've got a data file that's only ever manipulated by one program, and you're sure you'll never run multiple simultaneous instances of that program, then you don't need semaphore files. But you need semaphore files in all other cases, that is, where you have a file or other resource that is accessed by potentially simultaneous processes (whether different programs, or instances of the same program) and that resource could suffer from uncontrolled simultaneous access.
In this article, I've assumed that the programs for which you need semaphore files are all running on the same machine, that that machine runs UNIX (or something with the same basic locking semantics), and that the filesystem you're putting the semaphore files on is not NFS (which often doesn't implement locking properly). In my next The Perl Journal article, I'll discuss what to do if you need semaphore files, but either you're not under UNIX, or the processes you need to coordinate are running on several different machines.
Sean M. Burke (sburke@cpan.org) lives in New Mexico, where he mostly does data-munging for Native language preservation projects.