2024 π Daylatest newsbuy art
And she looks like the moon. So close and yet, so far.Future Islandsaim highmore quotes
very clickable
data + munging

The Perl Journal

Volumes 1–6 (1996–2002)

Code tarballs available for issues 1–21.

I reformatted the CD-ROM contents. Some things may still be a little wonky — oh, why hello there <FONT> tag. Syntax highlighting is iffy. Please report any glaring issues.

The Perl Journal
#13
Spring 1999
vol 4
num 1
Perl News
What's new in the Perl community.
Five Quick Hacks: Downloading Web Pages
Grabbing web page contents without a browser.
The Perl Scripts Archive
The CPAN now stores scripts as well as modules.
Web Databases the Genome Project Way
A free database system tailored for the web.
XML
The successor to HTML.
Bricolage: Memoization
Speeding up Perl functions.
Review: The O'Reilly Perl Resource Kit
Localizing Your Perl Programs
Adapting programs for languages other than English.
Building Your Own Perl for Win32
How to compile and install your own Perl from the source code.
The rezrov Infocom Game Interpreter
A maximally portable game engine, with a few back doors.
Perl and MIDI: Simple Languages, Easy Music
How a little language for generating music was created.
Perl Heresies: Building Objects Out Of Arrays
Most people build objects out of hashes. Here's why you shouldn't.
Controlling Modems with Win32::SerialPort
Let your programs talk to any serial device.
Contest: The Solitaire 500
Speed matters. How fast can your program play a simple card game?
TPJ One Liners
Mark-Jason Dominus (1999) Bricolage: Memoization. The Perl Journal, vol 4(1), issue #13, Spring 1999.

Bricolage: Memoization

Speeding up Perl functions.

Mark-Jason Dominus


Caching is a useful general technique that sometimes makes programs run faster. It does this by exchanging space for time: Caching tries to save previously computed results in an attempt to reuse them later rather than recomputing them.

Caching is useful in all kinds of situations, including almost any kind of searching (cache the results of the search so that you can skip it next time), HTML generation (cache the results of the generation process so that you don't have to generate the page next time), and numeric computation (cache the results of the computation). As a specific example, consider a function for preparing graphic images to be printed on an industrial offset printer. Four-color printing processes, commonly used in magazines and color printers, use four colors of ink: cyan, magenta, yellow, and black, collectively referred to as CMYK. Graphics files, however, usually record colors by indicating the desired levels of red, green, and blue light (RGB). The RGB format has to be converted to CMYK values prior to printing.

Each pixel needs to have its color converted; this typically requires a fairly expensive calculation. But in many graphics formats, such as GIF, there are few colors relative to the number of pixels. A naive approach would compute the CMYK values for each color repeatedly.

To speed up the calculation of CMYK values, we save each CMYK set in a data structure called a cache. Later, if we use the function to compute the CMYK values for the same color, the function looks in the cache to see if it already knows the right values. If it finds them there, it returns them immediately, avoiding the lengthy calculation. If not, it computes them, installs them in the cache, and returns them to the caller.

Caching therefore trades space for time. A function that uses caching may run faster than one without, but it uses up more memory. The basic strategy looks like this:

sub fake_function {
   if (@_ looks familiar) {
       return cached value from the cache;
   } else {
      $val = real_function(@_);
      store $val in the cache;
      return $val;
   }
}

Then, when you call fake_function() instead of real_function(), it takes care of managing the cache and returning the values that real_function() would have returned had you called it directly. If real_function() takes a long time to run, this can yield a substantial savings.

Obviously this is inappropriate for many functions. For example, you wouldn't like it if the time or rand function used caching, because then they would return the same time or the same random number every time you called them. But for many functions the result depends only on the arguments, and not on anything in the outside world like the time of day. Some of these functions are candidates for caching strategies.

Here's a common example: the factorial function. (The factorial of n is the number of different orders for a list of n items.) This function accepts an argument which is an integer, at least 0. The factorial is defined as follows:

sub factorial {
   my $n = shift;
   if ($n == 0) { return 1 }
   else { return $n * factorial($n-1) }
}

factorial() is suitable for caching because it's what is called a pure function. A pure function is one with no side effects, whose return value depends only on the values of its arguments. It's easy to see that no matter how many times we compute factorial(7) the result will always be the same. This is just the property that makes it suitable for caching: Since it'll always be the same, we'll remember the result from the first time we call it, and return that same result on the subsequent calls.

It's easy to take the factorial() function above and turn it into a version with caching. This process is called memoization. Since the argument to factorial() is always a non-negative integer, we'll just use an ordinary Perl array as the cache. When we compute factorial($n), we'll store the result in $cache[$n]. Then when we need to see if we've computed factorial($n) previously, we'll just look to see if $cache[$n] is defined. The code is simple:

1 { my @cache;
2   sub factorial {
3       my $n = shift;
4       return $cache[$n] if defined $cache[$n];
5
6       my $result;
7       if ($n == 0) { $result = 1 }
8       else { $result $n * factorial($n-1) }
9
10      $cache[$n] = $result;
11      return $result;
12   }
13 }

Line 1 sets up the cache array; the my confines it to the block that ends on line 13, so it's only visible to the factorial() function. After the function gets the argument on line 3, it checks on line 4 to see if it already knows the factorial. If so, it returns the result immediately.

Something interesting and a little subtle is going on here: The memoization is a bigger win than it might appear at first. Suppose we've used factorial(12) to compute the factorial of 12 and that we need to compute factorial(13). $cache[13] isn't defined yet, so control passes to line 8, which wants to compute 13 * factorial(12), so it makes a recursive call. But the recursive call for factorial(12) returns immediately, because $cache[12] is already defined. We got factorial(13) almost immediately, even though the value wasn't in the cache yet, because we had the result of a partial computation in the cache.

Now similarly, let's ask for factorial(11). Even though we never explicitly asked for this before, the result is already in the cache, because it was computed and recorded in the process of computing factorial(12).

Memoization not only speeds up calls for questions that we've asked before, it sometimes also speeds up the computations for questions that we haven't asked before.

Here's an example where memoizing turns into a huge win for this reason: the Fibonacci function. In 1202 the Italian mathematician Leonardo of Pisa, also called 'Fibonacci', asked how quickly a colony of immortal rabbits would increase if every month each pair of adult rabbits spawned a new baby pair that took another month to grow up to be adults themselves.

It's not hard to discover that if f(n) is the number of pairs of rabbits in month n, then

f(n) = n 		(if n < 2)
f(n) = f(n-1) + f(n-2) 	(otherwise)

The outputs of this function--1, 1, 2, 3, 5, 8, 13, 21, and so on--are the famous Fibonacci Numbers. The code for the function to compute them is very simple:

# Compute nth Fibonacci number
sub fib {
    my $n = shift;
    if ($n < 2) { return $n }
    else { return fib($n-1) + fib($n-2) }
}

However, this simple function is very slow. It takes a long time to compute fib(20), because it first wants to compute fib(19) and fib(18), and add to the results. But to compute fib(19), it first has to compute fib(18) and fib(17), and then when it is done it comes back and computes fib(18) all over again even though the answer is the same as it was before. Both of the times that it wants to compute fib(18), it has to compute fib(17) from scratch, and then it has to do that again each time it wants to compute fib(19). This function does so much recomputing of old results that it takes a really long time to run. fib(20) makes about 22,000 recursive calls, to compute and recompute things that it already computed.

Here's a memoized version:

{ my @cache;
  sub fib {
      my $n = shift;
      return $cache[$n] if defined $cache[$n];

      my $result;
      if ($n < 2) { $result = $n }
      else { $result = fib($n-1) + fib($n-2) }

      $cache[$n] = $result;
      return $result;
  }
}

Memoizing here is a big win; you reduce those 22,000 function calls to about 40. Even if you never ask for fib(20) again, you get a huge benefit.

We can write the memoized version a little more tidily this way:

1 { my @cache;
2   BEGIN { @cache = (0, 1) }
3   sub fib {
4       my $n = shift;
5       return $cache[$n] if defined $cache[$n];
6
7       return $cache[$n] = fib($n-1) + fib($n-2);
8   }
9 }

There are two tricks here. One is that by initializing the cache to contain the two special cases n=0 and n=1, we can eliminate the code that deals with those special cases. The other trick is that we can eliminate the $result variable: We compute the result, stick it right into the cache, and return it, all on line 7.

Recursive Functions

Lots of people want to advise you to avoid recursion, because recursive functions are often so much less efficient than iterative ones. But for some problems, recursion leads to much simpler and more natural code, and it can be hard to decide whether to trade off efficiency for simplicity.

Memoization often improves the performance of recursive functions to the point that you don't care about the small additional gain you might get by rewriting the function iteratively. So before you rewrite, try memoizing.

Automatic Memoization

Civilization advances by extending the number of important operations which we can perform without thinking.

­Alfred North Whitehead

Like all tools, memoization is better when you don't have to think about it. I like memoization, so I wrote a module that memoizes functions automatically. Once you have the module, you can say

use Memoize;
memoize 'fib';

and it'll automatically memoize the fib() function. You can try memoizing a function to see if you get a performance increase, and if not you can turn it off again without wasting any time actually modifying the code or thinking about how to organize the cache. Very handy, and available from CPAN.

Internals of Memoize

Usually when I write a Bricolage module, I try to make it readable, even if that means leaving out features or sacrificing efficiency. But I wrote Memoize.pm a long time ago, and it's a three-hundred-line monster full of all sorts of bells and whistles. So for this article, I wrote a little tiny version of Memoize.pm. It leaves out the fancy features and weighs in at a mere 31 lines of significant code.

There are two functions here. _check_cache() is private, and is responsible for maintaining the cached function values and returning them when appropriate. memoize() is called directly by the user of the package, whom we'll call 'you', and sets things up so that when you think you're calling your own function, you're really calling _check_cache() instead.

We'll see memoize() first, because it's called first. It's responsible for memoizing a function and setting up the cache. Suppose you call memoize 'fib' in the main package. What does memoize() do to set up the cache and arrange to have the cache checked at the right time?

The first thing it needs to do is find the actual function and get a reference to it. This occurs on lines 14­19. If the argument you passed to memoize() was already a reference to a function, we find that out on line 13 and skip this part. Supposing that what we got is a function name, we first find out the name of the calling package with caller (line 14). Then we put the function name together with the package name to get the full name of the function, something like main::fib. If the name contained ::, we assume that it already has its package name attached and skip that part. We save the complete name in $funcname (we'll need it again later) and replace the name of the function in $function with a reference to the function itself on line 19. Turning a name into a reference this way is usually something people do by accident, so strict will abort the program if we do it. But here it's really on purpose, so we shut off strict 'refs' for just that one line.

Now $function has a reference to the function you wanted to memoize, and $funcname has the full name of that function, if you told us what the name was.

Now let's jump ahead a little. You might like to memoize more than one function at a time, so we need some way of keeping the caches for separate functions separate. It would be terrible if you memoized fib() and fact() and then fib(7) returned 5040 instead of 13.

Each cache will be an anonymous hash. The keys will look like function arguments, and the values will be function return values. The cache hashes will be stored in another hash, keyed by function name. Given the name of a function, we can look in the main hash to find the cache hash for that function. This main hash is %memotable, declared on line 8. %memotable is the most important data structure in the module.

Actually I fibbed a little. The keys of %memotable aren't function names, because memoize() might not know the function name. You can ask memoize() to memoize a reference to a function, like this:

memoize \&fib;

in which case the name won't be available to memoize(). You can also memoize an anonymous function:

memoize sub { gethostbyname(@_) || 'Sorry.' };

Here, there's no name at all. But how can memoize() identify a function if it doesn't have a name? We'll play a sly trick. If we have a reference to a function, we can convert it to a string, and we will get something that looks like CODE(0x811f744). Two references to the same function always yield the same string, and references to different functions always yield different strings. This means that the 'stringified' references serve to identify the functions. We'll use these stringified references as the keys into %memotable.

We're now ready to create a new function that will replace fib() when fib() becomes memoized. This happens on line 22. The new function won't have a name, but it's very simple:

sub { _check_cache("CODE(0x811f744)", @_) }

We'll call this a stub, because it's so small. CODE(0x811f744) here is the stringified version of $function, which as you'll recall is a reference to fib(). Every time you memoize a new function, we'll create a new stub, each one slightly different from the others. Each will call _check_cache() with a slightly different $_[0] that uniquely identifies the memoized function it was called for.

All the stub really does is pass along the memoized function's arguments to _check_cache(), with an extra argument slapped on the front to say which function was originally called. _check_cache() can use the first argument to know which cache to consult and which real function to call to populate the cache. When it sees CODE(0x811f744)it'll remember that that was the string that corresponded to fib(), and it'll consult fib()'s private cache and call the real fib() function if necessary.

Lines 23­26 fill in the %memotable. In addition to the cache, which is initially empty, the %memotable records the location of the original, unmemoized function so that _check_cache() can call it if it needs to.

Line 30 is the real magic. If we know the name of the function we are memoizing, we reach into the Perl symbol table and replace that function with the stub that will call _check_cache(). Afterwards, when you try to call fib(), you'll really be calling the stub, which will hand off control to _check_cache(). We saved a reference to the original function in the %memotable so that _check_cache() will be able to find it again when it needs to. Tampering with the symbol table is another evil that use strict wants to protect us from, so we turn it off again.

Finally we return the stub, which is your entry to the memoized version of the original function. You will need this if we weren't able to install it into the symbol table, which would happen if you didn't tell memoize() the name of the memoized function. For example, just saying this:

memoize sub { gethostbyname(@_) || 'Sorry.' };

is useless, because the memoized version of the function is lost. You need to be able to say this:

$fast_gethostbyname = memoize sub 
             { gethostbyname(@_) || 'Sorry.' };
$name = $fast_gethostbyname->('tpj.com');

This would be impossible if memoize() didn't return the stub.

Compared with all the tricky things we've been doing so far, the code to actually call a memoized function is very simple. There's only one strange trick, and because it isn't really very strange, I'm going to explain it up front so we can run right over it when we meet it on the road.

Cached function return values are stored in the cache hash, and we need some way to turn the function's arguments into a hash key so that we can look up the cached return value associated with a particular argument list. But the arguments are a list, and hash keys can only be strings, not lists. The method we use to turn a list into a string has some problems, but in practice the problems don't come up very often. You can see how we turn an argument list into a hash key on line 44:

44 my $argstr = join $;, @_;

$; is a special Perl variable whose value is normally a string containing the single unlikely and unprintable character 'control-backslash'.

Joining the arguments into a single string with join is a little risky, because it doesn't always turn different lists into different strings. For example, the two argument lists ("x$;","y") and ("x","$;y") both turn into the string "x$;$;y" under this transformation. But it works properly almost all the time. It always works when there's only a single argument, and it always works when none of the arguments contain control-backslash, so for real applications it's almost good enough, and for a demonstration it is good enough.

With that out of the way, let's suppose that fib() has been memoized, and the user asks for fib(7). The name fib() is no longer attached to the real fib() function, but instead it's attached to the stub. The stub, as we saw, looks something like this:

sub { _check_cache("CODE(0x811f744)", @_) }

The stub calls _check_cache() with two arguments: First, the stringified version of the reference to the real fib(); second, the number 7. The first thing _check_cache() does is grab the stringified reference (line 37), which is its key into %memotab. If there isn't any entry in the %memotab for that key, it aborts the program (lines 38­41), but that should never happen, because we installed the entry back on line 23.

Supposing that the entry exists, _check_cache() retrieves the cache hash on line 43. On line 44, it turns the argument list into a hash key. (Thump!) _check_cache() then checks the cache. If there is a cached return value available, _check_cache() returns it immediately. (Line 45.) Otherwise, it retrieves a reference to the original, unmemoized version of the function (line 47), invokes it, stores the result in the cache, and returns the result. (Line 48.)

That was a lot of magic, but the end result is that it's totally transparent for you. The web site has a sample demonstration program that looks like this:

print "fib(20) = ", fib(20), "\n";
memoize 'fib'; # Make fib go faster
print "fib(20) = ", fib(20), "\n";

memoize 'fib' is supposed to magically make fib() go faster, and that's exactly what it does. The goal was to advance civilization by extending the number of important operations which you could perform without thinking, and we certainly did meet the goal.

MiniMemoize.pm

1 package MiniMemoize;
2 use Exporter;
3 @ISA = qw(Exporter);
4 @EXPORT = qw(memoize);
5 use Carp;
6 use strict;
7
8 my %memotable;
9
10 sub memoize {
11   my $function = shift;
12   my $funcname;
13   if (ref $function eq '') {
14     my $caller = caller;
15     # Convert to code reference
16     $function = $caller . "::$function" 
          unless $function =~ /::/;
17     $funcname = $function;
18     no strict 'refs';
19     $function = \&$function;
20   }
21
22 my $stub = eval qq{sub {
      _check_cache("$function", \@_) }};
23 $memotable{$function} =
24     { original => $function,
25     cache => { },
26     };
27
28
29     { no strict 'refs';
30     *{$funcname} = $stub if defined $funcname;
31   }
32   $stub;
33 }
34
35
36 sub _check_cache {
37   my $what_func = shift;
38   unless (exists $memotable{$what_func}) {
39     # This "should never happen"
40     croak("Tried to check cache of non-memoized
              function '$what_func'; aborting");
41     }
42
43   my $cache = $memotable{$what_func}{cache};
44   my $argstr = join $;, @_;
45   if (exists $cache->{$argstr}) 
        {return $cache->{$argstr}}
46
47   my $real_function = 
        $memotable{$what_func}{original};
48   $cache->{$argstr} = $real_function->(@_);
49 }
50
51 1;

Some Other Applications of Memoization

Persistent Cache. Storing the cache in memory is handy and can speed up a function. But when your process exits, the cache is lost. Since the function is pure, its values will be the same the next time you run the program, and it's a shame to have the later invocation recompute the values that were already computed by earlier invocations.

If the cache is in a file instead of in memory, your program will never have to compute the function twice for the same arguments. After the first time, it'll just retrieve the value that it saved on the disk.

You can even populate the cache in advance, with a throwaway program that runs over the weekend and does nothing but call the memoized function with different arguments. Then when you come back on Monday you'll find that any program that calls your slow function is faster, because the formerly slow function now returns almost instantaneously.

You could write the code to do this yourself and stick it into all your programs. But the real Memoize module on CPAN already has an option to record the cache results on disk, and using it is simple:

use DB_File;
Memoize 'fib', SCALAR_CACHE =>
   [ TIE, DB_File, $filename, O_RDWR | O_CREAT, 0666];

or

use Memoize::Storable;
# Memoize::Storable is a front-end on Storable that
# makes it suitable for use with Memoize.
Memoize 'fib', SCALAR_CACHE =>
                [TIE, Memoize::Storable, $filename];

This is a very powerful and useful feature, and the programming required to support it couldn't be simpler: The memoize() function just ties the cache hash before it installs it in the %memotable.

Profiling Execution Speed

If your program is too slow, you will need to speed it up by making the slow functions faster. But it is notoriously difficult to decide which functions are slowing your program down! If you guess wrong, you might waste effort trying to speed up a function that only contributes to 1% of the program's entire run time; at best this will make the program 1% faster.

The first thing you should always do in this situation is to

profile the program to find out which parts are really taking the most time, and then concentrate your efforts on just those parts. Profiling can be a pain, and the Perl profiler, Devel::DProf, isn't as well-developed as one might like.

You can use memoizing as an alternative, and it's sometimes preferable. Suppose you have a function f() which you think is a possible candidate to be rewritten to be faster. Run your program three times: Once with no memoizing, once with f() memoized (this will populate the cache), and once with f() memoized and the cache already populated. This last run will simulate the speed of the program with f()'s contributions removed. If the run time with the cache populated is 98% of the run time with no memoizing, then no possible rewriting of f() can speed up the program more than about 2%--so you'll know to look elsewhere.

The Orcish Maneuver

Memoizing is very useful for sort comparison functions, which tend to get called over and over again with the same arguments, and which are almost always pure.

Suppose you have a bunch of strings in the form Jan 14, 1994 and you want to sort them into chronological order. Here's the obvious way:

%m2n =
  ( jan => 0, feb => 1, mar => 2,
    apr => 3, may => 4, jun => 5,
    jul => 6, aug => 7, sep => 8,
    oct => 9, nov => 10, dec => 11, );

sub compare_dates {
  my ($am, $ad, $ay) = ($a =~ /(\w{3}) (\d+), (\d+)/);
  my ($bm, $bd, $by) = ($b =~ /(\w{3}) (\d+), (\d+)/);

             $ay <=> $by
  || $m2n{lc $am} <=> $m2n{lc $bm}
  || $ad <=> $bd;
}

sort compare_dates @datestrings;

Now, suppose @datestrings contains 1,000 of these strings. You're going to make about 8,700 calls to compare_dates(), so about 17,500 pattern matches, and that means that each date string was parsed and split an average of 17.5 times, the last 16.5 of which were a complete waste of time.

One way out of this is to use the Schwartzian Transform, which builds a list of data structures that contain both the numeric and printable versions of the dates, sorts this list by the numeric versions, and then throws away the numeric version leaving only the printables ones in the result. Another way is to define an auxiliary function that turns a date into a number, and then memoize it:

use Memoize;

sub compare_dates {
    to_number($a) <=> to_number($b);
}

# Convert "Nov 5, 1605" to "16051105"
sub to_number {
    my ($m, $d, $y) = ($_[0] =~ /(\w{3}) (\d+), (\d+)/);

    sprintf("%04d%02d%02d", $y, $m2n{$m}, $d);
}

memoize 'to_number';

Now you only do 1,000 pattern matches, which is a big improvement. The other 16,500 times, the result is already in the cache.

In sort comparators, you often need more speed, and the slow part of this comparator is the two calls to to_number(). We've replaced 17,500 pattern matches with 17,500 calls to to_number(), which is an improvement, but not as much of an improvement as we'd like. So instead of using the Memoize module, we can just inline the memoization, like this:

{ my %cache;

  sub compare_dates {
    ($cache{$a} ||= to_number($a)) <=>
    ($cache{$b} ||= to_number($b))
  }
}

||= is a Perl operator that suits this application perfectly. It gets the value of the expression on the left and returns it, unless it's a false value, in which case it evaluates the thing on the right, assigns it to the thing on the left, and returns it. If the numeric version is already in the cache, we get it immediately; otherwise we compute it and put it in the cache. Result: Exactly 1,000 calls to to_number().

Joseph Hall, author of Effective Perl Programming, dubbed this the Orcish Maneuver, because the notable features of this approach are the || and the cache.

Dynamic Programming

You might have heard of 'dynamic programming'; it's a technique in which you break down a problem into smaller subproblems, solve them separately, and then build up the solutions into a solution to the big problem. The merge sort is a good example. For a more interesting example, we'll look at the partition problem.

In the partition problem, you have a list of numbers. You want to divide the list into two groups so that the sum of the numbers in each group is the same.

If you learned about dynamic programming in school (assuming you went to school) you probably spent a lot of time working out the details of how to represent and locate the solutions to the subproblems. But you can think of memoization as an automatic dynamic programming technique.

We're going to solve the partition problem by writing a recursive function. Suppose the list has five elements, whose sum is 30. Then what we really need to do is find some subset of those elements whose sum is 15. If we can do that, or prove there is no such subset, we've solved the partition problem.

Now, suppose the first element is an 8. This 8 might be part of the subset we want to find, or it might not. If it is part of the subset, then the remaining elements of the subset must add up to 7; if it isn't, then they add up to 15. So throw away the 8, and recursively inquire if some subset of the remaining elements can be made to add up to 7 or 15.

Repeat this until you've thrown away all but one of the elements, at which point the solution is trivial; it's either equal to the target sum, or it isn't.

Our recursive function T() will take a list of numbers and a target sum, and it'll try to see whether there's a subset of the numbers that adds up to the target sum. If it can find such a subset, it returns a list of them, and otherwise, it returns undef. Here's the code:

# Take a list of numbers and a target sum.
# Return a sublist that adds up to the target,
# if possible, or undef otherwise.

sub T {
    my ($list, $target) = @_;
    my $answer;
    return undef if $target < 0;

    if (@$list == 0) {
        return ($target == 0) ? [] : undef
    }

    my ($first, @rest) = @$list;

    # Does the rest of the list contain a solution?
    $solution = T(\@rest, $target);
    return $solution if defined $solution;

    # Try the other way
    $solution = T(\@rest, $target - $first);
    return [$first, @$solution] if defined $solution;

    # Nope.
    return undef;
}

Now let's ask it if it can find a way to split the elements of (8,2,7,3,10) into two lists, each of which adds up to 15. The call looks like this:

T([8,2,7,3,10], 15)

And sure enough, the function returns (2,3,10).

Actually this function is too slow to use for large problems; it takes exponential time just like the Fibonacci number function did. And again, we can solve the problem by memoizing. A sample run of the program on a list of 21 numbers took 90 seconds and called T() 4,194,303 times; the memoized version took 0.42 seconds and called T() only 1,521 times.

There's a slight problem here that we didn't discuss before: The caching strategy that I showed earlier in the article doesn't work. It stores the cached value in a hash, keyed by join $;, @_;. In this case, the two arguments to T() are a reference to a list of numbers, and a target sum. When we use the reference in the join statement, we get something like ARRAY(0xae050). These strings are always different for different arrays, even if the contents of the arrays are the same. This means that _check_cache() can't tell that the arguments are the same even when they are.

The CPAN Memoize module has a feature that solves this problem: It lets you override the function that converts argument lists to hash keys, so you can tell the memoizer explicitly what parts of the arguments aren't important. This lets us say that in a call like T([1,2,3], 7), the 1,2, and 3 are important, but the identity of the array that happens to contain them is not important. Here's what the call to memoize() looks like:

memoize 'T', NORMALIZER =>
                  sub { my ($arr, $t) = @_;
                        join ' ', @$arr, $t;
		      };

The NORMALIZER is the function that converts argument lists to hash keys. In this case, it extracts the numbers from the array and joins those together into a string. The results of the call T([1,2,3], 7) will be stored in the hash under the key '1 2 3 7'.

When Memoizing Doesn't Work

Disaster will ensue unless the function that you memoize must be pure. That means that its return value must depend only on its arguments, and it must not have any side effects.

To see what happens when the function depends on something other than its arguments, consider this:

# Return the current time of day as "5:24:32"
sub time { my ($s, $m, $h) = localtime;
           sprintf "%d:$02d:%02d", $s, $m, $h;
         }

If you memoize this, it'll return the correct time the first time you call it, but after that the time will be in the cache and whenever you call it you'll get the cached value.

To see what happens when the function has side effects, consider this:

sub squawk {
  my $a = shift;
  print STDERR "There were only $a fields in the record!\n";
}

Suppose that the main program is reading a file whose records have 15 fields each. It uses this function to deliver an error message each time it encounters a record with the wrong number of fields. The first time you see a record with only 14 fields, it works properly. The second time, the memoizer sees that the argument is the same and returns the cached return value (which is 1, not that you cared) without actually calling the function--so no message comes out.

There's one other kind of function that you must avoid memoizing, and the problem is a little more subtle. If the function returns a reference to a structure that will be modified by its callers, you must not memoize it. To see why not, consider this function:

sub iota {
  my $n = shift;
  return [ 1 .. $n ];
}

iota() returns a reference to a list of the numbers from 1 to n, where n is its argument. If iota() is memoized, the cached value will be a reference to the list.

Suppose that iota() is memoized, and the caller does something like this:

$i10 = iota(10);
$j10 = iota(10);

$i10 and $j10 look like they're both arrays containing the numbers from 1 to 10. But they're not really; they're both references to the same array! If iota() hadn't been memoized, they would be references to different arrays. Now the punchline:

pop @$i10;

This modifies $j10 also, because it's the same array. If iota() weren't memoized, $j10 would have been unaffected.

Sometimes there are exceptions to these perils. For example, the gethostbyname() function often takes a long time, because Perl has to open a network socket, send data to the name server, and wait for the reply to come back. Memoizing gethostbyname() is technically incorrect, because the address data might change between calls. But in practice, Internet addresses don't change very quickly, and memoizing gethostbyname() doesn't lead to any real problems except in long-running programs.

Historical Note

The word 'memoize' was coined by Donald Michie in 1968.

Bibliography

"Improving the Performance of AI Software: Payoffs and Pitfalls in Using Automatic Memoization". Hall, Marty and James Mayfield. Proceedings of the Sixth International Symposium on Artificial Intelligence, Monterrey, Mexico, September 1993.

"Memo Functions and Machine Learning". Michie, Donald. Nature, 218, #1, April 1968, pp. 19­22.

Structure and Interpretation of Computer Programs. Abelson, Hal, and Gerald Jay Sussman. MIT Press, Cambridge, 1985. pp. 218­219.

The Memoize module is available from CPAN, and also from https://www.plover.com/~mjd/perl/Memoize/. The source code for the MiniMemoize module, and the other programs described in this article, is available from the TPJ web site and from https://www.plover.com/~mjd/perl/MiniMemoize/.


Mark-Jason Dominus runs a consulting business in Philadelphia and sometimes teaches classes elsewhere. He will be giving tutorials on Regular Expressions, WWW Security, and 'Tricks of the Wizards' at the O'Reilly Perl conference in August. His promised article on 'Seven Useful Uses of local' didn't fit in this issue; watch for it in the summer instead. He likes to get mail, so send him some at mjd-tpj@plover.com.

Martin Krzywinski | contact | Canada's Michael Smith Genome Sciences CentreBC Cancer Research CenterBC CancerPHSA
Google whack “vicissitudinal corporealization”
{ 10.9.234.152 }