2024 π Daylatest newsbuy art
Trance opera—Spente le Stellebe dramaticmore quotes
very clickable
data + munging

The Perl Journal

Volumes 1–6 (1996–2002)

Code tarballs available for issues 1–21.

I reformatted the CD-ROM contents. Some things may still be a little wonky — oh, why hello there <FONT> tag. Syntax highlighting is iffy. Please report any glaring issues.

The Perl Journal
#15
Fall 1999
vol 4
num 3
Letters
MIDI::Simple, and a little hate mail.
Perl News
What's new in the Perl community.
Just the FAQs: Precedence Problems
There's more to it than what you learned in fourth grade.
Braille Contractions and Regular Expressions
How a 14 kilobyte regex helps the visually challenged.
Client-Server Applications
Turn your program into a server.
Genetic Algorithms with Perl
Evolving algebraic expressions.
Review: Perl 5 For Dummies
Threadsafing a Module
Make your unthreaded modules palatable to threaded Perl.
Visual Debugging with ptkdb
Free software that finds bugs in your programs.
Predicting Sports Championships
Why the Denver Broncos will win the Superbowl again.
Hiding Object Data Using Closures
Concealing attributes from prying programmers.
Turning a Perl Program Into an NT Service
Long-lived Perl programs on Windows NT.
Operator Overloading in Perl
Use +, x, and other operators on your objects.
A Web Spider...In One Line?
Using HTML::, LWP::, and HTTP:: modules to traverse links.
Review: Writing Apache Modules with Perl and C
Prequel to SQL
Using Microsoft Access and DBI with a web application.
Version Control with makepatch
A free utility for updating documents.
The Obfuscated Perl Contest Victors
The Perl Journal One Liners
Tkil (1999) A Web Spider...In One Line?. The Perl Journal, vol 4(3), issue #15, Fall 1999.

A Web Spider...In One Line?

Using HTML::, LWP::, and HTTP:: modules to traverse links.

Tkil


URLs

libwww-perl (LWP):           https://www.linpro.no/lwp/
Web spiders (robots):         https://info.webcrawler.com/mak/projects/robots/robots.html

Today, someone on the IRC #perl channel was asking some confused questions. We finally managed to figure out that he was trying to write a web robot, or "spider", in Perl. Which is a grand idea, except that:

  1. Perfectly good spiders have already been written and are freely available at https://info.webcrawler.com/mak/ projects/robots/robots.html.

  • A Perl-based web spider is probably not an ideal project for a novice Perl programmer. Work your way up to it.

Having said that, I immediately pictured a one-line Perl robot. It wouldn't do much, but it would be amusing. After a few abortive attempts, I ended up with this monster, which requires Perl 5.005. I've split it onto separate lines for easier reading.

perl -MLWP::UserAgent -MHTML::LinkExtor -MURI::URL -lwe '
    $ua = LWP::UserAgent->new; 
    while (my $link = shift @ARGV) { 
        print STDERR "working on $link"; 
        HTML::LinkExtor->new( 
          sub { 
            my ($t, %a) = @_; 
            my @links = map { url($_, $link)->abs() } 
                       grep { defined } @a{qw/href img/}; 
            print STDERR "+ $_" foreach @links;
            push @ARGV, @links;
          } ) -> parse( 
           do { 
               my $r = $ua->simple_request 
                 (HTTP::Request->new("GET", $link)); 
               $r->content_type eq "text/html" ? $r->content : ""; 
           } 
         ) 
     }' https://slinky.scrye.com/~tkil/ 

I actually edited this on a single line; I use shell-mode inside of Emacs, so it wasn't that much of a terror. Here's the one-line version.

perl -MLWP::UserAgent -MHTML::LinkExtor -MURI::URL -lwe 
'$ua = LWP::UserAgent->new; while (my $link = shift @ARGV) { 
print STDERR "working on $link";HTML::LinkExtor->new( sub
{ my ($t, %a) = @_; my @links = map { url($_, $link)->abs()
} grep { defined } @a{qw/href img/}; print STDERR "+ $_"
foreach @links; push @ARGV, @links} )->parse(do { my $r =
$ua->simple_request (HTTP::Request->new("GET", $link)); 
$r->content_type eq "text/html" ? $r-> content : ""; } )
}' https://slinky.scrye.com/~tkil/ 

After getting an ego-raising chorus of groans from the hapless onlookers in #perl, I thought I'd try to identify some cute things I did with this code that might actually be instructive to TPJ readers.

Callbacks and Closures

Many modules are designed to do grunt work. In this case, HTML::LinkExtor (a specialized version of HTML::Parser ) knows how to look through an HTML document and find links. Once it finds them, however, it needs to know what to do with them.

This is where "callbacks" come in. They're well-known in GUI circles, since interfaces need to know what to do when one presses a button or selects a menu item. Here, HTML::LinkExtor needs to know what to do with links (all tags, actually) when it finds them.

My callback is an anonymous subroutine reference:

      sub { 
          my ($t, %a) = @_; 
          my @links = map { url($_, $link)->abs() } 
                        grep { defined } @a{qw/href img/}; 
          print STDERR "+ $_" foreach @links;
          push @ARGV, @links;
      } 

I didn't notice until later that $link is actually scoped just outside of this subroutine (in the while loop), making this subroutine look almost like a closure. It's not a classical closure - it doesn't define its own storage - but it does use a lexical value far away from where it is defined. (Enough justification for a section title!)

Cascading Arrows

It's amusing to note that, aside from debugging output, the while loop consists of a single statement. The arrow operator (->) only cares about the value of the left hand side; this is the heart of the Perl/Tk idiom:
    my $button = $main->Button( ... )->pack(); 

We use a similar approach, except we don't keep a copy of the created reference (which is stored in $button above):

    HTML::LinkExtor->new(...)->parse(...); 

This is a nice shortcut to use whenever you want to create an object for a single use.

Using Modules with One-Liners

From my first thought of this one-liner, I knew I'd be using modules from the libwww-perl (LWP) library. The first few iterations of this "one-liner" used LWP::Simple, which explicitly states that it should be ideal for one-liners. The -M flag is easy to use, and makes many things very easy. LWP::Simple fetched the files just fine. I used something like:
	HTML::LinkExtor->new(...)->parse( get $link ); 

Where get() is a function provided by LWP::Simple; it returns the contents of a given URL.

Unfortunately, I needed to check the Content-Type of the returned data. The first version merrily tried to parse .tar.gz files and got confused:

working on ./dist/irchat/irchat-3.03.tar.gz
Use of uninitialized value at
    /usr/lib/perl5/site_perl/5.005/LWP/Protocol.pm line 104.
Use of uninitialized value at 
    /usr/lib/perl5/site_perl/5.005/LWP/Protocol.pm line 107.
Use of uninitialized value at 
    /usr/lib/perl5/site_perl/5.005/LWP/Protocol.pm line 82. 

Ooops.

Switching to the "industrial strength" LWP::UserAgent module allowed me to check the Content-Type of the fetched page. Using this information, together with the HTTP::Response module and a quick ?: construct, I could parse either the HTML content or an empty string.

The End

Whenever I write a one-liner, I find it interesting to think about it in different ways. While I was writing it, I was mostly thinking from the bottom up; some of the complex nesting is a result of this. For example, the callback routine is fairly hairy, but once I had it written, I could change the data source from LWP::Simple::get() to LWP::UserAgent and HTTP::Request::content() quite easily.

Obviously, this spider does nothing more than visit HTML pages and try to grab all the links off of each one. It could be more polite (but see the LWP::RobotUA module for some of that) and it could be smarter about which links to visit. In particular, there's no sense of which pages have already been visited; a tied DBM of visited pages would solve that nicely.

Even with these limitations, I'm impressed at the power expressed by that "one" line. Kudos for that go to Gisle Aas (the author of LWP) and to Larry Wall, for making a language that does all the boring stuff for us. Thanks Gisle and Larry!


Tkil can be found at tkil@scrye.com He lives in Fort Collins, Colorado, with a small pod of computers, a wall full of CDs, some neglected juggling toys, a closetful of neuroses, bunches of books, and a string of Christmas lights for illumination. He enjoys playing with Perl, C++, and Unix, and sometimes even manages to get paid for it. The rest of his time is wasted on IRC EFNet's #perl channel.
Martin Krzywinski | contact | Canada's Michael Smith Genome Sciences CentreBC Cancer Research CenterBC CancerPHSA
Google whack “vicissitudinal corporealization”
{ 10.9.234.152 }