Martin Krzywinski - The Perl Journal - Canada's Michael Smith Genome Sciences Centre

Volumes 1–6 (1996–2002)

Code tarballs available for issues 1–21.

I reformatted the CD-ROM contents. Some things may still be a little wonky — oh, why hello there <FONT> tag. Syntax highlighting is iffy. Please report any glaring issues.

The Perl Journal

Winter 1996

vol 1

num 4

file_download download code 163,840 bytes

A Subjective Look at Object Oriented Programming

A guarded introduction to OOP.

Mike Stok

Randomness

The generation and use of random numbers.

Jon Orwant

The Perl Purity Test

Are you a a wizard, a guru, or merely a user?

Jeff Okamoto

Using Usenet from Perl

(Thank you for not spamming.)

Graham Barr

New Modules

Recent additions to the CPAN.

Jon Orwant

CGI Programming: The LWP Library

How to write your own browsers, robots, and more.

Lincoln Stein

Perl/Tk: The Grid Geometry Manager

How to lay out widgets.

Steve Lidie

use Lovecraft qw(cthulhu);

A Lovecraftian homage to the new Camel book.

Charlie Stross

Lincoln Stein (1996) CGI Programming: The LWP Library. The Perl Journal, vol 1(4), issue #4, Winter 1996.

How to write your own browsers, robots, and more.

Lincoln Stein

In previous columns I've focused on the Web from the server's point of view. We've talked about how the CGI protocol works, how to write server scripts, and how to maintain long-running transactions across the Web. But what about the client side of the story? Does Perl offer any support for those of us who wish to write our own Web-creeping robots, remote syntax verifiers, database accessors, or even full-fledged graphical browsers? Naturally it does, and the name of this support is LWP.

LWP (Library for WWW access in Perl), is a collection of modules written by Martijn Koster and Gisle Aas. It derives in part from the Perl 4 libwww-perl library created by Roy Fielding. To understand what LWP can do, consider the tasks your average Web browser is called upon to perform:

read and parse a URL

connect to a remote server using the protocol appropriate for the URL (e.g. HTTP, GOPHER, FTP)

negotiate with the server for the requested document, providing authentication when necessary

interpret the retrieved document's headers

parse and display the document's HTML content

The LWP library provides support for all of the tasks listed above, and several others, including handling proxy servers. In its simplest form, you can use LWP to fetch remote URLs from within a Perl script. With more effort, you can write an entirely Perl-based Web browser. In fact, the Perl/Tk library comes complete with a crude but functional graphical browser based on LWP.

The LWP modules are divided along the following categories:

URI::*              URL creation and parsing
HTML::*             HTML creation, parsing and formatting
HTTP::*             The HTTP protocol 
LWP::UserAgent      Object-oriented interface to the library 	
LWP::Simple         Procedural interface to the library 
LWP::Protocol::*    Interfaces to various protocols

To illustrate what you can do with LWP, I've written a Perl script, get_weather, that fetches and prints the current weather report. You could run this script from an hourly cron job and incorporate the result into an HTML page, or use it to produce the text for a scrolling marquee applet (and produce a special effect that does something useful for a change!).

The US National Oceanographic and Atmospherics Service (NOAA) runs a series of Web servers that provide constantly updated weather reports and weather maps. Its servers were designed for human interactive use using fill-out forms; by changing the form, you can select among the cities that the NOAA monitors, and choose among a variety of text and graphical reports. By reverse-engineering its forms, I was able to determine that you can obtain a basic weather report by passing the CGI script

https://www.nnic.noaa.gov/CGI-bin/netcast.do-it

a query string that looks like this (all one line):

state=<state>&city=on&area=Local+Forecast&match=Strong+Match&html=text+only+format

Everything in the string is constant except for the <state> parameter, which despite its name should be one of NOAA's three-letter city abbreviations (e.g. "BOS" for Boston, "NYC" for New York City; you can learn the list of abbreviations by browsing NOAA's site.) When you fetch this URL you'll receive a short HTML page that contains the weather report plus a few graphics and links to NOAA's other pages.

Further down the page you'll see the code for get_weather, which fetches the current weather report from the NOAA server. You invoke it from the command line with the city code as its argument (default "BOS"). An example of the script's output is shown further below.

Thanks to the LWP library, the code is very straightforward. Lines 04-06 load the components of the LWP library that we need. In addition to the LWP::UserAgent module, which provides URL-fetching functionality, we import routines from the HTML::Parse and HTML::FormatText modules. The first provides the ability to create a parse tree from HTML text, and the second turns the parse tree into pretty-printed text.

Lines 08-11 set up various globals for the script. The city is read from the command line, and globals for the server URL and its CGI parameters are defined.

The interesting part begins in lines 13-17, where we connect to the NOAA server, send the query, and retrieve the result. First we create a LWP::UserAgent object, which is essentially a virtual browser. Next we create an HTTP::Request object to hold information about the URL we're requesting. We initialize the request object with the string 'GET' to indicate we want to make a GET request, and with the URL we want to fetch. The actual connection and data transfer occurs in line 15, where we invoke the UserAgent's request() method and receive an HTTP::Response object as the result. Lastly we check the transaction's result code by calling the response object's isSuccess() method and die with an informative error message if there was a problem.

We now have an HTTP::Response object in hand. It contains the HTTP status code, the various MIME headers that are transmitted along with the document, and the document itself. In lines 19-24 we extract the document, reformat it, and print it out. First we extract the HTML document using the response object's content() method, and immediately pass its result to the parse_html() function. Next we create a new HTML formatter object. LWP provides several types of formatters, including one that generates PostScript, but we're interested in HTML::FormatText, which creates pretty-printed ASCII text. We then pass the parse tree to the formatter's format() method, effectively stripping all HTML tags from the text and returning HTML entity codes to their original characters.

The script isn't quite done, however, because the pretty-printed page still contains cruft such as the links to NOAA's home page that we don't want in the final output. The last part of the script splits the pretty-printed text into an array of lines and extracts all the text between the two rows of hyphens that NOAA uses to delimit the weather report.

This example only gives a taste of what you can do with LWP. The LWP library distribution is itself a good source for ideas. Among the sample application programs that accompany it is a Web mirror application that can be used to replicate a tree of Web pages, updating the local copies only if they are out of date with respect to the remote ones. Other parts of the library include the basic components required to write your own web crawling robots. LWP is distributed under the Perl Artistic License and can be downloaded from any CPAN archive.

To find an archive near you, visit https://www.perl.com/CPAN/ or one of the CPAN sites listed.

listing 1

get_weather: A Perl script that fetches the current weather report.

Lincoln Stein (1996) CGI Programming: The LWP Library. The Perl Journal, vol 1(4), issue #4, Winter 1996.

01 #!/usr/bin/perl 
02 # File: get_weather 
03 
04 use LWP::UserAgent; 
05 use HTML::Parse; 
06 use HTML::FormatText; 
07 
08 # options 
09 $CITY = shift || 'BOS'; 
10 $URL = 'https://www.nnic.noaa.gov/cgi-bin/netcast.do-it'; 
11 $OPTIONS =
     'city=on&area=Local+Forecast&match=Strong+Match&html=text+only+format'; 
12 
13 $agent = new LWP::UserAgent; 
14 $request = new HTTP::Request('GET', "$URL?state=$CITY&$OPTIONS"); 
15 $response = $agent->request($request); 
16 die "Couldn't get URL. Status code = ", $response->code 
17 unless $response->isSuccess; 
18 
19 $parse_tree = parse_html($response->content); 
20 $formatter = new HTML::FormatText; 
21 @lines = split("\n",$formatter->format($parse_tree)); 
22 foreach (@lines) { 
23     print $_, "\n" if /^\s*[-]+\w/../^\s*[-]+$/; 
24 }

listing 2

Output from get_weather

Lincoln Stein (1996) CGI Programming: The LWP Library. The Perl Journal, vol 1(4), issue #4, Winter 1996.

   --------------410 AM EST TUESDAY OCTOBER 29 1996--------------
   .Today...sunny...windy and cool. High in the mid 50s. Northwest wind
   20 to 30 mph...diminishing late. 
   .Tonight...clear and cool. Some clouds late. Low 35 to 40. Northwest
   wind becoming light and variable. 
   .Wednesday...becoming cloudy. A 50 percent chance of afternoon showers. 
   High in the mid 50s. 
   ------------------------------------------------------------------