LWP (Library for WWW access in Perl), is a collection of modules written by Martijn Koster and Gisle Aas. It derives in part from the Perl 4 libwww-perl library created by Roy Fielding. To understand what LWP can do, consider the tasks your average Web browser is called upon to perform:
- read and parse a URL
- connect to a remote server using the protocol appropriate for the URL (e.g. HTTP, GOPHER, FTP)
- negotiate with the server for the requested document, providing authentication when necessary
- interpret the retrieved document's headers
- parse and display the document's HTML content
The LWP library provides support for all of the tasks listed above, and several others, including handling proxy servers. In its simplest form, you can use LWP to fetch remote URLs from within a Perl script. With more effort, you can write an entirely Perl-based Web browser. In fact, the Perl/Tk library comes complete with a crude but functional graphical browser based on LWP.
The LWP modules are divided along the following categories:
URI::* URL creation and parsing HTML::* HTML creation, parsing and formatting HTTP::* The HTTP protocol LWP::UserAgent Object-oriented interface to the library LWP::Simple Procedural interface to the library LWP::Protocol::* Interfaces to various protocols
To illustrate what you can do with LWP, I've written a Perl script, get_weather, that fetches and prints the current weather report. You could run this script from an hourly cron job and incorporate the result into an HTML page, or use it to produce the text for a scrolling marquee applet (and produce a special effect that does something useful for a change!).
The US National Oceanographic and Atmospherics Service (NOAA) runs a series of Web servers that provide constantly updated weather reports and weather maps. Its servers were designed for human interactive use using fill-out forms; by changing the form, you can select among the cities that the NOAA monitors, and choose among a variety of text and graphical reports. By reverse-engineering its forms, I was able to determine that you can obtain a basic weather report by passing the CGI script
https://www.nnic.noaa.gov/CGI-bin/netcast.do-it
a query string that looks like this (all one line):
state=<state>&city=on&area=Local+Forecast&match=Strong+Match&html=text+only+format
Everything in the string is constant except for the <state> parameter, which despite its name should be one of NOAA's three-letter city abbreviations (e.g. "BOS" for Boston, "NYC" for New York City; you can learn the list of abbreviations by browsing NOAA's site.) When you fetch this URL you'll receive a short HTML page that contains the weather report plus a few graphics and links to NOAA's other pages.
Further down the page you'll see the code for get_weather, which fetches the current weather report from the NOAA server. You invoke it from the command line with the city code as its argument (default "BOS"). An example of the script's output is shown further below.
Thanks to the LWP library, the code is very straightforward. Lines 04-06 load the components of the LWP library that we need. In addition to the LWP::UserAgent module, which provides URL-fetching functionality, we import routines from the HTML::Parse and HTML::FormatText modules. The first provides the ability to create a parse tree from HTML text, and the second turns the parse tree into pretty-printed text.
Lines 08-11 set up various globals for the script. The city is read from the command line, and globals for the server URL and its CGI parameters are defined.
The interesting part begins in lines 13-17, where we connect to the NOAA server, send the query, and retrieve the result. First we create a LWP::UserAgent object, which is essentially a virtual browser. Next we create an HTTP::Request object to hold information about the URL we're requesting. We initialize the request object with the string 'GET' to indicate we want to make a GET request, and with the URL we want to fetch. The actual connection and data transfer occurs in line 15, where we invoke the UserAgent's request() method and receive an HTTP::Response object as the result. Lastly we check the transaction's result code by calling the response object's isSuccess() method and die with an informative error message if there was a problem.
We now have an HTTP::Response object in hand. It contains the HTTP status code, the various MIME headers that are transmitted along with the document, and the document itself. In lines 19-24 we extract the document, reformat it, and print it out. First we extract the HTML document using the response object's content() method, and immediately pass its result to the parse_html() function. Next we create a new HTML formatter object. LWP provides several types of formatters, including one that generates PostScript, but we're interested in HTML::FormatText, which creates pretty-printed ASCII text. We then pass the parse tree to the formatter's format() method, effectively stripping all HTML tags from the text and returning HTML entity codes to their original characters.
The script isn't quite done, however, because the pretty-printed page still contains cruft such as the links to NOAA's home page that we don't want in the final output. The last part of the script splits the pretty-printed text into an array of lines and extracts all the text between the two rows of hyphens that NOAA uses to delimit the weather report.
This example only gives a taste of what you can do with LWP. The LWP library distribution is itself a good source for ideas. Among the sample application programs that accompany it is a Web mirror application that can be used to replicate a tree of Web pages, updating the local copies only if they are out of date with respect to the remote ones. Other parts of the library include the basic components required to write your own web crawling robots. LWP is distributed under the Perl Artistic License and can be downloaded from any CPAN archive.
To find an archive near you, visit https://www.perl.com/CPAN/ or one of the CPAN sites listed.