LWP, Text::Wrap CPAN |
Sure, the web has all kinds of wonderful graphical services. Sit down in front of your computer, go clicky-clicky, and worlds of information are at your fingertips. The problem is, sometimes it's nice not to have to sit in front of a web browser to visit sites. Maybe you'd prefer to have those web pages mailed to you--call it poor man's push technology. Or maybe you'd like to download a lot of information from a huge number of web pages, and you don't want to open them all one by one. Or maybe you'd like to write a robot that scours the web for information. Enter the LWP bundle (sometimes called libwww-perl), which contains two modules that can download web pages for you: LWP::Simple and LWP::UserAgent.
My colleague Dan Gruhl submitted five tiny but exquisite programs to TPJ, all using LWP to automatically download information from a web service. Instead of sprinkling these around the magazine as "TPJ One-Liners", I've collected all five here with a bit of explanation for each.
The first thing to notice is that all five programs look alike. Each uses an LWP module (LWP::Simple in the first three, LWP::UserAgent in the last two) to store the HTML from a web page in Perl's default scalar variable $_. Then they use a series of s/// substitutions to discard the extraneous HTML. The remaining text--the part we're interested in--is displayed on the screen, although it could nearly as easily have been sent as email with the various Mail modules on the CPAN.
Downloading currency exchange rates
The currency program converts money from one currency into another, using the exchange rates on www.oanda.com. Here's how to find out what $15.82 in worth in Euros:
$ currency 15.82 USD EUR --> 15.82 US Dollar = 14.1452 Euro
The LWP::Simple module has a function that makes retrieving web pages easy: get(). When given a URL, get() returns the text of that web page as one long string. In currency, get() is fed a URL for oanda.com containing the three arguments provided to the program: $ARGV[0], $ARGV[1], and $ARGV[2], which correspond to 15.82, USD, and EUR in the sample run above. The resulting web page is stored in $_, after which four s/// substitutions discard unwanted data.
#!/usr/bin/perl -w # Currency converter. # Usage: currency [amount] [from curr] [to curr] use LWP::Simple; $_=get("https://www.oanda.com/converter/classic?value= $ARGV[0]&exch=$ARGV[1]&expr=$ARGV[2]"); s/^.*<!-- conversion result starts//s; s/<!-- conversion result ends.*$//s; s/<[^>]+>//g; s/[ \n]+/ /gs; print $_, "\n";
The first s/// removes all text before the HTML comment <!-- conversion result starts; the tail of that comment (-->) becomes the arrow that you see in the output. The second s/// removes all text after the conversion result. The third s/// dumbly removes all tags in the text that remains, and the final s/// replaces consecutive spaces and newlines with a single space each.
Downloading Weather Information
Weather information is downloaded from www.intellicast.com in much the same way as currency information is downloaded from www.oanda.com. The URL is different, the s/// substitutions are different (except for the HTML tag remover), but the basic operation is the same. As an added treat, weather.pl uses the Text::Wrap module to format the output to 76 columns (reformatted for TPJ):
$ weather bos WHDH WEATHER FORECAST Wednesday February 17, 1999 at 8:11 AM --------------------------------------- A stormier pattern is definitely in the making. Two significant storms will impact our area during the next 5 days. The first one will be bring rain at first Wednesday night and especially Thursday, but as colder air works down from the higher levels of the atmosphere, a mix with or change to heavy wet snow may occur before that storm is done with us later Thursday or Thursday night. However, enormous potential rests with the second storm. As of now, it would appear the time table could fall in the Saturday afternoon into early Sunday time frame. The rain/snow line is sure to be a challenge with that storm, but at this storm it appears that most of our area will receive mainly snow, with Cape Cod and the Islands falling closest to the rain/snow line. Please stay tuned to our updates all week long. Todd and Harv
Todd and Harv say more, but I've truncated their output to save space.
#!/usr/bin/perl # Prints the weather for a given airport code # # Examples: weather bos # weather sfo use LWP::Simple; use Text::Wrap; $_ = get("https://www.intellicast.com/weather/$ARGV[0]/ content.shtml"); s/^.*<BLOCKQUOTE>//s; s/<\/BLOCKQUOTE>//s; s/<[^>]+>//g; s/\n\n\n+/\n\n/g; s/\©.*//s; print wrap('', '', $_); # default: 76 columns
Downloading News Stories
The CNN home page displays the top news story; cnn formats and displays it using Text::Wrap. I sandwiched Dan's code in a while loop that sleeps for five minutes (300 seconds) and retrieves the top story again. If the new story (as usual, stored in $_) is different than the old story ($old), it's printed.
#!/usr/bin/perl use LWP::Simple; use Text::Wrap; while (sleep 300) { $_ = get("https://www.cnn.com"); s/^.*Top Table//s; s/<[^>]+>//g; s/FULL STORY.*$//s; s/^.*>\s+//s; s/\n\n+/\n\n/g; if ($old ne $_) { print wrap('', '', $_); $old = $_ } }
Because sleep returns the number of seconds slept, sleep 300 will always return a true value, and so this while loop will never exit.
Completing U.S. Postal Addresses
There's a TPJ subscriber in Cambridge who hasn't been getting his issues. When each issue goes to press, I FTP my mailing list to a professional mail house that takes care of all the presorting and bagging and labeling that the US Post Office requires--an improvement over the days when I addressed every issue myself in a cloud of Glu-Stik vapors.
The problem is that whether I like it or not, the mail house fixes addresses that seem incorrect. 'Albequerque' becomes 'Albuquerque', and 'Somervile' becomes 'Somerville'. That's great, as long as the rules for correcting addresses--developed by the post office--work. They usually do, but occasionally a correct address is "fixed" to an incorrect address. That's what happened to this subscriber.
The address program pretends to be a user typing information into the fields of the post office's address correction page at https://www.usps.com/ncsc/. That page asks for six fields: company (left blank for residential addresses), urbanization (valid only for Puerto Rico), street, city, and zip. You need to provide the street, and either the zip or the city and state. Regardless of which information you provide, the site responds with everything:
$ address company "The Perl Journal" urbanization "" street "Boxx 54" city "" state "" zip "02101" PO BOX 54 BOSTON MA 02101-0054 Carrier Route : B001 County : SUFFOLK Delivery Point : 54 Check Digit : 8
Note that I deliberately inserted a spelling error: Boxx.
One inconvenience of address is that you have to supply placeholders for all the fields, even the ones you're leaving blank.
This program is a bit trickier than the three you've seen so far. It doesn't use LWP::Simple, but instead two other modules from the LWP bundle: LWP::UserAgent and HTTP::Request::Common. That's because LWP::Simple can handle only HTTP GET queries. This web site uses a POST query, and so Dan used the more sophisticated LWP::UserAgent module, which has an object oriented interface.
First, a LWP::UserAgent object, $ua, is created with new(), and then its request() method is invoked to POST the address data to the web page. If the POST was successful, the is_success() method returns true, and the page contents can then be found in the _content attribute of the response object, $resp. The address is extracted as the _content is being stored in $_, and two more s/// substitutions remove unneeded data.
#!/usr/bin/perl -w # Need *either* state *or* zip use LWP::UserAgent; use HTTP::Request::Common; $ua = new LWP::UserAgent; $resp = $ua->request( POST 'https://www.usps.com/cgi-bin/zip4/zip4inq', [@ARGV]); exit -1 unless $resp->is_success; ($_ = $resp->{_content}) =~ s/^.*address is:<p>\n//si; s/Version .*$//s; s/<[^>]+>//g; print;
You can use address to determine the zip code given an address, or to find out your own nine-digit zip code, or even to find out who's on the same mail carrier route as you. If you type in the address of the White House, you'll learn that the First Lady has her own zip code, 20500-0002.
Downloading Stock Quotes
Salomon Smith Barney's web site is one of many with free 15-minute delayed stock quotes. To find the stock price for Yahoo, you'd provide stock with its ticker symbol, yhoo:
$ stock.pl yhoo Yahoo Inc Symbol: YHOO Last Price: $134 11/16 at 9:39am Chg.: +1 5/16 Bid: $135 1/16 Ask: $135 1/8
Like address, stock needs the LWP::UserAgent module because it's making a POST query.
Just because LWP::UserAgent has an OO interface doesn't mean that the program has to spend an entire line creating an object and explicitly storing it ($object = new Class), although that was undoubtedly what Gisle Aas envisioned when he wrote the interface. Here, Dan's preoccupation with brevity shows, as he invokes an object's method in the same statement that creates the object: (new LWP::UserAgent)->request(...).
#!/usr/bin/perl # Pulls a stock quote from Salomon Smith Barney's web site. # # Usage: stock.pl ibm # # or whatever stock ticker symbol you like. use LWP::UserAgent; use HTTP::Request::Common; $response = (new LWP::UserAgent)->request(POST 'https://www.smithbarney.com/cgi-bin/benchopen/quoteget', [ search_type => "1", search_string => "$ARGV[0]" ]); exit -1 unless $response->is_success; $_ = $response->{_content}; s/<[^>]+>//g; s/^.*recent close[^a-zA-Z0-9]+//s; @t = split(/\n\n+/); print shift(@t), "\n"; @h = split(/\n/, shift(@t)); foreach (@h){ ($f = shift(@t)) =~ s/\n//g; next unless $f =~ /\d/; # Skip fields without digits print $_, ": ", $f, " "; print " at ", shift(@t) if /L/; print "\n"; }
Whee
These aren't robust programs. They were dashed off in a couple of minutes for one person's pleasure, and they most certainly will break as the companies in charge of these pages change the web pages formats or the URLs needed to access them.
We don't care. When that happens, these scripts will break, we'll notice that, and we'll amend them accordingly. Sure, each of these programs could be made much more flexible. They could be primed to adapt to changes in the HTML, the way a human would if the information were moved around on the web page. Then the s/// expressions would fail, and the programs could expend some effort trying to understand the HTML using a more intelligent parsing scheme, perhaps using the HTML::Parse or Parse::RecDescent modules. If the URL became invalid, the scripts might start at the site home page and pretend to be a naive user looking for his weather or news or stock fix. A smart enough script could start at Yahoo and follow links until it found what it was looking for, but so far no one has been smart enough to write a script like that.
Of course, the time needed to create and test such programs would be much longer than making quick, brittle, and incremental changes to the code already written. No, it's not rocket science--it's not even computer science--but it gets the job done.
Jon Orwant and Dan Gruhl are members of the MIT Media Laboratory Electronic Publishing Group. When Jon isn't creating TPJ, he's writing Perl programs that write Perl programs that play games. Dan writes programs that hide messages in paper money and search for meaning in large text databases.