2024 π Daylatest newsbuy art
Twenty — minutes — maybe — more.Naomichoose four wordsmore quotes
very clickable
data + munging

The Perl Journal

Volumes 1–6 (1996–2002)

Code tarballs available for issues 1–21.

I reformatted the CD-ROM contents. Some things may still be a little wonky — oh, why hello there <FONT> tag. Syntax highlighting is iffy. Please report any glaring issues.

The Perl Journal
#16
Winter 1999
vol 4
num 4
Letters
Microsoft vs. ActiveState, and precedence.
Perl News
What's new in the Perl community.
Searching for Rhymes with Perl
"I chanced upon a lovely toad..."
Just Another Perl Haiku
Humanizing error messages with Coy.pm.
Managing Streaming Audio
Serving MP3s with Apache.
Hacking the Perl Core
How you can help change Perl. .
CS-Web: A Lightweight Summarizer for HTML
Providing capsule descriptions of web pages.
Review: Object Oriented Perl
Recursive Traversal of an FTP Site
Moving data between file trees.
Win32 Module Installation
Using Perl as a make-replacement.
Taking Perl to the Java Virtual Machine
Creating Java class files with Perl.
Review: Perl Programmer's Reference and Annotated Archives
Dynamic DNS Updates With Perl
Remotely updating Internet machine names.
The Second Perl Quiz Show
The Perl Poetry Contest
Tony Rose and Ave Wrigley (1999) CS-Web: A Lightweight Summarizer for HTML. The Perl Journal, vol 4(4), issue #16, Winter 1999.

CS-Web: A Lightweight Summarizer for HTML

Providing capsule descriptions of web pages.

Tony Rose and Ave Wrigley


Packages Used
HTML::Summary                                  CPAN
HTML::TreeBuilder                              CPAN
Text::Sentence.          Part of HTML::Summary

Canon, like many other large companies, is a multi-national organization with multiple (26) web sites, each managed by a different part of the company. This is a problem for the typical Canon customer, who knows nothing about Canon's internal organization and simply wants to find information about their cameras or download a new printer driver. They need a single clear way to find what they want.

CS-Web: a search engine for Canon's web space

In 1997 we wrote CS-Web, a set of Perl programs to collect information from all of Canon's web sites, index it, and make it searchable from the web. We wrote our own solution because at the time the available products were either tools designed for searching the entire web (such as AltaVista), or for searching a single web site.

CS-Web consists of a robot, a database, and a web interface (written in mod_perl). The robot traverses all of Canon web sites and stores a description of each page in the database. The search engine queries the database and gives you a list of candidate documents, and their descriptions. You can try CS-Web for yourself: it is linked from the main "gateway" page for Canon (https://www.canon.com/). Alternatively, you can access it directly at https://csweb.cre.canon.co.uk/.

CS-Web presented a variety of challenges, many of which would make suitable war stories for TPJ. However, for this article, we will focus on one crucial problem: generating the summary of an HTML document.

META tags

Unlike some other search engines, CS-Web doesn't index the full text of the document. Instead, it indexes document descriptions. When web page authors use the META tag's NAME and CONTENT attributes to encode information about the document, CS-Web will use it. However, when that information isn't provided, CS-Web tries to boil down the text of the document into a summary.

One important limitation on a document's summary is length: each web page corresponds to a row in a relational database table, and the size of each field in each row is fixed in advance. This is because fixed-width fields are much quicker to search than variable-width fields. You'll see later how this length constraint introduced its own problems.

If we were deploying CS-Web across a lot of web sites, we'd have quickly found that very few web page authors consistently provide accurate metadata. In fact, deliberately misleading metadata is often included by unscrupulous authors to enhance the page's prominence in search engine results.

However, the outlook for CS-Web was a little more promising. Since Canon's webmasters are generally working together, we could expect a certain level of integrity, and assume that they were not trying to deceive the CS-Web robot. In turn, the CS-Web robot could acknowledge this trust: if it found a page description within a META tag, it would accept the tag as legitimate.

So much for metadata, but how could we generate a text description from the raw HTML when no metadata are present? By encoding some natural language processing techniques in Perl. The result was HTML::Summary.

HTML::Summary

HTML::Summary is available from the CPAN (ftp://ftp.perl.org/pub/CPAN/authors/id/A/AW/AWRIGLEY/); version 0.016 is described in this article. First, you create an HTML::Summary object, using the new() method. You can provide configuration parameters as arguments to new():

    my $html_summarizer = new HTML::Summary LENGTH => 200;

The LENGTH parameter is the maximum length in bytes for the generated summary. Next, you need an HTML::Element object corresponding to the HTML page that you want to summarize. You can generate one with HTML::TreeBuilder:

    my $html_tree = new HTML::TreeBuilder;
    $html_tree->parse( $html_document );

$html_document is a string containing the HTML of the web page; this could have been read in from a file, or returned as the contents of an HTTP request, such as through LWP::Simple's get() method.

Finally, you call the generate() method of the HTML::Summary object, with the HTML::Element object as an argument. That returns the summary of the page as a string:

    $html_summary = $html_summarizer->generate( $html_tree );

That's how you use it. But how does it work?

The Summarization Algorithm

One of the main tasks before us was generating a good abstract of fixed length from arbitrary text. This is known to be a important and difficult problem, and a quality solution requires sophisticated natural language techniques that can analyze the structure of the original text, identify key phrases and concepts, and regenerate these in a more succinct format.

Luckily for us, there are some quick and dirty ways to generate summaries. We only needed to provide a gist of the original for someone browsing the CS-Web search results. In addition, for retrieval purposes, we wanted the summary to contain representative keywords.

One advantage that we had over people trying to generate summaries from plain text is that HTML pages contain markup information --the HTML tags. Markup tells us about the structure of the content, and often about its relative importance as well. For example, it is usually clear in HTML pages where paragraphs begin and end, and when important text is italicized, emboldened, or made into a heading.

The HTML::Summary module uses a technique known as the location method of text summarization. This consists of identifying important sentences (based primarily on their location in the text), and concatenating them together to produce an abstract. A simple example of this would be to take the first sentence of every paragraph in an article and string them together. This can sometimes be surprisingly effective:

   Canon, like many other large companies, is a multi-national
   organization with multiple (26) web sites, each managed by a 
   different part of the company. In 1997 we wrote CS-Web, a 
   set of Perl programs to collect information from all of 
   Canon's web sites, index it, and make it searchable from the 
   web. CS-Web consists of a robot, a database, and a web 
   interface (written in mod_perl). CS-Web presented a variety 
   of challenges, many of which would make suitable war stories 
   for TPJ.

The text summarization method used in HTML::Summary is an adaptation of the location method. It works as follows:

• Split into sentences

First, the text is split into sentences. (More about this later.)

• Score the sentences

The sentences are scored according to what HTML element they appear in, and whether or not they are the first sentence in that element. The algorithm here is pretty simple: each element has a score. The first sentence in that element gets this score; the rest of the sentences get nothing.

• Sort the sentences by score

The sentences are stored in an array of hashes. Each hash corresponds to a sentence, and contains information about the text in the sentence, its length, the HTML element it appeared in, the score given to it, and its original order in the text.

    $summary[ scalar( @summary ) ] = {
        'text'          => $text,
        'length'        => length( $text ),
        'tag'           => $tag,
        'score'         => $score,
        'order'         => scalar( @summary ),
    };

The scores, as described above, are based on the HTML element that the sentences appear in. These scores are stored in a global hash:

    my %ELEMENT_SCORES = (
        'p'         => 100,
        'h1'        => 90,
        'h2'        => 80,
        'h3'        => 70,
    );

These scores were arrived at by empirical investigation; we have no theoretical justification for them.

• Truncate the list of sentences

Calculate how many sentences are needed to exceed the requested summary length.

• Sort the sentences by original order again

Having remembered the original sentence order in the text in the hash for that sentence, we can now re-sort the sentences in that order.

• Concatenate the sentences to create the summary

Spaces are added between the sentences, since whitespace was stripped when the sentences were split.

• Truncate the summary at the requested length

This last step assumes that if you want a summary of 200 characters, 201 characters are not acceptable . even if it means chopping the summary off mid-sentence. This is what we wanted in CS-Web. Maybe in other applications a less severe approach would be appropriate . it's easy to add more options to HTML::Summary, so let us know what you think.

Sentence Splitting

Now for the nitty gritty. The remainder of this article focuses on just one aspect of the HTML::Summary: splitting the element contents into sentences. Japanese character encodings were a particular problem for CS-Web; our approach is described in a sidebar: Truncating Japanese Text.

The task of splitting text into sentences seemed like a more general problem than its application to text summarization, so this is contained in a separate module: Text::Sentence. For the moment, this is distributed as part of the HTML::Summary package, but we would be interested to learn if there is any interest in using this module independently.

Text::Sentence is basically just a regex. It is has a non-object oriented interface that exports one function, split_sentences(), that takes the text to be split into sentences as an argument, and returns a list of the sentences.

  sub split_sentences
  {
      my $text = shift;

      return () unless $text;

The function first checks if there really is any text to split into sentences; if not, it just returns the empty string.

# $capital_letter is a character set; to account for locale, this

# includes all letters for which lc is different from that letter.

  my $capital_letter =  
      '[' . 
          join( '', 
            grep { lc( $_ ) ne ( $_ ) } 
             map { chr( $_ ) } ord( "A" ) .. ord( "\xff" )
          ) . 
      ']'
  ;

Although it would be more efficient to compute this regex component once at the package level, doing it in split_sentences() allows the user to change locales between calls.

The next few lines build up the components of the regex that split the text into sentences. The first of these components is the capital letter found at the start of a sentence. Instead of using the character class [A-Z] as you would normally, Text::Sentence accounts for locale-specific capital letters. For example, in French, a capital A acute (Á) won't be matched by [A-Z]. The method used in Text::Sentence makes use of the fact that lc is sensitive to locale settings, and returns a lowercase version of all capitalized characters. For more information on how Perl handles locales, see the perllocale documentation.

        @PUNCTUATION = ( '\.', '\!', '\?' );

The @PUNCTUATION array is a global variable in Text::Sentence containing any punctuation used to indicate the end of a sentence. The fact that it's a global means that you're encouraged to change it. For example, you might want to add locale specific punctuation for the Spanish . ¡':

    my $html_summarizer = new HTML::Summary LENGTH => 200;
    push( @HTML::Summary::PUNCTUATION, chr( 161 ) ); 

Back to split_sentences():

  # this needs to be alternation, not a character class, 
  # because of multibyte characters
  my $punctuation = '(?:' . join( '|', @PUNCTUATION ) . ')';

As mentioned above, one of our concerns was dealing with multibyte character encodings (see the sidebar on Truncating Japanese Text). Japanese punctuation characters may be more than one character long. For example, an exclamation point in the EUC Japanese encoding is "\xA1\xAA".

# return $text if there is no punctuation ... return 
$text unless $text =~ /$punctuation/; 

If these isn't any end-of-sentence punctuation in the text, then we might as well return the text now.

 my $opt_start_quote = q/['"]?/;
 my $opt_close_quote = q/['"]?/;
 
 # these are distinguished because (eventually!) I would like to do 
 # locale stuff on quote  characters
 
 my $opt_start_bracket = q/[[({]?/; # }{
 my $opt_close_bracket = q/[\])}]?/;
 

Sentences sometimes have quotation marks or parentheses which come before the capital letter at the beginning, or after the full stop (period, question mark, or exclamation point) at the end. For example, the following sentence:

Larry said "let there be light!" (And there was.)
is two sentences; the first ends after the second double quote. However:
Larry said "let there be light!" (and there was).
is one sentence. Here is the regex in all its glory:


  my @sentences = $text =~ /
  (
                          # sentences start with ...
      $opt_start_quote    # an optional start quote
      $opt_start_bracket  # an optional start bracket
      $capital_letter     # a capital letter ...
      .+?                 # at least some (non-greedy) chars ...
      $punctuation        # ... followed by any one of !?.
      $opt_close_quote    # an optional close quote
      $opt_close_bracket  # and an optional close bracket
  )
  (?=                     # with lookahead that it is followed by ...
      (?:                      # either ...
          \s+                  # some whitespace ...
          $opt_start_quote     # an optional start quote
          $opt_start_bracket   # an optional start bracket
          $capital_letter      # an uppercase word character 
                               # (for locale sensitive matching)
      |                        # or ...
          \n\n                 # a couple (or more) of CRs
                               #(i.e. a new para)
      |                        # or ...
          \s*$                 # optional whitespace and by end of string
      )
  )
  /gxs
  ;
  return @sentences if @sentences;
  return ( $text );
}

This regex makes use of the lookahead feature of regular expressions. In this case, it allows us to specify that a sentence must not only start with a capital letter, and end in a full stop, but that there must be another capital letter which follows the full stop. The only exception to this is when the sentence is at the end of a paragraph.

The lookahead accounts for the whitespace between sentences, so it's not part of the matched patterns that end up in the @sentences array. That's why concatenating the sentences won't give you back the exact original text.

The main problem with trying to split text into sentences is that there are several uses for periods, such as abbreviations.

Dr. Livingstone, I presume.

This phrase counts as two sentences according to Text::Sentence -- the first sentence is three characters long. The performance of Text::Sentence could be improved by taking into account special cases like honorifics (Mr., Mrs., Dr.), common abbreviations (e.g., etc., i.e.), and so on. However, as with many natural language problems, this obeys the law of diminishing returns; a little bit of effort will do a decent 90% job, but that last 10% is pretty difficult. For our purposes the 90% is good enough.

Conclusion

We chose to use Perl for CS-Web because of the obvious benefits: the LWP modules for web programming, DBD/DBI, mod_perl, and so on. We found that Perl is also a useful tool for doing natural language work. Its text processing features, rapid development cycle, and ability to generate complex data structures on the fly make it particularly appropriate.

A lot of interesting work in natural language research involves analyzing corpus data; collecting statistics about language use over large databases of typical usage. The web is an obvious rich source of this type of data, and in view of this, it is a little surprising how few tools and modules appeared to be available in Perl for this field. Certainly, when we were working on the Text::Sentence regex, we posted something to a language processing mailing list, and there seemed to be quite a lot of interest in what we were doing, as well as extensive Perl expertise in that community. Hopefully natural language processing will become yet another nut for Perl to crack!


Tony Rose has a background in natural language processing and has published widely in the area of speech and language technology. He joined Canon Research Centre in November 1996 and is now manager of the Information Retrieval Department. Prior to this position he worked at Hewlett-Packard Laboratories on language modelling for speech recognition. He then completed a research fellowship at BT Labs on Internet agents for information retrieval. His research interests include natural language interfaces and information retrieval.

Ave Wrigley is webmaster at www.itn.co.uk. He has been developing web applications in perl for the last five years, and is the author of WWW::Sitemap and co-author of WWW::Robot and HTML::Summary. He can be contacted at Ave.Wrigley@itn.co.uk.

listing 1

Tony Rose and Ave Wrigley (1999) CS-Web: A Lightweight Summarizer for HTML. The Perl Journal, vol 4(4), issue #16, Winter 1999.

Truncating Japanese Text

Canon is a Japanese company, with Japanese text on many of its web pages. Japanese text is usually encoded in one of several possible multibyte encoding schemes, and some of these schemes use variable numbers of bytes to represent single Japanese characters, or intermingle Japanese and regular ASCII characters. This was a problem.

The summaries generated by Text::Summary are truncated at a fixed length, and this length is specified in bytes, not characters. If Japanese text is truncated at an arbitrary byte length, this might mean truncation in the middle of a character.

Worse, our page abstracts can appear in result listings for keyword searches. If a page summary broken mid-character is inserted into running text, the byte immediately following the summary could be interpreted as the next byte of the previously uncompleted Japanese character, upsetting the character boundaries for the rest of the text.

Text::Sentence includes another supporting module, Lingua::JA::Jtruncate, which addresses this problem. Lingua::JA::Jtruncate contains just one subroutine; jtruncate(), used as follows:

    use Lingua::JA::Jtruncate qw( jtruncate );
    $truncated_jtext = jtruncate( $jtext, $length );

where $jtext is some Japanese text that you want to truncate, $length is the maximum truncation length, and $truncated_text is the result. Here’s how it works.

First, some regexes are defined that match characters in each of the three main Japanese coding schemes: EUC, Shift-JIS, and JIS.

%euc_code_set = (
    ASCII_JIS_ROMAN     => '[\x00-\x7f]',
    JIS_X_0208_1997     => '[\xa1-\xfe][\xa1-\xfe]',
    HALF_WIDTH_KATAKANA => '\x8e[\xa0-\xdf]',
    JIS_X_0212_1990     => '\x8f[\xa1-\xfe][\xa1-\xfe]',
    );
    %sjis_code_set = (
        ASCII_JIS_ROMAN     => '[\x21-\x7e]',
        HALF_WIDTH_KATAKANA => '[\xa1-\xdf]',
        TWO_BYTE_CHAR       => 
                          '[\x81-\x9f\xe0-\xef][\x40-\x7e\x80-\xfc]',
    );
    %jis_code_set = (
        TWO_BYTE_ESC        => 
            '(?:' .
            join( '|',
                '\x1b\x24\x40',
                '\x1b\x24\x42',
                '\x1b\x26\x40\x1b\x24\x42',
                '\x1b\x24\x28\x44',
            ) .
            ')'
        ,
        TWO_BYTE_CHAR => '(?:[\x21-\x7e][\x21-\x7e])',
        ONE_BYTE_ESC => '(?:\x1b\x28[\x4a\x48\x42\x49])',
        ONE_BYTE_CHAR       =>
            '(?:' .
            join( '|', 
                '[\x21-\x5f]',                    
                 # JIS7 Half width katakana
                '\x0f[\xa1-\xdf]*\x0e',             
                 # JIS8 Half width katakana
                '[\x21-\x7e]',                      
                 # ASCII / JIS-Roman
            ) .
            ')'
    );
    %char_re = (
        'euc' => '(?:' . join( '|', values %euc_code_set ) . ')',
        'sjis' => '(?:' . join( '|', values %sjis_code_set ) . ')',
        'jis' => '(?:' . join( '|', values %jis_code_set ) . ')',
    );

Each of the regexes in %char_re matches one character encoded in the scheme corresponding to the keys of the hash.

Now for the definition of the jtruncate() subroutine; first, some fairly obvious sanity checks:

    sub jtruncate{
        my $text            = shift;
        my $length          = shift;
        # sanity checks
        return '' if $length == 0;
        return undef if not defined $length;
        return undef if $length < 0;
        return $text if length( $text ) <= $length;

Now we save the original text; this is used later if the truncation process fails for some reason.

        my $orig_text = $text;

Now we use Lingua::JA::Jcode::getcode() to detect the character encoding. Lingua::JA::Jcode::getcode() is a simple wrapper around the jcode.pl Perl library for Japanese character code conversion. Kazumasa Utashiro kindly agreed to let us distribute the code with HTML::Summary.

my $encoding = Lingua::JA::Jcode::getcode( \$text );

If getcode returns undef, or a value other than euc, sjis, or jis, then it has either failed to detect the encoding, or detected that it is not one of those that we are interested in. We then take the brute force approach, using substr.

        if ( not defined $encoding 
                     or $encoding !~ /^(?:euc|s?jis)$/ ){
            return substr( $text, 0, $length );
        }

The actual truncation of the string is done in chop_jchars() - more about this later.

        $text = chop_jchars($text, $length, $encoding );

chop_jchars() returns undef on failure. If we have failed to truncate the Japanese text properly we resort to substr again. We had to decide whether it was more important to meet the $length constraint or risk returning a Japanese string with broken character encoding. We chose the former:

        return substr( $orig_text, 0, $length ) 
                                    unless defined $text;

Next, a special case: JIS encoding uses escape sequences to shift in and out of single-byte and multi-byte modes. If the truncation process leaves the text ending in multi-byte mode, we need to add the single-byte escape sequence. Therefore, we truncate (at least) three more bytes from JIS encoded string, so we have room to add the single-byte escape sequence without going over the $length limit.

        if ( $encoding eq 'jis' and 
            $text =~ /$jis_code_set{ TWO_BYTE_CHAR }$/) {
            $text = chop_jchars( $text, $length - 3, 
                                             $encoding );
            return substr( $orig_text, 0, $length ) 
                                     unless defined $text;
            $text .= "\x1b\x28\x42";
        }
And we’re done!
        return $text;
    }

Now for chop_jchars(), which simply lops off Japanese characters from the end of the string until it is shorter than the requested length. It’s pretty ugly, and slow for large strings truncated to small values, but it does the job!

    sub chop_jchars
    {
        my $text = shift;
        my $length = shift;
        my $encoding = shift;
        while( length( $text ) > $length ) {
            return undef 
            unless $text =~ s!$char_re{ $encoding }$!!o;
        }
        return $text;
    }
Martin Krzywinski | contact | Canada's Michael Smith Genome Sciences CentreBC Cancer Research CenterBC CancerPHSA
Google whack “vicissitudinal corporealization”
{ 10.9.234.152 }