2024 π Daylatest newsbuy art
Thoughts rearrange, familiar now strange.Holly Golightly & The Greenhornes break flowersmore quotes
very clickable
data + munging

The Perl Journal

Volumes 1–6 (1996–2002)

Code tarballs available for issues 1–21.

I reformatted the CD-ROM contents. Some things may still be a little wonky — oh, why hello there <FONT> tag. Syntax highlighting is iffy. Please report any glaring issues.

Damian Conway (2000) Lazy Text Formatting. The Perl Journal, vol 5(4), issue #20, Winter 2000.

Lazy Text Formatting

Damian Conway


Packages Used  
Text::Autoformat CPAN

Don't you just hate getting an email that's been for
matted
for the wrong number of columns? It's an unprovoked ass
ault
on your poor visual cortex. And it's a thoughtless insult, to
o.
It screams: "Hey, you aren't even worth the eight keystr
okes
it would take me to correctly set my editor's autowrap!"

> And, of course, it only gets worse when quoted email is
involved. > Even when someone tries to do the right
thing, they just end > up frying more of your neurons as
you attempt to untangle > the mess that most text formatters
make of the standard > quoting conventions. It's no fun trying
to separate the meaning > from the massage.

What the world needs is a text reformatter that looks at the contents -- and context -- of the ASCII it's munging, and then Does The Right Thing automagically.

Text::Autoformat

And that's exactly what the Text::Autoformat module gives you. Specifically, it provides a subroutine named autoformat that wraps text to fixed margins. However, unlike other text wrapping modules (such as Text::Wrap, Text::Correct, or Text::Reflow), autoformat reformats its input by analyzing the text's structure: identifying and rearranging independent paragraphs by looking for visual gaps, list bullets, changes in quoting, centering, and underlining.

If you're happy to live with autoformat's reasonable defaults, then reformatting a single paragraph (taking it from STDIN and printing it to STDOUT) is no more complicated than this:

    use Text::Autoformat;
    autoformat;

The default width of the reformatted text is from column 1 to column 72, but it's very easy to change that (and a plethora of other defaults) by giving autoformat the appropriate options:

    autoformat {left=>8, right=>64};

Or the equivalent, but often more convenient, alternative:

    autoformat {left=>8, width=>57};

If autoformat's first argument isn't a hash reference, that argument is stringified and used as the text to be formatted. For example:

    autoformat $msg_text;

Likewise, if it's called in a non-void (scalar or list) context, autoformat returns the formatted text, rather than printing it to STDOUT.

Normally, autoformat only reformats the first paragraph it encounters, and leaves the remainder of the text unaltered. This behavior seems odd initially, until you realize that the single most common use of autoformat is in the following one-liner:

    perl -MText::Autoformat -e'autoformat'

And that the obvious thing to do with this one-liner is to map it onto a convenient keystroke in your text editor, thereby providing intelligent, single-key, paragraph-at-a-time reformatting. For example, if you're a vi user, you might add this to your .exrc file:

    map f !G perl -MText::Autoformat -eautoformat

That is: map the f key to grab every line from the current editing position to the end of the file and filter it through Perl. Then, to provide that filter, the Text::Autoformat module is loaded and autoformat is called.

If autoformat's default were to reformat everything it was sent, then you'd have to write:

    map f !} perl -MText::Autoformat -eautoformat

and you'd be stuck with vi's much less sophisticated understanding on what constitutes a paragraph. More on that shortly.

Of course, the real power of the module is best seen when it operates on multiple paragraphs simultaneously. To convince autoformat to do that -- to reflow every paragraph you send it -- you need to ask explicitly, with another option:

    autoformat { all=>1 };
Which leads to the obvious "just-fix-it-all-up-for-me-would-ya" editor macro:

    map F !Gperl -MText::Autoformat -eautoformat{all=>1}

What is this thing called "paragraph"?

The autoformat subroutine gives the illusion of understanding the structure of an input text because it has a series of very good heuristics (i.e. guesses) for locating and separating paragraphs.

Most text formatters -- and many text editors -- define a paragraph to be a sequence of characters terminated by two or more consecutive newlines. Indeed, this is Perl's notion of a paragraph (which you can grab with a single readline by setting the $/ variable to an empty string, as described in the perlvar man page).

That's very annoying, because it doesn't cope with how real people write paragraphed text. Real people leave spaces and tabs on "empty" lines. Real people (and many web browsers) bunch up lists of bulleted and numbered points with no whitespace at all between them. Real people quote email messages, which transforms formerly empty lines into non-empty \n\t>\n sequences.

Because real people do such things, autoformat understands all these notions of a paragraph. Even when they're all used at once. Even when they're used inside one another (for example, quoting a list of bulleted points).

Quote, unquote, requote...

For example, one of Text::Autoformat's most useful paragraphing heuristics is that any sequence of lines beginning with standard "quoter" characters is a single piece of quoted text, in which the quoters should be preserved and only the text to the right of them reflowed.

The standard quoters that autoformat recognizes are nested combinations of the characters:

        !  #  %  =  |  :  >

Angle brackets can also be preceded by alphabetic characters. So, for example, autoformat would take a series of paragraphs like this:

        > ! > calling map in a void context is the sign
        > ! > of a sick mind
        > !
        > ! I don't see why.
        > Me either, I regularly do it and I'm still
        > quite sane. I often split in a void context
        > too, but there's a bug in Perl that seems to
        > cause that to mess up $_[0], $_[1], etc.
        > ! > Sigh. Have you bothered to read the man
        > ! > page on split??? Yes, I know I wrote this
        > ! > before that reply: it's a miracle.

and reformat them like so:

        > ! > calling map in a void context is
        > ! > the sign of a sick mind
        > !
        > ! I don't see why.
        > Me either, I regularly do it and I'm
        > still quite sane. I often split in a
        > void context too, but there's a bug 
        > in Perl that seems to cause that to 
        > mess up $_[0], $_[1], etc.
        > ! > Sigh. Have you bothered to read
        > ! > the man page on split??? Yes, I
        > ! > know I wrote this before that
        > ! > reply: it's a miracle.

And that's the whole point. By understanding the structural conventions of typical plaintext, autoformat can reflow it logically, rather than physically.

Number one (with a bullet)

Often plaintext will include lists that are either bulleted with punctuation characters, simply numbered (i.e. 1., 2., 3., etc.), or hierarchically numbered (1, 1.1, 1.2, 1.3, 2, 2.1. etc.) Whether or not it is physically separated from each of its neighbors, each bulleted item is implicitly a separate paragraph and needs to be formatted individually, with the appropriate indentation.

autoformat takes care of that renumbering, and can also detect unordered bullets (the characters: * . + -), special markers that ought to be outdented (such as: NB: p.s., etc.), Arabic and Roman numerals, single alphabetic letters, and hierarchical combinations of these (for example: 2.a(ix)).

Besides adjusting the left margin so that the marker is outdented from the paragraph text, autoformat renumbers each numbered point sequentially (using the first number as its starting point). For example, given the following text:

        You're wrong for the following reasons:
                1. I'm right.
                1.a. I'm *always* right
                1. Even if you were right, you have the order
        wrong.
                1.x. You suggested:
                        > D. Analyze the problem carefully
                        > C. Design the algorithm appropriately
                        > A. Code solution systematically
                        > E. Test thoroughly
                        > B. Ship eventually
                1.n. The proper sequence is:
                        A. Code solution expediently
                        B. Ship immediately
                        E. Test sporadically (charge user for
        maintenance)
                        F. Release "upgrade" periodically (charge 
        user again)

autoformat {all => 1} produces:

        You're wrong for the following reasons:
                1. I'm right.
                1.a. I'm *always* right
                2. Even if you were right, you have the
                   order wrong.
                2.a. You suggested:
                        > D. Analyze the problem carefully
                        > C. Design the algorithm
                        >    appropriately
                        > A. Code solution systematically
                        > E. Test thoroughly
                        > B. Ship eventually
                2.b. The proper sequence is:
                        A. Code solution expediently
                        B. Ship immediately
                        C. Test sporadically (charge user 
                           for maintenance)
                        D. Release "upgrade" periodically
                           (charge user again)

Notice that autoformat got the hierarchical ordering correct, and that it didn't renumber the quoted list, even though it reflowed the text within the quoted section. That makes sense, since renumbering the quoted list might change its meaning in a way that reformatting wouldn't.

The autoformat suroutine also handles renumbering of lists marked with Roman numerals. For example, the list:

   Examples of the five declensions are:
           i. terra, terra, terram, terrae, terrae, 
   terra
           v. modus, mode, modum, modi, modo, modo
           x. nomen, nomen, nomen, nominis, nomini,
   nomine
           ix. portus, portus, portum, portus, portui,
   portu
           mmmclxiv. dies, dies, diem, diei, diei, die

would be reformatted thus:

   Examples of the five declensions are:
             i.	terra, terra, terram, terrae,
            	terrae, terra
            ii.	modus, mode, modum, modi,
               	modo, modo
           iii.	nomen, nomen, nomen, nominis,
              	nomini, nomine
            iv.	portus, portus, portum, portus,
              	portui, portu
             v.	dies, dies, diem, diei, diei,
               	die

autoformat is even smart enough to right-justify the numbers, so as to align the paragraph bodies cleanly.

Of course automatically handling lists of letters and lists of Roman numerals presents an interesting challenge. A list such as:

    I. Put cat in box.
    M. Close lid.
    P. Activate Geiger counter.

should obviously be reordered as I...J...K, whereas:

    I. Put cat in box.
    M. Close lid.
    XLI. Activate Geiger counter.

should clearly become I...II...III.

But what about:

    I. Put cat in box.
    M. Close lid.
    L. Activate Geiger counter.

The autoformat subroutine resolves this ambiguity by always interpreting a list with alphabetic bullets as being English letters, unless the full list contains only valid Roman numerals, and at least one of those numerals is two or more characters long. So the final example above would become I...J...K -- as you might have expected.

Famous next words

Literary quotations present a different challenge from quoted email. A typical formatter would re-render the following quotation:

      "We are all of us in the gutter, but some of us 
    are looking at the stars"                                 
                                    -- Oscar Wilde
                                       English playwright

like so:

      "We are all of us in the gutter, but some
      of us are looking at the stars" -- Oscar
      Wilde English playwright

But autoformat recognizes the quotation structure and preserves both indentation and attribution:

      "We are all of us in the gutter,
       but some of us are looking 
       at the stars"
                               -- Oscar Wilde
                                  English playwright

It even outdents the leading quotation mark nicely.

Suffer not the widow to abide alone

Did you notice that in the previous example, autoformat broke the second line earlier than it needed to? It did that because, if the full margin width had been used, the formatting would have left the last line oddly short:

      "We are all of us in the gutter,
       but some of us are looking at the 
       stars"
                               -- Oscar Wilde
                                  English playwright

Typographical misdemeanors of this type (known as widows) are heavily frowned upon in typesetting circles. They look ugly in plaintext too, so autoformat avoids them with a kind of Dickensian artful dodge: stealing extra words from earlier lines in a paragraph, to provide the widowed word with adequate company.

The heuristic used is that final lines must be at least ten characters long. If the last line is too short, the paragraph's right margin is reduced by one column, and the paragraph is reformatted. This process iterates until either the last line exceeds nine characters or the margins have been narrowed by 10% of their original separation. In the latter case, the reformatter gives up and just uses its original formatting.

Justification and sentencing

The autoformat subroutine can also take an option that tells it how the reformatted text should be justified. For example:

    autoformat {justify => 'right'};

The alternative values for this option are: 'left' (the default), 'right', 'centre' (or 'center'), and 'full'.

Full justification is interesting in a fixed-width medium like plaintext because it usually results in uneven spacing between words. Typically, text formatters provide this by distributing the extra spaces into the first available gaps of each line:

    R3> Now  is  the  Winter  of our discontent made
    R3> glorious Summer by this son of York. And all
    R3> the  clouds  that  lour'd  upon our house In
    R3> the deep bosom of the ocean buried.

This produces an odd visual effect, so autoformat reverses the strategy and inserts extra spaces at the end of lines (which most readers find less disconcerting):

    R3> Now is the Winter  of  our  discontent  made
    R3> glorious Summer by this son of York. And all
    R3> the clouds that lour'd  upon  our  house  In
    R3> the deep bosom of the ocean buried.

Even if explicit centering is not specified via the {justify => 'centre'} option, autoformat will automatically detect centered paragraphs and preserve their justification. It does this by examining each line of the paragraph and asking itself: "If this line were part of a centered paragraph, where would the midpoint have been?"

By making the same estimate for every line in the paragraph, and then comparing the estimates, autoformat can deduce whether all the lines are centered with respect to the same axis of symmetry (with an allowance of ±1 to cater for the inevitable integer rounding). If a common axis of symmetry is detected, autoformat assumes that the lines are supposed to remain centered, and automatically switches on center-justification for that paragraph.

You can also optionally perform case conversions on the text being processed, using the case option. The alternatives are 'upper', 'lower', 'title', and 'highlight'. Title casing capitalizes the first letter of each word:

    The Strange And Gruesome Case Of The Tab-indented 
    Python.

and highlight casing does the same, except that it ignores trivial words:

    The Strange and Gruesome Case of the Tab-indented 
    Python.

A fifth alternative is {case => 'sentence'}. This mode attempts to produce correctly-cased sentences: first letter in upper-case, subsequent words in lower-case (unless that word is originally in mixed case). For example, the paragraph:

    POVERTY, MISERY, FRIENDLESSNESS, ETC. are ever
    the lot of the VisualBasic hacker. 'tis an
    immutable law of Nature! Whom the GODS would
    DESTROY, they FIRST force to code Word MACROS.

under {case => 'sentence'} becomes:

    Poverty, misery, friendlessness, etc. are ever
    the lot of the VisualBasic hacker. 'Tis an
    immutable law of Nature! Whom the gods would
    destroy, they first force to code Word macros.

Note that autoformat is clever enough to recognize that the period in abbreviations such as etc. is not a sentence terminator, and that the first capitalizable letter of 'tis is the t, and that words like VisualBasic and Nature should retain their existing capitalizations.

Once and future features

There is an endless list of other smart things Text::Autoformat could be extended to do. Here's a short preview of some coming attractions...

Columns. A future release of Text::Autoformat will recognize columns within a paragraph and allow the user to independently control their layout and justification, even under margin adjustments. For example, given:

	Name	Mark	Comment
	====	====	=======
	Pat	99	Unusually high score. Suspect?
	Kim	72	Solid performance
	Leslie	51	Just scraped through this time

you'll be able to call:

    autoformat { justify => ['left', 'centre', 'left'],
                 width  => [undef, undef, 20] };

and produce:

	Name	Mark	Comment
	====	====	=======
	Pat	99	Unusually high
			score. Suspect?
	Kim	72	Solid performance
	Leslie	51	Just scraped through
			this time

Transliteration. autoformat will eventually provide smart 8-to-7 bit transliteration (the way the Text::StripHigh module does now), so that text like:

        ¥ This example's © Erwin Schrödinger
          N42(±1) UnÇertainté Straße, Østland.

could be transformed into this:

        * This example's (c) Erwin Schroedinger,
          No42(+/-1) Uncertainte' Strasse, Ostland.

Mail headers. autoformat was originally developed as a lazy way to clean up incoming and outgoing email. It does that exceptionally well, so long as you keep it away from the headers. Sendmail doesn't take kindly to autoformat's misguided efforts with them:

        To: Jon Orwant
        <orwant@oreilly.com> From:
        damian@conway.org Subject: Re:
        When's the next meeting of the
        Secret Perl Cabal? References:
        <200011100411.PAA17166@indy05-
        .csse.monash.edu.au>

A future version of the module will detect mail headers and either leave them alone or wrap them intelligently.

Mark-up. Another irritation is that autoformat blindly attempts to reformat HTML, pod, Perl code, and many other things it should just ignore. The very next release of Text::Autoformat will have a "leave-it-the-hell-alone" option that causes autoformat to disregard any (non-bulleted) text that is indented. Later versions may also be able to automatically diagnose marked-up sections of text -- and perhaps code examples -- and just magically skip them.

Configurability. Currently, the list of abbreviations and "stop words" that autoformat knows about is fixed, as are the set of quoter characters, and list bullets. This should obviously be user-configurable, and will be in a forthcoming release.

How much would you expect to pay?

Meanwhile, despite these niggles, Text::Autoformat does a remarkably good job at what it was designed for: making ASCII text reformatting as easy as (in)humanly possible.

So you no longer have any excuse for sending email that slops over the margin.

Damian Conway is an autonomous semi-intelligent coding device owned and operated on behalf of the Perl community by Tony Bowden, Marjan Bace, Kit Cosper, Chris DiBona, Randal L. Schwartz, Jon Orwant, Piers D. Cawley, Joel Hall, Stephen Barton, Leon Brocard, Dave Cross, Garrett Goebel, Andy Wardley, Kevin Lenzo, Richard Clamp, Daniel Chetlin, Scott Drassinower, Marcel Grunauer, Greg Mccarroll, Ken McGlothlen, Jasmine Merced, Dr. Karl Kleine, Chris Heller, Bob Badour, Brian Katzung, Robert Partington, James Carter, Collin Starkweather, Kyle Drake, Gregory Marton, Daniel Yacob, Richard Rodger Jostraca, Alan Jaffray, Jeffrey Seifert, Michael Graham, Warren Young, Derek Lane, Clark Cooper, Eric Larson, Tytus Mapp, Paul Sherman, Colin Meyer, Kenneth Robson, Richard Dice, Rand Bamberg, Ramki Balasubramanian, James Lee Evanston, Robin Houston, Martin Heinsdorf, Christopher Taranto, Lindsay Davies, Matt Gittins, Anton Guselnikov, Dan Boorstein, Jon Orwant, Reinhard Engels, Adekunle Olonoh, Michael E. Meyers, Christopher Conrad, Jonathan Stowe, The Long Valley Perl Mongers, Michael King, David Rolsky, Amanda Gilbert, Steven McDougall, Jim Baker, Alex Fiore, Neil Kandalgaonkar, Mark Zweifel, Michael Smith, Matija Grabnar, Stuart McDow, Michael Röschter, Bruno Nicoletti, Kurtis DeMaagd, David Schmitt, Larry Emmett, Darren W. Aldredge, Dean Wilson, Chris Winters, Terry Nightingale, Benjamin Holzman, Jonathan King, Per Jonas BrØmsØ Nielsen, T. Alex Beamish, Kellan Elliott-McCrea, Laurent Julliard, James Donnelly, Paul Trader, Casper Warming, Alvin and Jenna Sim, Mike Lavin, Brad Bowman, Jacob Morzinski, Keith Calvert Ivey, Clinton Pierce, Jim Parker, John Cavanaugh, Kurt von Tiehl, Paul Hamingson, Darin Dugan, Mark Fowler, Hugh Kennedy, John Birney, Scott Cluett, Benjamin Reed, Glenn Maciag, Alex Farber, David Storrs, Philip Newton, Andy Stritof, John Callender, Stray Toaster, Marc Majcher, Marc Kerr, Edward Almasy, Frank J. Tobin, Greg Cope, Bruce Winter, Richard Bond, Ian Bach, Robert Blackwell, Nigel Wetters, Steve Rushe, Augie De Blieck Jr, Sanford Redlich, Jon Scarborough, Tom Tarka, David James Morgan, Hampton Maxwell, Rich Gibson, Nic Doye, Briac Pilpre, and Jarrett Alexander.

Packages Used

	Text::Autoformat	CPAN
Martin Krzywinski | contact | Canada's Michael Smith Genome Sciences CentreBC Cancer Research CenterBC CancerPHSA
Google whack “vicissitudinal corporealization”
{ 10.9.234.152 }