In the old days, if you wrote a program to perform data manipulation on some file, there were standard operations that had to be implemented to access the file's data. Your program would have to open the file, read each record and process it, decide what to do with the newly manipulated data, and close the file. Perl doesn't let you avoid any of these steps, but by employing some of Perl's unique features, you can express your programs much more concisely-and they'll be faster, too.
In this article, we'll take a simple task and show how familiarity with Perl idioms can reduce the size and complexity of the solution. Our task is to display the lines of a file that are neither comments nor blank, and here's our first attempt:
#!/usr/bin/perl -w # Obtain filename from the first argument. $file = $ARGV[0]; # Open the file - if it can't be opened, terminate program # and print an error message. open INFILE, $file or die("\nCannot open file $!\n"); # For each record in the file, read it in and process it. while (defined($line = <INFILE>)) { # Grab the first one and two characters of each line. $firstchar = substr($line,0,1); $firsttwo = substr($line,0,2); # If the line does NOT begin with a #! (we want to see # any bang operators) but the first character does begin # with a # (we don't want to see any # comments), skip it. if ($firsttwo ne "#!" && $firstchar eq "#") { next; } # Or, if the line consists of only a newline (i.e. it's # a blank line), skip it. elsif ($firstchar eq "\n") { next; } # Otherwise display the line to standard output (i.e. # your terminal). else { print $line; } # Proceed to next record. } # When finished processing records, be a good programmer # and close the input file. close INFILE;
This script works just fine, but it's pretty large-you have to look at a lot of lines to figure out what it does. Let's streamline this code step-by-step until we're left with the bare essentials.
First, while (<>) opens the files provided on the command line and reads input lines without you having to explicitly assign them to a variable. Let's remove the comments and change the Perl script to use this feature.
#!/usr/bin/perl -w while (<>) { $firstchar = substr($_,0,1); $firsttwo = substr($_,0,2); if ($firsttwo ne "#!" && $firstchar eq "#") { next; } elsif ($firstchar eq "\n") { next; } else { print $_; } }
As each line is read, it is stored in the scalar $_. We changed our call to substr (which extracts or replaces individual characters from a string) and the print statement to use this internal variable.
We can even make the while loop implicit as well. The -n switch wraps your program inside a loop: LINE: while (<>) { your code }.
So we can shorten our little program even more:
#!/usr/bin/perl -wn $firstchar = substr($_,0,1); $firsttwo = substr($_,0,2); if ($firsttwo ne "#!" && $firstchar eq "#") { next; } elsif ($firstchar eq "\n") { next; } else { print $_; }
In Perl, there's more than one way to do nearly anything-even good old conditionals. We can use an alternate form-and the fact that our loop is now named LINE-to rewrite our program with even less punctuation:
#!/usr/bin/perl -wn $firstchar = substr($_,0,1); $firsttwo = substr($_,0,2); next LINE if $firsttwo ne "#!" && $firstchar eq "#"; next LINE if $firstchar eq "\n"; print $_;
The 'next LINE' commands aren't executed unless their if statements are true.
The intermediate variables $firstchar and $firsttwo make sense if they're going to be used repeatedly, but for our program they aren't. They require unnecessary amounts of time and memory. So let's eliminate them by using the substr function on the left side of the comparisons:
#!/usr/bin/perl -wn next LINE if substr($_,0,2) ne "#!" && substr($_,0,1) eq "#"; next LINE if substr($_,0,1) eq "\n"; print $_;
Our Perl program is now down to three lines of code (not count-ing the #! line). By combining the two ifs into one compound if, I can reduce the program to two lines:
#!/usr/bin/perl -wn next LINE if (substr($_,0,2) ne "#!" && substr($_,0,1) eq "#") || substr($_,0,1) eq "\n"; print $_;
That 'next LINE' statement won't fit in one column, but as usual There's Always More Than One Way To Shorten It. Using the match (m//) operator, you can construct regular expressions, which determine whether a string matches a pattern. Some simple regular expressions relevant to our task:
m/^#!/ Does the string begin (^) with '#!' m/^#/ Does the string begin (^) with '#' m/^\n$/ Does the string begin (^) a newline (\n) and end ($) with it too?
The =~ and !~ operators are used to test whether the pattern on the right applies to the string on the left. $string =~ /^#/ is true if $string begins with a#, and $string !~ /^#/ is true if it doesn't. The program can now be shortened even further:
#!/usr/bin/perl -wn next LINE if ($_ !~ m/^#!/ && $_ =~ m/^#/) || $_ =~ m/^\n$/; print $_;
What if there are blank lines with whitespace preceding the new-line? Then m/^\n$/ won't be true, and the line will be displayed, which isn't what we want to happen. Inside a pattern, Perl can test for a whitespace character with \s, which matches not only spaces but tabs and carriage returns as well. Inside a pattern, you can specify how much you want of something with a quantifier. The quantifiers are:
* 0 or more times
+ 1 or more times
? 0 or 1 time
{x,y} at least x but not more than y times
Since we might have any amount of extraneous whitespace, even none, * fits the bill. \s* means zero or more whitespace characters. Added into the matching operator, our program now reads:
#!/usr/bin/perl -wn next LINE if ($_ !~ m/^#!/ && $_ =~ m/^#/) || $_ =~ m/^\s*\n$/; print $_;
Perl often uses $_ as a default variable for its operators. It does this both with pattern matches and print:
#!/usr/bin/perl -wn next LINE if (!m/^#!/ && m/^#/) || m/^\s*\n$/; print;
If we're applying a pattern match to $_, we can leave off the m in m// matches:
#!/usr/bin/perl -wn next LINE if (!/^#!/ && /^#/) || /^\s*\n$/; print;
We can combine these two lines into one by using unless:
#!/usr/bin/perl -wn print unless !/^#!/ && /^#/ || /^\s*\n$/;
Finally, we can execute this program directly from the command line, with the -e flag. We can even trim the semicolon because it's the last statement of a block.
% perl -wne 'print unless !/^#!/ && /^#/ || /^\s*\n$/'
The result is a script that is starting to look like the others here in TPJ. Once you get used to these idioms, you'll spill out streamlined code like this without thinking. There are probably some Perl hackers out there who will come up with further optimiza-tions to this code.
Have fun!
Art Ramos (aramos@sunyorange.edu) is a Systems Analyst for Orange County Community College in Middletown, New York. He discovered Perl after a Prime system was replaced with an IBM RS6000 forcing him to dive into AIX. Since then, he has been working with both AIX and Linux. He lives in Middletown with his wife Katie and his daughters, Jessica and Allison. Mailshield