Real World Perl I

Those of you who have ever taken a class in programming, know that many of the class assignments seem to have no real-world applications. The assignment below is one that solved a real problem quickly and easily.

The Problem

We need to create a list of words for a computer game. The words should be grouped by the number of letters in each word. For our purposes, we wanted to collect words that are 4, 5 and 6 letters long. These will be the legal words the player may use to guess the secret word.

The Task

Where to Get a List of Words

On many modern computer systems, such as Unix, Linux, or Mac OS/X (links to Unix, Linux, Mac OS/X) a common set of system files come with the computer. One of these is a dictionary that can be found at /usr/dict/words or /usr/share/dict/words. It is a simple list of English words of varying length sorted in alphabetical order. On some systems the words may be in other languages, but English is the language used on many Unix-like systems.

Find the Word Length

In Perl, as in many computer languages, text words are a collection of characters stored in memory sequentially and often referred to as a string of characters or, simply, a string.  As it turns out, Perl has a handy function, length, which returns the length of a string. So, to find the length of a particular word, we need to read the list of words and get the length of each word. If it matches our criteria of 4, 5 or 6 letters long, we want to remember it.

Starting the Program

The first thing we need to do is to read the dictionary file and get the word we want to examine.  This will be a loop, which reads all the words in the file.



while ( $line = <stdin> ) {
}

The while statement checks a condition and then executes the statements (if any) that appear between the open brace “{” and the close brace “}”. In this example, there are no statements to execute.  The condition is the result of the assignment of data from the file handle called “stdin” (link to file handles, stdin). The stdin file handle is connected to standard input on a Unix system and is the standard place to get input for the majority of programs on Unix systems.  As this is the most common use, Perl allows for a shorthand notation “<>”:



while ( $line = <> ) {
}

Getting the Word

In this example, we now have a line of text in our variable, $line. This is not a word.  We know the input file contains a list of words, one per line. However, when we read a line of input, the text file contains a special end of line character (EOL) or sequence of characters. (See End of Line in the Glossary section for more details). Perl provides two functions, chop and chomp (added in Perl5). The chop function removes the last character from the end of a string of characters, returning the deleted character as the value of function. The chomp function is more selective (and safer than chop) as it only removes the last character if it is an end of line character or sequence.

Our program now becomes:


while ( $line = <> ) {
    chomp( $word = $line );
}


Note that the chomp function surrounds the assignment of $line to $word. This is very important. We want to remove the last character if it is an EOL character, and we want the variable $word to contain the result.  In effect, we are actually assigning a copy of the contents of $line to $word, and then calling the chomp function on $word.

Checking the Length

Again, we only want to keep words that are 4, 5 or 6 letters long. The next step is to compare the length of the word against our criteria. However, notice we want words of three different lengths.  One of the most powerful features of Unix systems is reusability. Instead of writing special code to check for a range of word lengths, and saving the words in separate files, it is better to write code which checks one length and stores those that match in one file, and then reuse the same code with a different length and a different file. So let us work with words of length 4 to start.


while ( $line = <> ) {
    chomp( $word = $line );
     if ( length($word) == 4 ) {
         # save the word
    }
}


Save the Words

Now for the next length, we modify the code to use a variable. Instead of hardcoding the length, we can set the length we want to check and then change the length for each run of the program. Since we are expecting to change the length with each run of the program, we can save the words we want by sending them to the standard output or simply printing them using print. Notice we are adding a newline “\n” when we print it. Of course, if you are using a non-Unix system, Perl should used the correct end of line (link to End of Line in glossary) character(s).


$len = 4;
while ( $line = <> ) {
    chomp( $word = $line );
    if ( length($word) == $len ) {
        print $word, "\n";
    }
}


The program is now complete.


Real World Perl  II

Running the Program

We are working on a program which reads a list of words, one word per line, and creates a new list of words that are 4, 5, or 6 letters long.  The program is complete and can be easily run from the command line by typing:



perl myprogram < /usr/share/dict/words

the results will appear on the terminal, in the terminal window, or your console log depending on what system you are running the program on.  If you want to control where the output goes, type instead:



perl myprogram < /usr/share/dict/words > myoutput

The Unix command line uses the less than “<” and greater than “>” symbols to direct or redirect input and output respectively to a program.  This allows the user to specify the input and output files for a particular program when the program is run, instead of having to hard-code the names in the program each time.  This feature allows programs to be reused with different data without any modifications required to the program.

Making the Program Executable and Reusable

You will notice that there wasn’t any cryptic Unix-like script identification in my program.  That’s because Perl doesn’t require it to run a program.  However, if you want to execute the program directly by double clicking or using the program name as a command from the command line, you have to add the script identifier line at the beginning.  This line varies on some systems and some implementations, but the normal line would be:



#! /usr/bin/perl

This tells Unix-like and some other systems where to find the perl runtime program on your computer. To make this program a little more usable, I like to use one of Perl’s option switches “-s”, especially while I’m testing my programs out, and “-w” to keep me honest and to catch syntax issues. The “-s” option lets me define and set variables on the command line.

Competed Program

My completed program:


#! /usr/bin/perl -ws
if ( ! defined $len ) {
    $len = 4;
}
while ( $line = <> ) {
    chomp( $word = $line );
    if ( length($word) == $len ) {
        print $word, "\n";
    }
}


The changed program checks to see if the $len variable is defined, and if not, sets the default value to 4.  To execute the program, I set the permissions on the file, myprogram, to be executable with the Unix command: chmod



chmod +x myprogram

Then, I enter the commands:



myprogram -len=4 < /usr/share/dict/words > myoutput-4
myprogram -len=5 < /usr/share/dict/words > myoutput-5
myprogram -len=6 < /usr/share/dict/words > myoutput-6

and I have three files, myoutput-4, myoutput-5, and myoutput-6, each containing words that are 4, 5 or 6 letters long.  And the best feature? If I want a list of 3 letter words or 10 letter words, I just execute the program again with a different option.


Real World Perl  III

Extra Credit

Now that we have our lists of words separated by word length, there’s the matter of creating a separate list of secret words from the larger lists of words.  What we want to allow in the secret words is only those words which are all different letters with no duplicates in the same word.  For example, FIRE, APHID or CHROME are acceptable, but FREE, SPOOK, and RECORD are unacceptable.

The Task

Describe the Problem

This is easy. Take a word and count its letters. Check to see if there are any duplicate letters. If we find a duplicate letter, the word cannot be used. Simple for you or I to do. What about a computer program?

How to Find Duplicate Letters

One method is for a program to compare every letter against every other letter. The brute force way of doing this is to take a letter and compare it against all other letters skipping the comparison against itself. A slightly faster method would be to compare each letter against the remaining letters, since the preceding letters have already been compared to it. A third method is to identify each letter by number and enter the numbers into a table and if the same number is entered more than once, you have a duplicate. For short words, either of the first methods work well, and are easy to program.  However, the third option works with words of any length and is the easiest to program.

Compare Every Letter Against Every Other

The first method, is a simple set of two loops, comparing each letter against every other letter. This actually compares each letter against every other letter twice.


01	$duplicate = 0;
02	foreach $i ( 0..(length($word)-1) ) {
03		foreach $j ( 0..(length($word)-1) ) {
04			next if ( $j == $i );
05			if ( substr($word,$i,1) eq substr($word,$j,1) ) {
06				$duplicate = 1;
07			}
08		}
09	}


First we set the variable $duplicate to 0 to indicate not found, and in line 06 we set it to 1 if a duplicate is found. That’s the easy part.  We introduce the foreach construct for a loop using index $i which iterates over the range or set of elements following.  Perl allows you to specify a range of 1 to 5 as (1..5), or n to m as ( n..m ). Arrays and strings are zero (0) based, so we need to compare from letter 0 to letter length-1.  Next we do the same for an inner loop using index $j. Then inside the inner loop we introduce the next function. What next does is break out of the current iteration of a loop. In this instance we call next if the two indices are equal, thus avoiding comparing the same character.  The comparison is obtained by taking each character in the string we want to compare using the substr (sub-string) function.

Compare Every Letter Against Every Other, Once


01	$duplicate = 0;
02	foreach $i ( 0..(length($word)-1) ) {
03		foreach $j ( ($i+1)..(length($word)-1) ) {
04			if ( substr($word,$i,1) eq substr($word,$j,1) ) {
05				$duplicate = 1;
06			}
07		}
08	}


This method is shorter by one whole line!  Because we start the inner loop one character past the current index, we never compare the same letters twice, so we don’t need to check to see if they match.  Otherwise the rest of the method is identical to the first. One thing to watch out for, is to not change the indices inside either loop as this could cause unpredictable results.

Arrays or Hash Tables

Perl comes with several types of data structures, including two built-in types of tables. The first is called an array, and contains consecutive entries. An array is a simple list. You can start the list at any number and end at any higher numeric number, such as 0 to 3, 45 to 99, or -3 to +3. The default in Perl is 0 but can be changed by the user for all arrays or just a single array.  It’s best to leave it at 0 unless you have a good reason to change it.

Using an Array


01	$duplicate = 0;
02	@letters();
03	foreach $i ( ("A"-"A")..("Z"-"A") ) { $letters[$i] = 0; }
04	foreach $i ( 0..(length($word)-1) ) {
05		$letter = substr($word,$i,1);
06		if ( $letters[$letter-"A"] == 1 ) { $duplicate = 1; }
07		$letters[$letter-"A"] = 1;
08	}

Whoa!  Wait, it just looks unusual.  Some of what you see is advanced, but the rest is just documenting what is going on instead of hiding the details.  First, we create an empty array in line 2.  To document what we are doing, we set the first 26 elements of the array to 0.  This corresponds to the range 0..25 as defined by "A"-"A" (0), to "Z"-"A" (25).  This might not work as expected in other alphabets, but works fine for ASCII text, upper and lower case characters.  We are assuming our word list is all upper case for this example. Line 04 begins the real work with a simple loop once per letter in the word. We grab the letter and then use it as the index, offset by subtracting the "A" value again so that "A" will be 0, "B" will be 1, etc. If the array value has been set to 1, we know we've been here before with this letter and can say we found a duplicate. Otherwise we set the array value to 1.  If we complete the loop without a collision, we did not find any duplicates.

Using a Hash Table

The other kind of table in Perl is called a hash table.  This is because it is a special database structure where elements are allocated to positions based on a numerical calculation. This calculation is called a hashing function or just a hash for short. It sounds complicated, but it isn't as far as we're concerned.  Think of it as magic that separates sets of items to keep them separate but easily accessible.  I like to tell people that hash tables allow you to store things as a collection without having to know the exact order they are in the database.  The hash is the index that lets you locate the specific item without having to compare the value against all the others in order to retrieve the information you want.
01	$duplicate = 0;
02	%letters();
03	foreach $i ( 0..(length($word)-1) ) {
04		$letter = substr($word,$i,1);
05		if ( $letters{$letter} == 1 ) { $duplicate = 1; }
06		$letters{$letter} = 1;
07	}


Line 02 defines the hash table.  It looks almost like an array. An array uses the prefix character '@' and a hash table uses the prefix character '%'. In line 05 we check to see if we have already set a value for this letter.  If we have, we found a duplicate.  If not, we proceed to the next line which sets the value for this letter to 1. The rest of the code is identical to that used by an array.

Differences Between  Arrays and Hash Tables

One major difference you see is that I did not have to pre-initialize the hash table. The other is that individual elements are represented using open brace "{" and close brace "}" instead of open bracket "[" and close bracket "]". The most important change is that we do not have to work or think of elements beginning at an index of 0, or n, or any start point.  You can put two items into a hash table with widely different index values and it will take up the same amount of storage, two entries.

The syntax is different, but the has table method is identical to the array method. It is a little cleaner to code and to maintain that code later on. Another good thing about hash tables is that they can hold a significantly large number of items and quickly get to the necessary data quickly.  This means little when you use it for a 5 letter table, but when you are working with 500,000 items it is significant.


Real World Perl  IV

Putting It All Together

To recap, we've been working on a program to read a list of words and pull out the words that are a specified length.  For my purposes they were 4, 5 and 6 characters long. As an extra credit assignment, we wanted to extract those words which contained no duplicate characters.  For example, FREE is not acceptable because it has the letter "E" twice, but FIRE is acceptable.


#! /usr/bin/perl -ws
if ( ! defined $len ) { $len = 4; }
while ( $line = <> ) {
    chomp( $word = $line );
    if ( length($word) == $len ) {
        tr/a-z/A-Z/;
        if ( defined $secret ) {
            $duplicate = 0;
            %letters();
            foreach $i ( 0..(length($word)-1) ) {
                $letter = substr($word,$i,1);
                if ( $letters{$letter} == 1 ) {
                    $duplicate = 1;
                }
                $letters{$letter} = 1;
            }
            if ( $duplicate != 1 ) {
                print $word, "\n";
            }
        }
        else {
            print $word, "\n";
        }
    }
}


We also made an assumption that the words in the dictionary file were upper case in the earlier examples on how to detect duplicate letters.  In our final version above, we added the tr statement which translates any occurrences of the first set or range of characters with those in the second set or range. This translation converts any lower case letters into upper case letters; (a-z) becomes (A-Z).

Instead of writing two completely different programs, I decided to combine them, so I could use the original set of words and the same program to produce the list of usable guess words and the usable list of secret words.

Running the Program

To run the program, we simply use commands similar to the original program:



myprogram -len=4 < /usr/share/dict/words > myoutput-4
myprogram -len=4 -secret < /usr/share/dict/words > secret-4
myprogram -len=5 < /usr/share/dict/words > myoutput-5
myprogram -len=5 -secret < /usr/share/dict/words > secret-5
myprogram -len=6 < /usr/share/dict/words > myoutput-6
myprogram -len=6 -secret < /usr/share/dict/words > secret-6

The -secret option to myprogram defines the variable $secret and the program merely tests for whether $secret is defined before executing the duplicate letters test. If the variable is not defined, the test is not made and the any word matching the word length is written to the output.  if the variable is defined, the test is made, and only those words with unique, non-duplicating letters are written to the output file.

Personal Comments

My own implementation of this program was slightly more cryptic, by combining some elements such as using "++" to increment a counter instead of setting it to 1, using the substring value of $word directly instead of using a separate $letter variable, and an added test to eliminate words in the various dictionaries I used that contained numbers in front of some words.  I also prefer the constructs:



statement if condition;
$variable = value if $field == $test;

instead of



if ( condition ) { statement; }
if ( $field == $test ) { $variable = value; }

But, that is a personal preference, and it actually make it harder to read the code if you are not used to using those conventions or that style. Just for reference, my original program is included below:


#!  /usr/bin/perl -sw
$len = 5 if ! defined( $len );
 #print "Len = $len\n" if $debug;
while ( $line = <> ) {
    $line =~ s/\r\n//g;
    $f = substr($line,0,1);
    #	print "F $f\n" if $debug;
    next if (($f ge "0") && ($f le "9"));
    chomp($line);
    #	print "Line: $line ",length($line)," $len\n" if $debug;
    chomp($line);
    #	print "Line: $line ",length($line)," $len\n" if $debug;
    if ( length($line) == $len ) {
        %chars = ( );
        for $i (0..length($line)-1) {
        #		print "<",substr($line,$i,1),"> " if $debug;
            last if $chars{substr($line,$i,1)}++;
        }
        #		print "\n" if $debug;
        $ch = keys %chars;
        print $line,"\n" if $ch == $len;
    }
}


I also tend to leave debugging code in many of my quick programs, but comment them out when I'm finished, just so perl doesn't try to interpret them unnecessarily. The final dictionary used for this project contained over 17,000 5-letter words and over 7,000 secret words. I also used two constructs that we did not discuss in my recreation of this project. I use global substitution to remove some line endings, and instead of keeping a separate $duplicate variable, I check the number of unique characters found in the word by comparing the number of keys in my data array against the number of characters in the word.



$line =~ s/\r\n//g;
$ch = keys %chars;
print $line,"\n" if $ch == $len;

Just remember, "There's More Than One Way To Do It!"


Real World Perl  IV

Glossary

Array

A sequential list of items.  In Perl the items can be individual numbers, strings, or even references to other arrays, hash tables, and user-defined class objects. The main structure of an array is simply a list which can be sequentially read or written or indexed directly to the nth element.

Hash Table

An in-memory data structure which acts like a simple relational database to allow quick access to specific data storage indexed by keys instead of direct sequential location. More powerful than an Array because retrieval is based on a calculation performed on the key to quickly locate a data element instead of searching sequentially through a list.  An array can be used as the final storage for a hash table.  That is how hash tables were originally created in other languages. Perl has them built-in to the syntax of the language.

EOL, End of Line

The end of line character (EOL) is different on different computers. Unix systems use the ASCII (link to definition of ASCII) character LF (line feed), also known as NL (new line) and depicted as "\n". MS-DOS and Windows (links to each) use two characters, a carriage return followed by a line feed CR+LF, or CRLF. Mac OS/X uses a single carriage return CR or a newline NL depending on the program. In the 1982 movie, Tron, the MCP or Master Control Program ended all of its conversations with the phrase "End of Line". (link to IMDB for Tron info, also link to Tron Wiki for quote about MCP using End Of Line; verify no about.com links can be used before linking offsite.)

CR, carriage return

The carriage return character is named after the function on typewriters where the operator pushed a lever to reset the position of the carriage to the return position so the operator could enter another line of text.  In ASCII, this is the character represented by octal '015', decimal 13, or hexadecimal 0x0D. It can be entered on most keyboards as control-M, ^M, or CTRL-M by holding down the control key and the M key simultaneously. However, some systems will translate this as the RETURN key and translate it to newline or the correct EOL sequence for that system. This is the end of line character on the Apple Macintosh MacOS operating systems prior to System 10.0 or OS/X.

CRLF, CR+LF, carriage-return line-feed

The combination of the characters CR (carriage return) and LF (linefeed) appearing together sequentially is called a CRLF. This is the end of line sequence of characters for many computer systems including MS-DOS, PC-DOS, TOPS-10, TOPS-20, and Windows. Some other systems use LF followed by CR, or LFCR, but it is rare.

LF, linefeed, line feed

The line feed character is named after the function to feed the paper up one line while maintaining the current position on the paper. This function exists on manual typewriters and many printing devices. In ASCII, this is the character represented by octal '012', decimal 10, or hexadecimal 0x0A. It can be entered directly on many keyboards as control-J, ^J or CTRL-J by holding down the control key and the J key simultaneously. This is the end of line character on most variants of Unix or Linux.

NL, newline, new line

A newline character, taken from the line feed function, but in computer terms it also resets the carriage or other mechanisms to be ready for the next new line of text. In ASCII, this is the character represented by octal '012', decimal 10, or hexadecimal 0x0A. This is the end of line character on most variants of Unix or Linux.