RAND(Squawk): Using code to find all the anagrams in a book.

While reading "War and Peace" by Leo Tolstoy the other day I wondered how many anagrams were in the book. Well that's not quite true I was actually reading the classic CS book "Programming Pearls" by Jon Bentley and was intrigued by an early gem described in the book. Thinking about how to solve a problem before leaping into a programming solution can have great run-time and resource benefits. An example was finding anagrams in a larger text and books don't come much bigger than War and Peace by Leo Tolstoy.

Materials

The electronic text for many older books can be found over at the excelent Guttenberg project. Amongst the collection is War and Peace in epub, html and text format. The text version was downloaded and reviewed to remove some foreign accented characters. Using perl 5, version 22, subversion 1 (v5.22.1) on a Mac but other text processing languages are available.

How big is this task ?

War and peace is a big book, when printed runs to 1440 pages. In electronic form the book has about as much data as a medium resolution picture about 3.5 million characters ( including spaces and punctuation )

$ wc ../../WarAndPeace/WarAndPeace_PlainText.txt

   64656  563405 3208085 ../../WarAndPeace/WarAndPeace_PlainText.txt

   |      |Words |

   Lines         Characters ( inc spaces )

This is easily within scope of an in memory solution.

How to solve

At first glance a simplistic approach of "Store every word and then compare each word against each other to see if they are anagrams" comes to mind. How ever this could involve comparing over 563,403 words with 563,403 others making a total of 563,403 ^ 2 = 317,422,940,409 thats over 300 trillion comparisons. At a million comparisons a second that is

317422940409 /1000000 /60 /60 = 88.1 hours.

Even using a moderately parallel approach would not help much but would be usable as each anagram check micro task is independent of others. Just remembering which words have been checked before would provide a great saving as there are just 17,523 unique words in the tome making 307 million comparisons required.

17523 * 17523 = 307,055,529

This simple approach also ignores the useful clues we know about words that are anagrams of each other.

The words must have the same number of letters. Longer or shorter words can be eliminated from comparison.
All actual letters provided must form part of the anagram. Rare letters (Q Z X Y K etc.) in the clue must be in the answer.
The subject matter of the anagrams is sometimes provided.

Rule 3 gives us little help as this is a not a linguistic task.

Rule 1 and common sense suggests that the book could be split up into lists of unique words of the same length and just those words compared with each other.

This would reduce the number of comparisons from 307 million to about 35 million easily achievable in a short time on any computer.

#Results Word length table text

#Length = Count

= 1

= 35

= 136

= 430

= 1280

= 1912

= 2527

= 2876

= 2697

= 2191

= 1509

= 948

= 533

= 285

= 133

= 52

= 9

= 4

= 1

* 136 + 430 * 430 + 1280 * 1280 + 1912 * 1912 + 2527 * 2527 + 2876 * 2876 + 2697 * 2697 + 2191 * 2191 + 1509 * 1509 + 948 * 948 + 533 * 533 + 285 * 285 + 133 * 133 + 52 * 52 + 9 * 9 + 4 * 4 + 1 * 1 +  0 = 35790525

Total Unique Words Count = 17523

The 0 and 1 length words listed above are just fragments of punctuation or unique single letters that cannot be anagrams and can all be ignored.

Rule 2 gives us the best hint for an even more efficient solution that could be used for larger books or collections of books. Having to compare each unique word once to see if that word is an anagram of any other word in the book would be the quickest solution. All the unique words in the book are stored in memory to enable repeated comparisons so why not use the index to that word structure to filter and isolate words that are anagrams of each other. Storing the words in a dictionary like structure under headings with the same word length, rare letters and number of each letter would drastically reduce the number of comparisons required.

Simplifying the anagram hunter to a general solution has each word taken and the letters reordered alphabetically and used as a "key" to group all words with the same generated key together. The key is a new anagram of the word with the same number and frequency of letters. All other words with the same number and frequency of letters will generate the same key and will therefore cluster together as matching anagrams.

Example

The words "notes" and "stone" when split into letters, sorted alphabetically both become "enost". When any word generates the same key "enost" another anagram of "notes" and "stone" will have been found. Later in the book "tones" appears, which is then added to the anagram sub-list.

Key

|       count of words with same key

|       |   | List of matching angrams of each other

enost = 3 = tones,notes,stone

To make this process work all punctuation is stripped from the book being replaced with a space character and all letters are converted to lower case before making the key. As we process the words keeping a counter for each word pattern allows a concordance to be generated.

Results

#Results Anagrams found in text

#Key = count = words

aelst = 5 = least,stale,steal,tales,stael

opst = 5 = post,stop,spot,pots,tops

acers = 5 = cares,scare,sacre,acres,races

adeerr = 4 = reread,reared,dearer,reader

aemn = 4 = mean,name,mane,amen

adegnr = 4 = ranged,danger,garden,grande

eilv = 4 = live,vile,veil,evil

eilms = 4 = smile,miles,limes,slime

aelp = 4 = pale,leap,plea,peal
.....snip

#Results frequent words found in text

#Word = Count

the = 34550

and = 22234

to = 16675

of = 14895

a = 10554

he = 10001

in = 8979

that = 8190

his = 7984

was = 7359
..... snip

Code

# Mon 23 Dec 2019 23:53:32 GMT

# find the anagrams from lines of things

# Prog Pearls 2.1C

# Only keep Unique words 

sub mkey () {

    #make key of characters in word in alpha order

    return join "",sort split //,$_[0];

}

$maxL =0;

while (<>) {

    s/[\W ]+/ /g;  #Tidy up text and drop punctuation

    foreach $d ( split (/ /) ) {  #get each word on line

        $d = lc $d;         #  make lower case

        $k = &mkey($d); #  make key from word

        $done{$d} +=1;      #  Count the number of times we have seen this word 

        if ($done{$d} == 1) {   # This is a new word

            $U +=1 ;            # count the number of unique words

            $kcnt{$k} +=1;      # count number of words with this key

            $wds{$k} .= $d."!"; # Save this word in list of words with same anagram key

            $lengthTable[length($d)] += 1; #Count unique words by length

            if ( length($d) > $maxL)

                { $maxL = length($d) };

        }

    }

}

# In number of anagrams found order

print "#Results Anagrams found in text\n#Key = count = words\n";

foreach $i (sort { $kcnt{$b} <=> $kcnt{$a} } keys %wds) {

    if ($kcnt{$i} > 1) {

        print "$i = $kcnt{$i} = ",join ",",split "!",$wds{$i};

        print "\n";

    }

}

# in most found order

print "#Results frequent words found in text\n#Word = Count\n";

foreach $i ( sort  { $done{$b} <=> $done{$a} } keys %done ) {

    print "$i = $done{$i}\n";

}

#Length table

print "#Results Word length table text\n#Length = Count\n";

foreach $i ( 0..$maxL) {

        print ("$i = $lengthTable[$i]\n");

}

foreach $i ( 2..$maxL) {

        if ($lengthTable[$i] > 0 ) {

                print ("$lengthTable[$i] * $lengthTable[$i] + ");

                $T += $lengthTable[$i] * $lengthTable[$i] ;

        }

}

print (" 0 = $T\nTotal Unique Words Count = $U\n");

Explanation

In Perl variables start with a $ and dictionary structures start with a % An single entry in a dictionary is addressed as $DictionaryName{ $key }. The +=1 phrase add one to the variable preceding. Looking here at just the interesting lines to get a flavour of how the program generates the answers.

There are three main dictionary structures in the program each of which has a collection of key and content pairs:

Name	Key	Values
done	A single word $d	A number incremented each time that word is seen.
kcnt	key generated from word	A number incremented each time a word generates this key
wds	key generated from word	List of words that all have the same key. Anagrams of each other.

Making the key

sub mkey () {
return join "",sort split //,$_[0];
}

Working from the end of the line towards the return value.
$_[0] is the word to make the key
split // slices the word in to a list of characters
sort puts the list into alphabetical order
join "" creates the key returned from the sorted list of characters

Inside the main foreach loop when a unique word is found

$d is the current word
$k is the generated key
$done{$d} +=1; Adds 1 to the "done" dictionary used to hold the count of words seen
$lengthTable[length($d)] += 1; Counts the words of the same into lengthTable

$wds{$k} .= $d."!"; Appends the word and an "!" to the list of all the words in %wds that have the same key

$kcnt{$k} +=1; Adds 1 to the dictionary used to hold the counts of words with the same $k key.

Reporting out the results

The anagrams found in dictionary %kcnt and words counting structure %wds are printed using very similar methods.

foreach $i (sort { $kcnt{$b} <=> $kcnt{$a} } keys %wds) { print... }

The output loop parts are

keys %wds Generate a list of the keys in the dictionary of stored words %wds

sort { $kcnt{$b} <=> $kcnt{$a} }

list of keys is sorted using a code fragment to access the values held with each key

foreach $i Process each entry in the now sorted list.

The length of words table and total word count are printed.

Exercises

Can you update the program to save and output the words found by length ? The longest words found are :

16 = circumstantially, disillusionments, enthusiastically, incomprehensible, irresponsibility, melodramatically, misunderstanding, responsibilities, superciliousness

17 = contemporaneously, misunderstandings, superstitiousness, unapproachability

18 = characteristically

Look at the shortest words found and improve the initial line text processing to reduce the numbers and noise in the answers.

Are there any words split across lines using hyphenation ? if so fix this.

RAND(Squawk)

Thursday, 26 December 2019

Using code to find all the anagrams in a book.