Thursday 26 December 2019

Using code to find all the anagrams in a book.

While reading "War and Peace" by Leo Tolstoy the other day I wondered how many anagrams were in the book. Well that's not quite true I was actually reading the classic CS book "Programming Pearls" by Jon Bentley and was intrigued by an early gem described in the book. Thinking about how to solve a problem before leaping into a programming solution can have great run-time and resource benefits. An example was finding anagrams in a larger text and books don't come much bigger than War and Peace by Leo Tolstoy.

Materials

The electronic text for many older books can be found over at the excelent  Guttenberg project. Amongst the collection is War and Peace in epub, html and text format. The text version was downloaded and reviewed to remove some foreign accented characters. Using perl 5, version 22, subversion 1 (v5.22.1) on a Mac but other text processing languages are available.

How big is this task ?

War and peace is a big book, when printed runs to 1440 pages. In electronic form the book has about as much data as a medium resolution picture about 3.5 million characters ( including spaces and punctuation )

$ wc ../../WarAndPeace/WarAndPeace_PlainText.txt
   64656  563405 3208085 ../../WarAndPeace/WarAndPeace_PlainText.txt
   |      |Words |
   Lines         Characters ( inc spaces )

This is easily within scope of an in memory solution.

How to solve

At first glance a simplistic approach of "Store every word and then compare each word against each other to see if they are anagrams" comes to mind. How ever this could involve comparing over 563,403 words with 563,403 others making a total of 563,403 ^ 2 = 317,422,940,409 thats over 300 trillion comparisons.  At a million comparisons a second that is

317422940409 /1000000 /60 /60 = 88.1 hours.

Even using a moderately parallel approach would not help much but would be usable as each anagram check micro task is independent of others. Just remembering which words have been checked before would provide a great saving as there are just 17,523 unique words in the tome making 307 million comparisons required.

17523 * 17523 = 307,055,529

This simple approach also ignores the useful clues we know about words that are anagrams of each other.
  1. The words must have the same number of letters. Longer or shorter words can be eliminated from comparison.
  2. All actual letters provided must form part of the anagram. Rare letters (Q Z X Y K etc.) in the clue must be in the answer.
  3. The subject matter of the anagrams is sometimes provided.
Rule 3 gives us little help as this is a not a linguistic task.

Rule 1 and common sense suggests that the book could be split up into lists of unique words of the same length and just those words compared with each other.

This would reduce the number of comparisons from 307 million to about 35 million easily achievable in a short time on any computer.

#Results Word length table text
#Length = Count
0 = 1
1 = 35
2 = 136
3 = 430
4 = 1280
5 = 1912
6 = 2527
7 = 2876
8 = 2697
9 = 2191
10 = 1509
11 = 948
12 = 533
13 = 285
14 = 133
15 = 52
16 = 9
17 = 4
18 = 1
136 * 136 + 430 * 430 + 1280 * 1280 + 1912 * 1912 + 2527 * 2527 + 2876 * 2876 + 2697 * 2697 + 2191 * 2191 + 1509 * 1509 + 948 * 948 + 533 * 533 + 285 * 285 + 133 * 133 + 52 * 52 + 9 * 9 + 4 * 4 + 1 * 1 +  0 = 35790525

Total Unique Words Count = 17523

The 0 and 1 length words listed above are just fragments of punctuation or unique single letters that cannot be anagrams and can all be ignored.

Rule 2 gives us the best hint for an even more efficient solution that could be used for larger books or collections of books. Having to compare each unique word once to see if that word is an anagram of any other word in the book would be the quickest solution. All the unique words in the book are stored in memory to enable repeated comparisons so why not use the index to that word structure to filter and isolate words that are anagrams of each other. Storing the words in a dictionary like structure under headings with the same word length, rare letters and number of each letter would drastically reduce the number of comparisons required.

Simplifying the anagram hunter to a general solution has each word taken and the letters reordered alphabetically and used as a "key" to group all words with the same generated key together. The key is a new anagram of the word with the same number and frequency of letters. All other words with the same number and frequency of letters will generate the same key and will therefore cluster together as  matching anagrams. 

Example
The words "notes" and "stone" when split into letters, sorted alphabetically both become "enost". When any word generates the same key "enost" another anagram of "notes" and "stone" will have been found. Later in the book "tones" appears, which is then added to the anagram sub-list.

Key
|       count of words with same key
|       |   | List of matching angrams of each other
enost = 3 = tones,notes,stone

To make this process work all punctuation is stripped from the book being replaced with a space character and all letters are converted to lower case before making the key. As we process the words keeping a counter for each word pattern allows a concordance to be generated.

Results

#Results Anagrams found in text
#Key = count = words
aelst = 5 = least,stale,steal,tales,stael
opst = 5 = post,stop,spot,pots,tops
acers = 5 = cares,scare,sacre,acres,races
adeerr = 4 = reread,reared,dearer,reader
aemn = 4 = mean,name,mane,amen
adegnr = 4 = ranged,danger,garden,grande
eilv = 4 = live,vile,veil,evil
eilms = 4 = smile,miles,limes,slime
aelp = 4 = pale,leap,plea,peal
.....snip


#Results frequent words found in text
#Word = Count
the = 34550
and = 22234
to = 16675
of = 14895
a = 10554
he = 10001
in = 8979
that = 8190
his = 7984
was = 7359
..... snip





Code


# Mon 23 Dec 2019 23:53:32 GMT
# find the anagrams from lines of things
# Prog Pearls 2.1C
# Only keep Unique words 

sub mkey () {
    #make key of characters in word in alpha order
    return join "",sort split //,$_[0];
}

$maxL =0;
while (<>) {
    s/[\W ]+/ /g#Tidy up text and drop punctuation
    foreach $d ( split (/ /) ) {  #get each word on line
        $d = lc $d;         #  make lower case
        $k = &mkey($d); #  make key from word
        $done{$d} +=1;      #  Count the number of times we have seen this word 
        if ($done{$d} == 1) {   # This is a new word
            $U +=1 ;            # count the number of unique words
            $kcnt{$k} +=1;      # count number of words with this key
            $wds{$k} .= $d."!"; # Save this word in list of words with same anagram key
            $lengthTable[length($d)] += 1; #Count unique words by length
            if ( length($d) > $maxL)
                { $maxL = length($d) };
        }
    }
}

# In number of anagrams found order
print "#Results Anagrams found in text\n#Key = count = words\n";
foreach $i (sort { $kcnt{$b} <=> $kcnt{$a} } keys %wds) {
    if ($kcnt{$i} > 1) {
        print "$i = $kcnt{$i} = ",join ",",split "!",$wds{$i};
        print "\n";
    }
}

# in most found order
print "#Results frequent words found in text\n#Word = Count\n";
foreach $i ( sort  { $done{$b} <=> $done{$a} } keys %done ) {
    print "$i = $done{$i}\n";
}

#Length table
print "#Results Word length table text\n#Length = Count\n";
foreach $i ( 0..$maxL) {
        print ("$i = $lengthTable[$i]\n");

}
foreach $i ( 2..$maxL) {
        if ($lengthTable[$i] > 0 ) {
                print ("$lengthTable[$i] * $lengthTable[$i] + ");
                $T += $lengthTable[$i] * $lengthTable[$i] ;
        }
}

print (" 0 = $T\nTotal Unique Words Count = $U\n");



Explanation


In Perl variables start with a $ and dictionary structures start with a % An single entry in a dictionary is addressed as $DictionaryName{ $key }. The +=1 phrase add one to the variable preceding. Looking here at just the interesting lines to get a flavour of how the program generates the answers.

There are three main dictionary structures in the program each of which has a collection of key and content pairs:



Name       KeyValues
doneA single word $d A number incremented each time that word is seen.
kcntkey generated from wordA number incremented each time a word generates this key
wdskey generated from wordList of words that all have the same key. Anagrams of each other.


Making the key

sub mkey () {
    return join "",sort split //,$_[0];
}

Working from the end of the line towards the return value.

 $_[0]     is the word to make the key
split //  slices the word in to a list of characters
sort      puts the list into alphabetical order
join ""   creates the key returned from the sorted list of characters

Inside the main foreach loop when a unique word is found

 $d is the current word
 $k is the generated key
 $done{$d} +=1Adds 1 to the "done" dictionary used to hold the count of words seen
 $lengthTable[length($d)] += 1; Counts the words of the same into lengthTable

 $wds{$k} .= $d."!";  Appends the word and an "!" to the list of all the words in %wds that have the same key 



 $kcnt{$k} +=1; Adds 1 to the dictionary used to hold the counts of words with the same $k key.

Reporting out the results


The anagrams found in dictionary %kcnt and words counting structure %wds are printed using very similar methods.



foreach $i (sort { $kcnt{$b} <=> $kcnt{$a} } keys %wds) { print... }


The output loop parts are


keys %wds     Generate a list of the keys in the dictionary of stored words %wds

sort { $kcnt{$b} <=> $kcnt{$a} }

              list of keys is sorted using a code fragment to access the values held with each key

foreach $i     Process each entry in the now sorted list.



The length of words table and total word count are printed.


Exercises 

Can you update the program to save and output the words found by length ? The longest words found are :


16 = circumstantially, disillusionments, enthusiastically, incomprehensible, irresponsibility, melodramatically, misunderstanding, responsibilities, superciliousness
17 = contemporaneously, misunderstandings, superstitiousness, unapproachability

18 = characteristically

Look at the shortest words found and improve the initial line text processing to reduce the numbers and noise in the answers. 

Are there any words split across lines using hyphenation ? if so fix this. 



No comments: