Thursday 26 December 2019

Using code to find all the anagrams in a book.

While reading "War and Peace" by Leo Tolstoy the other day I wondered how many anagrams were in the book. Well that's not quite true I was actually reading the classic CS book "Programming Pearls" by Jon Bentley and was intrigued by an early gem described in the book. Thinking about how to solve a problem before leaping into a programming solution can have great run-time and resource benefits. An example was finding anagrams in a larger text and books don't come much bigger than War and Peace by Leo Tolstoy.

Materials

The electronic text for many older books can be found over at the excelent  Guttenberg project. Amongst the collection is War and Peace in epub, html and text format. The text version was downloaded and reviewed to remove some foreign accented characters. Using perl 5, version 22, subversion 1 (v5.22.1) on a Mac but other text processing languages are available.

How big is this task ?

War and peace is a big book, when printed runs to 1440 pages. In electronic form the book has about as much data as a medium resolution picture about 3.5 million characters ( including spaces and punctuation )

$ wc ../../WarAndPeace/WarAndPeace_PlainText.txt
   64656  563405 3208085 ../../WarAndPeace/WarAndPeace_PlainText.txt
   |      |Words |
   Lines         Characters ( inc spaces )

This is easily within scope of an in memory solution.

How to solve

At first glance a simplistic approach of "Store every word and then compare each word against each other to see if they are anagrams" comes to mind. How ever this could involve comparing over 563,403 words with 563,403 others making a total of 563,403 ^ 2 = 317,422,940,409 thats over 300 trillion comparisons.  At a million comparisons a second that is

317422940409 /1000000 /60 /60 = 88.1 hours.

Even using a moderately parallel approach would not help much but would be usable as each anagram check micro task is independent of others. Just remembering which words have been checked before would provide a great saving as there are just 17,523 unique words in the tome making 307 million comparisons required.

17523 * 17523 = 307,055,529

This simple approach also ignores the useful clues we know about words that are anagrams of each other.
  1. The words must have the same number of letters. Longer or shorter words can be eliminated from comparison.
  2. All actual letters provided must form part of the anagram. Rare letters (Q Z X Y K etc.) in the clue must be in the answer.
  3. The subject matter of the anagrams is sometimes provided.
Rule 3 gives us little help as this is a not a linguistic task.

Rule 1 and common sense suggests that the book could be split up into lists of unique words of the same length and just those words compared with each other.

This would reduce the number of comparisons from 307 million to about 35 million easily achievable in a short time on any computer.

#Results Word length table text
#Length = Count
0 = 1
1 = 35
2 = 136
3 = 430
4 = 1280
5 = 1912
6 = 2527
7 = 2876
8 = 2697
9 = 2191
10 = 1509
11 = 948
12 = 533
13 = 285
14 = 133
15 = 52
16 = 9
17 = 4
18 = 1
136 * 136 + 430 * 430 + 1280 * 1280 + 1912 * 1912 + 2527 * 2527 + 2876 * 2876 + 2697 * 2697 + 2191 * 2191 + 1509 * 1509 + 948 * 948 + 533 * 533 + 285 * 285 + 133 * 133 + 52 * 52 + 9 * 9 + 4 * 4 + 1 * 1 +  0 = 35790525

Total Unique Words Count = 17523

The 0 and 1 length words listed above are just fragments of punctuation or unique single letters that cannot be anagrams and can all be ignored.

Rule 2 gives us the best hint for an even more efficient solution that could be used for larger books or collections of books. Having to compare each unique word once to see if that word is an anagram of any other word in the book would be the quickest solution. All the unique words in the book are stored in memory to enable repeated comparisons so why not use the index to that word structure to filter and isolate words that are anagrams of each other. Storing the words in a dictionary like structure under headings with the same word length, rare letters and number of each letter would drastically reduce the number of comparisons required.

Simplifying the anagram hunter to a general solution has each word taken and the letters reordered alphabetically and used as a "key" to group all words with the same generated key together. The key is a new anagram of the word with the same number and frequency of letters. All other words with the same number and frequency of letters will generate the same key and will therefore cluster together as  matching anagrams. 

Example
The words "notes" and "stone" when split into letters, sorted alphabetically both become "enost". When any word generates the same key "enost" another anagram of "notes" and "stone" will have been found. Later in the book "tones" appears, which is then added to the anagram sub-list.

Key
|       count of words with same key
|       |   | List of matching angrams of each other
enost = 3 = tones,notes,stone

To make this process work all punctuation is stripped from the book being replaced with a space character and all letters are converted to lower case before making the key. As we process the words keeping a counter for each word pattern allows a concordance to be generated.

Results

#Results Anagrams found in text
#Key = count = words
aelst = 5 = least,stale,steal,tales,stael
opst = 5 = post,stop,spot,pots,tops
acers = 5 = cares,scare,sacre,acres,races
adeerr = 4 = reread,reared,dearer,reader
aemn = 4 = mean,name,mane,amen
adegnr = 4 = ranged,danger,garden,grande
eilv = 4 = live,vile,veil,evil
eilms = 4 = smile,miles,limes,slime
aelp = 4 = pale,leap,plea,peal
.....snip


#Results frequent words found in text
#Word = Count
the = 34550
and = 22234
to = 16675
of = 14895
a = 10554
he = 10001
in = 8979
that = 8190
his = 7984
was = 7359
..... snip





Code


# Mon 23 Dec 2019 23:53:32 GMT
# find the anagrams from lines of things
# Prog Pearls 2.1C
# Only keep Unique words 

sub mkey () {
    #make key of characters in word in alpha order
    return join "",sort split //,$_[0];
}

$maxL =0;
while (<>) {
    s/[\W ]+/ /g#Tidy up text and drop punctuation
    foreach $d ( split (/ /) ) {  #get each word on line
        $d = lc $d;         #  make lower case
        $k = &mkey($d); #  make key from word
        $done{$d} +=1;      #  Count the number of times we have seen this word 
        if ($done{$d} == 1) {   # This is a new word
            $U +=1 ;            # count the number of unique words
            $kcnt{$k} +=1;      # count number of words with this key
            $wds{$k} .= $d."!"; # Save this word in list of words with same anagram key
            $lengthTable[length($d)] += 1; #Count unique words by length
            if ( length($d) > $maxL)
                { $maxL = length($d) };
        }
    }
}

# In number of anagrams found order
print "#Results Anagrams found in text\n#Key = count = words\n";
foreach $i (sort { $kcnt{$b} <=> $kcnt{$a} } keys %wds) {
    if ($kcnt{$i} > 1) {
        print "$i = $kcnt{$i} = ",join ",",split "!",$wds{$i};
        print "\n";
    }
}

# in most found order
print "#Results frequent words found in text\n#Word = Count\n";
foreach $i ( sort  { $done{$b} <=> $done{$a} } keys %done ) {
    print "$i = $done{$i}\n";
}

#Length table
print "#Results Word length table text\n#Length = Count\n";
foreach $i ( 0..$maxL) {
        print ("$i = $lengthTable[$i]\n");

}
foreach $i ( 2..$maxL) {
        if ($lengthTable[$i] > 0 ) {
                print ("$lengthTable[$i] * $lengthTable[$i] + ");
                $T += $lengthTable[$i] * $lengthTable[$i] ;
        }
}

print (" 0 = $T\nTotal Unique Words Count = $U\n");



Explanation


In Perl variables start with a $ and dictionary structures start with a % An single entry in a dictionary is addressed as $DictionaryName{ $key }. The +=1 phrase add one to the variable preceding. Looking here at just the interesting lines to get a flavour of how the program generates the answers.

There are three main dictionary structures in the program each of which has a collection of key and content pairs:



Name       KeyValues
doneA single word $d A number incremented each time that word is seen.
kcntkey generated from wordA number incremented each time a word generates this key
wdskey generated from wordList of words that all have the same key. Anagrams of each other.


Making the key

sub mkey () {
    return join "",sort split //,$_[0];
}

Working from the end of the line towards the return value.

 $_[0]     is the word to make the key
split //  slices the word in to a list of characters
sort      puts the list into alphabetical order
join ""   creates the key returned from the sorted list of characters

Inside the main foreach loop when a unique word is found

 $d is the current word
 $k is the generated key
 $done{$d} +=1Adds 1 to the "done" dictionary used to hold the count of words seen
 $lengthTable[length($d)] += 1; Counts the words of the same into lengthTable

 $wds{$k} .= $d."!";  Appends the word and an "!" to the list of all the words in %wds that have the same key 



 $kcnt{$k} +=1; Adds 1 to the dictionary used to hold the counts of words with the same $k key.

Reporting out the results


The anagrams found in dictionary %kcnt and words counting structure %wds are printed using very similar methods.



foreach $i (sort { $kcnt{$b} <=> $kcnt{$a} } keys %wds) { print... }


The output loop parts are


keys %wds     Generate a list of the keys in the dictionary of stored words %wds

sort { $kcnt{$b} <=> $kcnt{$a} }

              list of keys is sorted using a code fragment to access the values held with each key

foreach $i     Process each entry in the now sorted list.



The length of words table and total word count are printed.


Exercises 

Can you update the program to save and output the words found by length ? The longest words found are :


16 = circumstantially, disillusionments, enthusiastically, incomprehensible, irresponsibility, melodramatically, misunderstanding, responsibilities, superciliousness
17 = contemporaneously, misunderstandings, superstitiousness, unapproachability

18 = characteristically

Look at the shortest words found and improve the initial line text processing to reduce the numbers and noise in the answers. 

Are there any words split across lines using hyphenation ? if so fix this. 



Wednesday 25 December 2019

Six really great things about Swift - a modern programming language.


The languages people use guide how they think in both literature, science and programming. French and Welsh may good for poetry but my interest has always been in programming languages; how they influence the problem solving process and get the programming job done. From a career start in the old school languages of COBOL, Fortran and Basic it has always been obvious that the language used to describe the solution to a problem has a big influence on the accuracy and ease of the problem outcome.

A career in software technical support has provided a deep familiarity with broken software and solutions built on technical and scripting languages. Having a bit more time to research I thought it would be good to see what a really modern programming environment has to offer. Looking with a focus on improving the reliability of solutions built using modern software techniques.  Swift the Apple supported language behind iPhone apps was chosen being young, well supported, comprehensive and summed up in the tagline :
"Swift makes it easy to write software that is incredibly fast and safe by design."

Just a few months of familiarity with Swift have confirmed very positive initial impressions. The features listed combine in the language to deliver a rich and effective programming environment.

  1. Data types and structures are enforced but in a good way being flexible including the let/var choice and generic functions and extensibility.
  2. Parameter madness is tamed using named typed parameters and distinct result typing.
  3. Eco system provides easy integration with the OS and support tools. Xcode or just command line or including a playground.
  4. Documentation and understanding, hints and tips are built in. Books, online docs, error messages and tool tips.
  5. Memory non-management - threads unravelled and access to external features.
  6. The language and environment has rapidly evolved to maturity.

Example Code

The following simple example demonstrates just a very few of the language features of Swift. See below for the explanation.
The program demonstrates two methods of rotating the contents of an array of mixed items. The Array 'start' is set up with a mix of strings, numbers, boolean and an array. 'start' is then passed to funcs that rotate the contents by a specified number of positions then printed.  Two rotation func techniques are used.


// Sun 22 Dec 2019 16:39:47 GMT
// Programmining pearls 2.1 b
// Rotate an array of mixed elements

import Foundation

func RotateTradional (_ arr:[T], by: Int) -> [T] {
    var rev:[T] = []
    var rby = by % arr.count  // only rotate once
    if (rby == 0 || rby == arr.count)
        { return arr } // not moving or exact wraparround
    
    if (by < 0 )  { rby = arr.count + by } // handle -ve rotation in a positive way
    
    assert( rby > 0 && rby < arr.count, "Bad rotation \(rby) on \(arr.count)" )
    
    for z in (rby ..< arr.count+rby)
        { rev.append( arr[ z % arr.count ] ) }
    return rev
}

func Reverse (_ brr:[T], _ lft:Int, _ rgt:Int) -> [T] {
    var rev:[T] = []
    
    // reverse the array elements from lft to rgt-1
    // print ("#reverseIN  \(brr) \(lft) \(rgt)")
    
    if ( lft == rgt-1 ) // only 1 element to reverse
         { rev.append( brr[lft] ) }
    else {
          for z in stride(from:rgt, to:lft, by: -1 )
            { rev.append( brr[z-1] ) }
    }
    
    print ("#reverseOut \(brr) \(lft) \(rgt) is \(rev)")
    return rev
}

func Rotate (_ arr:[T], by: Int) -> [T] {
     // using a two partial reversal technique

    var rby = by % arr.count  // only rotate once
    
    if (rby == 0 || rby == arr.count)
        { return arr } // not moving or exact wraparround
    
    if (by < 0 ) { rby = arr.count + by } // handle -ve rotation in a positive way
    
    assert( rby > 0 && rby < arr.count, "Bad rotation \(rby) on \(arr.count)" )
    
    return Reverse( Reverse(arr,0,rby) + Reverse(arr,rby,arr.count), 0,arr.count)
}

//===================  Start Here   ===========================
// any old stuff in the array including a literal array

let start:[Any] = [ "alpha","b","c",5,"e",[99,3.2],"g",true,"I","omega" ]

print ("Start with     \(start)")
print ("Reversed start \(Reverse(start,0,start.count))")

print ("Rotated by Tradional method");
for i in ( -2 ... start.count + 2 ) {
    let temp = RotateTradional(start, by: i)
    print("\(String(format:"%2d", i)) = \(temp) has \(temp.count)")
}

print ("Rotated by partial reverse method");
for i in ( -2 ... start.count + 2 ) {
    let temp = Rotate(start, by: i)
    print("\(String(format:"%2d", i)) = \(temp) has \(temp.count)")
}

SampleOutput

Output from the above program with the debug output lines (starting with #) removed. 

$ swift RotateString.swift | grep -v '#'
Start with     ["alpha", "b", "c", 5.0, "e", [99.0, 3.2], "g", true, "I", "omega"]
Reversed start ["omega", "I", true, "g", [99.0, 3.2], "e", 5.0, "c", "b", "alpha"]
Rotated by Tradional method
-2 = ["I", "omega", "alpha", "b", "c", 5.0, "e", [99.0, 3.2], "g", true] has 10
-1 = ["omega", "alpha", "b", "c", 5.0, "e", [99.0, 3.2], "g", true, "I"] has 10
 0 = ["alpha", "b", "c", 5.0, "e", [99.0, 3.2], "g", true, "I", "omega"] has 10
 1 = ["b", "c", 5.0, "e", [99.0, 3.2], "g", true, "I", "omega", "alpha"] has 10
 2 = ["c", 5.0, "e", [99.0, 3.2], "g", true, "I", "omega", "alpha", "b"] has 10
 3 = [5.0, "e", [99.0, 3.2], "g", true, "I", "omega", "alpha", "b", "c"] has 10
 4 = ["e", [99.0, 3.2], "g", true, "I", "omega", "alpha", "b", "c", 5.0] has 10
 5 = [[99.0, 3.2], "g", true, "I", "omega", "alpha", "b", "c", 5.0, "e"] has 10
 6 = ["g", true, "I", "omega", "alpha", "b", "c", 5.0, "e", [99.0, 3.2]] has 10
 7 = [true, "I", "omega", "alpha", "b", "c", 5.0, "e", [99.0, 3.2], "g"] has 10
 8 = ["I", "omega", "alpha", "b", "c", 5.0, "e", [99.0, 3.2], "g", true] has 10
 9 = ["omega", "alpha", "b", "c", 5.0, "e", [99.0, 3.2], "g", true, "I"] has 10
10 = ["alpha", "b", "c", 5.0, "e", [99.0, 3.2], "g", true, "I", "omega"] has 10
11 = ["b", "c", 5.0, "e", [99.0, 3.2], "g", true, "I", "omega", "alpha"] has 10
12 = ["c", 5.0, "e", [99.0, 3.2], "g", true, "I", "omega", "alpha", "b"] has 10
Rotated by partial reverse method
-2 = ["I", "omega", "alpha", "b", "c", 5.0, "e", [99.0, 3.2], "g", true] has 10
-1 = ["omega", "alpha", "b", "c", 5.0, "e", [99.0, 3.2], "g", true, "I"] has 10
 0 = ["alpha", "b", "c", 5.0, "e", [99.0, 3.2], "g", true, "I", "omega"] has 10
 1 = ["b", "c", 5.0, "e", [99.0, 3.2], "g", true, "I", "omega", "alpha"] has 10
 2 = ["c", 5.0, "e", [99.0, 3.2], "g", true, "I", "omega", "alpha", "b"] has 10
 3 = [5.0, "e", [99.0, 3.2], "g", true, "I", "omega", "alpha", "b", "c"] has 10
 4 = ["e", [99.0, 3.2], "g", true, "I", "omega", "alpha", "b", "c", 5.0] has 10
 5 = [[99.0, 3.2], "g", true, "I", "omega", "alpha", "b", "c", 5.0, "e"] has 10
 6 = ["g", true, "I", "omega", "alpha", "b", "c", 5.0, "e", [99.0, 3.2]] has 10
 7 = [true, "I", "omega", "alpha", "b", "c", 5.0, "e", [99.0, 3.2], "g"] has 10
 8 = ["I", "omega", "alpha", "b", "c", 5.0, "e", [99.0, 3.2], "g", true] has 10
 9 = ["omega", "alpha", "b", "c", 5.0, "e", [99.0, 3.2], "g", true, "I"] has 10
10 = ["alpha", "b", "c", 5.0, "e", [99.0, 3.2], "g", true, "I", "omega"] has 10
11 = ["b", "c", 5.0, "e", [99.0, 3.2], "g", true, "I", "omega", "alpha"] has 10
12 = ["c", 5.0, "e", [99.0, 3.2], "g", true, "I", "omega", "alpha", "b"] has 10

Sample output showing Debug lines starting with # with line gaps added for clarity


Rotated by partial reverse method

#reverseOut ["alpha", "b", "c", 5.0, "e", [99.0, 3.2], "g", true, "I", "omega"] 0 8 is [true, "g", [99.0, 3.2], "e", 5.0, "c", "b", "alpha"]

#reverseOut ["alpha", "b", "c", 5.0, "e", [99.0, 3.2], "g", true, "I", "omega"] 8 10 is ["omega", "I"]

#reverseOut [true, "g", [99.0, 3.2], "e", 5.0, "c", "b", "alpha", "omega", "I"] 0 10 is ["I", "omega", "alpha", "b", "c", 5.0, "e", [99.0, 3.2], "g", true]
-2 = ["I", "omega", "alpha", "b", "c", 5.0, "e", [99.0, 3.2], "g", true] has 10

#reverseOut ["alpha", "b", "c", 5.0, "e", [99.0, 3.2], "g", true, "I", "omega"] 0 9 is ["I", true, "g", [99.0, 3.2], "e", 5.0, "c", "b", "alpha"]
#reverseOut ["alpha", "b", "c", 5.0, "e", [99.0, 3.2], "g", true, "I", "omega"] 9 10 is ["omega"]
#reverseOut ["I", true, "g", [99.0, 3.2], "e", 5.0, "c", "b", "alpha", "omega"] 0 10 is ["omega", "alpha", "b", "c", 5.0, "e", [99.0, 3.2], "g", true, "I"]
-1 = ["omega", "alpha", "b", "c", 5.0, "e", [99.0, 3.2], "g", true, "I"] has 10

 0 = ["alpha", "b", "c", 5.0, "e", [99.0, 3.2], "g", true, "I", "omega"] has 10
#reverseOut ["alpha", "b", "c", 5.0, "e", [99.0, 3.2], "g", true, "I", "omega"] 0 1 is ["alpha"]
#reverseOut ["alpha", "b", "c", 5.0, "e", [99.0, 3.2], "g", true, "I", "omega"] 1 10 is ["omega", "I", true, "g", [99.0, 3.2], "e", 5.0, "c", "b"]
#reverseOut ["alpha", "omega", "I", true, "g", [99.0, 3.2], "e", 5.0, "c", "b"] 0 10 is ["b", "c", 5.0, "e", [99.0, 3.2], "g", true, "I", "omega", "alpha"]

 1 = ["b", "c", 5.0, "e", [99.0, 3.2], "g", true, "I", "omega", "alpha"] has 10

Lest have a look at some of the feature that make Swift such a great language. Starting out with a few language features followed by the environmental factors that help to make the rich Swift ecosystem.

1a) Data types are enforced but in a good way.

The ability to use bits and bytes to represent multiple forms of data has been a blessing and source of errors throughout out the history of programming. Knowing what the contents of a specific memory location represent and enforcing the correct usage has driven improvements in both system security and program integrity. Any memory location can be a program instruction, integer, floating point number, string of characters, set of bits, reference to another thing, or even the internal representation of the pixels displayed on the screen. Improvements in detecting when a location is being used in the wrong way has been a fundamental part of software engineering over the last few decades.

Borrowing the concept of constants and variables from Pascal, Swift makes a good impression right from the outset by using "let" and "var" to declare fixed and changeable storage locations.

The types follow the variables around. When a new variable is declared and assigned a value from another with a known type the new variable becomes that type.

In addition the strict typing of variables functions and classes can be written to handle any type presented to them. We can even have arrays where each element is a different type.



let start:[Any] = [ "alpha","b","c",5,"e",[99,3.2],"g",true,"I","omega" ]

1b) Rich interchangeable data structures

Swift has a rich set of flexible built in data structures that can be combined and extended.  Enumerations, arrays, dictionary, sets and tuples. Structures and classes with functions provide the basic tool of object oriented programming.  Extensions to classes using modern techniques of protocols and closures is also supported. 

One advantage of having a rich set of data structures is that the language can be framed to include data structure handling tools such as iterators.


for i in f {
  print ( "\(i)" )
}

"f" could be declared as a set, array or dictionary or enumeration but this loop will print each element contained in "f".

1c) Uncertainty is allowed

yes/no true/false 1/0 are great as far as they go but sometimes two possible answers to a question are not enough.  The option of "Thank you for that great question but that whole thing didn't work out for me.' is sometimes required. At the statement level a try  { "something" } catch { "error handled here" } is provided but a real innovation is the provisional of optional values. For example converting a String to an Int works well when the string is made up of digits but what number is expected when the string contains "three" or "soixante-trois"  ?  Swift, as strongly typed language, cleverly uses an optional types Int? as the return data type from such functions.  The options can be unwound by forcing an error if the conversion was not successful or carried on to a point in the program where the conversion failure can be handled.

2) Subroutine Parameter madness is tamed

Start as declared above is an array of type "Any" and has been initiated with a mix of strings character, integers, an array and a boolean. The Rotate functions are written as generic functions  that can operate on array of any data type. This can be done as the function only handle array elements and not the contents of the array elements.

func RotateTradional (_ arr:[T], by: Int) -> [T] {

This function is passed a generic array "arr", an integer called "by" and will eventually return an array of the same type as arr whatever that happened to be.

By using named parameters and distinct result typing interface confusion is avoided. Parameters are normally treated as read only inside a function but where "mutating" parameters are needed these have to be annotated with "inout:".

The Swift parameter design avoids the madness that arrises from passing pointers to items as parameters to functions then having the function change the pointer or altering dereferenced content.  

3) Eco system integration with OS and support tools.

Everything from a Mobile app to server processing code or just a command line tool can be built using either Xcode as the rich development environment or using just a text editor.

4) Documentation for definitive understanding.

Hints and tips are built in. Compiler and runtime error messages often contain a hint to the solution as well as a reasonable description of the fault. Full free eBooks are obtainable on both the language and application environment. Xcode has the usual expected tool tips and ability to enrich the code building experence.

5) Memory non-management - threads unravelled and access to external features.

With Swifts strong data typing comes with strict scoping and memory reference counting. Malloc, heap and stack management are no longer required in the application. Memory management is handled automatically by the run time environment. 

6) Language has rapidly evolved to maturity.

Swift, launched in 2014, has evolved over 5 versions (as of 2020) to a good state of maturity. Whilst there was a wobble between version 3 & 4 on the way strings/ arrays of characters are handled forward compatibility of source code has been reasonable across versions.