Friday, 3 November 2017

Technical Problem to solve - "Five in a bed" Key matching across lists.

I have a technical problem that needs to be solved as part of a larger project that has been in the works for a while. Code development needed.

Starting conditions
Five text files {A..E} that each contain one number on each line.
Each line is a Hex value that has 8 bits set within a 40bit value.
EG: Each file will look like these lines ( order is not important )

The files have between a 5000 and 500,000 lines. On average about 30,000 lines.

Find one or more solution groups of lines, one from each file, that when OR’d together have all 40 bits set (0xFFFFF) but when AND’d together have 0 bit set (0x00000). The answer would deliver results as
repeated for each solution set found. Have the ability to stop after first solution or provide multiple answers is desirable.

Outline size of task - ( updated )
A simplistic approach would take all the lines from the first file and compare with all lines from second file and keep the matching pairs that are then compared against 3rd file and so on. This scales badly as up to 30,000 ^ 5 = 24,300,000,000,000,000,000,000 worst case comparisons may be needed. Further thought reduced this number as any failed comparison between the first two keys would remove the need for further comparisons for that key pair.  Also when 4 keys have been matched locating the last key becomes a trivial look-up for existence in the last key file. To get 4 keys we can generate keys pairs from files A&B and separately from files C&D then just merge the key results. This process would need 3 comparisons making worst case about 30,000 ^ 3 = 27,000,000,000,000 comparisons to find all the matching sets. As we only need 1 matching result 5 key set set, then if we can use a progressive algorithm that does not need all the results from one stage before starting the next stage, the actual number of comparisons would be much less.

Test runs show that from files with 8,172 and 438,761 entries  517,183,384 matching pairs are generated (approx 14% hit rate). The normal size of the files is 30,000 entries.  If this 14% success ratio persists then we would have in the normal case

(( 30000 * 30000 )* 2 * 0.14 ) 

comparisons and final stage lookups.  Which seems do-able.

Structure of answer
Read and prepare file data into data structures.

One possible solution would be to run matches between files A&B and separately files C&D to find second stage candidates. Join the results lists and look-up against file E for final matches. This is one approach but I am open to other possibilities.

It is expected that some sort of pre-organisation of the file keys would be used to reduce the need to compare each key with each other.  Possible some sort of Bloom filter or key grouping ( This is the needed magic bit. )

Technology to be used.
Perl and/or c++
Multi-tasking via threads or standard libraries encouraged 8 or 12 way processors available.
8GB memory, 5TB HDD and/or 250GB SSD.
Typical time to solution 5 minutes or better.
Any c++ should be "standards compliant" and self contained at src level and (highly desirable) to be adjustable to work on W10 or Mac or Ubuntu.
Code should be structured so that it can be trivially built into a larger program or as stand alone utility.

Done a perl prototype.
Usage {-v n -p n} -x n Afile Bfile Cfile Dfile Efile
Each file has a list of numbers 1 each on a line.
Find the lines in the files that have no overlapping bits, IE when & togther make 0. 
-v n  Set verbose level n default 0; 
-p n  Dump partial results after n keys and save mem 10000 is recomended for 2.5GB usage  
      memMode is set from -p and mode =4
-x {1,2,3,4} Run using files as intermediate result holders
            Run Mode x=1 work on Afile Bfile (default) , x=2 work on Cfile Dfile ,x=3 Murge previous done runs with Efile
    Run Mode x=4 Do all stages in this single process, sets memMode=0;
-x 5         Run as in memory solver all stages.

Hint: For shorter run times put the file on the command line in increasing size order.

Normal run would be -x 1 fileA fileB fileC fileD fileE & -x 2 fileA fileB fileC fileD fileE &
wait # till both have finished -x 3 fileA fileB fileC fileD fileE # which will pick up the results and do final murge.

or -x 4 fileA fileB fileC fileD fileE # may run out of memory on big sets

No comments: