COM**************************************************************************** COM COM This is an adaptation of the code was written in the March of 2006 by CAATS (AKA David Coderre) and originally posted to the ACL USER FORUM. COM This code is designed to compare two the results from two tables and only identifies duplicates if the COM match is from the second table. It was later uploaded onto the TexasACL.COM. COM COM Fuzzy Logic used to search for duplicates. The code is designed to be usable on any table. COM**************************************************************************** COM Ok, several people have asked about how to use CAATS script to compare two different tables. Yesterday, I had the need to do just that, so I modified CAATS script to do so. I had CAATS and GJH take a look at it, and they both caught the same error. But here you go. This script should apply CAATS Fuzzy logic query to the results. A couple of notes: COM 1) The resulting file name will be "COMPARE" plus the first six letters of both files compared. COM 2) A new field is generated called "matching" which will tell you if the match is exact or partial. This comparison is between the two ORIGINAL data fields. (The only normalization between the two is to ensure they are the same size for merging purposes.) If you get the result "EXACT" the dup-key will be 0. You can get cases where the dup-key is still 0, but this field will read "PARTIAL" that is because the ORIGINAL document had different spacing/capitolization/punctuation/etc. Visually they look identical, but they aren't. COM 3) I added a new dialogue box that lists some of the limitations on this approach. It's a great tool, but imperfect. COM ********************* set safety off DIALOG (DIALOG TITLE "User Dialog" WIDTH 460 HEIGHT 394 ) (BUTTONSET TITLE "&OK;&Cancel" AT 370 12 DEFAULT 1 ) (TEXT TITLE "On both tables that you are going to be asked you will asked for 3 pieces of information." AT 24 28 WIDTH 329 HEIGHT 37 ) (TEXT TITLE "First, you will be asked for the field you want to compare. For example, you may be comparing addresses, in this case you would select the field on both tables for addresses." AT 48 88 WIDTH 376 HEIGHT 69 ) (TEXT TITLE "Second, you will be asked for a field whose data you wish to capture. Again, if you were comparing addresses, this field might be the entities name." AT 48 184 WIDTH 381 HEIGHT 52 ) (TEXT TITLE "Third, you will be asked to enter a unique identier. This could be a vendor ID, Social Security Number, or line number. Ideally anything that provides unique reference back to the original file." AT 60 256 WIDTH 369 HEIGHT 68 ) (TEXT TITLE "original cc: CAATS" AT 312 340 ) (TEXT TITLE "Modified CC: Porter" AT 312 364 ) DIALOG (DIALOG TITLE "User Dialog" WIDTH 471 HEIGHT 541 ) (BUTTONSET TITLE "&OK;&Cancel" AT 370 12 DEFAULT 1 ) (TEXT TITLE "This script has two known scope limitations that the user should be aware of before making any conclusions." AT 24 28 WIDTH 320 HEIGHT 50 ) (TEXT TITLE "First, it is virtually impossible to compare every line on one table with every line on another table. Particularly when dealing with larger tables and millions of records." AT 24 100 WIDTH 430 HEIGHT 53 ) (TEXT TITLE "Thus this script compares each line with up to two lines from the other table. The script does it's comparisons in alphabetical order, then it repeats the script in reverse alphabetical order." AT 24 172 WIDTH 429 HEIGHT 50 ) (TEXT TITLE "The script may miss some similar names if another name interferes with the alphabetical ordering occurs." AT 24 244 WIDTH 430 HEIGHT 35 ) (TEXT TITLE "Second, the final results are demostrated on a one-to-one basis between the two tables. This script does not identify duplicates within a table and does not perform a many-to-many comparison." AT 24 292 WIDTH 429 HEIGHT 48 ) (TEXT TITLE "The script will only tell you that at least one potential match occurs between the two tables. Due dilligence on the user's behalf would be to determine if there are other potential duplicates involving the same data." AT 24 364 WIDTH 428 HEIGHT 63 ) COMMENT The basic principle utilized in this script was CC by CAATS. It was modified for usage on multiple tables by Porter Broyles. DIALOG (DIALOG TITLE "User Dialog" WIDTH 484 HEIGHT 347 ) (BUTTONSET TITLE "&OK;&Cancel" AT 360 24 DEFAULT 1 ) (TEXT TITLE "Fuzy logic test for duplicates" AT 24 40 WIDTH 147 HEIGHT 43 ) (TEXT TITLE "Select table to open" AT 24 112 ) (ITEM TITLE "f" TO "_t_main" AT 120 144 WIDTH 205 HEIGHT 130 DEFAULT "%_t_main%") (TEXT TITLE "Select table to compare with" AT 24 196 ) (ITEM TITLE "f" TO "_T_compare" AT 120 228 WIDTH 199 DEFAULT "%_t_compare%") open %_t_main% DELETE Main OK DELETE Other_field OK DELETE ID OK DELETE Source OK DELETE _f_main_other OK DELETE _f_main_ID OK DELETE testfield OK DIALOG (DIALOG TITLE "User Dialog" WIDTH 482 HEIGHT 309 ) (BUTTONSET TITLE "&OK;&Cancel" AT 370 12 DEFAULT 1 ) (TEXT TITLE "Select field to test for duplicates" AT 36 124 WIDTH 163 HEIGHT 36 ) (ITEM TITLE "C" TO "_f_main_test" AT 240 132 WIDTH 202 HEIGHT 130 DEFAULT"%_f_main_test%" ) (TEXT TITLE "Set threshold (considered duplicate if number of characters difference is less than or equal to)" AT 24 28 WIDTH 184 HEIGHT 76 ) (EDIT TO "_f_threshold" AT 240 48 WIDTH 20 HEIGHT 24 DEFAULT "%_f_threshold%" ) (TEXT TITLE "Select field - to list if duplicate record" AT 36 184 WIDTH 180 HEIGHT 31 ) (ITEM TITLE "CND" TO "_f_main_other1" AT 240 180 WIDTH 223 HEIGHT 133 DEFAULT "%_f_main_other1%") (ITEM TITLE "CND" TO "_f_main_ID1" AT 240 252 WIDTH 224 DEFAULT "%_f_main_ID1%") (TEXT TITLE "Select field- unique identifier for record" AT 36 244 WIDTH 174 HEIGHT 40 ) IF FTYPE("%_f_main_other1%") = "N" DEFINE FIELD _f_main_other COMPUTED ALLTRIM(String(%_f_main_other1%,50)) If FTYPE("%_f_main_other1%") = "D" DEFINE FIELD _f_main_other COMPUTED ALLTRIM(SUBSTRING(DATE(%_f_main_other1%),1,50)) IF FTYPE("%_f_main_other1%") = "C" DEFINE FIELD _f_main_other COMPUTED ALLTRIM(SUBSTRING(%_f_main_other1%,1,50)) IF FTYPE("%_f_main_ID1%") = "N" DEFINE FIELD _f_main_ID COMPUTED ALLTRIM(String(%_f_main_ID1%,50)) IF FTYPE("%_f_main_ID1%") = "C" DEFINE FIELD _f_main_ID COMPUTED ALLTRIM(SUBSTRING(%_f_main_ID1%,1,50)) DEFINE FIELD testfield COMPUTED ALLTRIM(SUBSTRING(%_f_Main_test%,1,50)) open %_t_compare% testname = "Compare " + SUBSTRING("%_t_main%",1,6)+ " " + Substring("%_t_compare%",1,6) DELETE Main OK DELETE Other_field OK DELETE ID OK DELETE Source OK DELETE _f_compare_other OK DELETE _f_compare_ID OK DELETE testfield OK DIALOG (DIALOG TITLE "User Dialog" WIDTH 479 HEIGHT 304 ) (BUTTONSET TITLE "&OK;&Cancel" AT 370 12 DEFAULT 1 ) (TEXT TITLE "Select field to test for duplicates" AT 36 124 WIDTH 163 HEIGHT 36 ) (ITEM TITLE "C" TO "_f_compare_test" AT 240 132 WIDTH 202 HEIGHT 130 DEFAULT"%_f_compare_test%" ) (TEXT TITLE "Select field - to list if duplicate record" AT 36 184 WIDTH 180 HEIGHT 31 ) (ITEM TITLE "CND" TO "_f_compare_other1" AT 240 180 WIDTH 223 HEIGHT 133 DEFAULT "%_f_compare_other1%" ) (ITEM TITLE "CND" TO "_f_compare_ID1" AT 240 252 WIDTH 224 DEFAULT "%_f_compare_ID1%" ) (TEXT TITLE "Select field- to list if duplicate record" AT 36 244 WIDTH 174 HEIGHT 40 ) IF FTYPE("%_f_compare_other1%") = "N" DEFINE FIELD _f_compare_other COMPUTED ALLTRIM(String(%_f_compare_other1%,50)) If FTYPE("%_f_compare_other1%") = "D" DEFINE FIELD _f_compare_other COMPUTED ALLTRIM(SUBSTRING(DATE(%_f_compare_other1%),1,50)) IF FTYPE("%_f_compare_other1%") = "C" DEFINE FIELD _f_compare_other COMPUTED ALLTRIM(SUBSTRING(%_f_compare_other1%,1,50)) IF FTYPE("%_f_compare_ID1%") = "N" DEFINE FIELD _f_compare_ID COMPUTED ALLTRIM(String(%_f_compare_ID1%,50)) IF FTYPE("%_f_compare_ID1%") = "C" DEFINE FIELD _f_compare_ID COMPUTED ALLTRIM(SUBSTRING(%_f_compare_ID1%,1,50)) DEFINE FIELD testfield COMPUTED ALLTRIM(SUBSTRING(%_f_Compare_test%,1,50)) open "%_t_main%" DEFINE FIELD Main COMPUTED ALLTRIM(SUBSTRING(%_f_Main_test%,1,50)) DEFINE FIELD Other_field COMPUTED ALLTRIM(SUBSTRING(_f_Main_other,1,50)) DEFINE FIELD ID COMPUTED ALLTRIM(SUBSTRING(_f_main_ID,1,50)) DEFINE FIELD Source COMPUTED ALLTRIM(Substring("%_t_main%",1,50)) EXTRACT FIELDS Main Other_field ID Source testfield TO "Fuzzy_Dupe_Master" OPEN open %_t_compare% DEFINE FIELD Main COMPUTED SUBSTRING(%_f_compare_test%,1,50) DEFINE FIELD Other_field COMPUTED SUBSTRING(_f_compare_other,1,50) DEFINE FIELD ID COMPUTED SUBSTRING(_f_compare_ID,1,50) DEFINE FIELD Source COMPUTED Substring("%_t_compare%",1,50) EXTRACT FIELDS Main Other_field ID Source testfield TO "Fuzzy_Dupe_Master" append OPEN open Fuzzy_Dupe_Master DELETE Dup_test1 OK DELETE Dup_test2 OK DELETE RDup_test1 OK DELETE RDup_test2 OK DELETE recnum OK DELETE main_v2 OK DELETE Other_field_v2 OK DELETE ID_v2 OK DELETE Source_v2 OK DELETE Fuzzy_Dup.fil OK DELETE format Fuzzy_Dup OK DELETE testfield_v2 OK DEFINE field recnum Computed recno() DEFINE field main_v2 COMPUTED recoffset(Main,1) DEFINE field Other_field_v2 COMPUTED recoffset(Other_field,1) DEFINE FIELD testfield_v2 COMPUTED recoffset(testfield ,1) DEFINE field Source_v2 COMPUTED recoffset(Source,1) DEFINE field ID_v2 COMPUTED recoffset(ID,1) DEFINE field Dup_test1 Computed INCLUDE(UPPER(Main),'QWERTYUIOPLKJHGFDSAZXCVBNM1234567890') DEFINE field Dup_test2 COMPUTED recoffset(Dup_test1,1) DEFINE field RDup_test1 COMPUTED Alltrim(REVERSE(Dup_test1)) DEFINE field RDup_test2 COMPUTED alltrim(REVERSE(Dup_test2)) SORT on Dup_test1 to temp OPEN EXTRACT FIELDS testfield testfield_v2 Dup_test1 Dup_test2 ID ID_v2 Main main_v2 Other_field Other_field_v2 RDup_test1 RDup_test2 recnum Source Source_v2 IF Source <> Source_v2 TO "temp1" OPEN loop_max=LEN(Main) loop_ctr=1 Ldup_key=0 Rdup_key=0 GROUP LOOP while loop_ctr<=loop_max LDup_key=Ldup_key+1 if substr(dup_test1,loop_ctr,1)<>substr(dup_test2,loop_ctr,1) loop_ctr=loop_ctr+1 END extract recnum testfield testfield_v2 Main Main_v2 Other_field Other_field_v2 ID ID_v2 Source Source_v2 LDup_key RDup_key 1 as 'L_cnt' 0 as 'R_cnt' to fuzzy_temp Ldup_key=0 Rdup_key=0 loop_ctr=1 END OPEN fuzzy_dupe_master SORT on RDup_test1 to temp OPEN EXTRACT FIELDS testfield testfield_v2 Dup_test1 Dup_test2 ID ID_v2 Main main_v2 Other_field Other_field_v2 RDup_test1 RDup_test2 recnum Source Source_v2 IF Source <> Source_v2 TO "temp1" OPEN OPEN fuzzy_dupe_Master SORT on RDup_test1 to temp OPEN loop_ctr=1 Ldup_key=0 Rdup_key=0 GROUP LOOP while loop_ctr<=loop_max RDup_key=Rdup_key+1 if substr(Rdup_test1,loop_ctr,1)<>substr(Rdup_test2,loop_ctr,1) loop_ctr=loop_ctr+1 END extract recnum testfield testfield_v2 Main Main_v2 Other_field Other_field_v2 ID ID_v2 Source Source_v2 Ldup_key RDup_key 0 as 'L_cnt' 1 as 'R_cnt' to fuzzy_temp LDup_key=0 Rdup_key=0 loop_ctr=1 END OPEN Fuzzy_temp Summarize on Main Main_v2 OTHER testfield testfield_v2 Other_field Other_field_v2 ID ID_v2 source source_v2 recnum Acc Ldup_key RDup_key L_cnt R_cnt to Fuzzy_temp2 PRESORT OPEN DEFINE field Dup_Key Computed MIN(LDup_Key,RDup_key) if l_cnt>0 and r_cnt>0 Ldup_key if L_cnt>0 RDup_key DEFINE FIELD matching COMPUTED "EXACT" if testfield = testfield_v2 "Not Exact" EXTRACT Matching testfield testfield_v2 Main Main_v2 Other_field Other_field_v2 ID ID_v2 recnum Dup_Key Source Source_v2 if Dup_Key<=%_f_threshold% to "%testname%" OPEN %_t_main% DELETE Fuzzy_TEMP.fil OK DELETE format Fuzzy_TEMP OK DELETE Fuzzy_TEMP2.fil OK DELETE format Fuzzy_TEMP2 OK DELETE TEMP.fil OK DELETE format TEMP OK DELETE TEMP1.fil OK DELETE format TEMP1 OK DELETE Dup_test1 OK DELETE Dup_test2 OK DELETE RDup_test1 OK DELETE RDup_test2 OK DELETE recnum OK DELETE Main_v2 OK DELETE Other_field_v2 OK DELETE loop_Ctr OK DELETE loop_max OK OPEN %_t_compare% DELETE Fuzzy_TEMP.fil OK DELETE format Fuzzy_TEMP OK DELETE Fuzzy_TEMP2.fil OK DELETE format Fuzzy_TEMP2 OK DELETE TEMP.fil OK DELETE format TEMP OK DELETE TEMP1.fil OK DELETE format TEMP1 OK DELETE Dup_test1 OK DELETE Dup_test2 OK DELETE RDup_test1 OK DELETE RDup_test2 OK DELETE recnum OK DELETE Main_v2 OK DELETE Other_field_v2 OK DELETE loop_Ctr OK DELETE loop_max OK DELETE Fuzzy_Dupe_Master.fil OK DELETE format Fuzzy_Dupe_Master OK set safety on