Duplicate FieldsWhen ACL performs a duplicate search, it does so on a very simplistic methodology. It will look for fields that are exact duplicates. Change one character, capitalize a single letter, add a space and ACL will fail to recognize two otherwise identical tables. To improve the possibility of identifying possible duplicated fields, there are few tricks one can utilize:
1) Normalize the data.
The first step one can utilize is to normalize the data. In normalizing data, one elimiates extraneous characters such as spaces, and non-alpha/numeric characters. One should also capitalizes alpha characters. This will ensure that the fields match:
1201 E. Main St.
1201 E MAIN ST
2)
Hamming DistanceThe Hamming Distance is a method for looking for potential duplicates. The code looks for common characters in the same position:
Porter T Broyles
Porter B Broiles
| 0 | 0
| 0 | 0
| 0
| 0
| 1
| 0
| 0
| 0
| 1
| 0
| 0
| 0
|
| Match | Match | Match | Match | Match | Match | No Match
| Match | Match | Match | No Match | Match | Match | Match |
| P | O
| R
| T
| E
| R
| T
| B
| R
| O
| Y
| L
| E
| S
|
| P | O
| R
| T
| E
| R
| B
| B
| R
| O
| I
| L
| E
| E
|
By eliminating the spacing and capitalizing every character, you can determine how many characters do not match. The lower the value, the more likely the match. In this case, the two fields having a Hamming Distance of 2 because every character but two match. This test is often run with the fields going both forwards and backwards. This is to help ensure that the impact of titles or leading characters (such as Mr Porter B Broyles) are minimized.
3)
Q-GramA Q-Gram looks at a sequence of characters rather than individual letters. This methodology looks for common characters within a certain range of one another. It requires three pieces of information:
First, how many characters are in common.
Second, how close these characters are to one another.
Third, what percentage of match you are looking for.
Using this methodology, one would be able too compare the names Porter T Broyles and Mr. Porter B. Broyles as such:
MR
| RP
| PO
| OR
| RT
| TE
| ER
| RB
| BB
| BR
| RO
| OY
| YL
| LE
| ES
|
|
| PO
| OR
| RT
| TE
| ER
| RT
| TB
| BR
| RO
| OY
| YL
| LE
| ES
|
no
| no
| yes
| yes | yes | yes | yes | no
| no
| yes | yes | yes | yes | yes | yes |
0
| 0
| 1
| 1
| 1
| 1
| 1
| 0
| 0
| 1
| 1
| 1
| 1
| 1
| 1
|
Using the above data, there are 11 segments that match out of a possible 13 segments (the total possible is based upon the shorter of the two possible combinations.) This results in an 85% match between the two fields.
4)
Levenschtein Distance Levenschtein Distance looks at the total number of character changes necessary to get two fields to match. The metric is named after Vladimir Levenshtein who formulated the concept in 1965. It is often used in applications that need to determine how similar, or different, two strings are. When comparing two fields of identical length, it is no different than a Hamming Distant, but when the two fields are of disparate length, it measures the number of different key strokes necessary to create different different fields. For example, the Levenschtein Distance between John Smith and Jon Smythe would be 3.
1) Delete the "h" from John Smith.
2) Change the "i" to "y".
3) Add the "e" at the end of Smythe.