Previous detail | Up | Next detail |
Detail 6.1 | Step 6 overview | |
Detail 6.2: Choosing which predicates to use
This section discusses some factors to consider when choosing predicates to use for measuring field correlations.
String comparison summary
String comparison methods have different effectiveness for detecting various types of similarity between fields. The following table shows the effectiveness of 7 methods for detecting when strings are similar in the presence of specific types of data-entry errors.
Table 1: Effectiveness of string comparison methods | Soundex | Nysiis | Metaphone | Double metaphone | Jaro- Winkler | Edit distance | Longest common substring |
Phonetic misspelling | X | X | X | X | X | | |
Typo | | | | | | X | |
Start more important | | | | | X | | |
Reversed parts | | | | | | | X |
Missing part | | | | | | | X |
String comparison examples
String comparison methods have different effectiveness for detecting various types of similarity between fields. The following table shows the effectiveness of 7 methods for detecting when strings are similar in the presence of specific types of data-entry errors.
Table 2: Examples of string comparison results | | Soundex | Nysiis | Metaphone | Double metaphone | Jaro- Winkler | Edit distance | Longest common substring |
Mustafa | Mustapha | X | X | X | X | 0.89 | 2 | 0.71 |
Maxwell | Michelle | X | | | | 0.80 | 4 | 0.43 |
Carlo | Carlos | | X | | | 1.00 | 1 | 1.00 |
Nachanon | Natchanon | | | X | | 0.93 | 1 | 0.75 |
Jolanda | Yolanda | | | | X | 0.93 | 1 | 0.86 |
Li | Lilian | | | | | 1.00 | 4 | 0.00 |
Combinations of string comparison functions
Table 3 shows the effectiveness of various combinations of string comparison methods in reducing the amount of human review required to achieve 99.5% accuracy in a particular matching experiment. If no string comparison methods were used, then this sample of pairs would have required that 62.8% of the pairs be reviewed to achieve the desired 99.5% matching accuracy; i.e. without reviewing 62.8% of the pairs, the number of false negatives would have been too high. When Nysiis, edit distance, soundex or Jaro-Winkler are used by themselves, the required review rate drops to 62.1, 19.7, 17.9 and 12.8 per cent, respectively. The best accuracy with the least amount of review occurs when 3 or four of the methods are used in combination with one another.
Table 3: The effect of combining string comparison methodsNysiis | Edit distance | Soundex | Jaro- Winkler | Human review rate |
| | | | 62.8% |
X | | | | 62.1% |
| X | | | 19.7% |
| | X | | 17.9% |
| | | X | 12.8% |
X | X | X | | 3.4% |
| X | X | X | 2.3% |
X | X | | X | 1.7% |
X | X | X | X | 1.6% |
X | | X | X | 1.5% |
Previous detail | Up | Next detail |
Detail 6.1 | Step 6 overview | |