Previous detailUpNext detail
Detail 6.1Step 6 overview

Detail 6.2: Choosing which predicates to use

This section discusses some factors to consider when choosing predicates to use for measuring field correlations.

String comparison summary

String comparison methods have different effectiveness for detecting various types of similarity between fields. The following table shows the effectiveness of 7 methods for detecting when strings are similar in the presence of specific types of data-entry errors.

Table 1: Effectiveness of string comparison methods
SoundexNysiisMetaphoneDouble metaphoneJaro- WinklerEdit distanceLongest common substring
Phonetic misspellingXXXXX
TypoX
Start more importantX
Reversed partsX
Missing partX

String comparison examples

String comparison methods have different effectiveness for detecting various types of similarity between fields. The following table shows the effectiveness of 7 methods for detecting when strings are similar in the presence of specific types of data-entry errors.

Table 2: Examples of string comparison results
SoundexNysiisMetaphoneDouble metaphoneJaro- WinklerEdit distanceLongest common substring
MustafaMustaphaXXXX0.8920.71
MaxwellMichelleX0.8040.43
CarloCarlosX1.0011.00
NachanonNatchanonX0.9310.75
JolandaYolandaX0.9310.86
LiLilian1.0040.00

Combinations of string comparison functions

Table 3 shows the effectiveness of various combinations of string comparison methods in reducing the amount of human review required to achieve 99.5% accuracy in a particular matching experiment. If no string comparison methods were used, then this sample of pairs would have required that 62.8% of the pairs be reviewed to achieve the desired 99.5% matching accuracy; i.e. without reviewing 62.8% of the pairs, the number of false negatives would have been too high. When Nysiis, edit distance, soundex or Jaro-Winkler are used by themselves, the required review rate drops to 62.1, 19.7, 17.9 and 12.8 per cent, respectively. The best accuracy with the least amount of review occurs when 3 or four of the methods are used in combination with one another.

Table 3: The effect of combining string comparison methods
NysiisEdit distanceSoundexJaro- WinklerHuman review rate
62.8%
X62.1%
X19.7%
X17.9%
X12.8%
XXX3.4%
XXX2.3%
XXX1.7%
XXXX1.6%
XXX1.5%

Previous detailUpNext detail
Detail 6.1Step 6 overview