Previous detail	Up	Next detail
Detail 6.1	Step 6 overview

Detail 6.2: Choosing which predicates to use

This section discusses some factors to consider when choosing predicates to use for measuring field correlations.

String comparison summary

String comparison methods have different effectiveness for detecting various types of similarity between fields. The following table shows the effectiveness of 7 methods for detecting when strings are similar in the presence of specific types of data-entry errors.

Table 1: Effectiveness of string comparison methods
	Soundex	Nysiis	Metaphone	Double metaphone	Jaro- Winkler	Edit distance	Longest common substring
Phonetic misspelling	X	X	X	X	X
Typo						X
Start more important					X
Reversed parts							X
Missing part							X

String comparison examples

Table 2: Examples of string comparison results
		Soundex	Nysiis	Metaphone	Double metaphone	Jaro- Winkler	Edit distance	Longest common substring
Mustafa	Mustapha	X	X	X	X	0.89	2	0.71
Maxwell	Michelle	X				0.80	4	0.43
Carlo	Carlos		X			1.00	1	1.00
Nachanon	Natchanon			X		0.93	1	0.75
Jolanda	Yolanda				X	0.93	1	0.86
Li	Lilian					1.00	4	0.00

Combinations of string comparison functions

Table 3 shows the effectiveness of various combinations of string comparison methods in reducing the amount of human review required to achieve 99.5% accuracy in a particular matching experiment. If no string comparison methods were used, then this sample of pairs would have required that 62.8% of the pairs be reviewed to achieve the desired 99.5% matching accuracy; i.e. without reviewing 62.8% of the pairs, the number of false negatives would have been too high. When Nysiis, edit distance, soundex or Jaro-Winkler are used by themselves, the required review rate drops to 62.1, 19.7, 17.9 and 12.8 per cent, respectively. The best accuracy with the least amount of review occurs when 3 or four of the methods are used in combination with one another.

Table 3: The effect of combining string comparison methods
Nysiis	Edit distance	Soundex	Jaro- Winkler	Human review rate
				62.8%
X				62.1%
	X			19.7%
		X		17.9%
			X	12.8%
X	X	X		3.4%
	X	X	X	2.3%
X	X		X	1.7%
X	X	X	X	1.6%
X		X	X	1.5%

Previous detail	Up	Next detail
Detail 6.1	Step 6 overview

A Data Generator

Developer information

Modules

Detail 6.2: Choosing which predicates to use

String comparison summary

String comparison examples

Combinations of string comparison functions