Previous detail	Up	Next detail
	Step 7 overview	Step 7.2

Step 7.1: Initial analysis

The following table shows the 20 most common signatures of MATCH pairs in the current IIS Deduplication Test Kit. The first column is the number of times that a signature shows up. The next 14 columns are the firing patterns within the columns labelled at the top. The predicates are listed in the order described in section 3 of Step 6.

REFERENCE PAIR CORRELATIONS (6 predicates/field)

Predicate order for each field:
     ExactMatch, Nysiis, Soundex, EditDistance, Jaro-Winkler, Difference
     |Nysiis
     | Soundex
     |  EditDistance
     |   Jaro-Winkler
     |    Difference
     |    |
     *    *
     110100

     111110 Exact match (Strings with at least 3 characters)
     110100 Exact match (Strings with less than 3 characters, including empty strings)
     0xxxx0 Approximate match
     000001 Different
     000000 Null in one or both values

CNT  DOB    FirstN LastN  Middle MomFN  MomLN  Maiden MomMN  Sex    Suffix VacCod VacDat VacMfr VacNam
===  ====== ====== ====== ====== ====== ====== ====== ====== ====== ====== ====== ====== ====== ======
 5:  011110 111110 111110 111110 111110 111110 111110 111110 111110 000000 010100 011000 000001 000001
 3:  111110 111110 111110 111110 111110 111110 111110 000000 111110 000000 111110 111110 111110 111110
 3:  111110 111110 011110 000000 111110 011110 111110 111110 111110 000000 010100 011000 000001 000001
 3:  111110 011110 111110 000000 111110 111110 111110 111110 111110 000000 010100 011000 000001 000001
 3:  111110 000001 111110 000000 111110 111110 111110 111110 111110 000000 010100 011000 000001 000001
 2:  111110 111110 111110 000000 111110 111110 111110 000000 111110 000000 111110 111110 111110 111110
 2:  111110 111110 011110 111110 111110 011110 111110 111110 111110 000000 010100 011000 000001 000001
 2:  111110 111110 000001 111110 111110 111110 111110 111110 111110 000000 111110 011100 111110 111110
 2:  111110 111110 000001 111110 111110 111110 111110 111110 111110 000000 010100 011000 111110 000001
 2:  111110 111110 000001 111110 111110 000001 111110 111110 111110 000000 010100 011000 000001 000001
 2:  111110 010110 111110 111110 111110 111110 111110 111110 111110 000000 111110 011000 111110 111110
 2:  111110 000010 111110 000000 111110 111110 111110 111110 111110 000000 010100 011000 111110 000001
 2:  111110 000010 111110 000000 111110 111110 111110 111110 111110 000000 010100 011000 000001 000001
 2:  111110 000001 111110 000000 111110 111110 111110 111110 111110 000000 011100 011000 000001 000001
 2:  111110 000001 111110 000000 111110 111110 111110 000000 111110 000000 010100 011000 000001 000001
 2:  011110 111110 111110 111110 111110 111110 000000 000000 111110 000000 011100 011000 000001 000001
 2:  011110 111110 111110 111110 011100 111110 111110 000000 111110 000000 010100 011000 000001 000001
 2:  011110 111110 111110 000000 111110 111110 111110 000000 111110 000000 111110 011000 111110 111110
 1:  111110 111110 111110 111110 111110 111110 111110 111110 111110 000000 111110 111110 000001 111110
 1:  111110 111110 111110 111110 111110 111110 111110 111110 111110 000000 010100 111110 111110 000001

In all, this initial set of predicates produces 224 distinct signatures for the 244 MATCH pairs in the reference set. In contrast, App 3 of the CDC Examples, which is configured to generate pairs as described in Steps 1 through 6 so far, generates many more distinct signatures. As one might expect, the number of distinct signatures increases as the number of generated pairs increases, and for a given number of generated pairs, there is some randomness in the number of distinct signatures among the pairs. The number of distinct signatures grows sublinearly with the number of generated pairs. For example, the following chart shows that for one trial run of App 3, the number of signatures grew from 161 for 200 pairs to 119,668 for 1M pairs.

Figure 1: Number of distinct signatures in a collection of generated pairs

Futhermore, the overlap of generated signatures with the set of reference signatures grows even more slowly. Figure 2 shows the overlap increases logarithmically from an overlap of about of 25 for a sample size of 10k to about 92 for a sample size of 1M. Extrapolating from these results, it would take a sample size of about 9M to generate all 224 distinct signatures in the reference sample.

Figure 2: Overlap of generated signatures with reference signatures

Of course, generating 9M pairs in order to reproduce a reference set of 224 signatures is a particularly inefficient way to do things. If nothing else, the process is stupidly slow; on a laptop with with 2 GHz Intel Core i7 processor, it would take App3 about an hour to generate 9M pairs, and most of these pairs would be very different from the pairs in the reference set.

To get a sense of how the generated pairs differ from the reference pairs, and therefore where time is being wasted, it helps to boil down signatures discussed into some broader categories. We will do the following:

Ignore correlations in the vaccination-related fields.
Ignore correlations in fields that match exactly
Lump together Soundex, Nysiss, Edit distance and Jaro-Winkler correlations into a single category called approximate match
Count just the number of fields which are approximate matches or completely different

If we make these drastic simplications, and then compare the reference pairs to a set of pairs that are randomly generated by App3, we find statistics similar to the ones shown in Figure 3. The statistics show that App3 is spending about half (49%) of its time generating pairs that are identical or that differ only by nulls, compared to only 7% of the reference pairs that are identical or that differ only by nulls. Conversely, App3 is spending too little of its time (10%) generating pairs in which 2 or 3 fields approximately match or are complete different, whereas these kind of pairs account for 42% of the reference pairs.

Figure 3: Percentages of pairs with fields that are approximate matches or completely different

Previous detail	Up	Next detail
	Step 7 overview	Step 7.2

A Data Generator

Developer information

Modules

Step 7.1: Initial analysis