Previous detail | Up | Next detail |
Step 7 overview | Step 7.2 |
The following table shows the 20 most common signatures of MATCH pairs in the current IIS Deduplication Test Kit. The first column is the number of times that a signature shows up. The next 14 columns are the firing patterns within the columns labelled at the top. The predicates are listed in the order described in section 3 of Step 6.
REFERENCE PAIR CORRELATIONS (6 predicates/field) Predicate order for each field: ExactMatch, Nysiis, Soundex, EditDistance, Jaro-Winkler, Difference |Nysiis | Soundex | EditDistance | Jaro-Winkler | Difference | | * * 110100 111110 Exact match (Strings with at least 3 characters) 110100 Exact match (Strings with less than 3 characters, including empty strings) 0xxxx0 Approximate match 000001 Different 000000 Null in one or both values CNT DOB FirstN LastN Middle MomFN MomLN Maiden MomMN Sex Suffix VacCod VacDat VacMfr VacNam === ====== ====== ====== ====== ====== ====== ====== ====== ====== ====== ====== ====== ====== ====== 5: 011110 111110 111110 111110 111110 111110 111110 111110 111110 000000 010100 011000 000001 000001 3: 111110 111110 111110 111110 111110 111110 111110 000000 111110 000000 111110 111110 111110 111110 3: 111110 111110 011110 000000 111110 011110 111110 111110 111110 000000 010100 011000 000001 000001 3: 111110 011110 111110 000000 111110 111110 111110 111110 111110 000000 010100 011000 000001 000001 3: 111110 000001 111110 000000 111110 111110 111110 111110 111110 000000 010100 011000 000001 000001 2: 111110 111110 111110 000000 111110 111110 111110 000000 111110 000000 111110 111110 111110 111110 2: 111110 111110 011110 111110 111110 011110 111110 111110 111110 000000 010100 011000 000001 000001 2: 111110 111110 000001 111110 111110 111110 111110 111110 111110 000000 111110 011100 111110 111110 2: 111110 111110 000001 111110 111110 111110 111110 111110 111110 000000 010100 011000 111110 000001 2: 111110 111110 000001 111110 111110 000001 111110 111110 111110 000000 010100 011000 000001 000001 2: 111110 010110 111110 111110 111110 111110 111110 111110 111110 000000 111110 011000 111110 111110 2: 111110 000010 111110 000000 111110 111110 111110 111110 111110 000000 010100 011000 111110 000001 2: 111110 000010 111110 000000 111110 111110 111110 111110 111110 000000 010100 011000 000001 000001 2: 111110 000001 111110 000000 111110 111110 111110 111110 111110 000000 011100 011000 000001 000001 2: 111110 000001 111110 000000 111110 111110 111110 000000 111110 000000 010100 011000 000001 000001 2: 011110 111110 111110 111110 111110 111110 000000 000000 111110 000000 011100 011000 000001 000001 2: 011110 111110 111110 111110 011100 111110 111110 000000 111110 000000 010100 011000 000001 000001 2: 011110 111110 111110 000000 111110 111110 111110 000000 111110 000000 111110 011000 111110 111110 1: 111110 111110 111110 111110 111110 111110 111110 111110 111110 000000 111110 111110 000001 111110 1: 111110 111110 111110 111110 111110 111110 111110 111110 111110 000000 010100 111110 111110 000001
In all, this initial set of predicates produces 224 distinct signatures for the 244 MATCH pairs in the reference set. In contrast, App 3 of the CDC Examples, which is configured to generate pairs as described in Steps 1 through 6 so far, generates many more distinct signatures. As one might expect, the number of distinct signatures increases as the number of generated pairs increases, and for a given number of generated pairs, there is some randomness in the number of distinct signatures among the pairs. The number of distinct signatures grows sublinearly with the number of generated pairs. For example, the following chart shows that for one trial run of App 3, the number of signatures grew from 161 for 200 pairs to 119,668 for 1M pairs.
Number of distinct signatures in a collection of generated pairs
Futhermore, the overlap of generated signatures with the set of reference signatures grows even more slowly. Figure 2 shows the overlap increases logarithmically from an overlap of about of 25 for a sample size of 10k to about 92 for a sample size of 1M. Extrapolating from these results, it would take a sample size of about 9M to generate all 224 distinct signatures in the reference sample.
Overlap of generated signatures with reference signatures
Of course, generating 9M pairs in order to reproduce a reference set of 224 signatures is a particularly inefficient way to do things. If nothing else, the process is stupidly slow; on a laptop with with 2 GHz Intel Core i7 processor, it would take App3 about an hour to generate 9M pairs, and most of these pairs would be very different from the pairs in the reference set.
To get a sense of how the generated pairs differ from the reference pairs, and therefore where time is being wasted, it helps to boil down signatures discussed into some broader categories. We will do the following:
If we make these drastic simplications, and then compare the reference pairs to a set of pairs that are randomly generated by App3, we find statistics similar to the ones shown in Figure 3. The statistics show that App3 is spending about half (49%) of its time generating pairs that are identical or that differ only by nulls, compared to only 7% of the reference pairs that are identical or that differ only by nulls. Conversely, App3 is spending too little of its time (10%) generating pairs in which 2 or 3 fields approximately match or are complete different, whereas these kind of pairs account for 42% of the reference pairs.
Percentages of pairs with fields that are approximate matches or completely different
Previous detail | Up | Next detail |
Step 7 overview | Step 7.2 |