Previous stepUpNext step
Step 3Tutorial overviewStep 5

Step 4: Create pair generators

This part of the tutorial describes how to create pair generators using a record generator. The goal is to create generators that produce three types of pairs:

  • Pairs in which both records correspond to the same real-world entity, even when some field values are different between the records.
  • Pairs in which the records correspond to different real-world entities, even when some field values are the same.
  • Pairs in which one or both records have an ambiguous correspondence to a real-world entity.

It is important to note that these definitions are at best incomplete, if not inconsistent, because they depend on a context, the "real world", which isn't quantified here. For example, it might be intuitive to define some subset of fields as being indispenseable for the task of matching two children in an immunization registry. For example, suppose that first name, lastname, date of birth and mother's maiden name were considered to be "core" fields; that is, if two records match exactly on all four of these fields, then the records correspond to the same child. As intuitive and clear as this criterion is, it is not valid if an immunization registry monitors a large enough population. There are documented cases for at least one large metropolitan registry, in which two different children have the same first and last names, and their mothers have the same maiden and last name, and the children were born on the same day in the same hospital. Most immunization registries have similar war stories about children who are difficult to distinguish.

We could take the approach that 8 fields are necessary and sufficient to distinguish the children in an immunization registry:

  • All name fields for a child (first, last and middle) excluding suffix
  • The child's date of birth
  • All name fields for the child's mother (first, last, middle and maiden)

There are two obvious problems with this approach, one that is a low probability problem and the other that can be addressed in the next step, Step 5, when we start modifying generated pairs. The first, low-probability problem with our approach is that perhaps even these eight fields may be insufficient to distinguish children in some situations. For example, even with 8 fields, there might be identity collisions among children who have names that are relatively common for a population, and whose mothers also have names that are relatively common if an immunization registry were to cover a sufficiently large population. This is a low probability issue that we will ignore. If two records match on all eight of the fields specified above, and all eight fields are non-null and valid (a criterion we'll leave undefined for now), then we'll say that by definition that the two records represent the same child.

The second problem with this approach is that real immunization registries almost never receive records in which all eight of these fields are non-null and valid for any child. From this perspective, the 8-field definition is useless, since there would be few distinguishable children in any registry. This is a valid criticism of the 8-field our criteria, but it can be fudged by agreeing that certain correlations of field values are extremely high probability markers for matches, differs or holds. We could make this solution precise by separating the issue of generating realistic correlations from the issue of deciding whether those correlations represent matches, differs or holds.

For now, we'll take the imprecise but useful approach of defining certain field combinations as nearly certain matches, differs or holds. In the next step, we'll introduce data-ommission and data-entry errors to increase the number of field correlations that we can generate. As data-ommission and data-entry errors are introduced, it will become less intutive whether the modified pairs represent matches, differs or holds. We'll have to add a new category of agreement, INDETERMINANT>>, to model the increasing uncertainty about the modified pairs. In the end, with enough modification, all pairs become <<<INDETERMINATE examples of matches, differs or holds. But that's OK, because what we're really after are example of field correlations that match the correlations in real data.

In other words, we will use the three categories of pairs decribed above as a way of organizing our work, but then throw away the decisions that we attach to the synthetic pairs that we produce.

How does one create examples of pairs that fall into the three categories described above? By telling stories. People are very good at recognizing when pairs fall into any of these categories, and they can usually explain their reasoning through short "stories" about the pairs that they evaluate. For example, a pair might be marked as a differ "because those records are brothers" or another pair might be marked as a hold "because even though the children's names match, we're missing the mother's name is missing on one of the records and that's a pretty common child's name." We'll reverse this pattern, by using stories to specify which fields in a pair of records should match or not in order for the pair to represent a match, differ or hold.

Previous stepUpNext step
Step 3Tutorial overviewStep 5