Overview

Figure 1 shows the data flow in a data generator.

Figure 1

Data flow in a data generator

In the first part of the data flow, records and groups of records are generated. In the next part, pairs of records are generated and each record is labeled with a generation decision:

  • Match

    The records in the pair represent the same real-world entity. For example, in the CDC example, the records represent the same person.

  • Differ

    The records in the pair represent different real-world entities.

  • Hold

    The records in the pair are too ambiguous or lack sufficient data to determine whether they represent the same real-world entity. For example, in the CDC example, if a record contains just a first and last name, this isn't usually enough information to determine whether the record corresponds to a unique person, since it is not uncommon to find two persons in a population who have the same name (unless the name is very unusual).

After pairs are generated and lablelled, they may be modified to introduce data entry errors. (This step is not shown as a distinct element in the diagram). After a pair is modified, it is labelled with a modification decision:

  • Match

    The records in the pair still represent the same real-world entity. In the CDC example, this means that the modifications are mild enough that a human reviewer would (or should) still mark the pair as representing the same person.

  • Differ

    The records in the pair still represent different real-world entities.

  • Hold

    The records in the pair are still ambiguous representations of real-world entities.

  • Indeterminate

    The effect of the modification(s) is unclear. A pair of records may now be a match, differ or hold in terms of whether a human reviewer could, would or should mark the pair as representing the same or different real-world entities.

After step two, every pair is tagged with both a generation and a modification decision. The two types of decision are distinct and both pieces of information are preserved on a synthesized pair.

In the last step of the data flow, the correlations within the synthesized pairs may be compared to the correlations that are found in real data records, and the synthesized pairs may be filtered based upon that comparison. The measurement and comparison of correlations is performed using a correlation library from the ChoiceMaker project. In addition, independent of whether the pairs are filtered, the synthesized pairs are usually stored in some persistant form, such as entries in a database; a text file such as CSV or XML file; or as a binary file such as a spreadsheet.