Previous stepUpNext step
WelcomeStep 1


This tutorial describes how to set up a duplicate generator based on data from the Centers for Disease Control (CDC) Deduplication Test Case for immunization information systems. The test data may be downloaded from the CDC site, but it is also included as the DupTestData.csv file in the resources directory of the adg-cdc1-example project. All the code described in this tutorial is part of that project (except that the package names have been changed from net.sf.adatagenerator.ex.cdc.* to net.sf.adatagenerator.ex.cdc1.*).

Seven steps

At a high level, setting up a generator involves seven steps:

  • Step 1: Define what a record is

    The first step is to define the interface and implementation of a Java bean that represents a data record. It is usually necessary to extend this definition with an interface that includes fields specific to the process of data synthesis.

  • Step 2: Create field generators

    The second step is to create generators for the field values of records.

  • Step 3: Create a record generator

    In the third step, field generators are combined into a record generator.

  • Step 4: Create pair generators

    The fourth step is to create generators for pairs of records. A good way to design pair generators is to think of "stories" for how records might be different and yet represent the same person, or conversely, for how records might be similar and yet represent different persons.

  • Step 5: Create error generators

    In the fifth step, data-entry errors are generated among the pairs of records. Like the task of creating of creating pairs, a good way of creating error generator is to think of stories about what could go wrong when a record for a child is created.

  • Step 6: Measure field correlations

    The sixth step is to measure the correlations of field values among the records of a pair, and to do this for all the generated pairs. In contrast to the previous step, in which the task was to figure out what could go wrong, the task here is to measure what did go wrong.

  • Step 7: Compare the generated pairs to real data

    The last step is to compare the correlations of the generated pairs to correlations in real data (or the CDC test data, in this case) and to select the generated pairs which best match the real data.

Previous stepUpNext step
WelcomeStep 1