Previous step | Up | Next step |
Welcome | Step 1 |
This tutorial describes how to set up a duplicate generator based on data from the Centers for Disease Control (CDC) Deduplication Test Case for immunization information systems. The test data may be downloaded from the CDC site, but it is also included as the DupTestData.csv file in the resources directory of the adg-cdc1-example project. All the code described in this tutorial is part of that project (except that the package names have been changed from net.sf.adatagenerator.ex.cdc.* to net.sf.adatagenerator.ex.cdc1.*).
At a high level, setting up a generator involves seven steps:
The first step is to define the interface and implementation of a Java bean that represents a data record. It is usually necessary to extend this definition with an interface that includes fields specific to the process of data synthesis.
The second step is to create generators for the field values of records.
In the third step, field generators are combined into a record generator.
The fourth step is to create generators for pairs of records. A good way to design pair generators is to think of "stories" for how records might be different and yet represent the same person, or conversely, for how records might be similar and yet represent different persons.
In the fifth step, data-entry errors are generated among the pairs of records. Like the task of creating of creating pairs, a good way of creating error generator is to think of stories about what could go wrong when a record for a child is created.
The sixth step is to measure the correlations of field values among the records of a pair, and to do this for all the generated pairs. In contrast to the previous step, in which the task was to figure out what could go wrong, the task here is to measure what did go wrong.
The last step is to compare the correlations of the generated pairs to correlations in real data (or the CDC test data, in this case) and to select the generated pairs which best match the real data.
Previous step | Up | Next step |
Welcome | Step 1 |