Previous stepUpNext step
Step 1Tutorial overviewStep 3

Step 2: Create field generators

Field generators are factories for the values held by a record. Typically, a field generator creates random values that are statistically distributed with frequencies that match a distribution observed in real records. For example, if real records are stored in a SQL database, then one might calculate the relative frequencies of values in a FIRST_NAME column by grouping the values and computing their counts. To avoid low-frequency values that might identify particular patients in a population, one should exclude values that occur less frequently than some threshold value, say 100.

SELECT FIRST_NAME, COUNT(*) FROM PERSON
    GROUP BY FIRST_NAME HAVING COUNT(*) >= 100;

In the case of the CDC test data, the records are stored in a CSV file, so one has to either load the records into a SQL database or calculate value frequencies by some other means. The next page briefly describes how to use Linux text utilities to calculate value frequencies on CSV files. The 10 most frequent values for the first name column are shown below, along with the number of times the each value appears in the first name column.

EMILY,12
MICHAEL,12
DANIEL,11
HANNAH,11
SAMANTHA,11
CHRISTOPHER,10
MADISON,10
JOSE,9
SARAH,9
JOSEPH,8

A simple way to create a random value generator for this distribution of values is to simply store copies of the values in a list, with each value listed a number of times matching its relative frequency. By choosing random values from the list, one would produce the desired distribution. Of course, this simple design for a random value generator is very wasteful of memory. Instead, we use a class, DefaultFrequencyBasedGenerator, that efficiently provides this kind of functionality. This class is used to create frequency-based field value generators for CDC records.


Previous stepUpNext step
Step 1Tutorial overviewStep 3