Examples

The adatagenerator-examples project contains three examples:

Febrl example

adg-febrl-example is a rewritten version of the original Febrl data generator.

More details to follow...

CDC Example 1

adg-cdc1-example is a generator that mimics the duplicated data test set published by the US Center for Disease Control immunization registry project. It runs as a command line application. It generates synthetic CDC pairs for various types of groups and collects statistics on the field correlations within the pairs. A tuturial describes how the CDC1 example was created. There are 4 example applications in this module.

  • App1

    Shows how to create and connect sources and sinks of pairs.

  • App2

    Shows how to collect statistics on the field correlations within generated pairs.

  • App3

    Provides a more realistic example of pair generation and correlation analysis.

  • DbApp3

    Extracts MATCH pairs from tables of patient records and record pairs in a test database

App1

App1 shows how to create and connect sources and sinks of pairs. The application generates synthetic CDC pairs from various groups and writes them to two files, an XML file and a Excel spreadsheet.

The application takes two optional command line parameters:

  • Total number of pairs to be generated. Must be non-negative and less than 15,000. (default: 100)
  • Absolute base name of an output file; for example /tmp/samples (default: a temporary file). Note that the extensions .xml and .xls will be appended to the base name.

Basic useage:

  java -cp adg-cdc1-example-0.0.1-SNAPSHOT.jar net.sf.adatagenerator.ex.cdc1.app.App1

Figure 1 shows a sample of the Excel format. The XML format is the same as for App2 (see below).

Figure 1

Sample Excel output

App2

App2 generates synthetic CDC pairs for various types of groups and collects statistics on the field correlations within the pairs.

The application takes three optional command line parameters:

  • Total number of pairs to be generated. Must be non-negative and less than MAX_COUNT (default: 100)
  • Absolute name of an output XML file for pairs; for example /tmp/sample.xml (default: a temporary file).
  • Absolute name of an output text file for correlation counts; for example /tmp/sample_counts.txt (default: a temporary file)

Basic useage:

  java -cp adg-cdc1-example-0.0.1-SNAPSHOT.jar net.sf.adatagenerator.ex.cdc1.app.App2

There are two output files, an XML file containing pairs and a text file containing information about correlations amoung the pairs. The XML file uses a ChoiceMaker markup, although other output formats could be used.

<?xml version="1.0" encoding="UTF-8"?>
<ChoiceMakerMarkedRecordPairs>
  <MarkedRecordPair decision="differ">
    <cdc1Record DOB="19961121" FirstName="MAURICE" LastName="HALL-CARLSEN" MiddleName=""
      MomFirst="GABRIELLE" MomLast="FETHERSTON" MomMaiden="PETERSON" MomMiddle="HARDY"
      Sex="F" Suffix="" VacCode="20" VacDate="19961127" VacMfr="WAL" VacName="HIB-HbOC" />
    <cdc1Record DOB="19960116" FirstName="WILLIAM" LastName="NAMY-NANIK-CARLSEN" MiddleName="JOY"
      MomFirst="SUZANNE" MomLast="ABBOTT" MomMaiden="" MomMiddle=""
      Sex="F" Suffix="" VacCode="47" VacDate="19970524" VacMfr="PMC" VacName="IPV" />
  </MarkedRecordPair>
  <MarkedRecordPair decision="match">
    <cdc1Record DOB="19971012" FirstName="sharen" LastName="CANNES" MiddleName="MARIA"
      MomFirst="MARGARET" MomLast="MILTON" MomMaiden="JOHNSON" MomMiddle="KIMBERLY"
      Sex="F" Suffix="" VacCode="2" VacDate="19970205" VacMfr="PMC" VacName="MMR" />
    <cdc1Record DOB="19971012" FirstName="sharon" LastName="CANNES" MiddleName="MARIA"
      MomFirst="MARGARET" MomLast="MILTON" MomMaiden="JOHNSON" MomMiddle="KIMBERLY"
      Sex="F" Suffix="" VacCode="2" VacDate="19970205" VacMfr="PMC" VacName="MMR" />
  </MarkedRecordPair>
  <!-- ... -->
</ChoiceMakerMarkedRecordPairs>

The counts file has one row for each distinct correlation signature among the pairs. Each signature is preceded by a count of the number of pairs that display the signature.

56: <1111111111111111111111111111111111111110001>
39: <1111111111111111111111111110000000000001111>
13: <1111111111111111111111111110000001110001111>
12: <1111111111111111111111111110001110000001111>
10: <0000001110001110000000000000000000000000001>
 6: <0000001110000000000000000000000000000000001>
 5: <0000001110001110000000000000000001110000001>
 5: <0000001110001110000000001110000000000000001>
 5: <1111111111111111111111111111110000000001111>
 4: <0000001110001110000000000000001110000000001>
 4: <1111111111111111111111111111110001110001111>
 3: <0000001110000000000000000000001110000000001>
 2: <0000000000000000000000000000000001110000001>
 2: <0000001110000000000000000001110000000000001>
 2: <0000001110000000000000001110000000000000001>
 2: <0000001110001110000000000001110000000000001>
 2: <0000001110001110000000000001110001110000001>
 2: <0000001110001110000000001110000001110000001>
 2: <0000001110001110001110000000000000000000001>
 2: <0001111110001110000000000000000001110000001>
 2: <1111111111111111111111111110001111110001111>
 1: <0000000000000000000000001110000000000000001>
 1: <0000000000000000000000001110001110000000001>
 1: <0000000000001110000000000000000000000000001>
 1: <0000000000001110000000000001110000000000001>
 1: <0000001110000000000000000000000001110000001>
 1: <0000001110000000000000000000001111110000001>
 1: <0000001110000000000000000001110001110000001>
 1: <0000001110000000000000001110000001110000001>
 1: <0000001110000000000000001110001110000000001>
 1: <0000001110001110000000000000001111110000001>
 1: <0000001110001110000000000001111111110000001>
 1: <0000001110001110000000001110001110000000001>
 1: <0000001110001110000001111110000000000000001>
 1: <0001111110000000000000000000000000000000001>
 1: <0001111110000000000000000001110000000000001>
 1: <0001111110001110000000000000000000000000001>
 1: <0001111110001110000000000000001110000000001>
 1: <0001111110001110000000000000001111110000001>
 1: <1111111111111111111111111110001110001111111>
 1: <1111111111111111111111111111111110000001111>

App3

App3 is similar to App2. It generates synthetic CDC pairs for various types of groups and collects statistics on the field correlations within the pairs. The main difference is that App3 uses more realistic record manglers and computes a more realistic correlation signature for pairs.

The application takes three optional command line parameters:

  • Total number of pairs to be generated. Must be non-negative and less than MAX_COUNT (default: 100)
  • Absolute name of an output XML file for pairs; for example /tmp/sample.xml (default: a temporary file).
  • Absolute name of an output text file for correlation counts; for example /tmp/sample_counts.txt (default: a temporary file)

Basic useage:

  java -cp adg-cdc1-example-0.0.1-SNAPSHOT.jar net.sf.adatagenerator.ex.cdc1.app.App3

There are two output files, an XML file containing pairs and a text file containing information about correlations amoung the pairs. The format of the XML file is identical to the format of the XML file produced by App2. The counts file is similar to the one for App2, except that at the top of the file it prints the predicates used to compute a correlation signature. The predicates are listed in the order that they are used in the signature computation.

aExactMatchDOB aNysissDOB aSoundexDOB aEditDistanceDOB aJaroDOB aDifferentDOB
aExactMatchFirstName aNysissFirstName aSoundexFirstName aEditDistanceFirstName aJaroFirstName aDifferentFirstName
aExactMatchLastName aNysissLastName aSoundexLastName aEditDistanceLastName aJaroLastName aDifferentLastName
aExactMatchMiddleName aNysissMiddleName aSoundexMiddleName aEditDistanceMiddleName aJaroMiddleName aDifferentMiddleName
aExactMatchMomFirst aNysissMomFirst aSoundexMomFirst aEditDistanceMomFirst aJaroMomFirst aDifferentMomFirst
aExactMatchMomLast aNysissMomLast aSoundexMomLast aEditDistanceMomLast aJaroMomLast aDifferentMomLast
aExactMatchMomMaiden aNysissMomMaiden aSoundexMomMaiden aEditDistanceMomMaiden aJaroMomMaiden aDifferentMomMaiden
aExactMatchMomMiddle aNysissMomMiddle aSoundexMomMiddle aEditDistanceMomMiddle aJaroMomMiddle aDifferentMomMiddle
aExactMatchSex aNysissSex aSoundexSex aEditDistanceSex aJaroSex aDifferentSex
aExactMatchSuffix aNysissSuffix aSoundexSuffix aEditDistanceSuffix aJaroSuffix aDifferentSuffix
aExactMatchVacCode aNysissVacCode aSoundexVacCode aEditDistanceVacCode aJaroVacCode aDifferentVacCode
aExactMatchVacDate aNysissVacDate aSoundexVacDate aEditDistanceVacDate aJaroVacDate aDifferentVacDate
aExactMatchVacMfr aNysissVacMfr aSoundexVacMfr aEditDistanceVacMfr aJaroVacMfr aDifferentVacMfr
aExactMatchVacName aNysissVacName aSoundexVacName aEditDistanceVacName aJaroVacName aDifferentVacName 

1: < 111110 111110 111110 111110 000001 111110 000001 110100 111110 110100 111110 011110 111110 111110 >
1: < 111110 000001 111110 000001 111110 011110 111110 110100 111110 110100 111110 011000 111110 111110 >
1: < 011110 111110 111110 111110 111110 111110 111110 001110 111110 110100 010100 011000 000001 000001 >
1: < 111110 111110 000001 000001 111110 111110 110100 111110 111110 110100 010100 011000 000001 000001 >
1: < 111110 111110 011100 111110 111110 011100 000001 111110 111110 110100 010100 011000 000001 001100 >
...

DbApp3

DbApp3 performs the same processing as App3 (and App2) except that it uses a MySql database as a source of pairs, rather than generating them. A script that creates and populates the database is included in the source code for the CDC 1 example.

The application has three modes of operation, which are specified by the first command line parameters:

  • processTestCases

    This is the default mode of operation which executes if no command line parameter is specified. In this mode, the DbApp3 application pulls pairs from the database and process them in the same way that App3 does. The output will be to temporary files that will be reported as the application executes.

    Basic useage:

      java -cp adg-cdc1-example-0.0.1-SNAPSHOT.jar \
         -Dhibernate.connection.url="jdbc:mysql://some-host/some-database" \
         -Dhibernate.connection.username="some-mysql-account" \
         -Dhibernate.connection.password="some-password" \
         net.sf.adatagenerator.ex.cdc1.app.DbApp3 processTestCases
    
      OR
    
      java -cp adg-cdc1-example-0.0.1-SNAPSHOT.jar \
         -Dhibernate.connection.url="jdbc:mysql://some-host/some-database" \
         -Dhibernate.connection.username="some-mysql-account" \
         -Dhibernate.connection.password="some-password" \
         net.sf.adatagenerator.ex.cdc1.app.DbApp3
  • listPatients

    In this mode, the DbApp3 application lists the entries in the patient table.

    Basic useage:

      java -cp adg-cdc1-example-0.0.1-SNAPSHOT.jar \
         net.sf.adatagenerator.ex.cdc1.app.DbApp3 listPatients
  • listTestCases

    In this mode, the DbApp3 applications list the entries in the match_test_case table.

    Basic useage:

      java -cp adg-cdc1-example-0.0.1-SNAPSHOT.jar \
         net.sf.adatagenerator.ex.cdc1.app.DbApp3 listTestCases

CDC Example 2

adg-cdc2-example will be a generator that extends the CDC data set by including patient addresses. This example is still in the planning stages.

More details to follow...