The adatagenerator-examples project contains three examples:
adg-febrl-example is a rewritten version of the original Febrl data generator.
More details to follow...
adg-cdc1-example is a generator that mimics the duplicated data test set published by the US Center for Disease Control immunization registry project. It runs as a command line application. It generates synthetic CDC pairs for various types of groups and collects statistics on the field correlations within the pairs. A tuturial describes how the CDC1 example was created. There are 4 example applications in this module.
Shows how to create and connect sources and sinks of pairs.
Shows how to collect statistics on the field correlations within generated pairs.
Provides a more realistic example of pair generation and correlation analysis.
Extracts MATCH pairs from tables of patient records and record pairs in a test database
App1 shows how to create and connect sources and sinks of pairs. The application generates synthetic CDC pairs from various groups and writes them to two files, an XML file and a Excel spreadsheet.
The application takes two optional command line parameters:
Basic useage:
java -cp adg-cdc1-example-0.0.1-SNAPSHOT.jar net.sf.adatagenerator.ex.cdc1.app.App1
Figure 1 shows a sample of the Excel format. The XML format is the same as for App2 (see below).
Sample Excel output
App2 generates synthetic CDC pairs for various types of groups and collects statistics on the field correlations within the pairs.
The application takes three optional command line parameters:
Basic useage:
java -cp adg-cdc1-example-0.0.1-SNAPSHOT.jar net.sf.adatagenerator.ex.cdc1.app.App2
There are two output files, an XML file containing pairs and a text file containing information about correlations amoung the pairs. The XML file uses a ChoiceMaker markup, although other output formats could be used.
<?xml version="1.0" encoding="UTF-8"?> <ChoiceMakerMarkedRecordPairs> <MarkedRecordPair decision="differ"> <cdc1Record DOB="19961121" FirstName="MAURICE" LastName="HALL-CARLSEN" MiddleName="" MomFirst="GABRIELLE" MomLast="FETHERSTON" MomMaiden="PETERSON" MomMiddle="HARDY" Sex="F" Suffix="" VacCode="20" VacDate="19961127" VacMfr="WAL" VacName="HIB-HbOC" /> <cdc1Record DOB="19960116" FirstName="WILLIAM" LastName="NAMY-NANIK-CARLSEN" MiddleName="JOY" MomFirst="SUZANNE" MomLast="ABBOTT" MomMaiden="" MomMiddle="" Sex="F" Suffix="" VacCode="47" VacDate="19970524" VacMfr="PMC" VacName="IPV" /> </MarkedRecordPair> <MarkedRecordPair decision="match"> <cdc1Record DOB="19971012" FirstName="sharen" LastName="CANNES" MiddleName="MARIA" MomFirst="MARGARET" MomLast="MILTON" MomMaiden="JOHNSON" MomMiddle="KIMBERLY" Sex="F" Suffix="" VacCode="2" VacDate="19970205" VacMfr="PMC" VacName="MMR" /> <cdc1Record DOB="19971012" FirstName="sharon" LastName="CANNES" MiddleName="MARIA" MomFirst="MARGARET" MomLast="MILTON" MomMaiden="JOHNSON" MomMiddle="KIMBERLY" Sex="F" Suffix="" VacCode="2" VacDate="19970205" VacMfr="PMC" VacName="MMR" /> </MarkedRecordPair> <!-- ... --> </ChoiceMakerMarkedRecordPairs>
The counts file has one row for each distinct correlation signature among the pairs. Each signature is preceded by a count of the number of pairs that display the signature.
56: <1111111111111111111111111111111111111110001> 39: <1111111111111111111111111110000000000001111> 13: <1111111111111111111111111110000001110001111> 12: <1111111111111111111111111110001110000001111> 10: <0000001110001110000000000000000000000000001> 6: <0000001110000000000000000000000000000000001> 5: <0000001110001110000000000000000001110000001> 5: <0000001110001110000000001110000000000000001> 5: <1111111111111111111111111111110000000001111> 4: <0000001110001110000000000000001110000000001> 4: <1111111111111111111111111111110001110001111> 3: <0000001110000000000000000000001110000000001> 2: <0000000000000000000000000000000001110000001> 2: <0000001110000000000000000001110000000000001> 2: <0000001110000000000000001110000000000000001> 2: <0000001110001110000000000001110000000000001> 2: <0000001110001110000000000001110001110000001> 2: <0000001110001110000000001110000001110000001> 2: <0000001110001110001110000000000000000000001> 2: <0001111110001110000000000000000001110000001> 2: <1111111111111111111111111110001111110001111> 1: <0000000000000000000000001110000000000000001> 1: <0000000000000000000000001110001110000000001> 1: <0000000000001110000000000000000000000000001> 1: <0000000000001110000000000001110000000000001> 1: <0000001110000000000000000000000001110000001> 1: <0000001110000000000000000000001111110000001> 1: <0000001110000000000000000001110001110000001> 1: <0000001110000000000000001110000001110000001> 1: <0000001110000000000000001110001110000000001> 1: <0000001110001110000000000000001111110000001> 1: <0000001110001110000000000001111111110000001> 1: <0000001110001110000000001110001110000000001> 1: <0000001110001110000001111110000000000000001> 1: <0001111110000000000000000000000000000000001> 1: <0001111110000000000000000001110000000000001> 1: <0001111110001110000000000000000000000000001> 1: <0001111110001110000000000000001110000000001> 1: <0001111110001110000000000000001111110000001> 1: <1111111111111111111111111110001110001111111> 1: <1111111111111111111111111111111110000001111>
App3 is similar to App2. It generates synthetic CDC pairs for various types of groups and collects statistics on the field correlations within the pairs. The main difference is that App3 uses more realistic record manglers and computes a more realistic correlation signature for pairs.
The application takes three optional command line parameters:
Basic useage:
java -cp adg-cdc1-example-0.0.1-SNAPSHOT.jar net.sf.adatagenerator.ex.cdc1.app.App3
There are two output files, an XML file containing pairs and a text file containing information about correlations amoung the pairs. The format of the XML file is identical to the format of the XML file produced by App2. The counts file is similar to the one for App2, except that at the top of the file it prints the predicates used to compute a correlation signature. The predicates are listed in the order that they are used in the signature computation.
aExactMatchDOB aNysissDOB aSoundexDOB aEditDistanceDOB aJaroDOB aDifferentDOB aExactMatchFirstName aNysissFirstName aSoundexFirstName aEditDistanceFirstName aJaroFirstName aDifferentFirstName aExactMatchLastName aNysissLastName aSoundexLastName aEditDistanceLastName aJaroLastName aDifferentLastName aExactMatchMiddleName aNysissMiddleName aSoundexMiddleName aEditDistanceMiddleName aJaroMiddleName aDifferentMiddleName aExactMatchMomFirst aNysissMomFirst aSoundexMomFirst aEditDistanceMomFirst aJaroMomFirst aDifferentMomFirst aExactMatchMomLast aNysissMomLast aSoundexMomLast aEditDistanceMomLast aJaroMomLast aDifferentMomLast aExactMatchMomMaiden aNysissMomMaiden aSoundexMomMaiden aEditDistanceMomMaiden aJaroMomMaiden aDifferentMomMaiden aExactMatchMomMiddle aNysissMomMiddle aSoundexMomMiddle aEditDistanceMomMiddle aJaroMomMiddle aDifferentMomMiddle aExactMatchSex aNysissSex aSoundexSex aEditDistanceSex aJaroSex aDifferentSex aExactMatchSuffix aNysissSuffix aSoundexSuffix aEditDistanceSuffix aJaroSuffix aDifferentSuffix aExactMatchVacCode aNysissVacCode aSoundexVacCode aEditDistanceVacCode aJaroVacCode aDifferentVacCode aExactMatchVacDate aNysissVacDate aSoundexVacDate aEditDistanceVacDate aJaroVacDate aDifferentVacDate aExactMatchVacMfr aNysissVacMfr aSoundexVacMfr aEditDistanceVacMfr aJaroVacMfr aDifferentVacMfr aExactMatchVacName aNysissVacName aSoundexVacName aEditDistanceVacName aJaroVacName aDifferentVacName 1: < 111110 111110 111110 111110 000001 111110 000001 110100 111110 110100 111110 011110 111110 111110 > 1: < 111110 000001 111110 000001 111110 011110 111110 110100 111110 110100 111110 011000 111110 111110 > 1: < 011110 111110 111110 111110 111110 111110 111110 001110 111110 110100 010100 011000 000001 000001 > 1: < 111110 111110 000001 000001 111110 111110 110100 111110 111110 110100 010100 011000 000001 000001 > 1: < 111110 111110 011100 111110 111110 011100 000001 111110 111110 110100 010100 011000 000001 001100 > ...
DbApp3 performs the same processing as App3 (and App2) except that it uses a MySql database as a source of pairs, rather than generating them. A script that creates and populates the database is included in the source code for the CDC 1 example.
The application has three modes of operation, which are specified by the first command line parameters:
This is the default mode of operation which executes if no command line parameter is specified. In this mode, the DbApp3 application pulls pairs from the database and process them in the same way that App3 does. The output will be to temporary files that will be reported as the application executes.
Basic useage:
java -cp adg-cdc1-example-0.0.1-SNAPSHOT.jar \ -Dhibernate.connection.url="jdbc:mysql://some-host/some-database" \ -Dhibernate.connection.username="some-mysql-account" \ -Dhibernate.connection.password="some-password" \ net.sf.adatagenerator.ex.cdc1.app.DbApp3 processTestCases OR java -cp adg-cdc1-example-0.0.1-SNAPSHOT.jar \ -Dhibernate.connection.url="jdbc:mysql://some-host/some-database" \ -Dhibernate.connection.username="some-mysql-account" \ -Dhibernate.connection.password="some-password" \ net.sf.adatagenerator.ex.cdc1.app.DbApp3
In this mode, the DbApp3 application lists the entries in the patient table.
Basic useage:
java -cp adg-cdc1-example-0.0.1-SNAPSHOT.jar \ net.sf.adatagenerator.ex.cdc1.app.DbApp3 listPatients
In this mode, the DbApp3 applications list the entries in the match_test_case table.
Basic useage:
java -cp adg-cdc1-example-0.0.1-SNAPSHOT.jar \ net.sf.adatagenerator.ex.cdc1.app.DbApp3 listTestCases