Components

The main trunk of the source code repository for A Data Generator is broken into two trees, adatagenerator-core and adatagenerator-examples. Each tree contains a few modules, and each module is distributed as a jar file that contains a number of packages. Here's how things are organized.

  • A Data Generator Project

    The adatagenerator-project source code tree contains application-independent, reusable libraries. There are currently two modules located in this tree, the core ADG library and a library of Febrl-based generators.

    • ADG: Core API Module

      This module builds the adatagenerator-core JAR file, which contains the ADG programming interface, the current implementation of that interface, and a collection of utilities that are useful in implementing the interface.

      • net.sf.adatagenerator.api package

        The API package is composed of Java interfaces, abstract classes and Exception classes that the define the ADG framework. The basic elements of the framework are Java classes that model field values; Java Beans that model database records; and an interface, CMPair (borrowed from ChoiceMaker), that models pairs of records. There some fancier versions of these elements, like GeneratedBean and GeneratedPair. There are things that create or modify these elements, like the aptly named FieldGenerator and BeanModifier. And finally, there are general interfaces, CMSource and CMSink (again, borrowed from ChoiceMaker), for supplying or consuming limitless streams of values, beans and pairs.

      • net.sf.adatagenerator.core package

        The CORE package is composed of concrete classes for default implementations of the ADG framework.

      • net.sf.adatagenerator.modifiers package

        The MODIFIERS package is a specialized set of classes that implement value, bean and pair modifiers.

      • net.sf.adatagenerator.mutators package

        This is a deprecated package that was a predecessor to the modifiers package. It will be removed in a future release.

      • net.sf.adatagenerator.pred package

        This is a deprecated package whose functionality has been moved to the ChoiceMaker correlations module; see below.

      • net.sf.adatagenerator.util package

        The UTIL package contains general utilities for implementing generators and modifiers. In particular, we use the FrequencyBasedList class as the basis for most of the random value generators and modifiers used in the examples.

    • ADG: Febrl Library Module

      This module builds the adatagenerator-febrl JAR file, which contains a library of value generators and modifiers routines based on the Febrl data generator. Please note that the data files in this library are covered by the ANOUS 1.1 open source license, and that the code is covered by the Eclipse 1.0 public license wherever the ANOUS license is not applicable.

      • net.sf.adatagenerator.febrl.generators package

        This package contains Febrl-based generators for names, addresses, ages, birthdates and family roles appropriate to the demographic distributions documented by Christen and Pudjijuno in Accurate Synthetic Generation of Realistic Personal Information.

      • net.sf.adatagenerator.febrl.modifiers package

        This package contains Febrl-based value modifiers for data-entry errors like character insertions, OCR mistakes, phonetic substitions, and so forth. The routines are documented by Christen and Pudjijono in Accurate Synthetic Generation of Realistic Personal Information. The current version of this library uses the deprecated net.sf.adatagenerator.mutators package, but in a future release the library will be based on the net.sf.adatagenerator.modifiers package.

  • A Data Generator Examples

    The adatagenerator-examples source code tree contains applications of the ADG framework that are discussed in the Examples section.

    • ADG: Febrl Application

      This module builds the adg-febrl-example JAR file, which contains an application that uses the adatagenerator-febrl library. It also demonstrates some non-Febrl features, new to this framework, like an example of a hierarchical group of persons (in this case, a household containing more than one family). The current version of this application uses the deprecated net.sf.adatagenerator.mutators package, but in a future release the application will be based on the net.sf.adatagenerator.modifiers package.

    • ADG: CDC 1 Application

      This module builds the adg-cdc1-example JAR file, which contains an application that uses that produces duplicates based on data from the Centers for Disease Control (CDC) Deduplication Test Case for immunization information systems.

    • ADG: CDC 2 Application

      This module, which is still in the vapor-ware stage, will extend the adg-cdc1-example application by including additional record fields for patient addresses and facility-specific medical record numbers.

Correlation measurements in our framework are built with libraries from the ChoiceMaker project. Here's a brief summary of the ChoiceMaker code that we rely on.

There are two source code repositories for the ChoiceMaker project, a CVS repository and an SVN repository. We use code from the SVN repository. In the main trunk of the SVN repository, there are currently two software directory trees, choicemaker-shared and choicemaker-project.

Since a goal of our framework is to measure the quality of synthetic duplicates by measuring the correlations between generated pairs of records, we've found it convenient to build on some components from the ChoiceMaker record matching project.

  • ChoiceMaker Shared

    The choicemaker-shared source code tree contains application-independent, reusable libraries. The code tree consists of only one module, which is the CM Shared module.

    • CM Shared Module

      This module builds the choicemaker-shared JAR file, which contains a small, shared API package, a shared CORE package that implements the API, and a UTIL package of some generally useful utilities that are mostly unrelated to the shared API.

      • com.choicemaker.shared.api

        The shared API package is composed of the CMPair, CMSource and CMSink interfaces, plus two exception classes.

      • com.choicemaker.shared.core

        The shared CORE package contains a single class, CMDefaultPair.

      • com.choicemaker.shared.util

        The shared UTIL package contains a utilities for constructing equality and hash code methodsand for finding accessor and mutator methods of a Java Bean by reflection. The package also contains a very useful SynonymMap class, which was borrowed from the Apache Lucerne project.

  • ChoiceMaker Project

    The choicemaker-project source code tree contains version 3.0 of the ChoiceMaker application code. The version is still at a very early stage of development, so the code tree is much more sparse than the code tree for version 2.5 which is contained in the CVS repository for ChoiceMaker.

    • CM Correlation

      This module builds the choicemaker-correlation JAR file, which contains an API package and two implementation packages.

      • com.choicemaker.correlation

        This package is the API that defines how ChoiceMaker measures pair correlations. The most important class is the Predicate interface, which is a named test that evaluates a CMPair and returns a boolean result. Another important class is the PredicateCorrelator interface, which computes the correlation signature of a CMPair.

      • com.choicemaker.correlation.predicates

        This package contains the implementation of some commonly used predicates, including ExactMatchPredicate, EditDistancePredicate, SoundexPredicate and TrimmedSubstringsLowerCase.

      • com.choicemaker.correlation.util

        This package contains two very useful classes: SimpleCorrelationEvaluator, which implements the PredicateCorrelator interface, and SimpleCorrelationCounter, which acts as a CMSink of CMPair instances and keeps track of the correlation patterns exhibited by the pairs that it consumes.

    • CM General Matching

      This module builds the choicemaker-general-matching JAR file, which contains two packages.

      • com.choicemaker.matching.gen

        This package contains language-independent String comparison tests, such as edit distance, jumble distance, longest common substring and longest common subsequence.

      • com.choicemaker.matching.en

        This package contains English-specific String comparison tests, such as soundex, nysiis, Jaro-Winkler, metaphone and double metaphone.

    • CM Util

      This module builds the choicemaker-util JAR file, which contains just the com.choicemaker.util package.

      • com.choicemaker.util

        Currently, this package contains just the StringUtils class. We plan to add other utilities, like DateUtils, from the ChoiceMaker 2.5 code branch.