Welcome to A Data Generator project! Thank you for visiting our site. We've created a framework for creating data sources that are useful in record matching analysis and duplicate detection. Just as importantly, we've tried to provide clear documentation about how to use and extend our framework.

We hope that you'll find this website helpful. Please send us your suggestions for improvements. And please consider joining us to continue to improve, develop and extend what's currently here.

What is A Data Generator

A Data Generator is a framework for creating applications that generate realistic, synthetic pairs for evaluating record matching and de-duplication systems.

Duplicate entries can be a problem for any system that tracks real-world entities without univeral, reliable identifiers. A major impediment in developing and evaluating record matching systems has been that the most interesting and challenging examples of duplicate records are invariably found in high-value data that is usually confidential. Duplicate records are problem in patient registries, customer databases, counter-terrorism, fraud detection, counter-party financial data and so on.

The goal of this project is to provide a framework for creating sources of synthetic duplicates that simulate the complexity found in real, high-value systems. These duplicates can then be used either to develop new record matching systems or to measure the effectiveness of existing ones.

Previous work

This work is based in part on the work of other researchers. The earliest duplicate generator of which we're aware is a C program called dbgen that was written by Mauricio Hernandez for a paper that he published with Salvatore Stolfo in 1998. (See the References section.) The dbgen program is available from the RIDDLE Datasets repository. The dbgen program works by first generating a template record and then copying and mangling it by introducing deliberate errors such as transposed string values, null values, and so on. The original and mangled copy are then added to a corpus of test data.

Peter Christen and Ajus Pudjijuno extended Hernandez's work in the Febrl project. The Febrl data generator allows one to create template records that are parts of groups, such as families. It also allows one to define detailed statistical models for correlations between the field values of records that belong to the same group. Christen and Pudjijuno did extensive studies on the distributions of field values in groups as a part of a paper that they published in 2009. (See the References section for a full citation to this paper.)

This project builds on the dbgen and Febrl data generators in two ways. First, our data generator is written in Java and is designed to be easily reused with arbitrary data schema. Second, and more importantly, we introduce a method for measuring the correlations found in real data and then reproducing these correlations in generated data.

We like to think of our method as Fourier Analysis for pair correlations. The spark for this idea comes from ChoiceMaker record matching software, which is based (in part) on Andrew Borthwick's patent for using machine learning to perform record matching. Under the covers, the essence of ChoiceMaker record matching is the creation (and weighting) of simple correlation tests to measure the similarity of pairs of records. (Please note that Andrew's patent is licensed by the Open Invention Network under terms designed to promote collaborative exchanges of intellectual property. See the OIN website for more details.)

Please see the References list for more details about work in this area.

What's on the rest of the site

On other pages of this site you'll find a list of components that comprise the data generator framework, some examples that use the framework, and a tutorial that walks through the steps of a creating a data generator. There are various fact sheets and reports listed under the the section on Developer informaton section in the site menu. And there are still more resources available on our project website at SourceForge.

What's next on our agenda

For us, this project has turned out to be unexpectedly beneficial in terms of understanding the record matching problem. We think that there are lots ways that our work can be extended. For example:

  • Stacked data

    Our method could be extended to cover "stacked" data; that is, data in which several subentities are linked to a master entity. A typical example is a record representing a person which is linked to several records representing phone numbers or current and past addresses.

  • Automation

    The process of creating a duplicate generator could be automated.

  • Efficient generation of specific correlation patterns

    The process of reproducing a given set of correlation patterns could be much more efficient, perhaps through techniques like logistic regression or genetic programming, in selecting generation and modification parameters that preferentially produce correlation patterns close to, or identical with, a target set of correlation patterns.

  • Correlation measurements of entire databases

    Our method, combined with a technique like ChoiceMaker's patented algorithm for automated batch blocking, may offer a way to characterize the degree of duplication in real databases. Please note that ChoiceMaker's algorithm is licensed by Open Invention Network (OIN) under terms designed to promote collaborative exchanges of intellectual property. See the OIN website for more details.

  • Synthesis of large, complex databases

    Efficient production of correlation examples, combined with correlation measurements of real databases, may offer a way to simulate large, high-value databases.

  • Greater collaboration on record matching techniques

    Large databases of realistic synthethetic data would allow open collaboration on improved techniques for duplicate detection.

We hope that you find these prospects exciting, and that you might be tempted to join us in exploring them and others as well. We look forward to hearing from you!