References and further reading

  1. M. A. Hernandez and S. J. Stolfo. Real-world data is dirty: Data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery, 2(1):9-37, 1998.

    This paper describes the problem of merging multiple databases of information. It introduced and used the dbgen duplicate generation program.

  2. Peter Christen and Agus Pudjijono. Accurate Synthetic Generation of Realistic Personal Information. Advances in Knowledge Discovery and Data Mining: 13th Pacific-Asia Conference, PAKDD 2009, Bangkok, Thailand., 507-514, 2009.

    This paper presents the Febrl data generator for personal information, which allows for the generation of realistic synthetic data based on frequency tables and attribute generation rules.

  3. Andrew Borthwick. Probabilistic Record linkage model derived from training data. United States Patent #6523019, 2003.

    This patent describes a method of training a record linkage system from examples of record pairs that match each other, or are different from each other, or are simply too ambiguous or sparse to make a match or differ decision. Although framed in the language of machine learning, the essence of this method is a weighted comparison of field correlations between a pair of records. This patent, and the two patents that are described next, may be licensed from the Open Invention Network under terms that are designed to promote collaborative exchanges of intellectual property.

  4. Andrew Borthwick, Martin Buechi and Arthur Goldberg. Automated database blocking and record matching. United States Patent #6523019, 2003.

    Blocking is a technique of generating a set (or a block, hence the name blocking) of potentially matching records to some input (or query) record. This patent describes describes an automated blocking technique that is used in online or interactive systems where a fast, comprehensive and limited response is important. The block of potential matches is then passed to a matching process to evaluate which records match the input record. Applications include but are not limited to individual matching such as student identification, householding, business matching, supply chain matching, financial matching, news or text matching, and other applications. Like the patent described above, this patent may be licensed from the Open Invention Network under terms that are designed to promote collaborative exchanges of intellectual property.

  5. Andrew Borthwick, Arthur Goldberg and Put Cheung. Batch automated blocking and record matching. United States Patent #7899796, 2011.

    In contrast to online or interactive blocking, batch blocking takes a set of input records and generates sets (plural) of potentially matching records for the entire input set. Batch blocking is used in situations where throughout is important and interactive response is not relevant. Like the patents described above, this patent may be licensed from the Open Invention Network under terms that are designed to promote collaborative exchanges of intellectual property.

  6. ChoiceMaker record matching software is available under the Eclipse Public License at SourceForge.net.

    ChoiceMaker record matching sofware uses the techniques described in the three preceding patents.

  7. Febrl record matching software is available under the Australian National University Open Source License (ANUOS License) Version 1.1

    Febrl record matching uses Felligi Sunter and a variety of other record matching techniques.

  8. Record linkage entry in Wikipedia.

    Last accessed for this reference on 2012-02-19.