Step 6: Measure field correlations

Five basic String correlation predicates

Other useful String correlation predicates

Non-string fields

Details

Putting it all together

Differences

Field swaps

Placeholders

Nicknames and synonyms

A Data Generator

Developer information

Modules

Previous step	Up	Next step
Step 5	Tutorial overview	Step 7

The degree of correlation between two records is measured by comparing the values of their individual fields. These field comparisons are performed by objects that implement the Predicate interface. Typically many field comparisons are performed by an ordered set of predicates, and the results are tabulated as a correlation signature of ones and zeros, with a one (1) indicating that a particular comparison test returned true and zero (0) indicating that a test returned false.

As a simple example in the context of a CDC record, a very simple set of predicates might be a collection that simply checks whether each field in a pair of records is exactly equal. There are 14 fields in a CDC record, so the predicate set would consist of 14 individual tests that check whether last names match exactly; whether first names match exactly; and so on. In defining even a simple correlation evaluator like this one, we would need to decide whether null values are considered exact matches to each other. Consider the case of last names. If two vaccination records are missing last names, we can't be certain whether the records refer to the same child. Therefore the correlation between the records is weaker than if two valid last names were present and equal. For this reason, null last names should not be considered exact matches to each other. In general, null values are usually not considered exact matches for any field values.

Given this definition of a simple correlation evaluator, if two CDC records match exactly on the children's last and first names, but all other field values were null, then correlation signature of the pair would be:

  110000000000000

where the order of the predicates matches the order of the fields listed in Step 1 of this tutorial.

A slightly more complicated example is illustrated by App2 of CDC Example 1. In this application, three predicates are used to test each field of a CDC record:

An Exact match predicate
A Trimmed, lower-case Strings predicate
A Trimmed, lower-case Substrings predicate

These predicates are applied in order to each field of a CDC record, so the signature for the same pair of records discussed above would become:

  111 111 000 000 000 000 000 000 000 000 000 000 000 000 000

where spaces have been introduced to increase readability.

While these examples are simplistic, useful predicates sets are not much more complicated, which is what gives this technique great utility. In general, a good starting point for testing the correlation of String fields is the following set of five predicates:

An Exact match predicate
A Nysiis predicate
A Soundex predicate
An Edit distance predicate
A Jaro-Winkler predicate

As discussed in the section on choosing which predicates to use, this combination of predicates can detect meaningful String correlations with very high accuracy.

It is often useful to supplement the five basic String predicates by some additional tests.

One important but somewhat counter-intuitive test of correlation is a difference predicate. In our example, we would build upon the 5 basic predicates to define when field values are different.

  A difference predicate
  ======================
  Two String values are not null; not exact matches; not Nysiis-similar; not Soundex-similar;
    not Edit-distance similar; and not Jaro-Winkler similar.

This predicate is not redundant with the previous five predicates because of the very first test that it exercises, namely checking that two String values are not null. A difference predicate is necessary to distinguish between null field values that indicate two records are weakly uncorrelated and non-null strongly differing field values that indicate two records are anti-correlated. Consider the case of the last name field and three possible values of Smith, Smithers, Jones and null in a pair of records:

Here we've assumed default configurations for the predicates, as discussed in the details section. The point of this example is that last names of Smith and Jones are strong indicators that two records refer to two different children, much stronger than the mismatches of Smith to Smithers or Smith to null. The only way to distinguish between weak mismatches and a unambiguous difference is with a difference predicate.

Not all tests of record correlations can rely on comparing just one field value. In real data, field values are often swapped between valid locations; for example, the first and last names of a child might be mis-entered into last and first name fields, respectively. A field-swap predicate can be used to test for this type of correlation.

Not all non-null values are necessarily significant when testing for correlations. For example, in immunization records, it is not uncommon for a phrase like baby, baby girl, or unknown to be used a placeholder for the name of a newborn infant who has not been named yet. When comparing two records, a placeholder value shouldn't be regarded as either a strong indication that two records match or don't match; in many cases, a placeholder value acts much like a null value.

Placeholder values are typically field specific. While both baby and unknown are common placeholders for a first name field, only unknown is a common placeholder value for a last name field. Neither of these is a common placeholder value for a date field. For dates, placeholders may be an agreed upon value (such as 1/1/1900) or an estimated value (such as the first day and first month of an estimated birth year or the first day of an estimated birth month and year). This last example illustrates a vexing problem with placeholder values -- some valid values may look like placeholders. There are children born on New Year's Day, and there are probably even some children named or nicknamed Baby.

One way to deal with placeholders is to simply include a placeholder predicate along with other predicates. A less reusable approach is to validate field values against placeholder lists before using the values in other predicates.

Two field values may be correlated semantically in ways that can't be captured by similarities in spelling or phonetics. A classic example is nicknames. Jack is a nickname of John; Fritz is a nickname for Fred; and Rick is a nickname for Fredrick. None of the nickname relationships will be captured by predicates we've discussed so far.

Synonym relationships may be important in fields other than name. For example, non-standard codes for vaccines or vaccine manufacturers could be recognized through the use of synonym dictionaries.

While we've focused on String-valued fields, predicates can be designed and implemented to work with non-String fields as well. Common examples are Date fields, in which correlation is measured by closeness in time. Another example would be address fields, which can be used in some cases to compute latitude and longitude through geocoding. Latitude and longitude can then be used to measure spatial correlation.

Another technique is to convert non-String fields into String values, and then apply String predicates to the values. Again, Date fields are an example where this technique can be very useful, since dates can be misentered due to a variety of issues.

More details on available predicates and some guidelines for choosing appropriate predicates are listed on the following pages.

  // Get the declared methods of the Cdc1Record interface
  final Method[] methods = Cdc1Record.class.getDeclaredMethods();

  // Sort the methods by name
  Comparator<Method> methodComparator = new Comparator<Method>() {
    @Override
    public int compare(Method m1, Method m2) {
      String name1 = m1.getName();
      String name2 = m2.getName();
      return name1.compareTo(name2);
    }
    
  };
  Arrays.sort(methods, 0, methods.length, methodComparator);

  // Create a list to hold the predicates that will be constructed
  List<Predicate<Cdc1Record>> predicates = new ArrayList<Predicate<Cdc1Record>>();

  //
  // For each method that is a field accessor, add a set of predicates to the list
  //
  for (Method m : methods) {
    if (ReflectionUtils.isFieldAccessor(m)) {

      // This list will be used to form a difference predicate below
      final List<AbstractPredicate<?, Cdc1Record>> matchCorrelations
        = new ArrayList<AbstractPredicate<?, Cdc1Record>>();

      // The stem of the field name is used to name the predicates
      final String stem = ReflectionUtils.parseStemFromMethodName(m);

      //
      // For every field, create an ExactMatchPredicate
      //
      String name = "aExactMatch" + stem;
      AbstractPredicate<?, Cdc1Record> p
        = new ExactMatchPredicate<EXPORT_TYPE_JAVA, Cdc1Record>(name, m, null);
      predicates.add(p);
      matchCorrelations.add(p);

      //
      // For every String field, create 4 fuzzy match predicates and a difference predicate
      //
      if (String.class.isAssignableFrom(m.getReturnType())) {
        
        name = "aNysiss" + stem;
        p = new NysiisPredicate<Cdc1Record>(name, m);
        predicates.add(p);
        matchCorrelations.add(p);

        name = "aSoundex" + stem;
        p = new SoundexPredicate<Cdc1Record>(name, m);
        predicates.add(p);
        matchCorrelations.add(p);

        final int maxDistance = EditDistancePredicate.DEFAULT_DISTANCE;
        name = "aEditDistance" + stem;
        p = new EditDistancePredicate<Cdc1Record>(name, m,
            maxDistance);
        predicates.add(p);
        matchCorrelations.add(p);

        name = "aJaro" + stem;
        p = new JaroWinklerPredicate<Cdc1Record>(name, m);
        predicates.add(p);
        matchCorrelations.add(p);

        // In essence, a difference predicate == !(matchCorrelations)
        name = "aDifferent" + stem;
        p = new DifferencePredicate<EXPORT_TYPE_JAVA,Cdc1Record>(name, matchCorrelations);
        predicates.add(p);

      }
    }
  }
}

Previous step	Up	Next step
Step 5	Tutorial overview	Step 7

Nysiis

Soundex

Edit-D

Difference

Smith, Smith

Smith, Smithers

Smith, null

Smith, Jones