Previous detailUpNext detail
Detail 6.1.1Detail 6.1Detail 6.1.3

Detail 6.1.2: Character insertions, deletions and transpositions

Edit distance

Edit distance computes the smallest number of character insertions, deletions and substitutions required to convert one string into another. There are two variations of this proceedure implemented in the ChoiceMaker libraries:

  • Type ONE

    Insertions, deletions and substitutions are the only edit steps.

  • Type TWO

    Transpositions of two characters count as one edit step.

Edit distance is an approximate string matching technique that works well both for English and non-English strings.

A EditDistancePredicate returns true if the edit distance between two strings is less than some (exclusive) maximum. A useful value for the maximum distance that works well in many applications is 3. (This value is represented by the manifest constant DEFAULT_DISTANCE). By default, the type TWO algorithm is used to calculate distances, although the type ONE variation may be specified in alternate constructor methods.

Examples:

  • Mauricio, Mark -> 5
  • Peter, Pedro -> 3
  • Agus, Angus -> 1
  • Andrew, ANDREI -> 1

Basic construction:

  final int maxDistance = EditDistancePredicate.DEFAULT_DISTANCE;
  String name = "somePredicateName";
  Method m = Cdc1Record.class.getMethod("getLastName", (Class<?>[])null);
  Predicate<Cdc1Record> p = new EditDistancePredicate<Cdc1Record>(name, m, maxDistance);

Affine gap edit distance

Jumble distance

Jaro-Winkler

Jaro-Winkler is a similarity metric that was developed by the U.S. Census. It weights agreement at the start of two strings more heavily than agreement at the end of the strings. In essence, the algorithm does the following.

  • It calculates c, the number of matching characters in the first string that are within half the length of the second string. For example, the 2nd letter of the first string must appear within positions 1 - 5 of the second string if the second string is 8 characters long.
  • It assigns a normalized similarity score to the two strings based on the following forumula
      d = (1/3)*(c/s1) + (1/3)*(c/s2) + (1/3)*(t/c)
    
      where
    
        c = number of matching characters
    
        s1 = length of the first string
    
        s2 = length of the second string
    
        t = number of transpositions

A JaroWinklerPredicate returns true if the Jaro-Winkler similarity between two strings is greater than some (exclusive) minimum. A useful value for the minimum value that works well in many applications is 0.90. (This value is represented by the manifest constant DEFAULT_SIMILARITY).

Examples:

  • Mauricio, Mark -> 0.82
  • Peter, Pedro -> 0.77
  • Agus, Angus -> 0.75
  • Andrew, ANDREI -> 0.93

Basic construction:

  String name = "somePredicateName";
  Method m = Cdc1Record.class.getMethod("getLastName", (Class<?>[])null);
  Predicate<Cdc1Record> p = new JaroWinklerPredicate<Cdc1Record>(name, m);

Previous detailUpNext detail
Detail 6.1.1Detail 6.1Detail 6.1.3