Previous stepUpNext step
Step 4Tutorial overviewStep 6

Step 5: Create error generators

This part of the tutorial describes how to introduce data-collection and data-entry errors into pairs of records. The distinction between data-collection errors and data-entry is a bit artificial; we use the terms mainly as a way to conveniently organize our discussion.

  • Step 5.1: Data collection errors

    Data-collection errors are records that are entered without a full set of valid entries for the 8 essential demographic fields listed in Step 4. Examples would include records entered with a null middle name for a child or placeholder value for a first name.

  • Step 5.2: Data entry errors

    Data-entry errors are field values that are entered with errors like character transpositions, phonetic substitutions or field swaps.

Java representation

Data-collection and data-entry errors are represented by a semantic hierarchy of modifier interfaces:

  • ValueModifiers

    Modify a particular type of value, such as String or Date values

    /**
     * Declares a capability for modifying a property value of a Java Bean.
     * 
     * @param <V>
     *            the value type
     */
    public interface ValueModifier<V> extends NamedInstance {
      
      /**
       * An optional hint to clients that use reflection for the type of
       * value modified by this instance.
       * @return may be null
       */
      Class<V> getValueType();
      
      /** Modifies and returns the specified value */
      V modifyValue(V value) throws ModificationException;
    
    }

    While the ValueModifier interface may be implemented directly, it is usually more convenient to build upon an abstract base class such as AbstractValueModifier or FebrlDataDrivenModifier.

  • BeanModifiers

    Modify a particular type of record, such as Cdc1Record or GeneratedCdc1Record instances

    /**
     * Declares a capability for modifying a Java Bean.
     *
     * @param <T>
     *            the bean type
     */
    public interface ValueModifier<V> extends NamedInstance {
      
      /**
       * An optional hint to clients that use reflection for the type of
       * value modified by this instance.
       * @return may be null
       */
      Class<V> getValueType();
      
      /** Modifies and returns the specified value */
      V modifyValue(V value) throws ModificationException;
    
    }

    While the BeanModifier interface may be implemented directly, it is usually possible to build a BeanModifier from a ValueModifier using a helper class such as BeanFieldModifier or StringFieldModifier. Complex bean modifiers can often be built up as a composite of simpler bean modifiers using helper classes such as CompositeBeanFieldModifier or RandomBeanFieldModifier.

  • PairModifiers

    Modify a particular type of record pair, such as a ModifiablePair<Cdc1Pair> instance

    /**
     * Declares a capability for modifying modifies a pair of Java Beans.
     * 
     * @param <T>
     *            the bean type
     * @param <P>
     *            the pair type
     */
    public interface PairModifier<T, P extends ModifiablePair<T>> extends
        NamedInstance {
    
      /**
       * An enumeration of how a processor may change a pair by processing it.
       */
      enum ModificationResult {
        MATCH, HOLD, DIFFER, INDETERMINATE;
      }
    
      /**
       * Modifies the target and updates its current status.
       * 
       * @param target
       *            the object to modify
       * @return the modified target (or a modified copy of the target)
       * @throws ModificationException
       *             if modification fails
       */
      P modifyPair(P target) throws ModificationException;
    
    }

    While the PairModifier interface may be implemented directly, it is usually possible to build a PairModifier from the base classes DefaultPairModifier or CompositePairModifier.

Putting it all together

Generally, one builds a CompositePair modifier as a collection of PairModifier instances, where each PairModifier is built with a collection of BeanModifier instances. In turn, each BeanModifier instance is built with a collection of ValueModifier instances. One starts at the bottom layer, with ValueModifier instances, and works up.

One useful trick, which will be exploited in Step 7, is to make the collections of ValueModifiers and BeanModifiers tunable. This can be done by using a collection class called a FrequencyBasedList, in which each element in the list is entered along with a relative frequency of occurrence. When elements are drawn randomly from the list, the retrieved elements follow the specified distribution of relative frequencies. In the example below, one FrequencyBasedList is used to create a BeanModifier composed of a flat distribution of StringFieldModifier instances, all with the relative frequency of 1 (one).

//
// Create a set of default modifiers for String values
//
List<ValueModifier<String>> stringModifiers = new LinkedList<ValueModifier<String>>();
stringModifiers.add(new OCRTransformer());
stringModifiers.add(new PhoneticTransformer());
stringModifiers.add(new NoChangeValueModifier<String>(String.class));
stringModifiers.add(new CharInserter());
stringModifiers.add(new Transpositioner());

//
// Create bean modifiers from the String modifiers and map them to frequencies
//
Map<BeanModifier<GeneratedCdc1Record>,Integer> modifierFrequencies =
    new HashMap<BeanModifier<GeneratedCdc1Record>, Integer>();
Method[] methods = Cdc1Record.class.getDeclaredMethods();
RELATIVE_FREQUENCY = 1;
for (Method m : methods) {
  if (CMReflectionUtils.isFieldAccessor(m)
      && m.getReturnType().equals(String.class)) {
    for (ValueModifier<String> stringModifier : stringModifiers) {
      String nameStem = CMReflectionUtils.parseStemFromMethodName(m);
      BeanModifier<GeneratedCdc1Record> beanModifier =
        new StringFieldModifier<Cdc1Record, GeneratedCdc1Record>(
          Cdc1Record.class, nameStem, stringModifier);
      modifierFrequencies.put(beanModifier, RELATIVE_FREQUENCY);
    }
  }
}

//
// Create a frequency based list from the modifier frequencies
//
@SuppressWarnings("unchecked")
Comparator<BeanModifier<GeneratedCdc1Record>> modifierComparator =
    new DefaultNullComparator();
FrequencyBasedList<BeanModifier<GeneratedCdc1Record>> fbl =
    new FrequencyBasedList<
        BeanModifier<GeneratedCdc1Record>>(modifierFrequencies, modifierComparator);

//
// Create a random pair modifier from the frequency based list
//
PairModifier<
  GeneratedCdc1Record,
  SynthesizedPair<GeneratedCdc1Record>> randomPairModifier =
    new RandomPairModifier<
        GeneratedCdc1Record,
        SynthesizedPair<GeneratedCdc1Record>>(fbl);

//
// Finally, from a "clean" source of pairs that is defined elsewhere,
// create a source of modified pairs that have data-collection or data-entry errors
//
CMSource<SynthesizedPair<GeneratedCdc1Record>> modifiedPairs =
    new AbstractTransformingSource<
        SynthesizedPair<GeneratedCdc1Record>,
        SynthesizedPair<GeneratedCdc1Record>>(cleanPairs) {

  @Override
  public SynthesizedPair<GeneratedCdc1Record>
    transform(SynthesizedPair<GeneratedCdc1Record> pair) throws CMException {
      try {
        return randomPairModifier.modifyPair(pair);
      } catch (ModificationException e) {
        throw new CMException(e.toString() + ": " + e.getCause(), e);
      }
  }

};

Previous stepUpNext step
Step 4Tutorial overviewStep 6