Class PentahoJaroWinklerDistance

java.lang.Object
org.pentaho.di.core.util.PentahoJaroWinklerDistance

public class PentahoJaroWinklerDistance extends Object
A similarity algorithm indicating the percentage of matched characters between two character sequences.

The Jaro measure is the weighted sum of percentage of matched characters from each file and transposed characters. Winkler increased this measure for matching initial characters.

This implementation is based on the Jaro Winkler similarity algorithm from http://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance.

This code has been adapted from Apache Commons Lang 3.3.

Since:
1.0
  • Field Details

    • INDEX_NOT_FOUND

      public static final int INDEX_NOT_FOUND
      Represents a failed index search.
      See Also:
  • Constructor Details

    • PentahoJaroWinklerDistance

      public PentahoJaroWinklerDistance()
  • Method Details

    • getJaroDistance

      public Double getJaroDistance()
    • getJaroWinklerDistance

      public Double getJaroWinklerDistance()
    • apply

      public void apply(CharSequence left, CharSequence right)
      Find the Jaro Winkler Distance which indicates the similarity score between two CharSequences.
       distance.apply(null, null)          = IllegalArgumentException
       distance.apply("","")               = 0.0
       distance.apply("","a")              = 0.0
       distance.apply("aaapppp", "")       = 0.0
       distance.apply("frog", "fog")       = 0.93
       distance.apply("fly", "ant")        = 0.0
       distance.apply("elephant", "hippo") = 0.44
       distance.apply("hippo", "elephant") = 0.44
       distance.apply("hippo", "zzzzzzzz") = 0.0
       distance.apply("hello", "hallo")    = 0.88
       distance.apply("ABC Corporation", "ABC Corp") = 0.93
       distance.apply("D N H Enterprises Inc", "D & H Enterprises, Inc.") = 0.95
       distance.apply("My Gym Children's Fitness Center", "My Gym. Childrens Fitness") = 0.92
       distance.apply("PENNSYLVANIA", "PENNCISYLVNIA")    = 0.88
       
      Parameters:
      left - the first String, must not be null
      right - the second String, must not be null
      Throws:
      IllegalArgumentException - if either String input null
    • matches

      protected static int[] matches(CharSequence first, CharSequence second)
      This method returns the Jaro-Winkler string matches, transpositions, prefix, max array.
      Parameters:
      first - the first string to be matched
      second - the second string to be matched
      Returns:
      mtp array containing: matches, transpositions, prefix, and max length
    • reset

      public void reset()