Class PentahoJaroWinklerDistance


  • public class PentahoJaroWinklerDistance
    extends Object
    A similarity algorithm indicating the percentage of matched characters between two character sequences.

    The Jaro measure is the weighted sum of percentage of matched characters from each file and transposed characters. Winkler increased this measure for matching initial characters.

    This implementation is based on the Jaro Winkler similarity algorithm from http://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance.

    This code has been adapted from Apache Commons Lang 3.3.

    Since:
    1.0
    • Field Detail

      • INDEX_NOT_FOUND

        public static final int INDEX_NOT_FOUND
        Represents a failed index search.
        See Also:
        Constant Field Values
    • Constructor Detail

      • PentahoJaroWinklerDistance

        public PentahoJaroWinklerDistance()
    • Method Detail

      • getJaroDistance

        public Double getJaroDistance()
      • getJaroWinklerDistance

        public Double getJaroWinklerDistance()
      • apply

        public void apply​(CharSequence left,
                          CharSequence right)
        Find the Jaro Winkler Distance which indicates the similarity score between two CharSequences.
         distance.apply(null, null)          = IllegalArgumentException
         distance.apply("","")               = 0.0
         distance.apply("","a")              = 0.0
         distance.apply("aaapppp", "")       = 0.0
         distance.apply("frog", "fog")       = 0.93
         distance.apply("fly", "ant")        = 0.0
         distance.apply("elephant", "hippo") = 0.44
         distance.apply("hippo", "elephant") = 0.44
         distance.apply("hippo", "zzzzzzzz") = 0.0
         distance.apply("hello", "hallo")    = 0.88
         distance.apply("ABC Corporation", "ABC Corp") = 0.93
         distance.apply("D N H Enterprises Inc", "D & H Enterprises, Inc.") = 0.95
         distance.apply("My Gym Children's Fitness Center", "My Gym. Childrens Fitness") = 0.92
         distance.apply("PENNSYLVANIA", "PENNCISYLVNIA")    = 0.88
         
        Parameters:
        left - the first String, must not be null
        right - the second String, must not be null
        Throws:
        IllegalArgumentException - if either String input null
      • matches

        protected static int[] matches​(CharSequence first,
                                       CharSequence second)
        This method returns the Jaro-Winkler string matches, transpositions, prefix, max array.
        Parameters:
        first - the first string to be matched
        second - the second string to be matched
        Returns:
        mtp array containing: matches, transpositions, prefix, and max length
      • reset

        public void reset()