Package org.pentaho.di.core.util
Class PentahoJaroWinklerDistance
java.lang.Object
org.pentaho.di.core.util.PentahoJaroWinklerDistance
A similarity algorithm indicating the percentage of matched characters between two character sequences.
The Jaro measure is the weighted sum of percentage of matched characters from each file and transposed characters. Winkler increased this measure for matching initial characters.
This implementation is based on the Jaro Winkler similarity algorithm from http://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance.
This code has been adapted from Apache Commons Lang 3.3.
- Since:
- 1.0
-
Field Summary
Modifier and TypeFieldDescriptionstatic final int
Represents a failed index search. -
Constructor Summary
-
Method Summary
Modifier and TypeMethodDescriptionvoid
apply
(CharSequence left, CharSequence right) Find the Jaro Winkler Distance which indicates the similarity score between two CharSequences.protected static int[]
matches
(CharSequence first, CharSequence second) This method returns the Jaro-Winkler string matches, transpositions, prefix, max array.void
reset()
-
Field Details
-
INDEX_NOT_FOUND
public static final int INDEX_NOT_FOUNDRepresents a failed index search.- See Also:
-
-
Constructor Details
-
PentahoJaroWinklerDistance
public PentahoJaroWinklerDistance()
-
-
Method Details
-
getJaroDistance
-
getJaroWinklerDistance
-
apply
Find the Jaro Winkler Distance which indicates the similarity score between two CharSequences.distance.apply(null, null) = IllegalArgumentException distance.apply("","") = 0.0 distance.apply("","a") = 0.0 distance.apply("aaapppp", "") = 0.0 distance.apply("frog", "fog") = 0.93 distance.apply("fly", "ant") = 0.0 distance.apply("elephant", "hippo") = 0.44 distance.apply("hippo", "elephant") = 0.44 distance.apply("hippo", "zzzzzzzz") = 0.0 distance.apply("hello", "hallo") = 0.88 distance.apply("ABC Corporation", "ABC Corp") = 0.93 distance.apply("D N H Enterprises Inc", "D & H Enterprises, Inc.") = 0.95 distance.apply("My Gym Children's Fitness Center", "My Gym. Childrens Fitness") = 0.92 distance.apply("PENNSYLVANIA", "PENNCISYLVNIA") = 0.88
- Parameters:
left
- the first String, must not be nullright
- the second String, must not be null- Throws:
IllegalArgumentException
- if either String inputnull
-
matches
This method returns the Jaro-Winkler string matches, transpositions, prefix, max array.- Parameters:
first
- the first string to be matchedsecond
- the second string to be matched- Returns:
- mtp array containing: matches, transpositions, prefix, and max length
-
reset
public void reset()
-