String distances: string_distance
Syntax
string_distance ( method, string1, string2 )
Input parameters
method |
the distance method to apply (levenshtein, damerau_levenshtein, hamming, jaro_winkler) |
string1 |
the first string operand |
string2 |
the second string operand |
Examples of valid syntaxes
string_distance(levenshtein, "foo", "bar")
string_distance(levenshtein, DS_1, DS_2)
string_distance(damerau_levenshtein, "foo", "bar")
string_distance(hamming, "foo", "bar")
string_distance(jaro_winkler, "FOO", "BAR")
Semantics for scalar operations
Returns the distance between two strings using the specified distance method.
All distance methods are symmetric (commutative), meaning that string_distance(method, string1, string2) equals string_distance(method, string2, string1).
levenshtein: Returns the Levenshtein distance, which is the minimum number of single-character edits (insertions, deletions, or substitutions) required to change one string into another.
damerau_levenshtein: Returns the Damerau–Levenshtein distance, which extends Levenshtein distance by including transpositions of adjacent characters as a single operation.
hamming: Returns the Hamming distance, which is the number of positions at which the corresponding characters are different. Both strings must have the same length.
jaro_winkler: Returns the Jaro–Winkler distance, which is a string metric measuring an edit distance between two sequences. It is a variant of the Jaro distance metric and mainly used in the area of record linkage.
For example:
string_distance(levenshtein, "foo", "fo") gives 1string_distance(damerau_levenshtein, "bar", "baz") gives 1string_distance(hamming, "foo", "fob") gives 1string_distance(jaro_winkler, "FOO", "BAR") gives a value between 0 and 1Input parameters type
method
component<string>
| string
string1, string2
dataset { measure<string> _+ }
| component<string>
| string
Result type
result
dataset { measure<number> _+ }
| component<number>
| number
Additional Constraints
The method parameter must be one of:
levenshtein,damerau_levenshtein,hamming, orjaro_winkler.For Hamming distance, both strings must have the same length.
For operations at Data Set level, the input Data Sets must have exactly one string type Measure each.
Behaviour
The operator has the behaviour of the “Operators applicable on two Scalar Values or Data Sets or Data Set Components” (see the section “Typical behaviours of the ML Operators”). If string1 and string2 are Data Sets then string_distance returns a dataset with a single measure of type number.
Examples
Given the operand datasets DS_1 and DS_2:
Input DS_1 (see structure)
Id_1 |
Id_2 |
Me_1 |
|---|---|---|
1 |
A |
foo |
2 |
B |
bar |
Input DS_2 (see structure)
Id_1 |
Id_2 |
Me_1 |
|---|---|---|
1 |
A |
fo |
2 |
B |
baz |
Example 1
DS_r := string_distance(levenshtein, DS_1, DS_2);
results in (see structure):
Id_1 |
Id_2 |
Me_1 |
|---|---|---|
1 |
A |
1 |
2 |
B |
1 |
Example 2
DS_r := string_distance(damerau_levenshtein, DS_1, DS_2);
results in (see structure):
Id_1 |
Id_2 |
Me_1 |
|---|---|---|
1 |
A |
1 |
2 |
B |
1 |
Example 3
DS_r := DS_1[calc Me_2 := string_distance(hamming, Me_1, "fob")];
results in (see structure):
Id_1 |
Id_2 |
Me_1 |
Me_2 |
|---|---|---|---|
1 |
A |
foo |
1 |
2 |
B |
bar |
3 |