Fast batch jaro winkler distance implementation in C99.
Project description
batch_jaro_winkler
Fast batch jaro winkler distance implementation in C99 with Ruby, OCaml and Python bindings.
This project gets its performance from the pre-calculation of an optimized model in advance of the actual runtime calculations. Supports any encoding.
C99, Python >= 3.3, OCaml >= 4.0 and Ruby >= 2.1 (Warning regarding ruby versions)
Language specific parts:
Benchmark
Linear scale for dramatic effect.
2 datasets are used for the benchmark: english words and chinese words. These datasets are fairly small (~ 5 MB). The bigger the dataset, the greater the speedup you get by using batch_jaro_winkler over other libraries.
The metric MB/s
refers to the number of megabytes of the original datasets processed per second. The datasets are utf-8 encoded, so 1 byte = 1 character for english text. So if we have an english dataset of 1 million 10 characters words, our dataset is 10 MB, and batch_jaro_winkler 4 threads
is able to calculate the score for each of the original words against a new value in 10 MB / 400 MB/s = 25 ms with min_score=0.0
, and 10 MB / 1.7 GB/s = 6 ms with min_score=0.9
.
Libraries used for comparison: Levenshtein, jellyfish, hotwater, jaro_winkler, fuzzy-string-match, amatch, textdistance, py_stringmatching, jaro, pyjarowinkler
Installation
Python: pip3 install batch_jaro_winkler
OCaml: opam update && opam install batch_jaro_winkler
Ruby: gem install batch_jaro_winkler
Issues?
- You need a development version of python or ruby, with
apt-get
that would beapt-get install python3-dev
orapt-get install ruby-dev
. - You need a C compiler installed,
gcc
orclang
for example. - You need
make
for the ruby library.
Examples
import batch_jaro_winkler as bjw
candidates = ['héllo', '中国', 'hiz']
exportable_model = bjw.build_exportable_model(candidates)
runtime_model = bjw.build_runtime_model(exportable_model)
res = bjw.jaro_winkler_distance(runtime_model, 'hélloz')
# res = [('中国', 0.0), ('hiz', 0.5), ('héllo', 0.9666666388511658)]
Use of min_score
for each candidate:
candidates = [{ 'candidate': 'héllo', 'min_score': 0.99 }, { 'candidate': '中国', 'min_score': 0.0 }, { 'candidate': 'hiz', 'min_score': 0.4 }]
res = bjw.jaro_winkler_distance(runtime_model, 'hélloz')
# res = [('中国', 0.0), ('hiz', 0.5)]
Use of min_score
as runtime argument, which takes precedence over the min scores for each candidate:
candidates = [{ 'candidate': 'héllo', 'min_score': 0.99 }, { 'candidate': '中国', 'min_score': 0.0 }, { 'candidate': 'hiz', 'min_score': 0.4 }]
res = bjw.jaro_winkler_distance(runtime_model, 'hélloz', min_score=0.5)
# res = [('hiz', 0.5), ('héllo', 0.9666666388511658)]
Use of weight
, threshold
and n_best_results
:
candidates = ['héllo', '中国', 'hiz']
res = bjw.jaro_winkler_distance(runtime_model, 'hélloz', weight=0.2, threshold=0.5, n_best_results=1)
# res = [('héllo', 0.9888888597488403)]
Correctness
Different libraries calculate scores differently. The output of this project matches the output of the original implementation by Bill Winkler, George McLaughlin and Matt Jaro. See here for more details.
How to use
The exportable model
The first step is to build a model from a set of candidates. This model is a simple string (bytes
in Python) that we can store where we like: RAM, disk, a database, S3 etc.
We can optionally set a min_score
requirement for each candidate, so that a candidate is only returned at runtime if the matching score is higher than a certain value, except if we manually pass a min_score
at runtime, which takes precedence.
We need to choose how many threads we want to use for runtime calculations when we build the exportable model, as this value is used for an optimized internal representation.
The exportable model compresses very well, but decompressing at runtime might be too slow depending on your needs:
>>> import gzip
>>> exportable_model = bjw.build_exportable_model(candidates)
>>> len(exportable_model)
34738694
>>> compressed_exportable_model = gzip.compress(exportable_model)
>>> len(compressed_exportable_model)
9443218
The runtime model and runtime calculations
Once we have an exportable model, we can make runtime score calculations. A prerequisite is to build a runtime model. This is a very cheap operation, but to prevent us from doing it over and over we can reuse the model for any number of runtime calculations. Beware that while an exportable model can be used by multiple threads at the same time as we only read from it, we write to the runtime model passed when doing runtime calculations, so it isn't thread safe.
Then come the score calculations. We can perform jaro or jaro winkler distance calculations. We can set the min_score
argument so that only candidates matching with at least a certain score are returned. This argument overrides the min_score
that we may have set for each candidate when building the exportable model. A high value improves the performance, see the benchmark. We can also set the n_best_results
argument, it filters the candidates and makes the runtime function return only the best scoring candidates. A small value (< 20% of the dataset size) improves the performance, see the benchmark.
How does it work?
One possible approach to performing jaro winkler distance calculations against a set of values is to test each value one after the other. This project leverages the fact that values are known in advance to build a data structure (a model) that prevents us from looping to find matches, using a hash table linking to a list of matches instead. More often than not, you only care about matches that have a score higher than a threshold value. We benefit from this fact by skipping candidates that can't match the threshold as the calculations go on, allowing us to speed up further.
We calculate the jaro winkler distance in 2 steps. The first one calculates the number of matches for a candidate, and populates a 'flags' data structure that is used later on to calculate the number of transpositions.
These would be the flags for 'im marhta yo' as the input and 'martha' as the candidate:
input_flags = [0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0]
cand_flags_flags = [1, 1, 1, 1, 1, 1]
The data structure used to speed up the finding of matching characters is a hash table linking characters to their occurrences in candidates. It is stored this way:
{
'a': | cand1 | cand1 | cand1 | cand1 | nb_oc | ind1 | ind2 | ind3 | cand2 | ...,
'b': | cand1 | cand1 | cand1 | cand1 | nb_oc | ind1 | ind2 | cand2 | ...,
}
Where 'a' and 'b' are characters which appear in candidates, 'cand1' and 'cand2' are the indexes of the candidates in the model, 'nb_oc' is the number of occurrences of the character in the candidate and 'ind1', 'ind2' etc. are the indexes of the character's occurrences in the candidate.
Additionally, we can know in advance how many occurrences matches we need to satisfy the
min_score
requirement. This allows us to ignore candidates once we know that they won't
be able to match. For example, with a min_score
of 1.0:
runtime_input_len = 8, candidate_len = 8 => We know that we must have 8 matches
runtime_input_len = 9, candidate_len = 8 => We know that it is impossible that they match
With a min_score
of 0.9:
runtime_input_len = 8, candidate_len = 8 =>
(3.0 * min_score * candidate_len * runtime_input_len - (candidate_len * runtime_input_len)) / (candidate_len + runtime_input_len)
=> 6.8, we need at least 7 matches
runtime_input_len = 9, candidate_len = 8 =>
(3.0 * min_score * candidate_len * runtime_input_len - (candidate_len * runtime_input_len)) / (candidate_len + runtime_input_len)
=> 7.2, we need at least 8 matches
This data representation allows us to have an efficient runtime, which looks something like this (simplified):
for i_char, char in runtime_input:
remaining_chars = len(runtime_input) - i_char
occurrences = char_matches[char]
for candidate_ind, indexes in occurrences:
enough_matches_possible = candidates[candidate_ind].nb_matches + remaining_chars >= candidates[candidate_ind].required_nb_matches
if not enough_matches_possible:
continue
for ind in indexes:
match_in_search_range = ind >= i_char - candidates[candidate_ind].search_range and ind <= i_char + candidates[candidate_ind].search_range
if match_in_search_range:
candidates[candidate_ind].nb_matches += 1
runtime_input_flags[candidate_ind][i_char] = 1
candidates_flags[candidate_ind][ind] = 1
break
Once this is done, all that is left to do is to calculate the number of transpositions for possibly matching candidates (candidates that have a number of matches at least equal to the required number of matches) from the flags.
Things that were tried but did not improve performance. They could still bring performance gains if done right:
- Instead of iterating over all occurrences matches for a character everytime, keep a linked list (as offsets in the candidates' array) of possible candidate occurrences for a given character. We can eliminate candidate occurrences when we know a candidate can't match, or when we explored all occurrences for this candidate and this character.
- Using bits instead of uint8_t for runtime_input_flags and candidate_flags.
- Keep track of the number of potential matches left for a candidate, that way we can skip impossible candidates in try_to_match_occurrence the same way we do with nb_matches + remaining_chars < required_nb_matches
Things that were not tried:
- Sort candidates by jaro distance when building the model. This would greatly optimize cache usage, as potential candidates for a particular runtime_input would all be near one another, making memory accesses more efficient. Right now we are using an approximation of this, sorting by alphabetical order + length.
- Better split candidates across each thread's local storage so that each thread takes around the same time to finish. This would greatly improve the multi-threading performance, which caps at around 3 times faster than with 1 thread, even with 6 threads.
- If bits are used for the flags, maybe use binary operations for transpositions calculation? Seems unlikely.
- Use SIMD instructions, probably tough because there is a lot of branching right now.
- Keep track of the current best score when finding matches, to invalidate candidates that are guaranteed to have a smaller score. Same when calculating transpositions.
Python
import batch_jaro_winkler as bjw
candidates = ['héllo', '中国', 'hiz']
exportable_model = bjw.build_exportable_model(candidates)
runtime_model = bjw.build_runtime_model(exportable_model)
res = bjw.jaro_winkler_distance(runtime_model, 'hélloz')
# res = [('中国', 0.0), ('hiz', 0.5), ('héllo', 0.9666666388511658)]
-
build_exportable_model
build_exportable_model(candidates, nb_runtime_threads=1)
Parameter | Type | Comment |
---|---|---|
candidates | list-like | A list (or list-like object) containing the strings to match runtime values against. Must respect one of these 2 schemas: ['hi', 'hello'] or [{ 'candidate': 'hi', 'min_score': 0.5 }, { 'candidate': 'hello', 'min_score': 0.8 }] . If one candidate has a min_score , all of them must have one. If min_score is provided, a candidate is only returned at runtime if the matching score is higher than its min score specified here, except if we manually pass a min_score at runtime, which takes precedence. |
nb_runtime_threads | int | The number of threads to use at runtime (jaro_distance[_bytes] and jaro_winkler_distance[_bytes] ). |
Returns a bytes
object.
-
build_exportable_model_bytes
build_exportable_model_bytes(char_width, candidates, nb_runtime_threads=1)
Parameter | Type | Comment |
---|---|---|
char_width | int | Must be one of {1, 2, 4}. The width in bytes of a single character in the strings you provide in the candidates parameter. For example, if you use utf-32 , set char_width to 4. |
candidates | list-like | A list (or list-like object) containing the strings to match runtime values against. They can be encoded however you like, including in custom encodings, and including with 0 characters in the middle of the encoded strings. Must respect one of these 2 schemas: [b'hi', b'hello'] or [{ 'candidate': b'hi', 'min_score': 0.5 }, { 'candidate': b'hello', 'min_score': 0.8 }] . If one candidate has a min_score , all of them must have one. If min_score is provided, a candidate is only returned at runtime if the matching score is higher than its min score specified here, except if we manually pass a min_score at runtime, which takes precedence. |
nb_runtime_threads | int | The number of threads to use at runtime (jaro_distance[_bytes] and jaro_winkler_distance[_bytes] ). |
Returns a bytes
object.
-
build_runtime_model
build_runtime_model(exportable_model)
Parameter | Type | Comment |
---|---|---|
exportable_model | bytes | An exportable model built with build_exportable_model[_bytes] . |
Returns an object that you can then pass as argument to one of the runtime functions: jaro_distance[_bytes]
and jaro_winkler_distance[_bytes]
.
-
jaro_winkler_distance
jaro_winkler_distance(runtime_model, inp, min_score=None, weight=0.1, threshold=0.7, n_best_results=None)
Parameter | Type | Comment |
---|---|---|
runtime_model | RuntimeModel | Object returned by build_runtime_model . |
inp | str | The input to get scores for. |
min_score | float | If set, the function only returns the candidates that have a matching score at least as high as this value. Improves performance. Takes precedence over the min scores that may be set for each candidate when building the exportable model. |
weight | float | The jaro winkler algorithm gives a scoring bonus for matching prefixes, this parameter controls how big the bonus is. This value must be >= 0.0 and <= 0.25. For the standard jaro winkler score calculation, use the default value. |
threshold | float | The jaro winkler algorithm gives a scoring bonus for matching prefixes, this parameter controls what the minimum score should be as a condition for applying the bonus. For the standard jaro winkler score calculation, use the default value. |
n_best_results | int | Makes the function return the n_best_results best scoring candidates only. Improves performance. |
Returns a list of tuples containing the candidates and the matching scores, following this schema: [('中国', 0.0), ('hiz', 0.5)]
.
-
jaro_winkler_distance_bytes
jaro_winkler_distance_bytes(char_width, runtime_model, inp, min_score=None, weight=0.1, threshold=0.7, n_best_results=None)
Parameter | Type | Comment |
---|---|---|
char_width | int | Must be one of {1, 2, 4}. The value used must match with the char_width passed when calling build_exportable_model . The width in bytes of a single character in the inp parameter, as well as in the candidates in the exportable model. |
runtime_model | RuntimeModel | Object returned by build_runtime_model . |
inp | bytes | The input to get scores for. |
min_score | float | If set, the function only returns the candidates that have a matching score at least as high as this value. Improves performance. Takes precedence over the min scores that may be set for each candidate when building the exportable model. |
weight | float | The jaro winkler algorithm gives a scoring bonus for matching prefixes, this parameter controls how big the bonus is. This value must be >= 0.0 and <= 0.25. For the standard jaro winkler score calculation, use the default value. |
threshold | float | The jaro winkler algorithm gives a scoring bonus for matching prefixes, this parameter controls what the minimum score should be as a condition for applying the bonus. For the standard jaro winkler score calculation, use the default value. |
n_best_results | int | Makes the function return the n_best_results best scoring candidates only. Improves performance. |
Returns a list of tuples containing the candidates (as bytes) and the matching scores, following this schema: [(b'-N\x00\x00\xfdV\x00\x00', 0.0), (b'hiz', 0.5)]
.
-
jaro_distance
jaro_distance(runtime_model, inp, min_score=None, n_best_results=None)
Parameter | Type | Comment |
---|---|---|
runtime_model | RuntimeModel | Object returned by build_runtime_model . |
inp | bytes | The input to get scores for. |
min_score | float | If set, the function only returns the candidates that have a matching score at least as high as this value. Improves performance. Takes precedence over the min scores that may be set for each candidate when building the exportable model. |
n_best_results | int | Makes the function return the n_best_results best scoring candidates only. Improves performance. |
Returns a list of tuples containing the candidates and the matching scores, following this schema: [('中国', 0.0), ('hiz', 0.5)]
.
-
jaro_distance_bytes
jaro_distance_bytes(char_width, runtime_model, inp, min_score=None, n_best_results=None)
Parameter | Type | Comment |
---|---|---|
char_width | int | Must be one of {1, 2, 4}. The value used must match with the char_width passed when calling build_exportable_model . The width in bytes of a single character in the inp parameter, as well as in the candidates in the exportable model. |
runtime_model | RuntimeModel | Object returned by build_runtime_model . |
inp | bytes | The input to get scores for. |
min_score | float | If set, the function only returns the candidates that have a matching score at least as high as this value. Improves performance. Takes precedence over the min scores that may be set for each candidate when building the exportable model. |
n_best_results | int | Makes the function return the n_best_results best scoring candidates only. Improves performance. |
Returns a list of tuples containing the candidates and the matching scores, following this schema: [(b'-N\x00\x00\xfdV\x00\x00', 0.0), (b'hiz', 0.5)]
.
Ruby
require 'batch_jaro_winkler'
candidates = ['héllo', '中国', 'hiz']
exportable_model = BatchJaroWinkler.build_exportable_model(candidates)
runtime_model = BatchJaroWinkler.build_runtime_model(exportable_model)
res = BatchJaroWinkler.jaro_winkler_distance(runtime_model, 'hélloz')
# res = [['中国', 0.0], ['hiz', 0.5], ['héllo', 0.9666666388511658]]
-
build_exportable_model
build_exportable_model(candidates, nb_runtime_threads: 1)
Parameter | Type | Comment |
---|---|---|
candidates | array-like | An array (or array-like object) containing the strings to match runtime values against. Must respect one of these 2 schemas: ['hi', 'hello'] or [{ candidate: 'hi', min_score: 0.5 }, { candidate: 'hello', min_score: 0.8 }] . If one candidate has a min_score , all of them must have one. If min_score is provided, a candidate is only returned at runtime if the matching score is higher than its min score specified here, except if we manually pass a min_score at runtime, which takes precedence. |
nb_runtime_threads | Integer | The number of threads to use at runtime (jaro_distance[_bytes] and jaro_winkler_distance[_bytes] ). |
Returns an ascii encoded String
object.
-
build_exportable_model_bytes
build_exportable_model_bytes(char_width, candidates, nb_runtime_threads: 1)
Parameter | Type | Comment |
---|---|---|
char_width | Integer | Must be one of {1, 2, 4}. The width in bytes of a single character in the strings you provide in the candidates parameter. For example, if you use utf-32 , set char_width to 4. |
candidates | array-like | An array (or array-like object) containing the strings to match runtime values against. They can be encoded however you like, including in custom encodings, and including with 0 characters in the middle of the encoded strings. Must respect one of these 2 schemas: ['hi'.encode('utf-32le'), 'hello'.encode('utf-32le')] or [{ candidate: 'hi'.encode('utf-32le'), min_score: 0.5 }, { candidate: 'hello'.encode('utf-32le'), min_score: 0.8 }] . If one candidate has a min_score , all of them must have one. If min_score is provided, a candidate is only returned at runtime if the matching score is higher than its min score specified here, except if we manually pass a min_score at runtime, which takes precedence. |
nb_runtime_threads | Integer | The number of threads to use at runtime (jaro_distance[_bytes] and jaro_winkler_distance[_bytes] ). |
Returns an ascii encoded String
object.
-
build_runtime_model
build_runtime_model(exportable_model)
Parameter | Type | Comment |
---|---|---|
exportable_model | String | An exportable model built with build_exportable_model[_bytes] . |
Returns an object that you can then pass as argument to one of the runtime functions: jaro_distance[_bytes]
and jaro_winkler_distance[_bytes]
.
-
jaro_winkler_distance
jaro_winkler_distance(runtime_model, inp, min_score: nil, weight: 0.1, threshold: 0.7, n_best_results: nil)
Parameter | Type | Comment |
---|---|---|
runtime_model | RuntimeModel | Object returned by build_runtime_model . |
inp | String | The input to get scores for. |
min_score | Float | If set, the function only returns the candidates that have a matching score at least as high as this value. Improves performance. Takes precedence over the min scores that may be set for each candidate when building the exportable model. |
weight | Float | The jaro winkler algorithm gives a scoring bonus for matching prefixes, this parameter controls how big the bonus is. This value must be >= 0.0 and <= 0.25. For the standard jaro winkler score calculation, use the default value. |
threshold | Float | The jaro winkler algorithm gives a scoring bonus for matching prefixes, this parameter controls what the minimum score should be as a condition for applying the bonus. For the standard jaro winkler score calculation, use the default value. |
n_best_results | Integer | Makes the function return the n_best_results best scoring candidates only. Improves performance. |
Returns an array of arrays containing the candidates and the matching scores, following this schema: [['中国', 0.0], ['hiz', 0.5]]
.
-
jaro_winkler_distance_bytes
jaro_winkler_distance_bytes(char_width, runtime_model, inp, min_score: nil, weight: 0.1, threshold: 0.7, n_best_results: nil)
Parameter | Type | Comment |
---|---|---|
char_width | Integer | Must be one of {1, 2, 4}. The value used must match with the char_width passed when calling build_exportable_model . The width in bytes of a single character in the inp parameter, as well as in the candidates in the exportable model. |
runtime_model | RuntimeModel | Object returned by build_runtime_model . |
inp | String | The input to get scores for. |
min_score | Float | If set, the function only returns the candidates that have a matching score at least as high as this value. Improves performance. Takes precedence over the min scores that may be set for each candidate when building the exportable model. |
weight | Float | The jaro winkler algorithm gives a scoring bonus for matching prefixes, this parameter controls how big the bonus is. This value must be >= 0.0 and <= 0.25. For the standard jaro winkler score calculation, use the default value. |
threshold | Float | The jaro winkler algorithm gives a scoring bonus for matching prefixes, this parameter controls what the minimum score should be as a condition for applying the bonus. For the standard jaro winkler score calculation, use the default value. |
n_best_results | Integer | Makes the function return the n_best_results best scoring candidates only. Improves performance. |
Returns an array of arrays containing the candidates and the matching scores, following this schema: [['\u4E2D\u56FD', 0.0], ['hiz', 0.5]]
.
-
jaro_distance
jaro_distance(runtime_model, inp, min_score: nil, n_best_results: nil)
Parameter | Type | Comment |
---|---|---|
runtime_model | RuntimeModel | Object returned by build_runtime_model . |
inp | String | The input to get scores for. |
min_score | Float | If set, the function only returns the candidates that have a matching score at least as high as this value. Improves performance. Takes precedence over the min scores that may be set for each candidate when building the exportable model. |
n_best_results | Integer | Makes the function return the n_best_results best scoring candidates only. Improves performance. |
Returns an array of arrays containing the candidates and the matching scores, following this schema: [['中国', 0.0], ['hiz', 0.5]]
.
-
jaro_distance_bytes
jaro_distance_bytes(char_width, runtime_model, inp, min_score: nil, n_best_results: nil)
Parameter | Type | Comment |
---|---|---|
char_width | Integer | Must be one of {1, 2, 4}. The value used must match with the char_width passed when calling build_exportable_model . The width in bytes of a single character in the inp parameter, as well as in the candidates in the exportable model. |
runtime_model | RuntimeModel | Object returned by build_runtime_model . |
inp | String | The input to get scores for. |
min_score | Float | If set, the function only returns the candidates that have a matching score at least as high as this value. Improves performance. Takes precedence over the min scores that may be set for each candidate when building the exportable model. |
n_best_results | Integer | Makes the function return the n_best_results best scoring candidates only. Improves performance. |
Returns an array of arrays containing the candidates and the matching scores, following this schema: [['\u4E2D\u56FD', 0.0], ['hiz', 0.5]]
.
OCaml
test_bjw.ml
module Bjw = Batch_jaro_winkler
let () =
let encoding = Bjw.Encoding.UTF8 in
let candidates = [("héllo", None) ; ("中国", None) ; ("hiz", None)] in
let exportable_model = Bjw.build_exportable_model ~encoding candidates in
let runtime_model = Bjw.build_runtime_model exportable_model in
let res = Bjw.jaro_winkler_distance ~encoding runtime_model "hélloz" in
(* res = [("中国", 0.0) ; ("hiz", 0.5) ; ("héllo", 0.9666666388511658)] *)
List.iter (fun (candidate, score) -> Printf.printf "'%s': %f\n" candidate score) res
dune
(executable
(name test_bjw)
(libraries batch_jaro_winkler))
$ dune build
$ ./_build/default/test_bjw.exe
'héllo': 0.966667
'hiz': 0.500000
'中国': 0.000000
-
build_exportable_model
val build_exportable_model : encoding:Encoding.t -> ?nb_runtime_threads:int ->
(string * float option) list -> string
let build_exportable_model ~encoding ?nb_runtime_threads:(nb_runtime_threads=1) candidates
Parameter | Type | Comment |
---|---|---|
encoding | Encoding.t | One of {ASCII, UTF8, UTF16, UTF32, CHAR_WIDTH_1, CHAR_WIDTH_2, CHAR_WIDTH_4} . Describes the encoding of the string passed in the candidates parameter. You can use a custom fixed-width encoding with CHAR_WIDTH_1 , CHAR_WIDTH_2 and CHAR_WIDTH_4 . |
nb_runtime_threads | int | The number of threads to use at runtime (jaro_distance and jaro_winkler_distance ). |
candidates | (string * float option) list | A list containing the strings to match runtime values against, and optionally the minimum matching score required at runtime. Must respect the following schema: [("héllo", None) ; ("中国", Some 0.5)] . If one candidate has a min_score , all of them must have one. If you give a minimum score for a candidate, it is only returned at runtime if the matching score is higher than its min score specified here, except if we manually pass a min_score at runtime, which takes precedence. |
Returns a string.
-
build_runtime_model
val build_runtime_model : string -> runtime_model
let build_runtime_model exportable_model
Parameter | Type | Comment |
---|---|---|
exportable_model | string | An exportable model built with build_exportable_model . |
Returns a runtime_model
that you can then pass as argument to one of the runtime functions: jaro_distance
and jaro_winkler_distance
.
-
jaro_winkler_distance
val jaro_winkler_distance : encoding:Encoding.t -> ?min_score:float -> ?weight:float ->
?threshold:float -> ?n_best_results:int option -> runtime_model -> string -> (string * float) list
let jaro_winkler_distance ~encoding ?min_score:(min_score=(-1.0)) ?weight:(weight=0.1)
?threshold:(threshold=0.7) ?n_best_results:(n_best_results=None) runtime_model input
Parameter | Type | Comment |
---|---|---|
encoding | Encoding.t | One of {ASCII, UTF8, UTF16, UTF32, CHAR_WIDTH_1, CHAR_WIDTH_2, CHAR_WIDTH_4} . Describes the encoding of the string passed in the input parameter. You can use a custom fixed-width encoding with CHAR_WIDTH_1 , CHAR_WIDTH_2 and CHAR_WIDTH_4 . The encoding must match the one used to build the exportable model. |
min_score | float | If set, the function only returns the candidates that have a matching score at least as high as this value. Improves performance. Takes precedence over the min scores that may be set for each candidate when building the exportable model. |
weight | float | The jaro winkler algorithm gives a scoring bonus for matching prefixes, this parameter controls how big the bonus is. This value must be >= 0.0 and <= 0.25. For the standard jaro winkler score calculation, use the default value. |
threshold | float | The jaro winkler algorithm gives a scoring bonus for matching prefixes, this parameter controls what the minimum score should be as a condition for applying the bonus. For the standard jaro winkler score calculation, use the default value. |
n_best_results | int | Makes the function return the n_best_results best scoring candidates only. Improves performance. |
runtime_model | runtime_model | Value returned by build_runtime_model . |
input | string | The input to get scores for. |
Returns a list of string * float
tuples containing the candidates and the matching scores, following this schema: [("中国", 0.0) ; ("hiz", 0.5)]
.
-
jaro_distance
val jaro_distance : encoding:Encoding.t -> ?min_score:float -> ?n_best_results:int option ->
runtime_model -> string -> (string * float) list
let jaro_distance ~encoding ?min_score:(min_score=(-1.0)) ?n_best_results:(n_best_results=None)
runtime_model input
Parameter | Type | Comment |
---|---|---|
encoding | Encoding.t | One of {ASCII, UTF8, UTF16, UTF32, CHAR_WIDTH_1, CHAR_WIDTH_2, CHAR_WIDTH_4} . Describes the encoding of the string passed in the input parameter. You can use a custom fixed-width encoding with CHAR_WIDTH_1 , CHAR_WIDTH_2 and CHAR_WIDTH_4 . The encoding must match the one used to build the exportable model. |
min_score | float | If set, the function only returns the candidates that have a matching score at least as high as this value. Improves performance. Takes precedence over the min scores that may be set for each candidate when building the exportable model. |
n_best_results | int | Makes the function return the n_best_results best scoring candidates only. Improves performance. |
runtime_model | runtime_model | Value returned by build_runtime_model . |
input | string | The input to get scores for. |
Returns a list of string * float
tuples containing the candidates and the matching scores, following this schema: [("中国", 0.0) ; ("hiz", 0.5)]
.
C
The files you need are in the lib
folder.
test.c
:
#include "batch_jaro_winkler.h"
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main(void)
{
char *candidates[] = { "hello", "hiz" };
uint32_t candidates_lengths[] = { 5, 3 };
uint32_t exportable_model_size;
uint32_t nb_results;
// char_width = 1 ; nb_candidates = 2 ; nb_runtime_threads = 1
void *exportable_model = bjw_build_exportable_model(
candidates, 1, candidates_lengths, 2, NULL, 1, &exportable_model_size
);
if (!exportable_model)
exit(1);
void *runtime_model = bjw_build_runtime_model(exportable_model);
if (!runtime_model)
exit(1);
// input_length = 5 ; min_score = -1.0 (deactivate)
// weight = 0.1 (default value for standard jaro winkler)
// threshold = 0.7 (default value for standard jaro winkler)
// n_best_results = 0 (deactivate)
bjw_result *res = bjw_jaro_winkler_distance(runtime_model, "hallo", 5, -1.0, 0.1, 0.7, 0, &nb_results);
if (!res)
exit(1);
uint32_t best_candidate_ind = 0;
for (uint32_t i_res = 0; i_res < nb_results; i_res++)
{
// Warning: candidates are not null terminated, as the meaning of bytes within candidates
// depends on the encoding, including for 0.
printf(
"{ .candidate = \"%.*s\", .score = %f }\n",
res[i_res].candidate_length, res[i_res].candidate, res[i_res].score
);
if (res[i_res].score > res[best_candidate_ind].score)
best_candidate_ind = i_res;
}
// Important: the 'candidate' field in `bjw_result` is a pointer to somewhere within the exportable model.
// If you want to keep candidates after the exportable model is being freed, you must copy the data.
// char_width = 1
char *best_candidate = malloc(res[best_candidate_ind].candidate_length * 1);
memcpy(best_candidate, res[best_candidate_ind].candidate, res[best_candidate_ind].candidate_length * 1);
uint32_t best_candidate_length = res[best_candidate_ind].candidate_length;
free(res);
bjw_free_runtime_model(runtime_model);
free(exportable_model);
printf("best candidate: \"%.*s\"\n", best_candidate_length, best_candidate);
free(best_candidate);
return (0);
}
$ ls -l
-rw-r--r-- 1 user wheel 33490 27 avr 13:12 batch_jaro_winkler.c
-rw-r--r-- 1 user wheel 1111 27 avr 13:12 batch_jaro_winkler.h
-rw-r--r-- 1 user wheel 1533 27 avr 13:12 batch_jaro_winkler_internal.h
-rw-r--r-- 1 user wheel 22514 27 avr 13:12 batch_jaro_winkler_runtime.h
-rw-r--r-- 1 user wheel 2190 27 avr 13:15 test.c
-rw-r--r-- 1 user wheel 78701 27 avr 13:12 uthash.h
$ gcc -O3 batch_jaro_winkler.c test.c
$ ./a.out
{ .candidate = "hiz", .score = 0.511111 }
{ .candidate = "hello", .score = 0.880000 }
best candidate: "hello"
-
bjw_build_exportable_model
void *bjw_build_exportable_model(
void **candidates, uint32_t char_width, uint32_t *candidates_lengths, uint32_t nb_candidates,
float *min_scores, uint32_t nb_runtime_threads, uint32_t *res_model_size
)
Parameter | Type | Comment |
---|---|---|
candidates | void** | An array of character arrays. Each character must be char_width bytes wide. They can be encoded however you like, including in custom encodings, and including with 0 characters in the middle of the encoded strings. |
char_width | uint32_t | Must be one of {1, 2, 4}. The width in bytes of a single character in the strings you provide in the candidates parameter. For example, if you use utf-32 , set char_width to 4. |
candidates_lengths | uint32_t* | Array containing the length of each candidate. If the strings are null-terminated, don't count the last byte when determining the length. |
nb_candidates | uint32_t | The number of elements in the candidates , candidates_lengths and min_scores arrays. |
min_scores | float* | Can be NULL. If provided, a candidate is only returned at runtime if the matching score is higher than its min score specified here, except if we manually pass a min_score at runtime, which takes precedence. |
nb_runtime_threads | uint32_t | The number of threads to use at runtime (bjw_jaro_distance and bjw_jaro_winkler_distance ). |
res_model_size | uint32_t* | The value is set by the function to the size in bytes of the resulting exportable model. |
Returns a buffer. You are responsible for freeing it. Warning: you need to keep it in memory as long as you want to make runtime calculations if you passed it as argument to bjw_build_runtime_model
, as the runtime functions (bjw_jaro_distance
and bjw_jaro_winkler_distance
) return pointers to the candidates' strings stored inside the exportable model.
-
bjw_build_runtime_model
void *bjw_build_runtime_model(void *exportable_model)
Parameter | Type | Comment |
---|---|---|
exportable_model | void* | An exportable model built with bjw_build_exportable_model . |
Returns a buffer that you can then pass as argument to one of the runtime functions: bjw_jaro_distance
and bjw_jaro_winkler_distance
. You must call bjw_free_runtime_model
when you are done using it.
-
bjw_free_runtime_model
void bjw_free_runtime_model(void *runtime_model)
Parameter | Type | Comment |
---|---|---|
runtime_model | void* | A runtime model built with bjw_build_runtime_model . |
Frees the runtime model.
-
bjw_jaro_winkler_distance
bjw_result *bjw_jaro_winkler_distance(
void *runtime_model, void *input, uint32_t input_length, float min_score,
float weight, float threshold, uint32_t n_best_results, uint32_t *nb_results
)
Parameter | Type | Comment |
---|---|---|
runtime_model | void* | Object returned by bjw_build_runtime_model . |
input | void* | The input to get scores for. Must be encoded in the same way as candidates. |
input_length | uint32_t | Length of the input in characters, not bytes. |
min_score | float | If >= 0.0, the function only returns the candidates that have a matching score at least as high as this value. Improves performance. Takes precedence over the min scores that may be set for each candidate when building the exportable model. |
weight | float | The jaro winkler algorithm gives a scoring bonus for matching prefixes, this parameter controls how big the bonus is. This value must be >= 0.0 and <= 0.25. For the standard jaro winkler score calculation, use the default value. |
threshold | float | The jaro winkler algorithm gives a scoring bonus for matching prefixes, this parameter controls what the minimum score should be as a condition for applying the bonus. For the standard jaro winkler score calculation, use the default value. |
n_best_results | uint32_t | If > 0, the function returns the n_best_results best scoring candidates only. Improves performance. |
nb_results | uint32_t* | This value is set by the function to the number of results. |
Returns an array of *nb_results
bjw_result
. You are responsible for freeing the resulting array. Here is the definition for bjw_result
, candidate_length
is the number of characters, not bytes:
typedef struct
{
void *candidate;
float score;
uint32_t candidate_length;
} bjw_result;
-
bjw_jaro_distance
bjw_result *bjw_jaro_distance(
void *runtime_model, void *input, uint32_t input_length,
float min_score, uint32_t n_best_results, uint32_t *nb_results
)
Parameter | Type | Comment |
---|---|---|
runtime_model | void* | Object returned by bjw_build_runtime_model . |
input | void* | The input to get scores for. Must be encoded in the same way as candidates. |
input_length | uint32_t | Length of the input in characters, not bytes. |
min_score | float | If >= 0.0, the function only returns the candidates that have a matching score at least as high as this value. Improves performance. Takes precedence over the min scores that may be set for each candidate when building the exportable model. |
n_best_results | uint32_t | If > 0, the function returns the n_best_results best scoring candidates only. Improves performance. |
nb_results | uint32_t* | This value is set by the function to the number of results. |
Returns an array of *nb_results
bjw_result
. You are responsible for freeing the resulting array. Here is the definition for bjw_result
, candidate_length
is the number of characters, not bytes:
typedef struct
{
void *candidate;
float score;
uint32_t candidate_length;
} bjw_result;
Warning regarding Ruby versions
If you use older MRI versions (< 2.5.8 or between 2.6.0 and 2.6.4 included), you may experience memory leaks. It could somehow be related to MRI's string implementation, which was fixed at the end of 2019: https://github.com/ruby/ruby/compare/v2_6_4...v2_6_5#diff-7a2f2c7dfe0bf61d38272aeaf68ac768R2117. Work was done in this library to mitigate the issue, but the absence of leaks is not guaranteed. If you're interested, you can most likely reproduce the leak with this program, change utf-32le
to utf-32
to watch to memory leak disappear:
while true do
1000.times do
# random 10 characters string
str = (0...10).map{ (65 + rand(26)).chr }.join
str.encode('utf-32le')
end
GC.start(full_mark: true, immediate_sweep: true)
GC.start
end
The future
A similar approach could probably also benefit other fuzzy matching algorithms.
Something that would be really neat would be the ability to add candidates or remove candidates from an exportable model on the fly. This would allow for more flexible scenarios, where the set of candidates can change very often, as in a database for instance. It would make the project appropriate as a PostgreSQL extension for example.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file batch_jaro_winkler-0.1.6.zip
.
File metadata
- Download URL: batch_jaro_winkler-0.1.6.zip
- Upload date:
- Size: 55.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | fa508a9b3ca78a00c1a0a0f10aceacbc3cd6623b5309b4ce0d44922d2d12d64b |
|
MD5 | 9040be1ed15c4c73cf42edc99c125f6a |
|
BLAKE2b-256 | bad2f86e6cb70c673e15df6c337774077d9a38dff1346a40c38e69e54e05179d |