No project description provided
Project description
Stringmetrics
This is a Rust library for approximate string matching that implements simple algorithms such has Hamming distance, Levenshtein distance, Jaccard similarity, and more, as well as a competent spellchecker that handles Hunspell dictionaries.
This package comes with a library for programatic use, as well as a command line interface. The library is usable via WASM.
Crate info: https://crates.io/crates/stringmetrics
Crate docs: https://docs.rs/stringmetrics/.
Crate source: https://github.com/pluots/stringmetrics-rust
Stringmetric Algorithms
One of the main purposes of this library is to provide a variety of string
metric functions. These include a few Levenshtein implementations (including
limit/max, weighted, and generic), Jaccard index, and a Hamming implementation.
These are all found in the algorithms
module.
Spellcheck
This is a spellchecker written completely in Rust. While maintaining compatibility with the venerable Hunspell dictionary format, it does not rely on Hunspell or any other underlying checker. NOTE: Spellchecker is currently in alpha.
Spellcheck functionality is found in the spellcheck
module.
Functionality
NOTE: The spellcheck portion of this project is still under development and is not guaranteed to work properly. Completed and future planned support include:
- Basic prefix/sufix dictionary files
- Forbidden word handling
- [ ]
- Morphological/Phonetic handling
Performance
In general, this program has been shown to be quite fast. On an average laptop, benchmarks give approximately 40-50 ns per word. This is fast enough to spellcheck the entire million words of the Harry Potter series in about 40 ms.
Simple benchmarks:
Spellcheck: compile dictionary
time: [127.40 ms 132.32 ms 138.44 ms]
Found 9 outliers among 100 measurements (9.00%)
9 (9.00%) high severe
Spellcheck: 1 correct word
time: [35.343 ns 35.446 ns 35.563 ns]
Found 15 outliers among 100 measurements (15.00%)
11 (11.00%) high mild
4 (4.00%) high severe
Spellcheck: 1 incorrect word
time: [46.577 ns 46.700 ns 46.853 ns]
Found 16 outliers among 100 measurements (16.00%)
6 (6.00%) high mild
10 (10.00%) high severe
Spellcheck: 15 correct words
time: [537.31 ns 552.10 ns 568.80 ns]
Found 12 outliers among 100 measurements (12.00%)
2 (2.00%) high mild
10 (10.00%) high severe
Spellcheck: 15 incorrect words
time: [741.72 ns 747.44 ns 755.19 ns]
Found 15 outliers among 100 measurements (15.00%)
4 (4.00%) high mild
11 (11.00%) high severe
Spellcheck: 188 word paragraph
time: [6.9062 us 6.9259 us 6.9485 us]
Found 11 outliers among 100 measurements (11.00%)
4 (4.00%) high mild
7 (7.00%) high severe
Note that dictionary compiling is only a one-time task after a file is loaded.
License
See the LICENSE file for license information. The provided license does allow for proprietary use and adaptation; that being said, I kindly suggest that if you come up with an improvement, you submit a pull request and help us all out :)
Dictionary data license
The dictionaries provided in this repository for testing purposed have been obtained under license. These files have been sourced from here: https://github.com/wooorm/dictionaries
These dictionaries are licensed under various licenses, different from that of
this project. Please see the applicable .license
file withing the
dictionaries/
directory.
Note: this project was previously named "textdistance". Please make sure to update all references.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Hashes for stringmetrics-0.1.0-cp310-cp310-macosx_10_7_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1f1087d745706495b7ad8d34e9b16f567d7dd2fbad2f93b69235e598e8e0fdf8 |
|
MD5 | d8268ee75285e22e0a0b06d3748e1edb |
|
BLAKE2b-256 | 65a117f12ffca0aa997eb788241ba5d8aa87977a514f582c6090bd76012c8017 |
Hashes for stringmetrics-0.1.0-cp39-cp39-macosx_10_7_x86_64.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | dd9a0bf48075e5214b3f871de19c20ad0fb0dfd45882b55f973ead8fc656cd8e |
|
MD5 | b7e107ba1830a0224709eff6c27840fa |
|
BLAKE2b-256 | e436cb810b77d08aa493dc7800f6ce6f0dc1d697acdb4078438540ce58174d2d |