Skip to main content

No project description provided

Project description

Stringmetrics

This is a Rust library for approximate string matching that implements simple algorithms such has Hamming distance, Levenshtein distance, Jaccard similarity, and more, as well as a competent spellchecker that handles Hunspell dictionaries.

This package comes with a library for programatic use, as well as a command line interface. The library is usable via WASM.

Crate info: https://crates.io/crates/stringmetrics

Crate docs: https://docs.rs/stringmetrics/.

Crate source: https://github.com/pluots/stringmetrics-rust

Stringmetric Algorithms

One of the main purposes of this library is to provide a variety of string metric functions. These include a few Levenshtein implementations (including limit/max, weighted, and generic), Jaccard index, and a Hamming implementation. These are all found in the algorithms module.

Spellcheck

This is a spellchecker written completely in Rust. While maintaining compatibility with the venerable Hunspell dictionary format, it does not rely on Hunspell or any other underlying checker. NOTE: Spellchecker is currently in alpha.

Spellcheck functionality is found in the spellcheck module.

Functionality

NOTE: The spellcheck portion of this project is still under development and is not guaranteed to work properly. Completed and future planned support include:

  • Basic prefix/sufix dictionary files
  • Forbidden word handling
  • [ ]
  • Morphological/Phonetic handling

Performance

In general, this program has been shown to be quite fast. On an average laptop, benchmarks give approximately 40-50 ns per word. This is fast enough to spellcheck the entire million words of the Harry Potter series in about 40 ms.

Simple benchmarks:

Spellcheck: compile dictionary
                        time:   [127.40 ms 132.32 ms 138.44 ms]
Found 9 outliers among 100 measurements (9.00%)
  9 (9.00%) high severe

Spellcheck: 1 correct word
                        time:   [35.343 ns 35.446 ns 35.563 ns]
Found 15 outliers among 100 measurements (15.00%)
  11 (11.00%) high mild
  4 (4.00%) high severe

Spellcheck: 1 incorrect word
                        time:   [46.577 ns 46.700 ns 46.853 ns]
Found 16 outliers among 100 measurements (16.00%)
  6 (6.00%) high mild
  10 (10.00%) high severe

Spellcheck: 15 correct words
                        time:   [537.31 ns 552.10 ns 568.80 ns]
Found 12 outliers among 100 measurements (12.00%)
  2 (2.00%) high mild
  10 (10.00%) high severe

Spellcheck: 15 incorrect words
                        time:   [741.72 ns 747.44 ns 755.19 ns]
Found 15 outliers among 100 measurements (15.00%)
  4 (4.00%) high mild
  11 (11.00%) high severe

Spellcheck: 188 word paragraph
                        time:   [6.9062 us 6.9259 us 6.9485 us]
Found 11 outliers among 100 measurements (11.00%)
  4 (4.00%) high mild
  7 (7.00%) high severe

Note that dictionary compiling is only a one-time task after a file is loaded.

License

See the LICENSE file for license information. The provided license does allow for proprietary use and adaptation; that being said, I kindly suggest that if you come up with an improvement, you submit a pull request and help us all out :)

Dictionary data license

The dictionaries provided in this repository for testing purposed have been obtained under license. These files have been sourced from here: https://github.com/wooorm/dictionaries

These dictionaries are licensed under various licenses, different from that of this project. Please see the applicable .license file withing the dictionaries/ directory.

Note: this project was previously named "textdistance". Please make sure to update all references.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

stringmetrics-0.1.0.tar.gz (254.7 kB view hashes)

Uploaded Source

Built Distributions

stringmetrics-0.1.0-cp310-cp310-macosx_10_7_x86_64.whl (170.2 kB view hashes)

Uploaded CPython 3.10 macOS 10.7+ x86-64

stringmetrics-0.1.0-cp39-cp39-macosx_10_7_x86_64.whl (170.3 kB view hashes)

Uploaded CPython 3.9 macOS 10.7+ x86-64

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page