A fast, native Rust implementation of the SimString algorithm with Python bindings.
Project description
simstring_rust
A native Rust implementation of the CPMerge algorithm, designed for approximate string matching. This crate is particularly useful for natural language processing tasks that require the retrieval of strings/texts from very large corpora (big amounts of texts). Currently, this crate supports both character and word-based N-grams feature generation, with plans to allow custom user-defined feature generation methods.
Features
- ✅ Fast algorithm for string matching
- ✅ 100% exact retrieval
- ✅ Support for Unicode
- Support for building databases directly from text files
- Mecab-based tokenizer support
Supported String Similarity Measures
- ✅ Dice coefficient
- ✅ Jaccard coefficient
- ✅ Cosine coefficient
- ✅ Overlap coefficient
- ✅ Exact match
Installation
Add simstring_rust to your Cargo.toml:
[dependencies]
simstring_rust = "0.3.0" # change version accordingly
For the latest features, you can add the master branch by specifying the Git repository:
[dependencies]
simstring_rust = { git = "https://github.com/PyDataBlog/simstring_rs.git", branch = "main" }
Note: Using the master branch may include experimental features and potential breakages. Use with caution!
To revert to a stable version, ensure your Cargo.toml specifies a specific version number instead of the Git repository.
Usage
Here is a basic example of how to use simstring_rs in your Rust project:
use simstring_rust::database::HashDb;
use simstring_rust::extractors::CharacterNgrams;
use simstring_rust::measures::Cosine;
use simstring_rust::Searcher;
use std::sync::Arc;
fn main() {
// 1. Setup the database
let feature_extractor = Arc::new(CharacterNgrams::new(2, "$"));
let mut db = HashDb::new(feature_extractor);
// 2. Index some strings
db.insert("hello".to_string());
db.insert("help".to_string());
db.insert("halo".to_string());
db.insert("world".to_string());
// 3. Search for strings
let measure = Cosine;
let searcher = Searcher::new(&db, measure);
let query = "hell";
let alpha = 0.5;
if let Ok(results) = searcher.ranked_search(query, alpha) {
println!("Found {} results for query '{}'", results.len(), query);
for (item, score) in results {
println!("- Match: '{}', Score: {:.4}", item, score);
}
}
}
Contributing
Contributions are welcome! Please open an issue or submit a pull request on GitHub. License
This project is licensed under the MIT License.
Benchmarks
The benches/run_benches.py harness compares several language bindings (Rust, Python, Julia, Ruby, C++).
git,autoconf,automake,libtool,make,python,uvand a C++ compiler (g++) to build the C++ CLI.
The C++ sources are cloned into benches/.simstring_cpp/ and a local copy of the simstring binary is installed
under that directory. If you need to rebuild from scratch, remove benches/.simstring_cpp/ before re-running the benchmark suite.
Acknowledgements
Inspired by the SimString project.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file simstring_rust-0.3.5b1-cp37-abi3-win_amd64.whl.
File metadata
- Download URL: simstring_rust-0.3.5b1-cp37-abi3-win_amd64.whl
- Upload date:
- Size: 257.8 kB
- Tags: CPython 3.7+, Windows x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: maturin/1.10.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
13390de5d90b40bae4be6b0af8a76028b3ed3446439b7378fc4822e08f363e08
|
|
| MD5 |
8a991e5e2814911226c132c76caaf72e
|
|
| BLAKE2b-256 |
4f11f474158491968408738510a4763ad94bc5a829049364c368ffd5aed7520c
|
File details
Details for the file simstring_rust-0.3.5b1-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: simstring_rust-0.3.5b1-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 3.8 MB
- Tags: CPython 3.7+, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: maturin/1.10.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
de3ac787ca5677382bc7e4b407e857ef7d16a40c188f2c978934163d7b85636d
|
|
| MD5 |
b7ad2fa36dccf23bfc38d45c486d9e47
|
|
| BLAKE2b-256 |
58a54326828f2351aca8e22d9b2a3cd5a89ce00d4e302da997cdf54a0f37646a
|
File details
Details for the file simstring_rust-0.3.5b1-cp37-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl.
File metadata
- Download URL: simstring_rust-0.3.5b1-cp37-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl
- Upload date:
- Size: 806.8 kB
- Tags: CPython 3.7+, macOS 10.12+ universal2 (ARM64, x86-64), macOS 10.12+ x86-64, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: maturin/1.10.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4954507dc6ceb708ba45a1dc5b6a5ef994c85851a31d5d03db3228191b49bc93
|
|
| MD5 |
34cf439a946f9356e0bd0087bb3ea205
|
|
| BLAKE2b-256 |
7c146578ede9a3c566452281fff4752f3e3275e52950cd66cfc344dd6f7608f2
|