Skip to main content

A fast, native Rust implementation of the SimString algorithm with Python bindings.

Project description

simstring_rust

Build Status Crates.io PyPI version Python versions Documentation Rust Codecov

A native Rust implementation of the CPMerge algorithm, designed for approximate string matching. This crate is particularly useful for natural language processing tasks that require the retrieval of strings/texts from very large corpora (big amounts of texts). Currently, this crate supports both character and word-based N-grams feature generation, with plans to allow custom user-defined feature generation methods.

Features

  • ✅ Fast algorithm for string matching
  • ✅ 100% exact retrieval
  • ✅ Support for Unicode
  • Support for building databases directly from text files
  • Mecab-based tokenizer support

Supported String Similarity Measures

  • ✅ Dice coefficient
  • ✅ Jaccard coefficient
  • ✅ Cosine coefficient
  • ✅ Overlap coefficient
  • ✅ Exact match

Installation

Add simstring_rust to your Cargo.toml:

[dependencies]
simstring_rust = "0.3.0" # change version accordingly

For the latest features, you can add the master branch by specifying the Git repository:

[dependencies]
simstring_rust = { git = "https://github.com/PyDataBlog/simstring_rs.git", branch = "main" }

Note: Using the master branch may include experimental features and potential breakages. Use with caution!

To revert to a stable version, ensure your Cargo.toml specifies a specific version number instead of the Git repository.

Usage

Here is a basic example of how to use simstring_rs in your Rust project:

use simstring_rust::database::HashDb;
use simstring_rust::extractors::CharacterNgrams;
use simstring_rust::measures::Cosine;
use simstring_rust::Searcher;

use std::sync::Arc;

fn main() {
    // 1. Setup the database
    let feature_extractor = Arc::new(CharacterNgrams::new(2, "$"));
    let mut db = HashDb::new(feature_extractor);

    // 2. Index some strings
    db.insert("hello".to_string());
    db.insert("help".to_string());
    db.insert("halo".to_string());
    db.insert("world".to_string());

    // 3. Search for strings
    let measure = Cosine;
    let searcher = Searcher::new(&db, measure);
    let query = "hell";
    let alpha = 0.5;

    if let Ok(results) = searcher.ranked_search(query, alpha) {
        println!("Found {} results for query '{}'", results.len(), query);
        for (item, score) in results {
            println!("- Match: '{}', Score: {:.4}", item, score);
        }
    }
}

Contributing

Contributions are welcome! Please open an issue or submit a pull request on GitHub. License

This project is licensed under the MIT License.

Benchmarks

The benches/run_benches.py harness compares several language bindings (Rust, Python, Julia, Ruby, C++).

  • git, autoconf, automake, libtool, make, python, uv and a C++ compiler (g++) to build the C++ CLI.

The C++ sources are cloned into benches/.simstring_cpp/ and a local copy of the simstring binary is installed under that directory. If you need to rebuild from scratch, remove benches/.simstring_cpp/ before re-running the benchmark suite.

Acknowledgements

Inspired by the SimString project.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

simstring_rust-0.3.5b1-cp37-abi3-win_amd64.whl (257.8 kB view details)

Uploaded CPython 3.7+Windows x86-64

simstring_rust-0.3.5b1-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.8 MB view details)

Uploaded CPython 3.7+manylinux: glibc 2.17+ x86-64

simstring_rust-0.3.5b1-cp37-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl (806.8 kB view details)

Uploaded CPython 3.7+macOS 10.12+ universal2 (ARM64, x86-64)macOS 10.12+ x86-64macOS 11.0+ ARM64

File details

Details for the file simstring_rust-0.3.5b1-cp37-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for simstring_rust-0.3.5b1-cp37-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 13390de5d90b40bae4be6b0af8a76028b3ed3446439b7378fc4822e08f363e08
MD5 8a991e5e2814911226c132c76caaf72e
BLAKE2b-256 4f11f474158491968408738510a4763ad94bc5a829049364c368ffd5aed7520c

See more details on using hashes here.

File details

Details for the file simstring_rust-0.3.5b1-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for simstring_rust-0.3.5b1-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 de3ac787ca5677382bc7e4b407e857ef7d16a40c188f2c978934163d7b85636d
MD5 b7ad2fa36dccf23bfc38d45c486d9e47
BLAKE2b-256 58a54326828f2351aca8e22d9b2a3cd5a89ce00d4e302da997cdf54a0f37646a

See more details on using hashes here.

File details

Details for the file simstring_rust-0.3.5b1-cp37-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl.

File metadata

File hashes

Hashes for simstring_rust-0.3.5b1-cp37-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl
Algorithm Hash digest
SHA256 4954507dc6ceb708ba45a1dc5b6a5ef994c85851a31d5d03db3228191b49bc93
MD5 34cf439a946f9356e0bd0087bb3ea205
BLAKE2b-256 7c146578ede9a3c566452281fff4752f3e3275e52950cd66cfc344dd6f7608f2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page