Skip to main content

A fast, native Rust implementation of the SimString algorithm with Python bindings.

Project description

simstring_rust

Build Status Crates.io PyPI version Python versions Documentation Rust Codecov

A native Rust implementation of the CPMerge algorithm, designed for approximate string matching. This crate is particularly useful for natural language processing tasks that require the retrieval of strings/texts from very large corpora (big amounts of texts). Currently, this crate supports both character and word-based N-grams feature generation, with plans to allow custom user-defined feature generation methods.

Features

  • ✅ Fast algorithm for string matching
  • ✅ 100% exact retrieval
  • ✅ Support for Unicode
  • Support for building databases directly from text files
  • Mecab-based tokenizer support

Supported String Similarity Measures

  • ✅ Dice coefficient
  • ✅ Jaccard coefficient
  • ✅ Cosine coefficient
  • ✅ Overlap coefficient
  • ✅ Exact match

Installation

Add simstring_rust to your Cargo.toml:

[dependencies]
simstring_rust = "0.3.0" # change version accordingly

For the latest features, you can add the master branch by specifying the Git repository:

[dependencies]
simstring_rust = { git = "https://github.com/PyDataBlog/simstring_rs.git", branch = "main" }

Note: Using the master branch may include experimental features and potential breakages. Use with caution!

To revert to a stable version, ensure your Cargo.toml specifies a specific version number instead of the Git repository.

Usage

Here is a basic example of how to use simstring_rs in your Rust project:

use simstring_rust::database::HashDb;
use simstring_rust::extractors::CharacterNgrams;
use simstring_rust::measures::Cosine;
use simstring_rust::Searcher;

use std::sync::Arc;

fn main() {
    // 1. Setup the database
    let feature_extractor = Arc::new(CharacterNgrams::new(2, "$"));
    let mut db = HashDb::new(feature_extractor);

    // 2. Index some strings
    db.insert("hello".to_string());
    db.insert("help".to_string());
    db.insert("halo".to_string());
    db.insert("world".to_string());

    // 3. Search for strings
    let measure = Cosine;
    let searcher = Searcher::new(&db, measure);
    let query = "hell";
    let alpha = 0.5;

    if let Ok(results) = searcher.ranked_search(query, alpha) {
        println!("Found {} results for query '{}'", results.len(), query);
        for (item, score) in results {
            println!("- Match: '{}', Score: {:.4}", item, score);
        }
    }
}

Contributing

Contributions are welcome! Please open an issue or submit a pull request on GitHub. License

This project is licensed under the MIT License.

Acknowledgements

Inspired by the SimString project.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

simstring_rust-0.3.1-cp37-abi3-win_amd64.whl (249.6 kB view details)

Uploaded CPython 3.7+Windows x86-64

simstring_rust-0.3.1-cp37-abi3-manylinux_2_34_x86_64.whl (3.6 MB view details)

Uploaded CPython 3.7+manylinux: glibc 2.34+ x86-64

simstring_rust-0.3.1-cp37-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl (788.7 kB view details)

Uploaded CPython 3.7+macOS 10.12+ universal2 (ARM64, x86-64)macOS 10.12+ x86-64macOS 11.0+ ARM64

File details

Details for the file simstring_rust-0.3.1-cp37-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for simstring_rust-0.3.1-cp37-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 8a058224403e552801154579f816d48d3cad0834ccb293e37e46f6c7b5b02e88
MD5 656a8928b282fdb87787dad12963eb6a
BLAKE2b-256 097e986cf8acb0bf0deca0acde1a3cc5478f0c9ca26df8dec6c2e965fc79388a

See more details on using hashes here.

File details

Details for the file simstring_rust-0.3.1-cp37-abi3-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for simstring_rust-0.3.1-cp37-abi3-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 30b5b95c5a12280dc272ce24ec26f71847e6114a971a534a2126520cdb72fa8f
MD5 1d9f7d1eaf60e763e80c705f945008cd
BLAKE2b-256 b5de1d9b74b90bb5123c3c57d96bb7088215cea8ca3bb2150d72b705b0247eb9

See more details on using hashes here.

File details

Details for the file simstring_rust-0.3.1-cp37-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl.

File metadata

File hashes

Hashes for simstring_rust-0.3.1-cp37-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl
Algorithm Hash digest
SHA256 c0382954a20a4337832a00efd2d290b729c5c5ab2d4b297f925d2a05297813cb
MD5 d85eea71ff94575844ebf92ddba0f06a
BLAKE2b-256 a2165dab67736bd6bb42a8207cce506fbb4428423963fbff6a808ffc99bf523f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page