Skip to main content

A fast, native Rust implementation of the SimString algorithm with Python bindings.

Project description

simstring_rust

Build Status Crates.io PyPI version Python versions Documentation Rust Codecov

A native Rust implementation of the CPMerge algorithm, designed for approximate string matching. This crate is particularly useful for natural language processing tasks that require the retrieval of strings/texts from very large corpora (big amounts of texts). Currently, this crate supports both character and word-based N-grams feature generation, with plans to allow custom user-defined feature generation methods.

Features

  • ✅ Fast algorithm for string matching
  • ✅ 100% exact retrieval
  • ✅ Support for Unicode
  • Support for building databases directly from text files
  • Mecab-based tokenizer support

Supported String Similarity Measures

  • ✅ Dice coefficient
  • ✅ Jaccard coefficient
  • ✅ Cosine coefficient
  • ✅ Overlap coefficient
  • ✅ Exact match

Installation

Add simstring_rust to your Cargo.toml:

[dependencies]
simstring_rust = "0.3.0" # change version accordingly

For the latest features, you can add the master branch by specifying the Git repository:

[dependencies]
simstring_rust = { git = "https://github.com/PyDataBlog/simstring_rs.git", branch = "main" }

Note: Using the master branch may include experimental features and potential breakages. Use with caution!

To revert to a stable version, ensure your Cargo.toml specifies a specific version number instead of the Git repository.

Usage

Here is a basic example of how to use simstring_rs in your Rust project:

use simstring_rust::database::HashDb;
use simstring_rust::extractors::CharacterNgrams;
use simstring_rust::measures::Cosine;
use simstring_rust::Searcher;

use std::sync::Arc;

fn main() {
    // 1. Setup the database
    let feature_extractor = Arc::new(CharacterNgrams::new(2, "$"));
    let mut db = HashDb::new(feature_extractor);

    // 2. Index some strings
    db.insert("hello".to_string());
    db.insert("help".to_string());
    db.insert("halo".to_string());
    db.insert("world".to_string());

    // 3. Search for strings
    let measure = Cosine;
    let searcher = Searcher::new(&db, measure);
    let query = "hell";
    let alpha = 0.5;

    if let Ok(results) = searcher.ranked_search(query, alpha) {
        println!("Found {} results for query '{}'", results.len(), query);
        for (item, score) in results {
            println!("- Match: '{}', Score: {:.4}", item, score);
        }
    }
}

Contributing

Contributions are welcome! Please open an issue or submit a pull request on GitHub. License

This project is licensed under the MIT License.

Acknowledgements

Inspired by the SimString project.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

simstring_rust-0.3.3-cp37-abi3-win_amd64.whl (245.3 kB view details)

Uploaded CPython 3.7+Windows x86-64

simstring_rust-0.3.3-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.7 MB view details)

Uploaded CPython 3.7+manylinux: glibc 2.17+ x86-64

simstring_rust-0.3.3-cp37-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl (781.5 kB view details)

Uploaded CPython 3.7+macOS 10.12+ universal2 (ARM64, x86-64)macOS 10.12+ x86-64macOS 11.0+ ARM64

File details

Details for the file simstring_rust-0.3.3-cp37-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for simstring_rust-0.3.3-cp37-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 f819e0d0f8216b6997f4462be871fd4886de428d9b76dfe36c8d1d6d6ebfaa64
MD5 53c74b21fad07aa3be6b65b4b023a32c
BLAKE2b-256 5e89066533a4af8adfd14a536fca5d5fdbf809c2023d48f415ac5a0d6b9dee62

See more details on using hashes here.

File details

Details for the file simstring_rust-0.3.3-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for simstring_rust-0.3.3-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 396a5397d73b17f7916228ae9a0af9eb7991c8f0386d992b6bfde81bf984d65c
MD5 36bd7a3dd134141f88e94a35d2447b0d
BLAKE2b-256 5a3d1f010bb56553bb02706fefd86647e642f4a3477d53fbb6be2b030d230a72

See more details on using hashes here.

File details

Details for the file simstring_rust-0.3.3-cp37-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl.

File metadata

File hashes

Hashes for simstring_rust-0.3.3-cp37-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl
Algorithm Hash digest
SHA256 bc01df58a32130ea6853e69c04ac45cee22522b0c3dde8e1e4abbee959868501
MD5 ba9e9c24ac4d7c59fd1b63905f073356
BLAKE2b-256 c281b13e1a265e47df29e7359a17a453aba7c773d44b89a23e7054fbec1d8034

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page