Skip to main content

A fast, native Rust implementation of the SimString algorithm with Python bindings.

Project description

simstring_rust

Build Status Crates.io Documentation Rust

A native Rust implementation of the CPMerge algorithm, designed for approximate string matching. This crate is particularly useful for natural language processing tasks that require the retrieval of strings/texts from very large corpora (big amounts of texts). Currently, this crate supports both character and word-based N-grams feature generation, with plans to allow custom user-defined feature generation methods.

Features

  • ✅ Fast algorithm for string matching
  • ✅ 100% exact retrieval
  • ✅ Support for Unicode
  • Support for building databases directly from text files
  • Mecab-based tokenizer support

Supported String Similarity Measures

  • ✅ Dice coefficient
  • ✅ Jaccard coefficient
  • ✅ Cosine coefficient
  • ✅ Overlap coefficient
  • ✅ Exact match

Installation

Add simstring_rust to your Cargo.toml:

[dependencies]
simstring_rust = "0.3.0" # change version accordingly

For the latest features, you can add the master branch by specifying the Git repository:

[dependencies]
simstring_rust = { git = "https://github.com/PyDataBlog/simstring_rs.git", branch = "main" }

Note: Using the master branch may include experimental features and potential breakages. Use with caution!

To revert to a stable version, ensure your Cargo.toml specifies a specific version number instead of the Git repository.

Usage

Here is a basic example of how to use simstring_rs in your Rust project:

use simstring_rust::database::HashDb;
use simstring_rust::extractors::CharacterNgrams;
use simstring_rust::measures::Cosine;
use simstring_rust::Searcher;

use std::sync::Arc;

fn main() {
    // 1. Setup the database
    let feature_extractor = Arc::new(CharacterNgrams::new(2, "$"));
    let mut db = HashDb::new(feature_extractor);

    // 2. Index some strings
    db.insert("hello".to_string());
    db.insert("help".to_string());
    db.insert("halo".to_string());
    db.insert("world".to_string());

    // 3. Search for strings
    let measure = Cosine;
    let searcher = Searcher::new(&db, measure);
    let query = "hell";
    let alpha = 0.5;

    if let Ok(results) = searcher.ranked_search(query, alpha) {
        println!("Found {} results for query '{}'", results.len(), query);
        for (item, score) in results {
            println!("- Match: '{}', Score: {:.4}", item, score);
        }
    }
}

Contributing

Contributions are welcome! Please open an issue or submit a pull request on GitHub. License

This project is licensed under the MIT License.

Acknowledgements

Inspired by the SimString project.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

simstring_rust-0.3.1rc1-cp37-abi3-win_amd64.whl (246.8 kB view details)

Uploaded CPython 3.7+Windows x86-64

simstring_rust-0.3.1rc1-cp37-abi3-manylinux_2_34_x86_64.whl (3.2 MB view details)

Uploaded CPython 3.7+manylinux: glibc 2.34+ x86-64

simstring_rust-0.3.1rc1-cp37-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl (773.9 kB view details)

Uploaded CPython 3.7+macOS 10.12+ universal2 (ARM64, x86-64)macOS 10.12+ x86-64macOS 11.0+ ARM64

File details

Details for the file simstring_rust-0.3.1rc1-cp37-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for simstring_rust-0.3.1rc1-cp37-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 bf5118cc86b5b42cfc179b333c33ff943098ca01561d7da55574f6fb2f9edd50
MD5 ee3dabe950cd6b05ef42815a85d0b503
BLAKE2b-256 2ac4d5f9859443e6fc5d1bdd9481e4d2075cff5dc6f5fefcb97047ac5485d546

See more details on using hashes here.

File details

Details for the file simstring_rust-0.3.1rc1-cp37-abi3-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for simstring_rust-0.3.1rc1-cp37-abi3-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 fb40301be2aacb25fa6cf91eef079d00b2f079ba52e393e950d19cff6672f159
MD5 3852e9301bd536c46e6dea6ecaaec791
BLAKE2b-256 6bbbc4235d61e68e525881f2b7fb19bb552f96855ca213d9043c678da9abed2a

See more details on using hashes here.

File details

Details for the file simstring_rust-0.3.1rc1-cp37-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl.

File metadata

File hashes

Hashes for simstring_rust-0.3.1rc1-cp37-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl
Algorithm Hash digest
SHA256 3edf3cd86ed91b07547c2f828e3e8445a37d2aa93dc30c93bfacc638589e08e9
MD5 04b894098e9e9db399b09281471edf09
BLAKE2b-256 20f139db053a92c0d4d827d0311173287e26b3f6754299f721a789315a1b4da2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page