Skip to main content

A fast, native Rust implementation of the SimString algorithm with Python bindings.

Project description

simstring_rust

Build Status Crates.io Documentation Rust

A native Rust implementation of the CPMerge algorithm, designed for approximate string matching. This crate is particularly useful for natural language processing tasks that require the retrieval of strings/texts from very large corpora (big amounts of texts). Currently, this crate supports both character and word-based N-grams feature generation, with plans to allow custom user-defined feature generation methods.

Features

  • ✅ Fast algorithm for string matching
  • ✅ 100% exact retrieval
  • ✅ Support for Unicode
  • Support for building databases directly from text files
  • Mecab-based tokenizer support

Supported String Similarity Measures

  • ✅ Dice coefficient
  • ✅ Jaccard coefficient
  • ✅ Cosine coefficient
  • ✅ Overlap coefficient
  • ✅ Exact match

Installation

Add simstring_rust to your Cargo.toml:

[dependencies]
simstring_rust = "0.3.0" # change version accordingly

For the latest features, you can add the master branch by specifying the Git repository:

[dependencies]
simstring_rust = { git = "https://github.com/PyDataBlog/simstring_rs.git", branch = "main" }

Note: Using the master branch may include experimental features and potential breakages. Use with caution!

To revert to a stable version, ensure your Cargo.toml specifies a specific version number instead of the Git repository.

Usage

Here is a basic example of how to use simstring_rs in your Rust project:

use simstring_rust::database::HashDb;
use simstring_rust::extractors::CharacterNgrams;
use simstring_rust::measures::Cosine;
use simstring_rust::Searcher;

use std::sync::Arc;

fn main() {
    // 1. Setup the database
    let feature_extractor = Arc::new(CharacterNgrams::new(2, "$"));
    let mut db = HashDb::new(feature_extractor);

    // 2. Index some strings
    db.insert("hello".to_string());
    db.insert("help".to_string());
    db.insert("halo".to_string());
    db.insert("world".to_string());

    // 3. Search for strings
    let measure = Cosine;
    let searcher = Searcher::new(&db, measure);
    let query = "hell";
    let alpha = 0.5;

    if let Ok(results) = searcher.ranked_search(query, alpha) {
        println!("Found {} results for query '{}'", results.len(), query);
        for (item, score) in results {
            println!("- Match: '{}', Score: {:.4}", item, score);
        }
    }
}

Contributing

Contributions are welcome! Please open an issue or submit a pull request on GitHub. License

This project is licensed under the MIT License.

Acknowledgements

Inspired by the SimString project.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

simstring_rust-0.3.1b3-cp37-abi3-win_amd64.whl (246.7 kB view details)

Uploaded CPython 3.7+Windows x86-64

simstring_rust-0.3.1b3-cp37-abi3-manylinux_2_34_x86_64.whl (3.2 MB view details)

Uploaded CPython 3.7+manylinux: glibc 2.34+ x86-64

simstring_rust-0.3.1b3-cp37-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl (773.7 kB view details)

Uploaded CPython 3.7+macOS 10.12+ universal2 (ARM64, x86-64)macOS 10.12+ x86-64macOS 11.0+ ARM64

File details

Details for the file simstring_rust-0.3.1b3-cp37-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for simstring_rust-0.3.1b3-cp37-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 59414938d116c4d3b2c6e9f79f2b0f4db43c4a762c3b4ab6e0252f97f454f854
MD5 dbd163dd62ebed027ec0668d8f045d0a
BLAKE2b-256 b3d33bfcd66fd59ab2deab895060cff3d31bf339b398dd1bd9dd594d7f518aa8

See more details on using hashes here.

File details

Details for the file simstring_rust-0.3.1b3-cp37-abi3-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for simstring_rust-0.3.1b3-cp37-abi3-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 6e69dbaa971fb32a6f3d46ac2089d3a0eb50e0e49a1100b8362d763f848fed0b
MD5 e27720e1f9939a769368bbc7ed9bd9af
BLAKE2b-256 d52a66138de4e338669e17ba34a1f7cfbcd09f4a32c7de8923a63a89f9d3c253

See more details on using hashes here.

File details

Details for the file simstring_rust-0.3.1b3-cp37-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl.

File metadata

File hashes

Hashes for simstring_rust-0.3.1b3-cp37-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl
Algorithm Hash digest
SHA256 7c6c4b63daec93fe56d27602e94881367b495a84fbc5eadf72840dc3aa5635ae
MD5 40e5cd94a338e69eeec8839c8d9123e6
BLAKE2b-256 ef57dcde2881fce62055a79412b775e337b93c720617dfbea11fe523c3fa0ba2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page