Skip to main content

A fast, native Rust implementation of the SimString algorithm with Python bindings.

Project description

simstring_rust

Build Status Crates.io Documentation Rust

A native Rust implementation of the CPMerge algorithm, designed for approximate string matching. This crate is particularly useful for natural language processing tasks that require the retrieval of strings/texts from very large corpora (big amounts of texts). Currently, this crate supports both character and word-based N-grams feature generation, with plans to allow custom user-defined feature generation methods.

Features

  • ✅ Fast algorithm for string matching
  • ✅ 100% exact retrieval
  • ✅ Support for Unicode
  • Support for building databases directly from text files
  • Mecab-based tokenizer support

Supported String Similarity Measures

  • ✅ Dice coefficient
  • ✅ Jaccard coefficient
  • ✅ Cosine coefficient
  • ✅ Overlap coefficient
  • ✅ Exact match

Installation

Add simstring_rust to your Cargo.toml:

[dependencies]
simstring_rust = "0.3.0" # change version accordingly

For the latest features, you can add the master branch by specifying the Git repository:

[dependencies]
simstring_rust = { git = "https://github.com/PyDataBlog/simstring_rs.git", branch = "main" }

Note: Using the master branch may include experimental features and potential breakages. Use with caution!

To revert to a stable version, ensure your Cargo.toml specifies a specific version number instead of the Git repository.

Usage

Here is a basic example of how to use simstring_rs in your Rust project:

use simstring_rust::database::HashDb;
use simstring_rust::extractors::CharacterNgrams;
use simstring_rust::measures::Cosine;
use simstring_rust::Searcher;

use std::sync::Arc;

fn main() {
    // 1. Setup the database
    let feature_extractor = Arc::new(CharacterNgrams::new(2, "$"));
    let mut db = HashDb::new(feature_extractor);

    // 2. Index some strings
    db.insert("hello".to_string());
    db.insert("help".to_string());
    db.insert("halo".to_string());
    db.insert("world".to_string());

    // 3. Search for strings
    let measure = Cosine;
    let searcher = Searcher::new(&db, measure);
    let query = "hell";
    let alpha = 0.5;

    if let Ok(results) = searcher.ranked_search(query, alpha) {
        println!("Found {} results for query '{}'", results.len(), query);
        for (item, score) in results {
            println!("- Match: '{}', Score: {:.4}", item, score);
        }
    }
}

Contributing

Contributions are welcome! Please open an issue or submit a pull request on GitHub. License

This project is licensed under the MIT License.

Acknowledgements

Inspired by the SimString project.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

simstring_rust-0.3.1b2-cp37-abi3-win_amd64.whl (246.8 kB view details)

Uploaded CPython 3.7+Windows x86-64

simstring_rust-0.3.1b2-cp37-abi3-manylinux_2_34_x86_64.whl (3.2 MB view details)

Uploaded CPython 3.7+manylinux: glibc 2.34+ x86-64

simstring_rust-0.3.1b2-cp37-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl (773.8 kB view details)

Uploaded CPython 3.7+macOS 10.12+ universal2 (ARM64, x86-64)macOS 10.12+ x86-64macOS 11.0+ ARM64

File details

Details for the file simstring_rust-0.3.1b2-cp37-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for simstring_rust-0.3.1b2-cp37-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 4b5c54ed6a75c9c13cf7939740632941ffa5cc9556095a73e32c9b2386013ec2
MD5 3fda55c1b85616a695ef152580c62e5d
BLAKE2b-256 f97ab1fa5019ab70ba01a1785f69cdce82ceec72b499d850327c39cff99b7876

See more details on using hashes here.

File details

Details for the file simstring_rust-0.3.1b2-cp37-abi3-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for simstring_rust-0.3.1b2-cp37-abi3-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 1f5c2673477c21c221d6903d926fb9dc766b0fd70522c33055c736e74c7b66e8
MD5 81039249c3ccc9d556d86c101f219547
BLAKE2b-256 f202003c31ea47758db4bb29c9f8d5f1e3ceadc7e748ed747d67fafe0ae8d6b3

See more details on using hashes here.

File details

Details for the file simstring_rust-0.3.1b2-cp37-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl.

File metadata

File hashes

Hashes for simstring_rust-0.3.1b2-cp37-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl
Algorithm Hash digest
SHA256 f791c7755f7375d8cc6b9d49fdeed7c8b714d305d431ce877cf4254a78cfba97
MD5 9ff6128973c6e444ba8521c1facf4960
BLAKE2b-256 76adc88ff7441db85aedddfbe8652c9de465147d81df0c1a83d7bbeb0d3667e5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page