Skip to main content

A fast, native Rust implementation of the SimString algorithm with Python bindings.

Project description

simstring_rust

Build Status Crates.io PyPI version Python versions Documentation Rust Codecov

A native Rust implementation of the CPMerge algorithm, designed for approximate string matching. This crate is particularly useful for natural language processing tasks that require the retrieval of strings/texts from very large corpora (big amounts of texts). Currently, this crate supports both character and word-based N-grams feature generation, with plans to allow custom user-defined feature generation methods.

Features

  • ✅ Fast algorithm for string matching
  • ✅ 100% exact retrieval
  • ✅ Support for Unicode
  • Support for building databases directly from text files
  • Mecab-based tokenizer support

Supported String Similarity Measures

  • ✅ Dice coefficient
  • ✅ Jaccard coefficient
  • ✅ Cosine coefficient
  • ✅ Overlap coefficient
  • ✅ Exact match

Installation

Add simstring_rust to your Cargo.toml:

[dependencies]
simstring_rust = "0.3.0" # change version accordingly

For the latest features, you can add the master branch by specifying the Git repository:

[dependencies]
simstring_rust = { git = "https://github.com/PyDataBlog/simstring_rs.git", branch = "main" }

Note: Using the master branch may include experimental features and potential breakages. Use with caution!

To revert to a stable version, ensure your Cargo.toml specifies a specific version number instead of the Git repository.

Usage

Here is a basic example of how to use simstring_rs in your Rust project:

use simstring_rust::database::HashDb;
use simstring_rust::extractors::CharacterNgrams;
use simstring_rust::measures::Cosine;
use simstring_rust::Searcher;

use std::sync::Arc;

fn main() {
    // 1. Setup the database
    let feature_extractor = Arc::new(CharacterNgrams::new(2, "$"));
    let mut db = HashDb::new(feature_extractor);

    // 2. Index some strings
    db.insert("hello".to_string());
    db.insert("help".to_string());
    db.insert("halo".to_string());
    db.insert("world".to_string());

    // 3. Search for strings
    let measure = Cosine;
    let searcher = Searcher::new(&db, measure);
    let query = "hell";
    let alpha = 0.5;

    if let Ok(results) = searcher.ranked_search(query, alpha) {
        println!("Found {} results for query '{}'", results.len(), query);
        for (item, score) in results {
            println!("- Match: '{}', Score: {:.4}", item, score);
        }
    }
}

Contributing

Contributions are welcome! Please open an issue or submit a pull request on GitHub. License

This project is licensed under the MIT License.

Benchmarks

The benches/run_benches.py harness compares several language bindings (Rust, Python, Julia, Ruby, C++).

  • git, autoconf, automake, libtool, make, python, uv and a C++ compiler (g++) to build the C++ CLI.

The C++ sources are cloned into benches/.simstring_cpp/ and a local copy of the simstring binary is installed under that directory. If you need to rebuild from scratch, remove benches/.simstring_cpp/ before re-running the benchmark suite.

Acknowledgements

Inspired by the SimString project.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

simstring_rust-0.3.4-cp37-abi3-win_amd64.whl (258.4 kB view details)

Uploaded CPython 3.7+Windows x86-64

simstring_rust-0.3.4-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.8 MB view details)

Uploaded CPython 3.7+manylinux: glibc 2.17+ x86-64

simstring_rust-0.3.4-cp37-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl (806.9 kB view details)

Uploaded CPython 3.7+macOS 10.12+ universal2 (ARM64, x86-64)macOS 10.12+ x86-64macOS 11.0+ ARM64

File details

Details for the file simstring_rust-0.3.4-cp37-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for simstring_rust-0.3.4-cp37-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 f07ebe4059bd1864eb4e5b8dcaae3915c07605814d089a43effdc20815323c23
MD5 2c8fbbacfb7bedd86bbbd9d51df6ef6a
BLAKE2b-256 8ee75da8f613cadbf82b89b583f4d3ffdec1871de3687cc717416fa4fa61402f

See more details on using hashes here.

File details

Details for the file simstring_rust-0.3.4-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for simstring_rust-0.3.4-cp37-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 028c2ed44f215770dfa65ff3fd54917a1f01eed9813d346017abb3768d129e1d
MD5 33a8e1aa319763eb06c2158516eafcdc
BLAKE2b-256 609d8e9cbbde56f65838cff0cbf2c0e2eb82d97315cb729e5908327b29a1e4d8

See more details on using hashes here.

File details

Details for the file simstring_rust-0.3.4-cp37-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl.

File metadata

File hashes

Hashes for simstring_rust-0.3.4-cp37-abi3-macosx_10_12_x86_64.macosx_11_0_arm64.macosx_10_12_universal2.whl
Algorithm Hash digest
SHA256 97a90fc7668698af513fb797cc3b0c524467e3dadb662543c07ccae4111ee40b
MD5 e888d40b78ca67ac7f44bbb39b68c720
BLAKE2b-256 a9befe2106205138ae1f68c6f79de1829ed821ba242b7f388d3885f65ac58dd8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page