Fast substring matching in large text corpora using optimized Levenshtein distance

These details have not been verified by PyPI

Project links

Homepage

Project description

Corpus Matcher

Fast substring matching in large text corpora using optimized Levenshtein distance algorithms.

This library provides efficient fuzzy string matching capabilities, particularly useful for finding the best matching substring within large text corpora. It implements both a quick heuristic approach and a thorough search algorithm to balance speed and accuracy.

Features

Dual Algorithm Approach: Quick heuristic matching for speed, with fallback to thorough search for accuracy
Parallel Processing: Leverages joblib for multi-threaded processing
Caching: Built-in result caching using joblib Memory
Case Sensitivity Control: Optional case-sensitive or case-insensitive matching
Configurable Parameters: Adjustable step factors and search granularity

Installation

pip install corpus-matcher

Quick Start

from corpus_matcher import find_best_substring_match

# Basic usage
query = "machine learning algorithms"
corpus = "This document discusses various machine learning algorithms and their applications in data science."

result = find_best_substring_match(query, corpus)

print(f"Best matches: {result.matches}")
print(f"Similarity ratio: {result.ratio}")
print(f"Distance: {result.distance}")
print(f"Quick match used: {result.quick_match_used}")

Advanced Usage

from corpus_matcher import find_best_substring_match

# Case-insensitive matching with custom parameters
result = find_best_substring_match(
    query="PYTHON programming",
    corpus="Learn python programming from basics to advanced concepts",
    case_sensitive=False,
    step_factor=300,  # Higher step factor for more thorough search
    n_jobs=4  # Use 4 parallel jobs
)

print(f"Matches: {result.matches}")
print(f"Ratio: {result.ratio:.3f}")

API Reference

`find_best_substring_match(query, corpus, case_sensitive=True, step_factor=500, n_jobs=-1)`

Find the best matching substring(s) in a corpus for a given query.

Parameters:

query (str): The text to search for
corpus (str): The text to search within
case_sensitive (bool, optional): Whether matching should be case-sensitive. Default: True
step_factor (int, optional): Controls search resolution. Higher values = more thorough search. Default: 500
n_jobs (int, optional): Number of parallel jobs (-1 for all available cores). Default: -1

Returns:

MatchResult: Object containing matches, ratio, distance, and algorithm info

`MatchResult`

A dataclass containing the results of the matching operation:

matches (List[str]): List of best matching substrings
ratio (float): Levenshtein similarity ratio (0-100)
distance (float): Normalized Levenshtein distance (0-1)
quick_match_used (bool): Whether the quick algorithm was sufficient

Algorithm Details

The library uses a two-stage approach:

Quick Match: Identifies potential regions using word-based heuristics, then performs localized search
Thorough Search: Falls back to comprehensive n-gram analysis if quick match fails

This approach provides good performance for most use cases while maintaining accuracy.

Requirements

Python ≥ 3.8
joblib
rapidfuzz

Development

This project was developed with assistance from aider.chat.

License

GPL-v3

Project details

These details have not been verified by PyPI

Project links

Homepage

Release history Release notifications | RSS feed

This version

1.0.0

Jun 23, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

corpus_matcher-1.0.0.tar.gz (23.2 kB view details)

Uploaded Jun 23, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

corpus_matcher-1.0.0-py3-none-any.whl (21.0 kB view details)

Uploaded Jun 23, 2025 Python 3

File details

Details for the file corpus_matcher-1.0.0.tar.gz.

File metadata

Download URL: corpus_matcher-1.0.0.tar.gz
Upload date: Jun 23, 2025
Size: 23.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.11

File hashes

Hashes for corpus_matcher-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`c2b18364e2399dba014a2769d4ab0d2f93ed39f1bd99b677f7383bb32c8ed6c4`
MD5	`54ee4e1bfc08debb8a6b1900c0c2983d`
BLAKE2b-256	`6e4263204a0a9306df8700d1748960619ed7b3bc672bda4aef13a2c30895ec5a`

See more details on using hashes here.

File details

Details for the file corpus_matcher-1.0.0-py3-none-any.whl.

File metadata

Download URL: corpus_matcher-1.0.0-py3-none-any.whl
Upload date: Jun 23, 2025
Size: 21.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.11

File hashes

Hashes for corpus_matcher-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d0ec9df0ef15d0cd6dc711f97cb90e9517b40b12d9bb440d11b7f30461a0c4a6`
MD5	`b20ece1b230bf4b5113c53f819061fa2`
BLAKE2b-256	`0103197fa2e1254365f1b65eca0fc872e9846b89706c06b1b7bc6f19330049d5`

See more details on using hashes here.

corpus-matcher 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Corpus Matcher

Features

Installation

Quick Start

Advanced Usage

API Reference

`find_best_substring_match(query, corpus, case_sensitive=True, step_factor=500, n_jobs=-1)`

`MatchResult`

Algorithm Details

Requirements

Development

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes