Fast substring matching in large text corpora using optimized Levenshtein distance
Project description
Corpus Matcher
Fast substring matching in large text corpora using optimized Levenshtein distance algorithms.
This library provides efficient fuzzy string matching capabilities, particularly useful for finding the best matching substring within large text corpora. It implements both a quick heuristic approach and a thorough search algorithm to balance speed and accuracy.
Features
- Dual Algorithm Approach: Quick heuristic matching for speed, with fallback to thorough search for accuracy
- Parallel Processing: Leverages joblib for multi-threaded processing
- Caching: Built-in result caching using joblib Memory
- Case Sensitivity Control: Optional case-sensitive or case-insensitive matching
- Configurable Parameters: Adjustable step factors and search granularity
Installation
pip install corpus-matcher
Quick Start
from corpus_matcher import find_best_substring_match
# Basic usage
query = "machine learning algorithms"
corpus = "This document discusses various machine learning algorithms and their applications in data science."
result = find_best_substring_match(query, corpus)
print(f"Best matches: {result.matches}")
print(f"Similarity ratio: {result.ratio}")
print(f"Distance: {result.distance}")
print(f"Quick match used: {result.quick_match_used}")
Advanced Usage
from corpus_matcher import find_best_substring_match
# Case-insensitive matching with custom parameters
result = find_best_substring_match(
query="PYTHON programming",
corpus="Learn python programming from basics to advanced concepts",
case_sensitive=False,
step_factor=300, # Higher step factor for more thorough search
n_jobs=4 # Use 4 parallel jobs
)
print(f"Matches: {result.matches}")
print(f"Ratio: {result.ratio:.3f}")
API Reference
find_best_substring_match(query, corpus, case_sensitive=True, step_factor=500, n_jobs=-1)
Find the best matching substring(s) in a corpus for a given query.
Parameters:
query(str): The text to search forcorpus(str): The text to search withincase_sensitive(bool, optional): Whether matching should be case-sensitive. Default: Truestep_factor(int, optional): Controls search resolution. Higher values = more thorough search. Default: 500n_jobs(int, optional): Number of parallel jobs (-1 for all available cores). Default: -1
Returns:
MatchResult: Object containing matches, ratio, distance, and algorithm info
MatchResult
A dataclass containing the results of the matching operation:
matches(List[str]): List of best matching substringsratio(float): Levenshtein similarity ratio (0-100)distance(float): Normalized Levenshtein distance (0-1)quick_match_used(bool): Whether the quick algorithm was sufficient
Algorithm Details
The library uses a two-stage approach:
- Quick Match: Identifies potential regions using word-based heuristics, then performs localized search
- Thorough Search: Falls back to comprehensive n-gram analysis if quick match fails
This approach provides good performance for most use cases while maintaining accuracy.
Requirements
- Python ≥ 3.8
- joblib
- rapidfuzz
Development
This project was developed with assistance from aider.chat.
License
GPL-v3
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file corpus_matcher-1.0.0.tar.gz.
File metadata
- Download URL: corpus_matcher-1.0.0.tar.gz
- Upload date:
- Size: 23.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c2b18364e2399dba014a2769d4ab0d2f93ed39f1bd99b677f7383bb32c8ed6c4
|
|
| MD5 |
54ee4e1bfc08debb8a6b1900c0c2983d
|
|
| BLAKE2b-256 |
6e4263204a0a9306df8700d1748960619ed7b3bc672bda4aef13a2c30895ec5a
|
File details
Details for the file corpus_matcher-1.0.0-py3-none-any.whl.
File metadata
- Download URL: corpus_matcher-1.0.0-py3-none-any.whl
- Upload date:
- Size: 21.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d0ec9df0ef15d0cd6dc711f97cb90e9517b40b12d9bb440d11b7f30461a0c4a6
|
|
| MD5 |
b20ece1b230bf4b5113c53f819061fa2
|
|
| BLAKE2b-256 |
0103197fa2e1254365f1b65eca0fc872e9846b89706c06b1b7bc6f19330049d5
|