Skip to main content

Fast substring matching in large text corpora using optimized Levenshtein distance

Project description

Corpus Matcher

Fast substring matching in large text corpora using optimized Levenshtein distance algorithms.

This library provides efficient fuzzy string matching capabilities, particularly useful for finding the best matching substring within large text corpora. It implements both a quick heuristic approach and a thorough search algorithm to balance speed and accuracy.

Features

  • Dual Algorithm Approach: Quick heuristic matching for speed, with fallback to thorough search for accuracy
  • Parallel Processing: Leverages joblib for multi-threaded processing
  • Caching: Built-in result caching using joblib Memory
  • Case Sensitivity Control: Optional case-sensitive or case-insensitive matching
  • Configurable Parameters: Adjustable step factors and search granularity

Installation

pip install corpus-matcher

Quick Start

from corpus_matcher import find_best_substring_match

# Basic usage
query = "machine learning algorithms"
corpus = "This document discusses various machine learning algorithms and their applications in data science."

result = find_best_substring_match(query, corpus)

print(f"Best matches: {result.matches}")
print(f"Similarity ratio: {result.ratio}")
print(f"Distance: {result.distance}")
print(f"Quick match used: {result.quick_match_used}")

Advanced Usage

from corpus_matcher import find_best_substring_match

# Case-insensitive matching with custom parameters
result = find_best_substring_match(
    query="PYTHON programming",
    corpus="Learn python programming from basics to advanced concepts",
    case_sensitive=False,
    step_factor=300,  # Higher step factor for more thorough search
    n_jobs=4  # Use 4 parallel jobs
)

print(f"Matches: {result.matches}")
print(f"Ratio: {result.ratio:.3f}")

API Reference

find_best_substring_match(query, corpus, case_sensitive=True, step_factor=500, n_jobs=-1)

Find the best matching substring(s) in a corpus for a given query.

Parameters:

  • query (str): The text to search for
  • corpus (str): The text to search within
  • case_sensitive (bool, optional): Whether matching should be case-sensitive. Default: True
  • step_factor (int, optional): Controls search resolution. Higher values = more thorough search. Default: 500
  • n_jobs (int, optional): Number of parallel jobs (-1 for all available cores). Default: -1

Returns:

  • MatchResult: Object containing matches, ratio, distance, and algorithm info

MatchResult

A dataclass containing the results of the matching operation:

  • matches (List[str]): List of best matching substrings
  • ratio (float): Levenshtein similarity ratio (0-100)
  • distance (float): Normalized Levenshtein distance (0-1)
  • quick_match_used (bool): Whether the quick algorithm was sufficient

Algorithm Details

The library uses a two-stage approach:

  1. Quick Match: Identifies potential regions using word-based heuristics, then performs localized search
  2. Thorough Search: Falls back to comprehensive n-gram analysis if quick match fails

This approach provides good performance for most use cases while maintaining accuracy.

Requirements

  • Python ≥ 3.8
  • joblib
  • rapidfuzz

Development

This project was developed with assistance from aider.chat.

License

GPL-v3

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

corpus_matcher-1.0.0.tar.gz (23.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

corpus_matcher-1.0.0-py3-none-any.whl (21.0 kB view details)

Uploaded Python 3

File details

Details for the file corpus_matcher-1.0.0.tar.gz.

File metadata

  • Download URL: corpus_matcher-1.0.0.tar.gz
  • Upload date:
  • Size: 23.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.11

File hashes

Hashes for corpus_matcher-1.0.0.tar.gz
Algorithm Hash digest
SHA256 c2b18364e2399dba014a2769d4ab0d2f93ed39f1bd99b677f7383bb32c8ed6c4
MD5 54ee4e1bfc08debb8a6b1900c0c2983d
BLAKE2b-256 6e4263204a0a9306df8700d1748960619ed7b3bc672bda4aef13a2c30895ec5a

See more details on using hashes here.

File details

Details for the file corpus_matcher-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: corpus_matcher-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 21.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.11

File hashes

Hashes for corpus_matcher-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d0ec9df0ef15d0cd6dc711f97cb90e9517b40b12d9bb440d11b7f30461a0c4a6
MD5 b20ece1b230bf4b5113c53f819061fa2
BLAKE2b-256 0103197fa2e1254365f1b65eca0fc872e9846b89706c06b1b7bc6f19330049d5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page