Gap-tolerant fuzzy phrase matching in documents with Jaro-Winkler similarity
Project description
Fuzzy Finder
Gap-tolerant fuzzy phrase matching in documents with Jaro-Winkler similarity.
Features
- Fuzzy Matching: Find phrases even with typos, OCR errors, or missing words
- Gap Tolerance: Allows gaps between matched tokens (configurable)
- Batch Search: Search multiple phrases at once with position-matched results
- Score Threshold: Filter out low-quality matches automatically
- Exact Offsets: Returns precise character positions in original text
- Multi-factor Scoring: Combines similarity, coverage, sequence order, gaps, and transpositions
- Zero Dependencies: Pure Python, uses only standard library
Installation
pip install fuzzy-finder
Or copy fuzzy_finder/ directory to your project.
Quick Start
from fuzzy_finder import FuzzyFinder
# Initialize with document text
document = """
Владимир Туров рассказывает о налоговой оптимизации
для предпринимателей. В этом посте разбираем вычеты НДС.
"""
finder = FuzzyFinder(document)
# Find exact phrase
result = finder.find("налоговой оптимизации")
print(f"Found: {result.found}") # True
print(f"Position: [{result.start_offset}:{result.end_offset}]")
# Find phrase with typo
result = finder.find("налоговай оптимизацыи") # typos!
print(f"Found: {result.found}") # True (fuzzy match)
# Get matched text
if result.found:
matched = document[result.start_offset:result.end_offset]
print(f"Matched: '{matched}'")
Batch Search API
Search multiple phrases at once. Results array positions match input positions.
from fuzzy_finder import FuzzyFinder
finder = FuzzyFinder(document)
# Search multiple phrases
results = finder.search(
phrases=["налоговой оптимизации", "вычеты НДС", "unknown phrase"],
options={
"score_threshold": 0.4, # Filter matches below this score
"find_all": True, # Find all occurrences (not just first)
"coefficients": { # Optional: custom scoring weights
"w_similarity": 1.0,
"w_coverage": 0.3,
"w_gap_penalty": -0.01,
}
}
)
# Results structure:
# - len(results) == len(phrases) # Always same length!
# - results[i] corresponds to phrases[i]
# - Each element is:
# - None: not found or below threshold
# - SearchResult: single match
# - List[SearchResult]: multiple matches (when find_all=True)
for i, result in enumerate(zip(phrases, results)):
phrase, res = result
if res is None:
print(f"[{i}] '{phrase}': NOT FOUND")
elif isinstance(res, list):
print(f"[{i}] '{phrase}': {len(res)} matches")
for match in res:
print(f" [{match.start_offset}:{match.end_offset}] score={match.score:.3f}")
else:
print(f"[{i}] '{phrase}': found at [{res.start_offset}:{res.end_offset}]")
SearchResult Fields
@dataclass
class SearchResult:
found: bool # True if matched
start_offset: int # First char position in raw text
end_offset: int # Last char position (exclusive)
score: float # Normalized score 0.0-1.0
raw_score: float # Raw score before normalization
matched_text: str # Actual matched text from document
debug_info: dict # Detailed scoring breakdown
SearchOptions
options = {
"score_threshold": 0.0, # Min score (0.0-1.0), below = None
"find_all": True, # Find all occurrences or first only
"min_offset": 0, # Start position in document
"direction": "forward", # "forward" or "backward"
"coefficients": {...} # Custom ScoringCoefficients (optional)
}
API Reference
FuzzyFinder (High-level)
finder = FuzzyFinder(text)
# Batch search (recommended for multiple phrases)
results = finder.search(phrases, options)
# Find first occurrence
result = finder.find(phrase, min_offset=0, direction="forward")
# Find all occurrences
results = finder.find_all(phrase)
# Calculate string similarity
score = finder.similarity("word1", "word2") # 0.0-1.0
Low-level Functions
from fuzzy_finder import (
tokenize_text, # Tokenize document into word list with positions
tokenize_marker, # Extract normalized tokens from phrase
find_end_marker, # Find phrase, pick match nearest AFTER reference
find_start_marker, # Find phrase, pick match nearest BEFORE reference
jaro_winkler_similarity, # String similarity (0.0-1.0)
)
# Tokenize once, search many times
word_list = tokenize_text(document)
result = find_end_marker(
marker="искомая фраза",
text=document,
word_list=word_list,
min_char_offset=0
)
MarkerMatch Result
@dataclass
class MarkerMatch:
found: bool # Whether match was found
marker: str # Original search phrase
start_offset: int # First char position in raw text
end_offset: int # Last char position in raw text
debug_info: dict # Scoring details, gaps, etc.
Matching Algorithm
- Tokenize phrase into normalized words/numbers
- Find all positions of first token (exact + fuzzy)
- Extend each candidate by matching subsequent tokens with gap constraints
- Score candidates using weighted formula
- Disambiguate by proximity to reference position
Scoring Formula
score = W_SIMILARITY × Σ(token_scores) +
W_COVERAGE × tokens_found +
W_SEQUENTIAL × sequential_bonus +
W_GAP_PENALTY × avg_gap +
W_SKIP_PENALTY × skips +
W_TRANSPOSITION × transpositions
Configuration
from fuzzy_finder import MatchConfig
# Default values
config = MatchConfig(
max_single_gap=50, # Max chars between consecutive tokens
max_total_skips=2, # Max tokens that can be skipped
max_consecutive_skips=1, # Max tokens skipped in a row
min_tokens_required=3 # Minimum tokens that must match
)
Use Cases
- Document Search: Find sections in long documents
- OCR Post-processing: Match text despite recognition errors
- Plagiarism Detection: Find similar passages
- Data Extraction: Locate markers in structured documents
- Log Analysis: Find patterns with variations
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
vibe_finder-1.0.0.tar.gz
(23.4 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vibe_finder-1.0.0.tar.gz.
File metadata
- Download URL: vibe_finder-1.0.0.tar.gz
- Upload date:
- Size: 23.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
74889e65a2709f142e9341dfdd7434a31cdfb23aaca0f44e23b8b73ab65e960a
|
|
| MD5 |
c0c149f9d876b3226efee61a9dd5692d
|
|
| BLAKE2b-256 |
93ad785b596a6d9934198f9270491cf41fef8d9c6869413a0462dcd7a1b6a9da
|
File details
Details for the file vibe_finder-1.0.0-py3-none-any.whl.
File metadata
- Download URL: vibe_finder-1.0.0-py3-none-any.whl
- Upload date:
- Size: 20.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8eba87abe2473091189158fa27febcfd88723c958f6a613da5fce21d93d2e7f3
|
|
| MD5 |
1d65200130bb113c10c00a96f2211393
|
|
| BLAKE2b-256 |
f9849c8d484fd4365f77ce19db9d817b9dca2e65cbbb800cdb6e657398743e6a
|