Gap-tolerant fuzzy phrase matching in documents with Jaro-Winkler similarity

These details have not been verified by PyPI

Project links

Project description

Fuzzy Finder

Gap-tolerant fuzzy phrase matching in documents with Jaro-Winkler similarity.

Features

Fuzzy Matching: Find phrases even with typos, OCR errors, or missing words
Gap Tolerance: Allows gaps between matched tokens (configurable)
Batch Search: Search multiple phrases at once with position-matched results
Score Threshold: Filter out low-quality matches automatically
Exact Offsets: Returns precise character positions in original text
Multi-factor Scoring: Combines similarity, coverage, sequence order, gaps, and transpositions
Zero Dependencies: Pure Python, uses only standard library

Installation

pip install fuzzy-finder

Or copy fuzzy_finder/ directory to your project.

Quick Start

from fuzzy_finder import FuzzyFinder

# Initialize with document text
document = """
Владимир Туров рассказывает о налоговой оптимизации
для предпринимателей. В этом посте разбираем вычеты НДС.
"""

finder = FuzzyFinder(document)

# Find exact phrase
result = finder.find("налоговой оптимизации")
print(f"Found: {result.found}")  # True
print(f"Position: [{result.start_offset}:{result.end_offset}]")

# Find phrase with typo
result = finder.find("налоговай оптимизацыи")  # typos!
print(f"Found: {result.found}")  # True (fuzzy match)

# Get matched text
if result.found:
    matched = document[result.start_offset:result.end_offset]
    print(f"Matched: '{matched}'")

Batch Search API

Search multiple phrases at once. Results array positions match input positions.

from fuzzy_finder import FuzzyFinder

finder = FuzzyFinder(document)

# Search multiple phrases
results = finder.search(
    phrases=["налоговой оптимизации", "вычеты НДС", "unknown phrase"],
    options={
        "score_threshold": 0.4,   # Filter matches below this score
        "find_all": True,         # Find all occurrences (not just first)
        "coefficients": {         # Optional: custom scoring weights
            "w_similarity": 1.0,
            "w_coverage": 0.3,
            "w_gap_penalty": -0.01,
        }
    }
)

# Results structure:
# - len(results) == len(phrases)  # Always same length!
# - results[i] corresponds to phrases[i]
# - Each element is:
#   - None: not found or below threshold
#   - SearchResult: single match
#   - List[SearchResult]: multiple matches (when find_all=True)

for i, result in enumerate(zip(phrases, results)):
    phrase, res = result
    if res is None:
        print(f"[{i}] '{phrase}': NOT FOUND")
    elif isinstance(res, list):
        print(f"[{i}] '{phrase}': {len(res)} matches")
        for match in res:
            print(f"     [{match.start_offset}:{match.end_offset}] score={match.score:.3f}")
    else:
        print(f"[{i}] '{phrase}': found at [{res.start_offset}:{res.end_offset}]")

SearchResult Fields

@dataclass
class SearchResult:
    found: bool           # True if matched
    start_offset: int     # First char position in raw text
    end_offset: int       # Last char position (exclusive)
    score: float          # Normalized score 0.0-1.0
    raw_score: float      # Raw score before normalization
    matched_text: str     # Actual matched text from document
    debug_info: dict      # Detailed scoring breakdown

SearchOptions

options = {
    "score_threshold": 0.0,    # Min score (0.0-1.0), below = None
    "find_all": True,          # Find all occurrences or first only
    "min_offset": 0,           # Start position in document
    "direction": "forward",    # "forward" or "backward"
    "coefficients": {...}      # Custom ScoringCoefficients (optional)
}

API Reference

FuzzyFinder (High-level)

finder = FuzzyFinder(text)

# Batch search (recommended for multiple phrases)
results = finder.search(phrases, options)

# Find first occurrence
result = finder.find(phrase, min_offset=0, direction="forward")

# Find all occurrences
results = finder.find_all(phrase)

# Calculate string similarity
score = finder.similarity("word1", "word2")  # 0.0-1.0

Low-level Functions

from fuzzy_finder import (
    tokenize_text,      # Tokenize document into word list with positions
    tokenize_marker,    # Extract normalized tokens from phrase
    find_end_marker,    # Find phrase, pick match nearest AFTER reference
    find_start_marker,  # Find phrase, pick match nearest BEFORE reference
    jaro_winkler_similarity,  # String similarity (0.0-1.0)
)

# Tokenize once, search many times
word_list = tokenize_text(document)

result = find_end_marker(
    marker="искомая фраза",
    text=document,
    word_list=word_list,
    min_char_offset=0
)

MarkerMatch Result

@dataclass
class MarkerMatch:
    found: bool           # Whether match was found
    marker: str           # Original search phrase
    start_offset: int     # First char position in raw text
    end_offset: int       # Last char position in raw text
    debug_info: dict      # Scoring details, gaps, etc.

Matching Algorithm

Tokenize phrase into normalized words/numbers
Find all positions of first token (exact + fuzzy)
Extend each candidate by matching subsequent tokens with gap constraints
Score candidates using weighted formula
Disambiguate by proximity to reference position

Scoring Formula

score = W_SIMILARITY × Σ(token_scores) +
        W_COVERAGE × tokens_found +
        W_SEQUENTIAL × sequential_bonus +
        W_GAP_PENALTY × avg_gap +
        W_SKIP_PENALTY × skips +
        W_TRANSPOSITION × transpositions

Configuration

from fuzzy_finder import MatchConfig

# Default values
config = MatchConfig(
    max_single_gap=50,       # Max chars between consecutive tokens
    max_total_skips=2,       # Max tokens that can be skipped
    max_consecutive_skips=1, # Max tokens skipped in a row
    min_tokens_required=3    # Minimum tokens that must match
)

Use Cases

Document Search: Find sections in long documents
OCR Post-processing: Match text despite recognition errors
Plagiarism Detection: Find similar passages
Data Extraction: Locate markers in structured documents
Log Analysis: Find patterns with variations

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.5.2

Jan 1, 2026

1.5.1

Jan 1, 2026

1.5.0

Jan 1, 2026

1.4.0

Jan 1, 2026

1.3.0

Jan 1, 2026

1.2.0

Jan 1, 2026

1.1.0

Jan 1, 2026

This version

1.0.0

Jan 1, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vibe_finder-1.0.0.tar.gz (23.4 kB view details)

Uploaded Jan 1, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

vibe_finder-1.0.0-py3-none-any.whl (20.9 kB view details)

Uploaded Jan 1, 2026 Python 3

File details

Details for the file vibe_finder-1.0.0.tar.gz.

File metadata

Download URL: vibe_finder-1.0.0.tar.gz
Upload date: Jan 1, 2026
Size: 23.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for vibe_finder-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`74889e65a2709f142e9341dfdd7434a31cdfb23aaca0f44e23b8b73ab65e960a`
MD5	`c0c149f9d876b3226efee61a9dd5692d`
BLAKE2b-256	`93ad785b596a6d9934198f9270491cf41fef8d9c6869413a0462dcd7a1b6a9da`

See more details on using hashes here.

File details

Details for the file vibe_finder-1.0.0-py3-none-any.whl.

File metadata

Download URL: vibe_finder-1.0.0-py3-none-any.whl
Upload date: Jan 1, 2026
Size: 20.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for vibe_finder-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8eba87abe2473091189158fa27febcfd88723c958f6a613da5fce21d93d2e7f3`
MD5	`1d65200130bb113c10c00a96f2211393`
BLAKE2b-256	`f9849c8d484fd4365f77ce19db9d817b9dca2e65cbbb800cdb6e657398743e6a`

See more details on using hashes here.

vibe-finder 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Fuzzy Finder

Features

Installation

Quick Start

Batch Search API

SearchResult Fields

SearchOptions

API Reference

FuzzyFinder (High-level)

Low-level Functions

MarkerMatch Result

Matching Algorithm

Scoring Formula

Configuration

Use Cases

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes