Skip to main content

Gap-tolerant fuzzy phrase matching for OCR documents and AI/LLM-generated text search. Find phrases even when text is altered, prettified, or contains OCR errors - without agentic multi-iteration approaches.

Project description

Vibe Finder

Gap-tolerant fuzzy phrase matching for OCR documents and AI/LLM-generated text search.

Find phrases even when text is altered, prettified, or contains OCR errors - without agentic multi-iteration approaches.

Use Cases

  • OCR Document Search: Find text in scanned documents where OCR introduced errors, typos, or character substitutions
  • AI/LLM Output Matching: Locate phrases in large texts when AI/LLM altered, rephrased, or "prettified" the original wording
  • Single-Pass Alternative: Replace slow agentic multi-iteration search loops with one fast fuzzy search call
  • Robust Text Extraction: Get exact character offsets even when source text doesn't match query exactly

Features

  • Fuzzy Matching: Find phrases even with typos, OCR errors, or missing words
  • Gap Tolerance: Allows gaps between matched tokens (configurable)
  • Batch Search: Search multiple phrases at once with position-matched results
  • Score Threshold: Filter out low-quality matches automatically
  • Exact Offsets: Returns precise character positions in original text
  • Multi-factor Scoring: Combines similarity, coverage, sequence order, gaps, and transpositions
  • Zero Dependencies: Pure Python, uses only standard library

Installation

pip install fuzzy-finder

Or copy fuzzy_finder/ directory to your project.

Quick Start

from vibe_finder import FuzzyFinder

# Initialize with document text
document = """
Владимир Туров рассказывает о налоговой оптимизации
для предпринимателей. В этом посте разбираем вычеты НДС.
"""

finder = FuzzyFinder(document)

# Find exact phrase
result = finder.find("налоговой оптимизации")
print(f"Found: {result.found}")  # True
print(f"Position: [{result.start_offset}:{result.end_offset}]")

# Find phrase with typo
result = finder.find("налоговай оптимизацыи")  # typos!
print(f"Found: {result.found}")  # True (fuzzy match)

# Get matched text
if result.found:
    matched = document[result.start_offset:result.end_offset]
    print(f"Matched: '{matched}'")

Batch Search API

Search multiple phrases at once. Results array positions match input positions.

from vibe_finder import FuzzyFinder

finder = FuzzyFinder(document)

# Search multiple phrases
results = finder.search(
    phrases=["налоговой оптимизации", "вычеты НДС", "unknown phrase"],
    options={
        "score_threshold": 0.4,   # Filter matches below this score
        "find_all": True,         # Find all occurrences (not just first)
        "coefficients": {         # Optional: custom scoring weights
            "w_similarity": 1.0,
            "w_coverage": 0.3,
            "w_gap_penalty": -0.01,
        }
    }
)

# Results structure:
# - len(results) == len(phrases)  # Always same length!
# - results[i] corresponds to phrases[i]
# - Each element is:
#   - None: not found or below threshold
#   - SearchResult: single match
#   - List[SearchResult]: multiple matches (when find_all=True)

for i, result in enumerate(zip(phrases, results)):
    phrase, res = result
    if res is None:
        print(f"[{i}] '{phrase}': NOT FOUND")
    elif isinstance(res, list):
        print(f"[{i}] '{phrase}': {len(res)} matches")
        for match in res:
            print(f"     [{match.start_offset}:{match.end_offset}] score={match.score:.3f}")
    else:
        print(f"[{i}] '{phrase}': found at [{res.start_offset}:{res.end_offset}]")

SearchResult Fields

@dataclass
class SearchResult:
    found: bool           # True if matched
    start_offset: int     # First char position in raw text
    end_offset: int       # Last char position (exclusive)
    score: float          # Normalized score 0.0-1.0
    raw_score: float      # Raw score before normalization
    matched_text: str     # Actual matched text from document
    debug_info: dict      # Detailed scoring breakdown

SearchOptions

options = {
    "score_threshold": 0.0,    # Min score (0.0-1.0), below = None
    "find_all": True,          # Find all occurrences or first only
    "min_offset": 0,           # Start position in document
    "direction": "forward",    # "forward" or "backward"
    "coefficients": {...}      # Custom ScoringCoefficients (optional)
}

API Reference

FuzzyFinder (High-level)

finder = FuzzyFinder(text)

# Batch search (recommended for multiple phrases)
results = finder.search(phrases, options)

# Find first occurrence
result = finder.find(phrase, min_offset=0, direction="forward")

# Find all occurrences
results = finder.find_all(phrase)

# Calculate string similarity
score = finder.similarity("word1", "word2")  # 0.0-1.0

Low-level Functions

from vibe_finder import (
    tokenize_text,      # Tokenize document into word list with positions
    tokenize_marker,    # Extract normalized tokens from phrase
    find_end_marker,    # Find phrase, pick match nearest AFTER reference
    find_start_marker,  # Find phrase, pick match nearest BEFORE reference
    jaro_winkler_similarity,  # String similarity (0.0-1.0)
)

# Tokenize once, search many times
word_list = tokenize_text(document)

result = find_end_marker(
    marker="искомая фраза",
    text=document,
    word_list=word_list,
    min_char_offset=0
)

MarkerMatch Result

@dataclass
class MarkerMatch:
    found: bool           # Whether match was found
    marker: str           # Original search phrase
    start_offset: int     # First char position in raw text
    end_offset: int       # Last char position in raw text
    debug_info: dict      # Scoring details, gaps, etc.

Matching Algorithm

  1. Tokenize phrase into normalized words/numbers
  2. Find all positions of first token (exact + fuzzy)
  3. Extend each candidate by matching subsequent tokens with gap constraints
  4. Score candidates using weighted formula
  5. Disambiguate by proximity to reference position

Scoring Formula

score = W_SIMILARITY × Σ(token_scores) +
        W_COVERAGE × tokens_found +
        W_SEQUENTIAL × sequential_bonus +
        W_GAP_PENALTY × avg_gap +
        W_SKIP_PENALTY × skips +
        W_TRANSPOSITION × transpositions

Configuration

from vibe_finder import MatchConfig

# Default values
config = MatchConfig(
    max_single_gap=50,       # Max chars between consecutive tokens
    max_total_skips=2,       # Max tokens that can be skipped
    max_consecutive_skips=1, # Max tokens skipped in a row
    min_tokens_required=3    # Minimum tokens that must match
)

Use Cases

  • Document Search: Find sections in long documents
  • OCR Post-processing: Match text despite recognition errors
  • Plagiarism Detection: Find similar passages
  • Data Extraction: Locate markers in structured documents
  • Log Analysis: Find patterns with variations

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vibe_finder-1.1.0.tar.gz (24.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vibe_finder-1.1.0-py3-none-any.whl (21.2 kB view details)

Uploaded Python 3

File details

Details for the file vibe_finder-1.1.0.tar.gz.

File metadata

  • Download URL: vibe_finder-1.1.0.tar.gz
  • Upload date:
  • Size: 24.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for vibe_finder-1.1.0.tar.gz
Algorithm Hash digest
SHA256 47eb8e553ba94301d53d12a8095a382dfd4c427e0846dec33a1ab3ed8d51c1c4
MD5 a0db9b20cc24f5ed875ce1e79199168d
BLAKE2b-256 1f8a7c06f63d1be3f5a44c2996c925b4b7e067a2307e874ff43525e885eb4ec3

See more details on using hashes here.

File details

Details for the file vibe_finder-1.1.0-py3-none-any.whl.

File metadata

  • Download URL: vibe_finder-1.1.0-py3-none-any.whl
  • Upload date:
  • Size: 21.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for vibe_finder-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 1dd7af306c076d3fa4fca35ea778b6bc6729efec8271db0b1ad3b22ab5c87e1f
MD5 f0948dd3019c654ce9ad3c969a955c2f
BLAKE2b-256 689df2d9fa67ca757f3ad8aeaaefd59003923613513990681ab8bdb526e7eb4a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page