Skip to main content

Gap-tolerant fuzzy phrase matching for OCR documents and AI/LLM-generated text search. Find phrases even when text is altered, prettified, or contains OCR errors - without agentic multi-iteration approaches.

Project description

Vibe Finder

Gap-tolerant fuzzy phrase matching for OCR documents and AI/LLM-generated text search.

Find phrases even when text is altered, prettified, or contains OCR errors - without agentic multi-iteration approaches.

Use Cases

  • OCR Document Search: Find text in scanned documents where OCR introduced errors, typos, or character substitutions
  • AI/LLM Output Matching: Locate phrases in large texts when AI/LLM altered, rephrased, or "prettified" the original wording
  • Single-Pass Alternative: Replace slow agentic multi-iteration search loops with one fast fuzzy search call
  • Robust Text Extraction: Get exact character offsets even when source text doesn't match query exactly
  • Topic Detection: Find keyword hotspots to detect where topics are discussed (NEW in v1.2)

Features

  • Fuzzy Matching: Find phrases even with typos, OCR errors, or missing words
  • Gap Tolerance: Allows gaps between matched tokens (configurable)
  • Batch Search: Search multiple phrases at once with position-matched results
  • Score Threshold: Filter out low-quality matches automatically
  • Exact Offsets: Returns precise character positions in original text
  • Multi-factor Scoring: Combines similarity, coverage, sequence order, gaps, and transpositions
  • Keyword Hotspots: Find regions where keywords concentrate (CloudFinder) - NEW in v1.2
  • Zero Dependencies: Pure Python, uses only standard library

Installation

pip install vibe-finder

Quick Start

from vibe_finder import FuzzyFinder

# Initialize with document text
document = """
Владимир Туров рассказывает о налоговой оптимизации
для предпринимателей. В этом посте разбираем вычеты НДС.
"""

finder = FuzzyFinder(document)

# Find exact phrase
result = finder.find("налоговой оптимизации")
print(f"Found: {result.found}")  # True
print(f"Position: [{result.start_offset}:{result.end_offset}]")

# Find phrase with typo
result = finder.find("налоговай оптимизацыи")  # typos!
print(f"Found: {result.found}")  # True (fuzzy match)

# Get matched text
if result.found:
    matched = document[result.start_offset:result.end_offset]
    print(f"Matched: '{matched}'")

Batch Search API

Search multiple phrases at once. Results array positions match input positions.

from vibe_finder import FuzzyFinder

finder = FuzzyFinder(document)

# Search multiple phrases
results = finder.search(
    phrases=["налоговой оптимизации", "вычеты НДС", "unknown phrase"],
    options={
        "score_threshold": 0.4,   # Filter matches below this score
        "find_all": True,         # Find all occurrences (not just first)
        "coefficients": {         # Optional: custom scoring weights
            "w_similarity": 1.0,
            "w_coverage": 0.3,
            "w_gap_penalty": -0.01,
        }
    }
)

# Results structure:
# - len(results) == len(phrases)  # Always same length!
# - results[i] corresponds to phrases[i]
# - Each element is:
#   - None: not found or below threshold
#   - SearchResult: single match
#   - List[SearchResult]: multiple matches (when find_all=True)

for i, result in enumerate(zip(phrases, results)):
    phrase, res = result
    if res is None:
        print(f"[{i}] '{phrase}': NOT FOUND")
    elif isinstance(res, list):
        print(f"[{i}] '{phrase}': {len(res)} matches")
        for match in res:
            print(f"     [{match.start_offset}:{match.end_offset}] score={match.score:.3f}")
    else:
        print(f"[{i}] '{phrase}': found at [{res.start_offset}:{res.end_offset}]")

SearchResult Fields

@dataclass
class SearchResult:
    found: bool           # True if matched
    start_offset: int     # First char position in raw text
    end_offset: int       # Last char position (exclusive)
    score: float          # Normalized score 0.0-1.0
    raw_score: float      # Raw score before normalization
    matched_text: str     # Actual matched text from document
    debug_info: dict      # Detailed scoring breakdown

SearchOptions

options = {
    "score_threshold": 0.0,    # Min score (0.0-1.0), below = None
    "find_all": True,          # Find all occurrences or first only
    "min_offset": 0,           # Start position in document
    "direction": "forward",    # "forward" or "backward"
    "coefficients": {...}      # Custom ScoringCoefficients (optional)
}

API Reference

FuzzyFinder (High-level)

finder = FuzzyFinder(text)

# Batch search (recommended for multiple phrases)
results = finder.search(phrases, options)

# Find first occurrence
result = finder.find(phrase, min_offset=0, direction="forward")

# Find all occurrences
results = finder.find_all(phrase)

# Calculate string similarity
score = finder.similarity("word1", "word2")  # 0.0-1.0

Low-level Functions

from vibe_finder import (
    tokenize_text,      # Tokenize document into word list with positions
    tokenize_marker,    # Extract normalized tokens from phrase
    find_end_marker,    # Find phrase, pick match nearest AFTER reference
    find_start_marker,  # Find phrase, pick match nearest BEFORE reference
    jaro_winkler_similarity,  # String similarity (0.0-1.0)
)

# Tokenize once, search many times
word_list = tokenize_text(document)

result = find_end_marker(
    marker="искомая фраза",
    text=document,
    word_list=word_list,
    min_char_offset=0
)

MarkerMatch Result

@dataclass
class MarkerMatch:
    found: bool           # Whether match was found
    marker: str           # Original search phrase
    start_offset: int     # First char position in raw text
    end_offset: int       # Last char position in raw text
    debug_info: dict      # Scoring details, gaps, etc.

Matching Algorithm

  1. Tokenize phrase into normalized words/numbers
  2. Find all positions of first token (exact + fuzzy)
  3. Extend each candidate by matching subsequent tokens with gap constraints
  4. Score candidates using weighted formula
  5. Disambiguate by proximity to reference position

Scoring Formula

score = W_SIMILARITY × Σ(token_scores) +
        W_COVERAGE × tokens_found +
        W_SEQUENTIAL × sequential_bonus +
        W_GAP_PENALTY × avg_gap +
        W_SKIP_PENALTY × skips +
        W_TRANSPOSITION × transpositions

Configuration

from vibe_finder import MatchConfig

# Default values
config = MatchConfig(
    max_single_gap=50,       # Max chars between consecutive tokens
    max_total_skips=2,       # Max tokens that can be skipped
    max_consecutive_skips=1, # Max tokens skipped in a row
    min_tokens_required=3    # Minimum tokens that must match
)

Use Cases

  • Document Search: Find sections in long documents
  • OCR Post-processing: Match text despite recognition errors
  • Plagiarism Detection: Find similar passages
  • Data Extraction: Locate markers in structured documents
  • Log Analysis: Find patterns with variations

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vibe_finder-1.2.0.tar.gz (28.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vibe_finder-1.2.0-py3-none-any.whl (24.8 kB view details)

Uploaded Python 3

File details

Details for the file vibe_finder-1.2.0.tar.gz.

File metadata

  • Download URL: vibe_finder-1.2.0.tar.gz
  • Upload date:
  • Size: 28.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for vibe_finder-1.2.0.tar.gz
Algorithm Hash digest
SHA256 2fa6d7c7669bdc67cda90e320ffe1cfd30f1d8ee8d9046549d7927ee999dc00c
MD5 1b8f72b551872d92bea6a68c46c5328d
BLAKE2b-256 eb4bac33fb2ba313c320558de68df341c4605a607eb65e3bb73e6cceba757239

See more details on using hashes here.

File details

Details for the file vibe_finder-1.2.0-py3-none-any.whl.

File metadata

  • Download URL: vibe_finder-1.2.0-py3-none-any.whl
  • Upload date:
  • Size: 24.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.12

File hashes

Hashes for vibe_finder-1.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 27ad4dfe6a81a61164ddf6ad5be2c200fa02cb27b7b5e177d8e4dac777066258
MD5 1fe7d575ef990647673dc0ef2230713e
BLAKE2b-256 68ca7cf2d28e4c568d81b47b6bdbc8e85eeb9f84a8c36756089955a2777671ff

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page