Skip to main content

Intelligent fuzzy keyword search with lemmatization for investigative journalists. Local processing ensures source confidentiality.

Project description

Fuzzy Context Finder

A Python utility for investigative journalists and researchers that performs intelligent fuzzy keyword searching within documents and extracts customizable context around matched terms. Perfect for text analysis, document investigation, and content exploration where approximate matches and surrounding context are critical.

🔒 Privacy & Security First

All processing happens locally on your machine. No documents, content, or search terms are ever sent to third-party servers or cloud services. This makes Fuzzy Context Finder the ideal choice for investigative journalists who must protect source confidentiality and maintain document security. Whether you're working with leaked documents, confidential sources, or sensitive investigations, your materials never leave your control.

Features

  • 🔍 Intelligent Matching Strategies:

    • Lemmatization: Matches word families (coercion → coerced, coercing)
    • Fuzzy Matching: Catches typos (prescription → perscription)
    • Prefix Protection: Prevents false matches (prescription ≠ description)
    • Short Word Family Mode: Matches related words (sex → sexual, sexually)
  • 📄 Flexible Context Extraction: Capture customizable amounts of text before, after, or around matched terms

  • 🎯 Accuracy-First Design: Built for investigative journalism where missing a match is not an option

  • Performance Optimized: Caching and efficient algorithms for large documents

Installation

1. Create a Virtual Environment (Recommended)

On Mac/Linux:

python3 -m venv venv
source venv/bin/activate

On Windows:

python -m venv venv
venv\Scripts\activate

2. Install Required Packages

pip install pandas rapidfuzz regex spacy

3. Download spaCy Language Model

python -m spacy download en_core_web_sm

4. Install the Package

From PyPI (when published):

pip install fuzzy-context-finder

From Source:

git clone https://github.com/yourusername/fuzzy_context_finder.git
cd fuzzy_context_finder
pip install -e .

Requirements

Create a requirements.txt file:

pandas>=2.0.0
rapidfuzz>=3.0.0
regex>=2023.0.0
spacy>=3.7.0

Install all at once:

pip install -r requirements.txt
python -m spacy download en_core_web_sm

Quick Start

from fuzzy_context_finder import keyword_context_finder

# Your document content
content = """
The doctor wrote a prescription for antibiotics.
The patient's opioid addiction complicated treatment.
There were allegations of coercion and coerced consent.
"""

# Search terms
search_terms = ["prescription", "addict", "coercion"]

# Find matches with context
results = keyword_context_finder(
    content=content,
    terms=search_terms,
    file_name="medical_report.txt",
    words_around=20,
    match_threshold=80,
    use_lemmatization=True
)

# View results
if results is not None:
    print(results[["Matched Term", "Original Term", "Match Type"]])
    
    # Save to CSV
    results.to_csv("investigation_results.csv", index=False)

Usage Examples

Example 1: Basic Investigation

content = "The investigation revealed sexual misconduct and abuse of power."
search_terms = ["sex", "abuse"]

results = keyword_context_finder(
    content=content,
    terms=search_terms,
    file_name="investigation.txt",
    words_around=15
)

# Matches: "sexual" (from "sex") and "abuse"

Example 2: Multiple Documents

import os

documents = ["report1.txt", "report2.txt", "report3.txt"]
search_terms = ["fraud", "embezzlement", "misconduct"]
all_results = []

for doc_path in documents:
    with open(doc_path, 'r', encoding='utf-8') as f:
        content = f.read()
    
    results = keyword_context_finder(
        content=content,
        terms=search_terms,
        file_name=doc_path,
        match_threshold=85,
        use_lemmatization=True
    )
    
    if results is not None:
        all_results.append(results)

# Combine all results
import pandas as pd
combined_results = pd.concat(all_results, ignore_index=True)
combined_results.to_csv("all_findings.csv", index=False)

Example 3: Adjusting for Precision vs Recall

High Precision (fewer false positives):

results = keyword_context_finder(
    content=content,
    terms=search_terms,
    file_name="document.txt",
    match_threshold=90,  # Stricter threshold
    require_prefix_match=True,  # Require prefix to match
    min_prefix_length=4,  # Longer prefix requirement
    use_lemmatization=True
)

High Recall (catch more variations):

results = keyword_context_finder(
    content=content,
    terms=search_terms,
    file_name="document.txt",
    match_threshold=75,  # Lower threshold
    require_prefix_match=False,  # Allow different prefixes
    use_lemmatization=True,
    word_family_mode=True
)

Parameters

Required Parameters

Parameter Type Description
content str The text content to search
terms list List of search terms (e.g., ["fraud", "corruption"])
file_name str Name/identifier for the document

Context Parameters

Parameter Type Default Description
words_before int 250 Number of words to capture before the matched term
words_after int 250 Number of words to capture after the matched term
words_around int 50 Number of words to capture around the term (split evenly)

Matching Strategy Parameters

Parameter Type Default Description
match_threshold int 80 Minimum similarity score (0-100) for fuzzy matching
require_prefix_match bool True Requires word beginnings to match (prevents "prescription"/"description" confusion)
min_prefix_length int 3 Number of starting characters that must match exactly
use_lemmatization bool True Use spaCy lemmatization to match word families
word_family_mode bool True Enable prefix matching for short words
family_mode_max_length int 4 Max term length for family mode (≤ this uses prefix matching)

Match Types Explained

The Match Type column in results shows how each match was found:

  • lemma: Matched through lemmatization (e.g., "coercion" → "coerced")
  • family: Matched through prefix for short words (e.g., "sex" → "sexual")
  • fuzzy: Matched through fuzzy string matching (e.g., typos)

Output Format

Returns a pandas DataFrame with columns:

Column Description
File Name Name of the file/document
Page Number Page number (always 1 for text content)
Matched Term The actual word found in the text
Original Term Your search term that matched
Similarity Score Similarity score (0-100)
Match Type How the match was found (lemma/family/fuzzy)
Search Term with N Words Context Context around the term
Previous N Words (Including Term) Text before the match
Next N Words (Including Term) Text after the match
Character Position Position in original text
Word Index Word position in document

Real-World Use Cases

Investigative Journalism

# Track all mentions of a subject and related terms
search_terms = ["corruption", "bribery", "kickback", "fraud"]
# Lemmatization catches: corrupted, bribes, bribing, fraudulent, etc.

Legal Discovery

# Find all contract-related terms
search_terms = ["contract", "agreement", "terms", "breach"]
# High threshold to avoid false positives in legal documents

Medical Research

# Track medication mentions and variations
search_terms = ["oxycodone", "fentanyl", "prescription", "opioid"]
# Catches: prescriptions, prescribed, opioids, etc.

Troubleshooting

spaCy Model Not Found

If you get an error about missing spaCy model:

python -m spacy download en_core_web_sm

No Matches Found

  1. Lower the threshold: Try match_threshold=70
  2. Check your search terms: Use root words (e.g., "coerce" instead of "coercion")
  3. Disable prefix matching: Set require_prefix_match=False
  4. Check lemmatization: Print lemmas to debug:
   import spacy
   nlp = spacy.load("en_core_web_sm")
   doc = nlp("coercion coerced")
   for token in doc:
       print(f"{token.text}{token.lemma_}")

Too Many False Positives

  1. Increase threshold: Try match_threshold=90
  2. Enable prefix matching: Set require_prefix_match=True
  3. Increase prefix length: Set min_prefix_length=4

Performance Tips

  • Large documents: Process in chunks if > 1MB
  • Many search terms: Results scale with O(words × terms)
  • Caching: Repeated words are cached for efficiency
  • Disable unused features: Turn off lemmatization if not needed

Contributing

Contributions welcome! Please:

  1. Fork the repository
  2. Create a feature branch
  3. Add tests for new features
  4. Submit a pull request

License

MIT License - see LICENSE file for details

Author

Created for investigative journalists who need accurate, comprehensive document analysis.

Support

Changelog

Version 2.0.0

  • NEW: Improved acronym handling - now correctly matches A.I., U.S., C.E.O., Ph.D., etc.
  • NEW: Acronyms with and without periods now match (searching "AI" finds both "AI" and "A.I.")
  • Improved tokenization pattern to preserve acronyms as complete tokens
  • Added spaCy lemmatization for better word family matching
  • Added prefix matching to prevent false positives
  • Added multiple matching strategies (lemma/family/fuzzy)
  • Improved tokenization to handle punctuation
  • Added caching for performance

Version 1.0.0

  • Initial release with basic fuzzy matching

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fuzzy_context_finder-2.0.0.tar.gz (8.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

fuzzy_context_finder-2.0.0-py3-none-any.whl (8.5 kB view details)

Uploaded Python 3

File details

Details for the file fuzzy_context_finder-2.0.0.tar.gz.

File metadata

  • Download URL: fuzzy_context_finder-2.0.0.tar.gz
  • Upload date:
  • Size: 8.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.2

File hashes

Hashes for fuzzy_context_finder-2.0.0.tar.gz
Algorithm Hash digest
SHA256 485266e4c377ef1cd96c97d845deaab93d1eb0d2800809cf16447afba951ff38
MD5 89764723207730fee64fc131304a9e0b
BLAKE2b-256 9696b532ca42462f2c44457e0723d9e528beb4dbdd01566bc252a11653fb15b6

See more details on using hashes here.

File details

Details for the file fuzzy_context_finder-2.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for fuzzy_context_finder-2.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7924c8a680e9cbf7ca713749bd0e53150bfb1880520c3dcec6c58140a38119d4
MD5 9f309ea11a1fa2a10a35e578c6e7e79f
BLAKE2b-256 c119e504c1d003c1bf76f150dacf39733168c18abd6b48cbeb40f277129e6da3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page