Intelligent fuzzy keyword search with lemmatization for investigative journalists. Local processing ensures source confidentiality.

These details have not been verified by PyPI

Project links

Project description

Fuzzy Context Finder

A Python utility for investigative journalists and researchers that performs intelligent fuzzy keyword searching within documents and extracts customizable context around matched terms. Perfect for text analysis, document investigation, and content exploration where approximate matches and surrounding context are critical.

🔒 Privacy & Security First

All processing happens locally on your machine. No documents, content, or search terms are ever sent to third-party servers or cloud services. This makes Fuzzy Context Finder the ideal choice for investigative journalists who must protect source confidentiality and maintain document security. Whether you're working with leaked documents, confidential sources, or sensitive investigations, your materials never leave your control.

Features

🔍 Intelligent Matching Strategies:
- Lemmatization: Matches word families (coercion → coerced, coercing)
- Fuzzy Matching: Catches typos (prescription → perscription)
- Prefix Protection: Prevents false matches (prescription ≠ description)
- Short Word Family Mode: Matches related words (sex → sexual, sexually)
📄 Flexible Context Extraction: Capture customizable amounts of text before, after, or around matched terms
🎯 Accuracy-First Design: Built for investigative journalism where missing a match is not an option
⚡ Performance Optimized: Caching and efficient algorithms for large documents

Installation

1. Create a Virtual Environment (Recommended)

On Mac/Linux:

python3 -m venv venv
source venv/bin/activate

On Windows:

python -m venv venv
venv\Scripts\activate

2. Install Required Packages

pip install pandas rapidfuzz regex spacy

3. Download spaCy Language Model

python -m spacy download en_core_web_sm

4. Install the Package

From PyPI (when published):

pip install fuzzy-context-finder

From Source:

git clone https://github.com/yourusername/fuzzy_context_finder.git
cd fuzzy_context_finder
pip install -e .

Requirements

Create a requirements.txt file:

pandas>=2.0.0
rapidfuzz>=3.0.0
regex>=2023.0.0
spacy>=3.7.0

Install all at once:

pip install -r requirements.txt
python -m spacy download en_core_web_sm

Quick Start

from fuzzy_context_finder import keyword_context_finder

# Your document content
content = """
The doctor wrote a prescription for antibiotics.
The patient's opioid addiction complicated treatment.
There were allegations of coercion and coerced consent.
"""

# Search terms
search_terms = ["prescription", "addict", "coercion"]

# Find matches with context
results = keyword_context_finder(
    content=content,
    terms=search_terms,
    file_name="medical_report.txt",
    words_around=20,
    match_threshold=80,
    use_lemmatization=True
)

# View results
if results is not None:
    print(results[["Matched Term", "Original Term", "Match Type"]])
    
    # Save to CSV
    results.to_csv("investigation_results.csv", index=False)

Usage Examples

Example 1: Basic Investigation

content = "The investigation revealed sexual misconduct and abuse of power."
search_terms = ["sex", "abuse"]

results = keyword_context_finder(
    content=content,
    terms=search_terms,
    file_name="investigation.txt",
    words_around=15
)

# Matches: "sexual" (from "sex") and "abuse"

Example 2: Multiple Documents

import os

documents = ["report1.txt", "report2.txt", "report3.txt"]
search_terms = ["fraud", "embezzlement", "misconduct"]
all_results = []

for doc_path in documents:
    with open(doc_path, 'r', encoding='utf-8') as f:
        content = f.read()
    
    results = keyword_context_finder(
        content=content,
        terms=search_terms,
        file_name=doc_path,
        match_threshold=85,
        use_lemmatization=True
    )
    
    if results is not None:
        all_results.append(results)

# Combine all results
import pandas as pd
combined_results = pd.concat(all_results, ignore_index=True)
combined_results.to_csv("all_findings.csv", index=False)

Example 3: Adjusting for Precision vs Recall

High Precision (fewer false positives):

results = keyword_context_finder(
    content=content,
    terms=search_terms,
    file_name="document.txt",
    match_threshold=90,  # Stricter threshold
    require_prefix_match=True,  # Require prefix to match
    min_prefix_length=4,  # Longer prefix requirement
    use_lemmatization=True
)

High Recall (catch more variations):

results = keyword_context_finder(
    content=content,
    terms=search_terms,
    file_name="document.txt",
    match_threshold=75,  # Lower threshold
    require_prefix_match=False,  # Allow different prefixes
    use_lemmatization=True,
    word_family_mode=True
)

Parameters

Required Parameters

Parameter	Type	Description
`content`	str	The text content to search
`terms`	list	List of search terms (e.g., `["fraud", "corruption"]`)
`file_name`	str	Name/identifier for the document

Context Parameters

Parameter	Type	Default	Description
`words_before`	int	250	Number of words to capture before the matched term
`words_after`	int	250	Number of words to capture after the matched term
`words_around`	int	50	Number of words to capture around the term (split evenly)

Matching Strategy Parameters

Parameter	Type	Default	Description
`match_threshold`	int	80	Minimum similarity score (0-100) for fuzzy matching
`require_prefix_match`	bool	True	Requires word beginnings to match (prevents "prescription"/"description" confusion)
`min_prefix_length`	int	3	Number of starting characters that must match exactly
`use_lemmatization`	bool	True	Use spaCy lemmatization to match word families
`word_family_mode`	bool	True	Enable prefix matching for short words
`family_mode_max_length`	int	4	Max term length for family mode (≤ this uses prefix matching)

Match Types Explained

The Match Type column in results shows how each match was found:

lemma: Matched through lemmatization (e.g., "coercion" → "coerced")
family: Matched through prefix for short words (e.g., "sex" → "sexual")
fuzzy: Matched through fuzzy string matching (e.g., typos)

Output Format

Returns a pandas DataFrame with columns:

Column	Description
`File Name`	Name of the file/document
`Page Number`	Page number (always 1 for text content)
`Matched Term`	The actual word found in the text
`Original Term`	Your search term that matched
`Similarity Score`	Similarity score (0-100)
`Match Type`	How the match was found (lemma/family/fuzzy)
`Search Term with N Words Context`	Context around the term
`Previous N Words (Including Term)`	Text before the match
`Next N Words (Including Term)`	Text after the match
`Character Position`	Position in original text
`Word Index`	Word position in document

Real-World Use Cases

Investigative Journalism

# Track all mentions of a subject and related terms
search_terms = ["corruption", "bribery", "kickback", "fraud"]
# Lemmatization catches: corrupted, bribes, bribing, fraudulent, etc.

Legal Discovery

# Find all contract-related terms
search_terms = ["contract", "agreement", "terms", "breach"]
# High threshold to avoid false positives in legal documents

Medical Research

# Track medication mentions and variations
search_terms = ["oxycodone", "fentanyl", "prescription", "opioid"]
# Catches: prescriptions, prescribed, opioids, etc.

Troubleshooting

spaCy Model Not Found

If you get an error about missing spaCy model:

python -m spacy download en_core_web_sm

No Matches Found

Lower the threshold: Try match_threshold=70
Check your search terms: Use root words (e.g., "coerce" instead of "coercion")
Disable prefix matching: Set require_prefix_match=False
Check lemmatization: Print lemmas to debug:

   import spacy
   nlp = spacy.load("en_core_web_sm")
   doc = nlp("coercion coerced")
   for token in doc:
       print(f"{token.text} → {token.lemma_}")

Too Many False Positives

Increase threshold: Try match_threshold=90
Enable prefix matching: Set require_prefix_match=True
Increase prefix length: Set min_prefix_length=4

Performance Tips

Large documents: Process in chunks if > 1MB
Many search terms: Results scale with O(words × terms)
Caching: Repeated words are cached for efficiency
Disable unused features: Turn off lemmatization if not needed

Contributing

Contributions welcome! Please:

Fork the repository
Create a feature branch
Add tests for new features
Submit a pull request

License

MIT License - see LICENSE file for details

Author

Created for investigative journalists who need accurate, comprehensive document analysis.

Support

Issues: https://github.com/yourusername/fuzzy_context_finder/issues
Documentation: https://github.com/yourusername/fuzzy_context_finder/wiki

Changelog

Version 2.0.0

NEW: Improved acronym handling - now correctly matches A.I., U.S., C.E.O., Ph.D., etc.
NEW: Acronyms with and without periods now match (searching "AI" finds both "AI" and "A.I.")
Improved tokenization pattern to preserve acronyms as complete tokens
Added spaCy lemmatization for better word family matching
Added prefix matching to prevent false positives
Added multiple matching strategies (lemma/family/fuzzy)
Improved tokenization to handle punctuation
Added caching for performance

Version 1.0.0

Initial release with basic fuzzy matching

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

2.0.0

Nov 21, 2025

0.1.2

Nov 25, 2024

0.1.1

Nov 25, 2024

0.1.0

Nov 25, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fuzzy_context_finder-2.0.0.tar.gz (8.8 kB view details)

Uploaded Nov 21, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

fuzzy_context_finder-2.0.0-py3-none-any.whl (8.5 kB view details)

Uploaded Nov 21, 2025 Python 3

File details

Details for the file fuzzy_context_finder-2.0.0.tar.gz.

File metadata

Download URL: fuzzy_context_finder-2.0.0.tar.gz
Upload date: Nov 21, 2025
Size: 8.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.2

File hashes

Hashes for fuzzy_context_finder-2.0.0.tar.gz
Algorithm	Hash digest
SHA256	`485266e4c377ef1cd96c97d845deaab93d1eb0d2800809cf16447afba951ff38`
MD5	`89764723207730fee64fc131304a9e0b`
BLAKE2b-256	`9696b532ca42462f2c44457e0723d9e528beb4dbdd01566bc252a11653fb15b6`

See more details on using hashes here.

File details

Details for the file fuzzy_context_finder-2.0.0-py3-none-any.whl.

File metadata

Download URL: fuzzy_context_finder-2.0.0-py3-none-any.whl
Upload date: Nov 21, 2025
Size: 8.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.2

File hashes

Hashes for fuzzy_context_finder-2.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7924c8a680e9cbf7ca713749bd0e53150bfb1880520c3dcec6c58140a38119d4`
MD5	`9f309ea11a1fa2a10a35e578c6e7e79f`
BLAKE2b-256	`c119e504c1d003c1bf76f150dacf39733168c18abd6b48cbeb40f277129e6da3`

See more details on using hashes here.

fuzzy-context-finder 2.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Fuzzy Context Finder

🔒 Privacy & Security First

Features

Installation

1. Create a Virtual Environment (Recommended)

2. Install Required Packages

3. Download spaCy Language Model

4. Install the Package

Requirements

Quick Start

Usage Examples

Example 1: Basic Investigation

Example 2: Multiple Documents

Example 3: Adjusting for Precision vs Recall

Parameters

Required Parameters

Context Parameters

Matching Strategy Parameters

Match Types Explained

Output Format

Real-World Use Cases

Investigative Journalism

Legal Discovery

Medical Research

Troubleshooting

spaCy Model Not Found

No Matches Found

Too Many False Positives

Performance Tips

Contributing

License

Author

Support

Changelog

Version 2.0.0

Version 1.0.0

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes