Intelligent fuzzy keyword search with lemmatization for investigative journalists. Local processing ensures source confidentiality.
Project description
Fuzzy Context Finder
A Python utility for investigative journalists and researchers that performs intelligent fuzzy keyword searching within documents and extracts customizable context around matched terms. Perfect for text analysis, document investigation, and content exploration where approximate matches and surrounding context are critical.
🔒 Privacy & Security First
All processing happens locally on your machine. No documents, content, or search terms are ever sent to third-party servers or cloud services. This makes Fuzzy Context Finder the ideal choice for investigative journalists who must protect source confidentiality and maintain document security. Whether you're working with leaked documents, confidential sources, or sensitive investigations, your materials never leave your control.
Features
-
🔍 Intelligent Matching Strategies:
- Lemmatization: Matches word families (coercion → coerced, coercing)
- Fuzzy Matching: Catches typos (prescription → perscription)
- Prefix Protection: Prevents false matches (prescription ≠ description)
- Short Word Family Mode: Matches related words (sex → sexual, sexually)
-
📄 Flexible Context Extraction: Capture customizable amounts of text before, after, or around matched terms
-
🎯 Accuracy-First Design: Built for investigative journalism where missing a match is not an option
-
⚡ Performance Optimized: Caching and efficient algorithms for large documents
Installation
1. Create a Virtual Environment (Recommended)
On Mac/Linux:
python3 -m venv venv
source venv/bin/activate
On Windows:
python -m venv venv
venv\Scripts\activate
2. Install Required Packages
pip install pandas rapidfuzz regex spacy
3. Download spaCy Language Model
python -m spacy download en_core_web_sm
4. Install the Package
From PyPI (when published):
pip install fuzzy-context-finder
From Source:
git clone https://github.com/yourusername/fuzzy_context_finder.git
cd fuzzy_context_finder
pip install -e .
Requirements
Create a requirements.txt file:
pandas>=2.0.0
rapidfuzz>=3.0.0
regex>=2023.0.0
spacy>=3.7.0
Install all at once:
pip install -r requirements.txt
python -m spacy download en_core_web_sm
Quick Start
from fuzzy_context_finder import keyword_context_finder
# Your document content
content = """
The doctor wrote a prescription for antibiotics.
The patient's opioid addiction complicated treatment.
There were allegations of coercion and coerced consent.
"""
# Search terms
search_terms = ["prescription", "addict", "coercion"]
# Find matches with context
results = keyword_context_finder(
content=content,
terms=search_terms,
file_name="medical_report.txt",
words_around=20,
match_threshold=80,
use_lemmatization=True
)
# View results
if results is not None:
print(results[["Matched Term", "Original Term", "Match Type"]])
# Save to CSV
results.to_csv("investigation_results.csv", index=False)
Usage Examples
Example 1: Basic Investigation
content = "The investigation revealed sexual misconduct and abuse of power."
search_terms = ["sex", "abuse"]
results = keyword_context_finder(
content=content,
terms=search_terms,
file_name="investigation.txt",
words_around=15
)
# Matches: "sexual" (from "sex") and "abuse"
Example 2: Multiple Documents
import os
documents = ["report1.txt", "report2.txt", "report3.txt"]
search_terms = ["fraud", "embezzlement", "misconduct"]
all_results = []
for doc_path in documents:
with open(doc_path, 'r', encoding='utf-8') as f:
content = f.read()
results = keyword_context_finder(
content=content,
terms=search_terms,
file_name=doc_path,
match_threshold=85,
use_lemmatization=True
)
if results is not None:
all_results.append(results)
# Combine all results
import pandas as pd
combined_results = pd.concat(all_results, ignore_index=True)
combined_results.to_csv("all_findings.csv", index=False)
Example 3: Adjusting for Precision vs Recall
High Precision (fewer false positives):
results = keyword_context_finder(
content=content,
terms=search_terms,
file_name="document.txt",
match_threshold=90, # Stricter threshold
require_prefix_match=True, # Require prefix to match
min_prefix_length=4, # Longer prefix requirement
use_lemmatization=True
)
High Recall (catch more variations):
results = keyword_context_finder(
content=content,
terms=search_terms,
file_name="document.txt",
match_threshold=75, # Lower threshold
require_prefix_match=False, # Allow different prefixes
use_lemmatization=True,
word_family_mode=True
)
Parameters
Required Parameters
| Parameter | Type | Description |
|---|---|---|
content |
str | The text content to search |
terms |
list | List of search terms (e.g., ["fraud", "corruption"]) |
file_name |
str | Name/identifier for the document |
Context Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
words_before |
int | 250 | Number of words to capture before the matched term |
words_after |
int | 250 | Number of words to capture after the matched term |
words_around |
int | 50 | Number of words to capture around the term (split evenly) |
Matching Strategy Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
match_threshold |
int | 80 | Minimum similarity score (0-100) for fuzzy matching |
require_prefix_match |
bool | True | Requires word beginnings to match (prevents "prescription"/"description" confusion) |
min_prefix_length |
int | 3 | Number of starting characters that must match exactly |
use_lemmatization |
bool | True | Use spaCy lemmatization to match word families |
word_family_mode |
bool | True | Enable prefix matching for short words |
family_mode_max_length |
int | 4 | Max term length for family mode (≤ this uses prefix matching) |
Match Types Explained
The Match Type column in results shows how each match was found:
lemma: Matched through lemmatization (e.g., "coercion" → "coerced")family: Matched through prefix for short words (e.g., "sex" → "sexual")fuzzy: Matched through fuzzy string matching (e.g., typos)
Output Format
Returns a pandas DataFrame with columns:
| Column | Description |
|---|---|
File Name |
Name of the file/document |
Page Number |
Page number (always 1 for text content) |
Matched Term |
The actual word found in the text |
Original Term |
Your search term that matched |
Similarity Score |
Similarity score (0-100) |
Match Type |
How the match was found (lemma/family/fuzzy) |
Search Term with N Words Context |
Context around the term |
Previous N Words (Including Term) |
Text before the match |
Next N Words (Including Term) |
Text after the match |
Character Position |
Position in original text |
Word Index |
Word position in document |
Real-World Use Cases
Investigative Journalism
# Track all mentions of a subject and related terms
search_terms = ["corruption", "bribery", "kickback", "fraud"]
# Lemmatization catches: corrupted, bribes, bribing, fraudulent, etc.
Legal Discovery
# Find all contract-related terms
search_terms = ["contract", "agreement", "terms", "breach"]
# High threshold to avoid false positives in legal documents
Medical Research
# Track medication mentions and variations
search_terms = ["oxycodone", "fentanyl", "prescription", "opioid"]
# Catches: prescriptions, prescribed, opioids, etc.
Troubleshooting
spaCy Model Not Found
If you get an error about missing spaCy model:
python -m spacy download en_core_web_sm
No Matches Found
- Lower the threshold: Try
match_threshold=70 - Check your search terms: Use root words (e.g., "coerce" instead of "coercion")
- Disable prefix matching: Set
require_prefix_match=False - Check lemmatization: Print lemmas to debug:
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("coercion coerced")
for token in doc:
print(f"{token.text} → {token.lemma_}")
Too Many False Positives
- Increase threshold: Try
match_threshold=90 - Enable prefix matching: Set
require_prefix_match=True - Increase prefix length: Set
min_prefix_length=4
Performance Tips
- Large documents: Process in chunks if > 1MB
- Many search terms: Results scale with O(words × terms)
- Caching: Repeated words are cached for efficiency
- Disable unused features: Turn off lemmatization if not needed
Contributing
Contributions welcome! Please:
- Fork the repository
- Create a feature branch
- Add tests for new features
- Submit a pull request
License
MIT License - see LICENSE file for details
Author
Created for investigative journalists who need accurate, comprehensive document analysis.
Support
- Issues: https://github.com/yourusername/fuzzy_context_finder/issues
- Documentation: https://github.com/yourusername/fuzzy_context_finder/wiki
Changelog
Version 2.0.0
- NEW: Improved acronym handling - now correctly matches A.I., U.S., C.E.O., Ph.D., etc.
- NEW: Acronyms with and without periods now match (searching "AI" finds both "AI" and "A.I.")
- Improved tokenization pattern to preserve acronyms as complete tokens
- Added spaCy lemmatization for better word family matching
- Added prefix matching to prevent false positives
- Added multiple matching strategies (lemma/family/fuzzy)
- Improved tokenization to handle punctuation
- Added caching for performance
Version 1.0.0
- Initial release with basic fuzzy matching
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file fuzzy_context_finder-2.0.0.tar.gz.
File metadata
- Download URL: fuzzy_context_finder-2.0.0.tar.gz
- Upload date:
- Size: 8.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
485266e4c377ef1cd96c97d845deaab93d1eb0d2800809cf16447afba951ff38
|
|
| MD5 |
89764723207730fee64fc131304a9e0b
|
|
| BLAKE2b-256 |
9696b532ca42462f2c44457e0723d9e528beb4dbdd01566bc252a11653fb15b6
|
File details
Details for the file fuzzy_context_finder-2.0.0-py3-none-any.whl.
File metadata
- Download URL: fuzzy_context_finder-2.0.0-py3-none-any.whl
- Upload date:
- Size: 8.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7924c8a680e9cbf7ca713749bd0e53150bfb1880520c3dcec6c58140a38119d4
|
|
| MD5 |
9f309ea11a1fa2a10a35e578c6e7e79f
|
|
| BLAKE2b-256 |
c119e504c1d003c1bf76f150dacf39733168c18abd6b48cbeb40f277129e6da3
|