search for keywords and their context
Project description
Keyword Context Finder
Searching for context doesn't have to be a chore!
A Python utility for finding keywords and their surrounding context in MangoCR markdown files. This tool supports fuzzy matching and provides flexible context extraction around matched terms.
Features
- Fuzzy string matching for approximate keyword finding
- Customizable context window sizes (before, after, and around matches)
- Page number tracking for MangoCR formatted documents
- Adjustable similarity threshold for matches
- Returns results in a pandas DataFrame for easy analysis
Installation
pip install pandas rapidfuzz
Dependencies
- Python 3.6+
- pandas
- rapidfuzz
- regex
Usage
from fuzzy_context_finder import keyword_context_finder
# Example usage
content = """
## document_page_1
This is the content of page one with some keywords.
## document_page_2
More content on page two with different keywords.
"""
search_terms = ["keyword", "content"]
file_name = "example_document.pdf"
results = keyword_context_finder(
content=content,
terms=search_terms,
file_name=file_name,
words_before=250,
words_after=250,
words_around=50,
match_threshold=80
)
Parameters
content
(str): Document content with pages separated by MangoCR markers (## filename_page_number
)terms
(list): List of search terms to find in the documentfile_name
(str): Name of the file being processedwords_before
(int, default=250): Number of words to capture before the termwords_after
(int, default=250): Number of words to capture after the termwords_around
(int, default=50): Number of words to capture around the termmatch_threshold
(int, default=80): Minimum similarity score (0-100) for fuzzy matching
Return Value
Returns a pandas DataFrame with the following columns:
- File Name
- Page Marker
- Page Number
- Matched Term
- Original Term
- Similarity Score
- Search Term with Context (configurable width)
- Previous Words Context
- Next Words Context
Returns None
if no matches are found.
Example Output
>>> results.head()
File Name Page Marker Page Number Matched Term Original Term Similarity Score ...
0 example.md document_p_1 1 keyword keyword 100 ...
Document Format Requirements
The tool expects documents to follow the MangoCR format with page markers:
## filename_page_1
Content for page 1
## filename_page_2
Content for page 2
Error Handling
- Empty pages are automatically skipped
- Returns None if no matches are found
- Handles out-of-bounds context windows gracefully
Contributing
Feel free to open issues or submit pull requests with improvements.
License
MIT License
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file fuzzy_context_finder-0.1.2.tar.gz
.
File metadata
- Download URL: fuzzy_context_finder-0.1.2.tar.gz
- Upload date:
- Size: 4.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 |
17f528e68756b124d0ea12fc7f07c6063c0e3c97848dd19a61b52bb2b7a2e4a9
|
|
MD5 |
37446d85daf1314470f08998e2c648c1
|
|
BLAKE2b-256 |
70732661de8756d29730a6e737c0352c7858f1cb4a4425c3a366f8e65faa4c56
|
File details
Details for the file fuzzy_context_finder-0.1.2-py3-none-any.whl
.
File metadata
- Download URL: fuzzy_context_finder-0.1.2-py3-none-any.whl
- Upload date:
- Size: 5.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.2
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 |
55da05bc3e925a92c318a7a1661ebaa61a0b8a60bbae787cd3e5c28b7050a317
|
|
MD5 |
4415500010ea5372267c3808310d8ea3
|
|
BLAKE2b-256 |
114b63789f16bb9a6c1c60228e1df509dcde41dbfaa65147809ad7cce19ef235
|