Skip to main content

search for keywords and their context

Project description

Keyword Context Finder

Searching for context doesn't have to be a chore!

Searching for context doesn't have to be a chore!

A Python utility for finding keywords and their surrounding context in MangoCR markdown files. This tool supports fuzzy matching and provides flexible context extraction around matched terms.

Features

  • Fuzzy string matching for approximate keyword finding
  • Customizable context window sizes (before, after, and around matches)
  • Page number tracking for MangoCR formatted documents
  • Adjustable similarity threshold for matches
  • Returns results in a pandas DataFrame for easy analysis

Installation

pip install pandas rapidfuzz

Dependencies

  • Python 3.6+
  • pandas
  • rapidfuzz
  • regex

Usage

from fuzzy_context_finder import keyword_context_finder

# Example usage
content = """
## document_page_1
This is the content of page one with some keywords.
## document_page_2
More content on page two with different keywords.
"""

search_terms = ["keyword", "content"]
file_name = "example_document.pdf"

results = keyword_context_finder(
    content=content,
    terms=search_terms,
    file_name=file_name,
    words_before=250,
    words_after=250,
    words_around=50,
    match_threshold=80
)

Parameters

  • content (str): Document content with pages separated by MangoCR markers (## filename_page_number)
  • terms (list): List of search terms to find in the document
  • file_name (str): Name of the file being processed
  • words_before (int, default=250): Number of words to capture before the term
  • words_after (int, default=250): Number of words to capture after the term
  • words_around (int, default=50): Number of words to capture around the term
  • match_threshold (int, default=80): Minimum similarity score (0-100) for fuzzy matching

Return Value

Returns a pandas DataFrame with the following columns:

  • File Name
  • Page Marker
  • Page Number
  • Matched Term
  • Original Term
  • Similarity Score
  • Search Term with Context (configurable width)
  • Previous Words Context
  • Next Words Context

Returns None if no matches are found.

Example Output

>>> results.head()
   File Name    Page Marker  Page Number  Matched Term  Original Term  Similarity Score  ...
0  example.md  document_p_1          1      keyword       keyword              100      ...

Document Format Requirements

The tool expects documents to follow the MangoCR format with page markers:

## filename_page_1
Content for page 1
## filename_page_2
Content for page 2

Error Handling

  • Empty pages are automatically skipped
  • Returns None if no matches are found
  • Handles out-of-bounds context windows gracefully

Contributing

Feel free to open issues or submit pull requests with improvements.

License

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fuzzy_context_finder-0.1.2.tar.gz (4.3 kB view details)

Uploaded Source

Built Distribution

fuzzy_context_finder-0.1.2-py3-none-any.whl (5.0 kB view details)

Uploaded Python 3

File details

Details for the file fuzzy_context_finder-0.1.2.tar.gz.

File metadata

  • Download URL: fuzzy_context_finder-0.1.2.tar.gz
  • Upload date:
  • Size: 4.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.2

File hashes

Hashes for fuzzy_context_finder-0.1.2.tar.gz
Algorithm Hash digest
SHA256 17f528e68756b124d0ea12fc7f07c6063c0e3c97848dd19a61b52bb2b7a2e4a9
MD5 37446d85daf1314470f08998e2c648c1
BLAKE2b-256 70732661de8756d29730a6e737c0352c7858f1cb4a4425c3a366f8e65faa4c56

See more details on using hashes here.

File details

Details for the file fuzzy_context_finder-0.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for fuzzy_context_finder-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 55da05bc3e925a92c318a7a1661ebaa61a0b8a60bbae787cd3e5c28b7050a317
MD5 4415500010ea5372267c3808310d8ea3
BLAKE2b-256 114b63789f16bb9a6c1c60228e1df509dcde41dbfaa65147809ad7cce19ef235

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page