Skip to main content

Extract text that has been highlighted in PDF documents.

Project description

extract-pdf-highlighted-text

Extract text that has been highlighted in PDF documents.

How it works

  • Locates all highlight annotations in each page using PyPDF2.
  • Computes the bounding boxes of each highlight annotation.
  • Uses pdfminer.six to determine locations of all visible characters on the page.
  • For each annotation, matches the characters whose bounding boxes overlap the annotation's bounding box (using IoU).
  • Groups and prints out the highlighted text in reading order.

Installation

pip install extract-pdf-highlighted-text

After installation, run it as extract_pdf_highlighted_text.

Dependencies:

  • PyPDF2 (for annotation geometry)
  • pdfminer.six (for text locations)

Usage

extract_pdf_highlighted_text your_file.pdf

The script will print each extracted highlight in reading order.

Example Output

This is a highlighted passage.

Another highlighted bit here.

Limitations

  • Does not support image-based PDFs (no OCR).
  • Precision may depend on PDF quality and producer.

Contributing

Contributions are welcome! Please submit pull requests or open issues on the GitHub repository.

License

This project is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

extract_pdf_highlighted_text-0.1.0a1.tar.gz (4.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

extract_pdf_highlighted_text-0.1.0a1-py2.py3-none-any.whl (5.4 kB view details)

Uploaded Python 2Python 3

File details

Details for the file extract_pdf_highlighted_text-0.1.0a1.tar.gz.

File metadata

File hashes

Hashes for extract_pdf_highlighted_text-0.1.0a1.tar.gz
Algorithm Hash digest
SHA256 23aa3645f77ff07d155c2a53810c5f044d8734409fc2994e49d4ffe81697bd0c
MD5 b91a6eaf2cee513720766f0608bafabd
BLAKE2b-256 d890b620d0e3c9c2a7a08718f8c36d876d3d6cd54a924ac2755fcd40dc1a3354

See more details on using hashes here.

File details

Details for the file extract_pdf_highlighted_text-0.1.0a1-py2.py3-none-any.whl.

File metadata

File hashes

Hashes for extract_pdf_highlighted_text-0.1.0a1-py2.py3-none-any.whl
Algorithm Hash digest
SHA256 29735eec2c0b2e36ee013245e38dfe7bcfd583aad794b7c4ad075142adb5ea15
MD5 ddbc2de5244b00ac22af252c5a8b6a40
BLAKE2b-256 c0d9979723818c263b8be93c5db607fdb8063175c684fbc86247484b0576ae5b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page