Skip to main content

Extract and summarize highlights from PDF files.

Project description

📘 pdf_highlight_extractor

Extract highlighted text from PDF files using PyMuPDF.

This lightweight utility reads highlights from PDFs, along with the associated page number and highlight color. Perfect for summarizing annotated documents, research papers, or ebooks.


🔧 Installation

Install from PyPI:

pip install pdf-highlight-extractor

🚀 Usage

from pdf_highlight_extractor.reader import extract_highlights

highlights = extract_highlights("sample.pdf")

for h in highlights:
    print(f"Page {h['page']} | Color: {h['color']} | Text: {h['text']}")

📝 Output Example

Page 2 | Color: (1.0, 1.0, 0.0) | Text: This is a highlighted phrase
Page 5 | Color: (0.0, 1.0, 0.0) | Text: Another important note

🧠 Features

  • ✅ Extract text from highlights
  • ✅ Get page number and highlight color
  • ✅ Fallback extraction if highlight text is not directly stored
  • ✅ Simple API for automation or personal use

🧪 Example PDF

You can test the tool using any PDF with highlights created in:

  • Adobe Acrobat Reader
  • Preview (macOS)
  • Xodo or other PDF apps

📦 Requirements

  • Python 3.7+
  • PyMuPDF (automatically installed)

Only needed for development:

pip install -e .

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdf_highlight_extractor-0.1.2.tar.gz (3.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

pdf_highlight_extractor-0.1.2-py3-none-any.whl (3.0 kB view details)

Uploaded Python 3

File details

Details for the file pdf_highlight_extractor-0.1.2.tar.gz.

File metadata

  • Download URL: pdf_highlight_extractor-0.1.2.tar.gz
  • Upload date:
  • Size: 3.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.2

File hashes

Hashes for pdf_highlight_extractor-0.1.2.tar.gz
Algorithm Hash digest
SHA256 4a00054e752bc1cb39e7f3f545d607eccf25c24fc43d793d3c62d357267a89a2
MD5 1eb38964f70d489fe680f19c01fc0671
BLAKE2b-256 4d1917b95ced3b64f33722fa0ad8183e1b7d54de634f2ce58835f587ba618cd9

See more details on using hashes here.

File details

Details for the file pdf_highlight_extractor-0.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for pdf_highlight_extractor-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 3e184e6e03fb1d86823a0f07b39ea587573e1eff6d9e1e9d11549f5311c56f97
MD5 e5035d4239680c4639708297feb7813a
BLAKE2b-256 9d595578cb7bb7312ff983ee45dba033d22080053c70c8340a66df84f4696d3f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page