Extract OCR text and annotations from PDF files
Project description
# -------------------------
# Documentation
# -------------------------
"""
PDF Annot Extractor
===================
A Python package for extracting:
- OCR text from PDF pages
- PDF annotations (comments, highlights, etc.)
INSTALLATION
pip install .
USAGE
1) Python API
from pdf_annot_extractor import PDFTextAnnotationExtractor
# Basic usage (output defaults to current folder)
extractor = PDFTextAnnotationExtractor("file.pdf")
# Save OCR text (one file per page)
extractor.save_text()
# Export annotations to Excel
extractor.export_annotations_excel()
2) Custom Output Folder
extractor = PDFTextAnnotationExtractor("file.pdf", "output/")
extractor.save_text()
extractor.export_annotations_excel("output/result.xlsx")
3) Directory Processing
from pdf_annot_extractor import process_directory
results = process_directory("pdfs/")
4) CLI Usage
Extract annotations to Excel:
python extractor.py --input file.pdf --output result.xlsx
Extract text files:
python extractor.py --input file.pdf --mode text
Process directory:
python extractor.py --input pdfs/ --output all.xlsx
OPTIONS
--input : PDF file or directory
--output : Output file (optional for text mode)
--mode : text | excel
--lang : OCR language (default: ara)
NOTES
- Default output folder is current working directory if not specified
- Requires Tesseract OCR installed with Arabic language pack
- Requires Poppler for pdf2image
EXAMPLE
python extractor.py --input my.pdf --mode text
→ Creates:
page_001.txt
page_002.txt
...
python extractor.py --input my.pdf --output annotations.xlsx
→ Creates:
annotations.xlsx
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file trt_secnd_attempt-0.0.2.tar.gz.
File metadata
- Download URL: trt_secnd_attempt-0.0.2.tar.gz
- Upload date:
- Size: 3.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.2 {"installer":{"name":"uv","version":"0.11.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b53a6a73b9c141e6f2b12b50e3829c12d882fbc4fe8a114b280f1f16a6abfa3f
|
|
| MD5 |
c5d7b3ab4cf058ae1d9388038d379ff6
|
|
| BLAKE2b-256 |
a91de29e2e89e930c6c862095022539216b27a855be22df5556c9ad81dc0cd8a
|
File details
Details for the file trt_secnd_attempt-0.0.2-py3-none-any.whl.
File metadata
- Download URL: trt_secnd_attempt-0.0.2-py3-none-any.whl
- Upload date:
- Size: 4.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.2 {"installer":{"name":"uv","version":"0.11.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a8c3dd6b1c022822667a44ba5a6610eba9c81f2aa3e205d16a22e01267ea8e55
|
|
| MD5 |
3e1c3702c6ebb2cd52b7fb3bedc38db6
|
|
| BLAKE2b-256 |
69b82f5f6064c0c68c2fb71dee3bd59756d17a2d7db96d5cc3555fcb4b3ee17a
|