Skip to main content

Extract OCR text and annotations from PDF files

Project description

# -------------------------

# Documentation

# -------------------------

"""

PDF Annot Extractor

===================

A Python package for extracting:

- OCR text from PDF pages

- PDF annotations (comments, highlights, etc.)


INSTALLATION


pip install .


USAGE


1) Python API


from pdf_annot_extractor import PDFTextAnnotationExtractor

# Basic usage (output defaults to current folder)

extractor = PDFTextAnnotationExtractor("file.pdf")

# Save OCR text (one file per page)

extractor.save_text()

# Export annotations to Excel

extractor.export_annotations_excel()

2) Custom Output Folder


extractor = PDFTextAnnotationExtractor("file.pdf", "output/")

extractor.save_text()

extractor.export_annotations_excel("output/result.xlsx")

3) Directory Processing


from pdf_annot_extractor import process_directory

results = process_directory("pdfs/")

4) CLI Usage


Extract annotations to Excel:

python extractor.py --input file.pdf --output result.xlsx

Extract text files:

python extractor.py --input file.pdf --mode text

Process directory:

python extractor.py --input pdfs/ --output all.xlsx


OPTIONS


--input : PDF file or directory

--output : Output file (optional for text mode)

--mode : text | excel

--lang : OCR language (default: ara)


NOTES


- Default output folder is current working directory if not specified

- Requires Tesseract OCR installed with Arabic language pack

- Requires Poppler for pdf2image


EXAMPLE


python extractor.py --input my.pdf --mode text

→ Creates:

page_001.txt

page_002.txt

...

python extractor.py --input my.pdf --output annotations.xlsx

→ Creates:

annotations.xlsx

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

trt_secnd_attempt-0.0.2.tar.gz (3.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

trt_secnd_attempt-0.0.2-py3-none-any.whl (4.8 kB view details)

Uploaded Python 3

File details

Details for the file trt_secnd_attempt-0.0.2.tar.gz.

File metadata

  • Download URL: trt_secnd_attempt-0.0.2.tar.gz
  • Upload date:
  • Size: 3.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.2 {"installer":{"name":"uv","version":"0.11.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for trt_secnd_attempt-0.0.2.tar.gz
Algorithm Hash digest
SHA256 b53a6a73b9c141e6f2b12b50e3829c12d882fbc4fe8a114b280f1f16a6abfa3f
MD5 c5d7b3ab4cf058ae1d9388038d379ff6
BLAKE2b-256 a91de29e2e89e930c6c862095022539216b27a855be22df5556c9ad81dc0cd8a

See more details on using hashes here.

File details

Details for the file trt_secnd_attempt-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: trt_secnd_attempt-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 4.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.2 {"installer":{"name":"uv","version":"0.11.2","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for trt_secnd_attempt-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 a8c3dd6b1c022822667a44ba5a6610eba9c81f2aa3e205d16a22e01267ea8e55
MD5 3e1c3702c6ebb2cd52b7fb3bedc38db6
BLAKE2b-256 69b82f5f6064c0c68c2fb71dee3bd59756d17a2d7db96d5cc3555fcb4b3ee17a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page