Skip to main content

OCR library with advanced PDF to text, layout visuals, and audio generation

Project description

ocr_pdf2txt

This library extracts text from PDF files using OCR, automatically discovers poppler and Tesseract dependencies, and even allows you to visualize recognized text, generate audio, and detect broad semantic topics.

Features

  • Cross-platform (Mac, Windows, Linux) with automatic detection of Tesseract.
  • HTML visualization of recognized text on each page.
  • Audio file generation for reading PDF content aloud.
  • Semantic topic detection leveraging spaCy’s named entity recognition.

Installation

pip install ocr_pdf2txt

Usage

from ocr_pdf2txt import ocr_pdf_to_text

pdf_path = "sample.pdf"
output_folder = "output_dir"

ocr_pdf_to_text(
    pdf_path=pdf_path,
    output_folder=output_folder,
    visualize=True,      # Show OCR overlay in HTML
    audio_output=True,   # Generate an MP3 of recognized text
    semantic_topics=True # Print out recognized semantic topics
)

Make sure you have Tesseract and Poppler installed on your machine. Check documentation for your operating system if you run into issues.

License

MIT. See LICENSE for more information.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ocr_pdf2txt-0.1.1.tar.gz (4.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ocr_pdf2txt-0.1.1-py3-none-any.whl (4.9 kB view details)

Uploaded Python 3

File details

Details for the file ocr_pdf2txt-0.1.1.tar.gz.

File metadata

  • Download URL: ocr_pdf2txt-0.1.1.tar.gz
  • Upload date:
  • Size: 4.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for ocr_pdf2txt-0.1.1.tar.gz
Algorithm Hash digest
SHA256 f8dfb29753429aa92e48961f747f8365ee152f653131a80f052e4f4091e5b886
MD5 2ea68e8494072fc6b676b5b2fb1ffb30
BLAKE2b-256 851611aa36a926f8676ed81aaedca4fab1bb02b98eb97347fa9710c66ea3ac2d

See more details on using hashes here.

File details

Details for the file ocr_pdf2txt-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: ocr_pdf2txt-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 4.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.8

File hashes

Hashes for ocr_pdf2txt-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 d0b2af3fa80a42c61f530d333314afe1ee58e0cf110607c7ec9deee1656c68c1
MD5 76859cdc041e306e434b2a3f6d574dee
BLAKE2b-256 dec3da9a57234360af6dbf601654c290534ae4b05e46960aab83e2d02e59b54d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page