OCR library with advanced PDF to text, layout visuals, and audio generation
Project description
ocr_pdf2txt
This library extracts text from PDF files using OCR, automatically discovers poppler and Tesseract dependencies, and even allows you to visualize recognized text, generate audio, and detect broad semantic topics.
Features
- Cross-platform (Mac, Windows, Linux) with automatic detection of Tesseract.
- HTML visualization of recognized text on each page.
- Audio file generation for reading PDF content aloud.
- Semantic topic detection leveraging spaCy’s named entity recognition.
Installation
pip install ocr_pdf2txt
Usage
from ocr_pdf2txt import ocr_pdf_to_text
pdf_path = "sample.pdf"
output_folder = "output_dir"
ocr_pdf_to_text(
pdf_path=pdf_path,
output_folder=output_folder,
visualize=True, # Show OCR overlay in HTML
audio_output=True, # Generate an MP3 of recognized text
semantic_topics=True # Print out recognized semantic topics
)
Make sure you have Tesseract and Poppler installed on your machine. Check documentation for your operating system if you run into issues.
License
MIT. See LICENSE for more information.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ocr_pdf2txt-0.1.1.tar.gz.
File metadata
- Download URL: ocr_pdf2txt-0.1.1.tar.gz
- Upload date:
- Size: 4.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f8dfb29753429aa92e48961f747f8365ee152f653131a80f052e4f4091e5b886
|
|
| MD5 |
2ea68e8494072fc6b676b5b2fb1ffb30
|
|
| BLAKE2b-256 |
851611aa36a926f8676ed81aaedca4fab1bb02b98eb97347fa9710c66ea3ac2d
|
File details
Details for the file ocr_pdf2txt-0.1.1-py3-none-any.whl.
File metadata
- Download URL: ocr_pdf2txt-0.1.1-py3-none-any.whl
- Upload date:
- Size: 4.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d0b2af3fa80a42c61f530d333314afe1ee58e0cf110607c7ec9deee1656c68c1
|
|
| MD5 |
76859cdc041e306e434b2a3f6d574dee
|
|
| BLAKE2b-256 |
dec3da9a57234360af6dbf601654c290534ae4b05e46960aab83e2d02e59b54d
|