Python package to convert PDF to text using OCR
Project description
py-ocr-pdf
This project has been designed to allow you to OCR PDF files regardless of whether the PDF contains text or images.
Python Support
This project only actively supports current Python versions, Python 3.10 to 3.14.
Installation
You can install this package from pip using
pip install py-ocr-pdf
OS Dependencies
- poppler-utils
- tesseract-ocr
Linux PDF OCR Support
- Install the following
sudo apt-get install poppler-utils tesseract-ocr
Mac OS PDF OCR Support
This project uses pdftoppm and tesseract-ocr so you need to install poppler-utils and tesseract-ocr.
brew install poppler
Windows OS PDF OCR Support
On Windows you can install pdftoppm by following the instructions here:
- Go to https://github.com/oschwartz10612/poppler-windows
- Navigate there to the latest release
- Download the zip
- Unzip and save the files in a new folder
- After you have installed the Zotero OCR plugin, adjust the location of pdftoppm in your settings
Found a Bug?
Issues are tracked via GitHub issues at the project issue page
Have A Feature Request?
Feature requests can be raised by creating an issue within the project issue page, but please create the issue with "Feature Request -" at the start of the issue
Testing
To run the tests use
coverage erase && \
python -W error::DeprecationWarning -W error::PendingDeprecationWarning -m coverage run --parallel -m pytest --ds tests.settings && \
coverage combine && \
coverage report
Compiling Requirements
Run pip install pip-tools then run python requirements/compile.py to generate the various requirements files.
I use two local VIRTUALENVS to build the requirements, one running Python3.8 and the other running Python 3.11.
Building
This project uses hatchling
python -m build --sdist
tox
Contributing
- Check for open issues at the project issue page or open a new issue to start a discussion about a feature or bug.
- Fork the repository on GitHub to start making changes.
- Clone the repository
- Initialise pre-commit by running
pre-commit install - Install requirements from one of the requirement files
- Add a test case to show that the bug is fixed or the feature is implemented correctly.
- Test using
python -W error::DeprecationWarning -W error::PendingDeprecationWarning -m coverage run --parallel -m pytest --ds tests.settings - Create a pull request, tagging the issue, bug me until I can merge your pull request. Also, don't forget to add yourself to AUTHORS.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file py_ocr_pdf-0.0.1.tar.gz.
File metadata
- Download URL: py_ocr_pdf-0.0.1.tar.gz
- Upload date:
- Size: 21.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
29a6a0928d081dc449d8e7eff56bfe008e6c70f79828dfd8def8e95e4685298a
|
|
| MD5 |
91c1b5926b0ff9807136e03889534f58
|
|
| BLAKE2b-256 |
897579228490f98d298d374c35a5c920186aa4df54e8c5510cdd3401c89c625a
|