Skip to main content

Python package to convert PDF to text using OCR

Project description

CI Build Status Coverage PyPi Code Style Pre-Commit Enabled Read The Docs

PyPi Downloads

py-ocr-pdf

This project has been designed to allow you to OCR PDF files regardless of whether the PDF contains text or images.

Python Support

This project only actively supports current Python versions, Python 3.10 to 3.14.

Installation

You can install this package from pip using

pip install py-ocr-pdf

OS Dependencies

  • poppler-utils
  • tesseract-ocr

Linux PDF OCR Support

  • Install the following sudo apt-get install poppler-utils tesseract-ocr

Mac OS PDF OCR Support

This project uses pdftoppm and tesseract-ocr so you need to install poppler-utils and tesseract-ocr.

brew install poppler

Windows OS PDF OCR Support

On Windows you can install pdftoppm by following the instructions here:

  1. Go to https://github.com/oschwartz10612/poppler-windows
  2. Navigate there to the latest release
  3. Download the zip
  4. Unzip and save the files in a new folder
  5. After you have installed the Zotero OCR plugin, adjust the location of pdftoppm in your settings

Found a Bug?

Issues are tracked via GitHub issues at the project issue page

Have A Feature Request?

Feature requests can be raised by creating an issue within the project issue page, but please create the issue with "Feature Request -" at the start of the issue

Testing

To run the tests use

coverage erase && \
python -W error::DeprecationWarning -W error::PendingDeprecationWarning -m coverage run --parallel -m pytest --ds tests.settings && \
coverage combine && \
coverage report

Compiling Requirements

Run pip install pip-tools then run python requirements/compile.py to generate the various requirements files. I use two local VIRTUALENVS to build the requirements, one running Python3.8 and the other running Python 3.11.

Building

This project uses hatchling python -m build --sdist

tox

Contributing

  • Check for open issues at the project issue page or open a new issue to start a discussion about a feature or bug.
  • Fork the repository on GitHub to start making changes.
  • Clone the repository
  • Initialise pre-commit by running pre-commit install
  • Install requirements from one of the requirement files
  • Add a test case to show that the bug is fixed or the feature is implemented correctly.
  • Test using python -W error::DeprecationWarning -W error::PendingDeprecationWarning -m coverage run --parallel -m pytest --ds tests.settings
  • Create a pull request, tagging the issue, bug me until I can merge your pull request. Also, don't forget to add yourself to AUTHORS.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

py_ocr_pdf-0.0.1.tar.gz (21.5 kB view details)

Uploaded Source

File details

Details for the file py_ocr_pdf-0.0.1.tar.gz.

File metadata

  • Download URL: py_ocr_pdf-0.0.1.tar.gz
  • Upload date:
  • Size: 21.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.5

File hashes

Hashes for py_ocr_pdf-0.0.1.tar.gz
Algorithm Hash digest
SHA256 29a6a0928d081dc449d8e7eff56bfe008e6c70f79828dfd8def8e95e4685298a
MD5 91c1b5926b0ff9807136e03889534f58
BLAKE2b-256 897579228490f98d298d374c35a5c920186aa4df54e8c5510cdd3401c89c625a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page