A PDF pipeline to convert, OCR, and merge documents.
Project description
ocr-my-mess
A complete and modular Python pipeline to convert, OCR, and merge all your documents into a single, searchable PDF.
Pre-built Binaries (Simplest Method)
For the quickest start, pre-built executables for Windows, macOS, and Linux are available for download from the GitHub Releases page. These executables are standalone and do not require Python or any other dependencies to be installed on your system. Simply download the appropriate version for your operating system, extract it, and run.
Features
- Recursive Conversion: Traverses a directory to find all supported files (images, office documents, archives, existing PDFs).
- OCR Processing: Applies OCR to all documents using
ocrmypdfto make them text-searchable. - Hierarchical Merging: Merges all generated PDFs into a single file with a table of contents that mirrors the original folder structure.
- Dual Interfaces: Usable as both a powerful Command-Line Interface (
ocr-my-mess-cli) and a simple Graphical User Interface (ocr-my-mess-gui). - Cross-Platform: Packaged with PyInstaller to run on Windows, macOS, and Linux.
Installation
There are two main ways to install ocr-my-mess: from Conda or from PyPI.
From Conda (Recommended)
This is the easiest and most reliable way to get started. The Conda environment, defined in the config/conda/environment.yml file, includes all Python dependencies as well as external binaries like Tesseract, Unpaper, and jbig2dec. This ensures you have the latest compiled versions, which are often more recent and performant than the ones provided by your operating system's package manager.
- Create and activate the Conda environment:
conda env create -f environment.yml conda activate ocr-my-mess
- Run the application:
Once the environment is activated, you can run the application directly.
ocr-my-mess
For development: If you want to modify the source code, you can install the project in editable mode after activating the environment:
pip install -e .
Note on LibreOffice: The Conda environment does not include LibreOffice. If you need to convert office documents, you must install it separately on your system (see the PyPI installation section for instructions).
From PyPI
This method requires you to install system dependencies manually before installing the Python package.
-
Install System Dependencies
This project relies on several external programs. Please install them using your system's package manager.
Linux (Debian/Ubuntu):
sudo apt-get update sudo apt-get install -y tesseract-ocr unpaper jbig2dec libreoffice
macOS:
brew install tesseract unpaper jbig2dec brew install --cask libreoffice
Windows: Installation on Windows is more complex. We recommend using the official
ocrmypdfDocker image if possible. Otherwise, you will need to install the following dependencies manually:- Tesseract OCR:
choco install tesseract - LibreOffice:
choco install libreoffice - Unpaper and jbig2dec: These are not readily available on Chocolatey. Please refer to the
ocrmypdfdocumentation for installation instructions.
Optional Dependencies:
jbig2enc: For better PDF compression. See the ocrmypdf documentation for installation.
- Tesseract OCR:
-
Install
ocr-my-messfrom PyPIpip install ocr-my-mess
Usage
The application can be run in two modes:
- Command-Line Interface (CLI): If you provide any arguments.
- Graphical User Interface (GUI): If you run it without any arguments.
Command-Line Interface (CLI)
The CLI provides several commands, including run, convert and merge.
# General help
ocr-my-mess --help
# Get version
ocr-my-mess -v
# Run the full pipeline on a directory
ocr-my-mess run --input /path/to/docs --output /path/to/final.pdf --lang en+fr
# Just convert and OCR all documents in a folder
ocr-my-mess convert --input-dir /path/to/docs --output-dir /path/to/output
# Just merge all PDFs in a folder into a single file
ocr-my-mess merge --input-dir /path/to/output --output-file /path/to/final.pdf
Graphical User Interface (GUI)
For a more visual approach, you can launch the GUI by running the command without any arguments.
ocr-my-mess
This will open a window allowing you to:
- Select input and output directories.
- Choose OCR languages.
- Run the full pipeline.
- See live logs and progress.
Development
Running Tests
To ensure everything is working correctly, run the automated tests:
pytest
Building Executables
This project uses PyInstaller to create a standalone executable. A build script is provided in the scripts/ directory.
# Build the executable
python scripts/build.py
The executable will be located in the dist/ directory.
Note: When running the GUI from the executable on Windows or macOS, a console window will appear alongside the main application window. This is expected behavior.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ocr_my_mess-0.3.12.tar.gz.
File metadata
- Download URL: ocr_my_mess-0.3.12.tar.gz
- Upload date:
- Size: 29.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
834245d126e55445f81157dea111b7d703148d0729d70d45308cccf85532edaf
|
|
| MD5 |
af5e03674c18e05d4570ab9687058f32
|
|
| BLAKE2b-256 |
2f5df5bd54fc2172fc2c9801b0de898a8667f01ec76940bf589ab4981fa6f083
|
Provenance
The following attestation bundles were made for ocr_my_mess-0.3.12.tar.gz:
Publisher:
build.yml on TheCodesUprising/ocr-my-mess
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
ocr_my_mess-0.3.12.tar.gz -
Subject digest:
834245d126e55445f81157dea111b7d703148d0729d70d45308cccf85532edaf - Sigstore transparency entry: 621991201
- Sigstore integration time:
-
Permalink:
TheCodesUprising/ocr-my-mess@df9565fffd7925a91d3a8cade6f6599b47458283 -
Branch / Tag:
refs/tags/v0.3.12 - Owner: https://github.com/TheCodesUprising
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
build.yml@df9565fffd7925a91d3a8cade6f6599b47458283 -
Trigger Event:
push
-
Statement type:
File details
Details for the file ocr_my_mess-0.3.12-py3-none-any.whl.
File metadata
- Download URL: ocr_my_mess-0.3.12-py3-none-any.whl
- Upload date:
- Size: 25.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d6475cf7c1b1e676d40aad8f1a0294dd15563733f123956c90b7f0caaaac5006
|
|
| MD5 |
431761ea7527e45d941cbe21215e5c65
|
|
| BLAKE2b-256 |
7fcc416f44dd58a85998ad517ce47e6143162279b7b760087fbf672130b808b5
|
Provenance
The following attestation bundles were made for ocr_my_mess-0.3.12-py3-none-any.whl:
Publisher:
build.yml on TheCodesUprising/ocr-my-mess
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
ocr_my_mess-0.3.12-py3-none-any.whl -
Subject digest:
d6475cf7c1b1e676d40aad8f1a0294dd15563733f123956c90b7f0caaaac5006 - Sigstore transparency entry: 621991206
- Sigstore integration time:
-
Permalink:
TheCodesUprising/ocr-my-mess@df9565fffd7925a91d3a8cade6f6599b47458283 -
Branch / Tag:
refs/tags/v0.3.12 - Owner: https://github.com/TheCodesUprising
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
build.yml@df9565fffd7925a91d3a8cade6f6599b47458283 -
Trigger Event:
push
-
Statement type: