Skip to main content

A PDF pipeline to convert, OCR, and merge documents.

Project description

ocr-my-mess

A complete and modular Python pipeline to convert, OCR, and merge all your documents into a single, searchable PDF.

Features

  • Recursive Conversion: Traverses a directory to find all supported files (images, office documents, archives, existing PDFs).
  • OCR Processing: Applies OCR to all documents using ocrmypdf to make them text-searchable.
  • Hierarchical Merging: Merges all generated PDFs into a single file with a table of contents that mirrors the original folder structure.
  • Dual Interfaces: Usable as both a powerful Command-Line Interface (ocr-my-mess-cli) and a simple Graphical User Interface (ocr-my-mess-gui).
  • Cross-Platform: Packaged with PyInstaller to run on Windows, macOS, and Linux.

Installation

Using Conda (Recommended)

This is the easiest way to get started, as it handles all dependencies, last version of tessaract and Python itself.

# 1. Create the conda environment
conda env create -f environment.yml

# 2. Activate the environment
conda activate ocr-my-mess

# 3. Install the project in editable mode
pip install -e .

Using Pip

# 1. Create and activate a virtual environment
python -m venv .venv
source .venv/bin/activate

# 2. Install dependencies
pip install -r requirements.txt

# 3. Install the project in editable mode
pip install -e .

Note: For office document conversion, you must have LibreOffice installed and available in your system's PATH.

Usage

The application can be run in two modes:

  • Command-Line Interface (CLI): If you provide any arguments.
  • Graphical User Interface (GUI): If you run it without any arguments.

Command-Line Interface (CLI)

The CLI provides several commands, including run, convert and merge.

# General help
ocr-my-mess --help

# Get version
ocr-my-mess -v

# Run the full pipeline on a directory
ocr-my-mess run --input /path/to/docs --output /path/to/final.pdf --lang en+fr

# Just convert and OCR all documents in a folder
ocr-my-mess convert --input-dir /path/to/docs --output-dir /path/to/output

# Just merge all PDFs in a folder into a single file
ocr-my-mess merge --input-dir /path/to/output --output-file /path/to/final.pdf

Graphical User Interface (GUI)

For a more visual approach, you can launch the GUI by running the command without any arguments.

ocr-my-mess

This will open a window allowing you to:

  • Select input and output directories.
  • Choose OCR languages.
  • Run the full pipeline.
  • See live logs and progress.

Development

Running Tests

To ensure everything is working correctly, run the automated tests:

pytest

Building Executables

This project uses PyInstaller to create a standalone executable. A build script is provided in the scripts/ directory.

# Build the executable
python scripts/build.py

The executable will be located in the dist/ directory.

Note: When running the GUI from the executable on Windows or macOS, a console window will appear alongside the main application window. This is expected behavior.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ocr_my_mess-0.2.0.tar.gz (24.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ocr_my_mess-0.2.0-py3-none-any.whl (22.3 kB view details)

Uploaded Python 3

File details

Details for the file ocr_my_mess-0.2.0.tar.gz.

File metadata

  • Download URL: ocr_my_mess-0.2.0.tar.gz
  • Upload date:
  • Size: 24.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for ocr_my_mess-0.2.0.tar.gz
Algorithm Hash digest
SHA256 d0d8d6b78ee21d44bb1e78c228c68db6e645545253144ab40af09eb2977e27aa
MD5 d96b99818e2dfb3b886c044e21c611ee
BLAKE2b-256 4acbb49530a03326c2d8275a601583423b2d8a8b402e9c77f1e96d22548fd849

See more details on using hashes here.

File details

Details for the file ocr_my_mess-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: ocr_my_mess-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 22.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.7

File hashes

Hashes for ocr_my_mess-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 376e35af97ef5b7c4ceb1ad0f0b18e0385ae142c98aa82996849f12d057a8726
MD5 2b3eb119b8242bf97ef2edc8368aa4f2
BLAKE2b-256 02209d10579a67656b4293608432ef4e1740af305ce6789bae41f1dfc2cd2053

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page