Skip to main content

A Python package to compare files (PDF, docx, images) and generate reports in txt, html, or PDF format

Project description

PDFCompare

PDFCompare is a Python package designed for comparing multiple file types, including PDF, DOCX, and scanned images. It generates detailed difference reports that can be exported in TXT, HTML, and PDF formats. The package utilizes PyMuPDF for parsing PDFs, pytesseract for OCR on images, and python-docx for DOCX parsing. Additionally, it now includes advanced image preprocessing for improved OCR accuracy using OpenCV.

Features

  • Compare multiple file types: PDF, DOCX, and scanned image files.
  • Export comparison reports: Generate and save reports in TXT, HTML, or PDF formats.
  • OCR for image files: Supports text extraction from scanned PDFs or images using pytesseract with advanced preprocessing.
  • Advanced image preprocessing: Leverage OpenCV for binarization, noise removal, and other image enhancements to improve OCR accuracy.
  • Easy-to-use CLI: Run comparisons via the command line or integrate into your own Python applications.

Installation

Python Requirements

  • Python 3.7+

External Dependencies

The following external dependencies are required for handling PDF parsing and OCR:

  1. Tesseract OCR: For extracting text from images or scanned PDFs.
  2. wkhtmltopdf: For converting HTML reports into PDFs.
  3. OpenCV: For image preprocessing before OCR.

Installing Tesseract

Linux (Debian/Ubuntu)
sudo apt-get update
sudo apt-get install tesseract-ocr
MacOS

If you have Homebrew installed, run:

brew install tesseract
Windows

Download the Tesseract installer from the official repository here and follow the installation instructions.

Installing wkhtmltopdf

Linux (Debian/Ubuntu)
sudo apt-get update
sudo apt-get install wkhtmltopdf
MacOS

Using Homebrew:

brew install wkhtmltopdf
Windows

Download the Windows installer from here and install it.

Installing OpenCV

To install OpenCV for image preprocessing, run:

pip install opencv-python

Installing the pdfcompare Package

Once all dependencies are installed, you can install pdfcompare via pip:

pip install pdfcompare

Usage

Command-Line Interface (CLI)

pdfcompare provides an intuitive command-line interface for comparing files and generating reports.

Basic Syntax

pdfcompare file1 file2 --output txt
pdfcompare file1 file2 --output html
pdfcompare file1 file2 --output pdf

Example

pdfcompare document1.pdf document2.docx --output html

This command compares document1.pdf and document2.docx, and saves the comparison result as an HTML report.

Options

  • file1, file2: Paths to the files you want to compare.
  • --output: Specify the format for the report (options: txt, html, pdf).

Advanced Image Preprocessing for OCR

The pdfcompare package now supports advanced image preprocessing using OpenCV to improve OCR accuracy. This includes steps like binarization, noise removal, and other enhancements before performing text extraction.

Programmatic Usage

pdfcompare can be used as a Python module within your code.

from pdfcompare.cli import compare_files

file1 = "path/to/file1.pdf"
file2 = "path/to/file2.docx"
output_format = "pdf"  # Choose from 'txt', 'html', or 'pdf'

compare_files(file1, file2, output_format)
from pdfcompare.file_handlers.image_handler import extract_text

text = extract_text("path/to/your/image.png")
print(text)

Testing

To run unit tests, first install the development dependencies, and then use:

python -m unittest discover tests/

Coverage of Tests:

  • Text extraction: From PDFs, DOCX files, and images.
  • File comparison logic: Ensures accurate and consistent differences between file contents.
  • Report generation: Tests for TXT, HTML, and PDF formats.
  • Image preprocessing: Tests the effectiveness of OpenCV preprocessing for OCR.

Contributing

  1. Fork the repository.
  2. Create your feature branch (git checkout -b feature/your-feature).
  3. Commit your changes (git commit -am 'Add new feature').
  4. Push to the branch (git push origin feature/your-feature).
  5. Open a new Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Key Changes and Additions:

  1. Advanced Image Preprocessing: Added details about preprocessing images using OpenCV before performing OCR to improve accuracy.
  2. Python Version Requirement: Updated to require Python 3.7+.
  3. Installation Section: Included OpenCV installation instructions.
  4. Testing: Added specifics about testing image preprocessing with OpenCV and OCR.
  5. Programmatic Usage: Clarified how to use the package as a Python module.

Changelog

Version 0.2.0

  • Added advanced image preprocessing (grayscale, binarization, and noise removal) using OpenCV to improve OCR accuracy.
  • Modularized the extract_text function for better maintainability.

Installation

To install the latest version:

pip install pdfcompare --upgrade

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pdfcompare-0.2.0.tar.gz (12.5 kB view details)

Uploaded Source

Built Distribution

pdfcompare-0.2.0-py3-none-any.whl (11.7 kB view details)

Uploaded Python 3

File details

Details for the file pdfcompare-0.2.0.tar.gz.

File metadata

  • Download URL: pdfcompare-0.2.0.tar.gz
  • Upload date:
  • Size: 12.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.7

File hashes

Hashes for pdfcompare-0.2.0.tar.gz
Algorithm Hash digest
SHA256 4d8ff09da4db63d9decf85906c4157935b1e76744e15fccca04da0d0e0b4c3b3
MD5 b62f2e7e62a0a1c896415c9a17d56549
BLAKE2b-256 d82d13c3d4a42a0770b833d327f33ecf49946a87b4e58a1b59c05cd79c2c1a79

See more details on using hashes here.

File details

Details for the file pdfcompare-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: pdfcompare-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 11.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.7

File hashes

Hashes for pdfcompare-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 be381f0c8ba50860e696aecd0c839b0da7d3987e8de0198cc6ce2f0eb9b68adf
MD5 bb670060aa249635e40acb2d78049b4f
BLAKE2b-256 a6006b461acd66cded148e294e977b2fd869761bc8707d180af6b7c24224924a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page