A Python package to compare files (PDF, docx, images) and generate reports in txt, html, or PDF format
Project description
PDFCompare
PDFCompare
is a Python package designed for comparing multiple file types, including PDF, DOCX, and scanned images. It generates detailed difference reports that can be exported in TXT, HTML, and PDF formats. The package utilizes PyMuPDF
for parsing PDFs, pytesseract
for OCR on images, and python-docx
for DOCX parsing. Additionally, it now includes advanced image preprocessing for improved OCR accuracy using OpenCV.
Features
- Compare multiple file types: PDF, DOCX, and scanned image files.
- Export comparison reports: Generate and save reports in TXT, HTML, or PDF formats.
- OCR for image files: Supports text extraction from scanned PDFs or images using
pytesseract
with advanced preprocessing. - Advanced image preprocessing: Leverage
OpenCV
for binarization, noise removal, and other image enhancements to improve OCR accuracy. - Easy-to-use CLI: Run comparisons via the command line or integrate into your own Python applications.
Installation
Python Requirements
- Python 3.7+
External Dependencies
The following external dependencies are required for handling PDF parsing and OCR:
- Tesseract OCR: For extracting text from images or scanned PDFs.
- wkhtmltopdf: For converting HTML reports into PDFs.
- OpenCV: For image preprocessing before OCR.
Installing Tesseract
Linux (Debian/Ubuntu)
sudo apt-get update
sudo apt-get install tesseract-ocr
MacOS
If you have Homebrew installed, run:
brew install tesseract
Windows
Download the Tesseract installer from the official repository here and follow the installation instructions.
Installing wkhtmltopdf
Linux (Debian/Ubuntu)
sudo apt-get update
sudo apt-get install wkhtmltopdf
MacOS
Using Homebrew:
brew install wkhtmltopdf
Windows
Download the Windows installer from here and install it.
Installing OpenCV
To install OpenCV for image preprocessing, run:
pip install opencv-python
Installing the pdfcompare
Package
Once all dependencies are installed, you can install pdfcompare
via pip
:
pip install pdfcompare
Usage
Command-Line Interface (CLI)
pdfcompare
provides an intuitive command-line interface for comparing files and generating reports.
Basic Syntax
pdfcompare file1 file2 --output txt
pdfcompare file1 file2 --output html
pdfcompare file1 file2 --output pdf
Example
pdfcompare document1.pdf document2.docx --output html
This command compares document1.pdf
and document2.docx
, and saves the comparison result as an HTML report.
Options
file1
,file2
: Paths to the files you want to compare.--output
: Specify the format for the report (options:txt
,html
,pdf
).
Advanced Image Preprocessing for OCR
The pdfcompare
package now supports advanced image preprocessing using OpenCV to improve OCR accuracy. This includes steps like binarization, noise removal, and other enhancements before performing text extraction.
Programmatic Usage
pdfcompare
can be used as a Python module within your code.
from pdfcompare.cli import compare_files
file1 = "path/to/file1.pdf"
file2 = "path/to/file2.docx"
output_format = "pdf" # Choose from 'txt', 'html', or 'pdf'
compare_files(file1, file2, output_format)
from pdfcompare.file_handlers.image_handler import extract_text
text = extract_text("path/to/your/image.png")
print(text)
Testing
To run unit tests, first install the development dependencies, and then use:
python -m unittest discover tests/
Coverage of Tests:
- Text extraction: From PDFs, DOCX files, and images.
- File comparison logic: Ensures accurate and consistent differences between file contents.
- Report generation: Tests for TXT, HTML, and PDF formats.
- Image preprocessing: Tests the effectiveness of OpenCV preprocessing for OCR.
Contributing
- Fork the repository.
- Create your feature branch (
git checkout -b feature/your-feature
). - Commit your changes (
git commit -am 'Add new feature'
). - Push to the branch (
git push origin feature/your-feature
). - Open a new Pull Request.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Key Changes and Additions:
- Advanced Image Preprocessing: Added details about preprocessing images using OpenCV before performing OCR to improve accuracy.
- Python Version Requirement: Updated to require Python 3.7+.
- Installation Section: Included OpenCV installation instructions.
- Testing: Added specifics about testing image preprocessing with OpenCV and OCR.
- Programmatic Usage: Clarified how to use the package as a Python module.
Changelog
Version 0.2.0
- Added advanced image preprocessing (grayscale, binarization, and noise removal) using OpenCV to improve OCR accuracy.
- Modularized the
extract_text
function for better maintainability.
Installation
To install the latest version:
pip install pdfcompare --upgrade
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file pdfcompare-0.2.0.tar.gz
.
File metadata
- Download URL: pdfcompare-0.2.0.tar.gz
- Upload date:
- Size: 12.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.10.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 4d8ff09da4db63d9decf85906c4157935b1e76744e15fccca04da0d0e0b4c3b3 |
|
MD5 | b62f2e7e62a0a1c896415c9a17d56549 |
|
BLAKE2b-256 | d82d13c3d4a42a0770b833d327f33ecf49946a87b4e58a1b59c05cd79c2c1a79 |
File details
Details for the file pdfcompare-0.2.0-py3-none-any.whl
.
File metadata
- Download URL: pdfcompare-0.2.0-py3-none-any.whl
- Upload date:
- Size: 11.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.10.7
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | be381f0c8ba50860e696aecd0c839b0da7d3987e8de0198cc6ce2f0eb9b68adf |
|
MD5 | bb670060aa249635e40acb2d78049b4f |
|
BLAKE2b-256 | a6006b461acd66cded148e294e977b2fd869761bc8707d180af6b7c24224924a |