Forge perfect documents from any format with precision, power, and simplicity
Project description
DocForge 🔨
Forge perfect documents from any format with precision, power, and simplicity.
DocForge is a comprehensive document processing toolkit built on proven implementations with a modern modular architecture. Born from real-world needs and battle-tested algorithms, DocForge transforms how you work with documents.
✨ Features
- 🔍 OCR Processing: Convert scanned PDFs to searchable documents with precision
- 🗜️ Smart Optimization: Reduce file sizes without compromising quality
- ⚙️ Batch Processing: Handle hundreds of documents efficiently
- 🔧 Document Analysis: Extract insights and metadata
- 🎯 Modular Design: Use only what you need, extend easily
🚀 Why DocForge?
- Battle-tested OCR algorithms with Windows compatibility
- Advanced optimization techniques from real-world usage
- Memory-efficient batch processing for large-scale operations
- Clean, modular codebase that's easy to understand and extend
- Comprehensive error handling and logging
- Both programmatic API and command-line interface
📦 Installation
Option 1: Install from PyPI (when available)
pip install docforge
Option 2: Install from source
git clone https://github.com/oscar2song/docforge.git
cd docforge
pip install -e .
System Dependencies
Ubuntu/Debian:
sudo apt-get install tesseract-ocr poppler-utils
macOS:
brew install tesseract poppler
Windows: Download Tesseract from: https://github.com/tesseract-ocr/tesseract
🎯 Quick Start
Command Line Interface
After installation, use the docforge command:
# Get help
docforge --help
# OCR a scanned PDF
docforge enhanced-ocr -i scanned_document.pdf -o searchable_document.pdf
# Batch OCR processing
docforge enhanced-batch-ocr -i scanned_folder/ -o searchable_folder/
# Standard OCR processing
docforge ocr -i document.pdf -o output.pdf --language eng
# Batch optimization
docforge batch-ocr -i input_folder/ -o output_folder/
# Test the interface
docforge test-rich
# Run performance benchmarks
docforge benchmark --test-files document.pdf
Programmatic API
from docforge import DocumentProcessor
# Initialize the processor
processor = DocumentProcessor(verbose=True)
# OCR a scanned PDF
result = processor.ocr_pdf(
"scanned_document.pdf",
"searchable_document.pdf",
language='eng'
)
# Optimize PDF size
result = processor.optimize_pdf(
"large_document.pdf",
"optimized_document.pdf",
optimization_type="aggressive"
)
# Batch processing
result = processor.batch_ocr_pdfs(
"scanned_folder/",
"searchable_folder/"
)
🏗️ Architecture
DocForge is built with a clean, modular architecture:
docforge/
├── core/ # Core processing engine
├── pdf/ # PDF operations (proven implementations)
├── cli/ # Command-line interface
├── utils/ # Shared utilities
└── main.py # CLI entry point
📋 Available Commands
| Command | Description |
|---|---|
enhanced-ocr |
OCR with advanced performance optimization |
enhanced-batch-ocr |
Batch OCR with intelligent performance optimization |
ocr |
Standard OCR processing |
batch-ocr |
Standard batch OCR processing |
optimize |
PDF optimization |
pdf-to-word |
PDF to Word conversion |
split-pdf |
Split PDF documents |
benchmark |
Run performance benchmarks |
perf-stats |
Display performance statistics |
test-rich |
Test Rich CLI interface |
🧪 Examples
Run the examples to see DocForge in action:
# Basic usage examples (if you have example files)
python examples/basic_usage.py
# Test the CLI interface
docforge test-rich
# Test error handling
docforge test-errors
# Test validation system
docforge test-validation
🤝 Contributing
We welcome contributions! The modular architecture makes it easy to add new features.
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
🗺️ Roadmap
- ✅ Core PDF processing with proven implementations
- ✅ OCR and optimization capabilities
- ✅ Command-line interface
- ✅ Comprehensive documentation
- 📄 Word document processing (Word ↔ PDF conversion)
- 🎨 Modern GUI interface
- 🚀 Performance optimizations
- 📊 Excel and PowerPoint support
- 🤖 AI-powered document analysis
- 🌐 Web interface
📄 License
This project is licensed under the MIT License.
🏆 Acknowledgments
Built with proven implementations and enhanced with modern architecture for the open source community.
⭐ If DocForge helped you, please give it a star! ⭐
Built by craftsmen, for craftsmen. 🔨
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file docforge-0.1.0.tar.gz.
File metadata
- Download URL: docforge-0.1.0.tar.gz
- Upload date:
- Size: 80.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
36e6e2603953995b6c98ae23a1a4955ed4c231c5bbc19e5ae94a911175d2b40d
|
|
| MD5 |
c48d61a539b46a3d219b326d90a14528
|
|
| BLAKE2b-256 |
a86e42ae2199b8df49765285aa752664b05042967b7a1ccd68a5aeb79b1d8e62
|
File details
Details for the file docforge-0.1.0-py3-none-any.whl.
File metadata
- Download URL: docforge-0.1.0-py3-none-any.whl
- Upload date:
- Size: 83.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
28194596e3dac1ae07affd3e443f40bee3281ebcf4d179a1d737bd394333f8e1
|
|
| MD5 |
8d0e645d7b7b6898b23cc2723b5c5088
|
|
| BLAKE2b-256 |
729c161b79acbd08c7fc7b82268dba7a79bbcbde8a5404291b2acdcbd2080f32
|