Forge perfect documents from any format with precision, power, and simplicity

These details have not been verified by PyPI

Project links

Project description

DocForge 🔨

Forge perfect documents from any format with precision, power, and simplicity.

DocForge is a comprehensive document processing toolkit built on proven implementations with a modern modular architecture. Born from real-world needs and battle-tested algorithms, DocForge transforms how you work with documents.

✨ Features

🔍 OCR Processing: Convert scanned PDFs to searchable documents with precision
🗜️ Smart Optimization: Reduce file sizes without compromising quality
⚙️ Batch Processing: Handle hundreds of documents efficiently
🔧 Document Analysis: Extract insights and metadata
🎯 Modular Design: Use only what you need, extend easily

🚀 Why DocForge?

Battle-tested OCR algorithms with Windows compatibility
Advanced optimization techniques from real-world usage
Memory-efficient batch processing for large-scale operations
Clean, modular codebase that's easy to understand and extend
Comprehensive error handling and logging
Both programmatic API and command-line interface

📦 Installation

Option 1: Install from PyPI (when available)

pip install docforge

Option 2: Install from source

git clone https://github.com/oscar2song/docforge.git
cd docforge
pip install -e .

System Dependencies

Ubuntu/Debian:

sudo apt-get install tesseract-ocr poppler-utils

macOS:

brew install tesseract poppler

Windows: Download Tesseract from: https://github.com/tesseract-ocr/tesseract

🎯 Quick Start

Command Line Interface

After installation, use the docforge command:

# Get help
docforge --help

# OCR a scanned PDF
docforge enhanced-ocr -i scanned_document.pdf -o searchable_document.pdf

# Batch OCR processing
docforge enhanced-batch-ocr -i scanned_folder/ -o searchable_folder/

# Standard OCR processing
docforge ocr -i document.pdf -o output.pdf --language eng

# Batch optimization
docforge batch-ocr -i input_folder/ -o output_folder/

# Test the interface
docforge test-rich

# Run performance benchmarks
docforge benchmark --test-files document.pdf

Programmatic API

from docforge import DocumentProcessor

# Initialize the processor
processor = DocumentProcessor(verbose=True)

# OCR a scanned PDF
result = processor.ocr_pdf(
    "scanned_document.pdf",
    "searchable_document.pdf", 
    language='eng'
)

# Optimize PDF size
result = processor.optimize_pdf(
    "large_document.pdf",
    "optimized_document.pdf",
    optimization_type="aggressive"
)

# Batch processing
result = processor.batch_ocr_pdfs(
    "scanned_folder/",
    "searchable_folder/"
)

🏗️ Architecture

DocForge is built with a clean, modular architecture:

docforge/
├── core/           # Core processing engine
├── pdf/            # PDF operations (proven implementations)  
├── cli/            # Command-line interface
├── utils/          # Shared utilities
└── main.py         # CLI entry point

📋 Available Commands

Command	Description
`enhanced-ocr`	OCR with advanced performance optimization
`enhanced-batch-ocr`	Batch OCR with intelligent performance optimization
`ocr`	Standard OCR processing
`batch-ocr`	Standard batch OCR processing
`optimize`	PDF optimization
`pdf-to-word`	PDF to Word conversion
`split-pdf`	Split PDF documents
`benchmark`	Run performance benchmarks
`perf-stats`	Display performance statistics
`test-rich`	Test Rich CLI interface

🧪 Examples

Run the examples to see DocForge in action:

# Basic usage examples (if you have example files)
python examples/basic_usage.py

# Test the CLI interface
docforge test-rich

# Test error handling
docforge test-errors

# Test validation system  
docforge test-validation

🤝 Contributing

We welcome contributions! The modular architecture makes it easy to add new features.

Fork the repository
Create a feature branch
Make your changes
Add tests if applicable
Submit a pull request

🗺️ Roadmap

✅ Core PDF processing with proven implementations
✅ OCR and optimization capabilities
✅ Command-line interface
✅ Comprehensive documentation
📄 Word document processing (Word ↔ PDF conversion)
🎨 Modern GUI interface
🚀 Performance optimizations
📊 Excel and PowerPoint support
🤖 AI-powered document analysis
🌐 Web interface

📄 License

This project is licensed under the MIT License.

🏆 Acknowledgments

Built with proven implementations and enhanced with modern architecture for the open source community.

⭐ If DocForge helped you, please give it a star! ⭐

Built by craftsmen, for craftsmen. 🔨

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Jul 13, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docforge-0.1.0.tar.gz (80.3 kB view details)

Uploaded Jul 13, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

docforge-0.1.0-py3-none-any.whl (83.1 kB view details)

Uploaded Jul 13, 2025 Python 3

File details

Details for the file docforge-0.1.0.tar.gz.

File metadata

Download URL: docforge-0.1.0.tar.gz
Upload date: Jul 13, 2025
Size: 80.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.9.6

File hashes

Hashes for docforge-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`36e6e2603953995b6c98ae23a1a4955ed4c231c5bbc19e5ae94a911175d2b40d`
MD5	`c48d61a539b46a3d219b326d90a14528`
BLAKE2b-256	`a86e42ae2199b8df49765285aa752664b05042967b7a1ccd68a5aeb79b1d8e62`

See more details on using hashes here.

File details

Details for the file docforge-0.1.0-py3-none-any.whl.

File metadata

Download URL: docforge-0.1.0-py3-none-any.whl
Upload date: Jul 13, 2025
Size: 83.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.9.6

File hashes

Hashes for docforge-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`28194596e3dac1ae07affd3e443f40bee3281ebcf4d179a1d737bd394333f8e1`
MD5	`8d0e645d7b7b6898b23cc2723b5c5088`
BLAKE2b-256	`729c161b79acbd08c7fc7b82268dba7a79bbcbde8a5404291b2acdcbd2080f32`

See more details on using hashes here.

docforge 0.1.0

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

DocForge 🔨

✨ Features

🚀 Why DocForge?

📦 Installation

Option 1: Install from PyPI (when available)

Option 2: Install from source

System Dependencies

🎯 Quick Start

Command Line Interface

Programmatic API

🏗️ Architecture

📋 Available Commands

🧪 Examples

🤝 Contributing

🗺️ Roadmap

📄 License

🏆 Acknowledgments

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes