Skip to main content

Forge perfect documents from any format with precision, power, and simplicity

Project description

DocForge 🔨

Forge perfect documents from any format with precision, power, and simplicity.

DocForge is a comprehensive document processing toolkit built on proven implementations with a modern modular architecture. Born from real-world needs and battle-tested algorithms, DocForge transforms how you work with documents.

✨ Features

  • 🔍 OCR Processing: Convert scanned PDFs to searchable documents with precision
  • 🗜️ Smart Optimization: Reduce file sizes without compromising quality
  • ⚙️ Batch Processing: Handle hundreds of documents efficiently
  • 🔧 Document Analysis: Extract insights and metadata
  • 🎯 Modular Design: Use only what you need, extend easily

🚀 Why DocForge?

  • Battle-tested OCR algorithms with Windows compatibility
  • Advanced optimization techniques from real-world usage
  • Memory-efficient batch processing for large-scale operations
  • Clean, modular codebase that's easy to understand and extend
  • Comprehensive error handling and logging
  • Both programmatic API and command-line interface

📦 Installation

Option 1: Install from PyPI (when available)

pip install docforge

Option 2: Install from source

git clone https://github.com/oscar2song/docforge.git
cd docforge
pip install -e .

System Dependencies

Ubuntu/Debian:

sudo apt-get install tesseract-ocr poppler-utils

macOS:

brew install tesseract poppler

Windows: Download Tesseract from: https://github.com/tesseract-ocr/tesseract

🎯 Quick Start

Command Line Interface

After installation, use the docforge command:

# Get help
docforge --help

# OCR a scanned PDF
docforge enhanced-ocr -i scanned_document.pdf -o searchable_document.pdf

# Batch OCR processing
docforge enhanced-batch-ocr -i scanned_folder/ -o searchable_folder/

# Standard OCR processing
docforge ocr -i document.pdf -o output.pdf --language eng

# Batch optimization
docforge batch-ocr -i input_folder/ -o output_folder/

# Test the interface
docforge test-rich

# Run performance benchmarks
docforge benchmark --test-files document.pdf

Programmatic API

from docforge import DocumentProcessor

# Initialize the processor
processor = DocumentProcessor(verbose=True)

# OCR a scanned PDF
result = processor.ocr_pdf(
    "scanned_document.pdf",
    "searchable_document.pdf", 
    language='eng'
)

# Optimize PDF size
result = processor.optimize_pdf(
    "large_document.pdf",
    "optimized_document.pdf",
    optimization_type="aggressive"
)

# Batch processing
result = processor.batch_ocr_pdfs(
    "scanned_folder/",
    "searchable_folder/"
)

🏗️ Architecture

DocForge is built with a clean, modular architecture:

docforge/
├── core/           # Core processing engine
├── pdf/            # PDF operations (proven implementations)  
├── cli/            # Command-line interface
├── utils/          # Shared utilities
└── main.py         # CLI entry point

📋 Available Commands

Command Description
enhanced-ocr OCR with advanced performance optimization
enhanced-batch-ocr Batch OCR with intelligent performance optimization
ocr Standard OCR processing
batch-ocr Standard batch OCR processing
optimize PDF optimization
pdf-to-word PDF to Word conversion
split-pdf Split PDF documents
benchmark Run performance benchmarks
perf-stats Display performance statistics
test-rich Test Rich CLI interface

🧪 Examples

Run the examples to see DocForge in action:

# Basic usage examples (if you have example files)
python examples/basic_usage.py

# Test the CLI interface
docforge test-rich

# Test error handling
docforge test-errors

# Test validation system  
docforge test-validation

🤝 Contributing

We welcome contributions! The modular architecture makes it easy to add new features.

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests if applicable
  5. Submit a pull request

🗺️ Roadmap

  • ✅ Core PDF processing with proven implementations
  • ✅ OCR and optimization capabilities
  • ✅ Command-line interface
  • ✅ Comprehensive documentation
  • 📄 Word document processing (Word ↔ PDF conversion)
  • 🎨 Modern GUI interface
  • 🚀 Performance optimizations
  • 📊 Excel and PowerPoint support
  • 🤖 AI-powered document analysis
  • 🌐 Web interface

📄 License

This project is licensed under the MIT License.

🏆 Acknowledgments

Built with proven implementations and enhanced with modern architecture for the open source community.


If DocForge helped you, please give it a star!

Built by craftsmen, for craftsmen. 🔨

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docforge-0.1.0.tar.gz (80.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

docforge-0.1.0-py3-none-any.whl (83.1 kB view details)

Uploaded Python 3

File details

Details for the file docforge-0.1.0.tar.gz.

File metadata

  • Download URL: docforge-0.1.0.tar.gz
  • Upload date:
  • Size: 80.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.6

File hashes

Hashes for docforge-0.1.0.tar.gz
Algorithm Hash digest
SHA256 36e6e2603953995b6c98ae23a1a4955ed4c231c5bbc19e5ae94a911175d2b40d
MD5 c48d61a539b46a3d219b326d90a14528
BLAKE2b-256 a86e42ae2199b8df49765285aa752664b05042967b7a1ccd68a5aeb79b1d8e62

See more details on using hashes here.

File details

Details for the file docforge-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: docforge-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 83.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.9.6

File hashes

Hashes for docforge-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 28194596e3dac1ae07affd3e443f40bee3281ebcf4d179a1d737bd394333f8e1
MD5 8d0e645d7b7b6898b23cc2723b5c5088
BLAKE2b-256 729c161b79acbd08c7fc7b82268dba7a79bbcbde8a5404291b2acdcbd2080f32

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page