Opinionated and Sophisticated Document Region Analyzer.
Project description
Docproc
A Python-based document region analyzer and content extraction tool.
[!WARNING]
Project is under active development so most of the features aren't implemented, The readme is written to understand project scope.
Overview
Docproc is an opinionated document region analyzer that helps extract text, equations, images and handwriting from documents. It provides both a library interface and a command-line tool.
Installation
# Using pip
pip install docproc
Usage
As a Command-line Tool
# Basic usage
docproc input.pdf
# Specify output format and file
docproc input.pdf -w csv -o output.csv
docproc input.pdf -w sqlite -o database.db
docproc input.pdf -w json -o output.json
# Extract only specific region types
docproc input.pdf --regions text equation
docproc input.pdf -r text image # Short form
# Enable verbose logging
docproc input.pdf -v
Supported output formats:
- CSV (default)
- SQLite
- JSON
As a Library
from docproc.doc.analyzer import DocumentAnalyzer
from docproc.writer import CSVWriter
# Using context manager (recommended)
with DocumentAnalyzer("input.pdf", CSVWriter, output_path="output.csv") as analyzer:
regions = analyzer.detect_regions()
analyzer.export_regions()
Roadmap
The following features are planned for upcoming releases:
- Handwriting Recognition: Detect and extract handwritten content from documents
Development
uv sync
Contributing
Pull requests are welcome. Please ensure tests pass before submitting.
Contact
For any questions, feedback or suggestions, please contact the author @ hi@rithul.dev
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file docproc-1.0.0.tar.gz.
File metadata
- Download URL: docproc-1.0.0.tar.gz
- Upload date:
- Size: 21.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fd81917b079de50f15457ce777899bcf3bc48b2e8f72de2dfc90e1a5ee52b2b6
|
|
| MD5 |
6413b48274542045d44d08897789a72a
|
|
| BLAKE2b-256 |
174cd679a2f54ec1df9578e7ab57d9a21f7b48bc87b0a0b34788bd9c1b093ba4
|
File details
Details for the file docproc-1.0.0-py3-none-any.whl.
File metadata
- Download URL: docproc-1.0.0-py3-none-any.whl
- Upload date:
- Size: 28.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.6.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
27e8800aea473abf42d8556af86d9453988cd05e2abd3fde1b13651036c33381
|
|
| MD5 |
e85dfc8d67e550c476aa3243f52ced59
|
|
| BLAKE2b-256 |
95a614adbef3bcb1161b68db6dc2eea270e2cd0a4b06752fca1d8fef4cc7ea43
|