Skip to main content

Opinionated and Sophisticated Document Region Analyzer.

Project description

Docproc

A Python-based document region analyzer and content extraction tool.

[!WARNING]
Project is under active development so most of the features aren't implemented, The readme is written to understand project scope.

Overview

Docproc is an opinionated document region analyzer that helps extract text, equations, images and handwriting from documents. It provides both a library interface and a command-line tool.

Installation

# Using pip
pip install docproc

Usage

As a Command-line Tool

# Basic usage
docproc input.pdf

# Specify output format and file
docproc input.pdf -w csv -o output.csv
docproc input.pdf -w sqlite -o database.db
docproc input.pdf -w json -o output.json

# Extract only specific region types
docproc input.pdf --regions text equation
docproc input.pdf -r text image  # Short form

# Enable verbose logging
docproc input.pdf -v

Supported output formats:

  • CSV (default)
  • SQLite
  • JSON

As a Library

from docproc.doc.analyzer import DocumentAnalyzer
from docproc.writer import CSVWriter

# Using context manager (recommended)
with DocumentAnalyzer("input.pdf", CSVWriter, output_path="output.csv") as analyzer:
    regions = analyzer.detect_regions()
    analyzer.export_regions()

Roadmap

The following features are planned for upcoming releases:

  • Handwriting Recognition: Detect and extract handwritten content from documents

Development

uv sync

Contributing

Pull requests are welcome. Please ensure tests pass before submitting.

Contact

For any questions, feedback or suggestions, please contact the author @ hi@rithul.dev

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docproc-1.0.0.tar.gz (21.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

docproc-1.0.0-py3-none-any.whl (28.1 kB view details)

Uploaded Python 3

File details

Details for the file docproc-1.0.0.tar.gz.

File metadata

  • Download URL: docproc-1.0.0.tar.gz
  • Upload date:
  • Size: 21.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.10

File hashes

Hashes for docproc-1.0.0.tar.gz
Algorithm Hash digest
SHA256 fd81917b079de50f15457ce777899bcf3bc48b2e8f72de2dfc90e1a5ee52b2b6
MD5 6413b48274542045d44d08897789a72a
BLAKE2b-256 174cd679a2f54ec1df9578e7ab57d9a21f7b48bc87b0a0b34788bd9c1b093ba4

See more details on using hashes here.

File details

Details for the file docproc-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: docproc-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 28.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.6.10

File hashes

Hashes for docproc-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 27e8800aea473abf42d8556af86d9453988cd05e2abd3fde1b13651036c33381
MD5 e85dfc8d67e550c476aa3243f52ced59
BLAKE2b-256 95a614adbef3bcb1161b68db6dc2eea270e2cd0a4b06752fca1d8fef4cc7ea43

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page