Skip to main content

Generate Document Running Workflow (GENDORUWO): a document extraction framework

Project description

GENDORUWO Logo

GENDORUWO 👻

Generate Document Running Workflow

GENDORUWO is a lightweight document text extraction framework built in Python.

Features

  • 📄 PDF Text Extraction: Extract text content from PDF documents
  • 📝 DOCX Text Extraction: Extract text from Word documents
  • 📊 XLSX Text Extraction: Extract text from Excel spreadsheets
  • 📎 Multi-format Support: Extensible architecture for adding more document formats
  • Workflow-based: Define extraction workflows via YAML configuration

Installation

pip install gendoruwo

Quick Start

CLI Usage

# Extract text from a single document
gendoruwo extract document.pdf

# Run a workflow
gendoruwo run workflow.yaml

# Validate a workflow file
gendoruwo validate workflow.yaml

# Initialize a sample workflow
gendoruwo init

Python API

from gendoruwo import Gendoruwo

gd = Gendoruwo()

# Extract text from a document
text = gd.extract("document.pdf")
print(text)

Workflow YAML Format

name: "Extract Contracts"
input:
  paths:
    - "docs/contract1.pdf"
    - "docs/contract2.docx"
  recursive: true
output:
  directory: "extracted/"
  format: "text"  # text | markdown | json
options:
  encoding: "utf-8"

Development

# Install in development mode
pip install -e ".[dev]"

# Run tests
pytest

# Run linting
flake8 src/ tests/

# Run tox
tox

License

MIT License - see LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gendoruwo-0.0.2.tar.gz (24.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gendoruwo-0.0.2-py3-none-any.whl (4.9 kB view details)

Uploaded Python 3

File details

Details for the file gendoruwo-0.0.2.tar.gz.

File metadata

  • Download URL: gendoruwo-0.0.2.tar.gz
  • Upload date:
  • Size: 24.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for gendoruwo-0.0.2.tar.gz
Algorithm Hash digest
SHA256 be947d57659913e7c0b996fa5fda25dfddc05a7e897642c667575ba4b0b13c4c
MD5 b18becfbae814e1ef7b8c48a6797eed2
BLAKE2b-256 f86ead12796d2952ad4a5d913235a21f43c0c899b7c6e51b024444148d3f7633

See more details on using hashes here.

File details

Details for the file gendoruwo-0.0.2-py3-none-any.whl.

File metadata

  • Download URL: gendoruwo-0.0.2-py3-none-any.whl
  • Upload date:
  • Size: 4.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for gendoruwo-0.0.2-py3-none-any.whl
Algorithm Hash digest
SHA256 658ae80ac5410f5339a3bf6bca7115799a9b0a12ab9081bf08a96144303650e5
MD5 fe0bfc7117cff885ed285ceaa8e5cf9b
BLAKE2b-256 829ffbd7d411a5edc2e0db43b7d622ec577acfedcf0a66e2cc7f54b45934572c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page