Skip to main content

Generate Document Running Workflow (GENDORUWO): a document extraction framework

Project description

GENDORUWO Logo

GENDORUWO 👻

Generate Document Running Workflow

GENDORUWO is a lightweight document text extraction framework built in Python.

Features

  • 📄 PDF Text Extraction: Extract text content from PDF documents
  • 📝 DOCX Text Extraction: Extract text from Word documents
  • 📊 XLSX Text Extraction: Extract text from Excel spreadsheets
  • 📎 Multi-format Support: Extensible architecture for adding more document formats
  • Workflow-based: Define extraction workflows via YAML configuration

Installation

pip install gendoruwo

Quick Start

CLI Usage

# Extract text from a single document
gendoruwo extract document.pdf

# Run a workflow
gendoruwo run workflow.yaml

# Validate a workflow file
gendoruwo validate workflow.yaml

# Initialize a sample workflow
gendoruwo init

Python API

from gendoruwo import Gendoruwo

gd = Gendoruwo()

# Extract text from a document
text = gd.extract("document.pdf")
print(text)

Workflow YAML Format

name: "Extract Contracts"
input:
  paths:
    - "docs/contract1.pdf"
    - "docs/contract2.docx"
  recursive: true
output:
  directory: "extracted/"
  format: "text"  # text | markdown | json
options:
  encoding: "utf-8"

Development

# Install in development mode
pip install -e ".[dev]"

# Run tests
pytest

# Run linting
flake8 src/ tests/

# Run tox
tox

License

MIT License - see LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gendoruwo-0.0.1.tar.gz (24.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gendoruwo-0.0.1-py3-none-any.whl (4.9 kB view details)

Uploaded Python 3

File details

Details for the file gendoruwo-0.0.1.tar.gz.

File metadata

  • Download URL: gendoruwo-0.0.1.tar.gz
  • Upload date:
  • Size: 24.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.2

File hashes

Hashes for gendoruwo-0.0.1.tar.gz
Algorithm Hash digest
SHA256 d8c92c878645248f86394358a992908440d1986da4fb2e23ab98a11c6bc4b706
MD5 a3c514890020b3d2d3480536c2b5be25
BLAKE2b-256 1723192065efdd45879be8b228ae03c2f11688bfa2ec453f6d04173c150bb2c1

See more details on using hashes here.

File details

Details for the file gendoruwo-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: gendoruwo-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 4.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.2

File hashes

Hashes for gendoruwo-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 06d80e526bcc19ef1bb489aa6ec2caf86849d83d5dadf55113876878f12bb35c
MD5 7133c7684fa35482f0f0e4a0bf01bf70
BLAKE2b-256 1481bcdba9172409dd0c23ebef7c19b823130b19be6ec1b30173853fe9430fb5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page