Generate Document Running Workflow (GENDORUWO): a document extraction framework
Project description
GENDORUWO 👻
Generate Document Running Workflow
GENDORUWO is a lightweight document text extraction framework built in Python.
Features
- 📄 PDF Text Extraction: Extract text content from PDF documents
- 📝 DOCX Text Extraction: Extract text from Word documents
- 📊 XLSX Text Extraction: Extract text from Excel spreadsheets
- 📎 Multi-format Support: Extensible architecture for adding more document formats
- ⚡ Workflow-based: Define extraction workflows via YAML configuration
Installation
pip install gendoruwo
Quick Start
CLI Usage
# Extract text from a single document
gendoruwo extract document.pdf
# Run a workflow
gendoruwo run workflow.yaml
# Validate a workflow file
gendoruwo validate workflow.yaml
# Initialize a sample workflow
gendoruwo init
Python API
from gendoruwo import Gendoruwo
gd = Gendoruwo()
# Extract text from a document
text = gd.extract("document.pdf")
print(text)
Workflow YAML Format
name: "Extract Contracts"
input:
paths:
- "docs/contract1.pdf"
- "docs/contract2.docx"
recursive: true
output:
directory: "extracted/"
format: "text" # text | markdown | json
options:
encoding: "utf-8"
Development
# Install in development mode
pip install -e ".[dev]"
# Run tests
pytest
# Run linting
flake8 src/ tests/
# Run tox
tox
License
MIT License - see LICENSE for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
gendoruwo-0.0.1.tar.gz
(24.3 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file gendoruwo-0.0.1.tar.gz.
File metadata
- Download URL: gendoruwo-0.0.1.tar.gz
- Upload date:
- Size: 24.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d8c92c878645248f86394358a992908440d1986da4fb2e23ab98a11c6bc4b706
|
|
| MD5 |
a3c514890020b3d2d3480536c2b5be25
|
|
| BLAKE2b-256 |
1723192065efdd45879be8b228ae03c2f11688bfa2ec453f6d04173c150bb2c1
|
File details
Details for the file gendoruwo-0.0.1-py3-none-any.whl.
File metadata
- Download URL: gendoruwo-0.0.1-py3-none-any.whl
- Upload date:
- Size: 4.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
06d80e526bcc19ef1bb489aa6ec2caf86849d83d5dadf55113876878f12bb35c
|
|
| MD5 |
7133c7684fa35482f0f0e4a0bf01bf70
|
|
| BLAKE2b-256 |
1481bcdba9172409dd0c23ebef7c19b823130b19be6ec1b30173853fe9430fb5
|