Generate Document Running Workflow (GENDORUWO): a document extraction framework
Project description
GENDORUWO 👻
Generate Document Running Workflow
GENDORUWO is a lightweight document text extraction framework built in Python.
Features
- 📄 PDF Text Extraction: Extract text content from PDF documents
- 📝 DOCX Text Extraction: Extract text from Word documents
- 📊 XLSX Text Extraction: Extract text from Excel spreadsheets
- 📎 Multi-format Support: Extensible architecture for adding more document formats
- ⚡ Workflow-based: Define extraction workflows via YAML configuration
Installation
pip install gendoruwo
Quick Start
CLI Usage
# Extract text from a single document
gendoruwo extract document.pdf
# Run a workflow
gendoruwo run workflow.yaml
# Validate a workflow file
gendoruwo validate workflow.yaml
# Initialize a sample workflow
gendoruwo init
Python API
from gendoruwo import Gendoruwo
gd = Gendoruwo()
# Extract text from a document
text = gd.extract("document.pdf")
print(text)
Workflow YAML Format
name: "Extract Contracts"
input:
paths:
- "docs/contract1.pdf"
- "docs/contract2.docx"
recursive: true
output:
directory: "extracted/"
format: "text" # text | markdown | json
options:
encoding: "utf-8"
Development
# Install in development mode
pip install -e ".[dev]"
# Run tests
pytest
# Run linting
flake8 src/ tests/
# Run tox
tox
License
MIT License - see LICENSE for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
gendoruwo-0.0.2.tar.gz
(24.2 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file gendoruwo-0.0.2.tar.gz.
File metadata
- Download URL: gendoruwo-0.0.2.tar.gz
- Upload date:
- Size: 24.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
be947d57659913e7c0b996fa5fda25dfddc05a7e897642c667575ba4b0b13c4c
|
|
| MD5 |
b18becfbae814e1ef7b8c48a6797eed2
|
|
| BLAKE2b-256 |
f86ead12796d2952ad4a5d913235a21f43c0c899b7c6e51b024444148d3f7633
|
File details
Details for the file gendoruwo-0.0.2-py3-none-any.whl.
File metadata
- Download URL: gendoruwo-0.0.2-py3-none-any.whl
- Upload date:
- Size: 4.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.15
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
658ae80ac5410f5339a3bf6bca7115799a9b0a12ab9081bf08a96144303650e5
|
|
| MD5 |
fe0bfc7117cff885ed285ceaa8e5cf9b
|
|
| BLAKE2b-256 |
829ffbd7d411a5edc2e0db43b7d622ec577acfedcf0a66e2cc7f54b45934572c
|