Document components for the Sayou Data Platform
Project description
sayou-document
The Universal Document Parsing Gateway for Sayou Fabric.
sayou-document is a high-fidelity parsing engine that converts diverse document formats (PDF, DOCX, PPTX, XLSX, Images) into a unified, structured Document Object Model (DOM).
Unlike simple text extractors, it preserves the semantic structure of documents—headers, tables, charts, and layout coordinates—making it ideal for RAG (Retrieval-Augmented Generation) and Layout-aware LLM applications.
💡 Core Philosophy
"One Interface, High Fidelity."
We abstract away the complexity of proprietary file formats. Whether it's a slide deck or a spreadsheet, sayou-document normalizes it into a consistent Document > Page > Element hierarchy.
- Smart Routing: Automatically detects file types (and converts images to PDF if needed) to select the optimal parser.
- Hybrid Extraction: Combines native text extraction with OCR fallback for scanned pages or images.
- Strict Schema: Outputs data strictly adhering to Pydantic models, ready for the next pipeline stage (Refinery).
flowchart LR
File[Raw File] --> Pipeline[Document Pipeline]
Pipeline -->|PDF/Image| OCR[PDF Parser + OCR]
Pipeline -->|Office| Office[DOCX/PPTX/XLSX Parser]
OCR --> DOM[Structured Document Model]
Office --> DOM
📦 Installation
pip install sayou-document
# For OCR support (requires Tesseract installed on OS)
pip install "sayou-document[ocr]"
⚡ Quick Start
The DocumentPipeline handles file detection, conversion, and parsing automatically.
import os
from sayou.document.pipeline import DocumentPipeline
def run_demo():
# 1. Initialize Pipeline (with optional OCR)
pipeline = DocumentPipeline(use_default_ocr=True)
pipeline.initialize()
# 2. Parse a file (PDF, Word, Excel, PPT, or Image)
file_path = "quarterly_report.pdf"
# file_path = "scan_image.png" # Images are auto-converted to PDF & OCR'd
with open(file_path, "rb") as f:
file_bytes = f.read()
doc = pipeline.run(file_bytes, os.path.basename(file_path))
if doc:
print(f"File: {doc.file_name} ({doc.doc_type})")
print(f"Pages: {doc.page_count}")
# 3. Access Structured Data
first_page = doc.pages[0]
if first_page.elements:
print(f"Content Preview: {first_page.elements[0].text[:100]}...")
# Export to JSON
print(doc.model_dump_json(indent=2))
if __name__ == "__main__":
run_demo()
🔑 Key Components
Parsers
PdfParser: Extracts text, images, and TOC from PDFs usingPyMuPDF. Supports full-page OCR for scanned documents.DocxParser: Parses Word documents, preserving heading levels and table structures.PptxParser: Extracts text frames, notes, and tables from slides.ExcelParser: Converts sheets into table elements and extracts embedded images.
Converters & OCR
ImageToPdfConverter: Automatically converts JPG/PNG images to PDF to leverage the robust PDF parsing pipeline.TesseractOCR: (Optional) Provides OCR capabilities for handling scanned content and embedded images.
🤝 Contributing
We welcome contributions for new Parsers (e.g., HwpParser for Korean documents, HtmlParser) or Enhanced OCR integrations (e.g., Google Vision API).
📜 License
Apache 2.0 License © 2025 Sayouzone
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sayou_document-0.3.0.tar.gz.
File metadata
- Download URL: sayou_document-0.3.0.tar.gz
- Upload date:
- Size: 29.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ffb173eec22385d632bc62cacd63c6b89895bc7c9453ff9978b1f55ab7d280aa
|
|
| MD5 |
7bd94aec544a5abd0564a58192550e0e
|
|
| BLAKE2b-256 |
c48ee1b8864949c6e525bf6dd0e17ceac7abd79ea327411659e67c97062aacf3
|
File details
Details for the file sayou_document-0.3.0-py3-none-any.whl.
File metadata
- Download URL: sayou_document-0.3.0-py3-none-any.whl
- Upload date:
- Size: 31.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
efc551a30c8d1ea7800e7651ddbe57dda97b428542b13d4cbe22a7ed797d839d
|
|
| MD5 |
57c7a6a88c8241d674bb29f8ae720be7
|
|
| BLAKE2b-256 |
0226dfd04a73730c4f4714e7725a87cfe19c5613749ecc5cb692fbbcb0dc1c63
|