Skip to main content

Document components for the Sayou Data Platform

Project description

sayou-document

PyPI version License Docs

Overview

The Universal Document Parsing Gateway for Sayou Fabric.

sayou-document is a high-fidelity parsing engine that converts diverse document formats (PDF, DOCX, PPTX, XLSX, Images) into a unified, structured Document Object Model (DOM).

Unlike simple text extractors, it preserves the semantic structure of documents—headers, tables, charts, and layout coordinates—making it ideal for RAG applications that require layout awareness.


1. Architecture & Role

The Document engine acts as a normalizer. It accepts raw file bytes and applies the optimal Parser Strategy to output a structured SayouDocument.

graph LR
    File[Raw File] --> Pipeline[Document Pipeline]
    
    subgraph Parsers
        PDF[PDF Parser + OCR]
        Office[Office Parser]
        Img[Image Converter]
    end
    
    Pipeline -->|Type Detection| Parsers
    Parsers --> DOM[Structured DOM]

1.1. Core Features

  • Smart Routing: Automatically detects file types (signatures) and selects the best parser.
  • Hybrid Extraction: Combines native text extraction for digital PDFs with OCR fallback for scanned images.
  • Strict Schema: Outputs a standardized hierarchy (Document > Page > Element) regardless of input format.

2. Supported Formats

sayou-document supports the following file types out-of-the-box.

Format Strategy Key Description
PDF pdf Extracts text, images, and TOC using PyMuPDF. Supports OCR.
Word docx Parses DOCX files, preserving heading levels and lists.
PowerPoint pptx Extracts text frames, speaker notes, and tables from slides.
Excel xlsx Converts sheets into table elements and extracts embedded charts.
Image image Auto-converts JPG/PNG/TIFF to PDF, then applies OCR.

3. Installation

pip install sayou-document

# For OCR support (requires Tesseract installed on OS)
pip install "sayou-document[ocr]"

4. Usage

The DocumentPipeline orchestrates file detection and parsing. It standardizes the input via the process method.

Case A: PDF Parsing (Standard)

Processes a PDF file to extract structured text and layout info.

import os
from sayou.document import DocumentPipeline

file_path = "quarterly_report.pdf"
with open(file_path, "rb") as f:
    file_bytes = f.read()

doc = DocumentPipeline.process(
    data=file_bytes,
    metadata={"filename": os.path.basename(file_path)}
)

# 4. Result
print(f"File: {doc.file_name}, Pages: {doc.page_count}")
print(f"First Element: {doc.pages[0].elements[0].text}")

Case B: Office Documents (Word/Excel)

Parses Office formats while preserving table structures.

from sayou.document import DocumentPipeline

with open("salary_table.xlsx", "rb") as f:
    file_bytes = f.read()

doc = DocumentPipeline.process(
    data=file_bytes,
    metadata={"filename": "salary_table.xlsx"}
)

# Access tables
tables = [e for p in doc.pages for e in p.elements if e.category == "table"]
print(f"Extracted {len(tables)} tables.")

Case C: Image with OCR

Automatically handles image conversion and OCR processing.

from sayou.document import DocumentPipeline

# Initialize with OCR enabled
pipeline = DocumentPipeline(config={"use_ocr": True, "ocr_lang": "eng"})

with open("scanned_receipt.png", "rb") as f:
    file_bytes = f.read()

doc = pipeline.process(
    data=file_bytes,
    metadata={"filename": "scanned_receipt.png"}
)

print(f"OCR Result: {doc.pages[0].elements[0].text}")

5. Configuration Keys

Customize the parsing behavior via the config dictionary.

  • use_ocr: (bool) Enable OCR for scanned pages or images.
  • ocr_lang: (str) Tesseract language code (default: eng+kor).
  • extract_images: (bool) Whether to extract embedded images to disk.
  • table_strategy: (str) fast (text-based) or accurate (vision-based).

6. License

Apache 2.0 License © 2026 Sayouzone

7. Plugin List

Plugin Example Description
Docx Data
Excel Data
PPTX Data
PDF Data

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sayou_document-0.4.1.tar.gz (29.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sayou_document-0.4.1-py3-none-any.whl (32.1 kB view details)

Uploaded Python 3

File details

Details for the file sayou_document-0.4.1.tar.gz.

File metadata

  • Download URL: sayou_document-0.4.1.tar.gz
  • Upload date:
  • Size: 29.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sayou_document-0.4.1.tar.gz
Algorithm Hash digest
SHA256 0d0a0cc15c08e1efd7e6e27be9eae0ad30763af4fe946e48405f996724e0565c
MD5 cf75e3f08f8b3b9b37e1c1df20820754
BLAKE2b-256 d880f716a52ad249fed484042236dcff0d19724ee1c90d4494153d643bca5b38

See more details on using hashes here.

File details

Details for the file sayou_document-0.4.1-py3-none-any.whl.

File metadata

  • Download URL: sayou_document-0.4.1-py3-none-any.whl
  • Upload date:
  • Size: 32.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sayou_document-0.4.1-py3-none-any.whl
Algorithm Hash digest
SHA256 282fc919bb9c9c3725bedfb9b0dd3dae6b4f1a00cedd0ce2e1a80cd2cc76cf3b
MD5 8e0f676f6d12d685d5fc14d0cca77365
BLAKE2b-256 833b55502d4ff7487e7af6adabb28c2304645a9aa0305b0589ea18d87582a2fb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page