Document components for the Sayou Data Platform

These details have not been verified by PyPI

Project links

Project description

sayou-document

Overview

The Universal Document Parsing Gateway for Sayou Fabric.

sayou-document is a high-fidelity parsing engine that converts diverse document formats (PDF, DOCX, PPTX, XLSX, Images) into a unified, structured Document Object Model (DOM).

Unlike simple text extractors, it preserves the semantic structure of documents—headers, tables, charts, and layout coordinates—making it ideal for RAG applications that require layout awareness.

1. Architecture & Role

The Document engine acts as a normalizer. It accepts raw file bytes and applies the optimal Parser Strategy to output a structured SayouDocument.

graph LR
    File[Raw File] --> Pipeline[Document Pipeline]
    
    subgraph Parsers
        PDF[PDF Parser + OCR]
        Office[Office Parser]
        Img[Image Converter]
    end
    
    Pipeline -->|Type Detection| Parsers
    Parsers --> DOM[Structured DOM]

1.1. Core Features

Smart Routing: Automatically detects file types (signatures) and selects the best parser.
Hybrid Extraction: Combines native text extraction for digital PDFs with OCR fallback for scanned images.
Strict Schema: Outputs a standardized hierarchy (Document > Page > Element) regardless of input format.

2. Supported Formats

sayou-document supports the following file types out-of-the-box.

Format	Strategy Key	Description
PDF	`pdf`	Extracts text, images, and TOC using `PyMuPDF`. Supports OCR.
Word	`docx`	Parses DOCX files, preserving heading levels and lists.
PowerPoint	`pptx`	Extracts text frames, speaker notes, and tables from slides.
Excel	`xlsx`	Converts sheets into table elements and extracts embedded charts.
Image	`image`	Auto-converts JPG/PNG/TIFF to PDF, then applies OCR.

3. Installation

pip install sayou-document

# For OCR support (requires Tesseract installed on OS)
pip install "sayou-document[ocr]"

4. Usage

The DocumentPipeline orchestrates file detection and parsing. It standardizes the input via the process method.

Case A: PDF Parsing (Standard)

Processes a PDF file to extract structured text and layout info.

import os
from sayou.document import DocumentPipeline

file_path = "quarterly_report.pdf"
with open(file_path, "rb") as f:
    file_bytes = f.read()

doc = DocumentPipeline.process(
    data=file_bytes,
    metadata={"filename": os.path.basename(file_path)}
)

# 4. Result
print(f"File: {doc.file_name}, Pages: {doc.page_count}")
print(f"First Element: {doc.pages[0].elements[0].text}")

Case B: Office Documents (Word/Excel)

Parses Office formats while preserving table structures.

from sayou.document import DocumentPipeline

with open("salary_table.xlsx", "rb") as f:
    file_bytes = f.read()

doc = DocumentPipeline.process(
    data=file_bytes,
    metadata={"filename": "salary_table.xlsx"}
)

# Access tables
tables = [e for p in doc.pages for e in p.elements if e.category == "table"]
print(f"Extracted {len(tables)} tables.")

Case C: Image with OCR

Automatically handles image conversion and OCR processing.

from sayou.document import DocumentPipeline

# Initialize with OCR enabled
pipeline = DocumentPipeline(config={"use_ocr": True, "ocr_lang": "eng"})

with open("scanned_receipt.png", "rb") as f:
    file_bytes = f.read()

doc = pipeline.process(
    data=file_bytes,
    metadata={"filename": "scanned_receipt.png"}
)

print(f"OCR Result: {doc.pages[0].elements[0].text}")

5. Configuration Keys

Customize the parsing behavior via the config dictionary.

use_ocr: (bool) Enable OCR for scanned pages or images.
ocr_lang: (str) Tesseract language code (default: eng+kor).
extract_images: (bool) Whether to extract embedded images to disk.
table_strategy: (str) fast (text-based) or accurate (vision-based).

6. License

7. Plugin List

Plugin	Example	Description
`Docx Parser`	▶
`Excel Parser`	▶
`PPTX Parser`	▶
`PDF Parser`	▶

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.5.2

Apr 10, 2026

0.5.1

Apr 9, 2026

0.5.0

Apr 1, 2026

This version

0.4.4

Mar 27, 2026

0.4.3

Mar 17, 2026

0.4.2

Mar 16, 2026

0.4.1

Mar 16, 2026

0.4.0

Feb 6, 2026

0.3.2

Dec 23, 2025

0.3.1

Dec 22, 2025

0.3.0

Dec 20, 2025

0.2.5

Dec 19, 2025

0.2.4

Dec 12, 2025

0.2.3

Dec 11, 2025

0.2.2

Dec 11, 2025

0.2.1

Dec 11, 2025

0.2.0

Dec 5, 2025

0.1.10

Dec 1, 2025

0.1.9

Dec 1, 2025

0.1.8

Dec 1, 2025

0.1.7

Dec 1, 2025

0.1.6

Dec 1, 2025

0.1.5

Dec 1, 2025

0.1.4

Dec 1, 2025

0.1.3

Dec 1, 2025

0.1.2

Nov 28, 2025

0.1.1

Nov 24, 2025

0.1.0

Nov 21, 2025

0.0.3

Nov 17, 2025

0.0.2

Nov 17, 2025

0.0.1

Nov 14, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sayou_document-0.4.4.tar.gz (29.3 kB view details)

Uploaded Mar 27, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

sayou_document-0.4.4-py3-none-any.whl (31.6 kB view details)

Uploaded Mar 27, 2026 Python 3

File details

Details for the file sayou_document-0.4.4.tar.gz.

File metadata

Download URL: sayou_document-0.4.4.tar.gz
Upload date: Mar 27, 2026
Size: 29.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sayou_document-0.4.4.tar.gz
Algorithm	Hash digest
SHA256	`c18c0db7f66ae4d06a336ef1c1b08959c36eb2df1432bab9f967bb11cd38dd31`
MD5	`8b66a2566481ea53146a5d0c0cbf7ccf`
BLAKE2b-256	`440d98d9177a1e5a4331fdbe90d5dfaf7cbfa8b9569624b134f6825b09323d76`

See more details on using hashes here.

File details

Details for the file sayou_document-0.4.4-py3-none-any.whl.

File metadata

Download URL: sayou_document-0.4.4-py3-none-any.whl
Upload date: Mar 27, 2026
Size: 31.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sayou_document-0.4.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`728603d6816bd3b94db76e311a6a29dd9594b98ab856a6e1595c1fd974850f5c`
MD5	`ba66577fa2184ea62782024c2b4dffbf`
BLAKE2b-256	`27ec4495d30259bc2bfb3ee31f4eb70b3b9b9accedba4d221361b731da6d12b3`

See more details on using hashes here.

sayou-document 0.4.4

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

sayou-document

Overview

1. Architecture & Role

1.1. Core Features

2. Supported Formats

3. Installation

4. Usage

Case A: PDF Parsing (Standard)

Case B: Office Documents (Word/Excel)

Case C: Image with OCR

5. Configuration Keys

6. License

7. Plugin List

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes