Refinery components for the Sayou Data Platform
Project description
sayou-refinery
Overview
The Universal Data Cleaning & Normalization Engine for Sayou Fabric.
sayou-refinery acts as the "Cleaning Plant" in your data pipeline. It transforms heterogeneous raw data (JSON Documents, HTML, DB Records) into a standardized stream of SayouBlocks.
It ensures that downstream components (like Chunkers or LLMs) receive clean, uniform data regardless of whether the source was a messy web scrape or a structured database row.
1. Architecture & Role
Refinery operates in two distinct stages to guarantee data quality: Normalization (Shape Shifting) and Processing (Hygiene).
graph LR
Raw[Raw Input] --> Pipeline[Refinery Pipeline]
subgraph Stage1 [Normalization]
Doc[Doc Normalizer]
Html[Html Normalizer]
Json[Json Normalizer]
end
subgraph Stage2 [Processing Chain]
Space[Whitespace]
PII[PII Masker]
Link[Link Extractor]
end
Pipeline --> Stage1
Stage1 --> Stage2
Stage2 --> Blocks[Clean SayouBlocks]
1.1. Core Features
- Normalization: Flattens complex structures (Nested JSON, HTML Trees) into a linear list of blocks.
- Hygiene: Removes invisible characters, normalizes Unicode, and fixes broken encoding.
- Safety: Automatically masks sensitive information (PII) like emails or phone numbers before they reach the LLM.
2. Available Strategies
sayou-refinery provides strategies tailored to specific input formats.
| Strategy Key | Target Format | Description |
|---|---|---|
standard_doc |
Sayou Document | [Default] Converts parsed document dictionaries into Markdown blocks. Applies standard text cleaning. |
html |
Web Pages | Strips HTML tags, extracts links, and converts the DOM tree into readable text blocks. |
json |
API/DB Records | Flattens JSON objects into key-value pairs or text representations. |
3. Installation
pip install sayou-refinery
4. Usage
The RefineryPipeline orchestrates the normalization and processing chain.
Case A: Document Cleaning (Standard)
Cleans messy OCR output or parsed document text.
from sayou.refinery import RefineryPipeline
raw_doc = {
"metadata": {"title": "Test Doc"},
"pages": [{
"elements": [
{"type": "text", "text": "Contact: admin@sayou.ai "},
{"type": "text", "text": "Generic Whitespace Error"}
]
}]
}
blocks = RefineryPipeline.process(
data=raw_doc,
strategy="standard_doc"
)
for block in blocks:
print(f"[{block.type}] {block.content}")
# Output: [text] Contact: [EMAIL]
# Output: [text] Generic Whitespace Error
Case B: HTML Processing
Converts web content into clean text while preserving hyperlinks.
from sayou.refinery import RefineryPipeline
raw_html = """
<html>
<body>
<h1>Welcome</h1>
<p>Click <a href='https://sayou.ai'>here</a>.</p>
</body>
</html>
"""
blocks = RefineryPipeline.process(
data=raw_html,
strategy="html"
)
# Result:
# [heading] Welcome
# [text] Click here (Link: https://sayou.ai)
5. Configuration Keys
Customize the cleaning processors via the config dictionary.
mask_pii: (bool) Mask emails, phone numbers, and IP addresses.normalize_whitespace: (bool) Collapse multiple spaces and trim lines.extract_links: (bool) Extract<a>tags or markdown links into metadata.remove_stopwords: (bool) Filter out common stopwords (optional).
6. License
Apache 2.0 License © 2026 Sayouzone
7. Plugin List
| Plugin | Example | Description |
|---|---|---|
Doc Refinery |
▶ | |
HTML Refinery |
▶ | |
Json Refinery |
▶ | |
Record Refinery |
▶ |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sayou_refinery-0.4.3.tar.gz.
File metadata
- Download URL: sayou_refinery-0.4.3.tar.gz
- Upload date:
- Size: 26.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
eb50f9aa0ec11778f5ed617bb49be33e4247f706b22b2cc29f4ef21a87a45aec
|
|
| MD5 |
c5bc7034cd2a6489a86bb506045babba
|
|
| BLAKE2b-256 |
d1c3a2ff14d1f52bc701cf2820dfd496d23e810fa4245e772c162d478d3c2c22
|
File details
Details for the file sayou_refinery-0.4.3-py3-none-any.whl.
File metadata
- Download URL: sayou_refinery-0.4.3-py3-none-any.whl
- Upload date:
- Size: 29.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c6ae1caae55225a54ba7478fcd87817e8b79a2072abc3d5f1b381a6ea4dfb9e7
|
|
| MD5 |
37069493b2ebc6c109a446fd8d46d555
|
|
| BLAKE2b-256 |
73d9e27004fa06b501809a97070a35e41f76a6450859e1842c4e1dae53ef8240
|