Refinery components for the Sayou Data Platform
Project description
sayou-refinery
sayou-refinery is the central data transformation and cleansing library for the Sayou Data Platform. It acts as the "smelter" in the data pipeline, taking raw extracted data and turning it into clean, usable content for downstream tasks like chunking, embedding, and RAG.
Philosophy: Transformation, Not Extraction
sayou-refinery does not extract data from files; that is the job of sayou-document or sayou-connector.
Instead, sayou-refinery's sole responsibility is transformation. It cleans, interprets, filters, and reformats data structures, making them intelligent and optimized for LLMs.
🚀 Key Features
sayou-refinery has a dual role:
- Document Refining: It "interprets" the rich, high-fidelity JSON output from
sayou-documentand transforms it into LLM-friendly formats like Markdown (ContentBlockobjects). - DataAtom Refining: It cleans and processes streams of
DataAtomobjects (fromsayou-connectororsayou-wrapper) to reduce noise and enhance insights for RAG.
Core Components
- Doc Refiners (
doc/): Specialized tools for transformingsayou-documentoutput.DocToMarkdownRefiner: The (Tier 2) engine that converts adocumentJSON into a list ofContentBlockobjects (Markdown, Images), interpretingraw_attributesto create semantic structure (like headings and lists).
- Atom Processors (
processor/): (1:1) Cleans or transforms singleDataAtoms.- e.g.,
Deduplicator(removes duplicates),TextCleaner(strips HTML).
- e.g.,
- Atom Aggregators (
aggregator/): (N:M) Summarizes or combines multipleDataAtoms into new ones.- e.g.,
AverageAggregator(calculates averages from time-series data).
- e.g.,
- Atom Mergers (
merger/): (N+E:N) EnrichesDataAtoms with external data.- e.g.,
KeyBasedMerger(joins atoms with a CSV or database lookup).
- e.g.,
📦 Installation
pip install sayou-refinery
⚡ Quickstart
sayou-refinery provides different tools for different data types.
1. Refining a Document (from sayou-document)
This example shows how sayou-rag uses refinery to process a document.
import json
from sayou.refinery.processor.doc_to_markdown import DocToMarkdownRefiner
# 1. Load the JSON output from sayou-document
# (This assumes doc_data is a dict from doc.model_dump())
with open("my_document_output.json", "r", encoding="utf-8") as f:
doc_data = json.load(f)
# 2. Initialize the Tier 2 refiner (default engine)
refiner = DocToMarkdownRefiner()
refiner.initialize()
# 3. Refine the dict into ContentBlocks (MD, image data, etc.)
content_blocks = refiner.refine(doc_data)
# 4. (Application Logic) Assemble and save the Markdown
final_markdown = []
for block in content_blocks:
if block.type == "md":
final_markdown.append(block.content)
# (Add logic here to save images (block.type == "image_base64") and link them)
output = "\n\n".join(final_markdown)
# print(output)
2. Refining DataAtoms
This example shows how to clean a list of DataAtoms.
from sayou.core.atom import DataAtom
from sayou.refinery.core.context import RefineryContext
from sayou.refinery.processor.deduplicator import Deduplicator
# 1. Prepare DataAtoms (e.g., from sayou-connector)
atoms = [
DataAtom("source_A", "item", {"id": "123", "data": "A"}),
DataAtom("source_B", "item", {"id": "456", "data": "B"}),
DataAtom("source_C", "item", {"id": "123", "data": "C_dupe"})
]
context = RefineryContext(atoms=atoms)
# 2. Initialize the Tier 2 processor
# We want to deduplicate based on the 'id' field in the payload
deduper = Deduplicator()
deduper.initialize(key_field="payload.id")
# 3. Process the context
refined_context = deduper.process(context)
# refined_context.atoms will now only contain the first two atoms
# print(len(refined_context.atoms)) # Output: 2
🗺️ Roadmap
- Implementing more Tier 2
Aggregatortemplates (e.g.,SumAggregator,TimeSeriesResampler). - Developing Tier 3 plugins for advanced HTML-to-Markdown conversion.
🤝 Contributing
We welcome contributions! If you are interested in building new refiner plugins, please check our contributing guidelines (TODO) and open an issue.
📜 License
This project is licensed under the Apache 2.0 License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sayou_refinery-0.1.1.tar.gz.
File metadata
- Download URL: sayou_refinery-0.1.1.tar.gz
- Upload date:
- Size: 18.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9c0c7f89daaec3504e83f61089bf31d3870ea1b2032f5321aa0a1839de3b4943
|
|
| MD5 |
8ceec65039542af6119e88e78a75a1d6
|
|
| BLAKE2b-256 |
8456ef3d0118164192aecb63dd0ac4fa6f4619cb245953e4535cc5dbd1c5f400
|
File details
Details for the file sayou_refinery-0.1.1-py3-none-any.whl.
File metadata
- Download URL: sayou_refinery-0.1.1-py3-none-any.whl
- Upload date:
- Size: 23.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d60ecdc02e6c12e2c1ed8ef81677edbdd178c434ca5c6813e8f4fed5a13f42b9
|
|
| MD5 |
57c01a3d3e384f82c079bf1bd3cced9c
|
|
| BLAKE2b-256 |
831bcbef671c553e3d1d2d6c626a756ffa1ee7a50a7426cc6403d289b6faf27a
|