industrial processing plant for data ingestion
Project description
coreason-refinery
coreason-refinery is the industrial processing plant for data ingestion at CoReason-AI. It transforms raw documents (PDF, Excel, PPTX) into semantically structured, machine-readable Markdown, ready for RAG (Retrieval-Augmented Generation) pipelines.
Unlike traditional RAG tools that blindly chunk text, coreason-refinery is Structure-Aware. It preserves the logical hierarchy of documents (headers, sections) and ensures complex artifacts like tables and equations are kept intact.
Features
- Multi-Modal Parsing:
- PDF: Vision-first parsing to preserve table grids and layout.
- Excel/CSV: Treated as relational data, converted to Markdown tables with header preservation.
- PowerPoint: Flattens slides into linear narratives (planned support for speaker notes).
- Semantic Segmentation:
- Splits documents by logical headers (#, ##) rather than character counts.
- Table Rescue: Never splits tables mid-row; merges tables spanning multiple pages.
- Context Injection ("Breadcrumbs"):
- Enriches every chunk with its hierarchical path (e.g.,
Context: Protocol > Section 4 > Toxicity). - Ensures LLMs understand the specific scope of any given text snippet.
- Enriches every chunk with its hierarchical path (e.g.,
- GxP Compliance Ready:
- Designed for lineage tracking and metadata enrichment.
Installation
pip install coreason-refinery
Note: This package requires Python 3.12+.
Usage
Here is how to initialize and run a refinery job:
import uuid
from coreason_refinery.models import IngestionJob, IngestionConfig
from coreason_refinery.pipeline import RefineryPipeline
# 1. Configure the Job
config = IngestionConfig(
chunk_strategy="HEADER",
segment_len=500
)
job = IngestionJob(
id=uuid.uuid4(),
source_file_path="path/to/document.pdf",
file_type="auto", # Infers PDF/Excel/CSV
config=config,
status="PROCESSING"
)
# 2. Run the Pipeline
pipeline = RefineryPipeline()
chunks = pipeline.process(job)
# 3. Inspect Results
for chunk in chunks:
print(f"--- Chunk ID: {chunk.id} ---")
print(chunk.text)
print(f"Metadata: {chunk.metadata}")
Architecture
The pipeline consists of three main stages:
- The Cracker (Parsing): Routes files to specialized parsers (
UnstructuredPdfParser,ExcelParser) to extract atomic elements. - The Cutter (Segmentation): Reassembles elements into
RefinedChunks based on document structure, applying "Rolling Context" to preserve hierarchy. - The Enricher (Metadata): (Planned) Adds lineage and semantic tags.
License
This software is proprietary and dual-licensed under the Prosperity Public License 3.0.
Commercial use beyond a 30-day trial requires a separate license.
See LICENSE for details.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file coreason_refinery-0.2.1.tar.gz.
File metadata
- Download URL: coreason_refinery-0.2.1.tar.gz
- Upload date:
- Size: 15.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d7ed1ca1ef0e18388c0d74e10cb0bc923b4b40a69c80651cde7887b6be8629a6
|
|
| MD5 |
b505e5b91ddd0d517802227775a9db97
|
|
| BLAKE2b-256 |
7fee1f23bb44d23ac56ce2244c8891e575c72ea013d4173bd0450e973f867c89
|
Provenance
The following attestation bundles were made for coreason_refinery-0.2.1.tar.gz:
Publisher:
publish.yml on CoReason-AI/coreason-refinery
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
coreason_refinery-0.2.1.tar.gz -
Subject digest:
d7ed1ca1ef0e18388c0d74e10cb0bc923b4b40a69c80651cde7887b6be8629a6 - Sigstore transparency entry: 868175929
- Sigstore integration time:
-
Permalink:
CoReason-AI/coreason-refinery@7aca6aa403ea7c781865910b52e5437cef7e246a -
Branch / Tag:
refs/tags/v0.2.1 - Owner: https://github.com/CoReason-AI
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@7aca6aa403ea7c781865910b52e5437cef7e246a -
Trigger Event:
release
-
Statement type:
File details
Details for the file coreason_refinery-0.2.1-py3-none-any.whl.
File metadata
- Download URL: coreason_refinery-0.2.1-py3-none-any.whl
- Upload date:
- Size: 19.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
471c68d6e3589c6004cb21d351d4305d066066691f2d2785244d78e4c9a64f7d
|
|
| MD5 |
26c4b362496712567fa3280e4cd60f47
|
|
| BLAKE2b-256 |
ec6e6317396099eff1e8ad5423c0baa096cf04d2f58b90591658afa0f0661485
|
Provenance
The following attestation bundles were made for coreason_refinery-0.2.1-py3-none-any.whl:
Publisher:
publish.yml on CoReason-AI/coreason-refinery
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
coreason_refinery-0.2.1-py3-none-any.whl -
Subject digest:
471c68d6e3589c6004cb21d351d4305d066066691f2d2785244d78e4c9a64f7d - Sigstore transparency entry: 868175930
- Sigstore integration time:
-
Permalink:
CoReason-AI/coreason-refinery@7aca6aa403ea7c781865910b52e5437cef7e246a -
Branch / Tag:
refs/tags/v0.2.1 - Owner: https://github.com/CoReason-AI
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@7aca6aa403ea7c781865910b52e5437cef7e246a -
Trigger Event:
release
-
Statement type: