industrial processing plant for data ingestion

These details have not been verified by PyPI

Project links

License
- Other/Proprietary License
Operating System
- OS Independent
Programming Language
- Python :: 3.12

Project description

coreason-refinery

License: Prosperity 3.0 Build Status Code Style: Ruff Documentation

coreason-refinery is the industrial processing plant for data ingestion at CoReason-AI. It transforms raw documents (PDF, Excel, PPTX) into semantically structured, machine-readable Markdown, ready for RAG (Retrieval-Augmented Generation) pipelines.

Unlike traditional RAG tools that blindly chunk text, coreason-refinery is Structure-Aware. It preserves the logical hierarchy of documents (headers, sections) and ensures complex artifacts like tables and equations are kept intact.

Features

Multi-Modal Parsing:
- PDF: Vision-first parsing to preserve table grids and layout.
- Excel/CSV: Treated as relational data, converted to Markdown tables with header preservation.
- PowerPoint: Flattens slides into linear narratives (planned support for speaker notes).
Semantic Segmentation:
- Splits documents by logical headers (#, ##) rather than character counts.
- Table Rescue: Never splits tables mid-row; merges tables spanning multiple pages.
Context Injection ("Breadcrumbs"):
- Enriches every chunk with its hierarchical path (e.g., Context: Protocol > Section 4 > Toxicity).
- Ensures LLMs understand the specific scope of any given text snippet.
GxP Compliance Ready:
- Designed for lineage tracking and metadata enrichment.

Installation

pip install coreason-refinery

Note: This package requires Python 3.12+.

Usage

Here is how to initialize and run a refinery job:

import uuid
from coreason_refinery.models import IngestionJob, IngestionConfig
from coreason_refinery.pipeline import RefineryPipeline

# 1. Configure the Job
config = IngestionConfig(
    chunk_strategy="HEADER",
    segment_len=500
)

job = IngestionJob(
    id=uuid.uuid4(),
    source_file_path="path/to/document.pdf",
    file_type="auto",  # Infers PDF/Excel/CSV
    config=config,
    status="PROCESSING"
)

# 2. Run the Pipeline
pipeline = RefineryPipeline()
chunks = pipeline.process(job)

# 3. Inspect Results
for chunk in chunks:
    print(f"--- Chunk ID: {chunk.id} ---")
    print(chunk.text)
    print(f"Metadata: {chunk.metadata}")

Architecture

The pipeline consists of three main stages:

The Cracker (Parsing): Routes files to specialized parsers (UnstructuredPdfParser, ExcelParser) to extract atomic elements.
The Cutter (Segmentation): Reassembles elements into RefinedChunks based on document structure, applying "Rolling Context" to preserve hierarchy.
The Enricher (Metadata): (Planned) Adds lineage and semantic tags.

License

This software is proprietary and dual-licensed under the Prosperity Public License 3.0. Commercial use beyond a 30-day trial requires a separate license. See LICENSE for details.

Project details

These details have not been verified by PyPI

Project links

License
- Other/Proprietary License
Operating System
- OS Independent
Programming Language
- Python :: 3.12

Release history Release notifications | RSS feed

0.2.2

Jan 28, 2026

This version

0.2.1

Jan 28, 2026

0.2.0

Jan 25, 2026

0.1.1

Jan 25, 2026

0.1.0

Jan 20, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

coreason_refinery-0.2.1.tar.gz (15.4 kB view details)

Uploaded Jan 28, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

coreason_refinery-0.2.1-py3-none-any.whl (19.8 kB view details)

Uploaded Jan 28, 2026 Python 3

File details

Details for the file coreason_refinery-0.2.1.tar.gz.

File metadata

Download URL: coreason_refinery-0.2.1.tar.gz
Upload date: Jan 28, 2026
Size: 15.4 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for coreason_refinery-0.2.1.tar.gz
Algorithm	Hash digest
SHA256	`d7ed1ca1ef0e18388c0d74e10cb0bc923b4b40a69c80651cde7887b6be8629a6`
MD5	`b505e5b91ddd0d517802227775a9db97`
BLAKE2b-256	`7fee1f23bb44d23ac56ce2244c8891e575c72ea013d4173bd0450e973f867c89`

See more details on using hashes here.

Provenance

The following attestation bundles were made for coreason_refinery-0.2.1.tar.gz:

Publisher: publish.yml on CoReason-AI/coreason-refinery

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: coreason_refinery-0.2.1.tar.gz
- Subject digest: d7ed1ca1ef0e18388c0d74e10cb0bc923b4b40a69c80651cde7887b6be8629a6
- Sigstore transparency entry: 868175929
- Sigstore integration time: Jan 28, 2026
Source repository:
- Permalink: CoReason-AI/coreason-refinery@7aca6aa403ea7c781865910b52e5437cef7e246a
- Branch / Tag: refs/tags/v0.2.1
- Owner: https://github.com/CoReason-AI
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@7aca6aa403ea7c781865910b52e5437cef7e246a
- Trigger Event: release

File details

Details for the file coreason_refinery-0.2.1-py3-none-any.whl.

File metadata

Download URL: coreason_refinery-0.2.1-py3-none-any.whl
Upload date: Jan 28, 2026
Size: 19.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for coreason_refinery-0.2.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`471c68d6e3589c6004cb21d351d4305d066066691f2d2785244d78e4c9a64f7d`
MD5	`26c4b362496712567fa3280e4cd60f47`
BLAKE2b-256	`ec6e6317396099eff1e8ad5423c0baa096cf04d2f58b90591658afa0f0661485`

See more details on using hashes here.

Provenance

The following attestation bundles were made for coreason_refinery-0.2.1-py3-none-any.whl:

Publisher: publish.yml on CoReason-AI/coreason-refinery

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: coreason_refinery-0.2.1-py3-none-any.whl
- Subject digest: 471c68d6e3589c6004cb21d351d4305d066066691f2d2785244d78e4c9a64f7d
- Sigstore transparency entry: 868175930
- Sigstore integration time: Jan 28, 2026
Source repository:
- Permalink: CoReason-AI/coreason-refinery@7aca6aa403ea7c781865910b52e5437cef7e246a
- Branch / Tag: refs/tags/v0.2.1
- Owner: https://github.com/CoReason-AI
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@7aca6aa403ea7c781865910b52e5437cef7e246a
- Trigger Event: release

coreason-refinery 0.2.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

coreason-refinery

Features

Installation

Usage

Architecture

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance