Skip to main content

industrial processing plant for data ingestion

Project description

coreason-refinery

License: Prosperity 3.0 Build Status Code Style: Ruff Documentation

coreason-refinery is the industrial processing plant for data ingestion at CoReason-AI. It transforms raw documents (PDF, Excel, PPTX) into semantically structured, machine-readable Markdown, ready for RAG (Retrieval-Augmented Generation) pipelines.

Unlike traditional RAG tools that blindly chunk text, coreason-refinery is Structure-Aware. It preserves the logical hierarchy of documents (headers, sections) and ensures complex artifacts like tables and equations are kept intact.

Features

  • Multi-Modal Parsing:
    • PDF: Vision-first parsing to preserve table grids and layout.
    • Excel/CSV: Treated as relational data, converted to Markdown tables with header preservation.
    • PowerPoint: Flattens slides into linear narratives (planned support for speaker notes).
  • Semantic Segmentation:
    • Splits documents by logical headers (#, ##) rather than character counts.
    • Table Rescue: Never splits tables mid-row; merges tables spanning multiple pages.
  • Context Injection ("Breadcrumbs"):
    • Enriches every chunk with its hierarchical path (e.g., Context: Protocol > Section 4 > Toxicity).
    • Ensures LLMs understand the specific scope of any given text snippet.
  • GxP Compliance Ready:
    • Designed for lineage tracking and metadata enrichment.

Installation

pip install coreason-refinery

Note: This package requires Python 3.12+.

Usage

Here is how to initialize and run a refinery job:

import uuid
from coreason_refinery.models import IngestionJob, IngestionConfig
from coreason_refinery.pipeline import RefineryPipeline

# 1. Configure the Job
config = IngestionConfig(
    chunk_strategy="HEADER",
    segment_len=500
)

job = IngestionJob(
    id=uuid.uuid4(),
    source_file_path="path/to/document.pdf",
    file_type="auto",  # Infers PDF/Excel/CSV
    config=config,
    status="PROCESSING"
)

# 2. Run the Pipeline
pipeline = RefineryPipeline()
chunks = pipeline.process(job)

# 3. Inspect Results
for chunk in chunks:
    print(f"--- Chunk ID: {chunk.id} ---")
    print(chunk.text)
    print(f"Metadata: {chunk.metadata}")

Architecture

The pipeline consists of three main stages:

  1. The Cracker (Parsing): Routes files to specialized parsers (UnstructuredPdfParser, ExcelParser) to extract atomic elements.
  2. The Cutter (Segmentation): Reassembles elements into RefinedChunks based on document structure, applying "Rolling Context" to preserve hierarchy.
  3. The Enricher (Metadata): (Planned) Adds lineage and semantic tags.

License

This software is proprietary and dual-licensed under the Prosperity Public License 3.0. Commercial use beyond a 30-day trial requires a separate license. See LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

coreason_refinery-0.2.1.tar.gz (15.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

coreason_refinery-0.2.1-py3-none-any.whl (19.8 kB view details)

Uploaded Python 3

File details

Details for the file coreason_refinery-0.2.1.tar.gz.

File metadata

  • Download URL: coreason_refinery-0.2.1.tar.gz
  • Upload date:
  • Size: 15.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for coreason_refinery-0.2.1.tar.gz
Algorithm Hash digest
SHA256 d7ed1ca1ef0e18388c0d74e10cb0bc923b4b40a69c80651cde7887b6be8629a6
MD5 b505e5b91ddd0d517802227775a9db97
BLAKE2b-256 7fee1f23bb44d23ac56ce2244c8891e575c72ea013d4173bd0450e973f867c89

See more details on using hashes here.

Provenance

The following attestation bundles were made for coreason_refinery-0.2.1.tar.gz:

Publisher: publish.yml on CoReason-AI/coreason-refinery

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file coreason_refinery-0.2.1-py3-none-any.whl.

File metadata

File hashes

Hashes for coreason_refinery-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 471c68d6e3589c6004cb21d351d4305d066066691f2d2785244d78e4c9a64f7d
MD5 26c4b362496712567fa3280e4cd60f47
BLAKE2b-256 ec6e6317396099eff1e8ad5423c0baa096cf04d2f58b90591658afa0f0661485

See more details on using hashes here.

Provenance

The following attestation bundles were made for coreason_refinery-0.2.1-py3-none-any.whl:

Publisher: publish.yml on CoReason-AI/coreason-refinery

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page