Skip to main content

industrial processing plant for data ingestion

Project description

coreason-refinery

License: Prosperity 3.0 Build Status Code Style: Ruff Documentation

coreason-refinery is the industrial processing plant for data ingestion at CoReason-AI. It transforms raw documents (PDF, Excel, PPTX) into semantically structured, machine-readable Markdown, ready for RAG (Retrieval-Augmented Generation) pipelines.

Unlike traditional RAG tools that blindly chunk text, coreason-refinery is Structure-Aware. It preserves the logical hierarchy of documents (headers, sections) and ensures complex artifacts like tables and equations are kept intact.

Features

  • Multi-Modal Parsing:
    • PDF: Vision-first parsing to preserve table grids and layout.
    • Excel/CSV: Treated as relational data, converted to Markdown tables with header preservation.
    • PowerPoint: Flattens slides into linear narratives (planned support for speaker notes).
  • Semantic Segmentation:
    • Splits documents by logical headers (#, ##) rather than character counts.
    • Table Rescue: Never splits tables mid-row; merges tables spanning multiple pages.
  • Context Injection ("Breadcrumbs"):
    • Enriches every chunk with its hierarchical path (e.g., Context: Protocol > Section 4 > Toxicity).
    • Ensures LLMs understand the specific scope of any given text snippet.
  • GxP Compliance Ready:
    • Designed for lineage tracking and metadata enrichment.

Installation

pip install coreason-refinery

Note: This package requires Python 3.12+.

Usage

Here is how to initialize and run a refinery job:

import uuid
from coreason_refinery.models import IngestionJob, IngestionConfig
from coreason_refinery.pipeline import RefineryPipeline

# 1. Configure the Job
config = IngestionConfig(
    chunk_strategy="HEADER",
    segment_len=500
)

job = IngestionJob(
    id=uuid.uuid4(),
    source_file_path="path/to/document.pdf",
    file_type="auto",  # Infers PDF/Excel/CSV
    config=config,
    status="PROCESSING"
)

# 2. Run the Pipeline
pipeline = RefineryPipeline()
chunks = pipeline.process(job)

# 3. Inspect Results
for chunk in chunks:
    print(f"--- Chunk ID: {chunk.id} ---")
    print(chunk.text)
    print(f"Metadata: {chunk.metadata}")

Architecture

The pipeline consists of three main stages:

  1. The Cracker (Parsing): Routes files to specialized parsers (UnstructuredPdfParser, ExcelParser) to extract atomic elements.
  2. The Cutter (Segmentation): Reassembles elements into RefinedChunks based on document structure, applying "Rolling Context" to preserve hierarchy.
  3. The Enricher (Metadata): (Planned) Adds lineage and semantic tags.

License

This software is proprietary and dual-licensed under the Prosperity Public License 3.0. Commercial use beyond a 30-day trial requires a separate license. See LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

coreason_refinery-0.2.0.tar.gz (15.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

coreason_refinery-0.2.0-py3-none-any.whl (19.8 kB view details)

Uploaded Python 3

File details

Details for the file coreason_refinery-0.2.0.tar.gz.

File metadata

  • Download URL: coreason_refinery-0.2.0.tar.gz
  • Upload date:
  • Size: 15.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for coreason_refinery-0.2.0.tar.gz
Algorithm Hash digest
SHA256 59402e1264629de7c632a86903b04609a28b9552a36b5e69fdbbf50b74676f59
MD5 91f95ad6cd639a7bcf842c8c8d5684d5
BLAKE2b-256 cd45b33eef3ec5de6c00c4abf744d592c754480decf6d07fbb8109e0b6d1d01b

See more details on using hashes here.

Provenance

The following attestation bundles were made for coreason_refinery-0.2.0.tar.gz:

Publisher: publish.yml on CoReason-AI/coreason-refinery

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file coreason_refinery-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for coreason_refinery-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 338ae684240fcc1ec49a530dca2f45435815a24334c98ac6b3faf5edb2fea171
MD5 3b7a1f3668e685a9b45e869ef12edcd6
BLAKE2b-256 aaee82cfacbb820e107342bebb1bc0689fbff544bb05c3c9ec8dccd575557601

See more details on using hashes here.

Provenance

The following attestation bundles were made for coreason_refinery-0.2.0-py3-none-any.whl:

Publisher: publish.yml on CoReason-AI/coreason-refinery

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page