Skip to main content

industrial processing plant for data ingestion

Project description

coreason-refinery

License: Prosperity 3.0 Build Status Code Style: Ruff Documentation

coreason-refinery is the industrial processing plant for data ingestion at CoReason-AI. It transforms raw documents (PDF, Excel, PPTX) into semantically structured, machine-readable Markdown, ready for RAG (Retrieval-Augmented Generation) pipelines.

Unlike traditional RAG tools that blindly chunk text, coreason-refinery is Structure-Aware. It preserves the logical hierarchy of documents (headers, sections) and ensures complex artifacts like tables and equations are kept intact.

Features

  • Multi-Modal Parsing:
    • PDF: Vision-first parsing to preserve table grids and layout.
    • Excel/CSV: Treated as relational data, converted to Markdown tables with header preservation.
    • PowerPoint: Flattens slides into linear narratives (planned support for speaker notes).
  • Semantic Segmentation:
    • Splits documents by logical headers (#, ##) rather than character counts.
    • Table Rescue: Never splits tables mid-row; merges tables spanning multiple pages.
  • Context Injection ("Breadcrumbs"):
    • Enriches every chunk with its hierarchical path (e.g., Context: Protocol > Section 4 > Toxicity).
    • Ensures LLMs understand the specific scope of any given text snippet.
  • GxP Compliance Ready:
    • Designed for lineage tracking and metadata enrichment.

Installation

pip install coreason-refinery

Note: This package requires Python 3.12+.

Usage

Here is how to initialize and run a refinery job:

import uuid
from coreason_refinery.models import IngestionJob, IngestionConfig
from coreason_refinery.pipeline import RefineryPipeline

# 1. Configure the Job
config = IngestionConfig(
    chunk_strategy="HEADER",
    segment_len=500
)

job = IngestionJob(
    id=uuid.uuid4(),
    source_file_path="path/to/document.pdf",
    file_type="auto",  # Infers PDF/Excel/CSV
    config=config,
    status="PROCESSING"
)

# 2. Run the Pipeline
pipeline = RefineryPipeline()
chunks = pipeline.process(job)

# 3. Inspect Results
for chunk in chunks:
    print(f"--- Chunk ID: {chunk.id} ---")
    print(chunk.text)
    print(f"Metadata: {chunk.metadata}")

Architecture

The pipeline consists of three main stages:

  1. The Cracker (Parsing): Routes files to specialized parsers (UnstructuredPdfParser, ExcelParser) to extract atomic elements.
  2. The Cutter (Segmentation): Reassembles elements into RefinedChunks based on document structure, applying "Rolling Context" to preserve hierarchy.
  3. The Enricher (Metadata): (Planned) Adds lineage and semantic tags.

License

This software is proprietary and dual-licensed under the Prosperity Public License 3.0. Commercial use beyond a 30-day trial requires a separate license. See LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

coreason_refinery-0.1.0.tar.gz (14.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

coreason_refinery-0.1.0-py3-none-any.whl (18.4 kB view details)

Uploaded Python 3

File details

Details for the file coreason_refinery-0.1.0.tar.gz.

File metadata

  • Download URL: coreason_refinery-0.1.0.tar.gz
  • Upload date:
  • Size: 14.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for coreason_refinery-0.1.0.tar.gz
Algorithm Hash digest
SHA256 841aec5218d9ee64a8cca70d039e2ef34065e131dcc061246855ed1aa563aab1
MD5 5e87024ceebc39e3a65b70c300fcd037
BLAKE2b-256 9cf0b9ff04af4cc95af9e588fea2e8882319d3d9b602feeba73074414fdaa836

See more details on using hashes here.

Provenance

The following attestation bundles were made for coreason_refinery-0.1.0.tar.gz:

Publisher: publish.yml on CoReason-AI/coreason-refinery

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file coreason_refinery-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for coreason_refinery-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a924c55c08bfd700dced56b8bed3b06d5cb9a35cb77e327fcf01a7eb8b104004
MD5 8a69c93695ee541595c09f3f371ebd83
BLAKE2b-256 2c2ad3bcce1aafbea57ddcf19d4a0a3005cdcbaa47b4506e658ff81871bc684e

See more details on using hashes here.

Provenance

The following attestation bundles were made for coreason_refinery-0.1.0-py3-none-any.whl:

Publisher: publish.yml on CoReason-AI/coreason-refinery

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page