Skip to main content

industrial processing plant for data ingestion

Project description

coreason-refinery

License: Prosperity 3.0 Build Status Code Style: Ruff Documentation

coreason-refinery is the industrial processing plant for data ingestion at CoReason-AI. It transforms raw documents (PDF, Excel, PPTX) into semantically structured, machine-readable Markdown, ready for RAG (Retrieval-Augmented Generation) pipelines.

Unlike traditional RAG tools that blindly chunk text, coreason-refinery is Structure-Aware. It preserves the logical hierarchy of documents (headers, sections) and ensures complex artifacts like tables and equations are kept intact.

Features

  • Multi-Modal Parsing:
    • PDF: Vision-first parsing to preserve table grids and layout.
    • Excel/CSV: Treated as relational data, converted to Markdown tables with header preservation.
    • PowerPoint: Flattens slides into linear narratives (planned support for speaker notes).
  • Semantic Segmentation:
    • Splits documents by logical headers (#, ##) rather than character counts.
    • Table Rescue: Never splits tables mid-row; merges tables spanning multiple pages.
  • Context Injection ("Breadcrumbs"):
    • Enriches every chunk with its hierarchical path (e.g., Context: Protocol > Section 4 > Toxicity).
    • Ensures LLMs understand the specific scope of any given text snippet.
  • GxP Compliance Ready:
    • Designed for lineage tracking and metadata enrichment.

Installation

pip install coreason-refinery

Note: This package requires Python 3.12+.

Usage

Here is how to initialize and run a refinery job:

import uuid
from coreason_refinery.models import IngestionJob, IngestionConfig
from coreason_refinery.pipeline import RefineryPipeline

# 1. Configure the Job
config = IngestionConfig(
    chunk_strategy="HEADER",
    segment_len=500
)

job = IngestionJob(
    id=uuid.uuid4(),
    source_file_path="path/to/document.pdf",
    file_type="auto",  # Infers PDF/Excel/CSV
    config=config,
    status="PROCESSING"
)

# 2. Run the Pipeline
pipeline = RefineryPipeline()
chunks = pipeline.process(job)

# 3. Inspect Results
for chunk in chunks:
    print(f"--- Chunk ID: {chunk.id} ---")
    print(chunk.text)
    print(f"Metadata: {chunk.metadata}")

Architecture

The pipeline consists of three main stages:

  1. The Cracker (Parsing): Routes files to specialized parsers (UnstructuredPdfParser, ExcelParser) to extract atomic elements.
  2. The Cutter (Segmentation): Reassembles elements into RefinedChunks based on document structure, applying "Rolling Context" to preserve hierarchy.
  3. The Enricher (Metadata): (Planned) Adds lineage and semantic tags.

License

This software is proprietary and dual-licensed under the Prosperity Public License 3.0. Commercial use beyond a 30-day trial requires a separate license. See LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

coreason_refinery-0.1.1.tar.gz (14.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

coreason_refinery-0.1.1-py3-none-any.whl (18.4 kB view details)

Uploaded Python 3

File details

Details for the file coreason_refinery-0.1.1.tar.gz.

File metadata

  • Download URL: coreason_refinery-0.1.1.tar.gz
  • Upload date:
  • Size: 14.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for coreason_refinery-0.1.1.tar.gz
Algorithm Hash digest
SHA256 0318cd39c534d8ce9803b75760db70ca55175b5210ed7bfd80db0a51f14a56dc
MD5 53d0cbfe8c7cbe3fc9bec74087596e3e
BLAKE2b-256 eaa8337ce68d6cbc7171385be60d4ad430192a61b91bedd6b9ddba4eccfb5693

See more details on using hashes here.

Provenance

The following attestation bundles were made for coreason_refinery-0.1.1.tar.gz:

Publisher: publish.yml on CoReason-AI/coreason-refinery

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file coreason_refinery-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for coreason_refinery-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 c2ac62aae0432cdbc996a39efe15e17268b9b0e24b0629f9a32059848256010d
MD5 15c100ca1a0deb65c06954f39dde3415
BLAKE2b-256 038127d778e2fc53200b4fb93eec023d0fd1e1eeab8cfeb203b8b57a865372bc

See more details on using hashes here.

Provenance

The following attestation bundles were made for coreason_refinery-0.1.1-py3-none-any.whl:

Publisher: publish.yml on CoReason-AI/coreason-refinery

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page