Skip to main content

industrial processing plant for data ingestion

Project description

coreason-refinery

License: Prosperity 3.0 Build Status Code Style: Ruff Documentation

coreason-refinery is the industrial processing plant for data ingestion at CoReason-AI. It transforms raw documents (PDF, Excel, PPTX) into semantically structured, machine-readable Markdown, ready for RAG (Retrieval-Augmented Generation) pipelines.

Unlike traditional RAG tools that blindly chunk text, coreason-refinery is Structure-Aware. It preserves the logical hierarchy of documents (headers, sections) and ensures complex artifacts like tables and equations are kept intact.

Features

  • Multi-Modal Parsing:
    • PDF: Vision-first parsing to preserve table grids and layout.
    • Excel/CSV: Treated as relational data, converted to Markdown tables with header preservation.
    • PowerPoint: Flattens slides into linear narratives (planned support for speaker notes).
  • Semantic Segmentation:
    • Splits documents by logical headers (#, ##) rather than character counts.
    • Table Rescue: Never splits tables mid-row; merges tables spanning multiple pages.
  • Context Injection ("Breadcrumbs"):
    • Enriches every chunk with its hierarchical path (e.g., Context: Protocol > Section 4 > Toxicity).
    • Ensures LLMs understand the specific scope of any given text snippet.
  • GxP Compliance Ready:
    • Designed for lineage tracking and metadata enrichment.

Installation

pip install coreason-refinery

Note: This package requires Python 3.12+.

Usage

Here is how to initialize and run a refinery job:

import uuid
from coreason_refinery.models import IngestionJob, IngestionConfig
from coreason_refinery.pipeline import RefineryPipeline

# 1. Configure the Job
config = IngestionConfig(
    chunk_strategy="HEADER",
    segment_len=500
)

job = IngestionJob(
    id=uuid.uuid4(),
    source_file_path="path/to/document.pdf",
    file_type="auto",  # Infers PDF/Excel/CSV
    config=config,
    status="PROCESSING"
)

# 2. Run the Pipeline
pipeline = RefineryPipeline()
chunks = pipeline.process(job)

# 3. Inspect Results
for chunk in chunks:
    print(f"--- Chunk ID: {chunk.id} ---")
    print(chunk.text)
    print(f"Metadata: {chunk.metadata}")

Architecture

The pipeline consists of three main stages:

  1. The Cracker (Parsing): Routes files to specialized parsers (UnstructuredPdfParser, ExcelParser) to extract atomic elements.
  2. The Cutter (Segmentation): Reassembles elements into RefinedChunks based on document structure, applying "Rolling Context" to preserve hierarchy.
  3. The Enricher (Metadata): (Planned) Adds lineage and semantic tags.

License

This software is proprietary and dual-licensed under the Prosperity Public License 3.0. Commercial use beyond a 30-day trial requires a separate license. See LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

coreason_refinery-0.2.2.tar.gz (15.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

coreason_refinery-0.2.2-py3-none-any.whl (19.8 kB view details)

Uploaded Python 3

File details

Details for the file coreason_refinery-0.2.2.tar.gz.

File metadata

  • Download URL: coreason_refinery-0.2.2.tar.gz
  • Upload date:
  • Size: 15.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for coreason_refinery-0.2.2.tar.gz
Algorithm Hash digest
SHA256 c7ae9db8f24192e967ad09bab26eb8fd07e3396cf6d24ea7a40b14347d149eea
MD5 942153701f5a352cce74c31183b85fa4
BLAKE2b-256 e9eb9836abdf0964f8a4d4be4077efaa35121dd8ec3d975b97563ed933f49433

See more details on using hashes here.

Provenance

The following attestation bundles were made for coreason_refinery-0.2.2.tar.gz:

Publisher: publish.yml on CoReason-AI/coreason-refinery

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file coreason_refinery-0.2.2-py3-none-any.whl.

File metadata

File hashes

Hashes for coreason_refinery-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 7b7ed3407e9cd06710f14fe57ce5fa329e434779f30d1489c02f1d7147d27626
MD5 6d0e7f49551d5ea5792b64482d0e740c
BLAKE2b-256 27647bbca3ad6eb70281645df7175446ce6c393937a82c491043cc47b6cc54f5

See more details on using hashes here.

Provenance

The following attestation bundles were made for coreason_refinery-0.2.2-py3-none-any.whl:

Publisher: publish.yml on CoReason-AI/coreason-refinery

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page