Build-time context preparation for agentic AI: clean, chunk, embed, enrich, hydrate. No orchestrator, no parser layer — just primitives over records.

These details have not been verified by PyPI

Project links

Project description

coralbricks-context-prep

Build-time context preparation for agentic AI. Clean → chunk → embed → enrich → hydrate. Plain Python, no orchestrator, no parser layer.

coralbricks-context-prep is the open-source context-prep layer of the Coral Bricks platform. It ships a small, opinionated set of primitives over records: turn the data you already have (in pandas / duckdb / parquet / a queue) into agent-ready memory — cleaned text, chunks, embeddings, and an optional knowledge graph.

Why another RAG library?

Most RAG / context libraries try to do three things at once:

Load files (PDF, HTML, parquet, S3, ...).
Transform them (clean, chunk, embed, NER, summarise, ...).
Orchestrate the pipeline (DAG, retries, scheduling, parallelism).

coralbricks-context-prep only does #2.

No file loaders. Use pandas, duckdb, pyarrow, or your warehouse client. They are better at IO than we will ever be.
No orchestrator. Use Airflow, Prefect, Dagster, Ray, or Spark. They are better at scheduling than we will ever be.
Just the transformations. Pure Python functions over list[dict] records that work the same in a Jupyter cell, a Prefect task, or a Spark UDF.

This makes the library tiny, testable, easy to drop into existing data pipelines, and easy to scale: every primitive is per-record.

Install

pip install coralbricks-context-prep                  # core (zero hard deps)
pip install 'coralbricks-context-prep[chunkers]'      # + tiktoken
pip install 'coralbricks-context-prep[cleaners]'      # + trafilatura
pip install 'coralbricks-context-prep[embed-api]'     # + openai, requests
pip install 'coralbricks-context-prep[embed-st]'      # + sentence-transformers, torch
pip install 'coralbricks-context-prep[embed-bedrock]' # + boto3
pip install 'coralbricks-context-prep[graph]'         # + pyarrow
pip install 'coralbricks-context-prep[enrichers-spacy]' # + spaCy
pip install 'coralbricks-context-prep[all]'           # everything above

60-second quickstart

from coralbricks.context_prep import clean, chunk, embed, enrich, hydrate

# Records you already loaded with whatever tool you like.
records = [
    {"id": "doc-1", "text": "<html><body><p>$AAPL is up.</p></body></html>"},
    {"id": "doc-2", "text": "<html><body><p>$MSFT and $AAPL rallied today.</p></body></html>"},
]

cleaned  = clean(records)                        # trafilatura main-content extraction
chunks   = chunk(cleaned, strategy="sliding_token", target_tokens=512)
vectors  = embed(chunks,  model="coral_embed")   # or openai, bedrock, st:bge-m3, ...
enriched = enrich(cleaned, extractors=["tickers", "dates", "urls"])
graph    = hydrate(enriched, graph="news")       # entity co-occurrence graph

# Push to your vector store / graph DB / object store of choice.

Verbs

Verb	What it does
`clean(records)`	Trafilatura main-content extraction. No custom rules.
`chunk(records, strategy=...)`	`fixed_token`, `sliding_token`, `recursive_character`, or `sentence`.
`embed(chunks, model=...)`	Coral, OpenAI, Bedrock, sentence-transformers, DeepInfra. Parquet output optional.
`enrich(records, extractors=...)`	Regex extractors (tickers/urls/dates/money/...) + spaCy NER.
`hydrate(records, graph=...)`	Triple extraction → deduplicated nodes/edges. Parquet output optional.
`join(left, right, on=...)`	Hash join over dict records (small data; for big, use DuckDB).

Every verb returns an Artifact describing what it produced (record_count, lineage, materialised payload), and accepts dry_run=True for composing recipes without executing them.

Records, not classes

The universal record shape across pandas, duckdb, parquet, JSON, and message queues is dict. So that is the only shape prep accepts:

{"id": "doc-1", "text": "...", "source": None, "metadata": {}}

source and metadata are optional. We do not ship a custom Document class; normalize_records() does the minimal coercion in one place so verbs stay simple.

At scale

For million-row jobs, drop the Recipe runner and call the primitives directly inside your orchestrator's tasks:

# in a Ray / Prefect / Spark / Airflow task:
from coralbricks.context_prep.chunkers import chunk_text
from coralbricks.context_prep.embedders import create_embedder

embedder = create_embedder("st:BAAI/bge-m3", dimension=1024)

def transform_partition(rows):
    out = []
    for row in rows:
        chunks = chunk_text(row["text"], strategy="sliding_token", target_tokens=512)
        texts = [c.text for c in chunks]
        vectors, _ = embedder.embed_texts(texts)
        out.extend(
            {"doc_id": row["id"], **c.to_dict(), "vector": v}
            for c, v in zip(chunks, vectors)
        )
    return out

For embedding jobs that should land on disk instead of in RAM, write each shard to its own vectors.parquet (fixed-size float32 list column — Qdrant, pgvector, LanceDB, and DuckDB ingest it directly):

from coralbricks.context_prep.embedders import write_vectors_parquet

vectors, _ = embedder.embed_texts(texts)
write_vectors_parquet(
    chunks, vectors, output_dir=f"/scratch/run/shard={i:05d}",
    model=embedder.get_model_name(),
    dimension=embedder.get_dimension(),
)

(Or just call embed(chunks, output_dir="/scratch/run", embedder=embedder) inside a single-process pipeline.)

For knowledge graphs, build per-shard graphs in parallel and combine them with merge_graphs(*partials):

from coralbricks.context_prep.graph import hydrate_graph, merge_graphs

partials = pool.map(lambda shard: hydrate_graph(shard, extractors), shards)
full = merge_graphs(*partials)   # nodes deduped, edge weights summed

Comparison to other tools

Tool	Loaders	Transforms	Orchestrator	Vector store	License
LangChain	yes	yes	partial	yes	MIT
LlamaIndex	yes	yes	partial	yes	MIT
Unstructured.io	yes	yes	no	no	Apache
coralbricks-context-prep	no	yes	no	no	Apache

If you want batteries-included, use LangChain or LlamaIndex. If you already have a data stack and want a small, sharp library that does one job well, use coralbricks-context-prep.

Examples

See examples/:

rag_quickstart.py — end-to-end clean → chunk → embed.
knowledge_graph.py — build an entity graph from news.
distributed_hydrate.py — hydrate_graph + merge_graphs reduce.

Documentation

docs/ARCHITECTURE.md — design principles & verb model.
docs/EXTENDING.md — write your own chunker / extractor / embedder.
CHANGELOG.md — release history.

Contributing

See CONTRIBUTING.md. Issues and PRs welcome on GitHub.

License

Apache 2.0 © Coral Bricks AI.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Apr 18, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

coralbricks_context_prep-0.1.0.tar.gz (57.7 kB view details)

Uploaded Apr 18, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

coralbricks_context_prep-0.1.0-py3-none-any.whl (63.2 kB view details)

Uploaded Apr 18, 2026 Python 3

File details

Details for the file coralbricks_context_prep-0.1.0.tar.gz.

File metadata

Download URL: coralbricks_context_prep-0.1.0.tar.gz
Upload date: Apr 18, 2026
Size: 57.7 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.8

File hashes

Hashes for coralbricks_context_prep-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`32c229c3d1a6ef483b5f216c8d6a3d0fabd75e7b315c6a3d2ceb2df38e46005d`
MD5	`37f226b8bbd39aa52a5360b53bd79f85`
BLAKE2b-256	`149027e847ab6620e8e5261d6a2a684f012c89efefd6932c95dc3f79e0dc61ba`

See more details on using hashes here.

File details

Details for the file coralbricks_context_prep-0.1.0-py3-none-any.whl.

File metadata

Download URL: coralbricks_context_prep-0.1.0-py3-none-any.whl
Upload date: Apr 18, 2026
Size: 63.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.8

File hashes

Hashes for coralbricks_context_prep-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`37f3c245e3effe646e0c8f7cceab8241e77a119fed26aa2330a395851a815c95`
MD5	`ad59f7389bde3685764ab49fd38b60f5`
BLAKE2b-256	`1970ad7b11a69b51458d1644065c4c9d2359320331c62633c6ee55cb8444d3e6`

See more details on using hashes here.

coralbricks-context-prep 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

coralbricks-context-prep

Why another RAG library?

Install

60-second quickstart

Verbs

Records, not classes

At scale

Comparison to other tools

Examples

Documentation

Contributing

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes