Skip to main content

Build-time context preparation for agentic AI: clean, chunk, embed, enrich, hydrate. No orchestrator, no parser layer — just primitives over records.

Project description

coralbricks-context-prep

Build-time context preparation for agentic AI. Clean → chunk → embed → enrich → hydrate. Plain Python, no orchestrator, no parser layer.

License Python PyPI

coralbricks-context-prep is the open-source context-prep layer of the Coral Bricks platform. It ships a small, opinionated set of primitives over records: turn the data you already have (in pandas / duckdb / parquet / a queue) into agent-ready memory — cleaned text, chunks, embeddings, and an optional knowledge graph.

Why another RAG library?

Most RAG / context libraries try to do three things at once:

  1. Load files (PDF, HTML, parquet, S3, ...).
  2. Transform them (clean, chunk, embed, NER, summarise, ...).
  3. Orchestrate the pipeline (DAG, retries, scheduling, parallelism).

coralbricks-context-prep only does #2.

  • No file loaders. Use pandas, duckdb, pyarrow, or your warehouse client. They are better at IO than we will ever be.
  • No orchestrator. Use Airflow, Prefect, Dagster, Ray, or Spark. They are better at scheduling than we will ever be.
  • Just the transformations. Pure Python functions over list[dict] records that work the same in a Jupyter cell, a Prefect task, or a Spark UDF.

This makes the library tiny, testable, easy to drop into existing data pipelines, and easy to scale: every primitive is per-record.

Install

pip install coralbricks-context-prep                  # core (zero hard deps)
pip install 'coralbricks-context-prep[chunkers]'      # + tiktoken
pip install 'coralbricks-context-prep[cleaners]'      # + trafilatura
pip install 'coralbricks-context-prep[embed-api]'     # + openai, requests
pip install 'coralbricks-context-prep[embed-st]'      # + sentence-transformers, torch
pip install 'coralbricks-context-prep[embed-bedrock]' # + boto3
pip install 'coralbricks-context-prep[graph]'         # + pyarrow
pip install 'coralbricks-context-prep[enrichers-spacy]' # + spaCy
pip install 'coralbricks-context-prep[all]'           # everything above

60-second quickstart

from coralbricks.context_prep import clean, chunk, embed, enrich, hydrate

# Records you already loaded with whatever tool you like.
records = [
    {"id": "doc-1", "text": "<html><body><p>$AAPL is up.</p></body></html>"},
    {"id": "doc-2", "text": "<html><body><p>$MSFT and $AAPL rallied today.</p></body></html>"},
]

cleaned  = clean(records)                        # trafilatura main-content extraction
chunks   = chunk(cleaned, strategy="sliding_token", target_tokens=512)
vectors  = embed(chunks,  model="coral_embed")   # or openai, bedrock, st:bge-m3, ...
enriched = enrich(cleaned, extractors=["tickers", "dates", "urls"])
graph    = hydrate(enriched, graph="news")       # entity co-occurrence graph

# Push to your vector store / graph DB / object store of choice.

Verbs

Verb What it does
clean(records) Trafilatura main-content extraction. No custom rules.
chunk(records, strategy=...) fixed_token, sliding_token, recursive_character, or sentence.
embed(chunks, model=...) Coral, OpenAI, Bedrock, sentence-transformers, DeepInfra. Parquet output optional.
enrich(records, extractors=...) Regex extractors (tickers/urls/dates/money/...) + spaCy NER.
hydrate(records, graph=...) Triple extraction → deduplicated nodes/edges. Parquet output optional.
join(left, right, on=...) Hash join over dict records (small data; for big, use DuckDB).

Every verb returns an Artifact describing what it produced (record_count, lineage, materialised payload), and accepts dry_run=True for composing recipes without executing them.

Records, not classes

The universal record shape across pandas, duckdb, parquet, JSON, and message queues is dict. So that is the only shape prep accepts:

{"id": "doc-1", "text": "...", "source": None, "metadata": {}}

source and metadata are optional. We do not ship a custom Document class; normalize_records() does the minimal coercion in one place so verbs stay simple.

At scale

For million-row jobs, drop the Recipe runner and call the primitives directly inside your orchestrator's tasks:

# in a Ray / Prefect / Spark / Airflow task:
from coralbricks.context_prep.chunkers import chunk_text
from coralbricks.context_prep.embedders import create_embedder

embedder = create_embedder("st:BAAI/bge-m3", dimension=1024)

def transform_partition(rows):
    out = []
    for row in rows:
        chunks = chunk_text(row["text"], strategy="sliding_token", target_tokens=512)
        texts = [c.text for c in chunks]
        vectors, _ = embedder.embed_texts(texts)
        out.extend(
            {"doc_id": row["id"], **c.to_dict(), "vector": v}
            for c, v in zip(chunks, vectors)
        )
    return out

For embedding jobs that should land on disk instead of in RAM, write each shard to its own vectors.parquet (fixed-size float32 list column — Qdrant, pgvector, LanceDB, and DuckDB ingest it directly):

from coralbricks.context_prep.embedders import write_vectors_parquet

vectors, _ = embedder.embed_texts(texts)
write_vectors_parquet(
    chunks, vectors, output_dir=f"/scratch/run/shard={i:05d}",
    model=embedder.get_model_name(),
    dimension=embedder.get_dimension(),
)

(Or just call embed(chunks, output_dir="/scratch/run", embedder=embedder) inside a single-process pipeline.)

For knowledge graphs, build per-shard graphs in parallel and combine them with merge_graphs(*partials):

from coralbricks.context_prep.graph import hydrate_graph, merge_graphs

partials = pool.map(lambda shard: hydrate_graph(shard, extractors), shards)
full = merge_graphs(*partials)   # nodes deduped, edge weights summed

Comparison to other tools

Tool Loaders Transforms Orchestrator Vector store License
LangChain yes yes partial yes MIT
LlamaIndex yes yes partial yes MIT
Unstructured.io yes yes no no Apache
coralbricks-context-prep no yes no no Apache

If you want batteries-included, use LangChain or LlamaIndex. If you already have a data stack and want a small, sharp library that does one job well, use coralbricks-context-prep.

Examples

See examples/:

Documentation

Contributing

See CONTRIBUTING.md. Issues and PRs welcome on GitHub.

License

Apache 2.0 © Coral Bricks AI.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

coralbricks_context_prep-0.1.0.tar.gz (57.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

coralbricks_context_prep-0.1.0-py3-none-any.whl (63.2 kB view details)

Uploaded Python 3

File details

Details for the file coralbricks_context_prep-0.1.0.tar.gz.

File metadata

  • Download URL: coralbricks_context_prep-0.1.0.tar.gz
  • Upload date:
  • Size: 57.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.8

File hashes

Hashes for coralbricks_context_prep-0.1.0.tar.gz
Algorithm Hash digest
SHA256 32c229c3d1a6ef483b5f216c8d6a3d0fabd75e7b315c6a3d2ceb2df38e46005d
MD5 37f226b8bbd39aa52a5360b53bd79f85
BLAKE2b-256 149027e847ab6620e8e5261d6a2a684f012c89efefd6932c95dc3f79e0dc61ba

See more details on using hashes here.

File details

Details for the file coralbricks_context_prep-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for coralbricks_context_prep-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 37f3c245e3effe646e0c8f7cceab8241e77a119fed26aa2330a395851a815c95
MD5 ad59f7389bde3685764ab49fd38b60f5
BLAKE2b-256 1970ad7b11a69b51458d1644065c4c9d2359320331c62633c6ee55cb8444d3e6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page