Build-time context preparation for agentic AI: clean, chunk, embed, enrich, hydrate. No orchestrator, no parser layer — just primitives over records.
Project description
coralbricks-context-prep
Build-time context preparation for agentic AI. Clean → chunk → embed → enrich → hydrate. Plain Python, no orchestrator, no parser layer.
coralbricks-context-prep is the open-source context-prep layer of the
Coral Bricks platform. It ships a small,
opinionated set of primitives over records: turn the data you
already have (in pandas / duckdb / parquet / a queue) into agent-ready
memory — cleaned text, chunks, embeddings, and an optional knowledge
graph.
Why another RAG library?
Most RAG / context libraries try to do three things at once:
- Load files (PDF, HTML, parquet, S3, ...).
- Transform them (clean, chunk, embed, NER, summarise, ...).
- Orchestrate the pipeline (DAG, retries, scheduling, parallelism).
coralbricks-context-prep only does #2.
- No file loaders. Use
pandas,duckdb,pyarrow, or your warehouse client. They are better at IO than we will ever be. - No orchestrator. Use Airflow, Prefect, Dagster, Ray, or Spark. They are better at scheduling than we will ever be.
- Just the transformations. Pure Python functions over
list[dict]records that work the same in a Jupyter cell, a Prefect task, or a Spark UDF.
This makes the library tiny, testable, easy to drop into existing data pipelines, and easy to scale: every primitive is per-record.
Install
pip install coralbricks-context-prep # core (zero hard deps)
pip install 'coralbricks-context-prep[chunkers]' # + tiktoken
pip install 'coralbricks-context-prep[cleaners]' # + trafilatura
pip install 'coralbricks-context-prep[embed-api]' # + openai, requests
pip install 'coralbricks-context-prep[embed-st]' # + sentence-transformers, torch
pip install 'coralbricks-context-prep[embed-bedrock]' # + boto3
pip install 'coralbricks-context-prep[graph]' # + pyarrow
pip install 'coralbricks-context-prep[enrichers-spacy]' # + spaCy
pip install 'coralbricks-context-prep[all]' # everything above
60-second quickstart
from coralbricks.context_prep import clean, chunk, embed, enrich, hydrate
# Records you already loaded with whatever tool you like.
records = [
{"id": "doc-1", "text": "<html><body><p>$AAPL is up.</p></body></html>"},
{"id": "doc-2", "text": "<html><body><p>$MSFT and $AAPL rallied today.</p></body></html>"},
]
cleaned = clean(records) # trafilatura main-content extraction
chunks = chunk(cleaned, strategy="sliding_token", target_tokens=512)
vectors = embed(chunks, model="coral_embed") # or openai, bedrock, st:bge-m3, ...
enriched = enrich(cleaned, extractors=["tickers", "dates", "urls"])
graph = hydrate(enriched, graph="news") # entity co-occurrence graph
# Push to your vector store / graph DB / object store of choice.
Verbs
| Verb | What it does |
|---|---|
clean(records) |
Trafilatura main-content extraction. No custom rules. |
chunk(records, strategy=...) |
fixed_token, sliding_token, recursive_character, or sentence. |
embed(chunks, model=...) |
Coral, OpenAI, Bedrock, sentence-transformers, DeepInfra. Parquet output optional. |
enrich(records, extractors=...) |
Regex extractors (tickers/urls/dates/money/...) + spaCy NER. |
hydrate(records, graph=...) |
Triple extraction → deduplicated nodes/edges. Parquet output optional. |
join(left, right, on=...) |
Hash join over dict records (small data; for big, use DuckDB). |
Every verb returns an Artifact describing what it produced
(record_count, lineage, materialised payload), and accepts
dry_run=True for composing recipes without executing them.
Records, not classes
The universal record shape across pandas, duckdb, parquet, JSON, and
message queues is dict. So that is the only shape prep accepts:
{"id": "doc-1", "text": "...", "source": None, "metadata": {}}
source and metadata are optional. We do not ship a custom
Document class; normalize_records() does the minimal coercion in
one place so verbs stay simple.
At scale
For million-row jobs, drop the Recipe runner and call the primitives
directly inside your orchestrator's tasks:
# in a Ray / Prefect / Spark / Airflow task:
from coralbricks.context_prep.chunkers import chunk_text
from coralbricks.context_prep.embedders import create_embedder
embedder = create_embedder("st:BAAI/bge-m3", dimension=1024)
def transform_partition(rows):
out = []
for row in rows:
chunks = chunk_text(row["text"], strategy="sliding_token", target_tokens=512)
texts = [c.text for c in chunks]
vectors, _ = embedder.embed_texts(texts)
out.extend(
{"doc_id": row["id"], **c.to_dict(), "vector": v}
for c, v in zip(chunks, vectors)
)
return out
For embedding jobs that should land on disk instead of in RAM, write
each shard to its own vectors.parquet (fixed-size float32 list
column — Qdrant, pgvector, LanceDB, and DuckDB ingest it directly):
from coralbricks.context_prep.embedders import write_vectors_parquet
vectors, _ = embedder.embed_texts(texts)
write_vectors_parquet(
chunks, vectors, output_dir=f"/scratch/run/shard={i:05d}",
model=embedder.get_model_name(),
dimension=embedder.get_dimension(),
)
(Or just call embed(chunks, output_dir="/scratch/run", embedder=embedder)
inside a single-process pipeline.)
For knowledge graphs, build per-shard graphs in parallel and combine
them with merge_graphs(*partials):
from coralbricks.context_prep.graph import hydrate_graph, merge_graphs
partials = pool.map(lambda shard: hydrate_graph(shard, extractors), shards)
full = merge_graphs(*partials) # nodes deduped, edge weights summed
Comparison to other tools
| Tool | Loaders | Transforms | Orchestrator | Vector store | License |
|---|---|---|---|---|---|
| LangChain | yes | yes | partial | yes | MIT |
| LlamaIndex | yes | yes | partial | yes | MIT |
| Unstructured.io | yes | yes | no | no | Apache |
| coralbricks-context-prep | no | yes | no | no | Apache |
If you want batteries-included, use LangChain or LlamaIndex. If you
already have a data stack and want a small, sharp library that does
one job well, use coralbricks-context-prep.
Examples
See examples/:
rag_quickstart.py— end-to-end clean → chunk → embed.knowledge_graph.py— build an entity graph from news.distributed_hydrate.py—hydrate_graph+merge_graphsreduce.
Documentation
docs/ARCHITECTURE.md— design principles & verb model.docs/EXTENDING.md— write your own chunker / extractor / embedder.CHANGELOG.md— release history.
Contributing
See CONTRIBUTING.md. Issues and PRs welcome on GitHub.
License
Apache 2.0 © Coral Bricks AI.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file coralbricks_context_prep-0.1.0.tar.gz.
File metadata
- Download URL: coralbricks_context_prep-0.1.0.tar.gz
- Upload date:
- Size: 57.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
32c229c3d1a6ef483b5f216c8d6a3d0fabd75e7b315c6a3d2ceb2df38e46005d
|
|
| MD5 |
37f226b8bbd39aa52a5360b53bd79f85
|
|
| BLAKE2b-256 |
149027e847ab6620e8e5261d6a2a684f012c89efefd6932c95dc3f79e0dc61ba
|
File details
Details for the file coralbricks_context_prep-0.1.0-py3-none-any.whl.
File metadata
- Download URL: coralbricks_context_prep-0.1.0-py3-none-any.whl
- Upload date:
- Size: 63.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
37f3c245e3effe646e0c8f7cceab8241e77a119fed26aa2330a395851a815c95
|
|
| MD5 |
ad59f7389bde3685764ab49fd38b60f5
|
|
| BLAKE2b-256 |
1970ad7b11a69b51458d1644065c4c9d2359320331c62633c6ee55cb8444d3e6
|