Skip to main content

With CocoIndex, users declare the transformation, CocoIndex creates & maintains an index, and keeps the derived index up to date based on source update, with minimal computation and changes.

Project description

Enterprise corpus — codebase, Slack, meeting notes, and documentation — flowing continuously through the CocoIndex incremental sync engine into a production AI agent with always-fresh context. Only the Δ (delta) is reprocessed on every change. Keywords: RAG pipeline, agent memory, enterprise retrieval, AI agent context, live indexing, retrieval-augmented generation, production LLM apps, streaming ETL, incremental ingestion.

Your agents deserve fresh context.

Star us ❤️ → Star CocoIndex on GitHub — open-source Python framework for RAG, vector search, and live agent context  ·  cocoindex.io — the CocoIndex homepage: incremental data pipelines for AI agents  ·  CocoIndex documentation — quickstart, connectors, ops, transformations, target stores, RAG and knowledge graph recipes  ·  Join the CocoIndex Discord community — help, showcase, release notes, and live chat with maintainers

CocoIndex turns codebases, meeting notes, inboxes, Slack, PDFs, and videos into live, continuously fresh context for your AI agents and LLM apps to reason over effectively — with minimal incremental processing. Get your production AI agent ready in 10 minutes with reliable, continuously fresh data — no stale batches, no context gap

Incremental · only the delta  ·  Any scale · parallel by default  ·  Declarative · Python, 5 min

stars downloads pypi python rust license discord

CI release links

cocoindex-io/cocoindex | Trendshift




Built with CocoIndex ❤️

CocoIndex-code — flagship MCP server for AI coding agents. AST-aware incremental semantic code index that keeps live call graphs, symbols, vectors, and chunks fresh on every commit. 70% fewer tokens per turn, 80-90% cache hits on re-index, sub-second freshness. Supports Python, TypeScript, Rust, and Go. Features: Δ-only incremental processing, semantic search by meaning (not grep), call graphs and blast-radius analysis, global repo view for duplicates and architecture. Build coding agents (generate, refactor) and code-review agents (catch, approve). One install — Claude Code, Cursor, and other MCP-aware agents see your whole repository instantly. Keywords: MCP server, coding agent, code intelligence, AST chunking, semantic code search, call graph, vector embedding, repository context, Claude Code, Cursor, incremental indexing, blast radius.

See all 20+ examples · updated every week →


Get started

pip install -U --pre cocoindex     # v1 is in preview — the --pre flag is required

Declare what should be in your target — CocoIndex keeps it in sync forever, recomputing only the Δ.

import cocoindex as coco
from cocoindex.connectors import localfs, postgres
from cocoindex.ops.text import RecursiveSplitter

@coco.fn(memo=True)                          # ← cached by hash(input) + hash(code)
async def index_file(file, table):
    for chunk in RecursiveSplitter().split(await file.read_text()):
        table.declare_row(text=chunk.text, embedding=embed(chunk.text))

@coco.fn
async def main(src):
    table = await postgres.mount_table_target(PG, table_name="docs")
    table.declare_vector_index(column="embedding")
    await coco.mount_each(index_file, localfs.walk_dir(src).items(), table)

coco.App(coco.AppConfig(name="docs"), main, src="./docs").update_blocking()

Run once to backfill. Re-run anytime — only the changed files re-embed.

Full quickstart — open-book icon linking to the CocoIndex documentation quickstart: pip install, declare sources and targets, run the incremental engine    Learn the concept — lightbulb icon linking to the CocoIndex core-concepts guide: sources, targets, flows, incremental engine, and data lineage

Animated GitHub Star button for the cocoindex-io/cocoindex repository: a cursor clicks the star, it fills yellow, confetti bursts, the star count ticks up 6.9k → 7.0k, and an 'Appreciate a star if you like it!' caption with a beating heart shows below the button



React — for data engineering

React — for data engineering. The CocoIndex mental model: Target = F(Source). A persistent-state-driven dataflow where you declare the desired target state and the engine keeps it in sync with the latest source data and code, forever, at low latency and low cost. Source files (.py, .md, .pdf, .ts) flow through your Python transformation F into a live target dots-matrix index; only the Δ is reprocessed on every change, and every target dot traces back to its exact source byte. Four core properties: Python not a DAG (sky), declare target state (yellow bullseye), lineage end-to-end (coral connected dots), and incremental at any scale (mint Δ+1). Your code is as simple as the one-off version — the engine does the rest. Keywords: React for data engineering, declarative ETL, persistent state, data lineage, dataflow, Δ only, incremental indexing, CocoIndex.

What happens when either side changes — CocoIndex tracks per-row provenance so the Δ propagates at minimum cost. Two scenarios shown in one illustration: (top) Source change — one file (b.md) is edited and only one target dot re-syncs (coral pulse). (bottom) Code change — the transformation function F is rewritten from v1 to v2 and only the dots whose outputs depend on the changed code re-run (amber/yellow pulses). Source on the left, F in the center (Python code block), target dots-matrix on the right. Keywords: incremental indexing, change data capture, delta processing, fine-grained invalidation, code-aware caching, hash-of-code invalidation, memoization, reproducible pipelines, incremental recomputation.

See the React ↔ CocoIndex mental model →



Incremental engine for long-horizon agents

Data transformation for any engineer, designed for AI workloads —
with a smart incremental engine for always-fresh, explainable data.

Learn the concept — purple button with a lightbulb icon linking to the CocoIndex core-concepts guide: sources, targets, flows, incremental engine, and data lineage

CocoIndex's Python-native transformation flows connect 8 source categories (Codebases, Meeting Notes, Web · APIs, File System · Blob Stores, Databases, Message Queues, Images · Video, Voice · Transcripts) through the incremental engine out to 6 target stores (Relational DB, Data Warehouse, Vector DB, Graph DB, Message Queue, Feature Store). A flow.py code block (@coco.fn · def f(src): · chunks = split(src) · target.row(embed(chunks))) shows the shared pipeline; only the Δ is reprocessed — unchanged src hits the cache, changed src re-runs split() and Δ → re-embed. The persistent data-pipeline control plane runs eight always-on subsystems: live caching, pipeline catalog, version tracking, continuously learning, lineage, task scheduling, metrics collection, and failure management. Keywords: data pipeline, ETL, source connectors, vector database, graph database, incremental engine, streaming ingestion, caching, lineage, versioning, scheduling, metrics, retries.



Why incremental?

Your agents are only as good as the data they see.
Batch pipelines drift stale. CocoIndex stays live — and only runs the Δ.

Why incremental? — one illustration combining the four core benefits of CocoIndex's incremental engine. Sub-second fresh (mint): a stopwatch ticking under a second, source changes propagate to the target in under a second so agents see the world as it is, not as it was yesterday. 10× cheaper at scale (yellow): a 10,000-row corpus block where only a thin Δ 0.1% column re-runs and 99.9% stays cached — you skip the other 99.9% of your corpus and pay a fraction of the compute, embedding, and LLM bill. Explainable by default (coral): a lineage thread links a source byte (handbook.md L42) to a target vector — every vector, row, or graph node in the target traces back to its exact source byte for debuggable, auditable, regulator-friendly AI pipelines. Production-grade (purple): a shield stamped with the Rust crab surrounded by retry loops, back-off dots, a DLQ tray, and a no-data-loss check — Rust core with retries, exponential back-off, dead-letter queues, and no-data-loss guarantees, production-ready for long-horizon AI agents. Keywords: incremental indexing, Δ-only reprocessing, sub-second freshness, low-latency RAG, cost-efficient embeddings, data lineage, retrieval-augmented generation, Rust core, retries, back-off, dead letters, no data loss, long-horizon agents.



What can you build?

See all 20+ examples · updated every week →

Working starters from the examples tree — clone, plug your source, ship.

Real-time code index — walk a git repo, AST-chunk source files, embed with sentence-transformers, upsert to pgvector / LanceDB, incremental on every commit. Keywords: code search, code embedding, semantic code retrieval, Python.

PDF → RAG index — ingest PDFs from local, S3, or GDrive, extract + chunk text, embed chunks, upsert to pgvector / LanceDB. Classic retrieval-augmented-generation stack, incremental. Keywords: RAG, document Q&A, PDF search, vector database.

HN trending topics — pull Hacker News threads via Algolia, recursively parse comments, LLM-extract topics with Gemini 2.5 Flash, rank by weighted hit count (thread=5, comment=1), store in Postgres. Incremental. Keywords: Hacker News, trending topics, LLM extraction, Gemini, Postgres, news intelligence, topic ranking.

Conversation → knowledge graph — LLM extracts people, topics, decisions, action items from transcripts and upserts into Neo4j / Kuzu. Live graph, incremental. Keywords: knowledge graph, entity extraction, meeting intelligence, agent memory.

Multi-repo summarization — walk N git repos, extract structure, LLM-summarize per-repo + a rolled-up org summary, refresh on every push. Keywords: internal platform, developer experience, monorepo, SDK docs.

Structured extraction — BAML / DSPy typed schema extraction from forms, PDFs, intakes, invoices into Postgres / warehouse. Incremental. Keywords: ETL, LLM extraction, schema-first, patient intake, invoice processing, KYC, contracts.

Podcast → knowledge graph — transcribe YouTube / Spotify audio with speaker diarization, LLM-extract speakers and statements, resolve entities across episodes, store in SurrealDB / Neo4j. Keywords: podcast, diarization, YouTube, Whisper, SurrealDB, knowledge graph, entity resolution.

CSV → Kafka live — watch a folder of CSV files, publish each row as a JSON message to a Kafka topic via CocoIndex's Kafka target connector. Incremental, sub-second, no producer loop. Keywords: Kafka, CDC, streaming, StreamNative, Confluent, CSV ingestion, event streaming.


Share what you build — a banner with a trail of tiny hearts rising from the bottom behind the text, inviting the CocoIndex community to share projects built with the framework

Building something with CocoIndex? We want to see it.
Tag @cocoindex_io on X or drop a link in #showcase on Discord. We'll boost it. 🥥



Community

Join the CocoIndex Discord community — live chat with maintainers and users, showcase your projects, get help building RAG pipelines and knowledge graphs Subscribe to the CocoIndex YouTube channel — video tutorials, live demos, architecture deep dives, and AI agent recipes Read the CocoIndex blog — engineering deep dives, release notes, RAG and knowledge graph tutorials, and case studies Follow @cocoindex_io on X (formerly Twitter) for release notes, demos, launches, and AI data pipeline updates



We love Contributors — section title banner with a pulsing coral heart badge and cream twinkle sparkles. Every typo fix, new connector, and doc tweak makes CocoIndex better. Keywords: open-source contribution, pull request, typo fix, new connector, good first issue, Hacktoberfest, community, coconut heart.

We are so excited to meet you.
Every typo fix, new connector, doc tweak, or full-on rewrite makes CocoIndex better.
Come hang out — big PRs and small ones, both welcome.

📝 Read the contributing guide  ·  🐛 good first issues  ·  💬 Say hi on Discord



CocoIndex Enterprise

CocoIndex Enterprise — built for enterprise scale. Four headline stats for PB-scale incremental indexing: PB corpus scale incrementally indexed (coral), 10× fewer LLM embedding calls vs. full recompute (yellow), 100% lineage coverage with every byte traceable (mint), Δ only the delta always (sky). Below, a wide 50×8 corpus matrix of 400 dim tiles represents a petabyte-scale store where a single coral Δ slice of 8 tiles re-runs while the other 99.9% stays cached. Keywords: enterprise RAG, petabyte-scale indexing, incremental compute, delta-only, lineage, parallel chunking, zero-copy, failure isolation.

Large corpus — built for enterprise scale.

Incremental compute is the only way to keep large corpora fresh without re-embedding them every cycle.
CocoIndex scales from a single repo to petabyte-scale stores — parallel by default, delta-only by design.


Process once. Reconcile forever.

When a source changes, CocoIndex identifies the affected records, propagates the change
across joins and lookups, updates the target, and retires stale rows —
without touching anything that didn't change.


Built on a Rust engine.

The core is Rust — production-grade from day zero.
Parallel chunking, zero-copy transforms where possible, and failure isolation
so one bad record doesn't stall the flow.



Explore CocoIndex Enterprise — bright blue pill button linking to cocoindex.io/enterprise, the PB-scale incremental data pipeline for AI agents



Apache 2.0 · © CocoIndex contributors 🥥

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cocoindex-1.0.0a50.tar.gz (360.1 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

cocoindex-1.0.0a50-cp314-cp314t-win_amd64.whl (8.0 MB view details)

Uploaded CPython 3.14tWindows x86-64

cocoindex-1.0.0a50-cp314-cp314t-manylinux_2_28_x86_64.whl (7.9 MB view details)

Uploaded CPython 3.14tmanylinux: glibc 2.28+ x86-64

cocoindex-1.0.0a50-cp314-cp314t-manylinux_2_28_aarch64.whl (7.6 MB view details)

Uploaded CPython 3.14tmanylinux: glibc 2.28+ ARM64

cocoindex-1.0.0a50-cp314-cp314t-macosx_11_0_arm64.whl (7.7 MB view details)

Uploaded CPython 3.14tmacOS 11.0+ ARM64

cocoindex-1.0.0a50-cp311-abi3-win_amd64.whl (8.0 MB view details)

Uploaded CPython 3.11+Windows x86-64

cocoindex-1.0.0a50-cp311-abi3-manylinux_2_28_x86_64.whl (7.9 MB view details)

Uploaded CPython 3.11+manylinux: glibc 2.28+ x86-64

cocoindex-1.0.0a50-cp311-abi3-manylinux_2_28_aarch64.whl (7.6 MB view details)

Uploaded CPython 3.11+manylinux: glibc 2.28+ ARM64

cocoindex-1.0.0a50-cp311-abi3-macosx_11_0_arm64.whl (7.8 MB view details)

Uploaded CPython 3.11+macOS 11.0+ ARM64

cocoindex-1.0.0a50-cp311-abi3-macosx_10_12_x86_64.whl (7.7 MB view details)

Uploaded CPython 3.11+macOS 10.12+ x86-64

File details

Details for the file cocoindex-1.0.0a50.tar.gz.

File metadata

  • Download URL: cocoindex-1.0.0a50.tar.gz
  • Upload date:
  • Size: 360.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.13.1

File hashes

Hashes for cocoindex-1.0.0a50.tar.gz
Algorithm Hash digest
SHA256 d3fb4c8b408c4b40c1178c1a2d1d077ec8f8b6b4f0b32a445b651446ad330c02
MD5 5c94799dbcf3bcc27a8465d6785c251c
BLAKE2b-256 fbfa5f59ba776b1c58d5ef98706229b2990a60beee314060823a33207512fcdb

See more details on using hashes here.

File details

Details for the file cocoindex-1.0.0a50-cp314-cp314t-win_amd64.whl.

File metadata

File hashes

Hashes for cocoindex-1.0.0a50-cp314-cp314t-win_amd64.whl
Algorithm Hash digest
SHA256 8e08a181bcf630db4107ca9be04b41e7df1a85505936a717571823e4b603c884
MD5 cf9f65135a63d812c604d1c9353a9f36
BLAKE2b-256 98582aaf4429402f6b0aeb2c1db396038a4358e8616bc5db05120f933b611291

See more details on using hashes here.

File details

Details for the file cocoindex-1.0.0a50-cp314-cp314t-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for cocoindex-1.0.0a50-cp314-cp314t-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 901ac1040fc2a2ce0ec01fe0027d90f8e4589402bf0bcd6b8c421cbae688ebc8
MD5 d89c32e3d9795c1e39fb70e27f81954e
BLAKE2b-256 deb98d5b05740d0d7c4afc6b8a8931e750c99eb61fb9b48f4885a8e21e800b46

See more details on using hashes here.

File details

Details for the file cocoindex-1.0.0a50-cp314-cp314t-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for cocoindex-1.0.0a50-cp314-cp314t-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 cfdaa644c11cbf74022c72ddcadde06629dd3d9039ad61fa70d8a7b0a2253c5c
MD5 b89a6944538355f1787182cca93ab6ae
BLAKE2b-256 962605ccb15f6dddeb410cd947a24f77f93c7c901a2b8be2f7e5a104f942b0f5

See more details on using hashes here.

File details

Details for the file cocoindex-1.0.0a50-cp314-cp314t-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for cocoindex-1.0.0a50-cp314-cp314t-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 2fd291e6ca00eec23f27a91151971b53b86b6f5102904e6bc7633f28b09a661d
MD5 1c6f3d7ff912ce60acf573e47d608442
BLAKE2b-256 7521f9ef399b14e70355f57e7e0a1cea1f6bc81744db6c6669f766df169fa7c4

See more details on using hashes here.

File details

Details for the file cocoindex-1.0.0a50-cp311-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for cocoindex-1.0.0a50-cp311-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 dc38d43e2514e275b492e25b7ee220afa266eee84c7852f86842486a911b7dfd
MD5 5d11c1e9f97763ab1f15d54e1f21592a
BLAKE2b-256 0b9e266e115958fea04bea4327fb8a9413120ad3e50a88737e5c41ee83114cbe

See more details on using hashes here.

File details

Details for the file cocoindex-1.0.0a50-cp311-abi3-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for cocoindex-1.0.0a50-cp311-abi3-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 21410f51be8972947c6eb600622ac337d69e707360155e5e1230edc78f46624c
MD5 953d11921cda63a034448046507f4bf1
BLAKE2b-256 37410989111b9985c93acf1f15a40b23c08f4531bc75b93a905c14e69786e8f9

See more details on using hashes here.

File details

Details for the file cocoindex-1.0.0a50-cp311-abi3-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for cocoindex-1.0.0a50-cp311-abi3-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 94a556091c4522620ebedda2a920ae35b1c40423a60da6cc556285c16fafec1a
MD5 efc3bb75a2dffad596e09de582d144ef
BLAKE2b-256 679bf1a3008e0825b4eccf6e076b842c36d5025439c24e08df29a28a2dab5f2c

See more details on using hashes here.

File details

Details for the file cocoindex-1.0.0a50-cp311-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for cocoindex-1.0.0a50-cp311-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 bc235f04da609a446aded52811f9330613cffec955b86efdbd0810f3aad2f298
MD5 d76b171be47e4a9994e8b6aac793a511
BLAKE2b-256 1c69dab12e5a0dc2c5df5b20f5af220c0da5ea57f125528a782b2f01efb2a0f2

See more details on using hashes here.

File details

Details for the file cocoindex-1.0.0a50-cp311-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for cocoindex-1.0.0a50-cp311-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 94bb0a53ffe2a10c7bcb1c71320d1fead458112d574eba76201b09dca0ad6059
MD5 b46d6e551e07ca4a45e81ed0f66112f0
BLAKE2b-256 1b560ef1ff18c196a5a7dbcac63ee91bfccc04aecbcf9769fc8c7c55f81a95cb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page