Skip to main content

With CocoIndex, users declare the transformation, CocoIndex creates & maintains an index, and keeps the derived index up to date based on source update, with minimal computation and changes.

Project description

CocoIndex

Data transformation for AI

GitHub Documentation License PyPI version

PyPI Downloads CI release Link Check prek Discord

cocoindex-io%2Fcocoindex | Trendshift

Ultra performant data transformation framework for AI, with core engine written in Rust. Support incremental processing and data lineage out-of-box. Exceptional developer velocity. Production-ready at day 0.

⭐ Drop a star to help us grow!


CocoIndex Transformation


CocoIndex makes it effortless to transform data with AI, and keep source data and target in sync. Whether you’re building a vector index, creating knowledge graphs for context engineering or performing any custom data transformations — goes beyond SQL.


CocoIndex Features


Exceptional velocity

Just declare transformation in dataflow with ~100 lines of python

# import
data['content'] = flow_builder.add_source(...)

# transform
data['out'] = data['content']
    .transform(...)
    .transform(...)

# collect data
collector.collect(...)

# export to db, vector db, graph db ...
collector.export(...)

CocoIndex follows the idea of Dataflow programming model. Each transformation creates a new field solely based on input fields, without hidden states and value mutation. All data before/after each transformation is observable, with lineage out of the box.

Particularly, developers don't explicitly mutate data by creating, updating and deleting. They just need to define transformation/formula for a set of source data.

Plug-and-Play Building Blocks

Native builtins for different source, targets and transformations. Standardize interface, make it 1-line code switch between different components - as easy as assembling building blocks.

CocoIndex Features

Data Freshness

CocoIndex keep source data and target in sync effortlessly.

Incremental Processing

It has out-of-box support for incremental indexing:

  • minimal recomputation on source or logic change.
  • (re-)processing necessary portions; reuse cache when possible

Quick Start

If you're new to CocoIndex, we recommend checking out

Setup

  1. Install CocoIndex Python library
pip install -U cocoindex
  1. Install Postgres if you don't have one. CocoIndex uses it for incremental processing.

  2. (Optional) Install Claude Code skill for enhanced development experience. Run these commands in Claude Code:

/plugin marketplace add cocoindex-io/cocoindex-claude
/plugin install cocoindex-skills@cocoindex

Define data flow

Follow Quick Start Guide to define your first indexing flow. An example flow looks like:

@cocoindex.flow_def(name="TextEmbedding")
def text_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope):
    # Add a data source to read files from a directory
    data_scope["documents"] = flow_builder.add_source(cocoindex.sources.LocalFile(path="markdown_files"))

    # Add a collector for data to be exported to the vector index
    doc_embeddings = data_scope.add_collector()

    # Transform data of each document
    with data_scope["documents"].row() as doc:
        # Split the document into chunks, put into `chunks` field
        doc["chunks"] = doc["content"].transform(
            cocoindex.functions.SplitRecursively(),
            language="markdown", chunk_size=2000, chunk_overlap=500)

        # Transform data of each chunk
        with doc["chunks"].row() as chunk:
            # Embed the chunk, put into `embedding` field
            chunk["embedding"] = chunk["text"].transform(
                cocoindex.functions.SentenceTransformerEmbed(
                    model="sentence-transformers/all-MiniLM-L6-v2"))

            # Collect the chunk into the collector.
            doc_embeddings.collect(filename=doc["filename"], location=chunk["location"],
                                   text=chunk["text"], embedding=chunk["embedding"])

    # Export collected data to a vector index.
    doc_embeddings.export(
        "doc_embeddings",
        cocoindex.targets.Postgres(),
        primary_key_fields=["filename", "location"],
        vector_indexes=[
            cocoindex.VectorIndexDef(
                field_name="embedding",
                metric=cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY)])

It defines an index flow like this:

Data Flow

🚀 Examples and demo

Example Description
Text Embedding Index text documents with embeddings for semantic search
Code Embedding Index code embeddings for semantic search
PDF Embedding Parse PDF and index text embeddings for semantic search
PDF Elements Embedding Extract text and images from PDFs; embed text with SentenceTransformers and images with CLIP; store in Qdrant for multimodal search
Manuals LLM Extraction Extract structured information from a manual using LLM
Amazon S3 Embedding Index text documents from Amazon S3
Azure Blob Storage Embedding Index text documents from Azure Blob Storage
Google Drive Text Embedding Index text documents from Google Drive
Meeting Notes to Knowledge Graph Extract structured meeting info from Google Drive and build a knowledge graph
Docs to Knowledge Graph Extract relationships from Markdown documents and build a knowledge graph
Embeddings to Qdrant Index documents in a Qdrant collection for semantic search
Embeddings to LanceDB Index documents in a LanceDB collection for semantic search
FastAPI Server with Docker Run the semantic search server in a Dockerized FastAPI setup
Product Recommendation Build real-time product recommendations with LLM and graph database
Image Search with Vision API Generates detailed captions for images using a vision model, embeds them, enables live-updating semantic search via FastAPI and served on a React frontend
Face Recognition Recognize faces in images and build embedding index
Paper Metadata Index papers in PDF files, and build metadata tables for each paper
Multi Format Indexing Build visual document index from PDFs and images with ColPali for semantic search
Custom Source HackerNews Index HackerNews threads and comments, using CocoIndex Custom Source
Custom Output Files Convert markdown files to HTML files and save them to a local directory, using CocoIndex Custom Targets
Patient intake form extraction Use LLM to extract structured data from patient intake forms with different formats
HackerNews Trending Topics Extract trending topics from HackerNews threads and comments, using CocoIndex Custom Source and LLM
Patient Intake Form Extraction with BAML Extract structured data from patient intake forms using BAML
Patient Intake Form Extraction with DSPy Extract structured data from patient intake forms using DSPy

More coming and stay tuned 👀!

📖 Documentation

For detailed documentation, visit CocoIndex Documentation, including a Quickstart guide.

🤝 Contributing

We love contributions from our community ❤️. For details on contributing or running the project for development, check out our contributing guide.

👥 Community

Welcome with a huge coconut hug 🥥⋆。˚🤗. We are super excited for community contributions of all kinds - whether it's code improvements, documentation updates, issue reports, feature requests, and discussions in our Discord.

Join our community here:

Support us

We are constantly improving, and more features and examples are coming soon. If you love this project, please drop us a star ⭐ at GitHub repo GitHub to stay tuned and help us grow.

License

CocoIndex is Apache 2.0 licensed.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cocoindex-0.3.29.tar.gz (443.8 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

cocoindex-0.3.29-cp314-cp314t-win_amd64.whl (19.4 MB view details)

Uploaded CPython 3.14tWindows x86-64

cocoindex-0.3.29-cp314-cp314t-macosx_11_0_arm64.whl (17.7 MB view details)

Uploaded CPython 3.14tmacOS 11.0+ ARM64

cocoindex-0.3.29-cp311-abi3-win_amd64.whl (19.5 MB view details)

Uploaded CPython 3.11+Windows x86-64

cocoindex-0.3.29-cp311-abi3-manylinux_2_28_x86_64.whl (18.9 MB view details)

Uploaded CPython 3.11+manylinux: glibc 2.28+ x86-64

cocoindex-0.3.29-cp311-abi3-manylinux_2_28_aarch64.whl (18.2 MB view details)

Uploaded CPython 3.11+manylinux: glibc 2.28+ ARM64

cocoindex-0.3.29-cp311-abi3-macosx_11_0_arm64.whl (17.7 MB view details)

Uploaded CPython 3.11+macOS 11.0+ ARM64

cocoindex-0.3.29-cp311-abi3-macosx_10_12_x86_64.whl (18.4 MB view details)

Uploaded CPython 3.11+macOS 10.12+ x86-64

cocoindex-0.3.29-cp310-cp310-manylinux_2_28_x86_64.whl (18.9 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.28+ x86-64

cocoindex-0.3.29-cp310-cp310-manylinux_2_28_aarch64.whl (18.2 MB view details)

Uploaded CPython 3.10manylinux: glibc 2.28+ ARM64

File details

Details for the file cocoindex-0.3.29.tar.gz.

File metadata

  • Download URL: cocoindex-0.3.29.tar.gz
  • Upload date:
  • Size: 443.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.11.5

File hashes

Hashes for cocoindex-0.3.29.tar.gz
Algorithm Hash digest
SHA256 103c46c2dba5785a7d7887036a2ce69d18834f95b7a2672533098babaa6c84b4
MD5 6abb1999a66e4fbe077a79e82f740d43
BLAKE2b-256 3c021f9bd86ec1083cef9751fb75a2bc9770533158300e68b0eff50bdab00a59

See more details on using hashes here.

File details

Details for the file cocoindex-0.3.29-cp314-cp314t-win_amd64.whl.

File metadata

File hashes

Hashes for cocoindex-0.3.29-cp314-cp314t-win_amd64.whl
Algorithm Hash digest
SHA256 1f2de66d329ce29147ca22ee3dce0890014f27a5df7beac309689eb453bd60e6
MD5 6c56e136cd47b4a629ec9e45246cfe3c
BLAKE2b-256 e8742039c34360b36905926f239e6846ebf6fadba7def723c06029225975207b

See more details on using hashes here.

File details

Details for the file cocoindex-0.3.29-cp314-cp314t-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for cocoindex-0.3.29-cp314-cp314t-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 8e5780e17337b80058c9e06126bcb69471d6ada5c3694a8bb3606407eb7933c8
MD5 ca32687d734ccb87313838bf4be30692
BLAKE2b-256 1e9333c3b0fbc079216d044b648e72894b4e9cc0aec0ad94d03d3640b951ae6c

See more details on using hashes here.

File details

Details for the file cocoindex-0.3.29-cp311-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for cocoindex-0.3.29-cp311-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 e9713b0ac2e2710bc1c42a23bbc4459602b6b08a938415f5fe249f097ed67e76
MD5 7b43759cea0ba1da08faf6443d08c454
BLAKE2b-256 caff8f725b0c323535d86e795ddf577577f18c23cfed4c52607c84077fa51ea8

See more details on using hashes here.

File details

Details for the file cocoindex-0.3.29-cp311-abi3-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for cocoindex-0.3.29-cp311-abi3-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 ba449685cb18d75b88628d570f158781c40b5573dc042cee5d98791192329b9a
MD5 f531ea5aefd0d08058d96c5a9f7ba63a
BLAKE2b-256 6e35b12d7e25ad213e657cd8929834efd82f0e18a06fae02b3919f9a139e31b6

See more details on using hashes here.

File details

Details for the file cocoindex-0.3.29-cp311-abi3-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for cocoindex-0.3.29-cp311-abi3-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 fc543f65a567b4262c05cecbe2d5f1e8dab94026388115fc3157d1f154cfdb48
MD5 0d2fb885f6574c48975d07ac63b8a02c
BLAKE2b-256 673a5b0490f840f22f4af42e31a7a133c99b8ef59ef0b1963e0a51e4dae4a8eb

See more details on using hashes here.

File details

Details for the file cocoindex-0.3.29-cp311-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for cocoindex-0.3.29-cp311-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 80f159bcab28549a0013081d1b3bafc497d36da08fb0d64a7f191252b6254abb
MD5 9f9f0a99e9cbcf26a144b3edb2d27863
BLAKE2b-256 e3406929895f911f051811a31d71cc1f040549aadfe34fdcfcee8999c11ede00

See more details on using hashes here.

File details

Details for the file cocoindex-0.3.29-cp311-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for cocoindex-0.3.29-cp311-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 60f8e84a0a7fb88c621f93c5ca55cd2ce5f3ae8c809b7237234625f5416c6901
MD5 861e78766134390805004f7d423e0508
BLAKE2b-256 6ae8058359ffeff375bcc4577343278dfcf513ec4400bf1cebe158bfb915ee47

See more details on using hashes here.

File details

Details for the file cocoindex-0.3.29-cp310-cp310-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for cocoindex-0.3.29-cp310-cp310-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 aa1e6ace8d117c356b645afb5f7bf0f7accaa6267f406675f1f04da69308aaec
MD5 b7fe01c193c8773a965d3aa15e055ec6
BLAKE2b-256 7489f53d8f863a01036e57062d5cc1357e5e77a9c0f5d423f31ab589f1ec5163

See more details on using hashes here.

File details

Details for the file cocoindex-0.3.29-cp310-cp310-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for cocoindex-0.3.29-cp310-cp310-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 539c97a7e5ba3157519fcea9511d096913ba19a869f31e5fba8e1eed8e5dc0fb
MD5 2733d29ebb63db66abcd0e94731e9d6f
BLAKE2b-256 030812ccd50f5519558858213aaece6bb1c6491c2008f9775aea5771e0e2c815

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page