Skip to main content

With CocoIndex, users declare the transformation, CocoIndex creates & maintains an index, and keeps the derived index up to date based on source update, with minimal computation and changes.

Project description

CocoIndex

Data transformation for AI

GitHub Documentation License PyPI version

PyPI Downloads CI release Link Check prek Discord

cocoindex-io%2Fcocoindex | Trendshift

Ultra performant data transformation framework for AI, with core engine written in Rust. Support incremental processing and data lineage out-of-box. Exceptional developer velocity. Production-ready at day 0.

⭐ Drop a star to help us grow!


CocoIndex Transformation


CocoIndex makes it effortless to transform data with AI, and keep source data and target in sync. Whether you’re building a vector index, creating knowledge graphs for context engineering or performing any custom data transformations — goes beyond SQL.


CocoIndex Features


Exceptional velocity

Just declare transformation in dataflow with ~100 lines of python

# import
data['content'] = flow_builder.add_source(...)

# transform
data['out'] = data['content']
    .transform(...)
    .transform(...)

# collect data
collector.collect(...)

# export to db, vector db, graph db ...
collector.export(...)

CocoIndex follows the idea of Dataflow programming model. Each transformation creates a new field solely based on input fields, without hidden states and value mutation. All data before/after each transformation is observable, with lineage out of the box.

Particularly, developers don't explicitly mutate data by creating, updating and deleting. They just need to define transformation/formula for a set of source data.

Plug-and-Play Building Blocks

Native builtins for different source, targets and transformations. Standardize interface, make it 1-line code switch between different components - as easy as assembling building blocks.

CocoIndex Features

Data Freshness

CocoIndex keep source data and target in sync effortlessly.

Incremental Processing

It has out-of-box support for incremental indexing:

  • minimal recomputation on source or logic change.
  • (re-)processing necessary portions; reuse cache when possible

Quick Start

If you're new to CocoIndex, we recommend checking out

Setup

  1. Install CocoIndex Python library
pip install -U cocoindex
  1. Install Postgres if you don't have one. CocoIndex uses it for incremental processing.

  2. (Optional) Install Claude Code skill for enhanced development experience. Run these commands in Claude Code:

/plugin marketplace add cocoindex-io/cocoindex-claude
/plugin install cocoindex-skills@cocoindex

Define data flow

Follow Quick Start Guide to define your first indexing flow. An example flow looks like:

@cocoindex.flow_def(name="TextEmbedding")
def text_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope):
    # Add a data source to read files from a directory
    data_scope["documents"] = flow_builder.add_source(cocoindex.sources.LocalFile(path="markdown_files"))

    # Add a collector for data to be exported to the vector index
    doc_embeddings = data_scope.add_collector()

    # Transform data of each document
    with data_scope["documents"].row() as doc:
        # Split the document into chunks, put into `chunks` field
        doc["chunks"] = doc["content"].transform(
            cocoindex.functions.SplitRecursively(),
            language="markdown", chunk_size=2000, chunk_overlap=500)

        # Transform data of each chunk
        with doc["chunks"].row() as chunk:
            # Embed the chunk, put into `embedding` field
            chunk["embedding"] = chunk["text"].transform(
                cocoindex.functions.SentenceTransformerEmbed(
                    model="sentence-transformers/all-MiniLM-L6-v2"))

            # Collect the chunk into the collector.
            doc_embeddings.collect(filename=doc["filename"], location=chunk["location"],
                                   text=chunk["text"], embedding=chunk["embedding"])

    # Export collected data to a vector index.
    doc_embeddings.export(
        "doc_embeddings",
        cocoindex.targets.Postgres(),
        primary_key_fields=["filename", "location"],
        vector_indexes=[
            cocoindex.VectorIndexDef(
                field_name="embedding",
                metric=cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY)])

It defines an index flow like this:

Data Flow

🚀 Examples and demo

Example Description
Text Embedding Index text documents with embeddings for semantic search
Code Embedding Index code embeddings for semantic search
PDF Embedding Parse PDF and index text embeddings for semantic search
PDF Elements Embedding Extract text and images from PDFs; embed text with SentenceTransformers and images with CLIP; store in Qdrant for multimodal search
Manuals LLM Extraction Extract structured information from a manual using LLM
Amazon S3 Embedding Index text documents from Amazon S3
Azure Blob Storage Embedding Index text documents from Azure Blob Storage
Google Drive Text Embedding Index text documents from Google Drive
Meeting Notes to Knowledge Graph Extract structured meeting info from Google Drive and build a knowledge graph
Docs to Knowledge Graph Extract relationships from Markdown documents and build a knowledge graph
Embeddings to Qdrant Index documents in a Qdrant collection for semantic search
Embeddings to LanceDB Index documents in a LanceDB collection for semantic search
FastAPI Server with Docker Run the semantic search server in a Dockerized FastAPI setup
Product Recommendation Build real-time product recommendations with LLM and graph database
Image Search with Vision API Generates detailed captions for images using a vision model, embeds them, enables live-updating semantic search via FastAPI and served on a React frontend
Face Recognition Recognize faces in images and build embedding index
Paper Metadata Index papers in PDF files, and build metadata tables for each paper
Multi Format Indexing Build visual document index from PDFs and images with ColPali for semantic search
Custom Source HackerNews Index HackerNews threads and comments, using CocoIndex Custom Source
Custom Output Files Convert markdown files to HTML files and save them to a local directory, using CocoIndex Custom Targets
Patient intake form extraction Use LLM to extract structured data from patient intake forms with different formats
HackerNews Trending Topics Extract trending topics from HackerNews threads and comments, using CocoIndex Custom Source and LLM
Patient Intake Form Extraction with BAML Extract structured data from patient intake forms using BAML
Patient Intake Form Extraction with DSPy Extract structured data from patient intake forms using DSPy

More coming and stay tuned 👀!

📖 Documentation

For detailed documentation, visit CocoIndex Documentation, including a Quickstart guide.

🤝 Contributing

We love contributions from our community ❤️. For details on contributing or running the project for development, check out our contributing guide.

👥 Community

Welcome with a huge coconut hug 🥥⋆。˚🤗. We are super excited for community contributions of all kinds - whether it's code improvements, documentation updates, issue reports, feature requests, and discussions in our Discord.

Join our community here:

Support us

We are constantly improving, and more features and examples are coming soon. If you love this project, please drop us a star ⭐ at GitHub repo GitHub to stay tuned and help us grow.

License

CocoIndex is Apache 2.0 licensed.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cocoindex-0.3.38.tar.gz (485.2 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

cocoindex-0.3.38-cp314-cp314t-win_amd64.whl (19.7 MB view details)

Uploaded CPython 3.14tWindows x86-64

cocoindex-0.3.38-cp314-cp314t-manylinux_2_28_x86_64.whl (19.1 MB view details)

Uploaded CPython 3.14tmanylinux: glibc 2.28+ x86-64

cocoindex-0.3.38-cp314-cp314t-manylinux_2_28_aarch64.whl (18.5 MB view details)

Uploaded CPython 3.14tmanylinux: glibc 2.28+ ARM64

cocoindex-0.3.38-cp314-cp314t-macosx_11_0_arm64.whl (17.9 MB view details)

Uploaded CPython 3.14tmacOS 11.0+ ARM64

cocoindex-0.3.38-cp311-abi3-win_amd64.whl (19.7 MB view details)

Uploaded CPython 3.11+Windows x86-64

cocoindex-0.3.38-cp311-abi3-manylinux_2_28_x86_64.whl (19.1 MB view details)

Uploaded CPython 3.11+manylinux: glibc 2.28+ x86-64

cocoindex-0.3.38-cp311-abi3-manylinux_2_28_aarch64.whl (18.5 MB view details)

Uploaded CPython 3.11+manylinux: glibc 2.28+ ARM64

cocoindex-0.3.38-cp311-abi3-macosx_11_0_arm64.whl (18.0 MB view details)

Uploaded CPython 3.11+macOS 11.0+ ARM64

cocoindex-0.3.38-cp311-abi3-macosx_10_12_x86_64.whl (18.7 MB view details)

Uploaded CPython 3.11+macOS 10.12+ x86-64

File details

Details for the file cocoindex-0.3.38.tar.gz.

File metadata

  • Download URL: cocoindex-0.3.38.tar.gz
  • Upload date:
  • Size: 485.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.13.1

File hashes

Hashes for cocoindex-0.3.38.tar.gz
Algorithm Hash digest
SHA256 6079dba018397894b7efcdd71f3c01455330f24450ce24a075465287621194ad
MD5 1f01eaf2975c5713b67a1f610ee4c061
BLAKE2b-256 6d340870141df1ee7e77d22715b14b87e5e2d1fd9fc8792ec9966b0016febf25

See more details on using hashes here.

File details

Details for the file cocoindex-0.3.38-cp314-cp314t-win_amd64.whl.

File metadata

File hashes

Hashes for cocoindex-0.3.38-cp314-cp314t-win_amd64.whl
Algorithm Hash digest
SHA256 18108a092c425e65010b607a2203cff593b6914a5ca94ab02827b3491cb46cff
MD5 c0528ba54dbd6a91c4cd8c5247e337c4
BLAKE2b-256 6a33fad9a2452ce94da7a59aaad1e651619218f00b36b76d408929dd8f71266d

See more details on using hashes here.

File details

Details for the file cocoindex-0.3.38-cp314-cp314t-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for cocoindex-0.3.38-cp314-cp314t-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 57c753e54fb66d72cccd7624ade55348d1c996535b1038c2aff7aa111c1937bd
MD5 6f466bd3ecf943703dc4e6839d9f7235
BLAKE2b-256 ed39ff7fcf69e2c2e67108a680b3313aaba85a43706a5d6a0395923ade598a65

See more details on using hashes here.

File details

Details for the file cocoindex-0.3.38-cp314-cp314t-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for cocoindex-0.3.38-cp314-cp314t-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 6be7785ee12eb90a7acee485dc2ec7f26830e2e4148241d1dbd5a1d1c16e126b
MD5 c6fe7967639d436f1471140c44c2b015
BLAKE2b-256 e6edc832328280e94f342916547263b111a33f735cdebfb042d49b935910a19c

See more details on using hashes here.

File details

Details for the file cocoindex-0.3.38-cp314-cp314t-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for cocoindex-0.3.38-cp314-cp314t-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 d8eec33f5f864675ec1bb180f858433af61f86fd39c86ac13694bcacf90c4a45
MD5 4275672b3a7d0e43c19113931de9e7f5
BLAKE2b-256 2dc3ad23a6cd3c398eb354ba3536d70b913338e8553ec0c82bf6677ffedc5e3d

See more details on using hashes here.

File details

Details for the file cocoindex-0.3.38-cp311-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for cocoindex-0.3.38-cp311-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 24f4754258e45de1c75e707b92abc288ed67a05fa29b4d9f4ab1450825871348
MD5 606ff4a86cfacaf4c5c468ef2ab32532
BLAKE2b-256 8c46a23128d04fa3e7cbc2e3d1dfaff411e0138d151311300d9e958bbd005e61

See more details on using hashes here.

File details

Details for the file cocoindex-0.3.38-cp311-abi3-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for cocoindex-0.3.38-cp311-abi3-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 c9c55d74f1926b75d2b2848b68e1e04d280427114cd0ee4423364895e1430bc7
MD5 53766c8357abb905ef3bc01fad948cfe
BLAKE2b-256 56154c4dffbd7adcf4e689a44409f7c1912f881430e9928d3b259ba77ab119ff

See more details on using hashes here.

File details

Details for the file cocoindex-0.3.38-cp311-abi3-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for cocoindex-0.3.38-cp311-abi3-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 05be98de81eb2fd3e6b60ef0c66e1ad8d5c7421385c6e4fed6af5ce6fd8cd597
MD5 7b0392e8c26843bbc2b21e4724e4b98b
BLAKE2b-256 87a9b30da08d6f8ef30adf4e71e926ddc4141fb44b512ae83e92c17ca429ec3c

See more details on using hashes here.

File details

Details for the file cocoindex-0.3.38-cp311-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for cocoindex-0.3.38-cp311-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 8d286993edf064b50f9ef9578ce1eb4d53963063173a4cf2e71289ced10df051
MD5 fdb181a3f1ddb49bc0a8f5da6d0d832e
BLAKE2b-256 51aa4e5e226958996f2a38099628f9df0bc304f005a2b0303245ed0a406a2831

See more details on using hashes here.

File details

Details for the file cocoindex-0.3.38-cp311-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for cocoindex-0.3.38-cp311-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 8432961d9e614218af8f0d6a22423eaeac46d97ae23eba82438a437a8ca1007f
MD5 8bd44daf02d8bd2fce9a005c4329b02e
BLAKE2b-256 f96fb022a9bd16866b7c5c40401a05f7e680f4df3c47860ca7d073db5711b846

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page