Skip to main content

With CocoIndex, users declare the transformation, CocoIndex creates & maintains an index, and keeps the derived index up to date based on source update, with minimal computation and changes.

Project description

CocoIndex

Extract, Transform, Index Data. Easy and Fresh. 🌴

GitHub Documentation License PyPI version PyPI - Downloads

CI release Discord

CocoIndex is an ultra performant data transformation framework, with its core engine written in Rust. The problem it tries to solve is to make it easy to prepare fresh data for AI - either creating embedding, building knowledge graphs, or performing other data transformations - and take real-time data pipelines beyond traditional SQL.

CocoIndex Features

The philosophy is to have the framework handle the source updates, and having developers only worry about defining a series of data transformation, inspired by spreadsheet.

Dataflow programming

Unlike a workflow orchestration framework where data is usually opaque, in CocoIndex, data and data operations are first class citizens. CocoIndex follows the idea of Dataflow programming model. Each transformation creates a new field solely based on input fields, without hidden states and value mutation. All data before/after each transformation is observable, with lineage out of the box.

Particularly, users don't explicitly mutate data by creating, updating and deleting. Rather, they define something like - for a set of source data, this is the transformation or formula. The framework takes care of the data operations such as when to create, update, or delete.

# import
data['content'] = flow_builder.add_source(...) 

# transform
data['out'] = data['content'] 
    .transform(...)
    .transform(...)

# collect data
collector.collect(...)

# export to db, vector db, graph db ...
collector.export(...)

Data Freshness

As a data framework, CocoIndex takes it to the next level on data freshness. Incremental processing is one of the core values provided by CocoIndex.

Incremental Processing

The frameworks takes care of

  • Change data capture.
  • Figure out what exactly needs to be updated, and only updating that without having to recompute everything.

This makes it fast to reflect any source updates to the target store. If you have concerns with surfacing stale data to AI agents and are spending lots of efforts working on infra piece to optimize the latency, the framework actually handles it for you.

Quick Start:

If you're new to CocoIndex, we recommend checking out

Setup

  1. Install CocoIndex Python library
pip install -U cocoindex
  1. Install Postgres if you don't have one. CocoIndex uses it for incremental processing.

Define data flow

Follow Quick Start Guide to define your first indexing flow. An example flow looks like:

@cocoindex.flow_def(name="TextEmbedding")
def text_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope):
    # Add a data source to read files from a directory
    data_scope["documents"] = flow_builder.add_source(cocoindex.sources.LocalFile(path="markdown_files"))

    # Add a collector for data to be exported to the vector index
    doc_embeddings = data_scope.add_collector()

    # Transform data of each document
    with data_scope["documents"].row() as doc:
        # Split the document into chunks, put into `chunks` field
        doc["chunks"] = doc["content"].transform(
            cocoindex.functions.SplitRecursively(),
            language="markdown", chunk_size=2000, chunk_overlap=500)

        # Transform data of each chunk
        with doc["chunks"].row() as chunk:
            # Embed the chunk, put into `embedding` field
            chunk["embedding"] = chunk["text"].transform(
                cocoindex.functions.SentenceTransformerEmbed(
                    model="sentence-transformers/all-MiniLM-L6-v2"))

            # Collect the chunk into the collector.
            doc_embeddings.collect(filename=doc["filename"], location=chunk["location"],
                                   text=chunk["text"], embedding=chunk["embedding"])

    # Export collected data to a vector index.
    doc_embeddings.export(
        "doc_embeddings",
        cocoindex.storages.Postgres(),
        primary_key_fields=["filename", "location"],
        vector_indexes=[
            cocoindex.VectorIndexDef(
                field_name="embedding",
                metric=cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY)])

It defines an index flow like this:

Data Flow

🚀 Examples and demo

Example Description
Text Embedding Index text documents with embeddings for semantic search
Code Embedding Index code embeddings for semantic search
PDF Embedding Parse PDF and index text embeddings for semantic search
Manuals LLM Extraction Extract structured information from a manual using LLM
Amazon S3 Embedding Index text documents from Amazon S3
Google Drive Text Embedding Index text documents from Google Drive
Docs to Knowledge Graph Extract relationships from Markdown documents and build a knowledge graph
Embeddings to Qdrant Index documents in a Qdrant collection for semantic search
FastAPI Server with Docker Run the semantic search server in a Dockerized FastAPI setup
Product Recommendation Build real-time product recommendations with LLM and graph database
Image Search with Vision API Generates detailed captions for images using a vision model, embeds them, enables live-updating semantic search via FastAPI and served on a React frontend

More coming and stay tuned 👀!

📖 Documentation

For detailed documentation, visit CocoIndex Documentation, including a Quickstart guide.

🤝 Contributing

We love contributions from our community ❤️. For details on contributing or running the project for development, check out our contributing guide.

👥 Community

Welcome with a huge coconut hug 🥥⋆。˚🤗. We are super excited for community contributions of all kinds - whether it's code improvements, documentation updates, issue reports, feature requests, and discussions in our Discord.

Join our community here:

Support us:

We are constantly improving, and more features and examples are coming soon. If you love this project, please drop us a star ⭐ at GitHub repo GitHub to stay tuned and help us grow.

License

CocoIndex is Apache 2.0 licensed.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cocoindex-0.1.44.tar.gz (5.7 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

cocoindex-0.1.44-pp311-pypy311_pp73-manylinux_2_28_aarch64.whl (13.6 MB view details)

Uploaded PyPymanylinux: glibc 2.28+ ARM64

cocoindex-0.1.44-cp313-cp313t-manylinux_2_28_aarch64.whl (13.5 MB view details)

Uploaded CPython 3.13tmanylinux: glibc 2.28+ ARM64

cocoindex-0.1.44-cp313-cp313-win_amd64.whl (13.3 MB view details)

Uploaded CPython 3.13Windows x86-64

cocoindex-0.1.44-cp313-cp313-manylinux_2_28_x86_64.whl (14.1 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.28+ x86-64

cocoindex-0.1.44-cp313-cp313-manylinux_2_28_aarch64.whl (13.6 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.28+ ARM64

cocoindex-0.1.44-cp313-cp313-macosx_11_0_arm64.whl (13.3 MB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

cocoindex-0.1.44-cp313-cp313-macosx_10_12_x86_64.whl (13.8 MB view details)

Uploaded CPython 3.13macOS 10.12+ x86-64

cocoindex-0.1.44-cp312-cp312-win_amd64.whl (13.3 MB view details)

Uploaded CPython 3.12Windows x86-64

cocoindex-0.1.44-cp312-cp312-manylinux_2_28_x86_64.whl (14.1 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.28+ x86-64

cocoindex-0.1.44-cp312-cp312-manylinux_2_28_aarch64.whl (13.6 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.28+ ARM64

cocoindex-0.1.44-cp312-cp312-macosx_11_0_arm64.whl (13.3 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

cocoindex-0.1.44-cp312-cp312-macosx_10_12_x86_64.whl (13.8 MB view details)

Uploaded CPython 3.12macOS 10.12+ x86-64

cocoindex-0.1.44-cp311-cp311-win_amd64.whl (13.3 MB view details)

Uploaded CPython 3.11Windows x86-64

cocoindex-0.1.44-cp311-cp311-manylinux_2_28_x86_64.whl (14.1 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.28+ x86-64

cocoindex-0.1.44-cp311-cp311-manylinux_2_28_aarch64.whl (13.6 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.28+ ARM64

cocoindex-0.1.44-cp311-cp311-macosx_11_0_arm64.whl (13.3 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

cocoindex-0.1.44-cp311-cp311-macosx_10_12_x86_64.whl (13.8 MB view details)

Uploaded CPython 3.11macOS 10.12+ x86-64

File details

Details for the file cocoindex-0.1.44.tar.gz.

File metadata

  • Download URL: cocoindex-0.1.44.tar.gz
  • Upload date:
  • Size: 5.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.8.6

File hashes

Hashes for cocoindex-0.1.44.tar.gz
Algorithm Hash digest
SHA256 46f11263082b069926ac9ad053dc5c842527b91e27bb864afb1338150c8ddebf
MD5 7d35c3e864d9800473f344890c8f8915
BLAKE2b-256 0b27fd1ef07fbf3e06d9a94e955567f29def8d3c081534b112cbcc8cb584e30b

See more details on using hashes here.

File details

Details for the file cocoindex-0.1.44-pp311-pypy311_pp73-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for cocoindex-0.1.44-pp311-pypy311_pp73-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 0aa817f145e6e946e916ae76dc2c4cc855cf962aa6f3a06874a10fbdae250e8f
MD5 4766bc5336dc4835bd88b5dbe209fd97
BLAKE2b-256 c4f057812765cc97b60b1e7c8c13cebff8512b3480a2c267940567f8c3043b0d

See more details on using hashes here.

File details

Details for the file cocoindex-0.1.44-cp313-cp313t-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for cocoindex-0.1.44-cp313-cp313t-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 d20382c316c34b51e439008f370f611541914f17d5972412a08fa4b27462034c
MD5 d389e73aa627ea0de5fa792aa980d26c
BLAKE2b-256 c522a67fd28bd7571ee5b20c4aa8451fcbe8c5d46a7ac986df0ed16d0a6637ac

See more details on using hashes here.

File details

Details for the file cocoindex-0.1.44-cp313-cp313-win_amd64.whl.

File metadata

File hashes

Hashes for cocoindex-0.1.44-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 65388e9e545ee40d174615d9cdbe60263a74dc6ccc2133a12f202982a1d60e06
MD5 1ec08f598dd158ef3f88d93026e039af
BLAKE2b-256 3ddbbaa32896ec72cac740548a34271ea52743c957d1b4dc9fd90c3565244a49

See more details on using hashes here.

File details

Details for the file cocoindex-0.1.44-cp313-cp313-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for cocoindex-0.1.44-cp313-cp313-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 1bdc83e286ff678bbbf30f143748fc385554c897deb67a466ed108ccf7178f83
MD5 d0c0a75bb9512c095b915ccea5f7ed83
BLAKE2b-256 b0e540f01f206e7371d388e28b76a940f08f6ad29f328a767c33db77a4e47bab

See more details on using hashes here.

File details

Details for the file cocoindex-0.1.44-cp313-cp313-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for cocoindex-0.1.44-cp313-cp313-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 fe8cad9e546f26e295130cbb862d65a72afe16c6fb654a5cde425ffd3605275c
MD5 48ff245bada1d14f6e833b891805cc4b
BLAKE2b-256 b5ee50ff195bbf8ff4189699c5eb2375264f49d4567a4ff8371c8fdabad42788

See more details on using hashes here.

File details

Details for the file cocoindex-0.1.44-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for cocoindex-0.1.44-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 978dc0454f10f5d44b6fdb755557c83f64a7177ea8757ba67e36e6da7fabacc0
MD5 f582af84dd9ea3d8716325a1298158bc
BLAKE2b-256 63259a0f7faabdc0dc709ac18c4d372ba201f3a8ec58f727b39e11a3f46743a0

See more details on using hashes here.

File details

Details for the file cocoindex-0.1.44-cp313-cp313-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for cocoindex-0.1.44-cp313-cp313-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 e82aa04f549751dfb68c2802a60dd8478053708266a343782ade47a6e4b61311
MD5 c6d80d02cd6afc1cd72fadbff52d95d2
BLAKE2b-256 c1494b401d238caedad3428d36c62fbe018806ce459a3527725920d06238b5f7

See more details on using hashes here.

File details

Details for the file cocoindex-0.1.44-cp312-cp312-win_amd64.whl.

File metadata

File hashes

Hashes for cocoindex-0.1.44-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 5e6e71a1d8906d935db68804d42e163a9044856a5966ecd25cf324efc3f75472
MD5 115880abe7a6dd0b6b9a91492c850ed5
BLAKE2b-256 cd856355f489e02be029ee0b6f5361851b6ed782f0723272fcf71891db7850d1

See more details on using hashes here.

File details

Details for the file cocoindex-0.1.44-cp312-cp312-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for cocoindex-0.1.44-cp312-cp312-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 2ff3b80afe19e7b8329ce8d5b9cdaeb3853ae13d1830f6da79ce52bde76a8579
MD5 83d0132d9af21c0f389774da6f531b1e
BLAKE2b-256 a336a6586ea37e88b12cbf7d82afd1c7730be879be00ce184484bd9ba7fe60ad

See more details on using hashes here.

File details

Details for the file cocoindex-0.1.44-cp312-cp312-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for cocoindex-0.1.44-cp312-cp312-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 453eecfd1291061c0bdc94b3fdafc25b1563153800757325c50d0504a75e6844
MD5 55bf9fc846655243edae0db60b89d863
BLAKE2b-256 e60403bcd95df1d6db58cce2f5e3220f60d01d313e0a564f726bd17eba847567

See more details on using hashes here.

File details

Details for the file cocoindex-0.1.44-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for cocoindex-0.1.44-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 c70a95a1c0fc3b599088a5bdd6cde675365d54e1acb997437f17591bf199aefc
MD5 563d5065054a6b2da37a3f1a5e7ab6e8
BLAKE2b-256 e4643ca1ea05db4594b65f65edc2b0959335b43859954e78657723613640f080

See more details on using hashes here.

File details

Details for the file cocoindex-0.1.44-cp312-cp312-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for cocoindex-0.1.44-cp312-cp312-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 f9e61c3241f656af5f0435e8605977893d26b587feb4d854166f02806a0f0656
MD5 f4f2472dfc49be6ec89c323ee9bc155d
BLAKE2b-256 69235ccb271424d7607d2835291961232f6adcbcc2403d9877757c4f95c0074e

See more details on using hashes here.

File details

Details for the file cocoindex-0.1.44-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for cocoindex-0.1.44-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 734fe31d467a01dea214a0c3608b43a50abbc33fdc5f797cc8a8cec63475e367
MD5 9500b203bd6887a9f5053d95b6388472
BLAKE2b-256 7da4d239a8aa9005cfda9a5616a5b894ef59eada64b2acd5db3c7f53a82c7f44

See more details on using hashes here.

File details

Details for the file cocoindex-0.1.44-cp311-cp311-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for cocoindex-0.1.44-cp311-cp311-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 0ed6fad01594732a59d8ff034eea58558ca110131089a4ba6da624fc22ebc0d9
MD5 26e022e816434b3112b0007ae9a73755
BLAKE2b-256 c56f653640ee958a6a1983835722ebb5798ec0db9551ad5abcac827ef9fba6b8

See more details on using hashes here.

File details

Details for the file cocoindex-0.1.44-cp311-cp311-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for cocoindex-0.1.44-cp311-cp311-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 e6f9086569fe302cb142401d32576f3d5953710fc38e71ae24592b0bfe1eab52
MD5 597a2ad10f45c1a946a89dcec437c06b
BLAKE2b-256 25b5b85d79151d9fe82a0442d0d3f36b5f1c42647119726ea743012c084e41f7

See more details on using hashes here.

File details

Details for the file cocoindex-0.1.44-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for cocoindex-0.1.44-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 cf624dc899dc02b9360a44d6043abfc821d0a0089f40435f9d120453b12a75ed
MD5 cbc87796708ad01cb375d250dbf01b10
BLAKE2b-256 f7fa157ecaf54444a80185e6a2688e1f024125fa0500676228032eeec5522f37

See more details on using hashes here.

File details

Details for the file cocoindex-0.1.44-cp311-cp311-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for cocoindex-0.1.44-cp311-cp311-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 d6d1714a20402b5893d73e1a2c888d580482382973ac7ead2eaab4d617a08ef3
MD5 739f8fe38cbc1f93c4681857172d56a0
BLAKE2b-256 8aa9281e4b2397a0a65e3a8d623738fc0f8eedf933085994f7c70a426c35708f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page