Skip to main content

With CocoIndex, users declare the transformation, CocoIndex creates & maintains an index, and keeps the derived index up to date based on source update, with minimal computation and changes.

Project description

CocoIndex

Extract, Transform, Index Data. Easy and Fresh. 🌴

GitHub Documentation License PyPI version PyPI - Downloads

CI release Discord

CocoIndex is an ultra performant data transformation framework, with its core engine written in Rust. The problem it tries to solve is to make it easy to prepare fresh data for AI - either creating embedding, building knowledge graphs, or performing other data transformations - and take real-time data pipelines beyond traditional SQL.

CocoIndex Features

The philosophy is to have the framework handle the source updates, and having developers only worry about defining a series of data transformation, inspired by spreadsheet.

Dataflow programming

Unlike a workflow orchestration framework where data is usually opaque, in CocoIndex, data and data operations are first class citizens. CocoIndex follows the idea of Dataflow programming model. Each transformation creates a new field solely based on input fields, without hidden states and value mutation. All data before/after each transformation is observable, with lineage out of the box.

Particularly, users don't explicitly mutate data by creating, updating and deleting. Rather, they define something like - for a set of source data, this is the transformation or formula. The framework takes care of the data operations such as when to create, update, or delete.

# import
data['content'] = flow_builder.add_source(...) 

# transform
data['out'] = data['content'] 
    .transform(...)
    .transform(...)

# collect data
collector.collect(...)

# export to db, vector db, graph db ...
collector.export(...)

Data Freshness

As a data framework, CocoIndex takes it to the next level on data freshness. Incremental processing is one of the core values provided by CocoIndex.

Incremental Processing

The frameworks takes care of

  • Change data capture.
  • Figure out what exactly needs to be updated, and only updating that without having to recompute everything.

This makes it fast to reflect any source updates to the target store. If you have concerns with surfacing stale data to AI agents and are spending lots of efforts working on infra piece to optimize the latency, the framework actually handles it for you.

Quick Start:

If you're new to CocoIndex, we recommend checking out

Setup

  1. Install CocoIndex Python library
pip install -U cocoindex
  1. Install Postgres if you don't have one. CocoIndex uses it for incremental processing.

Define data flow

Follow Quick Start Guide to define your first indexing flow. An example flow looks like:

@cocoindex.flow_def(name="TextEmbedding")
def text_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope):
    # Add a data source to read files from a directory
    data_scope["documents"] = flow_builder.add_source(cocoindex.sources.LocalFile(path="markdown_files"))

    # Add a collector for data to be exported to the vector index
    doc_embeddings = data_scope.add_collector()

    # Transform data of each document
    with data_scope["documents"].row() as doc:
        # Split the document into chunks, put into `chunks` field
        doc["chunks"] = doc["content"].transform(
            cocoindex.functions.SplitRecursively(),
            language="markdown", chunk_size=2000, chunk_overlap=500)

        # Transform data of each chunk
        with doc["chunks"].row() as chunk:
            # Embed the chunk, put into `embedding` field
            chunk["embedding"] = chunk["text"].transform(
                cocoindex.functions.SentenceTransformerEmbed(
                    model="sentence-transformers/all-MiniLM-L6-v2"))

            # Collect the chunk into the collector.
            doc_embeddings.collect(filename=doc["filename"], location=chunk["location"],
                                   text=chunk["text"], embedding=chunk["embedding"])

    # Export collected data to a vector index.
    doc_embeddings.export(
        "doc_embeddings",
        cocoindex.storages.Postgres(),
        primary_key_fields=["filename", "location"],
        vector_indexes=[
            cocoindex.VectorIndexDef(
                field_name="embedding",
                metric=cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY)])

It defines an index flow like this:

Data Flow

🚀 Examples and demo

Example Description
Text Embedding Index text documents with embeddings for semantic search
Code Embedding Index code embeddings for semantic search
PDF Embedding Parse PDF and index text embeddings for semantic search
Manuals LLM Extraction Extract structured information from a manual using LLM
Amazon S3 Embedding Index text documents from Amazon S3
Google Drive Text Embedding Index text documents from Google Drive
Docs to Knowledge Graph Extract relationships from Markdown documents and build a knowledge graph
Embeddings to Qdrant Index documents in a Qdrant collection for semantic search
FastAPI Server with Docker Run the semantic search server in a Dockerized FastAPI setup
Product Recommendation Build real-time product recommendations with LLM and graph database
Image Search with Vision API Generates detailed captions for images using a vision model, embeds them, enables live-updating semantic search via FastAPI and served on a React frontend

More coming and stay tuned 👀!

📖 Documentation

For detailed documentation, visit CocoIndex Documentation, including a Quickstart guide.

🤝 Contributing

We love contributions from our community ❤️. For details on contributing or running the project for development, check out our contributing guide.

👥 Community

Welcome with a huge coconut hug 🥥⋆。˚🤗. We are super excited for community contributions of all kinds - whether it's code improvements, documentation updates, issue reports, feature requests, and discussions in our Discord.

Join our community here:

Support us:

We are constantly improving, and more features and examples are coming soon. If you love this project, please drop us a star ⭐ at GitHub repo GitHub to stay tuned and help us grow.

License

CocoIndex is Apache 2.0 licensed.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cocoindex-0.1.46.tar.gz (5.7 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

cocoindex-0.1.46-pp311-pypy311_pp73-manylinux_2_28_aarch64.whl (13.7 MB view details)

Uploaded PyPymanylinux: glibc 2.28+ ARM64

cocoindex-0.1.46-cp313-cp313t-manylinux_2_28_aarch64.whl (13.7 MB view details)

Uploaded CPython 3.13tmanylinux: glibc 2.28+ ARM64

cocoindex-0.1.46-cp313-cp313-win_amd64.whl (13.4 MB view details)

Uploaded CPython 3.13Windows x86-64

cocoindex-0.1.46-cp313-cp313-manylinux_2_28_x86_64.whl (14.2 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.28+ x86-64

cocoindex-0.1.46-cp313-cp313-manylinux_2_28_aarch64.whl (13.7 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.28+ ARM64

cocoindex-0.1.46-cp313-cp313-macosx_11_0_arm64.whl (13.4 MB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

cocoindex-0.1.46-cp313-cp313-macosx_10_12_x86_64.whl (14.0 MB view details)

Uploaded CPython 3.13macOS 10.12+ x86-64

cocoindex-0.1.46-cp312-cp312-win_amd64.whl (13.4 MB view details)

Uploaded CPython 3.12Windows x86-64

cocoindex-0.1.46-cp312-cp312-manylinux_2_28_x86_64.whl (14.2 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.28+ x86-64

cocoindex-0.1.46-cp312-cp312-manylinux_2_28_aarch64.whl (13.7 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.28+ ARM64

cocoindex-0.1.46-cp312-cp312-macosx_11_0_arm64.whl (13.4 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

cocoindex-0.1.46-cp312-cp312-macosx_10_12_x86_64.whl (14.0 MB view details)

Uploaded CPython 3.12macOS 10.12+ x86-64

cocoindex-0.1.46-cp311-cp311-win_amd64.whl (13.4 MB view details)

Uploaded CPython 3.11Windows x86-64

cocoindex-0.1.46-cp311-cp311-manylinux_2_28_x86_64.whl (14.2 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.28+ x86-64

cocoindex-0.1.46-cp311-cp311-manylinux_2_28_aarch64.whl (13.7 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.28+ ARM64

cocoindex-0.1.46-cp311-cp311-macosx_11_0_arm64.whl (13.4 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

cocoindex-0.1.46-cp311-cp311-macosx_10_12_x86_64.whl (14.0 MB view details)

Uploaded CPython 3.11macOS 10.12+ x86-64

File details

Details for the file cocoindex-0.1.46.tar.gz.

File metadata

  • Download URL: cocoindex-0.1.46.tar.gz
  • Upload date:
  • Size: 5.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.8.6

File hashes

Hashes for cocoindex-0.1.46.tar.gz
Algorithm Hash digest
SHA256 a311717f435cf3a015bcd1537c35df26386178545e4a30bbc58c3f9d1efd63d5
MD5 feb784f314c7ae06eb84b2d08486f51b
BLAKE2b-256 61399b0745badaa26da0caf2574abc8b90d9869029f15067d219db9ad7914834

See more details on using hashes here.

File details

Details for the file cocoindex-0.1.46-pp311-pypy311_pp73-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for cocoindex-0.1.46-pp311-pypy311_pp73-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 f07a6b7c82176ea1ca7050a823272f116e62c8432ce4508dd29805d19adebfcf
MD5 c0593cbadec5766438d6297e00b02c54
BLAKE2b-256 195e4f36a2e466b085fa23d8e7e7d0428bba393839623a45eabb27767c9533c8

See more details on using hashes here.

File details

Details for the file cocoindex-0.1.46-cp313-cp313t-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for cocoindex-0.1.46-cp313-cp313t-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 d7a82050742c114cc31c0da004f37ac24ecafaf4aa213c940782d2356eb5d1a8
MD5 def0e645a7d3ed413d1a716752f9763a
BLAKE2b-256 1881e55fc060b1a997994a4aec9a955b761df562e88199dd3a545e7b64dc2c9c

See more details on using hashes here.

File details

Details for the file cocoindex-0.1.46-cp313-cp313-win_amd64.whl.

File metadata

File hashes

Hashes for cocoindex-0.1.46-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 c32fc6e0401d5fa12b6ffe5b1b6c188f2eb973252a8d4dc0738c6b68b14417f7
MD5 2ad325efb55f6c484eedaee2958d0779
BLAKE2b-256 05592f55a2c45dffc276fcfcb83e3153e48584d8d7a3f03725d5e8fec1c98431

See more details on using hashes here.

File details

Details for the file cocoindex-0.1.46-cp313-cp313-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for cocoindex-0.1.46-cp313-cp313-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 219131baa00630de31adb0ee57e115cbe3571364c066bf4e5ffe5e5ebe3ee471
MD5 c415949a025ada34cf4c9e742edbbab7
BLAKE2b-256 7e1405384a99f45ca7842bcc46de46a7ac8229aa735b936d76f6389c777d44bb

See more details on using hashes here.

File details

Details for the file cocoindex-0.1.46-cp313-cp313-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for cocoindex-0.1.46-cp313-cp313-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 ce681a8252114ecf6395d05d4f55ced21bf31a649ecf2bf1fc0b62772ce986ed
MD5 b504460a969d7a0795d8574d1563fa47
BLAKE2b-256 05298d0f020827f9bb091f64f3f4bfdd82f2eda6c07733f2fcbcbae1cad124b0

See more details on using hashes here.

File details

Details for the file cocoindex-0.1.46-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for cocoindex-0.1.46-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 38c6c2b0d90c6c68280f4a89eadea1b9e506b5aac309c9c057b34c8f3d113fde
MD5 ab9768db6848c5208cfb823d18b2d997
BLAKE2b-256 d56e804bb65d52d025028ce1dbc7751ede27060f4ff44c1bc623699c882ca63f

See more details on using hashes here.

File details

Details for the file cocoindex-0.1.46-cp313-cp313-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for cocoindex-0.1.46-cp313-cp313-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 ae91ba0beb093b4814075e82be255c4f924a9d29b0d93b4e6881e592ce1b1038
MD5 973e5792730e55bf8e9646a9b5328409
BLAKE2b-256 8e903221acd9f47e925cf4a14194c7f73e57ac5ecca237510ec2abec2fade871

See more details on using hashes here.

File details

Details for the file cocoindex-0.1.46-cp312-cp312-win_amd64.whl.

File metadata

File hashes

Hashes for cocoindex-0.1.46-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 4fe7a419e52558eb99c665aa95771f18d5daeb4c6de6fa8f5d97bd9902fdcc1d
MD5 ba2c038fe86e1f6768b78adfbb4172b7
BLAKE2b-256 4a2238574ce8e8fcd0ff970ce78c4800e7542e651a27a608cc4ba94cc6bf78b9

See more details on using hashes here.

File details

Details for the file cocoindex-0.1.46-cp312-cp312-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for cocoindex-0.1.46-cp312-cp312-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 9a41a70f634684bdbe3a0ab1cf965e2698f70274905b94a9ba32c47f3867c6de
MD5 61e51cadaa1bcb860e215eecc7022947
BLAKE2b-256 9b4af03f9f39fe7756cf7b75c3ae6fee0ad122ab3b0b3874357e4df2ea949200

See more details on using hashes here.

File details

Details for the file cocoindex-0.1.46-cp312-cp312-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for cocoindex-0.1.46-cp312-cp312-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 c969d18b83b72b45acfa51e9dd215cd2392ed33804c3c3817eadbfde742afce7
MD5 8fdc83dea2bcdd44ae01b741d084c144
BLAKE2b-256 31c09c72a02fcebbc58985d9ea85d3b1904a94157b86c1eee747768b2d14869c

See more details on using hashes here.

File details

Details for the file cocoindex-0.1.46-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for cocoindex-0.1.46-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 7ad966fe7617eb3166f871c2cdd5e3c4f58278f95653b6a5181e2de4df040f9b
MD5 a4044956cfab5a0a98d70ad344938d1d
BLAKE2b-256 29f62a71334601578fd7b7d01d0d5ddb54d9f8be94b762b9ae9ac18721497b54

See more details on using hashes here.

File details

Details for the file cocoindex-0.1.46-cp312-cp312-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for cocoindex-0.1.46-cp312-cp312-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 29bd34ee497c1a065f9097667335ca78779bac570cdc338323522edd5d324e2a
MD5 21ffd1c482a6a7bc8ab425b6f2d1e4b3
BLAKE2b-256 954404449bfeb7f50976c97f70d09a01fd3c347a8cad10303d645de008f2f0ab

See more details on using hashes here.

File details

Details for the file cocoindex-0.1.46-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for cocoindex-0.1.46-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 6e95bdc03c41071d440ba15262ac8c5479e4b28fd6f7e03379b446ed932c9f51
MD5 10e6e6f57143b468df56404be78e1ed7
BLAKE2b-256 985de26ab3b28d03988d6b3836c7d58ca599ca0ae1cabc7238ebe0ed07cbabac

See more details on using hashes here.

File details

Details for the file cocoindex-0.1.46-cp311-cp311-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for cocoindex-0.1.46-cp311-cp311-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 27b9b0145ad8cd18a9279ae7e198f5301b05e5a8261c1dc560ca8c463a0ac373
MD5 2217ca0d214423d9d44a51e24c6cd7a2
BLAKE2b-256 e9cd52f8eba0d0d30d692e7fb967eb332b90be8ffd0146d063cabf9bac82b7f4

See more details on using hashes here.

File details

Details for the file cocoindex-0.1.46-cp311-cp311-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for cocoindex-0.1.46-cp311-cp311-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 956dad0163ebbab60e69558274506563fa2a06356f93bfc6826ea60a9e5c85a0
MD5 2e7ed58efc178b6fedf2ae507c7e74fe
BLAKE2b-256 203ad4a0ce6801c0c6613fb1b0dfadd3a032d23b64a7f0dfe548b4213838a552

See more details on using hashes here.

File details

Details for the file cocoindex-0.1.46-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for cocoindex-0.1.46-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 be281f1c052a621aac7f59c19a06ba84aa887600e1aa798f1a6bd69ee1ff7946
MD5 7d9a65872932fe3036943fd723b15c04
BLAKE2b-256 04e125c1969e317e819f6438980dac53adec45808928d1022addd502913ca0e8

See more details on using hashes here.

File details

Details for the file cocoindex-0.1.46-cp311-cp311-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for cocoindex-0.1.46-cp311-cp311-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 505c0b94d140c1a0325c1ab2d9493faa82c6c6cca81c98378384ed4c3d3538b4
MD5 f59f7268a5dcf4c63d8da01285b15a6f
BLAKE2b-256 0eb07d4934516b6ea1e6e91afeb441902daaa6e37a197c27e1085a60b1c453c6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page