Skip to main content

With CocoIndex, users declare the transformation, CocoIndex creates & maintains an index, and keeps the derived index up to date based on source update, with minimal computation and changes.

Project description

CocoIndex

Extract, Transform, Index Data. Easy and Fresh. 🌴

GitHub Documentation License PyPI version PyPI - Downloads

CI release Discord

CocoIndex is an ultra performant data transformation framework, with its core engine written in Rust. The problem it tries to solve is to make it easy to prepare fresh data for AI - either creating embedding, building knowledge graphs, or performing other data transformations - and take real-time data pipelines beyond traditional SQL.

CocoIndex Features

The philosophy is to have the framework handle the source updates, and having developers only worry about defining a series of data transformation, inspired by spreadsheet.

Dataflow programming

Unlike a workflow orchestration framework where data is usually opaque, in CocoIndex, data and data operations are first class citizens. CocoIndex follows the idea of Dataflow programming model. Each transformation creates a new field solely based on input fields, without hidden states and value mutation. All data before/after each transformation is observable, with lineage out of the box.

Particularly, users don't explicitly mutate data by creating, updating and deleting. Rather, they define something like - for a set of source data, this is the transformation or formula. The framework takes care of the data operations such as when to create, update, or delete.

# import
data['content'] = flow_builder.add_source(...) 

# transform
data['out'] = data['content'] 
    .transform(...)
    .transform(...)

# collect data
collector.collect(...)

# export to db, vector db, graph db ...
collector.export(...)

Data Freshness

As a data framework, CocoIndex takes it to the next level on data freshness. Incremental processing is one of the core values provided by CocoIndex.

Incremental Processing

The frameworks takes care of

  • Change data capture.
  • Figure out what exactly needs to be updated, and only updating that without having to recompute everything.

This makes it fast to reflect any source updates to the target store. If you have concerns with surfacing stale data to AI agents and are spending lots of efforts working on infra piece to optimize the latency, the framework actually handles it for you.

Quick Start:

If you're new to CocoIndex, we recommend checking out

Setup

  1. Install CocoIndex Python library
pip install -U cocoindex
  1. Install Postgres if you don't have one. CocoIndex uses it for incremental processing.

Define data flow

Follow Quick Start Guide to define your first indexing flow. An example flow looks like:

@cocoindex.flow_def(name="TextEmbedding")
def text_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope):
    # Add a data source to read files from a directory
    data_scope["documents"] = flow_builder.add_source(cocoindex.sources.LocalFile(path="markdown_files"))

    # Add a collector for data to be exported to the vector index
    doc_embeddings = data_scope.add_collector()

    # Transform data of each document
    with data_scope["documents"].row() as doc:
        # Split the document into chunks, put into `chunks` field
        doc["chunks"] = doc["content"].transform(
            cocoindex.functions.SplitRecursively(),
            language="markdown", chunk_size=2000, chunk_overlap=500)

        # Transform data of each chunk
        with doc["chunks"].row() as chunk:
            # Embed the chunk, put into `embedding` field
            chunk["embedding"] = chunk["text"].transform(
                cocoindex.functions.SentenceTransformerEmbed(
                    model="sentence-transformers/all-MiniLM-L6-v2"))

            # Collect the chunk into the collector.
            doc_embeddings.collect(filename=doc["filename"], location=chunk["location"],
                                   text=chunk["text"], embedding=chunk["embedding"])

    # Export collected data to a vector index.
    doc_embeddings.export(
        "doc_embeddings",
        cocoindex.storages.Postgres(),
        primary_key_fields=["filename", "location"],
        vector_indexes=[
            cocoindex.VectorIndexDef(
                field_name="embedding",
                metric=cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY)])

It defines an index flow like this:

Data Flow

🚀 Examples and demo

Example Description
Text Embedding Index text documents with embeddings for semantic search
Code Embedding Index code embeddings for semantic search
PDF Embedding Parse PDF and index text embeddings for semantic search
Manuals LLM Extraction Extract structured information from a manual using LLM
Amazon S3 Embedding Index text documents from Amazon S3
Google Drive Text Embedding Index text documents from Google Drive
Docs to Knowledge Graph Extract relationships from Markdown documents and build a knowledge graph
Embeddings to Qdrant Index documents in a Qdrant collection for semantic search
FastAPI Server with Docker Run the semantic search server in a Dockerized FastAPI setup
Product Recommendation Build real-time product recommendations with LLM and graph database
Image Search with Vision API Generates detailed captions for images using a vision model, embeds them, enables live-updating semantic search via FastAPI and served on a React frontend

More coming and stay tuned 👀!

📖 Documentation

For detailed documentation, visit CocoIndex Documentation, including a Quickstart guide.

🤝 Contributing

We love contributions from our community ❤️. For details on contributing or running the project for development, check out our contributing guide.

👥 Community

Welcome with a huge coconut hug 🥥⋆。˚🤗. We are super excited for community contributions of all kinds - whether it's code improvements, documentation updates, issue reports, feature requests, and discussions in our Discord.

Join our community here:

Support us:

We are constantly improving, and more features and examples are coming soon. If you love this project, please drop us a star ⭐ at GitHub repo GitHub to stay tuned and help us grow.

License

CocoIndex is Apache 2.0 licensed.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cocoindex-0.1.48.tar.gz (5.7 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

cocoindex-0.1.48-pp311-pypy311_pp73-manylinux_2_28_aarch64.whl (13.7 MB view details)

Uploaded PyPymanylinux: glibc 2.28+ ARM64

cocoindex-0.1.48-cp313-cp313t-manylinux_2_28_aarch64.whl (13.6 MB view details)

Uploaded CPython 3.13tmanylinux: glibc 2.28+ ARM64

cocoindex-0.1.48-cp313-cp313-win_amd64.whl (13.5 MB view details)

Uploaded CPython 3.13Windows x86-64

cocoindex-0.1.48-cp313-cp313-manylinux_2_28_x86_64.whl (14.2 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.28+ x86-64

cocoindex-0.1.48-cp313-cp313-manylinux_2_28_aarch64.whl (13.7 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.28+ ARM64

cocoindex-0.1.48-cp313-cp313-macosx_11_0_arm64.whl (13.5 MB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

cocoindex-0.1.48-cp313-cp313-macosx_10_12_x86_64.whl (14.0 MB view details)

Uploaded CPython 3.13macOS 10.12+ x86-64

cocoindex-0.1.48-cp312-cp312-win_amd64.whl (13.5 MB view details)

Uploaded CPython 3.12Windows x86-64

cocoindex-0.1.48-cp312-cp312-manylinux_2_28_x86_64.whl (14.2 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.28+ x86-64

cocoindex-0.1.48-cp312-cp312-manylinux_2_28_aarch64.whl (13.7 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.28+ ARM64

cocoindex-0.1.48-cp312-cp312-macosx_11_0_arm64.whl (13.5 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

cocoindex-0.1.48-cp312-cp312-macosx_10_12_x86_64.whl (14.0 MB view details)

Uploaded CPython 3.12macOS 10.12+ x86-64

cocoindex-0.1.48-cp311-cp311-win_amd64.whl (13.5 MB view details)

Uploaded CPython 3.11Windows x86-64

cocoindex-0.1.48-cp311-cp311-manylinux_2_28_x86_64.whl (14.2 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.28+ x86-64

cocoindex-0.1.48-cp311-cp311-manylinux_2_28_aarch64.whl (13.7 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.28+ ARM64

cocoindex-0.1.48-cp311-cp311-macosx_11_0_arm64.whl (13.5 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

cocoindex-0.1.48-cp311-cp311-macosx_10_12_x86_64.whl (14.0 MB view details)

Uploaded CPython 3.11macOS 10.12+ x86-64

File details

Details for the file cocoindex-0.1.48.tar.gz.

File metadata

  • Download URL: cocoindex-0.1.48.tar.gz
  • Upload date:
  • Size: 5.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.8.6

File hashes

Hashes for cocoindex-0.1.48.tar.gz
Algorithm Hash digest
SHA256 813235f50a2605f0203f0b4505d4634672a78cc02524d3124a38da1a80ca1f51
MD5 d25c8011c8c120b1a61a3afcd2ed109e
BLAKE2b-256 1cef5c91862e0db1967801c00f1574ab7a786bbdfffb4ad73be5445b020dabb3

See more details on using hashes here.

File details

Details for the file cocoindex-0.1.48-pp311-pypy311_pp73-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for cocoindex-0.1.48-pp311-pypy311_pp73-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 787f9b62c11b1d03dc5c0fa65e29bddfe00240016bafd5c01453f46a9fe4f320
MD5 f9cf4bdb2def7780a4910784352bca01
BLAKE2b-256 70432ec73e0af7632a0208e3c436bd9eb958db9a33eb5095d37959d334055ee1

See more details on using hashes here.

File details

Details for the file cocoindex-0.1.48-cp313-cp313t-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for cocoindex-0.1.48-cp313-cp313t-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 3939d49972439587136ac0e0da2778ead7141d38a985603a3e50104e3a9b854e
MD5 5a508dcd7e02e74c978f4ebe98442287
BLAKE2b-256 60900a7855cb5d5c2b8f3fde843e933b8bb43df7823dce39e982d77f3cf73038

See more details on using hashes here.

File details

Details for the file cocoindex-0.1.48-cp313-cp313-win_amd64.whl.

File metadata

File hashes

Hashes for cocoindex-0.1.48-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 f138e13f6a5e8e27b1e45d6043343857d667dc36899bcf9c0e43343426cf428e
MD5 58b1bdde89f76774faa3900f0b8ce8dc
BLAKE2b-256 50d0b3eaf089cf70f1d83921a8d303ad48582be75e1d2e86392ad82f4f6c6ac0

See more details on using hashes here.

File details

Details for the file cocoindex-0.1.48-cp313-cp313-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for cocoindex-0.1.48-cp313-cp313-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 0d1f04365fb13ae2b1a057c71b195f151a4b2b136ab876227c45a3a9d1d2e43a
MD5 0faa7ca3e8024babe8d457e024666121
BLAKE2b-256 418cd8b5401ae59fee4c1ddbed68bc56ace4da437081617633cdaef10da0e692

See more details on using hashes here.

File details

Details for the file cocoindex-0.1.48-cp313-cp313-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for cocoindex-0.1.48-cp313-cp313-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 e0e6f30e0c2ff391b713c3e386cdeb8719f63cd603334cb22a5de42649a098c4
MD5 6bf2ba174923cc91e2e3d3f70d865c3e
BLAKE2b-256 9e77db5156ff4d0bf5c8ca83d3deca89911a56679c29cdf1836d0548b01e58e4

See more details on using hashes here.

File details

Details for the file cocoindex-0.1.48-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for cocoindex-0.1.48-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 cf5e64469fdd9ab7dc714e91434c17ec3e1f8c27f47a524a7b7ff9df5df379c0
MD5 e1abd96b014b7287ffbd09cb3c34f704
BLAKE2b-256 2d689c3e5b4a986b642fddbd2a1ee9a3ab90e7c672b866a387f636fc833846bb

See more details on using hashes here.

File details

Details for the file cocoindex-0.1.48-cp313-cp313-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for cocoindex-0.1.48-cp313-cp313-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 56b2bb1debd4120a20dd264ae1bd8508e8f3d22b9a438a3221a58ed18ab2c7fc
MD5 89dee0d1dc7570980e114903bfb1cff2
BLAKE2b-256 944e57a6df592fffb980a18080d52f086054d9793abce58daafc24d42bf1333a

See more details on using hashes here.

File details

Details for the file cocoindex-0.1.48-cp312-cp312-win_amd64.whl.

File metadata

File hashes

Hashes for cocoindex-0.1.48-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 0edb103b945a9ce13567f6f106bcdee351f6bbfee3b5382c02f6917bd2c87f8a
MD5 59f75665b802e68d34841f3eea47317d
BLAKE2b-256 0017daf051b994b9f4846de2883157a76a6e5423338186fdda346295332b4993

See more details on using hashes here.

File details

Details for the file cocoindex-0.1.48-cp312-cp312-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for cocoindex-0.1.48-cp312-cp312-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 3148ee18908e5d376db223dcf7c69f1364e8f0dae824852ababc0bdf51c1f217
MD5 7dde6966ed63fc3855c3f0190f36030c
BLAKE2b-256 500ad5ffa179d7cb2987ec5510a554e87924a2350ff3ad103a7ab0f2a3d25e8f

See more details on using hashes here.

File details

Details for the file cocoindex-0.1.48-cp312-cp312-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for cocoindex-0.1.48-cp312-cp312-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 036eb0cd6e1f617a4f43dceb5ae773a9307bc227c6ea2de6addef199d788b8d6
MD5 01f7433a1df6f16929ad8b99e25a7f70
BLAKE2b-256 e90a8e74a40604ccd0578bcae7d7abff4527af6802186c39e8754713fd0a4a79

See more details on using hashes here.

File details

Details for the file cocoindex-0.1.48-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for cocoindex-0.1.48-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 8c5ec9354a48194618070a0c93097999bc0a9a13f9bcbf68221270d8614528a6
MD5 12f628c836458895d8a731b1ff0ec1d0
BLAKE2b-256 d136faf03d4fe4cf82744460167473f10aaf8bb5a4b8a09b0d3c07486a19ef42

See more details on using hashes here.

File details

Details for the file cocoindex-0.1.48-cp312-cp312-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for cocoindex-0.1.48-cp312-cp312-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 70e5d781e56f2a146f30bd49707f3ff76071a36fbcf073a34e39f5de3670eb90
MD5 c61ae2d89e95ecebd9bc3fc2d60f9275
BLAKE2b-256 e8df8f09aad8ea0b311ea608a3a682a3313e902e0291026d124667170e829b83

See more details on using hashes here.

File details

Details for the file cocoindex-0.1.48-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for cocoindex-0.1.48-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 ecc9be9f74dff952797662272aa7ef7d27073d991233750854c8503941494323
MD5 7a39111009a07e59fb56d7e7e3ada28a
BLAKE2b-256 66c08554f216b0e621ecd41df259239f5ad6771b6fcf1d85a4624a497bbb556a

See more details on using hashes here.

File details

Details for the file cocoindex-0.1.48-cp311-cp311-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for cocoindex-0.1.48-cp311-cp311-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 bfc1dc500a2c9c0951ef7629fa1027e6230fe6deffdf7291f8894fea7f2ee9ce
MD5 05c9f9c2945b81f8476e0765dde1e54a
BLAKE2b-256 70174ec46d558445a29e2c7055a66d30646013378727a7a12555b4cdf9242517

See more details on using hashes here.

File details

Details for the file cocoindex-0.1.48-cp311-cp311-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for cocoindex-0.1.48-cp311-cp311-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 11bf929c3e5b32808c923c2ea60368a2510fee7d30897adda60c923ccc174978
MD5 a5ee50064a178561369c33f0bfd0d1e2
BLAKE2b-256 6de407532977d65b83d7e6e81d302b1580ad638740ed3d8a3607205f91fbb918

See more details on using hashes here.

File details

Details for the file cocoindex-0.1.48-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for cocoindex-0.1.48-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 881105019ea29f8f34955a60bfb85a45370f5447e793d538e221d1e5a0f8d474
MD5 9222eee1d381be4f448eb412995cfd32
BLAKE2b-256 fbb138747c57499dafd62da1e43adce3f9cec98f0cb41ee5f150cd0feae99094

See more details on using hashes here.

File details

Details for the file cocoindex-0.1.48-cp311-cp311-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for cocoindex-0.1.48-cp311-cp311-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 3ac2510b5ffcb6d9940651b1e9e825590f74c2af49beb707cbd921da7e7db4f5
MD5 0d853b6f41c4ce37fcbe418641756689
BLAKE2b-256 b5400cefa3384a3fbcdc75c778abd34c5aaa303bc1d7932f02b5084b3f7c41cd

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page