Skip to main content

With CocoIndex, users declare the transformation, CocoIndex creates & maintains an index, and keeps the derived index up to date based on source update, with minimal computation and changes.

Project description

CocoIndex

Extract, Transform, Index Data. Easy and Fresh. 🌴

GitHub Documentation License PyPI version PyPI - Downloads

CI release Discord

CocoIndex is an ultra performant data transformation framework, with its core engine written in Rust. The problem it tries to solve is to make it easy to prepare fresh data for AI - either creating embedding, building knowledge graphs, or performing other data transformations - and take real-time data pipelines beyond traditional SQL.

CocoIndex Features

The philosophy is to have the framework handle the source updates, and having developers only worry about defining a series of data transformation, inspired by spreadsheet.

Dataflow programming

Unlike a workflow orchestration framework where data is usually opaque, in CocoIndex, data and data operations are first class citizens. CocoIndex follows the idea of Dataflow programming model. Each transformation creates a new field solely based on input fields, without hidden states and value mutation. All data before/after each transformation is observable, with lineage out of the box.

Particularly, users don't explicitly mutate data by creating, updating and deleting. Rather, they define something like - for a set of source data, this is the transformation or formula. The framework takes care of the data operations such as when to create, update, or delete.

# import
data['content'] = flow_builder.add_source(...) 

# transform
data['out'] = data['content'] 
    .transform(...)
    .transform(...)

# collect data
collector.collect(...)

# export to db, vector db, graph db ...
collector.export(...)

Data Freshness

As a data framework, CocoIndex takes it to the next level on data freshness. Incremental processing is one of the core values provided by CocoIndex.

Incremental Processing

The frameworks takes care of

  • Change data capture.
  • Figure out what exactly needs to be updated, and only updating that without having to recompute everything.

This makes it fast to reflect any source updates to the target store. If you have concerns with surfacing stale data to AI agents and are spending lots of efforts working on infra piece to optimize the latency, the framework actually handles it for you.

Quick Start:

If you're new to CocoIndex, we recommend checking out

Setup

  1. Install CocoIndex Python library
pip install -U cocoindex
  1. Install Postgres if you don't have one. CocoIndex uses it for incremental processing.

Define data flow

Follow Quick Start Guide to define your first indexing flow. An example flow looks like:

@cocoindex.flow_def(name="TextEmbedding")
def text_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope):
    # Add a data source to read files from a directory
    data_scope["documents"] = flow_builder.add_source(cocoindex.sources.LocalFile(path="markdown_files"))

    # Add a collector for data to be exported to the vector index
    doc_embeddings = data_scope.add_collector()

    # Transform data of each document
    with data_scope["documents"].row() as doc:
        # Split the document into chunks, put into `chunks` field
        doc["chunks"] = doc["content"].transform(
            cocoindex.functions.SplitRecursively(),
            language="markdown", chunk_size=2000, chunk_overlap=500)

        # Transform data of each chunk
        with doc["chunks"].row() as chunk:
            # Embed the chunk, put into `embedding` field
            chunk["embedding"] = chunk["text"].transform(
                cocoindex.functions.SentenceTransformerEmbed(
                    model="sentence-transformers/all-MiniLM-L6-v2"))

            # Collect the chunk into the collector.
            doc_embeddings.collect(filename=doc["filename"], location=chunk["location"],
                                   text=chunk["text"], embedding=chunk["embedding"])

    # Export collected data to a vector index.
    doc_embeddings.export(
        "doc_embeddings",
        cocoindex.storages.Postgres(),
        primary_key_fields=["filename", "location"],
        vector_indexes=[
            cocoindex.VectorIndexDef(
                field_name="embedding",
                metric=cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY)])

It defines an index flow like this:

Data Flow

🚀 Examples and demo

Example Description
Text Embedding Index text documents with embeddings for semantic search
Code Embedding Index code embeddings for semantic search
PDF Embedding Parse PDF and index text embeddings for semantic search
Manuals LLM Extraction Extract structured information from a manual using LLM
Amazon S3 Embedding Index text documents from Amazon S3
Google Drive Text Embedding Index text documents from Google Drive
Docs to Knowledge Graph Extract relationships from Markdown documents and build a knowledge graph
Embeddings to Qdrant Index documents in a Qdrant collection for semantic search
FastAPI Server with Docker Run the semantic search server in a Dockerized FastAPI setup
Product Recommendation Build real-time product recommendations with LLM and graph database
Image Search with Vision API Generates detailed captions for images using a vision model, embeds them, enables live-updating semantic search via FastAPI and served on a React frontend

More coming and stay tuned 👀!

📖 Documentation

For detailed documentation, visit CocoIndex Documentation, including a Quickstart guide.

🤝 Contributing

We love contributions from our community ❤️. For details on contributing or running the project for development, check out our contributing guide.

👥 Community

Welcome with a huge coconut hug 🥥⋆。˚🤗. We are super excited for community contributions of all kinds - whether it's code improvements, documentation updates, issue reports, feature requests, and discussions in our Discord.

Join our community here:

Support us:

We are constantly improving, and more features and examples are coming soon. If you love this project, please drop us a star ⭐ at GitHub repo GitHub to stay tuned and help us grow.

License

CocoIndex is Apache 2.0 licensed.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cocoindex-0.1.51.tar.gz (6.0 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

cocoindex-0.1.51-pp311-pypy311_pp73-manylinux_2_28_aarch64.whl (13.7 MB view details)

Uploaded PyPymanylinux: glibc 2.28+ ARM64

cocoindex-0.1.51-cp313-cp313t-manylinux_2_28_aarch64.whl (13.6 MB view details)

Uploaded CPython 3.13tmanylinux: glibc 2.28+ ARM64

cocoindex-0.1.51-cp313-cp313-win_amd64.whl (13.5 MB view details)

Uploaded CPython 3.13Windows x86-64

cocoindex-0.1.51-cp313-cp313-manylinux_2_28_x86_64.whl (14.2 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.28+ x86-64

cocoindex-0.1.51-cp313-cp313-manylinux_2_28_aarch64.whl (13.7 MB view details)

Uploaded CPython 3.13manylinux: glibc 2.28+ ARM64

cocoindex-0.1.51-cp313-cp313-macosx_11_0_arm64.whl (13.5 MB view details)

Uploaded CPython 3.13macOS 11.0+ ARM64

cocoindex-0.1.51-cp313-cp313-macosx_10_12_x86_64.whl (14.0 MB view details)

Uploaded CPython 3.13macOS 10.12+ x86-64

cocoindex-0.1.51-cp312-cp312-win_amd64.whl (13.5 MB view details)

Uploaded CPython 3.12Windows x86-64

cocoindex-0.1.51-cp312-cp312-manylinux_2_28_x86_64.whl (14.2 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.28+ x86-64

cocoindex-0.1.51-cp312-cp312-manylinux_2_28_aarch64.whl (13.7 MB view details)

Uploaded CPython 3.12manylinux: glibc 2.28+ ARM64

cocoindex-0.1.51-cp312-cp312-macosx_11_0_arm64.whl (13.4 MB view details)

Uploaded CPython 3.12macOS 11.0+ ARM64

cocoindex-0.1.51-cp312-cp312-macosx_10_12_x86_64.whl (14.0 MB view details)

Uploaded CPython 3.12macOS 10.12+ x86-64

cocoindex-0.1.51-cp311-cp311-win_amd64.whl (13.5 MB view details)

Uploaded CPython 3.11Windows x86-64

cocoindex-0.1.51-cp311-cp311-manylinux_2_28_x86_64.whl (14.2 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.28+ x86-64

cocoindex-0.1.51-cp311-cp311-manylinux_2_28_aarch64.whl (13.7 MB view details)

Uploaded CPython 3.11manylinux: glibc 2.28+ ARM64

cocoindex-0.1.51-cp311-cp311-macosx_11_0_arm64.whl (13.5 MB view details)

Uploaded CPython 3.11macOS 11.0+ ARM64

cocoindex-0.1.51-cp311-cp311-macosx_10_12_x86_64.whl (14.0 MB view details)

Uploaded CPython 3.11macOS 10.12+ x86-64

File details

Details for the file cocoindex-0.1.51.tar.gz.

File metadata

  • Download URL: cocoindex-0.1.51.tar.gz
  • Upload date:
  • Size: 6.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.8.7

File hashes

Hashes for cocoindex-0.1.51.tar.gz
Algorithm Hash digest
SHA256 0a0bd7e53918278d0ac8a66596c1ee71ad187b67a886be155ac4fc3a84a63a46
MD5 7d6b9b5abe6fb33da0fb30c4ed818258
BLAKE2b-256 5e717f7749bc65c58eb6c758701c2625fd53322bf19a2ff5f4810919831db946

See more details on using hashes here.

File details

Details for the file cocoindex-0.1.51-pp311-pypy311_pp73-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for cocoindex-0.1.51-pp311-pypy311_pp73-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 6cfce87759e4b19d13b65dce678fe4d8363433a81508f844eaf232d26deb27eb
MD5 b3893f59268e3749ceba8ab237209c4a
BLAKE2b-256 de83955f5fb78db259fa9b5b1e64c5c905c913c60cf32b6378994e6d3919b7e7

See more details on using hashes here.

File details

Details for the file cocoindex-0.1.51-cp313-cp313t-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for cocoindex-0.1.51-cp313-cp313t-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 ddcc4d4fa0bdfc5b70ce74a4d08a35645dce1ce8b94cd2bcb00d36b2c9765735
MD5 8a62283783236a2050614953f26edf1f
BLAKE2b-256 92f756de15a02eef61754e7fbe2d5dea54bff1fd36262cf8c5c00ce98e0ac76f

See more details on using hashes here.

File details

Details for the file cocoindex-0.1.51-cp313-cp313-win_amd64.whl.

File metadata

File hashes

Hashes for cocoindex-0.1.51-cp313-cp313-win_amd64.whl
Algorithm Hash digest
SHA256 817c4132fb2dd5f17f4bfddeac540a36b02be2c665724f2bfd48bd8502c266f3
MD5 b96256a089b71c1a16a1b7ccf02c65ed
BLAKE2b-256 32a3d108795e9ec657a494f959ac97b16b175a3b7dfef2236da3aaeaddec212e

See more details on using hashes here.

File details

Details for the file cocoindex-0.1.51-cp313-cp313-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for cocoindex-0.1.51-cp313-cp313-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 4d6f31f92dcd8d9062379a6894530903358eb79956a3920512bd1f0ff191c88f
MD5 d30a7b2f998e6e3941e3e91796f9cba2
BLAKE2b-256 ff022dd48c2b0e0ed23fadfb81dcb1ff8c291efe23573ded31e6a79025bc97cb

See more details on using hashes here.

File details

Details for the file cocoindex-0.1.51-cp313-cp313-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for cocoindex-0.1.51-cp313-cp313-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 25f7d5c47506124f99320642907a9f326bb9791576f8d64fd2e841d24355c286
MD5 61d8810990e94046c844580a745b6c6d
BLAKE2b-256 f57609ba42bd3a06724ce784438f89bd590e42b64a02f96460550f3897ba44ec

See more details on using hashes here.

File details

Details for the file cocoindex-0.1.51-cp313-cp313-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for cocoindex-0.1.51-cp313-cp313-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 61374f65368d0558ae383b94d9c5d478f9d9ca40b667d6cf7014764b91511dd5
MD5 87e07390e0acbe320c95300d40bb1d6a
BLAKE2b-256 87771c67aee9bf7389dd020e83b7a4750cb241cb8af484e1fefcb498a671c8c3

See more details on using hashes here.

File details

Details for the file cocoindex-0.1.51-cp313-cp313-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for cocoindex-0.1.51-cp313-cp313-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 035103e5d289209cb795514148d92a028a77f56d456a2c18eecead1261e4d5d5
MD5 9e144fe95f760a208efcbfb094997c3d
BLAKE2b-256 f18c1aebe4f3cd52a6901a6dca2a1dff4106b744c594880cadc6d4a5a32371ee

See more details on using hashes here.

File details

Details for the file cocoindex-0.1.51-cp312-cp312-win_amd64.whl.

File metadata

File hashes

Hashes for cocoindex-0.1.51-cp312-cp312-win_amd64.whl
Algorithm Hash digest
SHA256 43b66f4ab04615b3ef588e17c554e9f77bee2c3f1b75bebb3f881e3713f1dbaf
MD5 e7bd4386160f058e8c2de06de51b60c1
BLAKE2b-256 d2a5fd832c0f675ea2a9aa15df0dc0bb8989d3586ade71e42d87753518bd143a

See more details on using hashes here.

File details

Details for the file cocoindex-0.1.51-cp312-cp312-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for cocoindex-0.1.51-cp312-cp312-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 6cd8e588c3abbaee51fe5c8c58c1530bb0e500037e67789b44684c54e166eefe
MD5 84bb7a56215aeead2bdc2516b041ba76
BLAKE2b-256 a5528f9d91f740bfa00eb32af7aacf6d33bdd0e9f9c20efa7adf2ae0bb7f21f3

See more details on using hashes here.

File details

Details for the file cocoindex-0.1.51-cp312-cp312-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for cocoindex-0.1.51-cp312-cp312-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 426915fddc3e552d5e43205a11756a08d27d2cd6daa0adfdc2b5f0ed5e51c95a
MD5 8b074788cca8484e5a51664d847402e0
BLAKE2b-256 60872814f626c0d657f7f57215ce3a9f82f83f52e726d15187f3139ac8b55ffb

See more details on using hashes here.

File details

Details for the file cocoindex-0.1.51-cp312-cp312-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for cocoindex-0.1.51-cp312-cp312-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 11c34b370c1b75b322bf076c6b5363509aa743971947a92d8168eaf7dbab8403
MD5 ec74b100b80f4fa80d7bf442cff6d73d
BLAKE2b-256 5f3be48743e8e35b90ac3f09fd1ab3e7481beb58b554fccfdc650cc558a813c1

See more details on using hashes here.

File details

Details for the file cocoindex-0.1.51-cp312-cp312-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for cocoindex-0.1.51-cp312-cp312-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 9d07561fa840f711e11663f0419686d4197177f11ed2a2f004f02e4b31972f99
MD5 6dc2448acf3688e334bf5ee7e38f7f4d
BLAKE2b-256 c882888c0a7e50b4058a757b20c95e6dcd5435b01dee3c431524f7424d585915

See more details on using hashes here.

File details

Details for the file cocoindex-0.1.51-cp311-cp311-win_amd64.whl.

File metadata

File hashes

Hashes for cocoindex-0.1.51-cp311-cp311-win_amd64.whl
Algorithm Hash digest
SHA256 8d2dd8e512628e30dff3e2ab933eee3372d80b4e90b76b27053860c06e63b7dc
MD5 39b7968491be8c917e2e96f6d7d15ccd
BLAKE2b-256 75a1738f1f8149143236bb6c19305060d4997ee6ed63939a767108a9fc41fb5a

See more details on using hashes here.

File details

Details for the file cocoindex-0.1.51-cp311-cp311-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for cocoindex-0.1.51-cp311-cp311-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 fdc0c8b628f1f0299d273ff87becc99efc40c7192855141a51754e9e206abeeb
MD5 dbcec4ce6e72395069861d253379ba30
BLAKE2b-256 7cf2205ea69b11eec552d22d9df623172bb9ab3f836d85c858fd6a4b1fb6104e

See more details on using hashes here.

File details

Details for the file cocoindex-0.1.51-cp311-cp311-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for cocoindex-0.1.51-cp311-cp311-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 8b46ae308c5f28cf5428168405c1f80322c8fd772008b3289b27286f86c98ffe
MD5 bddf175409adfd1f96606ea317869f3c
BLAKE2b-256 2d43460b424e973d4304bc19418a7a50613fb8db81c487dc67f0217e8398c2e1

See more details on using hashes here.

File details

Details for the file cocoindex-0.1.51-cp311-cp311-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for cocoindex-0.1.51-cp311-cp311-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 5c96ad167c72bf05426a7eafbc183541b7a107f86ef7490fc23c53952c2e435a
MD5 50ffcaeae85d449e7cd34e57a3a7b2da
BLAKE2b-256 af62a161646ecf378f888b9d77a6dc14c35379690e4144568a407b60a8462e90

See more details on using hashes here.

File details

Details for the file cocoindex-0.1.51-cp311-cp311-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for cocoindex-0.1.51-cp311-cp311-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 6bbfc8c93035227942b43231dfd1db23df01fc038d43d8622e676d9a67669218
MD5 e07c7a29eeea38f6349f22a207eeff34
BLAKE2b-256 1b4aadcc999a6931f2f92425391efdc601bf184a03c016b92540ac430b9080b6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page