Skip to main content

With CocoIndex, users declare the transformation, CocoIndex creates & maintains an index, and keeps the derived index up to date based on source update, with minimal computation and changes.

Project description

CocoIndex

Data transformation for AI

GitHub Documentation License PyPI version

PyPI Downloads CI release Discord

cocoindex-io%2Fcocoindex | Trendshift

Ultra performant data transformation framework for AI, with core engine written in Rust. Support incremental processing and data lineage out-of-box. Exceptional developer velocity. Production-ready at day 0.

⭐ Drop a star to help us grow!


CocoIndex Transformation


CocoIndex makes it effortless to transform data with AI, and keep source data and target in sync. Whether you’re building a vector index for RAG, creating knowledge graphs, or performing any custom data transformations — goes beyond SQL.


CocoIndex Features


Exceptional velocity

Just declare transformation in dataflow with ~100 lines of python

# import
data['content'] = flow_builder.add_source(...)

# transform
data['out'] = data['content']
    .transform(...)
    .transform(...)

# collect data
collector.collect(...)

# export to db, vector db, graph db ...
collector.export(...)

CocoIndex follows the idea of Dataflow programming model. Each transformation creates a new field solely based on input fields, without hidden states and value mutation. All data before/after each transformation is observable, with lineage out of the box.

Particularly, developers don't explicitly mutate data by creating, updating and deleting. They just need to define transformation/formula for a set of source data.

Plug-and-Play Building Blocks

Native builtins for different source, targets and transformations. Standardize interface, make it 1-line code switch between different components - as easy as assembling building blocks.

CocoIndex Features

Data Freshness

CocoIndex keep source data and target in sync effortlessly.

Incremental Processing

It has out-of-box support for incremental indexing:

  • minimal recomputation on source or logic change.
  • (re-)processing necessary portions; reuse cache when possible

Quick Start:

If you're new to CocoIndex, we recommend checking out

Setup

  1. Install CocoIndex Python library
pip install -U cocoindex
  1. Install Postgres if you don't have one. CocoIndex uses it for incremental processing.

Define data flow

Follow Quick Start Guide to define your first indexing flow. An example flow looks like:

@cocoindex.flow_def(name="TextEmbedding")
def text_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope):
    # Add a data source to read files from a directory
    data_scope["documents"] = flow_builder.add_source(cocoindex.sources.LocalFile(path="markdown_files"))

    # Add a collector for data to be exported to the vector index
    doc_embeddings = data_scope.add_collector()

    # Transform data of each document
    with data_scope["documents"].row() as doc:
        # Split the document into chunks, put into `chunks` field
        doc["chunks"] = doc["content"].transform(
            cocoindex.functions.SplitRecursively(),
            language="markdown", chunk_size=2000, chunk_overlap=500)

        # Transform data of each chunk
        with doc["chunks"].row() as chunk:
            # Embed the chunk, put into `embedding` field
            chunk["embedding"] = chunk["text"].transform(
                cocoindex.functions.SentenceTransformerEmbed(
                    model="sentence-transformers/all-MiniLM-L6-v2"))

            # Collect the chunk into the collector.
            doc_embeddings.collect(filename=doc["filename"], location=chunk["location"],
                                   text=chunk["text"], embedding=chunk["embedding"])

    # Export collected data to a vector index.
    doc_embeddings.export(
        "doc_embeddings",
        cocoindex.targets.Postgres(),
        primary_key_fields=["filename", "location"],
        vector_indexes=[
            cocoindex.VectorIndexDef(
                field_name="embedding",
                metric=cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY)])

It defines an index flow like this:

Data Flow

🚀 Examples and demo

Example Description
Text Embedding Index text documents with embeddings for semantic search
Code Embedding Index code embeddings for semantic search
PDF Embedding Parse PDF and index text embeddings for semantic search
Manuals LLM Extraction Extract structured information from a manual using LLM
Amazon S3 Embedding Index text documents from Amazon S3
Azure Blob Storage Embedding Index text documents from Azure Blob Storage
Google Drive Text Embedding Index text documents from Google Drive
Docs to Knowledge Graph Extract relationships from Markdown documents and build a knowledge graph
Embeddings to Qdrant Index documents in a Qdrant collection for semantic search
FastAPI Server with Docker Run the semantic search server in a Dockerized FastAPI setup
Product Recommendation Build real-time product recommendations with LLM and graph database
Image Search with Vision API Generates detailed captions for images using a vision model, embeds them, enables live-updating semantic search via FastAPI and served on a React frontend
Face Recognition Recognize faces in images and build embedding index
Paper Metadata Index papers in PDF files, and build metadata tables for each paper
Multi Format Indexing Build visual document index from PDFs and images with ColPali for semantic search
Custom Output Files Convert markdown files to HTML files and save them to a local directory, using CocoIndex Custom Targets
Patient intake form extraction Use LLM to extract structured data from patient intake forms with different formats

More coming and stay tuned 👀!

📖 Documentation

For detailed documentation, visit CocoIndex Documentation, including a Quickstart guide.

🤝 Contributing

We love contributions from our community ❤️. For details on contributing or running the project for development, check out our contributing guide.

👥 Community

Welcome with a huge coconut hug 🥥⋆。˚🤗. We are super excited for community contributions of all kinds - whether it's code improvements, documentation updates, issue reports, feature requests, and discussions in our Discord.

Join our community here:

Support us:

We are constantly improving, and more features and examples are coming soon. If you love this project, please drop us a star ⭐ at GitHub repo GitHub to stay tuned and help us grow.

License

CocoIndex is Apache 2.0 licensed.

Project details


Release history Release notifications | RSS feed

This version

0.2.6

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cocoindex-0.2.6.tar.gz (29.9 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

cocoindex-0.2.6-cp311-abi3-win_amd64.whl (16.0 MB view details)

Uploaded CPython 3.11+Windows x86-64

cocoindex-0.2.6-cp311-abi3-manylinux_2_28_x86_64.whl (16.8 MB view details)

Uploaded CPython 3.11+manylinux: glibc 2.28+ x86-64

cocoindex-0.2.6-cp311-abi3-manylinux_2_28_aarch64.whl (16.1 MB view details)

Uploaded CPython 3.11+manylinux: glibc 2.28+ ARM64

cocoindex-0.2.6-cp311-abi3-macosx_11_0_arm64.whl (15.9 MB view details)

Uploaded CPython 3.11+macOS 11.0+ ARM64

cocoindex-0.2.6-cp311-abi3-macosx_10_12_x86_64.whl (16.5 MB view details)

Uploaded CPython 3.11+macOS 10.12+ x86-64

File details

Details for the file cocoindex-0.2.6.tar.gz.

File metadata

  • Download URL: cocoindex-0.2.6.tar.gz
  • Upload date:
  • Size: 29.9 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.9.4

File hashes

Hashes for cocoindex-0.2.6.tar.gz
Algorithm Hash digest
SHA256 e6f6b5b39f469f4f17fa6d2aca40525b49456852337bb4b8abdc811794c57cc7
MD5 80f41534134b8d3946402728bbdcfd92
BLAKE2b-256 49e7e1bbfac5083d5864e16f67f4fc6ac0d640885ca36e410abad5b850adaec2

See more details on using hashes here.

File details

Details for the file cocoindex-0.2.6-cp311-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for cocoindex-0.2.6-cp311-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 670653764ae9d87f2dfe82dc94d8dda7c2bd4fa32f7b842727d0355d5123c7db
MD5 ba0c3d077a1a14473a699f17d6c9f349
BLAKE2b-256 74dc72b659c797ca08000bb2ddf091bf38ecf07d81bf93f1024bcfe8526541a9

See more details on using hashes here.

File details

Details for the file cocoindex-0.2.6-cp311-abi3-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for cocoindex-0.2.6-cp311-abi3-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 fce90c576dd9e436f7dfd7f6a14927b07e2705824226b5d9eaa95354fafae8b9
MD5 9dd1a1ac52468c1cbf694d206bf757aa
BLAKE2b-256 8229e09ad0cde3cf335363a2b10e79cd8da401819b032e07afa106011ee50870

See more details on using hashes here.

File details

Details for the file cocoindex-0.2.6-cp311-abi3-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for cocoindex-0.2.6-cp311-abi3-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 5372e6d9e69eac163602e0fa656299c3180516256b7be89439e0d1bf3290e010
MD5 e5dc9fe95bcbe0f9aa289c7b75000581
BLAKE2b-256 fce6b63cf014447b5f1182fdc40cefc7641c355cae55d86254b1a6de5487af99

See more details on using hashes here.

File details

Details for the file cocoindex-0.2.6-cp311-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for cocoindex-0.2.6-cp311-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 895784e26266e899af85548ef73c627e6e02bcaea2aa9ffa82f8856a0be3947c
MD5 587d0a8e72110e7935fa75853388ba0c
BLAKE2b-256 3b9cb5de00b8b0e085db9789216ea265e9542a3f0cbadeae16b485ac2b8f2258

See more details on using hashes here.

File details

Details for the file cocoindex-0.2.6-cp311-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for cocoindex-0.2.6-cp311-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 4289680c94da1234c8b3f1dd214f4bb8c66232d2d2da90fe8d6e21836e092967
MD5 39dbd30bd1e158a17c88aa1c5b647a78
BLAKE2b-256 591d2d8625f60f743c80f7e898f8dac85958e57f8d9539254fa9ca8220e349e4

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page