Skip to main content

With CocoIndex, users declare the transformation, CocoIndex creates & maintains an index, and keeps the derived index up to date based on source update, with minimal computation and changes.

Project description

CocoIndex

Data transformation for AI

GitHub Documentation License PyPI version

PyPI Downloads CI release Discord

cocoindex-io%2Fcocoindex | Trendshift

Ultra performant data transformation framework for AI, with core engine written in Rust. Support incremental processing and data lineage out-of-box. Exceptional developer velocity. Production-ready at day 0.

⭐ Drop a star to help us grow!


CocoIndex Transformation


CocoIndex makes it effortless to transform data with AI, and keep source data and target in sync. Whether you’re building a vector index for RAG, creating knowledge graphs, or performing any custom data transformations — goes beyond SQL.


CocoIndex Features


Exceptional velocity

Just declare transformation in dataflow with ~100 lines of python

# import
data['content'] = flow_builder.add_source(...)

# transform
data['out'] = data['content']
    .transform(...)
    .transform(...)

# collect data
collector.collect(...)

# export to db, vector db, graph db ...
collector.export(...)

CocoIndex follows the idea of Dataflow programming model. Each transformation creates a new field solely based on input fields, without hidden states and value mutation. All data before/after each transformation is observable, with lineage out of the box.

Particularly, developers don't explicitly mutate data by creating, updating and deleting. They just need to define transformation/formula for a set of source data.

Plug-and-Play Building Blocks

Native builtins for different source, targets and transformations. Standardize interface, make it 1-line code switch between different components - as easy as assembling building blocks.

CocoIndex Features

Data Freshness

CocoIndex keep source data and target in sync effortlessly.

Incremental Processing

It has out-of-box support for incremental indexing:

  • minimal recomputation on source or logic change.
  • (re-)processing necessary portions; reuse cache when possible

Quick Start:

If you're new to CocoIndex, we recommend checking out

Setup

  1. Install CocoIndex Python library
pip install -U cocoindex
  1. Install Postgres if you don't have one. CocoIndex uses it for incremental processing.

Define data flow

Follow Quick Start Guide to define your first indexing flow. An example flow looks like:

@cocoindex.flow_def(name="TextEmbedding")
def text_embedding_flow(flow_builder: cocoindex.FlowBuilder, data_scope: cocoindex.DataScope):
    # Add a data source to read files from a directory
    data_scope["documents"] = flow_builder.add_source(cocoindex.sources.LocalFile(path="markdown_files"))

    # Add a collector for data to be exported to the vector index
    doc_embeddings = data_scope.add_collector()

    # Transform data of each document
    with data_scope["documents"].row() as doc:
        # Split the document into chunks, put into `chunks` field
        doc["chunks"] = doc["content"].transform(
            cocoindex.functions.SplitRecursively(),
            language="markdown", chunk_size=2000, chunk_overlap=500)

        # Transform data of each chunk
        with doc["chunks"].row() as chunk:
            # Embed the chunk, put into `embedding` field
            chunk["embedding"] = chunk["text"].transform(
                cocoindex.functions.SentenceTransformerEmbed(
                    model="sentence-transformers/all-MiniLM-L6-v2"))

            # Collect the chunk into the collector.
            doc_embeddings.collect(filename=doc["filename"], location=chunk["location"],
                                   text=chunk["text"], embedding=chunk["embedding"])

    # Export collected data to a vector index.
    doc_embeddings.export(
        "doc_embeddings",
        cocoindex.targets.Postgres(),
        primary_key_fields=["filename", "location"],
        vector_indexes=[
            cocoindex.VectorIndexDef(
                field_name="embedding",
                metric=cocoindex.VectorSimilarityMetric.COSINE_SIMILARITY)])

It defines an index flow like this:

Data Flow

🚀 Examples and demo

Example Description
Text Embedding Index text documents with embeddings for semantic search
Code Embedding Index code embeddings for semantic search
PDF Embedding Parse PDF and index text embeddings for semantic search
Manuals LLM Extraction Extract structured information from a manual using LLM
Amazon S3 Embedding Index text documents from Amazon S3
Azure Blob Storage Embedding Index text documents from Azure Blob Storage
Google Drive Text Embedding Index text documents from Google Drive
Docs to Knowledge Graph Extract relationships from Markdown documents and build a knowledge graph
Embeddings to Qdrant Index documents in a Qdrant collection for semantic search
Embeddings to LanceDB Index documents in a LanceDB collection for semantic search
FastAPI Server with Docker Run the semantic search server in a Dockerized FastAPI setup
Product Recommendation Build real-time product recommendations with LLM and graph database
Image Search with Vision API Generates detailed captions for images using a vision model, embeds them, enables live-updating semantic search via FastAPI and served on a React frontend
Face Recognition Recognize faces in images and build embedding index
Paper Metadata Index papers in PDF files, and build metadata tables for each paper
Multi Format Indexing Build visual document index from PDFs and images with ColPali for semantic search
Custom Output Files Convert markdown files to HTML files and save them to a local directory, using CocoIndex Custom Targets
Patient intake form extraction Use LLM to extract structured data from patient intake forms with different formats

More coming and stay tuned 👀!

📖 Documentation

For detailed documentation, visit CocoIndex Documentation, including a Quickstart guide.

🤝 Contributing

We love contributions from our community ❤️. For details on contributing or running the project for development, check out our contributing guide.

👥 Community

Welcome with a huge coconut hug 🥥⋆。˚🤗. We are super excited for community contributions of all kinds - whether it's code improvements, documentation updates, issue reports, feature requests, and discussions in our Discord.

Join our community here:

Support us:

We are constantly improving, and more features and examples are coming soon. If you love this project, please drop us a star ⭐ at GitHub repo GitHub to stay tuned and help us grow.

License

CocoIndex is Apache 2.0 licensed.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cocoindex-0.2.17.tar.gz (30.0 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

cocoindex-0.2.17-cp311-abi3-win_amd64.whl (16.7 MB view details)

Uploaded CPython 3.11+Windows x86-64

cocoindex-0.2.17-cp311-abi3-manylinux_2_28_x86_64.whl (17.4 MB view details)

Uploaded CPython 3.11+manylinux: glibc 2.28+ x86-64

cocoindex-0.2.17-cp311-abi3-manylinux_2_28_aarch64.whl (16.7 MB view details)

Uploaded CPython 3.11+manylinux: glibc 2.28+ ARM64

cocoindex-0.2.17-cp311-abi3-macosx_11_0_arm64.whl (16.5 MB view details)

Uploaded CPython 3.11+macOS 11.0+ ARM64

cocoindex-0.2.17-cp311-abi3-macosx_10_12_x86_64.whl (17.2 MB view details)

Uploaded CPython 3.11+macOS 10.12+ x86-64

File details

Details for the file cocoindex-0.2.17.tar.gz.

File metadata

  • Download URL: cocoindex-0.2.17.tar.gz
  • Upload date:
  • Size: 30.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.9.5

File hashes

Hashes for cocoindex-0.2.17.tar.gz
Algorithm Hash digest
SHA256 98f1c3c7574d92ee7ff40994815234c000284dd603f35b9f3afa9855a7172cb9
MD5 76751e02aa5551e9828732039a1f9367
BLAKE2b-256 7e235b5509bf3f5eb4b9a4f52fc2463e3c4b57e2ce6268bc96073c6b5c137a58

See more details on using hashes here.

File details

Details for the file cocoindex-0.2.17-cp311-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for cocoindex-0.2.17-cp311-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 497428a2e8de4d328f5f629f6f9140a545ded495707ae955faf239b552328c79
MD5 5b8f6c8bfcf06ffbb9e0d44c12f09925
BLAKE2b-256 20d33b4393d3bd832b1f914422f915891ed28eb3de456cda39271b2565e5fdc0

See more details on using hashes here.

File details

Details for the file cocoindex-0.2.17-cp311-abi3-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for cocoindex-0.2.17-cp311-abi3-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 b789444c6e15dd71d51c572d7bc11d79e1d831f55dfbf310e993c98518ada2f6
MD5 86d024d9b1cca969263d2bf707ced07f
BLAKE2b-256 f75813f4c0c616d218fcfd988ab59e32eff62e1e3f933bee3a17802f13fca2c9

See more details on using hashes here.

File details

Details for the file cocoindex-0.2.17-cp311-abi3-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for cocoindex-0.2.17-cp311-abi3-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 6368f497617ffc004a92111ac609a5061f5a62c3c2781e474095b0b84a0a385c
MD5 2aa1d28b37ef34f603375611b580a6e3
BLAKE2b-256 b87bf9572ab8751f2ad4898700d85981ea976860b89fe84eb9122ec8e385a18a

See more details on using hashes here.

File details

Details for the file cocoindex-0.2.17-cp311-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for cocoindex-0.2.17-cp311-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 ca866d185d2dea01ebc79363e43c703d73aa291786c8f46e37bbb1f47837b258
MD5 25faa0679a24cce74a51ad2ffbb08590
BLAKE2b-256 7f43e7cd3e9312672372ab521d20a2551b5aa4b4e6f9ac7e41d454ab74b5f699

See more details on using hashes here.

File details

Details for the file cocoindex-0.2.17-cp311-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for cocoindex-0.2.17-cp311-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 03b896b3f2aefc0deccd0675892f246866e8831654cf98fa7c27fb844c70b2bc
MD5 4aae8a0adf132b890a6520743b10092b
BLAKE2b-256 ac65dc4a01693638c3e67d91e59bc6834c7d00266fdd78168b71f8f1c14adc0e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page