LlamaIndex x LanceDB MultiModal AI Lakehouse

These details have not been verified by PyPI

Project description

LlamaIndex x LanceDB MultiModal AI LakeHouse

This package integrates the multi-modal functionalities of LanceDB with LlamaIndex.

To install it, you can run:

pip install llama-index-indices-managed-lancedb

And you can then use it in your scripts as an index!

You can use it for text or images, and you can also employ it as a base for a retriever and a query engine.

Text

You can use LanceDB with text in the following way:

from llama_index.indices.managed.lancedb import LanceDBMultiModalIndex

# use it with a local database
local_index = LanceDBMultiModalIndex(
    uri="lancedb/data",
    text_embedding_model="sentence-transformers",
    embedding_model_kwargs={"name": "all-MiniLM-L6-v2"},
    table_name="documents",
)
# use a remote connection
remote_index = LanceDBMUltiModalIndex(
    uri="db://***",
    region="us-east-1",
    api_key="***",
    text_embedding_model="sentence-transformers",
    embedding_model_kwargs={"name": "all-MiniLM-L6-v2"},
    table_name="remote_documents",
)


# You always have to connect the index once you instantiated it with the primary constructor (__init__):
## 1. If you set use_async = True:
async def connect_lancedb_index():
    await documents_index.acreate_index()


## 2. If you set use_async = False (this is the default behavior):
local_index.create_index()

# load it from documents (async constructor)
from llama_index.core.schema import Document

document_data = [
    Document(text="This is an example document"),
    Document(text="This is as example document 1"),
]
documents_index = await LanceDBMUltiModalIndex.from_documents(
    documents=document_data,
    uri="lancedb/documents",
    text_embedding_model="sentence-transformers",
    embedding_model_kwargs={"name": "all-MiniLM-L6-v2"},
    table_name="from_documents",
    indexing="NO_INDEXING",
    use_async=True,
)
## load it from different type of data, e.g. PyArrow tables, Pandas/Polars DataFrames or list of dictionaries (async constructor)
import pandas as pd
import numpy as np

data = pd.DataFrame(
    {
        "text": ["## Hello world", "This is a test"],
        "id": ["1", "2"],
        "metadata": ['{"type": "text/markdown"}', '{"type": "text/plain"}'],
        "vector": [
            np.random.random(384).to_list(),
            np.random.random(384).to_list(),
        ],
    }
)
data_index = await LanceDBMUltiModalIndex.from_documents(
    documents=document_data,
    uri="lancedb/documents",
    text_embedding_model="sentence-transformers",
    embedding_model_kwargs={"name": "all-MiniLM-L6-v2"},
    table_name="from_data",
    indexing="HNSW_PQ",
    use_async=True,
)

We should notice three things here:

You can choose your own text embedding model among the ones supported by LanceDB
The schema for a text table is defined as followed:

class TextSchema(LanceModel):
    id: str
    metadata: str  # deserializable
    text: str
    vector: List[List[float]]

In this schema, the text field is the source field for the embedding model to produce a vector, whereas the vector field must comply with the expected dimensions of the vectors produced by the embedding model. 3. You can define whether or not you want to index your table, and how to index it. Take a look at LancDB docs to see what indexing strategies are available.

[!IMPORTANT]

In the following examples, we will be using only sync methods. It is nevertheless important to stress that, if you set use_async = True, you need to use the async corresponding methods.

Once you instantiated and connected the LanceDB index, you can:

Add or delete nodes

local_index.insert_nodes(
    documents=[
        Document(text="Hello world", id_="1"),
        Document(text="How are you?", id_="2"),
    ],
)

# add from data
local_index.insert_data(
    data=pd.DataFrame(
        {
            "text": ["Hello world", "How are you?"],
            "id": ["1", "2"],
            "metadata": [
                '{"type": "text/markdown"}',
                '{"type": "text/plain"}',
            ],
        }
    ),
)

local_index.delete_nodes(["1", "2"])

Retrieve

retriever = local_index.as_retriever()
nodes = retriever.retrieve(query_str="Hello world!")
print(nodes)

Query

query_engine = local_index.as_query_engine()
response = query_engine.query(query_str="Hello world!")
print(response.response)

Images

images_index = LanceDBMultiModalIndex(
    uri="lancedb/images",
    multi_modal_embedding_model="open-clip",
    table_name="images",
)

# initialize from documents
from llama_index.core.schema import ImageDocument

images_index = await LanceDBMultiModalIndex.from_documents(
    documents=[
        ImageDocument(
            image_url="http://farm1.staticflickr.com/53/167798175_7c7845bbbd_z.jpg",
            metadata={"label": "cat"},
        ),
        ImageDocument(
            image_url="http://farm1.staticflickr.com/134/332220238_da527d8140_z.jpg",
            metadata={"label": "cat"},
        ),
        ImageDocument(
            image_url="http://farm9.staticflickr.com/8387/8602747737_2e5c2a45d4_z.jpg",
            metadata={"label": "dog"},
        ),
    ],
    uri="lancedb/images",
    multi_modal_embedding_model="open-clip",
    table_name="images",
)

# initialize from data
labels = ["dog", "horse", "horse"]
uris = [
    "http://farm5.staticflickr.com/4092/5017326486_1f46057f5f_z.jpg",
    "http://farm9.staticflickr.com/8216/8434969557_d37882c42d_z.jpg",
    "http://farm6.staticflickr.com/5142/5835678453_4f3a4edb45_z.jpg",
]
ids = [
    "1",
    "2",
    "3",
]
metadata = (
    [
        '{"mimetype": "image/jpeg"}',
        '{"mimetype": "image/jpeg"}',
        '{"mimetype": "image/jpeg"}',
    ],
)
image_bytes = [requests.get(uri).content for uri in uris]

data = pd.DataFrame(
    {
        "id": ids,
        "label": labels,
        "image_uri": uris,
        "image_bytes": image_bytes,
        "metadata": metadata,
    }
)

images_index = await LanceDBMultiModalIndex.from_data(
    data=data,
    uri="lancedb/images",
    multi_modal_embedding_model="open-clip",
    table_name="images",
)

As for before, you can choose your multi-modal embedding model and the index strategy, but this time the schema is a little bit different:

class MultiModalSchema(LanceModel):
    id: str
    metadata: str  # deserializable
    label: str
    image_uri: str  # image uri as the source
    image_bytes: bytes  # image bytes as the source
    vector: List[List[float]]  # vector column
    vec_from_bytes: List[
        List[float]
    ]  # Another vector column (uses only bytes as source)

In this case, the source fields for the embedding model are image_uri and image_bytes.

You can use the index as for the text, but with a key difference in retrieving/querying: you use images!

query_engine = images_index.as_query_engine()
# query_image can be a URL, an ImageBlock, an ImageDocument and a PIL Image
response = query_engine.query(
    query_image="http://farm6.staticflickr.com/5142/5835678453_4f3a4edb45_z.jpg"
)
# you can also use an image path
response = query_engine.query(
    query_image_path="/Users/user/images/hello_world.jpg"
)

Extra features

You can initialize the index from an existing table, setting table_exists = True in the constructor methods.
There are methods (such as insert or delete_ref_doc_id) that work only for adding/deleting one node
If you set use_async = True you cannot use synchronous methods, and vice versa!

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

Jul 4, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llama_index_vector_stores_lancedb_multimodal-0.1.0.tar.gz (12.3 kB view details)

Uploaded Jul 4, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

llama_index_vector_stores_lancedb_multimodal-0.1.0-py3-none-any.whl (13.9 kB view details)

Uploaded Jul 4, 2025 Python 3

File details

Details for the file llama_index_vector_stores_lancedb_multimodal-0.1.0.tar.gz.

File metadata

Download URL: llama_index_vector_stores_lancedb_multimodal-0.1.0.tar.gz
Upload date: Jul 4, 2025
Size: 12.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.7.19

File hashes

Hashes for llama_index_vector_stores_lancedb_multimodal-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`7b75aa36a3db3fbdbe90c2a78458f08fdbe61fb7fbe70b67442c8f40457a2122`
MD5	`2eddc7bda37596f63b8e87aaac69531f`
BLAKE2b-256	`aa3789d9be76c1153b0d0a88b847ebd3fcf66b7b20fe2f685c303e873df944b4`

See more details on using hashes here.

File details

Details for the file llama_index_vector_stores_lancedb_multimodal-0.1.0-py3-none-any.whl.

File metadata

Download URL: llama_index_vector_stores_lancedb_multimodal-0.1.0-py3-none-any.whl
Upload date: Jul 4, 2025
Size: 13.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.7.19

File hashes

Hashes for llama_index_vector_stores_lancedb_multimodal-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8cd25f9eef4ba77cae539b535134c5a516141c6b340f24ae2a9dd5fcfbdca160`
MD5	`414125279026c7d3f6f10b48e8d4886f`
BLAKE2b-256	`53f9adc5b61697aefae9bef278f58ecb4c182b3cc16ca9f961362dab4ca4ecfc`

See more details on using hashes here.

llama-index-vector-stores-lancedb-multimodal 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

LlamaIndex x LanceDB MultiModal AI LakeHouse

Text

Images

Extra features

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes