vdf-io

This library uses a universal format for vector datasets to easily export and import data from all vector databases.

Project description

This library uses a universal format for vector datasets to easily export and import data from all vector databases.

See the Contributing section to add support for your favorite vector database.

Supported Vector Databases

(Request support for a VectorDB by voting/commenting here: https://github.com/AI-Northstar-Tech/vector-io/discussions/38)

Vector Database	Import	Export
Pinecone	✅	✅
Qdrant	✅	✅
Milvus	✅	✅
GCP Vertex AI Vector Search	✅	✅
KDB.AI	✅	✅
Azure AI Search	🔜	🔜
Rockset	🔜	🔜
Vespa	⏳	⏳
Weaviate	⏳	⏳
MongoDB Atlas	⏳	⏳
Epsilla	⏳	⏳
txtai	⏳	⏳
Redis Search	⏳	⏳
OpenSearch	⏳	⏳
Activeloop Deep Lake	❌	❌
Anari AI	❌	❌
Apache Cassandra	❌	❌
ApertureDB	❌	❌
Chroma	❌	❌
ClickHouse	❌	❌
CrateDB	❌	❌
DataStax Astra DB	❌	❌
Elasticsearch	❌	❌
LanceDB	❌	❌
Marqo	❌	❌
Meilisearch	❌	❌
MyScale	❌	❌
Neo4j	❌	❌
Nuclia DB	❌	❌
OramaSearch	❌	❌
pgvector	❌	❌
Turbopuffer	❌	❌
Typesense	❌	❌
USearch	❌	❌
Vald	❌	❌
Apache Solr	❌	❌

Universal Vector Dataset Format (VDF) specification

VDF_META.json: It is a json file with the following schema:

interface Index {
  namespace: string;
  total_vector_count: number;
  exported_vector_count: number;
  dimensions: number;
  model_name: string;
  vector_columns: string[];
  data_path: string;
  metric: 'Euclid' | 'Cosine' | 'Dot';
}

interface VDFMeta {
  version: string;
  file_structure: string[];
  author: string;
  exported_from: 'pinecone' | 'qdrant'; // others when they are added
  indexes: {
    [key: string]: Index[];
  };
  exported_at: string;
}

Parquet files/folders for metadata and vectors.

Installation

Using pip

pip install vdf-io

From source

git clone https://github.com/AI-Northstar-Tech/vector-io.git
cd vector-io
pip install -r requirements.txt

Export Script

export_vdf --help
usage: export_vdf [-h] [-m MODEL_NAME]
                  [--max_file_size MAX_FILE_SIZE]
                  [--push_to_hub | --no-push_to_hub]
                  [--public | --no-public]
                  {pinecone,qdrant,kdbai,milvus,vertexai_vectorsearch}
                  ...

Export data from various vector databases to the VDF format
for vector datasets

options:
  -h, --help            show this help message and exit
  -m MODEL_NAME, --model_name MODEL_NAME
                        Name of model used
  --max_file_size MAX_FILE_SIZE
                        Maximum file size in MB (default:
                        1024)
  --push_to_hub, --no-push_to_hub
                        Push to hub
  --public, --no-public
                        Make dataset public (default:
                        False)

Vector Databases:
  Choose the vectors database to export data from

  {pinecone,qdrant,kdbai,milvus,vertexai_vectorsearch}
    pinecone            Export data from Pinecone
    qdrant              Export data from Qdrant
    kdbai               Export data from KDB.AI
    milvus              Export data from Milvus
    vertexai_vectorsearch
                        Export data from Vertex AI Vector
                        Search

Import script

import_vdf --help
usage: import_vdf [-h] [-d DIR] [-s | --subset | --no-subset]
                  [--create_new | --no-create_new]
                  {milvus,pinecone,qdrant,vertexai_vectorsearch,kdbai}
                  ...

Import data from VDF to a vector database

options:
  -h, --help            show this help message and exit
  -d DIR, --dir DIR     Directory to import
  -s, --subset, --no-subset
                        Import a subset of data (default: False)
  --create_new, --no-create_new
                        Create a new index (default: False)

Vector Databases:
  Choose the vectors database to export data from

  {milvus,pinecone,qdrant,vertexai_vectorsearch,kdbai}
    milvus              Import data to Milvus
    pinecone            Import data to Pinecone
    qdrant              Import data to Qdrant
    vertexai_vectorsearch
                        Import data to Vertex AI Vector Search
    kdbai               Import data to KDB.AI

Re-embed script

This Python script is used to re-embed a vector dataset. It takes a directory of vector dataset in the VDF format and re-embeds it using a new model. The script also allows you to specify the name of the column containing text to be embedded.

reembed.py --help
usage: reembed.py [-h] -d DIR [-m NEW_MODEL_NAME]
                  [-t TEXT_COLUMN]

Reembed a vector dataset

options:
  -h, --help            show this help message and exit
  -d DIR, --dir DIR     Directory of vector dataset in
                        the VDF format
  -m NEW_MODEL_NAME, --new_model_name NEW_MODEL_NAME
                        Name of new model to be used
  -t TEXT_COLUMN, --text_column TEXT_COLUMN
                        Name of the column containing
                        text to be embedded

Examples

export_vdf -m hkunlp/instructor-xl --push_to_hub pinecone --environment gcp-starter

Follow the prompt to select the index and id range to export.

Contributing

Adding a new vector database

If you wish to add an import/export implementation for a new vector database, you must also implement the other side of the import/export for the same database. Please fork the repo and send a PR for both the import and export scripts.

Steps to add a new vector database (ABC):

Export:

Add a new subparser in export_vdf_cli.py for the new vector database. Add database specific arguments to the subparser, such as the url of the database, any authentication tokens, etc.
Add a new file in src/vdf_io/export_vdf/ for the new vector database. This file should define a class ExportABC which inherits from ExportVDF.
Specify a DB_NAME_SLUG for the class
The class should implement the get_data() function to download points (in a batched manner) with all the metadata from the specified index of the vector database. This data should be stored in a series of parquet files/folders. The metadata should be stored in a json file with the schema above.
Use the script to export data from an example index of the vector database and verify that the data is exported correctly.

Import:

Add a new subparser in import_vdf_cli.py for the new vector database. Add database specific arguments to the subparser, such as the url of the database, any authentication tokens, etc.
Add a new file in src/vdf_io/import_vdf/ for the new vector database. This file should define a class ImportABC which inherits from ImportVDF. It should implement the upsert_data() function to upload points from a vdf dataset (in a batched manner) with all the metadata to the specified index of the vector database. All metadata about the dataset should be read fro mthe VDF_META.json file in the vdf folder.
Use the script to import data from the example vdf dataset exported in the previous step and verify that the data is imported correctly.

Changing the VDF specification

If you wish to change the VDF specification, please open an issue to discuss the change before sending a PR.

Efficiency improvements

If you wish to improve the efficiency of the import/export scripts, please fork the repo and send a PR.

Questions

If you have any questions, please open an issue on the repo or message Dhruv Anand on LinkedIn

Project details

Release history Release notifications | RSS feed

0.1.248

Aug 20, 2024

0.1.247

May 17, 2024

0.1.246

May 5, 2024

0.1.245

May 3, 2024

0.1.244

May 2, 2024

0.1.243

Apr 30, 2024

0.1.242

Apr 26, 2024

0.1.241

Apr 26, 2024

0.1.240

Apr 24, 2024

0.1.239

Apr 24, 2024

0.1.238

Apr 23, 2024

0.1.234

Apr 19, 2024

0.1.233

Apr 18, 2024

0.1.232

Mar 30, 2024

0.1.222

Mar 29, 2024

0.1.221

Mar 28, 2024

0.1.218

Mar 28, 2024

0.1.209

Mar 5, 2024

0.1.208

Mar 5, 2024

0.1.186

Feb 29, 2024

0.1.185

Feb 29, 2024

0.1.181

Feb 26, 2024

0.1.180

Feb 26, 2024

0.1.173

Feb 26, 2024

0.1.168

Feb 26, 2024

0.1.167

Feb 23, 2024

0.1.166

Feb 23, 2024

0.1.156

Feb 23, 2024

0.1.148

Feb 23, 2024

0.1.139

Feb 22, 2024

0.1.127

Feb 22, 2024

0.1.113

Feb 22, 2024

0.1.112

Feb 21, 2024

0.1.102

Feb 21, 2024

0.1.101

Feb 21, 2024

0.1.85

Feb 20, 2024

0.1.80

Feb 20, 2024

0.1.60

Feb 19, 2024

0.1.37

Feb 16, 2024

0.1.36

Feb 16, 2024

0.1.34

Feb 13, 2024

0.1.33

Feb 13, 2024

0.1.24

Feb 12, 2024

0.1.23

Feb 12, 2024

0.1.22

Feb 12, 2024

0.1.21

Feb 12, 2024

0.1.20

Feb 12, 2024

0.1.19

Feb 12, 2024

0.1.18

Feb 12, 2024

0.1.17

Feb 12, 2024

0.1.16

Feb 12, 2024

0.1.15

Feb 12, 2024

0.1.14

Feb 12, 2024

0.1.13

Feb 12, 2024

0.1.12

Feb 12, 2024

0.1.11

Feb 12, 2024

0.1.10

Feb 12, 2024

0.1.9

Feb 12, 2024

0.1.8

Feb 12, 2024

0.1.7

Feb 12, 2024

0.1.6

Feb 12, 2024

0.1.5

Feb 12, 2024

0.1.4

Feb 12, 2024

0.1.3

Feb 12, 2024

0.1.2

Feb 12, 2024

0.1.1

Feb 12, 2024

0.0.79

Feb 12, 2024

0.0.78

Feb 12, 2024

0.0.77

Feb 12, 2024

0.0.76

Feb 12, 2024

0.0.75

Feb 12, 2024

0.0.74

Feb 12, 2024

0.0.73

Feb 12, 2024

0.0.72

Feb 12, 2024

0.0.68

Feb 9, 2024

0.0.32

Feb 6, 2024

This version

0.0.31

Feb 6, 2024

0.0.30

Feb 6, 2024

0.0.29

Feb 6, 2024

0.0.28

Feb 6, 2024

0.0.27

Feb 6, 2024

0.0.26

Feb 6, 2024

0.0.25

Feb 6, 2024

0.0.24

Feb 6, 2024

0.0.23

Feb 6, 2024

0.0.18

Feb 6, 2024

0.0.17

Feb 6, 2024

0.0.16

Feb 6, 2024

0.0.15

Feb 6, 2024

0.0.14

Feb 6, 2024

0.0.12

Feb 6, 2024

0.0.11

Feb 6, 2024

0.0.10

Feb 6, 2024

0.0.9

Feb 6, 2024

0.0.8

Feb 6, 2024

0.0.5

Jan 22, 2024

0.0.4

Jan 22, 2024

0.0.1

Jan 15, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vdf_io-0.0.31.tar.gz (44.9 kB view details)

Uploaded Feb 6, 2024 Source

Built Distribution

vdf_io-0.0.31-py3-none-any.whl (85.4 kB view details)

Uploaded Feb 6, 2024 Python 3

File details

Details for the file vdf_io-0.0.31.tar.gz.

File metadata

Download URL: vdf_io-0.0.31.tar.gz
Upload date: Feb 6, 2024
Size: 44.9 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.11.6

File hashes

Hashes for vdf_io-0.0.31.tar.gz
Algorithm	Hash digest
SHA256	`9ef7296bd5a0200d8de3ab931a9750e21df86fed0c35f815c2d04768eb4edb87`
MD5	`64c800060a00a38580d80ad4c203cc6d`
BLAKE2b-256	`88b7a90a24fda0e053cbad04040c7c7d6d5ed85c5870dbb6751ebca6e756b5b0`

See more details on using hashes here.

File details

Details for the file vdf_io-0.0.31-py3-none-any.whl.

File metadata

Download URL: vdf_io-0.0.31-py3-none-any.whl
Upload date: Feb 6, 2024
Size: 85.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.2 CPython/3.11.6

File hashes

Hashes for vdf_io-0.0.31-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6b449a6d7d3a5ae9762a25ebc40aabd487f13dc31cc1219430196e8132a8ac4c`
MD5	`1ffe0989998a8ccf77ced09490f6b963`
BLAKE2b-256	`cd7114fa54f1be97b6995f36c0b92865704a4c361d3b108e3927dead48461f01`