Skip to main content

This library uses a universal format for vector datasets to easily export and import data from all vector databases.

Project description

This library uses a universal format for vector datasets to easily export and import data from all vector databases.

See the Contributing section to add support for your favorite vector database.

Universal Vector Dataset Format (VDF) specification

  1. VDF_META.json: It is a json file with the following schema:

interface Index {
  namespace: string;
  total_vector_count: number;
  exported_vector_count: number;
  dimensions: number;
  model_name: string;
  vector_columns: string[];
  data_path: string;
  metric: 'Euclid' | 'Cosine' | 'Dot';
}

interface VDFMeta {
  version: string;
  file_structure: string[];
  author: string;
  exported_from: 'pinecone' | 'qdrant'; // others when they are added
  indexes: {
    [key: string]: Index[];
  };
  exported_at: string;
}
  1. Parquet files/folders for metadata and vectors.

Installation

git clone https://github.com/AI-Northstar-Tech/vector-io.git
cd vector-io
pip install -r requirements.txt

Export Script

./export_vdf.py --help

usage: export.py [-h] [-m MODEL_NAME] [--max_file_size MAX_FILE_SIZE]
                 [--push_to_hub | --no-push_to_hub]
                 {pinecone,qdrant} ...

Export data from a vector database to VDF

options:
  -h, --help            show this help message and exit
  -m MODEL_NAME, --model_name MODEL_NAME
                        Name of model used
  --max_file_size MAX_FILE_SIZE
                        Maximum file size in MB (default: 1024)
  --push_to_hub, --no-push_to_hub
                        Push to hub

Vector Databases:
  Choose the vectors database to export data from

  {pinecone,qdrant}
    pinecone            Export data from Pinecone
    qdrant              Export data from Qdrant
./export_vdf.py pinecone --help
usage: export.py pinecone [-h] [-e ENVIRONMENT] [-i INDEX]
                          [-s ID_RANGE_START]
                          [--id_range_end ID_RANGE_END]
                          [-f ID_LIST_FILE]
                          [--modify_to_search MODIFY_TO_SEARCH]

options:
  -h, --help            show this help message and exit
  -e ENVIRONMENT, --environment ENVIRONMENT
                        Environment of Pinecone instance
  -i INDEX, --index INDEX
                        Name of index to export
  -s ID_RANGE_START, --id_range_start ID_RANGE_START
                        Start of id range
  --id_range_end ID_RANGE_END
                        End of id range
  -f ID_LIST_FILE, --id_list_file ID_LIST_FILE
                        Path to id list file
  --modify_to_search MODIFY_TO_SEARCH
                        Allow modifying data to search
./export_vdf.py qdrant --help
usage: export.py qdrant [-h] [-u URL] [-c COLLECTIONS]

options:
  -h, --help            show this help message and exit
  -u URL, --url URL     Location of Qdrant instance
  -c COLLECTIONS, --collections COLLECTIONS
                        Names of collections to export

Import script

./import_vdf.py --help
usage: import_vdf.py [-h] [-d DIR] {pinecone,qdrant} ...

Import data from VDF to a vector database

options:
  -h, --help         show this help message and exit
  -d DIR, --dir DIR  Directory to import

Vector Databases:
  Choose the vectors database to export data from

  {pinecone,qdrant}
    pinecone         Import data to Pinecone
    qdrant           Import data to Qdrant

./import_vdf.py pinecone --help
usage: import_vdf.py pinecone [-h] [-e ENVIRONMENT]

options:
  -h, --help            show this help message and exit
  -e ENVIRONMENT, --environment ENVIRONMENT
                        Pinecone environment

./import_vdf.py qdrant --help
usage: import_vdf.py qdrant [-h] [-u URL]

options:
  -h, --help         show this help message and exit
  -u URL, --url URL  Qdrant url

Examples

./export_vdf.py -m hkunlp/instructor-xl --push_to_hub pinecone --environment gcp-starter

Follow the prompt to select the index and id range to export.

Contributing

Adding a new vector database

If you wish to add an import/export implementation for a new vector database, you must also implement the other side of the import/export for the same database. Please fork the repo and send a PR for both the import and export scripts.

Changing the VDF specification

If you wish to change the VDF specification, please open an issue to discuss the change before sending a PR.

Efficiency improvements

If you wish to improve the efficiency of the import/export scripts, please fork the repo and send a PR.

Questions

If you have any questions, please open an issue on the repo or message Dhruv Anand on LinkedIn

Project details


Release history Release notifications | RSS feed

This version

0.0.8

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vdf_io-0.0.8.tar.gz (42.0 kB view details)

Uploaded Source

Built Distribution

vdf_io-0.0.8-py3-none-any.whl (73.8 kB view details)

Uploaded Python 3

File details

Details for the file vdf_io-0.0.8.tar.gz.

File metadata

  • Download URL: vdf_io-0.0.8.tar.gz
  • Upload date:
  • Size: 42.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.6

File hashes

Hashes for vdf_io-0.0.8.tar.gz
Algorithm Hash digest
SHA256 b4afabe49f92a7be5d89c1aaf1747256b85dcd09cec2dcaad0e31cca84d46fba
MD5 3b06de5d63f82e910e7044eb1d3a069b
BLAKE2b-256 abcaa9ee4a6fe7472ea7b97b6a17ea562cd4afa69214d5e26251389babc118df

See more details on using hashes here.

File details

Details for the file vdf_io-0.0.8-py3-none-any.whl.

File metadata

  • Download URL: vdf_io-0.0.8-py3-none-any.whl
  • Upload date:
  • Size: 73.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.11.6

File hashes

Hashes for vdf_io-0.0.8-py3-none-any.whl
Algorithm Hash digest
SHA256 f029d4fd69e96ad2c8617d1e3b64cc9f71c13c35d7fc474e42fc233cc553665a
MD5 6225876d7099b9f0a6d86bebb98a5553
BLAKE2b-256 aebd97ebfabdf6a381d96f1ad79f59922adc03a19df3fb9c122336316be0dab2

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page