This library uses a universal format for vector datasets to easily export and import data from all vector databases.
Project description
This library uses a universal format for vector datasets to easily export and import data from all vector databases.
See the Contributing section to add support for your favorite vector database.
Universal Vector Dataset Format (VDF) specification
VDF_META.json: It is a json file with the following schema:
interface Index { namespace: string; total_vector_count: number; exported_vector_count: number; dimensions: number; model_name: string; vector_columns: string[]; data_path: string; metric: 'Euclid' | 'Cosine' | 'Dot'; } interface VDFMeta { version: string; file_structure: string[]; author: string; exported_from: 'pinecone' | 'qdrant'; // others when they are added indexes: { [key: string]: Index[]; }; exported_at: string; }
Parquet files/folders for metadata and vectors.
Installation
git clone https://github.com/AI-Northstar-Tech/vector-io.git
cd vector-io
pip install -r requirements.txt
Export Script
./export_vdf.py --help
usage: export.py [-h] [-m MODEL_NAME] [--max_file_size MAX_FILE_SIZE]
[--push_to_hub | --no-push_to_hub]
{pinecone,qdrant} ...
Export data from a vector database to VDF
options:
-h, --help show this help message and exit
-m MODEL_NAME, --model_name MODEL_NAME
Name of model used
--max_file_size MAX_FILE_SIZE
Maximum file size in MB (default: 1024)
--push_to_hub, --no-push_to_hub
Push to hub
Vector Databases:
Choose the vectors database to export data from
{pinecone,qdrant}
pinecone Export data from Pinecone
qdrant Export data from Qdrant
./export_vdf.py pinecone --help
usage: export.py pinecone [-h] [-e ENVIRONMENT] [-i INDEX]
[-s ID_RANGE_START]
[--id_range_end ID_RANGE_END]
[-f ID_LIST_FILE]
[--modify_to_search MODIFY_TO_SEARCH]
options:
-h, --help show this help message and exit
-e ENVIRONMENT, --environment ENVIRONMENT
Environment of Pinecone instance
-i INDEX, --index INDEX
Name of index to export
-s ID_RANGE_START, --id_range_start ID_RANGE_START
Start of id range
--id_range_end ID_RANGE_END
End of id range
-f ID_LIST_FILE, --id_list_file ID_LIST_FILE
Path to id list file
--modify_to_search MODIFY_TO_SEARCH
Allow modifying data to search
./export_vdf.py qdrant --help
usage: export.py qdrant [-h] [-u URL] [-c COLLECTIONS]
options:
-h, --help show this help message and exit
-u URL, --url URL Location of Qdrant instance
-c COLLECTIONS, --collections COLLECTIONS
Names of collections to export
Import script
./import_vdf.py --help
usage: import_vdf.py [-h] [-d DIR] {pinecone,qdrant} ...
Import data from VDF to a vector database
options:
-h, --help show this help message and exit
-d DIR, --dir DIR Directory to import
Vector Databases:
Choose the vectors database to export data from
{pinecone,qdrant}
pinecone Import data to Pinecone
qdrant Import data to Qdrant
./import_vdf.py pinecone --help
usage: import_vdf.py pinecone [-h] [-e ENVIRONMENT]
options:
-h, --help show this help message and exit
-e ENVIRONMENT, --environment ENVIRONMENT
Pinecone environment
./import_vdf.py qdrant --help
usage: import_vdf.py qdrant [-h] [-u URL]
options:
-h, --help show this help message and exit
-u URL, --url URL Qdrant url
Examples
./export_vdf.py -m hkunlp/instructor-xl --push_to_hub pinecone --environment gcp-starter
Follow the prompt to select the index and id range to export.
Contributing
Adding a new vector database
If you wish to add an import/export implementation for a new vector database, you must also implement the other side of the import/export for the same database. Please fork the repo and send a PR for both the import and export scripts.
Changing the VDF specification
If you wish to change the VDF specification, please open an issue to discuss the change before sending a PR.
Efficiency improvements
If you wish to improve the efficiency of the import/export scripts, please fork the repo and send a PR.
Questions
If you have any questions, please open an issue on the repo or message Dhruv Anand on LinkedIn
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file vdf_io-0.0.10.tar.gz
.
File metadata
- Download URL: vdf_io-0.0.10.tar.gz
- Upload date:
- Size: 42.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 65aac98b91be85f0a81b66c0d4267f48835836a2712c8a42437fea714168d049 |
|
MD5 | b0ac1b2ce22d2edf54fded6c0f22ac4b |
|
BLAKE2b-256 | 43e6085da196adda905e464a55b1d351cd529ebe1003fd0b02f4b9bbc1e180d3 |
File details
Details for the file vdf_io-0.0.10-py3-none-any.whl
.
File metadata
- Download URL: vdf_io-0.0.10-py3-none-any.whl
- Upload date:
- Size: 73.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.11.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 702fe11e68463ac9a477a099ca1066a69cab26d73b31b89ce33f845c2c063483 |
|
MD5 | c6925acb584cb1e9c0adfa36b19a579a |
|
BLAKE2b-256 | 911427825fad83db79e1b004c98ee09f0d97cc1c68a11d674f097766a45361c0 |