MinVectorDB

MinVectorDB is a pure Python-implemented, lightweight, serverless vector, locally deployed databasethat offers clear and concise Python APIs, aimed at lowering the barrier to the use of vector databases.

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

A pure Python-implemented, lightweight, serverless, locally deployed vector database.

⚡ Serverless, simple parameters, simple API.

⚡ Fast, memory-efficient, easily scales to millions of vectors.

⚡ Supports cosine similarity and L2 distance, uses FLAT for exhaustive search or IVF-FLAT for inverted indexing.

⚡ Friendly caching technology stores recently queried vectors for accelerated access.

⚡ Based on a generic Python software stack, platform-independent, highly versatile.

WARNING: MinVectorDB is actively being updated, and API backward compatibility is not guaranteed. You should use version numbers as a strong constraint during deployment to avoid unnecessary feature conflicts and errors. Although our goal is to enable brute force search or inverted indexing on billion-scale vectors, we currently still recommend using it on a scale of millions of vectors or less for the best experience.

MinVectorDB is a vector database implemented purely in Python, designed to be lightweight, serverless, and easy to deploy locally. It offers straightforward and clear Python APIs, aiming to lower the entry barrier for using vector databases. In response to user needs and to enhance its practicality, we are planning to introduce new features, including but not limited to:

Optimizing Global Search Performance: We are focusing on algorithm and data structure enhancements to speed up searches across the database, enabling faster retrieval of vector data.
Enhancing Cluster Search with Inverted Indexes: Utilizing inverted index technology, we aim to refine the cluster search process for better search efficiency and precision.
Refining Clustering Algorithms: By improving our clustering algorithms, we intend to offer more precise and efficient data clustering to support complex queries.
Facilitating Vector Modifications and Deletions: We will introduce features to modify and delete vectors, allowing for more flexible data management.
Implementing Rollback Strategies: To increase database robustness and data security, rollback strategies will be added, helping users recover from incorrect operations or system failures easily.

MinVectorDB focuses on achieving 100% recall, prioritizing recall accuracy over high-speed search performance. This approach ensures that users can reliably retrieve all relevant vector data, making MinVectorDB particularly suitable for applications that require responses within hundreds of milliseconds.

While the project has not yet been benchmarked against other systems, we believe these planned features will significantly enhance MinVectorDB's capabilities in managing and retrieving vector data, addressing a wide range of user needs.

Install

pip install MinVectorDB

Qucik Start

Environment setup (optional, Each instance can only be set once, and needs to be set before instantiation)

import os

# logger settings
# logger level: DEBUG, INFO, WARNING, ERROR, CRITICAL
os.environ['MVDB_LOG_LEVEL'] = 'INFO'  # default: INFO, Options are 'DEBUG'/'INFO'/'WARNING'/'ERROR'/'CRITICAL'

# log path
os.environ['MVDB_LOG_PATH'] = './min_vec_db.log'  # default: None

# whether to truncate log file
os.environ['MVDB_TRUNCATE_LOG'] = 'True'  # default: True

# whether to add time to log
os.environ['MVDB_LOG_WITH_TIME'] = 'False'  # default: False

# clustering settings
# kmeans epochs
os.environ['MVDB_KMEANS_EPOCHS'] = '500'  # default: 100

# query cache size
os.environ['MVDB_QUERY_CACHE_SIZE'] = '10000'  # default: 10000

# specify the number of chunks in the memory cache
os.environ['MVDB_DATALOADER_BUFFER_SIZE'] = '20'  # default to '40', must be integer-like string

import min_vec
print("MinVectorDB version is: ", min_vec.__version__)
print("MinVectorDB all configs: ", '\n - ' + '\n - '.join([f'{k}: {v}' for k, v in min_vec.get_all_configs().items()]))

MinVectorDB version is:  0.3.0
MinVectorDB all configs:  
 - MVDB_LOG_LEVEL: INFO
 - MVDB_LOG_PATH: ./min_vec_db.log
 - MVDB_TRUNCATE_LOG: True
 - MVDB_LOG_WITH_TIME: False
 - MVDB_KMEANS_EPOCHS: 500
 - MVDB_QUERY_CACHE_SIZE: 10000
 - MVDB_DATALOADER_BUFFER_SIZE: 20

create a collection

from min_vec import MinVectorDB

# Specify database root directory
my_db = MinVectorDB(root_path='my_vec_db')

MinVectorDB - INFO - Successful initialization of MinVectorDB in root_path: /Users/guobingming/projects/MinVectorDB/my_vec_db

collection = my_db.require_collection("test_collection", 4, drop_if_exists=True)

MinVectorDB - INFO - Creating collection test_collection with: 
//    dim=4, collection='test_collection', 
//    n_clusters=16, chunk_size=100000,
//    distance='cosine', index_mode='IVF-FLAT', 
//    dtypes='float32', use_cache=True, 
//    scaler_bits=8, n_threads=10

Add vectors

with collection.insert_session():
    id = collection.add_item(vector=[0.01, 0.34, 0.74, 0.31], id=1, field={'field': 'test_1', 'order': 0})
    id = collection.add_item(vector=[0.36, 0.43, 0.56, 0.12], id=2, field={'field': 'test_1', 'order': 1})
    id = collection.add_item(vector=[0.03, 0.04, 0.10, 0.51], id=3, field={'field': 'test_2', 'order': 2})
    id = collection.add_item(vector=[0.11, 0.44, 0.23, 0.24], id=4, field={'field': 'test_2', 'order': 3})
    id = collection.add_item(vector=[0.91, 0.43, 0.44, 0.67], id=5, field={'field': 'test_2', 'order': 4})
    id = collection.add_item(vector=[0.92, 0.12, 0.56, 0.19], id=6, field={'field': 'test_3', 'order': 5})
    id = collection.add_item(vector=[0.18, 0.34, 0.56, 0.71], id=7, field={'field': 'test_1', 'order': 6})
    id = collection.add_item(vector=[0.01, 0.33, 0.14, 0.31], id=8, field={'field': 'test_2', 'order': 7})
    id = collection.add_item(vector=[0.71, 0.75, 0.91, 0.82], id=9, field={'field': 'test_3', 'order': 8})
    id = collection.add_item(vector=[0.75, 0.44, 0.38, 0.75], id=10, field={'field': 'test_1', 'order': 9})

# If you do not use the insert_session function, you need to manually call the commit function to submit the data
# collection.commit()

print(id)

Query

collection.query(vector=[0.36, 0.43, 0.56, 0.12], k=3)

(array([2, 9, 1]), Array([0.99822044, 0.9201999 , 0.8585187 ], dtype=float32))

print(collection.query_report_)

* - MOST RECENT QUERY REPORT -
| - Database Shape: (10, 4)
| - Query Time: 0.00125 s
| - Query Distance: cosine
| - Query K: 3
| - Top 3 Results ID: [2 9 1]
| - Top 3 Results Similarity: [0.99822  0.9202   0.858519]
* - END OF REPORT -

collection.status_report_['DATABASE STATUS REPORT']

{'Database shape': (10, 4),
 'Database last_commit_time': datetime.datetime(2024, 4, 23, 21, 16, 38, 764711),
 'Database commit status': True,
 'Database index_mode': 'IVF-FLAT',
 'Database distance': 'cosine',
 'Database use_cache': True,
 'Database status': 'ACTIVE'}

Use Filter

import operator

from min_vec.structures.filter import Filter, FieldCondition, MatchField, IDCondition, MatchID


collection.query(
    vector=[0.36, 0.43, 0.56, 0.12], 
    k=10, 
    query_filter=Filter(
        must=[
            FieldCondition(key='field', matcher=MatchField('test_1')),  # Support for filtering fields
        ], 
        any=[

            FieldCondition(key='order', matcher=MatchField(8, comparator=operator.ge)),
            IDCondition(MatchID([1, 2, 3, 4, 5])),  # Support for filtering IDs
        ]
    )
)

print(collection.query_report_)

* - MOST RECENT QUERY REPORT -
| - Database Shape: (10, 4)
| - Query Time: 0.00237 s
| - Query Distance: cosine
| - Query K: 10
| - Top 10 Results ID: [ 2  1  4  5 10  3]
| - Top 10 Results Similarity: [0.99822    0.858519   0.85362    0.812733   0.783597   0.34614798]
* - END OF REPORT -

Drop a collection

print("Collection list before dropping:", my_db.show_collections())
my_db.drop_collection("test_collection")
print("Collection list after dropped:", my_db.show_collections())

Collection list before dropping: ['test_collection']
Collection list after dropped: []

Drop the database

my_db.drop_database()
my_db

DELETED MinVectorDB(root_path='/Users/guobingming/projects/MinVectorDB/my_vec_db')

What's Next

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.3.5

May 10, 2024

0.3.4

May 9, 2024

0.3.3

May 8, 2024

0.3.2

Apr 26, 2024

0.3.1

Apr 24, 2024

This version

0.3.0

Apr 23, 2024

0.2.7

Apr 17, 2024

0.2.6

Apr 16, 2024

0.2.5

Apr 15, 2024

0.2.4

Apr 2, 2024

0.2.3

Mar 7, 2024

0.2.2

Feb 26, 2024

0.2.1

Feb 23, 2024

0.2.0

Feb 23, 2024

0.1.5

Jan 29, 2024

0.1.4

Jan 29, 2024

0.1.3

Jan 26, 2024

0.1.2

Jan 25, 2024

0.1.1

Jan 25, 2024

0.1.0

Jan 16, 2024

0.0.13

Jan 8, 2024

0.0.12

Jan 7, 2024

0.0.11

Jan 7, 2024

0.0.10

Jan 5, 2024

0.0.9

Jan 2, 2024

0.0.8

Dec 22, 2023

0.0.7

Dec 21, 2023

0.0.6

Dec 19, 2023

0.0.5

Dec 19, 2023

0.0.4

Dec 18, 2023

0.0.3

Dec 17, 2023

0.0.2

Dec 17, 2023

0.0.1

Dec 17, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

minvectordb-0.3.0.tar.gz (35.5 kB view hashes)

Uploaded Apr 23, 2024 Source

Built Distribution

MinVectorDB-0.3.0-py3-none-any.whl (79.0 kB view hashes)

Uploaded Apr 23, 2024 Python 3

Hashes for minvectordb-0.3.0.tar.gz

Hashes for minvectordb-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`ef4014fe3219ad3954b223c972a56ea4a9c85a56961f53256f3763816ae4c092`
MD5	`24bc8f9259c4ddf1220b0a14f230c000`
BLAKE2b-256	`9871271881229ba589f5e05c18e1f63138557d4f30729b97cc1020c2688b47f6`

Hashes for MinVectorDB-0.3.0-py3-none-any.whl

Hashes for MinVectorDB-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`aa4655108a9b95a6366d9735f9268cc8f063c1f13670597bae6064ac1f73d98b`
MD5	`bf816c23ccd18b2ca8493adb87dbbe67`
BLAKE2b-256	`2a4052e4b02677d043a587084836e24c110228ea67a38098f5234fa670f476f6`