A python project aimed at extracting embeddings from textual data and performing semantic search.
Project description
MiniVectorDB
This is a Python project aimed at extracting embeddings from textual data and performing semantic search. It's a simple yet powerful system combining a small quantized ONNX model with FAISS indexing for fast similarity search. As the model is small and also running in ONNX runtime with quantization, we get lightning fast speed.
Model link in Huggingface: universal-sentence-encoder-multilingual-3-onnx-quantized
Installation
pip install minivectordb
Supported Languages
["en", "pt", "ar", "zh", "fr", "de", "it", "ja", "ko", "nl", "ps", "es", "th", "tr", "ru"]
Usage
from minivectordb.embedding_model import EmbeddingModel
from minivectordb.vector_database import VectorDatabase
vector_db = VectorDatabase(embedding_size = 512)
model = EmbeddingModel()
# Text identifier and sentences
sentences = [
(1, "I like dogs"),
(2, "I like cats"),
(3, "The king has three kids"),
(4, "The queen has one daughter"),
(5, "Programming is cool"),
(6, "Software development is cool"),
(7, "I like to ride my bicycle"),
(8, "I like to ride my scooter"),
(9, "The sky is blue"),
(10, "The ocean is blue")
]
for id, sentence in sentences:
sentence_embedding = model.extract_embeddings(sentence)
vector_db.store_embedding(id, sentence_embedding)
## Semantic Search
query_embedding = model.extract_embeddings("I like cats")
search_results = vector_db.find_most_similar(query_embedding, k = 5)
ids, distances = search_results
for id, dist in zip(ids, distances):
print("ID:", id, "Distance:", dist)
# Output:
# ID: 2 Distance: 1.0
# ID: 1 Distance: 0.7593117
# ID: 8 Distance: 0.42757708
# ID: 7 Distance: 0.41723043
# ID: 5 Distance: 0.27484077
##################################################################
query_embedding = model.extract_embeddings("I am a programmer")
search_results = vector_db.find_most_similar(query_embedding, k = 5)
ids, distances = search_results
for id, dist in zip(ids, distances):
print("ID:", id, "Distance:", dist)
# Output:
# ID: 5 Distance: 0.6494667
# ID: 6 Distance: 0.47456568
# ID: 1 Distance: 0.31276548
# ID: 2 Distance: 0.28922778
# ID: 7 Distance: 0.21100259
# Save the database to disk
vector_db.persist_to_disk()
License
This project is licensed under the MIT License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
minivectordb-1.0.4.tar.gz
(45.1 MB
view hashes)
Built Distribution
Close
Hashes for minivectordb-1.0.4-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c33ebc8df7eafe5e93e14a96d4da0d10d7ec867d7b1818a813eab6babe759f7c |
|
MD5 | e0e5c09e55c50729f12edf581374b72e |
|
BLAKE2b-256 | 58e050171904a28c928e0a06b8b6c3d51f38350d3bfd8bafce2cb725db215030 |