A python project aimed at extracting embeddings from textual data and performing semantic search. It's a simple yet powerful system combining a small quantized ONNX model with FAISS indexing for fast similarity search. As the model is small and also running in ONNX runtime with quantization, we get lightning fast speed.
Project description
MiniVectorDB
This is a Python project aimed at extracting embeddings from textual data and performing semantic search. It's a simple yet powerful system combining a small quantized ONNX model with FAISS indexing for fast similarity search. As the model is small and also running in ONNX runtime with quantization, we get lightning fast speed.
Model link in Huggingface: universal-sentence-encoder-multilingual-3-onnx-quantized
Features
- Embedding Model: Load the ONNX model to extract embeddings from text.
- Vector Database: Store and manage textual embeddings, perform fast similarity searches using FAISS.
Getting Started
Prerequisites
- Python 3.11
- ONNX Runtime + Extensions
- FAISS
- NumPy
- pytest (for testing)
or use pip install -r requirements.txt
Installation
pip install minivectordb
Usage
from minivectordb.embedding_model import EmbeddingModel
from minivectordb.vector_database import VectorDatabase
vector_db = VectorDatabase(embedding_size = 512)
model = EmbeddingModel()
# Text identifier and sentences
sentences = [
(1, "I like dogs"),
(2, "I like cats"),
(3, "The king has three kids"),
(4, "The queen has one daughter"),
(5, "Programming is cool"),
(6, "Software development is cool"),
(7, "I like to ride my bicycle"),
(8, "I like to ride my scooter"),
(9, "The sky is blue"),
(10, "The ocean is blue")
]
for id, sentence in sentences:
sentence_embedding = model.extract_embeddings(sentence)
vector_db.store_embedding(id, sentence_embedding)
## Semantic Search
query_embedding = model.extract_embeddings("I like cats")
search_results = vector_db.find_most_similar(query_embedding, k = 5)
ids, distances = search_results
for id, dist in zip(ids, distances):
print("ID:", id, "Distance:", dist)
# Output:
# ID: 2 Distance: 1.0
# ID: 1 Distance: 0.7593117
# ID: 8 Distance: 0.42757708
# ID: 7 Distance: 0.41723043
# ID: 5 Distance: 0.27484077
##################################################################
query_embedding = model.extract_embeddings("I am a programmer")
search_results = vector_db.find_most_similar(query_embedding, k = 5)
ids, distances = search_results
for id, dist in zip(ids, distances):
print("ID:", id, "Distance:", dist)
# Output:
# ID: 5 Distance: 0.6494667
# ID: 6 Distance: 0.47456568
# ID: 1 Distance: 0.31276548
# ID: 2 Distance: 0.28922778
# ID: 7 Distance: 0.21100259
# Save the database to disk
vector_db.persist_to_disk()
Testing
Ensure you have pytest and pytest-cov installed. Run the tests using:
pytest --cov=minivectordb
For detailed coverage reports:
pytest --cov=minivectordb --cov-report=term-missing
Contributing
- Fork the repository on GitHub.
- Clone your fork locally: git clone https://github.com/yourusername/your-repo-name.git
- Create a new branch for your feature or fix: git checkout -b your-branch-name
- Commit your changes and push to your fork: git push origin your-branch-name
- Create a new Pull Request from your fork to the main repository.
License
This project is licensed under the MIT License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for minivectordb-1.0.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | ffe4f51b6e0f6e148623d4609b5bde88678d4723e744b1e681e002fdd9e88b18 |
|
MD5 | cd0aaf3f568b04ab44b3bf4db56d8908 |
|
BLAKE2b-256 | 34f8860f6805551bbb46eb42d615142ff2364e3ec8e28436014bd23b0125c3da |