Dimensia is a Python library for managing document embeddings and performing efficient similarity-based searches using various distance metrics. It supports creating collections, adding documents, and querying over long text data with customizable vector-based search capabilities.
Project description
Dimensia - Vector Database for Document Embeddings
Dimensia
is a lightweight vector database designed for managing and querying documents using embeddings. It supports creating collections, adding documents with optional metadata, performing semantic searches, and retrieving detailed document information.
Features
- Embeddings: Generates document embeddings using the
sentence-transformers
library, providing vector representations of text. - Collections: Allows you to create, manage, and query collections of documents.
- Search: Performs efficient searches for the top-k most relevant documents based on a query, using cosine, Euclidean, or Manhattan distance metrics.
- Concurrency: Uses thread pooling to handle document ingestion concurrently for efficient processing.
- Metadata: Supports storing metadata with documents for enriched queries.
- Duplicate Detection: Identifies and skips duplicate documents based on content hashes.
Installation
You can install Dimensia via pip:
pip install dimensia
Usage
from dimensia import Dimensia
# Initialize the database
db = Dimensia(db_path="dimensia_db")
print("Database initialized.")
# Set the embedding model (example: a transformer model for NLP tasks)
embedding_model_name = "sentence-transformers/paraphrase-MiniLM-L6-v2"
db.set_embedding_model(embedding_model_name)
print(f"Embedding model '{embedding_model_name}' set successfully.")
# prepare documents to ingest
documents = [
{"content": "The advancements in deep learning have revolutionized AI applications.", "metadata": {}},
{"content": "Natural Language Processing models are increasingly effective in understanding text.", "metadata": {}},
{"content": "Recent research shows that transformers outperform traditional neural networks in NLP.", "metadata": {}},
{"content": "Machine learning models are being used in healthcare for predictive analysis.", "metadata": {}},
{"content": "Reinforcement learning is transforming robotics and autonomous systems.", "metadata": {}},
]
# Define collection name for research articles
collection_name = "research_articles"
# Check if collection exists, create it if not
if collection_name not in db.get_collections():
db.create_collection(collection_name)
print(f"Collection '{collection_name}' created successfully.")
else:
print(f"Collection '{collection_name}' already exists.")
# Add documents (research articles) to the collection with metadata
db.add_documents(collection_name, documents)
print(f"Documents added to collection '{collection_name}' successfully.")
# Retrieve and print collection information (number of documents, vector size, etc.)
info = db.get_collection_info(collection_name)
print(f"Collection info for '{collection_name}': {info}")
# Get structure of the collection (document count, vector size)
structure = db.get_structure(collection_name)
print(f"Structure for collection '{collection_name}': {structure}")
# Retrieve vector size (size of embeddings) for the collection
vector_size = db.get_vector_size(collection_name)
print(f"Vector size for collection '{collection_name}': {vector_size}")
# Search for relevant research articles using a query
query = "How transformers are applied in NLP"
results = db.search(query, collection_name, top_k=3, metric="cosine")
print(f"Search results for query '{query}':")
for result in results:
print(f"Score: {result['score']}, Document: {result['document']['content']}")
# Retrieve a specific document by ID (e.g., ID 1)
document_id = 1
document = db.get_document(collection_name, document_id)
print(f"Retrieved document with ID {document_id}: {document}")
# Get all documents in the collection, sorted by document ID
all_docs = db.get_all_docs(collection_name)
print(f"All documents in '{collection_name}':")
for doc in all_docs["documents"]:
print(f"ID: {doc['id']}, Content: {doc['content']}, Metadata: {doc['metadata']}")
License
This project is licensed under the MIT License - see the LICENSE file for details.
Contributing
We welcome contributions to improve Dimensia! Please fork the repository, make your changes, and submit a pull request.
For further details, refer to the GitHub repository.
Support
If you encounter any issues or have questions, please don't hesitate to open an issue on our GitHub issues page. We welcome feedback, bug reports, and feature requests!
We strive to respond as quickly as possible to all issues and questions.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file dimensia-0.1.2.tar.gz
.
File metadata
- Download URL: dimensia-0.1.2.tar.gz
- Upload date:
- Size: 8.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | d821b7c7c42ee1ed6cc88096ceed5f00dacaed868ccdeb402543c45036754cab |
|
MD5 | 41d19e811a14f75b42b0eb45c41e7623 |
|
BLAKE2b-256 | a9ae86a6ef91edf6173b954f51b29b25174a29dbc6e714bdb02be91af1927ff2 |
File details
Details for the file dimensia-0.1.2-py3-none-any.whl
.
File metadata
- Download URL: dimensia-0.1.2-py3-none-any.whl
- Upload date:
- Size: 7.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.6
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1cc4391cc46e90ed0bdc332f5c496073bad367da547ff1f3ccb2c6fd71ccd5c9 |
|
MD5 | ce235e93afc10aada0833e0f3e20489d |
|
BLAKE2b-256 | 4ab58bd5b0f170c111087166f2a63e921d62a42ccfc5982a195d92632662b1df |