Dimensia is a Python library for managing document embeddings and performing efficient similarity-based searches using various distance metrics. It supports creating collections, adding documents, and querying over long text data with customizable vector-based search capabilities.

These details have not been verified by PyPI

Project links

Project description

Dimensia - Vector Database for Document Embeddings

Dimensia is a lightweight vector database designed for managing and querying documents using embeddings. It supports creating collections, adding documents with optional metadata, performing semantic searches, and retrieving detailed document information.

Features

Embeddings: Generates document embeddings using the sentence-transformers library, providing vector representations of text.
Collections: Allows you to create, manage, and query collections of documents.
Search: Performs efficient searches for the top-k most relevant documents based on a query, using cosine, Euclidean, or Manhattan distance metrics.
Concurrency: Uses thread pooling to handle document ingestion concurrently for efficient processing.
Metadata: Supports storing metadata with documents for enriched queries.
Duplicate Detection: Identifies and skips duplicate documents based on content hashes.

Installation

You can install Dimensia via pip:

pip install dimensia

Usage

from dimensia import Dimensia

# Initialize the database
db = Dimensia(db_path="dimensia_db")
print("Database initialized.")

# Set the embedding model (example: a transformer model for NLP tasks)
embedding_model_name = "sentence-transformers/paraphrase-MiniLM-L6-v2"
db.set_embedding_model(embedding_model_name)
print(f"Embedding model '{embedding_model_name}' set successfully.")

# prepare documents to ingest
documents = [
    {"content": "The advancements in deep learning have revolutionized AI applications.", "metadata": {}},
    {"content": "Natural Language Processing models are increasingly effective in understanding text.", "metadata": {}},
    {"content": "Recent research shows that transformers outperform traditional neural networks in NLP.", "metadata": {}},
    {"content": "Machine learning models are being used in healthcare for predictive analysis.", "metadata": {}},
    {"content": "Reinforcement learning is transforming robotics and autonomous systems.", "metadata": {}},
]

# Define collection name for research articles
collection_name = "research_articles"

# Check if collection exists, create it if not
if collection_name not in db.get_collections():
    db.create_collection(collection_name)
    print(f"Collection '{collection_name}' created successfully.")
else:
    print(f"Collection '{collection_name}' already exists.")

# Add documents (research articles) to the collection with metadata
db.add_documents(collection_name, documents)
print(f"Documents added to collection '{collection_name}' successfully.")

# Retrieve and print collection information (number of documents, vector size, etc.)
info = db.get_collection_info(collection_name)
print(f"Collection info for '{collection_name}': {info}")

# Get structure of the collection (document count, vector size)
structure = db.get_structure(collection_name)
print(f"Structure for collection '{collection_name}': {structure}")

# Retrieve vector size (size of embeddings) for the collection
vector_size = db.get_vector_size(collection_name)
print(f"Vector size for collection '{collection_name}': {vector_size}")

# Search for relevant research articles using a query
query = "How transformers are applied in NLP"
results = db.search(query, collection_name, top_k=3, metric="cosine")
print(f"Search results for query '{query}':")
for result in results:
    print(f"Score: {result['score']}, Document: {result['document']['content']}")

# Retrieve a specific document by ID (e.g., ID 1)
document_id = 1
document = db.get_document(collection_name, document_id)
print(f"Retrieved document with ID {document_id}: {document}")

# Get all documents in the collection, sorted by document ID
all_docs = db.get_all_docs(collection_name)
print(f"All documents in '{collection_name}':")
for doc in all_docs["documents"]:
    print(f"ID: {doc['id']}, Content: {doc['content']}, Metadata: {doc['metadata']}")

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contributing

We welcome contributions to improve Dimensia! Please fork the repository, make your changes, and submit a pull request.

For further details, refer to the GitHub repository.

Support

If you encounter any issues or have questions, please don't hesitate to open an issue on our GitHub issues page. We welcome feedback, bug reports, and feature requests!

We strive to respond as quickly as possible to all issues and questions.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.2

Nov 22, 2024

0.1.1

Nov 21, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dimensia-0.1.2.tar.gz (8.1 kB view details)

Uploaded Nov 22, 2024 Source

Built Distribution

dimensia-0.1.2-py3-none-any.whl (7.8 kB view details)

Uploaded Nov 22, 2024 Python 3

File details

Details for the file dimensia-0.1.2.tar.gz.

File metadata

Download URL: dimensia-0.1.2.tar.gz
Upload date: Nov 22, 2024
Size: 8.1 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.12.6

File hashes

Hashes for dimensia-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`d821b7c7c42ee1ed6cc88096ceed5f00dacaed868ccdeb402543c45036754cab`
MD5	`41d19e811a14f75b42b0eb45c41e7623`
BLAKE2b-256	`a9ae86a6ef91edf6173b954f51b29b25174a29dbc6e714bdb02be91af1927ff2`

See more details on using hashes here.

File details

Details for the file dimensia-0.1.2-py3-none-any.whl.

File metadata

Download URL: dimensia-0.1.2-py3-none-any.whl
Upload date: Nov 22, 2024
Size: 7.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.12.6

File hashes

Hashes for dimensia-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1cc4391cc46e90ed0bdc332f5c496073bad367da547ff1f3ccb2c6fd71ccd5c9`
MD5	`ce235e93afc10aada0833e0f3e20489d`
BLAKE2b-256	`4ab58bd5b0f170c111087166f2a63e921d62a42ccfc5982a195d92632662b1df`