Skip to main content

Dimensia is a Python library for managing document embeddings and performing efficient similarity-based searches using various distance metrics. It supports creating collections, adding documents, and querying over long text data with customizable vector-based search capabilities.

Project description

Dimensia - Vector Database for Document Embeddings

Dimensia is a lightweight vector database designed for managing and querying documents using embeddings. It supports creating collections, adding documents with optional metadata, performing semantic searches, and retrieving detailed document information.

Features

  • Embeddings: Generates document embeddings using the sentence-transformers library, providing vector representations of text.
  • Collections: Allows you to create, manage, and query collections of documents.
  • Search: Performs efficient searches for the top-k most relevant documents based on a query, using cosine, Euclidean, or Manhattan distance metrics.
  • Concurrency: Uses thread pooling to handle document ingestion concurrently for efficient processing.
  • Metadata: Supports storing metadata with documents for enriched queries.
  • Duplicate Detection: Identifies and skips duplicate documents based on content hashes.

Installation

You can install Dimensia via pip:

pip install dimensia

Usage

from dimensia import Dimensia

# Initialize the database
db = Dimensia(db_path="dimensia_db")
print("Database initialized.")

# Set the embedding model (example: a transformer model for NLP tasks)
embedding_model_name = "sentence-transformers/paraphrase-MiniLM-L6-v2"
db.set_embedding_model(embedding_model_name)
print(f"Embedding model '{embedding_model_name}' set successfully.")

# prepare documents to ingest
documents = [
    {"content": "The advancements in deep learning have revolutionized AI applications.", "metadata": {}},
    {"content": "Natural Language Processing models are increasingly effective in understanding text.", "metadata": {}},
    {"content": "Recent research shows that transformers outperform traditional neural networks in NLP.", "metadata": {}},
    {"content": "Machine learning models are being used in healthcare for predictive analysis.", "metadata": {}},
    {"content": "Reinforcement learning is transforming robotics and autonomous systems.", "metadata": {}},
]

# Define collection name for research articles
collection_name = "research_articles"

# Check if collection exists, create it if not
if collection_name not in db.get_collections():
    db.create_collection(collection_name)
    print(f"Collection '{collection_name}' created successfully.")
else:
    print(f"Collection '{collection_name}' already exists.")

# Add documents (research articles) to the collection with metadata
db.add_documents(collection_name, documents)
print(f"Documents added to collection '{collection_name}' successfully.")

# Retrieve and print collection information (number of documents, vector size, etc.)
info = db.get_collection_info(collection_name)
print(f"Collection info for '{collection_name}': {info}")

# Get structure of the collection (document count, vector size)
structure = db.get_structure(collection_name)
print(f"Structure for collection '{collection_name}': {structure}")

# Retrieve vector size (size of embeddings) for the collection
vector_size = db.get_vector_size(collection_name)
print(f"Vector size for collection '{collection_name}': {vector_size}")

# Search for relevant research articles using a query
query = "How transformers are applied in NLP"
results = db.search(query, collection_name, top_k=3, metric="cosine")
print(f"Search results for query '{query}':")
for result in results:
    print(f"Score: {result['score']}, Document: {result['document']['content']}")

# Retrieve a specific document by ID (e.g., ID 1)
document_id = 1
document = db.get_document(collection_name, document_id)
print(f"Retrieved document with ID {document_id}: {document}")

# Get all documents in the collection, sorted by document ID
all_docs = db.get_all_docs(collection_name)
print(f"All documents in '{collection_name}':")
for doc in all_docs["documents"]:
    print(f"ID: {doc['id']}, Content: {doc['content']}, Metadata: {doc['metadata']}")

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contributing

We welcome contributions to improve Dimensia! Please fork the repository, make your changes, and submit a pull request.

For further details, refer to the GitHub repository.

Support

If you encounter any issues or have questions, please don't hesitate to open an issue on our GitHub issues page. We welcome feedback, bug reports, and feature requests!

We strive to respond as quickly as possible to all issues and questions.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dimensia-0.1.2.tar.gz (8.1 kB view details)

Uploaded Source

Built Distribution

dimensia-0.1.2-py3-none-any.whl (7.8 kB view details)

Uploaded Python 3

File details

Details for the file dimensia-0.1.2.tar.gz.

File metadata

  • Download URL: dimensia-0.1.2.tar.gz
  • Upload date:
  • Size: 8.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.6

File hashes

Hashes for dimensia-0.1.2.tar.gz
Algorithm Hash digest
SHA256 d821b7c7c42ee1ed6cc88096ceed5f00dacaed868ccdeb402543c45036754cab
MD5 41d19e811a14f75b42b0eb45c41e7623
BLAKE2b-256 a9ae86a6ef91edf6173b954f51b29b25174a29dbc6e714bdb02be91af1927ff2

See more details on using hashes here.

File details

Details for the file dimensia-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: dimensia-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 7.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.6

File hashes

Hashes for dimensia-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 1cc4391cc46e90ed0bdc332f5c496073bad367da547ff1f3ccb2c6fd71ccd5c9
MD5 ce235e93afc10aada0833e0f3e20489d
BLAKE2b-256 4ab58bd5b0f170c111087166f2a63e921d62a42ccfc5982a195d92632662b1df

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page