Skip to main content

A vector database management module for ThothAI Project

Project description

Thoth Vector Database Manager v0.6.2

A high-performance, Haystack v2-based vector database manager with external embedding providers and centralized embedding management for 4 production-ready backends.

๐Ÿค– MCP Server Support

This project is configured with MCP (Model Context Protocol) servers for enhanced AI-assisted development:

  • Context7: Enhanced context management
  • Serena: IDE assistance and development support

See docs/MCP_SETUP.md for details.

๐Ÿš€ Features

๐ŸŒ NEW in v0.6.0: External Embedding Providers

  • OpenAI, Cohere, Mistral: Support for major external embedding APIs
  • Cost-Effective: Pay-per-use model with intelligent caching
  • High-Quality Embeddings: State-of-the-art embedding models
  • Unified Management: Centralized ExternalEmbeddingManager

๐Ÿ—๏ธ Core Features

  • Multi-backend support: Qdrant, Chroma, PostgreSQL pgvector, Milvus
  • Haystack v2 integration: Uses haystack-ai v2.12.0+ as an abstraction layer
  • Centralized embeddings: No more client-side embedding management
  • Memory optimization: Intelligent caching and lazy loading
  • API compatibility: Backward compatible with existing APIs
  • Type safety: Full type hints and Pydantic validation
  • Production-ready: Comprehensive testing and robust error handling

๐Ÿ“ฆ Installation

๐Ÿš€ Recommended: uv Package Manager

This project uses uv for fast, reliable Python package management. Install uv first:

# Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh

โœ… No Dependency Conflicts

Version 0.4.0 resolves all dependency conflicts! All 4 supported databases can now be installed together:

All Databases (Recommended)

# Install all supported backends (Qdrant, Chroma, PgVector, Milvus)
uv add thoth-vdbmanager[all]

Individual Backends

# Individual backend installation
uv add thoth-vdbmanager[qdrant]    # Qdrant support
uv add thoth-vdbmanager[chroma]    # Chroma support
uv add thoth-vdbmanager[pgvector]  # PostgreSQL pgvector support
uv add thoth-vdbmanager[milvus]    # Milvus support

Development Installation

# For development with all backends and testing tools
uv add thoth-vdbmanager[all,test,dev]

๐Ÿ”„ pip Installation (Also Supported)

If you prefer pip, all commands work by replacing uv add with pip install:

# Example with pip
pip install thoth-vdbmanager[all]

๐Ÿ”„ Breaking Changes in v0.4.0

  • Removed: Weaviate and Pinecone support (no longer maintained)
  • Updated: Now requires haystack-ai v2.12.0+ (not compatible with legacy haystack)
  • Improved: All remaining databases work together without conflicts

๐Ÿ—๏ธ Architecture

The library is built on a clean architecture with:

  • Core: Base interfaces and document types
  • Adapters: Backend-specific implementations using Haystack
  • Factory: Unified creation interface
  • Compatibility: Legacy API support

๐ŸŒ External Embedding Providers (NEW in v0.6.0)

Setup External Embeddings

Configure your external embedding provider using environment variables:

# OpenAI (recommended)
export EMBEDDING_PROVIDER=openai
export EMBEDDING_API_KEY=sk-your-openai-key
export EMBEDDING_MODEL=text-embedding-3-small

# Cohere
export EMBEDDING_PROVIDER=cohere  
export EMBEDDING_API_KEY=your-cohere-key
export EMBEDDING_MODEL=embed-multilingual-v3.0

# Mistral
export EMBEDDING_PROVIDER=mistral
export EMBEDDING_API_KEY=your-mistral-key
export EMBEDDING_MODEL=mistral-embed

Using External Embeddings

import os
from thoth_vdbmanager import ExternalVectorStoreFactory, ColumnNameDocument

# Create store with external embeddings
store = ExternalVectorStoreFactory.create_from_env(
    backend="qdrant",
    collection="my_collection",
    host="localhost",
    port=6333
)

# Add document - embeddings generated via API
doc = ColumnNameDocument(
    table_name="users",
    column_name="email",
    column_description="User email address",
    value_description="Valid email format"
)
store.add_column_description(doc)

# Search - query embeddings generated via API
results = store.search_similar(
    query="user email address",
    doc_type="column_name", 
    top_k=5
)

Available External Providers

Provider Models Dimensions Features
OpenAI text-embedding-3-small, text-embedding-3-large 1536, 3072 High quality, multilingual
Cohere embed-multilingual-v3.0, embed-english-v3.0 1024 Optimized for search
Mistral mistral-embed 1024 European provider

Cost Optimization with Caching

# Enable intelligent caching to reduce API calls
embedding_config = {
    'provider': 'openai',
    'api_key': 'sk-your-key',
    'model': 'text-embedding-3-small',
    'enable_cache': True,    # Enable caching
    'cache_size': 10000      # Cache up to 10k embeddings
}

store = ExternalVectorStoreFactory.create(
    backend="qdrant",
    embedding_config=embedding_config,
    collection="cached_collection",
    host="localhost",
    port=6333
)

๐Ÿš€ Quick Start

External Embedding API (Recommended)

import os
from thoth_vdbmanager import ExternalVectorStoreFactory, ColumnNameDocument, SqlDocument, EvidenceDocument

# Set up external embedding provider
os.environ['EMBEDDING_PROVIDER'] = 'openai'
os.environ['EMBEDDING_API_KEY'] = 'sk-your-openai-key'
os.environ['EMBEDDING_MODEL'] = 'text-embedding-3-small'

# Create a vector store with external embeddings
store = ExternalVectorStoreFactory.create_from_env(
    backend="qdrant",
    collection="my_collection",
    host="localhost",
    port=6333
)

# Add documents
column_doc = ColumnNameDocument(
    table_name="users",
    column_name="email",
    original_column_name="user_email",
    column_description="User email address",
    value_description="Valid email format"
)

doc_id = store.add_column_description(column_doc)

# Search documents using external API embeddings
results = store.search_similar(
    query="user email",
    doc_type="column_name",
    top_k=5
)

Available Classes

from thoth_vdbmanager import (
    VectorStoreFactory,      # Main factory for creating stores
    ColumnNameDocument,      # Column metadata documents
    SqlDocument,            # SQL example documents
    EvidenceDocument,       # Evidence/hint documents
    ThothType,              # Document type enumeration
    VectorStoreInterface    # Base interface for all stores
)

๐Ÿ”ง Configuration

Qdrant

store = VectorStoreFactory.create(
    backend="qdrant",
    collection="my_collection",
    host="localhost",
    port=6333,
    api_key="your-api-key",  # Optional
    embedding_dim=384,  # Optional
    hnsw_config={"m": 16, "ef_construct": 100}
)

Chroma (Multiple Modes)

Memory Mode (Recommended for Testing):

store = VectorStoreFactory.create(
    backend="chroma",
    collection="my_collection",
    mode="memory"  # Fast, isolated, no persistence
)

Filesystem Mode:

store = VectorStoreFactory.create(
    backend="chroma",
    collection="my_collection",
    mode="filesystem",
    persist_path="./chroma_db"
)

Server Mode (Production):

store = VectorStoreFactory.create(
    backend="chroma",
    collection="my_collection",
    mode="server",
    host="localhost",
    port=8000
)

๐Ÿ“– See Chroma Configuration Guide for detailed setup instructions

PostgreSQL pgvector

store = VectorStoreFactory.create(
    backend="pgvector",
    collection="my_table",
    connection_string="postgresql://user:pass@localhost:5432/dbname"
)

Milvus (Multiple Modes)

Lite Mode (Recommended for Testing):

store = VectorStoreFactory.create(
    backend="milvus",
    collection="my_collection",
    mode="lite",
    connection_uri="./milvus.db"  # File-based storage
)

Server Mode (Production):

store = VectorStoreFactory.create(
    backend="milvus",
    collection="my_collection",
    mode="server",
    host="localhost",
    port=19530
)

๐Ÿ“– See Milvus Configuration Guide for detailed setup instructions

๐Ÿ“Š Performance Optimizations

Memory Usage

  • Lazy initialization: Embedders and connections are initialized on first use
  • Singleton pattern: Same configuration reuses existing instances
  • Batch processing: Efficient bulk operations

Performance Tuning

# Optimize for specific use cases
store = VectorStoreFactory.create(
    backend="qdrant",
    collection="optimized",
    embedding_model="sentence-transformers/all-MiniLM-L6-v2",  # 384-dim, fast
    hnsw_config={"m": 32, "ef_construct": 200}  # Better search quality
)

๐Ÿงช Testing

# Run all tests
pytest

# Run specific backend tests
pytest tests/test_qdrant.py -v

# Run with coverage
pytest --cov=vdbmanager tests/

๐Ÿ“ˆ Migration Guide

From v0.3.x to v0.4.0

Breaking Changes

  • Removed databases: Weaviate and Pinecone are no longer supported
  • Haystack version: Now requires haystack-ai v2.12.0+ (not compatible with legacy haystack)
  • Dependencies: All remaining databases can now be installed together without conflicts

Migration Steps

1. Update installation:

# Old installation (v0.3.x)
pip install thoth-vdbmanager[all-safe]  # Avoided conflicts

# New installation (v0.4.0)
pip install thoth-vdbmanager[all]  # No conflicts!

2. Update code (if using removed databases):

# If you were using Weaviate - migrate to Qdrant or Chroma
# Old code (v0.3.x)
store = VectorStoreFactory.create(
    backend="weaviate",  # No longer supported
    collection="MyCollection",
    url="http://localhost:8080"
)

# New code (v0.4.0) - migrate to similar database
store = VectorStoreFactory.create(
    backend="qdrant",  # Recommended alternative
    collection="my_collection",
    host="localhost",
    port=6333
)

3. Existing supported databases work unchanged:

# This code works exactly the same in v0.4.0
store = VectorStoreFactory.create(
    backend="qdrant",  # โœ… Still supported
    collection="my_docs",
    host="localhost",
    port=6333
)

๐Ÿ” API Reference

Core Classes

VectorStoreFactory

# Create store
store = VectorStoreFactory.create(backend, collection, **kwargs)

# From config
config = {"backend": "qdrant", "params": {...}}
store = VectorStoreFactory.from_config(config)

# List backends
backends = VectorStoreFactory.list_backends()

Document Types

  • ColumnNameDocument: Column metadata
  • SqlDocument: SQL examples
  • EvidenceDocument: General evidence/hints

Methods

  • add_column_description(doc): Add column metadata
  • add_sql(doc): Add SQL example
  • add_evidence(doc): Add evidence/hint
  • search_similar(query, doc_type, top_k=5, score_threshold=0.7): Semantic search
  • get_document(doc_id): Retrieve by ID
  • bulk_add_documents(docs): Batch insert
  • get_collection_info(): Get stats

๐Ÿ› Troubleshooting

Common Issues

Connection Errors

# Check service availability
import requests
requests.get("http://localhost:6333")  # Qdrant

Memory Issues

# Use smaller embedding model
store = VectorStoreFactory.create(
    backend="qdrant",
    collection="my_collection",
    embedding_model="sentence-transformers/all-MiniLM-L6-v2"  # 384-dim
)

Performance Issues

# Tune HNSW parameters
store = VectorStoreFactory.create(
    backend="qdrant",
    collection="my_collection",
    hnsw_config={"m": 16, "ef_construct": 100}
)

๐Ÿค Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Add tests for new functionality
  4. Ensure all tests pass
  5. Submit a pull request

๐Ÿ“„ License

MIT License - see LICENSE file for details.

๐Ÿ“ Directory Structure

thoth_vdbmanager/
โ”œโ”€โ”€ vdbmanager/
โ”‚   โ”œโ”€โ”€ core/                    # Base interfaces and document types
โ”‚   โ”‚   โ”œโ”€โ”€ base.py             # Core document classes and interfaces
โ”‚   โ”‚   โ””โ”€โ”€ __init__.py
โ”‚   โ”œโ”€โ”€ adapters/               # Backend-specific implementations
โ”‚   โ”‚   โ”œโ”€โ”€ haystack_adapter.py # Base Haystack adapter
โ”‚   โ”‚   โ”œโ”€โ”€ qdrant_adapter.py   # Qdrant implementation
โ”‚   โ”‚   โ”œโ”€โ”€ chroma_adapter.py   # Chroma implementation
โ”‚   โ”‚   โ”œโ”€โ”€ pgvector_adapter.py # PostgreSQL pgvector
โ”‚   โ”‚   โ””โ”€โ”€ milvus_adapter.py   # Milvus implementation
โ”‚   โ”œโ”€โ”€ factory.py              # Unified creation interface
โ”‚   โ””โ”€โ”€ __init__.py            # Public API exports
โ”œโ”€โ”€ test_e2e_vectordb/          # End-to-end tests
โ”œโ”€โ”€ pyproject.toml              # Project configuration
โ””โ”€โ”€ README.md                   # This file

๐Ÿš€ Quick API Reference

Main API

from thoth_vdbmanager import VectorStoreFactory, ColumnNameDocument

# Create any backend
store = VectorStoreFactory.create(
    backend="qdrant",
    collection="my_docs",
    host="localhost",
    port=6333
)

# Use the methods
doc_id = store.add_column_description(column_doc)
results = store.search_similar("user email", "column_name")

๐ŸŽ‰ Ready to use with Haystack v2 and 4 production-ready vector databases!

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

thoth_vdbmanager-0.7.2.tar.gz (32.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

thoth_vdbmanager-0.7.2-py3-none-any.whl (44.5 kB view details)

Uploaded Python 3

File details

Details for the file thoth_vdbmanager-0.7.2.tar.gz.

File metadata

  • Download URL: thoth_vdbmanager-0.7.2.tar.gz
  • Upload date:
  • Size: 32.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.2

File hashes

Hashes for thoth_vdbmanager-0.7.2.tar.gz
Algorithm Hash digest
SHA256 72a0a69e9be296edc57422a5cf0353d57077b11bb894b9ef080120ffee150ca7
MD5 16826249ae3b995b0f889674c0e24f7e
BLAKE2b-256 faf1f7a167cf110e48774e0fbf4e4e6668e7d3a914f9ef2cf8151516a1afec0d

See more details on using hashes here.

File details

Details for the file thoth_vdbmanager-0.7.2-py3-none-any.whl.

File metadata

File hashes

Hashes for thoth_vdbmanager-0.7.2-py3-none-any.whl
Algorithm Hash digest
SHA256 3c4d6e3be6483d85ff6263230f6c00dbb99b63572370707db105bd9005f7895f
MD5 6ffbc81fd5049a53501d533fc7046d2c
BLAKE2b-256 94b79a57bed1004f5719ccf17cc54c11ea8e23f31d2034fcbcaa7878e49f7135

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page