A vector database management module for ThothAI Project
Project description
Thoth Vector Database Manager v0.6.2
A high-performance, Haystack v2-based vector database manager with external embedding providers and centralized embedding management for 4 production-ready backends.
๐ค MCP Server Support
This project is configured with MCP (Model Context Protocol) servers for enhanced AI-assisted development:
- Context7: Enhanced context management
- Serena: IDE assistance and development support
See docs/MCP_SETUP.md for details.
๐ Features
๐ NEW in v0.6.0: External Embedding Providers
- OpenAI, Cohere, Mistral: Support for major external embedding APIs
- Cost-Effective: Pay-per-use model with intelligent caching
- High-Quality Embeddings: State-of-the-art embedding models
- Unified Management: Centralized
ExternalEmbeddingManager
๐๏ธ Core Features
- Multi-backend support: Qdrant, Chroma, PostgreSQL pgvector, Milvus
- Haystack v2 integration: Uses haystack-ai v2.12.0+ as an abstraction layer
- Centralized embeddings: No more client-side embedding management
- Memory optimization: Intelligent caching and lazy loading
- API compatibility: Backward compatible with existing APIs
- Type safety: Full type hints and Pydantic validation
- Production-ready: Comprehensive testing and robust error handling
๐ฆ Installation
๐ Recommended: uv Package Manager
This project uses uv for fast, reliable Python package management. Install uv first:
# Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh
โ No Dependency Conflicts
Version 0.4.0 resolves all dependency conflicts! All 4 supported databases can now be installed together:
All Databases (Recommended)
# Install all supported backends (Qdrant, Chroma, PgVector, Milvus)
uv add thoth-vdbmanager[all]
Individual Backends
# Individual backend installation
uv add thoth-vdbmanager[qdrant] # Qdrant support
uv add thoth-vdbmanager[chroma] # Chroma support
uv add thoth-vdbmanager[pgvector] # PostgreSQL pgvector support
uv add thoth-vdbmanager[milvus] # Milvus support
Development Installation
# For development with all backends and testing tools
uv add thoth-vdbmanager[all,test,dev]
๐ pip Installation (Also Supported)
If you prefer pip, all commands work by replacing uv add with pip install:
# Example with pip
pip install thoth-vdbmanager[all]
๐ Breaking Changes in v0.4.0
- Removed: Weaviate and Pinecone support (no longer maintained)
- Updated: Now requires haystack-ai v2.12.0+ (not compatible with legacy haystack)
- Improved: All remaining databases work together without conflicts
๐๏ธ Architecture
The library is built on a clean architecture with:
- Core: Base interfaces and document types
- Adapters: Backend-specific implementations using Haystack
- Factory: Unified creation interface
- Compatibility: Legacy API support
๐ External Embedding Providers (NEW in v0.6.0)
Setup External Embeddings
Configure your external embedding provider using environment variables:
# OpenAI (recommended)
export EMBEDDING_PROVIDER=openai
export EMBEDDING_API_KEY=sk-your-openai-key
export EMBEDDING_MODEL=text-embedding-3-small
# Cohere
export EMBEDDING_PROVIDER=cohere
export EMBEDDING_API_KEY=your-cohere-key
export EMBEDDING_MODEL=embed-multilingual-v3.0
# Mistral
export EMBEDDING_PROVIDER=mistral
export EMBEDDING_API_KEY=your-mistral-key
export EMBEDDING_MODEL=mistral-embed
Using External Embeddings
import os
from thoth_vdbmanager import ExternalVectorStoreFactory, ColumnNameDocument
# Create store with external embeddings
store = ExternalVectorStoreFactory.create_from_env(
backend="qdrant",
collection="my_collection",
host="localhost",
port=6333
)
# Add document - embeddings generated via API
doc = ColumnNameDocument(
table_name="users",
column_name="email",
column_description="User email address",
value_description="Valid email format"
)
store.add_column_description(doc)
# Search - query embeddings generated via API
results = store.search_similar(
query="user email address",
doc_type="column_name",
top_k=5
)
Available External Providers
| Provider | Models | Dimensions | Features |
|---|---|---|---|
| OpenAI | text-embedding-3-small, text-embedding-3-large | 1536, 3072 | High quality, multilingual |
| Cohere | embed-multilingual-v3.0, embed-english-v3.0 | 1024 | Optimized for search |
| Mistral | mistral-embed | 1024 | European provider |
Cost Optimization with Caching
# Enable intelligent caching to reduce API calls
embedding_config = {
'provider': 'openai',
'api_key': 'sk-your-key',
'model': 'text-embedding-3-small',
'enable_cache': True, # Enable caching
'cache_size': 10000 # Cache up to 10k embeddings
}
store = ExternalVectorStoreFactory.create(
backend="qdrant",
embedding_config=embedding_config,
collection="cached_collection",
host="localhost",
port=6333
)
๐ Quick Start
External Embedding API (Recommended)
import os
from thoth_vdbmanager import ExternalVectorStoreFactory, ColumnNameDocument, SqlDocument, EvidenceDocument
# Set up external embedding provider
os.environ['EMBEDDING_PROVIDER'] = 'openai'
os.environ['EMBEDDING_API_KEY'] = 'sk-your-openai-key'
os.environ['EMBEDDING_MODEL'] = 'text-embedding-3-small'
# Create a vector store with external embeddings
store = ExternalVectorStoreFactory.create_from_env(
backend="qdrant",
collection="my_collection",
host="localhost",
port=6333
)
# Add documents
column_doc = ColumnNameDocument(
table_name="users",
column_name="email",
original_column_name="user_email",
column_description="User email address",
value_description="Valid email format"
)
doc_id = store.add_column_description(column_doc)
# Search documents using external API embeddings
results = store.search_similar(
query="user email",
doc_type="column_name",
top_k=5
)
Available Classes
from thoth_vdbmanager import (
VectorStoreFactory, # Main factory for creating stores
ColumnNameDocument, # Column metadata documents
SqlDocument, # SQL example documents
EvidenceDocument, # Evidence/hint documents
ThothType, # Document type enumeration
VectorStoreInterface # Base interface for all stores
)
๐ง Configuration
Qdrant
store = VectorStoreFactory.create(
backend="qdrant",
collection="my_collection",
host="localhost",
port=6333,
api_key="your-api-key", # Optional
embedding_dim=384, # Optional
hnsw_config={"m": 16, "ef_construct": 100}
)
Chroma (Multiple Modes)
Memory Mode (Recommended for Testing):
store = VectorStoreFactory.create(
backend="chroma",
collection="my_collection",
mode="memory" # Fast, isolated, no persistence
)
Filesystem Mode:
store = VectorStoreFactory.create(
backend="chroma",
collection="my_collection",
mode="filesystem",
persist_path="./chroma_db"
)
Server Mode (Production):
store = VectorStoreFactory.create(
backend="chroma",
collection="my_collection",
mode="server",
host="localhost",
port=8000
)
๐ See Chroma Configuration Guide for detailed setup instructions
PostgreSQL pgvector
store = VectorStoreFactory.create(
backend="pgvector",
collection="my_table",
connection_string="postgresql://user:pass@localhost:5432/dbname"
)
Milvus (Multiple Modes)
Lite Mode (Recommended for Testing):
store = VectorStoreFactory.create(
backend="milvus",
collection="my_collection",
mode="lite",
connection_uri="./milvus.db" # File-based storage
)
Server Mode (Production):
store = VectorStoreFactory.create(
backend="milvus",
collection="my_collection",
mode="server",
host="localhost",
port=19530
)
๐ See Milvus Configuration Guide for detailed setup instructions
๐ Performance Optimizations
Memory Usage
- Lazy initialization: Embedders and connections are initialized on first use
- Singleton pattern: Same configuration reuses existing instances
- Batch processing: Efficient bulk operations
Performance Tuning
# Optimize for specific use cases
store = VectorStoreFactory.create(
backend="qdrant",
collection="optimized",
embedding_model="sentence-transformers/all-MiniLM-L6-v2", # 384-dim, fast
hnsw_config={"m": 32, "ef_construct": 200} # Better search quality
)
๐งช Testing
# Run all tests
pytest
# Run specific backend tests
pytest tests/test_qdrant.py -v
# Run with coverage
pytest --cov=vdbmanager tests/
๐ Migration Guide
From v0.3.x to v0.4.0
Breaking Changes
- Removed databases: Weaviate and Pinecone are no longer supported
- Haystack version: Now requires haystack-ai v2.12.0+ (not compatible with legacy haystack)
- Dependencies: All remaining databases can now be installed together without conflicts
Migration Steps
1. Update installation:
# Old installation (v0.3.x)
pip install thoth-vdbmanager[all-safe] # Avoided conflicts
# New installation (v0.4.0)
pip install thoth-vdbmanager[all] # No conflicts!
2. Update code (if using removed databases):
# If you were using Weaviate - migrate to Qdrant or Chroma
# Old code (v0.3.x)
store = VectorStoreFactory.create(
backend="weaviate", # No longer supported
collection="MyCollection",
url="http://localhost:8080"
)
# New code (v0.4.0) - migrate to similar database
store = VectorStoreFactory.create(
backend="qdrant", # Recommended alternative
collection="my_collection",
host="localhost",
port=6333
)
3. Existing supported databases work unchanged:
# This code works exactly the same in v0.4.0
store = VectorStoreFactory.create(
backend="qdrant", # โ
Still supported
collection="my_docs",
host="localhost",
port=6333
)
๐ API Reference
Core Classes
VectorStoreFactory
# Create store
store = VectorStoreFactory.create(backend, collection, **kwargs)
# From config
config = {"backend": "qdrant", "params": {...}}
store = VectorStoreFactory.from_config(config)
# List backends
backends = VectorStoreFactory.list_backends()
Document Types
ColumnNameDocument: Column metadataSqlDocument: SQL examplesEvidenceDocument: General evidence/hints
Methods
add_column_description(doc): Add column metadataadd_sql(doc): Add SQL exampleadd_evidence(doc): Add evidence/hintsearch_similar(query, doc_type, top_k=5, score_threshold=0.7): Semantic searchget_document(doc_id): Retrieve by IDbulk_add_documents(docs): Batch insertget_collection_info(): Get stats
๐ Troubleshooting
Common Issues
Connection Errors
# Check service availability
import requests
requests.get("http://localhost:6333") # Qdrant
Memory Issues
# Use smaller embedding model
store = VectorStoreFactory.create(
backend="qdrant",
collection="my_collection",
embedding_model="sentence-transformers/all-MiniLM-L6-v2" # 384-dim
)
Performance Issues
# Tune HNSW parameters
store = VectorStoreFactory.create(
backend="qdrant",
collection="my_collection",
hnsw_config={"m": 16, "ef_construct": 100}
)
๐ค Contributing
- Fork the repository
- Create a feature branch
- Add tests for new functionality
- Ensure all tests pass
- Submit a pull request
๐ License
MIT License - see LICENSE file for details.
๐ Directory Structure
thoth_vdbmanager/
โโโ vdbmanager/
โ โโโ core/ # Base interfaces and document types
โ โ โโโ base.py # Core document classes and interfaces
โ โ โโโ __init__.py
โ โโโ adapters/ # Backend-specific implementations
โ โ โโโ haystack_adapter.py # Base Haystack adapter
โ โ โโโ qdrant_adapter.py # Qdrant implementation
โ โ โโโ chroma_adapter.py # Chroma implementation
โ โ โโโ pgvector_adapter.py # PostgreSQL pgvector
โ โ โโโ milvus_adapter.py # Milvus implementation
โ โโโ factory.py # Unified creation interface
โ โโโ __init__.py # Public API exports
โโโ test_e2e_vectordb/ # End-to-end tests
โโโ pyproject.toml # Project configuration
โโโ README.md # This file
๐ Quick API Reference
Main API
from thoth_vdbmanager import VectorStoreFactory, ColumnNameDocument
# Create any backend
store = VectorStoreFactory.create(
backend="qdrant",
collection="my_docs",
host="localhost",
port=6333
)
# Use the methods
doc_id = store.add_column_description(column_doc)
results = store.search_similar("user email", "column_name")
๐ Ready to use with Haystack v2 and 4 production-ready vector databases!
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file thoth_vdbmanager-0.7.2.tar.gz.
File metadata
- Download URL: thoth_vdbmanager-0.7.2.tar.gz
- Upload date:
- Size: 32.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
72a0a69e9be296edc57422a5cf0353d57077b11bb894b9ef080120ffee150ca7
|
|
| MD5 |
16826249ae3b995b0f889674c0e24f7e
|
|
| BLAKE2b-256 |
faf1f7a167cf110e48774e0fbf4e4e6668e7d3a914f9ef2cf8151516a1afec0d
|
File details
Details for the file thoth_vdbmanager-0.7.2-py3-none-any.whl.
File metadata
- Download URL: thoth_vdbmanager-0.7.2-py3-none-any.whl
- Upload date:
- Size: 44.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3c4d6e3be6483d85ff6263230f6c00dbb99b63572370707db105bd9005f7895f
|
|
| MD5 |
6ffbc81fd5049a53501d533fc7046d2c
|
|
| BLAKE2b-256 |
94b79a57bed1004f5719ccf17cc54c11ea8e23f31d2034fcbcaa7878e49f7135
|