Facades for VectorDBs
Project description
vd - Vector Database Facades
A unified, Pythonic interface for interacting with various vector databases. The vd package abstracts away the specifics of each database's API to offer a consistent, database-agnostic interface for semantic search operations.
Features
Core Features
- Unified API: Single interface for multiple vector database backends
- Backend Discovery: Easy-to-use tools to find, install, and use different vector databases
- Pythonic Design: Collections behave like MutableMapping (dict-like)
- Flexible Document Input: Support for strings, tuples, and Document objects
- Powerful Filtering: MongoDB-style query syntax for metadata filtering
- Automatic Embeddings: Seamless integration with embedding models via
imbed - Pluggable Backends: Easy to add new vector database backends
- Helpful Error Messages: Get installation instructions when backends aren't available
- Type-Safe: Full type hints and protocol-based design
- Well-Tested: Comprehensive test suite with >90% coverage
Extended Features
- Command-Line Interface: Full-featured CLI for common operations
- Configuration Management: YAML/TOML config files with profiles and environment variables
- Backend Comparison: Compare and get recommendations for backends based on your needs
- Import/Export: Support for JSONL, JSON, and directory formats
- Migration: Move collections between backends with progress tracking
- Analytics: Collection statistics, validation, duplicate detection, outlier analysis
- Text Preprocessing: Clean and chunk text with multiple strategies
- Health Checks: Monitor backend health and benchmark performance
- Advanced Search: Multi-query search, similarity search, reciprocal rank fusion
Installation
# Basic installation (includes memory backend)
pip install vd
# With ChromaDB support
pip install vd[chromadb]
# With all optional dependencies
pip install vd[all]
Quick Start
import vd
# Connect to a backend (memory backend for quick prototyping)
client = vd.connect('memory')
# Create a collection
docs = client.create_collection('my_documents')
# Add documents (simple!)
docs['doc1'] = "Machine learning is a subset of AI"
docs['doc2'] = "Deep learning uses neural networks"
docs['doc3'] = "Python is great for data science"
# Search with semantic similarity
results = docs.search("artificial intelligence", limit=2)
for result in results:
print(f"{result['id']}: {result['text']} (score: {result['score']:.3f})")
Core Concepts
Backends
vd supports multiple vector database backends:
memory: In-memory storage (always available, great for testing)chroma: ChromaDB (requirespip install chromadb)
More backends coming soon (Pinecone, Weaviate, Qdrant, Milvus, FAISS)!
# List currently registered backends
print(vd.list_backends())
# Connect to different backends
memory_client = vd.connect('memory')
chroma_client = vd.connect('chroma', persist_directory='./data')
Backend Discovery
vd makes it easy to discover and install vector database backends:
import vd
# View all backends with a nicely formatted table
vd.print_backends_table()
# List only backends that are currently available (installed)
available = vd.list_available_backends()
print(f"Available: {available}")
# Get detailed information about a specific backend
info = vd.get_backend_info('chroma')
print(info['description'])
print(info['features'])
# Get installation instructions
instructions = vd.get_install_instructions('chroma')
print(instructions)
# List ALL possible backends (including planned ones)
all_backends = vd.list_all_backends(include_planned=True)
When you try to connect to a backend that's not installed, you'll get helpful error messages:
>>> vd.connect('chroma')
ValueError: Backend 'chroma' is not available.
To install it:
pip install vd[chromadb]
Or run: vd.get_install_instructions('chroma') for more details.
Collections
Collections are MutableMapping objects that store searchable documents:
# Create a collection
docs = client.create_collection('articles')
# Dict-like operations
docs['doc1'] = "Some text" # Add
doc = docs['doc1'] # Retrieve
del docs['doc1'] # Delete
len(docs) # Count
for doc_id in docs: # Iterate
print(doc_id)
Documents
Multiple ways to specify documents:
# String (simple text)
docs['id1'] = "Just some text"
# Tuple: (text, metadata)
docs['id2'] = ("Article text", {'category': 'tech', 'year': 2024})
# Tuple: (text, id) - for batch operations
docs.add_documents([
("First article", "custom_id_1"),
("Second article", {'author': 'Alice'}),
])
# Document object (full control)
doc = vd.Document(
id='id3',
text='Article text',
metadata={'category': 'science'},
vector=[0.1, 0.2, ...] # Optional pre-computed embedding
)
docs.upsert(doc)
Searching
Powerful search with filtering and transformation:
# Basic search
results = docs.search("machine learning", limit=5)
# With metadata filter
results = docs.search(
"neural networks",
filter={'category': 'AI', 'year': {'$gte': 2020}}
)
# With egress function (transform results)
texts = docs.search(
"data science",
limit=10,
egress=vd.text_only # Just return the text
)
# Available egress functions
vd.text_only(result) # Returns just the text
vd.id_only(result) # Returns just the ID
vd.id_and_score(result) # Returns (id, score)
vd.id_text_score(result) # Returns (id, text, score)
Filtering
MongoDB-style filter syntax:
# Equality
docs.search("query", filter={'category': 'tech'})
# Comparison operators
docs.search("query", filter={'year': {'$gte': 2020}})
docs.search("query", filter={'views': {'$lt': 1000}})
# List membership
docs.search("query", filter={'tags': {'$in': ['python', 'ai']}})
# Logical operators
docs.search("query", filter={
'$and': [
{'year': {'$gte': 2020}},
{'category': 'tech'}
]
})
Supported operators:
$eq: Equal$ne: Not equal$gt: Greater than$gte: Greater than or equal$lt: Less than$lte: Less than or equal$in: In list$and: Logical AND$or: Logical OR
Advanced Usage
Custom Embedding Models
# Use a specific embedding model
client = vd.connect('memory', embedding_model='text-embedding-3-large')
# Use a custom embedding function
def my_embedder(text: str) -> list[float]:
# Your embedding logic here
return [...]
client = vd.connect('memory', embedding_model=my_embedder)
Batch Operations
# Batch add for efficiency
docs.add_documents([
"Document 1",
("Document 2", {'category': 'tech'}),
("Document 3", "custom_id", {'year': 2024}),
], batch_size=100)
Collection Management
# List collections
for name in client.list_collections():
print(name)
# Get existing collection
docs = client.get_collection('my_docs')
# Delete collection
client.delete_collection('old_docs')
Pre-computed Vectors
# If you already have embeddings
doc = vd.Document(
id='doc1',
text='Some text',
vector=[0.1, 0.2, 0.3, ...], # Your pre-computed embedding
)
docs['doc1'] = doc
# Search with pre-computed query vector
query_vector = [0.15, 0.25, 0.35, ...]
results = docs.search(query_vector, limit=5)
Architecture
The vd package is designed with several key principles:
- Protocol-based: Uses Python protocols for type safety without tight coupling
- Separation of Concerns: Embedding, storage, and search are independent
- Progressive Enhancement: Same code works from in-memory to distributed databases
- Facade Pattern: Provides a consistent interface across different backends
Project Structure
vd/
├── __init__.py # Public API
├── base.py # Core protocols and types
├── util.py # Utility functions and factory
├── backends/ # Backend implementations
│ ├── __init__.py
│ ├── memory.py # In-memory backend
│ └── chroma.py # ChromaDB backend
└── tests/ # Comprehensive test suite
Development
Running Tests
# Install development dependencies
pip install -e .[dev]
# Run tests
pytest tests/ -v
# Run tests with coverage
pytest tests/ --cov=vd --cov-report=html
Adding a New Backend
- Create a new file in
vd/backends/ - Implement the backend class inheriting from
BaseBackend - Implement a collection class with the MutableMapping interface
- Register the backend with
@register_backend('backend_name') - Add tests in
tests/
Example:
from vd.base import BaseBackend
from vd.util import register_backend
@register_backend('mydb')
class MyDBBackend(BaseBackend):
def create_collection(self, name, **kwargs):
# Implementation
pass
# ... other methods
Design Philosophy
The vd package follows these design principles:
- Favor functional over object-oriented where appropriate
- Use Mapping/MutableMapping abstractions for intuitive interfaces
- Leverage existing packages (dol, imbed) for core functionality
- Optional dependencies for backends (graceful degradation)
- Progressive enhancement: Scale from prototypes to production seamlessly
Integration with i2mint Ecosystem
vd is designed to work seamlessly with the i2mint ecosystem:
dol: Provides the underlying Mapping/Store patternsimbed: Handles embedding generation and managementi2: Signature manipulation for consistent interfacesoa: OpenAI API integration for embeddings
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
License
MIT
Links
- GitHub: https://github.com/i2mint/vd
- Documentation: Coming soon
- PyPI: Coming soon
Command-Line Interface
vd includes a comprehensive CLI for common operations:
# List available backends
vd backends
vd backends --planned # Include planned backends
# Get installation instructions
vd install chroma
# Check backend health
vd health memory
# Export a collection
vd export memory my_docs -o backup.jsonl
vd export memory my_docs -o backup.json -f json
# Import a collection
vd import chroma my_docs -i backup.jsonl
# View collection statistics
vd stats memory my_docs
vd stats memory my_docs -v # Verbose output
# Validate a collection
vd validate memory my_docs
# Migrate between backends
vd migrate memory source_docs chroma target_docs
# Benchmark search performance
vd benchmark memory my_docs -q "test query" --queries 100
Configuration Management
Manage backend configurations with YAML or TOML files:
import vd
# Connect using a configuration file
client = vd.connect_from_config('vd.yaml')
# Use a specific profile
client = vd.connect_from_config('vd.yaml', profile='production')
# Create example configuration
config_yaml = vd.create_example_config('yaml')
vd.save_config(config, 'vd.yaml')
Example vd.yaml:
profiles:
default:
backend: memory
dev:
backend: memory
prod:
backend: chroma
persist_directory: ./vector_db
Environment variable overrides:
VD_PROFILE: Select profile (default: 'default')VD_BACKEND: Override backend nameVD_EMBEDDING_MODEL: Override embedding model
Backend Comparison and Recommendation
Choose the right backend for your needs:
import vd
# Compare backends
vd.print_comparison(['memory', 'chroma', 'pinecone'])
# Get recommendations based on requirements
vd.print_recommendation(
dataset_size='medium', # small, medium, large, very_large
persistence_required=True,
cloud_required=False,
budget='free', # free, low, medium, high
performance_priority='balanced' # speed, scalability, balanced
)
# Get backend characteristics
chars = vd.get_backend_characteristics()
print(chars['chroma']['use_cases'])
Import/Export
Export and import collections in multiple formats:
import vd
# Export to JSONL (recommended for large collections)
vd.export_collection(docs, 'backup.jsonl', format='jsonl')
# Export to JSON
vd.export_collection(docs, 'backup.json', format='json')
# Export to directory (one file per document)
vd.export_collection(docs, './backup_dir', format='directory')
# Import from file
vd.import_collection(docs, 'backup.jsonl')
vd.import_collection(docs, 'backup.jsonl', skip_existing=True)
Migration
Move collections between backends:
import vd
# Migrate a collection
source = source_client.get_collection('docs')
target = target_client.create_collection('docs')
stats = vd.migrate_collection(
source,
target,
batch_size=100,
preserve_vectors=True, # Keep existing embeddings
progress_callback=lambda cur, tot: print(f"{cur}/{tot}")
)
# Migrate entire client (all collections)
vd.migrate_client(
source_client,
target_client,
collection_names=['docs1', 'docs2'] # Optional filter
)
Collection Analytics
Analyze and validate collections:
import vd
# Get collection statistics
stats = vd.collection_stats(docs)
print(f"Total: {stats['total_documents']}")
print(f"Avg length: {stats['avg_text_length']}")
print(f"Metadata fields: {stats['metadata_fields']}")
# Metadata distribution
dist = vd.metadata_distribution(docs, 'category')
# Find duplicate or near-duplicate documents
duplicates = vd.find_duplicates(docs, threshold=0.95)
# Find outliers (dissimilar documents)
outliers = vd.find_outliers(docs, threshold=0.3)
# Sample collection
random_sample = vd.sample_collection(docs, n=10, method='random')
diverse_sample = vd.sample_collection(docs, n=10, method='diverse')
# Validate collection integrity
report = vd.validate_collection(docs)
if not report['valid']:
for issue in report['issues']:
print(f"Issue: {issue}")
Text Preprocessing
Clean and chunk text before adding to collections:
import vd
# Clean text
clean = vd.clean_text(
text,
lowercase=True,
remove_extra_whitespace=True,
remove_urls=True,
remove_emails=True
)
# Chunk text
chunks = vd.chunk_text(
text,
chunk_size=500,
overlap=50,
strategy='sentences' # chars, words, sentences, paragraphs
)
# Chunk documents with metadata preservation
chunked_docs = vd.chunk_documents(
documents,
chunk_size=500,
id_template='{doc_id}_chunk_{chunk_num}',
preserve_metadata=True
)
# Extract metadata from text
metadata = vd.extract_metadata(
text,
extract_title=True,
extract_length=True,
extract_word_count=True
)
Health Checks and Benchmarking
Monitor and benchmark performance:
import vd
# Check backend health
health = vd.health_check_backend('chroma', persist_directory='./data')
print(f"Status: {health['status']}")
print(f"Available: {health['available']}")
# Check collection health
health = vd.health_check_collection(docs)
# Benchmark search performance
results = vd.benchmark_search(
docs,
query="test query",
n_queries=100,
limit=10
)
print(f"Avg latency: {results['avg_latency']*1000:.2f}ms")
print(f"P95: {results['p95']*1000:.2f}ms")
print(f"Throughput: {results['queries_per_second']:.1f} queries/sec")
# Benchmark insertion
results = vd.benchmark_insert(docs, n_documents=100, batch_size=10)
Advanced Search
Enhanced search capabilities:
import vd
# Multi-query search
results = vd.multi_query_search(
docs,
queries=["AI", "machine learning"],
limit=10,
combine='best' # interleave, concatenate, union, best
)
# Find similar documents
similar = vd.search_similar_to_document(
docs,
doc_id='doc1',
limit=10,
exclude_self=True
)
# Reciprocal Rank Fusion (combine multiple rankings)
results1 = list(docs.search("query1"))
results2 = list(docs.search("query2"))
combined = vd.reciprocal_rank_fusion([results1, results2])
# Deduplicate results
unique = vd.deduplicate_results(results, key='id', keep='first')
Roadmap
- Import/Export (JSONL, JSON, directory)
- Migration between backends
- Collection analytics and validation
- Text preprocessing and chunking
- Health checks and benchmarking
- Advanced search (multi-query, RRF, similarity)
- Configuration file support (YAML, TOML)
- Backend comparison and recommendation
- Command-line interface
- Additional backends (Pinecone, Weaviate, Qdrant, FAISS)
- Async support
- Hybrid search (vector + keyword)
- Comprehensive documentation site
Examples
See the demo scripts for comprehensive examples:
example_usage.py- Basic usage and core featuresdemo_backend_discovery.py- Backend discovery featuresdemo_config.py- Configuration managementdemo_comparison.py- Backend comparison and recommendationdemo_utilities.py- Import/export, migration, analytics, and more
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vd-0.1.1.tar.gz.
File metadata
- Download URL: vd-0.1.1.tar.gz
- Upload date:
- Size: 647.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.7 {"installer":{"name":"uv","version":"0.11.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
06d0c6092cb28d2e970c4f419dc66f4378a74a13b8942f3bd822087427ccbbdd
|
|
| MD5 |
e9b80e2ad0bedcb625c420f90d78b3f6
|
|
| BLAKE2b-256 |
fa463f1f40c28caa724e66f6fbb364f37cca345f13aea1376d6bdfe984efc9a7
|
File details
Details for the file vd-0.1.1-py3-none-any.whl.
File metadata
- Download URL: vd-0.1.1-py3-none-any.whl
- Upload date:
- Size: 50.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.11.7 {"installer":{"name":"uv","version":"0.11.7","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4919cdf7726cb410a503ae938e4bbfb039ec68fb67c4836144a0932a5b822af9
|
|
| MD5 |
54fcf1a7ba7f4aa98e8ed2b87804912a
|
|
| BLAKE2b-256 |
db4d775e727caff1c6acbe46302e4be92861b32593c13450cdfbe03dc1657137
|