Cross-platform vector database engine with pluggable adapters
Project description
CrossVector
A unified Python library for vector database operations with pluggable backends and embedding providers.
CrossVector provides a consistent, high-level API across multiple vector databases (AstraDB, ChromaDB, Milvus, PgVector) and embedding providers (OpenAI, Gemini), allowing you to switch between backends without rewriting your application code.
⚠️ Beta Status
WARNING: CrossVector is currently in BETA. Do not use in production until version 1.0 release.
- API may change without notice
- Database schemas may evolve
- Features are still being tested and refined
Recommended for:
- ✅ Prototyping and experimentation
- ✅ Development and testing environments
- ✅ Learning vector databases
Not recommended for:
- ❌ Production applications
- ❌ Mission-critical systems
Features
🔌 Pluggable Architecture
- 4 Vector Databases: AstraDB, ChromaDB, Milvus, PgVector
- 2 Embedding Providers: OpenAI, Gemini
- Switch backends without code changes
🎯 Unified API
- Consistent interface across all adapters
- Django-style
get,get_or_create,update_or_createsemantics - Flexible document input formats:
str,dict, orVectorDocument
🔍 Advanced Querying
- Query DSL: Type-safe filter composition with
Qobjects - Universal operators:
$eq,$ne,$gt,$gte,$lt,$lte,$in,$nin - Nested metadata: Dot-notation paths for hierarchical data
- Metadata-only search: Query without vector similarity (where supported)
🚀 Performance Optimized
- Automatic batch embedding generation
- Bulk operations:
bulk_create,bulk_update,upsert - Configurable batch sizes and conflict resolution
🛡️ Type-Safe & Validated
- Full Pydantic validation
- Structured exceptions with detailed context
- Centralized logging with configurable levels
⚙️ Flexible Configuration
- Environment variable support via
.env - Multiple primary key strategies: UUID, hash-based, int64, custom
- Optional text storage to optimize space
Installation
Core Package (Minimal)
pip install crossvector
With Specific Backends
# AstraDB + OpenAI
pip install crossvector[astradb,openai]
# ChromaDB + OpenAI
pip install crossvector[chromadb,openai]
# Milvus + Gemini
pip install crossvector[milvus,gemini]
# PgVector + OpenAI
pip install crossvector[pgvector,openai]
All Backends and Providers
# Everything
pip install crossvector[all]
# All databases only
pip install crossvector[all-dbs,openai]
# All embeddings only
pip install crossvector[astradb,all-embeddings]
Quick Start
Basic Usage
from crossvector import VectorEngine
from crossvector.embeddings.openai import OpenAIEmbeddingAdapter
from crossvector.dbs.pgvector import PgVectorAdapter
# Initialize engine
engine = VectorEngine(
embedding=OpenAIEmbeddingAdapter(model_name="text-embedding-3-small"),
db=PgVectorAdapter(),
collection_name="my_documents",
store_text=True
)
# Create documents (flexible input formats)
doc1 = engine.create(text="Python is a programming language")
doc2 = engine.create({"text": "Artificial intelligence", "metadata": {"category": "tech"}})
doc3 = engine.create(text="Machine learning basics", metadata={"level": "beginner"})
print(f"Created documents: {doc1.id}, {doc2.id}, {doc3.id}")
# Search by text (automatic embedding generation)
results = engine.search("programming languages", limit=5)
for doc in results:
print(f"[{doc.metadata.get('score', 0):.3f}] {doc.text}")
# Search by vector (skip embedding step)
vector = engine.embedding.get_embeddings(["my query"])[0]
results = engine.search(vector, limit=3)
# Get document by ID
doc = engine.get(doc1.id)
print(f"Retrieved: {doc.text}")
# Count documents
total = engine.count()
print(f"Total documents: {total}")
# Delete documents
engine.delete(doc1.id)
engine.delete([doc2.id, doc3.id]) # Batch delete
Flexible Input Formats
CrossVector accepts multiple document input formats for maximum convenience:
# String input (text only)
doc1 = engine.create("Simple text document")
# Dict input with metadata
doc2 = engine.create({
"text": "Document with metadata",
"metadata": {"source": "api", "author": "user123"}
})
# Dict input with metadata as kwargs
doc3 = engine.create(
text="Document with inline metadata",
source="web",
category="blog"
)
# VectorDocument instance
from crossvector import VectorDocument
doc4 = engine.create(
VectorDocument(
id="custom-id",
text="Full control document",
metadata={"priority": "high"}
)
)
# Provide pre-computed vector (skip embedding)
doc5 = engine.create(
text="Document with vector",
vector=[0.1, 0.2, ...], # 1536-dim for OpenAI
metadata={"source": "external"}
)
Django-Style Operations
# Get or create pattern
doc, created = engine.get_or_create(
text="My document",
metadata={"topic": "AI"}
)
if created:
print("Created new document")
else:
print("Document already exists")
# Update or create pattern
doc, created = engine.update_or_create(
{"id": "doc-123"},
text="Updated content",
defaults={"metadata": {"updated": True}}
)
# Get with metadata filters
doc = engine.get(source="api", status="active") # Must return exactly one
# Bulk operations
docs = [
{"text": "Doc 1", "metadata": {"idx": 1}},
{"text": "Doc 2", "metadata": {"idx": 2}},
{"text": "Doc 3", "metadata": {"idx": 3}},
]
created_docs = engine.bulk_create(docs, batch_size=100)
# Upsert (insert or update)
docs = engine.upsert([
{"id": "doc-1", "text": "Updated doc 1"},
{"id": "doc-2", "text": "New doc 2"},
])
Advanced Querying
Query DSL with Q Objects
CrossVector provides a powerful Query DSL for composing complex filters:
from crossvector.querydsl.q import Q
# Simple equality
results = engine.search("AI", where=Q(category="tech"))
# Comparison operators
results = engine.search(
"articles",
where=Q(score__gte=0.8) & Q(views__lt=1000)
)
# Range queries
results = engine.search(
"products",
where=Q(price__gte=100) & Q(price__lte=500)
)
# IN / NOT IN
results = engine.search(
"users",
where=Q(role__in=["admin", "moderator"]) & Q(status__ne="banned")
)
# Boolean combinations
high_quality = Q(rating__gte=4.5) & Q(reviews__gte=10)
featured = Q(featured__eq=True)
results = engine.search("items", where=high_quality | featured)
# Negation
results = engine.search("posts", where=~Q(status="archived"))
# Nested metadata (dot notation)
results = engine.search(
"documents",
where=Q(info__lang__eq="en") & Q(info__tier__eq="gold")
)
Universal Filter Format
You can also use dict-based filters with universal operators:
# Equality and comparison
results = engine.search("query", where={
"category": {"$eq": "tech"},
"score": {"$gt": 0.8},
"views": {"$lte": 1000}
})
# IN / NOT IN
results = engine.search("query", where={
"status": {"$in": ["active", "pending"]},
"priority": {"$nin": ["low"]}
})
# Nested paths
results = engine.search("query", where={
"user.role": {"$eq": "admin"},
"user.verified": {"$eq": True}
})
# Multiple conditions (implicit AND)
results = engine.search("query", where={
"category": {"$eq": "blog"},
"published": {"$eq": True},
"views": {"$gte": 100}
})
Metadata-Only Search
Search by metadata filters without vector similarity:
# Find all documents with specific metadata
docs = engine.search(
query=None, # No vector search
where={"status": {"$eq": "published"}},
limit=50
)
# Complex metadata queries
docs = engine.search(
query=None,
where=Q(category="tech") & Q(featured=True) & Q(score__gte=0.9),
limit=100
)
Supported Operators
All backends support these universal operators:
| Operator | Description | Example |
|---|---|---|
$eq |
Equal to | {"age": {"$eq": 25}} or Q(age=25) |
$ne |
Not equal to | {"status": {"$ne": "inactive"}} or Q(status__ne="inactive") |
$gt |
Greater than | {"score": {"$gt": 0.8}} or Q(score__gt=0.8) |
$gte |
Greater than or equal | {"price": {"$gte": 100}} or Q(price__gte=100) |
$lt |
Less than | {"age": {"$lt": 18}} or Q(age__lt=18) |
$lte |
Less than or equal | {"priority": {"$lte": 5}} or Q(priority__lte=5) |
$in |
In array | {"role": {"$in": ["admin", "mod"]}} or Q(role__in=["admin"]) |
$nin |
Not in array | {"status": {"$nin": ["banned"]}} or Q(status__nin=["banned"]) |
Configuration
Environment Variables
Create a .env file in your project root:
# OpenAI
OPENAI_API_KEY=sk-...
# Gemini
GOOGLE_API_KEY=AI...
# AstraDB
ASTRA_DB_APPLICATION_TOKEN=AstraCS:...
ASTRA_DB_API_ENDPOINT=https://...
ASTRA_DB_COLLECTION_NAME=vectors
# ChromaDB (Cloud)
CHROMA_API_KEY=...
CHROMA_TENANT=...
CHROMA_DATABASE=...
# ChromaDB (Self-hosted)
CHROMA_HOST=localhost
CHROMA_PORT=8000
# Milvus
MILVUS_API_ENDPOINT=https://...
MILVUS_API_KEY=...
# PgVector
PGVECTOR_HOST=localhost
PGVECTOR_PORT=5432
PGVECTOR_DBNAME=vector_db
PGVECTOR_USER=postgres
PGVECTOR_PASSWORD=postgres
# Vector settings
VECTOR_STORE_TEXT=true
VECTOR_METRIC=cosine
VECTOR_SEARCH_LIMIT=10
PRIMARY_KEY_MODE=uuid
LOG_LEVEL=INFO
Primary Key Strategies
CrossVector supports multiple primary key generation strategies:
from crossvector.settings import settings
# UUID (default) - random UUID
settings.PRIMARY_KEY_MODE = "uuid"
# Hash text - deterministic from text content
settings.PRIMARY_KEY_MODE = "hash_text"
# Hash vector - deterministic from vector values
settings.PRIMARY_KEY_MODE = "hash_vector"
# Sequential int64
settings.PRIMARY_KEY_MODE = "int64"
# Auto - hash text if available, else hash vector, else UUID
settings.PRIMARY_KEY_MODE = "auto"
# Custom factory function
settings.PRIMARY_KEY_FACTORY = "mymodule.generate_custom_id"
Backend-Specific Features
Backend Capabilities
Different backends have varying feature support:
| Feature | AstraDB | ChromaDB | Milvus | PgVector |
|---|---|---|---|---|
| Vector Search | ✅ | ✅ | ✅ | ✅ |
| Metadata-Only Search | ✅ | ✅ | ❌ | ✅ |
| Nested Metadata | ✅ | ✅* | ❌ | ✅ |
| Numeric Comparisons | ✅ | ✅ | ✅ | ✅ |
| Text Storage | ✅ | ✅ | ✅ | ✅ |
*ChromaDB supports nested metadata via dot-notation when metadata is flattened.
AstraDB
from crossvector.dbs.astradb import AstraDBAdapter
db = AstraDBAdapter()
engine = VectorEngine(embedding=embedding, db=db)
# Features:
# - Serverless, auto-scaling
# - Native JSON metadata support
# - Nested field queries with dot notation
# - Metadata-only search
ChromaDB
from crossvector.dbs.chroma import ChromaAdapter
# Cloud mode
db = ChromaAdapter() # Uses CHROMA_API_KEY from env
# Self-hosted mode
db = ChromaAdapter() # Uses CHROMA_HOST/PORT from env
# Local persistence mode
db = ChromaAdapter() # Uses CHROMA_PERSIST_DIR from env
engine = VectorEngine(embedding=embedding, db=db)
# Features:
# - Multiple deployment modes (cloud/HTTP/local)
# - Automatic client fallback
# - Flattened metadata with dot-notation support
Milvus
from crossvector.dbs.milvus import MilvusAdapter
db = MilvusAdapter()
engine = VectorEngine(embedding=embedding, db=db)
# Features:
# - High performance at scale
# - Automatic index creation
# - Boolean expression filters
# - Requires vector for all searches (no metadata-only)
PgVector
from crossvector.dbs.pgvector import PgVectorAdapter
db = PgVectorAdapter()
engine = VectorEngine(embedding=embedding, db=db)
# Features:
# - PostgreSQL extension
# - JSONB metadata storage
# - Nested field support with #>> operator
# - Automatic numeric type casting
# - Metadata-only search
# - Auto-creates database if missing
Embedding Providers
OpenAI
from crossvector.embeddings.openai import OpenAIEmbeddingAdapter
# Default model (text-embedding-3-small, 1536 dims)
embedding = OpenAIEmbeddingAdapter()
# Larger model (text-embedding-3-large, 3072 dims)
embedding = OpenAIEmbeddingAdapter(model_name="text-embedding-3-large")
# Legacy model (text-embedding-ada-002, 1536 dims)
embedding = OpenAIEmbeddingAdapter(model_name="text-embedding-ada-002")
Gemini
from crossvector.embeddings.gemini import GeminiEmbeddingAdapter
# Default model (gemini-embedding-001)
embedding = GeminiEmbeddingAdapter()
# With custom dimensions (768, 1536, 3072)
embedding = GeminiEmbeddingAdapter(
model_name="gemini-embedding-001",
dim=1536
)
# With task type
embedding = GeminiEmbeddingAdapter(
task_type="retrieval_document" # or "retrieval_query", "semantic_similarity"
)
Error Handling
CrossVector provides structured exceptions with detailed context:
from crossvector.exceptions import (
DoesNotExist,
MultipleObjectsReturned,
DocumentExistsError,
MissingFieldError,
InvalidFieldError,
CollectionNotFoundError,
MissingConfigError,
)
# Catch specific errors
try:
doc = engine.get(id="nonexistent")
except DoesNotExist as e:
print(f"Document not found: {e.details}")
# Multiple results when expecting one
try:
doc = engine.get(status="active") # Multiple matches
except MultipleObjectsReturned as e:
print(f"Multiple documents matched: {e.details}")
# Missing configuration
try:
db = PgVectorAdapter()
except MissingConfigError as e:
print(f"Missing config: {e.details['config_key']}")
print(f"Hint: {e.details['hint']}")
# Invalid field or operator
try:
results = engine.search("query", where={"field": {"$regex": "pattern"}})
except InvalidFieldError as e:
print(f"Unsupported operator: {e.message}")
Logging
Configure logging via environment variable:
LOG_LEVEL=DEBUG # DEBUG, INFO, WARNING, ERROR, CRITICAL
Or programmatically:
from crossvector.settings import settings
settings.LOG_LEVEL = "DEBUG"
# Logs include:
# - Engine initialization
# - Embedding generation
# - Database operations
# - Query compilation
# - Error details
Testing
Real Environment Tests (Opt-in)
Integration tests that exercise real backends live under scripts/tests/ to avoid running in GitHub Actions by default.
- Location:
scripts/tests/ - Run manually when services/credentials are available
Static defaults used in tests:
- AstraDB collection:
test_crossvector - Chroma collection:
test_crossvector - Milvus collection:
test_crossvector - PgVector table:
test_crossvector
Run examples:
pytest scripts/tests -q
pytest scripts/tests/test_pgvector.py -q
Environment setup examples:
# OpenAI (embeddings)
export OPENAI_API_KEY=sk-...
export OPENAI_EMBEDDING_MODEL=text-embedding-3-small
# AstraDB
export ASTRA_DB_APPLICATION_TOKEN=AstraCS:...
export ASTRA_DB_API_ENDPOINT=https://...apps.astra.datastax.com
# Chroma (local/cloud)
export CHROMA_HOST=api.trychroma.com
export CHROMA_API_KEY=ck-...
export CHROMA_TENANT=...
export CHROMA_DATABASE=Test
# Milvus
export MILVUS_API_ENDPOINT=http://localhost:19530
export MILVUS_API_TOKEN=...
# PgVector
export PGVECTOR_HOST=localhost
export PGVECTOR_PORT=5432
export PGVECTOR_DBNAME=vectordb
export PGVECTOR_USER=postgres
export PGVECTOR_PASSWORD=postgres
Run tests with pytest:
# All tests
pytest tests/
# Specific test file
pytest tests/test_engine.py
# With coverage
pytest tests/ --cov=crossvector --cov-report=html
# Integration tests (requires real backends)
python scripts/backend.py --backend pgvector --embedding-provider openai
python scripts/backend.py --backend astradb --embedding-provider openai
python scripts/backend.py --backend milvus --embedding-provider openai
python scripts/backend.py --backend chroma --embedding-provider openai
Examples
Full CRUD Example
from crossvector import VectorEngine
from crossvector.embeddings.openai import OpenAIEmbeddingAdapter
from crossvector.dbs.astradb import AstraDBAdapter
from crossvector.querydsl.q import Q
# Initialize
engine = VectorEngine(
embedding=OpenAIEmbeddingAdapter(),
db=AstraDBAdapter(),
collection_name="articles"
)
# Create
article1 = engine.create(
text="Introduction to Python programming",
metadata={"category": "tutorial", "level": "beginner", "views": 1500}
)
article2 = engine.create(
text="Advanced machine learning techniques",
metadata={"category": "tutorial", "level": "advanced", "views": 3200}
)
article3 = engine.create(
text="Best practices for API design",
metadata={"category": "guide", "level": "intermediate", "views": 2100}
)
# Search with filters
results = engine.search(
"machine learning tutorials",
where=Q(category="tutorial") & Q(level__in=["beginner", "intermediate"]),
limit=5
)
# Update
article1.metadata["views"] = 2000
engine.update(article1)
# Batch update
updates = [
{"id": article2.id, "metadata": {"featured": True}},
{"id": article3.id, "metadata": {"featured": True}},
]
engine.bulk_update(updates)
# Get or create
doc, created = engine.get_or_create(
text="Python best practices",
metadata={"category": "guide", "level": "intermediate"}
)
# Delete
engine.delete(article1.id)
engine.delete([article2.id, article3.id])
# Count
total = engine.count()
print(f"Total articles: {total}")
Switching Backends
# Same code works across all backends - just swap the adapter
# PgVector
from crossvector.dbs.pgvector import PgVectorAdapter
engine = VectorEngine(embedding=embedding, db=PgVectorAdapter())
# ChromaDB
from crossvector.dbs.chroma import ChromaAdapter
engine = VectorEngine(embedding=embedding, db=ChromaAdapter())
# Milvus
from crossvector.dbs.milvus import MilvusAdapter
engine = VectorEngine(embedding=embedding, db=MilvusAdapter())
# AstraDB
from crossvector.dbs.astradb import AstraDBAdapter
engine = VectorEngine(embedding=embedding, db=AstraDBAdapter())
# All operations remain the same!
results = engine.search("query", limit=10)
Architecture
Component Overview
┌─────────────────────────────────────────────────────────────┐
│ VectorEngine │
│ (Unified API, automatic embedding, flexible input) │
└───────────────────┬──────────────────┬──────────────────────┘
│ │
┌───────────▼──────────┐ ┌───▼──────────────────┐
│ EmbeddingAdapter │ │ VectorDBAdapter │
│ (OpenAI, Gemini) │ │ (Astra, Chroma...) │
└──────────────────────┘ └──────────┬───────────┘
│
┌──────────▼──────────┐
│ WhereCompiler │
│ (Query DSL → SQL) │
└─────────────────────┘
Query Processing Flow
User Input (Q or dict)
↓
Normalize to Universal Dict Format
↓
Backend-Specific Compiler
↓
Native Filter (SQL, Milvus expr, Chroma dict)
↓
Database Query
↓
VectorDocument Results
Roadmap
-
v1.0 Stable Release
- API freeze and backwards compatibility guarantee
- Production-ready documentation
- Performance benchmarks
-
Additional Backends
- Pinecone
- Weaviate
- Qdrant
- MongoDB
- Elasticsearch
- OpenSearch
-
Enhanced Features
- Hybrid search (vector + keyword)
- Reranking support (Cohere, Jina)
- Async/await support
- Streaming search results
- Pagination helpers
-
Developer Experience
- CLI tool for management
- Migration utilities
- Schema validation and linting
- Interactive query builder
Contributing
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
See CONTRIBUTING.md for detailed guidelines.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Changelog
See CHANGELOG.md for version history and migration guides.
Support
- Issues: GitHub Issues
- Documentation: GitHub Wiki
- Discussions: GitHub Discussions
Acknowledgments
- Built with Pydantic for validation
- Inspired by Django ORM's elegant API design
- Thanks to all vector database and embedding providers for their excellent SDKs
Made with ❤️ by the Two Farm
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file crossvector-0.1.3.tar.gz.
File metadata
- Download URL: crossvector-0.1.3.tar.gz
- Upload date:
- Size: 301.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8c8c8f2b621671e1cd2d3cb393dc200c8bdb95ff10cf79bc7b6a91802baebdfe
|
|
| MD5 |
7aa5834b7a099e2c60acfc039b9da0f6
|
|
| BLAKE2b-256 |
5389db2f3badaa8c8ec1eafd89709ff29f01be56a061ba7c65e25034ed5440b8
|
Provenance
The following attestation bundles were made for crossvector-0.1.3.tar.gz:
Publisher:
publish.yml on thewebscraping/crossvector
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
crossvector-0.1.3.tar.gz -
Subject digest:
8c8c8f2b621671e1cd2d3cb393dc200c8bdb95ff10cf79bc7b6a91802baebdfe - Sigstore transparency entry: 731958022
- Sigstore integration time:
-
Permalink:
thewebscraping/crossvector@b095fbe49e34e16946a6b4389a34d9205f243b72 -
Branch / Tag:
refs/tags/v0.1.3 - Owner: https://github.com/thewebscraping
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@b095fbe49e34e16946a6b4389a34d9205f243b72 -
Trigger Event:
release
-
Statement type:
File details
Details for the file crossvector-0.1.3-py3-none-any.whl.
File metadata
- Download URL: crossvector-0.1.3-py3-none-any.whl
- Upload date:
- Size: 76.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d88fe0ea648bd2325a2a813cc0fe1389709b56b29c9b3fbc15998bd4f6ecd9bc
|
|
| MD5 |
cfe79172257eba649503378d743e1ef9
|
|
| BLAKE2b-256 |
1dcafe820d5a9b9bb23a824d3d0a6082f636b9b2afba92954c9861f64362bf21
|
Provenance
The following attestation bundles were made for crossvector-0.1.3-py3-none-any.whl:
Publisher:
publish.yml on thewebscraping/crossvector
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
crossvector-0.1.3-py3-none-any.whl -
Subject digest:
d88fe0ea648bd2325a2a813cc0fe1389709b56b29c9b3fbc15998bd4f6ecd9bc - Sigstore transparency entry: 731958027
- Sigstore integration time:
-
Permalink:
thewebscraping/crossvector@b095fbe49e34e16946a6b4389a34d9205f243b72 -
Branch / Tag:
refs/tags/v0.1.3 - Owner: https://github.com/thewebscraping
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@b095fbe49e34e16946a6b4389a34d9205f243b72 -
Trigger Event:
release
-
Statement type: