Simple embedded vector database for local AI development with automatic embeddings
Project description
VittoriaDB Python SDK
VittoriaDB is a simple, embedded, zero-configuration vector database designed for local AI development and production deployments. This Python SDK provides a clean, intuitive interface to interact with VittoriaDB servers with automatic binary management.
๐ Key Features
- ๐ฏ Zero Configuration: Works immediately after installation with sensible defaults
- ๐ค Automatic Embeddings: Server-side text vectorization with multiple model support
- ๐ Document Processing: Built-in support for PDF, DOCX, TXT, MD, and HTML files
- ๐ง Auto Binary Management: Automatically downloads and manages VittoriaDB binaries
- โก High Performance: HNSW indexing provides sub-millisecond search times
- ๐ Pythonic API: Clean, intuitive Python interface with type hints
- ๐ Dual Mode: Works with existing servers or auto-starts local instances
๐ฆ Installation
pip install vittoriadb
The package automatically downloads the appropriate VittoriaDB binary for your platform during installation.
๐ Quick Start
Basic Usage
import vittoriadb
# Auto-starts VittoriaDB server and connects
db = vittoriadb.connect()
# Create a collection
collection = db.create_collection(
name="documents",
dimensions=384,
metric="cosine"
)
# Insert vectors with metadata
collection.insert(
id="doc1",
vector=[0.1, 0.2, 0.3] * 128, # 384 dimensions
metadata={"title": "My Document", "category": "tech"}
)
# Search for similar vectors
results = collection.search(
vector=[0.1, 0.2, 0.3] * 128,
limit=5,
include_metadata=True
)
for result in results:
print(f"ID: {result.id}, Score: {result.score:.4f}")
print(f"Metadata: {result.metadata}")
# Close connection
db.close()
Automatic Text Embeddings (๐ NEW!)
import vittoriadb
from vittoriadb.configure import Configure
# Connect to VittoriaDB
db = vittoriadb.connect()
# Create collection with automatic embeddings
collection = db.create_collection(
name="smart_docs",
dimensions=384,
vectorizer_config=Configure.Vectors.auto_embeddings() # ๐ฏ Server-side embeddings!
)
# Insert text directly - embeddings generated automatically!
collection.insert_text(
id="article1",
text="Artificial intelligence is transforming how we process data.",
metadata={"category": "AI", "source": "blog"}
)
# Batch insert multiple texts
texts = [
{
"id": "article2",
"text": "Machine learning enables computers to learn from data.",
"metadata": {"category": "ML"}
},
{
"id": "article3",
"text": "Vector databases provide efficient similarity search.",
"metadata": {"category": "database"}
}
]
collection.insert_text_batch(texts)
# Search with natural language queries
results = collection.search_text(
query="artificial intelligence and machine learning",
limit=3
)
for result in results:
print(f"Score: {result.score:.4f}")
print(f"Text: {result.metadata['text'][:100]}...")
db.close()
Document Upload and Processing
import vittoriadb
from vittoriadb.configure import Configure
db = vittoriadb.connect()
# Create collection with vectorizer for automatic processing
collection = db.create_collection(
name="knowledge_base",
dimensions=384,
vectorizer_config=Configure.Vectors.auto_embeddings()
)
# Upload and process documents automatically
result = collection.upload_file(
file_path="research_paper.pdf",
chunk_size=600,
chunk_overlap=100,
metadata={"source": "research", "year": "2024"}
)
print(f"Processed {result['chunks_created']} chunks")
print(f"Inserted {result['chunks_inserted']} vectors")
# Search the uploaded content
results = collection.search_text(
query="machine learning algorithms",
limit=5
)
db.close()
๐๏ธ Vectorizer Configuration
VittoriaDB supports multiple vectorizer backends for automatic embedding generation:
Sentence Transformers (Default)
from vittoriadb.configure import Configure
config = Configure.Vectors.sentence_transformers(
model="all-MiniLM-L6-v2",
dimensions=384
)
OpenAI Embeddings
config = Configure.Vectors.openai_embeddings(
api_key="your-openai-api-key",
model="text-embedding-ada-002",
dimensions=1536
)
HuggingFace Models
config = Configure.Vectors.huggingface_embeddings(
api_key="your-hf-token", # Optional for public models
model="sentence-transformers/all-MiniLM-L6-v2",
dimensions=384
)
Local Ollama
config = Configure.Vectors.ollama_embeddings(
model="nomic-embed-text",
dimensions=768,
base_url="http://localhost:11434"
)
๐ Document Processing
VittoriaDB supports automatic processing of various document formats:
| Format | Extension | Status | Features |
|---|---|---|---|
| Plain Text | .txt |
โ Fully Supported | Direct text processing |
| Markdown | .md |
โ Fully Supported | Frontmatter parsing |
| HTML | .html |
โ Fully Supported | Tag stripping, metadata |
.pdf |
โ Fully Supported | Multi-page text extraction | |
| DOCX | .docx |
โ Fully Supported | Properties, text extraction |
# Upload multiple document types
for file_path in ["doc.pdf", "guide.docx", "readme.md"]:
result = collection.upload_file(
file_path=file_path,
chunk_size=500,
metadata={"batch": "docs_2024"}
)
print(f"Processed {file_path}: {result['chunks_inserted']} chunks")
๐ง Advanced Configuration
Collection Configuration
# High-performance HNSW configuration
collection = db.create_collection(
name="large_dataset",
dimensions=1536,
metric="cosine",
index_type="hnsw",
config={
"m": 32, # HNSW connections per node
"ef_construction": 400, # Construction search width
"ef_search": 100 # Search width
},
vectorizer_config=Configure.Vectors.openai_embeddings(api_key="your-key")
)
Connection Options
# Connect to existing server
db = vittoriadb.connect(
url="http://localhost:8080",
auto_start=False
)
# Auto-start with custom configuration
db = vittoriadb.connect(
auto_start=True,
port=9090,
data_dir="./my_vectors"
)
Search with Filtering
# Search with metadata filters
results = collection.search(
vector=query_vector,
limit=10,
filter={"category": "technology", "year": 2024},
include_metadata=True
)
# Text search with filters
results = collection.search_text(
query="machine learning",
limit=5,
filter={"source": "research"}
)
๐ Performance and Scalability
- Insert Speed: >10,000 vectors/second with flat indexing, >5,000 with HNSW
- Search Speed: Sub-millisecond search times for 1M vectors using HNSW
- Memory Usage: <100MB for 100,000 vectors (384 dimensions)
- Scalability: Tested up to 1 million vectors, supports up to 2,048 dimensions
๐ ๏ธ Development
Installation for Development
git clone https://github.com/antonellof/VittoriaDB.git
cd VittoriaDB/sdk/python
# Install in development mode
pip install -e .
# Or use the development script
./install-dev.sh
Building and Publishing
๐ One-Command Deploy:
# Deploy to Test PyPI
./deploy.sh test
# Deploy to Production PyPI
./deploy.sh
The deploy script automatically:
- Cleans build artifacts
- Installs build dependencies
- Builds the package
- Validates the package
- Uploads to PyPI
๐ API Reference
VittoriaDB Class
connect(url=None, auto_start=True, **kwargs)- Connect to VittoriaDBcreate_collection(name, dimensions, metric="cosine", vectorizer_config=None)- Create collectionget_collection(name)- Get existing collectionlist_collections()- List all collectionsdelete_collection(name)- Delete collectionhealth()- Get server health statusclose()- Close connection
Collection Class
insert(id, vector, metadata=None)- Insert single vectorinsert_batch(vectors)- Insert multiple vectorsinsert_text(id, text, metadata=None)- Insert text (auto-vectorized)insert_text_batch(texts)- Insert multiple texts (auto-vectorized)search(vector, limit=10, filter=None)- Vector similarity searchsearch_text(query, limit=10, filter=None)- Text search (auto-vectorized)upload_file(file_path, chunk_size=500, **kwargs)- Upload and process documentget(id)- Get vector by IDdelete(id)- Delete vector by IDcount()- Get total vector count
๐ค Contributing
We welcome contributions! Please see our Contributing Guide for details.
๐ License
This project is licensed under the MIT License - see the LICENSE file for details.
๐ Links
- Documentation: https://vittoriadb.dev
- GitHub: https://github.com/antonellof/VittoriaDB
- PyPI: https://pypi.org/project/vittoriadb/
- Issues: https://github.com/antonellof/VittoriaDB/issues
๐ What's Next?
- ๐ Hybrid Search: Combine vector and keyword search
- ๐ Authentication: User management and access control
- ๐ Distributed Mode: Multi-node clustering support
- ๐ Analytics: Query performance monitoring and optimization
- ๐ฏ More Vectorizers: Support for additional embedding models
Happy building with VittoriaDB! ๐
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file vittoriadb-0.1.0.tar.gz.
File metadata
- Download URL: vittoriadb-0.1.0.tar.gz
- Upload date:
- Size: 17.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
032fa1cfcff30b7429d81385b8aea981cc94866eef42f6f5ba207d5c541b373d
|
|
| MD5 |
7ecec56dbdf88465a5dd2b1bf6f77924
|
|
| BLAKE2b-256 |
68a03770aee5a0462f3ba99db568b6abb92300c6903b984d71b66b57ca2faaff
|
File details
Details for the file vittoriadb-0.1.0-py3-none-any.whl.
File metadata
- Download URL: vittoriadb-0.1.0-py3-none-any.whl
- Upload date:
- Size: 14.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0b64e557a02bb591faa9aedd4fd445995a6ab0d8134c560a99c4210456cfe22b
|
|
| MD5 |
2441b33a0dd7eac0b7f8c11016f87a6b
|
|
| BLAKE2b-256 |
0eec1b17a65191e92cfcefba9c457d3ede19afdc431da83eab8e98bf98fad0b8
|