Skip to main content

Simple embedded vector database for local AI development with automatic embeddings

Project description

VittoriaDB Python SDK

PyPI version Python versions License

VittoriaDB is a simple, embedded, zero-configuration vector database designed for local AI development and production deployments. This Python SDK provides a clean, intuitive interface to interact with VittoriaDB servers with automatic binary management.

๐Ÿš€ Key Features

  • ๐ŸŽฏ Zero Configuration: Works immediately after installation with sensible defaults
  • ๐Ÿค– Automatic Embeddings: Server-side text vectorization with multiple model support
  • ๐Ÿ“„ Document Processing: Built-in support for PDF, DOCX, TXT, MD, and HTML files
  • ๐Ÿ”ง Auto Binary Management: Automatically downloads and manages VittoriaDB binaries
  • โšก High Performance: HNSW indexing provides sub-millisecond search times
  • ๐Ÿ Pythonic API: Clean, intuitive Python interface with type hints
  • ๐Ÿ”Œ Dual Mode: Works with existing servers or auto-starts local instances

๐Ÿ“ฆ Installation

pip install vittoriadb

The package automatically downloads the appropriate VittoriaDB binary for your platform during installation.

๐Ÿš€ Quick Start

Basic Usage

import vittoriadb

# Auto-starts VittoriaDB server and connects
db = vittoriadb.connect()

# Create a collection
collection = db.create_collection(
    name="documents",
    dimensions=384,
    metric="cosine"
)

# Insert vectors with metadata
collection.insert(
    id="doc1",
    vector=[0.1, 0.2, 0.3] * 128,  # 384 dimensions
    metadata={"title": "My Document", "category": "tech"}
)

# Search for similar vectors
results = collection.search(
    vector=[0.1, 0.2, 0.3] * 128,
    limit=5,
    include_metadata=True
)

for result in results:
    print(f"ID: {result.id}, Score: {result.score:.4f}")
    print(f"Metadata: {result.metadata}")

# Close connection
db.close()

Automatic Text Embeddings (๐Ÿš€ NEW!)

import vittoriadb
from vittoriadb.configure import Configure

# Connect to VittoriaDB
db = vittoriadb.connect()

# Create collection with automatic embeddings
collection = db.create_collection(
    name="smart_docs",
    dimensions=384,
    vectorizer_config=Configure.Vectors.auto_embeddings()  # ๐ŸŽฏ Server-side embeddings!
)

# Insert text directly - embeddings generated automatically!
collection.insert_text(
    id="article1",
    text="Artificial intelligence is transforming how we process data.",
    metadata={"category": "AI", "source": "blog"}
)

# Batch insert multiple texts
texts = [
    {
        "id": "article2",
        "text": "Machine learning enables computers to learn from data.",
        "metadata": {"category": "ML"}
    },
    {
        "id": "article3", 
        "text": "Vector databases provide efficient similarity search.",
        "metadata": {"category": "database"}
    }
]
collection.insert_text_batch(texts)

# Search with natural language queries
results = collection.search_text(
    query="artificial intelligence and machine learning",
    limit=3
)

for result in results:
    print(f"Score: {result.score:.4f}")
    print(f"Text: {result.metadata['text'][:100]}...")

db.close()

Document Upload and Processing

import vittoriadb
from vittoriadb.configure import Configure

db = vittoriadb.connect()

# Create collection with vectorizer for automatic processing
collection = db.create_collection(
    name="knowledge_base",
    dimensions=384,
    vectorizer_config=Configure.Vectors.auto_embeddings()
)

# Upload and process documents automatically
result = collection.upload_file(
    file_path="research_paper.pdf",
    chunk_size=600,
    chunk_overlap=100,
    metadata={"source": "research", "year": "2024"}
)

print(f"Processed {result['chunks_created']} chunks")
print(f"Inserted {result['chunks_inserted']} vectors")

# Search the uploaded content
results = collection.search_text(
    query="machine learning algorithms",
    limit=5
)

db.close()

๐ŸŽ›๏ธ Vectorizer Configuration

VittoriaDB supports multiple vectorizer backends for automatic embedding generation:

Sentence Transformers (Default)

from vittoriadb.configure import Configure

config = Configure.Vectors.sentence_transformers(
    model="all-MiniLM-L6-v2",
    dimensions=384
)

OpenAI Embeddings

config = Configure.Vectors.openai_embeddings(
    api_key="your-openai-api-key",
    model="text-embedding-ada-002",
    dimensions=1536
)

HuggingFace Models

config = Configure.Vectors.huggingface_embeddings(
    api_key="your-hf-token",  # Optional for public models
    model="sentence-transformers/all-MiniLM-L6-v2",
    dimensions=384
)

Local Ollama

config = Configure.Vectors.ollama_embeddings(
    model="nomic-embed-text",
    dimensions=768,
    base_url="http://localhost:11434"
)

๐Ÿ“„ Document Processing

VittoriaDB supports automatic processing of various document formats:

Format Extension Status Features
Plain Text .txt โœ… Fully Supported Direct text processing
Markdown .md โœ… Fully Supported Frontmatter parsing
HTML .html โœ… Fully Supported Tag stripping, metadata
PDF .pdf โœ… Fully Supported Multi-page text extraction
DOCX .docx โœ… Fully Supported Properties, text extraction
# Upload multiple document types
for file_path in ["doc.pdf", "guide.docx", "readme.md"]:
    result = collection.upload_file(
        file_path=file_path,
        chunk_size=500,
        metadata={"batch": "docs_2024"}
    )
    print(f"Processed {file_path}: {result['chunks_inserted']} chunks")

๐Ÿ”ง Advanced Configuration

Collection Configuration

# High-performance HNSW configuration
collection = db.create_collection(
    name="large_dataset",
    dimensions=1536,
    metric="cosine",
    index_type="hnsw",
    config={
        "m": 32,                # HNSW connections per node
        "ef_construction": 400,  # Construction search width
        "ef_search": 100        # Search width
    },
    vectorizer_config=Configure.Vectors.openai_embeddings(api_key="your-key")
)

Connection Options

# Connect to existing server
db = vittoriadb.connect(
    url="http://localhost:8080",
    auto_start=False
)

# Auto-start with custom configuration
db = vittoriadb.connect(
    auto_start=True,
    port=9090,
    data_dir="./my_vectors"
)

Search with Filtering

# Search with metadata filters
results = collection.search(
    vector=query_vector,
    limit=10,
    filter={"category": "technology", "year": 2024},
    include_metadata=True
)

# Text search with filters
results = collection.search_text(
    query="machine learning",
    limit=5,
    filter={"source": "research"}
)

๐Ÿ“Š Performance and Scalability

  • Insert Speed: >10,000 vectors/second with flat indexing, >5,000 with HNSW
  • Search Speed: Sub-millisecond search times for 1M vectors using HNSW
  • Memory Usage: <100MB for 100,000 vectors (384 dimensions)
  • Scalability: Tested up to 1 million vectors, supports up to 2,048 dimensions

๐Ÿ› ๏ธ Development

Installation for Development

git clone https://github.com/antonellof/VittoriaDB.git
cd VittoriaDB/sdk/python

# Install in development mode
pip install -e .

# Or use the development script
./install-dev.sh

Building and Publishing

๐Ÿš€ One-Command Deploy:

# Deploy to Test PyPI
./deploy.sh test

# Deploy to Production PyPI
./deploy.sh

The deploy script automatically:

  • Cleans build artifacts
  • Installs build dependencies
  • Builds the package
  • Validates the package
  • Uploads to PyPI

๐Ÿ“‹ API Reference

VittoriaDB Class

  • connect(url=None, auto_start=True, **kwargs) - Connect to VittoriaDB
  • create_collection(name, dimensions, metric="cosine", vectorizer_config=None) - Create collection
  • get_collection(name) - Get existing collection
  • list_collections() - List all collections
  • delete_collection(name) - Delete collection
  • health() - Get server health status
  • close() - Close connection

Collection Class

  • insert(id, vector, metadata=None) - Insert single vector
  • insert_batch(vectors) - Insert multiple vectors
  • insert_text(id, text, metadata=None) - Insert text (auto-vectorized)
  • insert_text_batch(texts) - Insert multiple texts (auto-vectorized)
  • search(vector, limit=10, filter=None) - Vector similarity search
  • search_text(query, limit=10, filter=None) - Text search (auto-vectorized)
  • upload_file(file_path, chunk_size=500, **kwargs) - Upload and process document
  • get(id) - Get vector by ID
  • delete(id) - Delete vector by ID
  • count() - Get total vector count

๐Ÿค Contributing

We welcome contributions! Please see our Contributing Guide for details.

๐Ÿ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

๐Ÿ”— Links

๐Ÿš€ What's Next?

  • ๐Ÿ” Hybrid Search: Combine vector and keyword search
  • ๐Ÿ” Authentication: User management and access control
  • ๐ŸŒ Distributed Mode: Multi-node clustering support
  • ๐Ÿ“Š Analytics: Query performance monitoring and optimization
  • ๐ŸŽฏ More Vectorizers: Support for additional embedding models

Happy building with VittoriaDB! ๐Ÿš€

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vittoriadb-0.1.0.tar.gz (17.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vittoriadb-0.1.0-py3-none-any.whl (14.5 kB view details)

Uploaded Python 3

File details

Details for the file vittoriadb-0.1.0.tar.gz.

File metadata

  • Download URL: vittoriadb-0.1.0.tar.gz
  • Upload date:
  • Size: 17.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.5

File hashes

Hashes for vittoriadb-0.1.0.tar.gz
Algorithm Hash digest
SHA256 032fa1cfcff30b7429d81385b8aea981cc94866eef42f6f5ba207d5c541b373d
MD5 7ecec56dbdf88465a5dd2b1bf6f77924
BLAKE2b-256 68a03770aee5a0462f3ba99db568b6abb92300c6903b984d71b66b57ca2faaff

See more details on using hashes here.

File details

Details for the file vittoriadb-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: vittoriadb-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 14.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.5

File hashes

Hashes for vittoriadb-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0b64e557a02bb591faa9aedd4fd445995a6ab0d8134c560a99c4210456cfe22b
MD5 2441b33a0dd7eac0b7f8c11016f87a6b
BLAKE2b-256 0eec1b17a65191e92cfcefba9c457d3ede19afdc431da83eab8e98bf98fad0b8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page