Skip to main content

A high-performance vector database library for Python

Project description

Valori

Author: Varshith
Team: Valori
Contact: varshith.gudur17@gmail.com

PyPI page for Valori

PyPI version Python versions License: MIT Build Status

A high-performance vector database library for Python that provides efficient storage, indexing, and search capabilities for high-dimensional vectors.

[!IMPORTANT] Valori is in early development, so bugs and breaking changes are expected. Please use the issues page to report bugs or request features.

Features

  • 🚀 High Performance: Optimized for speed with multiple indexing algorithms
  • 📄 Document Parsing: Support for PDF, Office, text, and advanced parsing with Docling
  • 🔄 Processing Pipeline: Complete document processing with cleaning, chunking, and embedding
  • 💾 Multiple Storage Backends: Memory, disk, and hybrid storage options
  • 🔍 Advanced Indexing: Flat, HNSW, and IVF indices for different use cases
  • 🗜️ Vector Quantization: Scalar and product quantization for memory efficiency
  • 💾 Persistence: Tensor-based and incremental persistence strategies
  • 🏭 Production Ready: Comprehensive logging, monitoring, and error handling
  • 🐍 Python Native: Pure Python implementation with NumPy integration
  • 📊 Extensible: Plugin architecture for custom components

Installation

Install Valori using pip:

pip install valori

Or install from source:

git clone https://github.com/varshith-Git/valori.git
cd valori
pip install -e .

Quick Start

import numpy as np
from valori import VectorDBClient
from valori.storage import MemoryStorage
from valori.indices import FlatIndex
from valori.processors import ProcessingPipeline

# Create components
storage = MemoryStorage({})
index = FlatIndex({"metric": "cosine"})

# Create client
client = VectorDBClient(storage, index)
client.initialize()

# Process documents
pipeline_config = {
    "parsers": {"text": {"chunk_size": 1000}},
    "processors": {
        "cleaning": {"normalize_whitespace": True},
        "chunking": {"strategy": "semantic"},
        "embedding": {"model_name": "sentence-transformers/all-MiniLM-L6-v2"}
    }
}

pipeline = ProcessingPipeline(pipeline_config)
pipeline.initialize()

# Process a document
result = pipeline.process_document("document.pdf")
embedding = np.array(result["embedding"]).reshape(1, -1)

# Store in vector database
inserted_ids = client.insert(embedding, [result["metadata"]])

# Search for similar documents
query_text = "machine learning"
query_result = pipeline.process_text(query_text)
query_embedding = np.array(query_result["embedding"])

results = client.search(query_embedding, k=5)
for i, result in enumerate(results):
    print(f"{i+1}. Document: {result['metadata']['file_name']}")

# Clean up
client.close()
pipeline.close()

Components

Storage Backends

Memory Storage: Fast but not persistent

from valori.storage import MemoryStorage
storage = MemoryStorage({})

Disk Storage: Persistent but slower

from valori.storage import DiskStorage
storage = DiskStorage({"data_dir": "./my_vectordb"})

Hybrid Storage: Combines memory and disk for optimal performance

from valori.storage import HybridStorage
storage = HybridStorage({
    "memory": {},
    "disk": {"data_dir": "./my_vectordb"},
    "memory_limit": 10000
})

Index Types

Flat Index: Exhaustive search, accurate but slower for large datasets

from valori.indices import FlatIndex
index = FlatIndex({"metric": "cosine"})  # or "euclidean"

HNSW Index: Fast approximate search for large datasets

from valori.indices import HNSWIndex
index = HNSWIndex({
    "metric": "cosine",
    "m": 16,
    "ef_construction": 200,
    "ef_search": 50
})

IVF Index: Clustering-based index for large datasets

from valori.indices import IVFIndex
index = IVFIndex({
    "metric": "cosine",
    "n_clusters": 100,
    "n_probes": 10
})

LSH Index: Locality sensitive hashing for high-dimensional data

from valori.indices import LSHIndex
index = LSHIndex({
    "metric": "cosine",
    "num_hash_tables": 10,
    "hash_size": 16,
    "num_projections": 64,
    "threshold": 0.3
})

Annoy Index: Approximate nearest neighbors with random projection trees

from valori.indices import AnnoyIndex
index = AnnoyIndex({
    "metric": "angular",
    "num_trees": 10,
    "search_k": -1
})

# Add vectors, then build
index.add(vectors, metadata)
index.build()  # Required for Annoy

Document Parsing

Parse various document formats:

Text and PDF Parsing:

from valori.parsers import TextParser, PDFParser

# Parse text files
text_parser = TextParser({"encoding": "auto", "chunk_size": 1000})
result = text_parser.parse("document.txt")

# Parse PDF files
pdf_parser = PDFParser({"extract_tables": True, "chunk_size": 1000})
result = pdf_parser.parse("document.pdf")

Advanced Parsing with Docling:

from valori.parsers import DoclingParser

# Microsoft Docling for advanced parsing
docling_parser = DoclingParser({"extract_tables": True, "preserve_layout": True})

Document Processing Pipeline

Complete Processing Pipeline:

from valori.processors import ProcessingPipeline

pipeline_config = {
    "parsers": {"text": {"chunk_size": 1000}},
    "processors": {
        "cleaning": {"normalize_whitespace": True, "remove_html": True},
        "chunking": {"strategy": "semantic", "chunk_size": 1000},
        "embedding": {"model_name": "sentence-transformers/all-MiniLM-L6-v2"}
    }
}

pipeline = ProcessingPipeline(pipeline_config)
pipeline.initialize()

# Process document end-to-end
result = pipeline.process_document("document.pdf")

Quantization

Reduce memory usage with vector quantization:

Scalar Quantization:

from valori.quantization import ScalarQuantizer
quantizer = ScalarQuantizer({"bits": 8})

Product Quantization:

from valori.quantization import ProductQuantizer
quantizer = ProductQuantizer({"m": 8, "k": 256})

SAQ Quantization:

from valori.quantization import SAQQuantizer
quantizer = SAQQuantizer({
    "total_bits": 128,
    "n_segments": 8,
    "adjustment_iters": 3,
    "rescore_top_k": 50
})

Advanced Usage

Complete Setup with All Components

import numpy as np
from valori import VectorDBClient
from valori.storage import HybridStorage
from valori.indices import HNSWIndex
from valori.quantization import ProductQuantizer
from valori.persistence import TensorPersistence

# Create all components
storage = HybridStorage({
    "memory": {},
    "disk": {"data_dir": "./vectordb_data"},
    "memory_limit": 10000
})

index = HNSWIndex({
    "metric": "cosine",
    "m": 32,
    "ef_construction": 400,
    "ef_search": 100
})

quantizer = ProductQuantizer({
    "m": 16,
    "k": 256
})

quantizer = SAQQuantizer({
    "total_bits": 128,
    "n_segments": 8,
    "adjustment_iters": 3,
    "rescore_top_k": 50
})

persistence = TensorPersistence({
    "data_dir": "./vectordb_persistence",
    "compression": True
})

# Create client
client = VectorDBClient(storage, index, quantizer, persistence)
client.initialize()

# Your vector operations here...
client.close()

Production Setup

import json
from valori.utils.logging import setup_logging

# Setup logging
setup_logging({
    "level": "INFO",
    "log_to_file": True,
    "log_file": "Valori.log"
})

# Load configuration
with open("config.json", "r") as f:
    config = json.load(f)

# Initialize with production config
client = VectorDBClient.from_config(config)
client.initialize()

# Your production code here...
client.close()

Examples

Check out the examples/ directory for comprehensive examples:

  • basic_usage.py - Basic operations and concepts
  • document_processing.py - Complete document parsing and processing workflow
  • advanced_indexing.py - LSH and Annoy indexing algorithms comparison
  • advanced_quantization.py - Quantization techniques and performance
  • production_setup.py - Production deployment and monitoring

Documentation

Full documentation is included in the docs/ folder of this repository. Key entry points:

  • Getting started (tutorial): docs/getting_started.rst
  • Quickstart guide: docs/quickstart.rst
  • API reference: docs/api.rst

If a documentation site is published for this project, it will be linked from the project landing page. To build the docs locally:

cd docs
make html
# Output will be in docs/_build/html

You can also open the source rst files directly in the repo if you prefer to read them without building HTML.

Development

Setup Development Environment

# Clone the repository
git clone https://github.com/varshith-Git/valori.git
cd valori

# Setup development environment
bash scripts/install_dev.sh

# Activate virtual environment
source venv/bin/activate

Running Tests

# Run all tests
bash scripts/run_tests.sh

# Run with coverage
bash scripts/run_tests.sh --coverage

# Run specific tests
bash scripts/run_tests.sh tests/test_storage.py

Code Quality

# Format code
black src/ tests/

# Lint code
flake8 src/ tests/

# Type checking
mypy src/

# Security checks
safety check
bandit -r src/

Building Documentation

cd docs
make html

Benchmarking

# Run benchmarks
python scripts/benchmark.py

# Quick benchmarks
python scripts/benchmark.py --quick

Performance

Valori is designed for high performance:

  • Memory Efficiency: Up to 75% memory reduction with quantization
  • Search Speed: Sub-millisecond search times for small datasets
  • Scalability: Handles millions of vectors with appropriate indexing
  • Flexibility: Choose the right components for your use case

Contributing

We welcome contributions! Please see our Contributing Guide for details.

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests
  5. Submit a pull request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Support

Roadmap

  • GPU acceleration support
  • Distributed deployment
  • More indexing algorithms (LSH, Annoy)
  • REST API server
  • Web UI for database management
  • Integration with popular ML frameworks

Citation

If you use Valori in your research, please cite:

@software{valori2025,
  title={Valori: A High-Performance Vector Database for Python},
  author={Varshith},
  year={2024},
  url={https://github.com/varshith-Git/valori}
}

Made with ❤️ by the Valori Team

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

valori-0.1.5.tar.gz (90.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

valori-0.1.5-py3-none-any.whl (98.2 kB view details)

Uploaded Python 3

File details

Details for the file valori-0.1.5.tar.gz.

File metadata

  • Download URL: valori-0.1.5.tar.gz
  • Upload date:
  • Size: 90.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for valori-0.1.5.tar.gz
Algorithm Hash digest
SHA256 5fcde9bd4c26be48defa585ed9c755e875052c6fa49714ef8bec2a8ebac06a83
MD5 36d01fc179cc41291baa2380575832af
BLAKE2b-256 8e21a3032b943a917dfb351f0a672f41b304dba48aa2a1e346669dfa48b0c634

See more details on using hashes here.

File details

Details for the file valori-0.1.5-py3-none-any.whl.

File metadata

  • Download URL: valori-0.1.5-py3-none-any.whl
  • Upload date:
  • Size: 98.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for valori-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 e8ae9aedeb14be59c8bfa82d0900173bc2a9688012f2d715899e808097c8b9d7
MD5 3b2d97c1954abc620be2cff77fdddf3a
BLAKE2b-256 10f46fb6ecd36cbcdad15744a16c03939a61b86c0c8344d6e21d75e27b308116

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page