Skip to main content

A lightweight, efficient vector database with similarity search capabilities

Project description

VecStream

Tests Benchmarks PyPI version Python versions License Downloads GitHub issues

A lightweight, efficient vector database with similarity search capabilities, designed for machine learning and AI applications.

Features

  • Fast similarity search using optimized indexing
  • HNSW indexing for significantly improved search performance
  • Vector collections/namespaces for organizing different types of embeddings
  • Metadata filtering for fine-grained search control
  • Efficient binary storage format for vectors and metadata
  • Automatic text embedding with sentence-transformers
  • Rich command-line interface with beautiful output
  • Cross-platform support (Windows, macOS, Linux)
  • Customizable storage locations
  • Metadata support for enhanced document management
  • Built-in text similarity search

Installation

pip install vecstream

Quick Start

Using the CLI

# Add a document
vecstream add "Machine learning is transforming technology" doc1

# Search for similar documents
vecstream search "AI and machine learning" --k 3

# Search with metadata filtering
vecstream search "cloud computing" --filter '{"category": "ai", "year": 2023}'

# Get document by ID
vecstream get doc1

# View database information
vecstream info

# Create and use a collection
vecstream create_collection research
vecstream add "Neural networks research" doc2 --collection research

# Use custom storage location
vecstream add "Custom storage test" doc3 --db-path "./my_vectors"

# Remove a document
vecstream remove doc1

Using the Python API

from vecstream.collections import CollectionManager
from vecstream.binary_store import BinaryVectorStore

# Using collections for different vector types
manager = CollectionManager("./vector_db")
research_collection = manager.create_collection("research")
products_collection = manager.create_collection("products")

# Add vectors with metadata to collections
research_collection.add_vector(
    id="paper1",
    vector=[1.0, 0.0, 0.0],
    metadata={"topic": "AI", "year": 2023, "author": "Smith"}
)

# Search with metadata filtering
results = research_collection.search_similar(
    query=[1.0, 0.0, 0.0],
    k=5,
    filter_metadata={"year": 2023, "topic": "AI"}
)

# Basic binary store usage (compatible with earlier versions)
store = BinaryVectorStore("./vector_db")

# Add vectors with metadata
store.add_vector(
    id="doc1",
    vector=[1.0, 0.0, 0.0],
    metadata={"text": "Example document", "tags": ["test"]}
)

# Search similar vectors
results = store.search_similar([1.0, 0.0, 0.0], k=5)

# Get vector with metadata
vector, metadata = store.get_vector_with_metadata("doc1")

Storage Locations

By default, VecStream stores its data in:

  • Windows: %APPDATA%/VecStream/store/
  • macOS/Linux: ~/.vecstream/store/

You can specify a custom storage location using the --db-path option in CLI commands or by passing the path to CollectionManager or BinaryVectorStore.

Storage Format

VecStream uses an efficient binary storage format:

  • Vectors: NumPy .npy format for fast access
  • Metadata: JSON format for flexibility
  • Automatic compression and optimization
  • Collections organized in subdirectories

CLI Features

The command-line interface provides:

  • Vector Management: Add, get, update and remove vectors with add, get, and remove commands
  • Similarity Search: Fast vector search with search command with adjustable k-nearest neighbors
  • HNSW Indexing: Significantly faster search performance for large datasets (up to 100x faster)
  • Collections: Organize vectors by type with collection create, collection list, and other commands
  • Metadata Filtering: Filter search results with --filter '{"key": "value"}' syntax
  • Nested Filters: Support for dot notation in filters like --filter '{"details.color": "red"}'
  • Beautiful UI: Rich, colored output and progress indicators for long operations
  • Database Stats: View detailed database information with info command
  • Custom Storage: Specify storage locations with --db-path option

Python API

The Python API offers:

  • HNSW Indexing: Fast approximate nearest-neighbor search with customizable parameters:
    from vecstream.hnsw_index import HNSWIndex
    index = HNSWIndex(dim=128, M=16, ef_construction=200)
    
  • Collections: Organize vectors with the CollectionManager:
    from vecstream.collections import CollectionManager
    manager = CollectionManager("./vector_db", use_hnsw=True)
    collection = manager.create_collection("images")
    
  • Metadata Filtering: Fine-grained search control:
    results = collection.search_similar(query, filter_metadata={"category": "electronics"})
    
  • Nested Filtering: Access nested properties with dot notation:
    results = collection.search_similar(query, filter_metadata={"details.color": "black"})
    
  • Binary Storage: Efficient serialization for large datasets:
    from vecstream.binary_store import BinaryVectorStore
    store = BinaryVectorStore("./vector_db")
    
  • Vector Operations: Direct access to similarity calculations, normalization, and more
  • Type Safety: Strong typing and error handling with descriptive exceptions

Requirements

  • Python 3.8 or higher
  • NumPy
  • SciPy
  • sentence-transformers
  • Rich (for CLI)
  • Click (for CLI)

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Version History

  • 0.3.0 (2024-03-XX)

    • Added HNSW indexing for faster similarity search
    • Added collections/namespaces for organizing vectors
    • Added metadata filtering for search results
    • Improved CLI with collection management commands
    • Performance optimizations
  • 0.2.0 (2024-03-XX)

    • Added binary vector store
    • Improved persistent storage
    • Enhanced CLI functionality
    • Added metadata support
  • 0.1.0 (2024-03-XX)

    • Initial release
    • Basic vector storage and search functionality
    • CLI interface
    • Client-server architecture

Documentation

Document Description Link
API Reference Complete reference of VecStream's classes, methods, and CLI commands API Reference
Advanced Usage Detailed examples and best practices for using VecStream Advanced Usage

Key Features

Feature Description Documentation
HNSW Indexing Fast approximate nearest neighbor search for large datasets API Reference, Usage Examples
Collections Organize vectors with metadata for better organization API Reference, Usage Examples
Metadata Filtering Filter search results using metadata properties API Reference, Usage Examples
Binary Storage Efficient storage format for large vector datasets API Reference, Usage Examples

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

vecstream-0.3.3.tar.gz (27.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

vecstream-0.3.3-py3-none-any.whl (26.2 kB view details)

Uploaded Python 3

File details

Details for the file vecstream-0.3.3.tar.gz.

File metadata

  • Download URL: vecstream-0.3.3.tar.gz
  • Upload date:
  • Size: 27.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for vecstream-0.3.3.tar.gz
Algorithm Hash digest
SHA256 634831e4709f0355a63fb7d53f6e929556bfd13f8c4684ab88916af06fcbd07b
MD5 02997c635659c2d49efaf65a1f31950d
BLAKE2b-256 109d15fc4a84c3d66d0fd4c5b8eb402147b81448f22de10ec38e2c91927b2695

See more details on using hashes here.

File details

Details for the file vecstream-0.3.3-py3-none-any.whl.

File metadata

  • Download URL: vecstream-0.3.3-py3-none-any.whl
  • Upload date:
  • Size: 26.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.3

File hashes

Hashes for vecstream-0.3.3-py3-none-any.whl
Algorithm Hash digest
SHA256 202662d8a89987cb83c225531472c394fdab12076bda6b2439940951edd6e91e
MD5 8783ee9a9ec4d257df611c5b27eaa45c
BLAKE2b-256 912dbf3f76b73b3cd66898b1eec6a87d99f6e4c05270ca50ddb96e03e61b0d64

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page