Skip to main content

Educational vector database built from first principles to understand how vector search really works.

Project description

m2vdb logo

Python Rust uv License: MIT Code Style: Ruff CI

M2VDB - Understanding Vector Search Through Real Implementations

This project is simply me trying to understand vector search and databases from first principles, while having fun building something end-to-end that feels like a real vector DB. I’ve worked as an applied scientist on AI systems with retrieval and yet, I never really understood how vector databases actually work. Until now :)

✨ Features

🧱 Index Implementations

  • Brute Force (Python)
  • Brute Force (Rust)
  • Product Quantization (PQ)
  • Inverted File (IVF)
  • More Rust ports coming...

🌐 API

  • REST API with FastAPI
  • Python SDK client & CLI
  • Docker & persistence support
  • MCP Server (planned) for the memes

📊 Benchmarking

  • Benchmarks on multiple datasets (SIFT1M, FastText, more coming)
  • Latency, recall, build time, memory, QPS
  • Caching benchmark runs & JSON results

🗺️ Roadmap

  • More Indexe: Implement HNSW (Python first, Rust when I'm board).
  • Comparative Benchmarks: Add FAISS baselines to compare my implementations.
  • Experiments: Hyperparameter sweeps for PQ (and others) with visualization/graphs.
  • Configuration: Better config management for running benchmark sweeps.
  • Memory Benchmarking: Improve memory measurement to track non-Python indexes.
  • MCP Server: Model Context Protocol integration (because why not?).

⚡️ Quick Start

Installation

Option 1: From PyPI (Recommended)

pip install m2vdb
# or with uv
uv pip install m2vdb

Option 2: From Source

git clone https://github.com/mmilunovic/m2vdb.git
cd m2vdb
uv sync

Optional: Enable Rust Indexes

For maximum performance, you can build optional Rust extensions:

# Install Rust if you don't have it
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

# Build Rust indexes
cd rust
maturin develop --release
cd ..

Start the Server

Using Docker

docker-compose up -d

Using CLI Command

# Basic usage
m2vdb-server

# Custom port
m2vdb-server --port 8080

# With persistent storage (when implemented)
m2vdb-server --data-dir /path/to/data

# Development mode with auto-reload
m2vdb-server --reload

💡 Tip: Once the server is running, visit http://localhost:8000/docs for the interactive API documentation (Swagger UI) to explore endpoints and test requests directly from your browser.

Use the Client

from m2vdb import M2VDBClient

# 1. Connect
client = M2VDBClient(api_key="sk-test-user1", host="http://localhost:8000")

# 2. Create Index
index = client.create_index(
    name="demo", 
    dimension=3, 
    metric="cosine",
    index_type="brute_force"  # Options: "brute_force", "pq", "ivf", "rust_brute_force" (if built)
)

# 3. Insert Data
index.upsert(
    vectors=[
        {"id": "A", "vector": [1.0, 0.0, 0.0], "metadata": {"label": "Red"}},
        {"id": "B", "vector": [0.0, 1.0, 0.0], "metadata": {"label": "Green"}},
    ]
)

# 4. Search
results = index.query(
    vector=[0.9, 0.1, 0.0],
    top_k=1
)
print(results) # Matches "A" (Red)

Using Rust Indexes (Optional)

If you've built the Rust extensions, you can use them for significantly better performance:

from m2vdb import Collection, HAS_RUST

# Check if Rust is available
print(f"Rust indexes available: {HAS_RUST}")

# Use Rust brute force index (5-10x faster than Python)
db = Collection(
    dimension=128,
    metric="euclidean",
    index_type="rust_brute_force"  # Requires Rust extensions
)

# Or use it via the client
index = client.create_index(
    name="fast-demo",
    dimension=128,
    metric="euclidean", 
    index_type="rust_brute_force"
)

Performance comparison (1M vectors, 128D):

  • Python BruteForce: ~5 QPS
  • Rust BruteForce: ~25 QPS (5x faster!)

📊 Benchmarks

All results below were generated on a MacBook Air M4, 16GB RAM, with:

  • 1,000,000 base vectors
  • 1,000 queries
  • k = 10

SIFT1M (1M vectors, 128D)

Index Build(ms) Index(MB) Bytes/Vec QPS p99(ms) Recall@10
PyBruteForce-euclidean 746 649.0 681 5 204.02 1.000
RustBruteForce-euclidean 698 N/A N/A 25 40.31 1.000
IVF(auto)-euclidean 5,453 657.7 690 25 56.67 0.995
FAISS-Flat-euclidean 707 N/A N/A 111 9.02 1.000
PQ(m=8,k=256)-euclidean 425,167* 191.5 201 19 51.56 0.332
FAISS-PQ(m=8,k=256)-euclidean 4,906 N/A N/A 461 2.17 0.323

FASTTEXT (sampled 1M vectors, 300D)

Index Build(ms) Index(MB) Bytes/Vec QPS p99(ms) Recall@10
PyBruteForce-cosine 707 1305.1 1369 3 310.86 1.000
RustBruteForce-cosine 1,074 N/A N/A 8 128.29 1.000
IVF(auto)-cosine 14,812 1310.0 1374 21 59.95 0.951
FAISS-Flat-cosine 1,273 N/A N/A 45 22.33 1.000
PQ(m=10,k=256)-cosine 559,221* 199.5 209 18 56.49 0.283
FAISS-PQ(m=10,k=256)-cosine 7,208 N/A N/A 291 3.44 0.253

To reproduce results just run.

uv run python benchmarks/run_benchmarks.py

📜 License

MIT. If you actually use it I'll be flattered 🥹

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

m2vdb-1.0.0.tar.gz (49.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

m2vdb-1.0.0-py3-none-any.whl (51.0 kB view details)

Uploaded Python 3

File details

Details for the file m2vdb-1.0.0.tar.gz.

File metadata

  • Download URL: m2vdb-1.0.0.tar.gz
  • Upload date:
  • Size: 49.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for m2vdb-1.0.0.tar.gz
Algorithm Hash digest
SHA256 dc8b5892fa066503770d556f4a3b2f6df5e55cbdb2dc2b5181dd654f4e95d925
MD5 02d9c904d2bba90d9a2c42fed6a53aed
BLAKE2b-256 52b45d3cd30232ec1169e76bc8654a166cf3ea407d9ecede2512df27e132ba22

See more details on using hashes here.

Provenance

The following attestation bundles were made for m2vdb-1.0.0.tar.gz:

Publisher: publish.yml on mmilunovic/m2vdb

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file m2vdb-1.0.0-py3-none-any.whl.

File metadata

  • Download URL: m2vdb-1.0.0-py3-none-any.whl
  • Upload date:
  • Size: 51.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for m2vdb-1.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e16b31854eba4fd0f633c3b5cd60974bc4e365ad59b16d40b736921d4e991455
MD5 88164f3c8c18e852671ede2d5128a9d4
BLAKE2b-256 d4203aa740fb61e789e6ea400fb9c0603a98d3bc5aa31e9d7ad9f50315a8542a

See more details on using hashes here.

Provenance

The following attestation bundles were made for m2vdb-1.0.0-py3-none-any.whl:

Publisher: publish.yml on mmilunovic/m2vdb

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page