Skip to main content

Educational vector database built from first principles to understand how vector search really works.

Project description

m2vdb logo

Python Rust uv License: MIT Code Style: Ruff CI

M2VDB - Understanding Vector Search Through Real Implementations

This project is simply me trying to understand vector search and databases from first principles, while having fun building something end-to-end that feels like a real vector DB. I’ve worked as an applied scientist on AI systems with retrieval and yet, I never really understood how vector databases actually work. Until now :)

✨ Features

🧱 Index Implementations

  • Brute Force (Python)
  • Brute Force (Rust)
  • Product Quantization (PQ)
  • Inverted File (IVF)
  • More Rust ports coming...

🌐 API

  • Minimal FastAPI server
  • Resource stats
  • MCP server planned (for the memes)

📊 Benchmarking

  • Benchmarks on multiple datasets (SIFT1M, FastText, more coming)
  • Latency, recall, build time, memory, QPS
  • Caching benchmark runs & JSON results

🗺️ Roadmap

  • More Indexe: Implement HNSW (Python first, Rust when I'm board).
  • Comparative Benchmarks: Add FAISS baselines to compare my implementations.
  • Experiments: Hyperparameter sweeps for PQ (and others) with visualization/graphs.
  • Configuration: Better config management for running benchmark sweeps.
  • Memory Benchmarking: Improve memory measurement to track non-Python indexes.
  • MCP Server: Model Context Protocol integration (because why not?).
  • Rust Ports: Porting more index types to Rust for speed.

⚡️ Quick Start

Installation

Option 1: From PyPI (Recommended)

pip install m2vdb
# or with uv
uv pip install m2vdb

Option 2: From Source

git clone https://github.com/mmilunovic/m2vdb.git
cd m2vdb
uv sync

Optional: Enable Rust Indexes

For maximum performance, you can build optional Rust extensions:

# Install Rust if you don't have it
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

# Build Rust indexes
cd rust
maturin develop --release
cd ..

Start the Server

Using Docker

docker-compose up -d

Using CLI Command

# Basic usage
m2vdb-server

# Custom port
m2vdb-server --port 8080

# With persistent storage (when implemented)
m2vdb-server --data-dir /path/to/data

# Development mode with auto-reload
m2vdb-server --reload

💡 Tip: Once the server is running, visit http://localhost:8000/docs for the interactive API documentation (Swagger UI) to explore endpoints and test requests directly from your browser.

Use the Client

from m2vdb import M2VDBClient

# 1. Connect
client = M2VDBClient(api_key="sk-test-user1", host="http://localhost:8000")

# 2. Create Index
index = client.create_index(
    name="demo", 
    dimension=3, 
    metric="cosine",
    index_type="brute_force"  # Options: "brute_force", "pq", "ivf", "rust_brute_force" (if built)
)

# 3. Insert Data
index.upsert(
    vectors=[
        {"id": "A", "vector": [1.0, 0.0, 0.0], "metadata": {"label": "Red"}},
        {"id": "B", "vector": [0.0, 1.0, 0.0], "metadata": {"label": "Green"}},
    ]
)

# 4. Search
results = index.query(
    vector=[0.9, 0.1, 0.0],
    top_k=1
)
print(results) # Matches "A" (Red)

Using Rust Indexes (Optional)

If you've built the Rust extensions, you can use them for significantly better performance:

from m2vdb import VectorDatabase, HAS_RUST

# Check if Rust is available
print(f"Rust indexes available: {HAS_RUST}")

# Use Rust brute force index (5-10x faster than Python)
db = VectorDatabase(
    dimension=128,
    metric="euclidean",
    index_type="rust_brute_force"  # Requires Rust extensions
)

# Or use it via the client
index = client.create_index(
    name="fast-demo",
    dimension=128,
    metric="euclidean", 
    index_type="rust_brute_force"
)

Performance comparison (1M vectors, 128D):

  • Python BruteForce: ~5 QPS
  • Rust BruteForce: ~25 QPS (5x faster!)

📊 Benchmarks

All results below were generated on a MacBook Air M4, 16GB RAM, with:

  • 1,000,000 base vectors
  • 1,000 queries
  • k = 10

SIFT1M (1M vectors, 128D)

Index Build(ms) Index(MB) Bytes/Vec QPS p99(ms) Recall@10
PyBruteForce-euclidean 746 649.0 681 5 204.02 1.000
RustBruteForce-euclidean 698 N/A N/A 25 40.31 1.000
IVF(auto)-euclidean 5,453 657.7 690 25 56.67 0.995
FAISS-Flat-euclidean 707 N/A N/A 111 9.02 1.000
PQ(m=8,k=256)-euclidean 425,167* 191.5 201 19 51.56 0.332
FAISS-PQ(m=8,k=256)-euclidean 4,906 N/A N/A 461 2.17 0.323

FASTTEXT (sampled 1M vectors, 300D)

Index Build(ms) Index(MB) Bytes/Vec QPS p99(ms) Recall@10
PyBruteForce-cosine 707 1305.1 1369 3 310.86 1.000
RustBruteForce-cosine 1,074 N/A N/A 8 128.29 1.000
IVF(auto)-cosine 14,812 1310.0 1374 21 59.95 0.951
FAISS-Flat-cosine 1,273 N/A N/A 45 22.33 1.000
PQ(m=10,k=256)-cosine 559,221* 199.5 209 18 56.49 0.283
FAISS-PQ(m=10,k=256)-cosine 7,208 N/A N/A 291 3.44 0.253

To reproduce results just run.

uv run python benchmarks/run_benchmarks.py

📜 License

MIT. If you actually use it I'll be flattered 🥹

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

m2vdb-0.1.0.tar.gz (28.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

m2vdb-0.1.0-py3-none-any.whl (30.0 kB view details)

Uploaded Python 3

File details

Details for the file m2vdb-0.1.0.tar.gz.

File metadata

  • Download URL: m2vdb-0.1.0.tar.gz
  • Upload date:
  • Size: 28.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.7

File hashes

Hashes for m2vdb-0.1.0.tar.gz
Algorithm Hash digest
SHA256 591f3124bd06d08b728c612fec3a19e31ae0533c43131f2bfa5c0929397f0844
MD5 e91158609374aa6484acc792bcc69865
BLAKE2b-256 b50e929801e036dc4fcbdd5ff3ae87db10799e9ee435832a7e798eba6af33d36

See more details on using hashes here.

File details

Details for the file m2vdb-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: m2vdb-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 30.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.7

File hashes

Hashes for m2vdb-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 34e4cb71d73da376a93275386517500bb97f79175c73044b12d9a94b94462f24
MD5 a293bed8b25805a62ff52fe8ae722ec8
BLAKE2b-256 6cb2ef5a61f4cd380e9f4e3edd84830b0873ae1f3ffda9794d6e31b538345044

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page