Skip to main content

TOON-Native Auto-Embedding & RAG Toolkit for MariaDB — VECTOR(N), HNSW, VEC_DISTANCE_COSINE, with TOON tabular output that saves 10-55% LLM tokens vs JSON.

Project description

Seamless-RAG

Seamless-RAG

Vector Search & Structured-Data RAG Toolkit for MariaDB

Turn any MariaDB table into a searchable vector store. Query results come back in TOON v3 tabular format — a compact wire format that saves 10-55% of tokens (vs compact JSON) when feeding structured data to LLMs or agents.

Powered by MariaDB PyPI Python 3.10+ TOON v3 License Tests


Quick Start for Judges

Evaluating this submission? Start with the Judges' Testing Guide — four progressive paths from inspect-only (5 min, no install) to full test suite (15 min). The full docs site is at https://mariadb-hackathon-my-2026.github.io/seamless-rag/.

90-second verification:

git clone https://github.com/SunflowersLwtech/seamless-rag.git
cd seamless-rag
docker compose up -d --wait
docker compose exec app seamless-rag demo

A 90-second screencast of the same flow lives at docs/assets/demo.gif (also demo.mp4):

Seamless-RAG demo


MariaDB Features We Use

This project is MariaDB-native end-to-end. The pipeline only works because of features that landed in MariaDB 11.7.2+ and is explicitly tuned for them — no external vector store, no shadow index, no application-side ANN.

Feature What we use it for Where in the code
VECTOR(N) column type First-class storage for 384-dim float32 embeddings, no BLOB workaround storage/mariadb.py:112 (schema), storage/mariadb.py:227 (auto-add column for arbitrary tables)
VECTOR INDEX … DISTANCE=cosine (HNSW) Sub-linear similarity search; we tune mhnsw_ef_search = 100 per session for recall/latency trade-off storage/mariadb.py:114, storage/mariadb.py:337
VEC_DISTANCE_COSINE Distance function in ORDER BY so the planner picks the HNSW index storage/mariadb.py:343, storage/mariadb.py:362
Native binary protocol via mariadb-connector-python array.array('f', embedding) is sent verbatim — no VEC_FromText round-trip, no string parsing storage/mariadb.py:254, storage/mariadb.py:288 (batch insert)
CTE for context windowing Single round-trip retrieval: closest chunks plus their neighbours by chunk_order storage/mariadb.py:341 (WITH closest AS …)
Hybrid SQL filter + vector ORDER BY seamless-rag ask "waterproof watches" --where "price < 50" — SQL pre-filter narrows the candidate set, vector ranks within storage/mariadb.py:315 (validated WHERE), storage/mariadb.py:362 (combined query)
Connection pool + autocommit mariadb.ConnectionPool with per-call lease, isolation-aware so the watcher never sees stale snapshots storage/mariadb.py:158, storage/mariadb.py:178
Foreign keys + composite index chunks.document_id REFERENCES documents(id) plus INDEX idx_doc_order(document_id, chunk_order) so the CTE neighbour-join stays index-only storage/mariadb.py:117-118
Auto-schema for arbitrary tables seamless-rag embed --table products --columns name,category adds a VECTOR(N) column and HNSW index to your existing table without touching its other columns storage/mariadb.py:227-232
Bare VEC_DISTANCE() auto-pick (MariaDB-only) When the index has DISTANCE=cosine, plain VEC_DISTANCE(...) reads it from the index and applies cosine — no other RDBMS does this. Demonstrated live by seamless-rag schema storage/mariadb.py:431 (compare_vec_distance), tests/integration/test_vector_operations.py (1e-6 equivalence assertion)

See it for yourself in 5 seconds: seamless-rag schema pretty-prints SHOW CREATE TABLE chunks (highlighting vector(384) and VECTOR KEY ... DISTANCE=cosine), SHOW INDEX FROM chunks (with the VECTOR row called out), and runs a side-by-side VEC_DISTANCE() vs VEC_DISTANCE_COSINE() query so you can verify auto-pick parity yourself.

Tested against MariaDB 11.8 (the version shipped in the official mariadb:11.8 Docker image). 11/11 integration tests pass against the real server, exercising every feature above — see tests/integration/test_vector_operations.py.

Without MariaDB's VECTOR + HNSW, this project would either need a sidecar vector DB (Chroma/Qdrant/pgvector) or a from-scratch ANN implementation. Neither would be MariaDB-native, neither would benefit from the same indexes that already serve OLTP traffic.


Why

LLMs and agents consume structured data as context. The standard approach — dumping JSON — wastes tokens on repeated field names and structural characters:

[{"id":1,"name":"Widget","category":"Tools","price":29.99,"stock":150,"supplier":"Acme","rating":4.5},
 {"id":2,"name":"Gadget","category":"Tools","price":19.99,"stock":300,"supplier":"Acme","rating":4.2}]

TOON tabular writes field names once, values as compact rows:

[2,]{id,name,category,price,stock,supplier,rating}:
  1,Widget,Tools,29.99,150,Acme,4.5
  2,Gadget,Tools,19.99,300,Acme,4.2

Measured on real public datasets (full benchmark):

Dataset (query type) Rows JSON Tokens TOON Tokens Savings
MovieLens — top rated movies (7 cols) 100 6,540 5,019 23.3%
MovieLens — metadata only (4 cols) 100 2,258 1,364 39.6%
SF Restaurant — violations (9 cols) 100 7,071 4,326 38.8%
SF Restaurant — high risk (9 cols) 50 3,437 2,076 39.6%

Savings scale with row count and stabilize at the dataset's natural ceiling:

Rows MovieLens (7 cols) Restaurant (9 cols)
10 21.7% 34.6%
50 22.0% 38.2%
100 24.1% 38.8%
500 29.0% 38.9%

TOON is not magic — it shines on structured tabular data with many columns and short values, which is exactly what comes out of database queries. All measurements use compact JSON (separators=(",",":")) as baseline.

Where It Fits

For structured database data, the industry uses two retrieval approaches. Seamless-RAG bridges both to LLMs:

"Q3 revenue by region?"           "Find products similar to X"
        │                                    │
   Text-to-SQL                        Vector Search
   (LLM generates SQL)              (cosine similarity)
        │                                    │
        └──────────┬─────────────────────────┘
                   ▼
           MariaDB executes
                   ▼
           list[dict] results
                   ▼
        Seamless-RAG → TOON format     ← saves 20-40% tokens
                   ▼
           LLM / Agent consumes
  • Precise queries ("revenue > 1M"): write SQL directly, use seamless-rag export to TOON-format the results
  • Semantic queries ("similar products"): use seamless-rag ask for vector search on text columns
  • Hybrid ("waterproof watches under $50"): seamless-rag ask --where "price < 50" combines both

Seamless-RAG is a format + embedding bridge, not a replacement for SQL.

Quick Start

Your data is already in MariaDB. Seamless-RAG adds vectors and TOON.

Prerequisites — the mariadb Python connector is a C extension. Install the system library first:

  • macOS: brew install mariadb-connector-c
  • Debian/Ubuntu: sudo apt install libmariadb-dev
  • Or use Docker (see Quick Start for Judges) and skip pip entirely
pip install "seamless-rag[mariadb,embeddings]"  # from PyPI
# or, from source:
# pip install -e ".[mariadb,embeddings]"

docker compose up -d                            # MariaDB 11.8

seamless-rag init                               # create VECTOR columns + HNSW index
seamless-rag embed --table products --column description  # embed existing rows
seamless-rag ask "Which products are most relevant?"      # vector search → TOON → LLM
seamless-rag export "SELECT id, name, price FROM products LIMIT 20"  # SQL → TOON

Optional extras: [gemini], [openai], [ollama], [web] (Gradio UI), or [all] for everything.

No file loading, no document chunking — data lives in MariaDB, Seamless-RAG bridges it to vectors and LLMs.

CLI Commands

seamless-rag init              Create VECTOR columns + HNSW index
seamless-rag embed             Bulk-embed existing table rows (core workflow)
seamless-rag watch             Auto-embed new inserts in real time (Rich live)
seamless-rag ask <question>    Vector search → TOON context → LLM answer
seamless-rag export <sql>      Any SELECT → TOON format
seamless-rag benchmark         JSON vs TOON token/cost comparison
seamless-rag web               Gradio web UI (localhost-only by default)
seamless-rag demo              End-to-end demo with sample data
seamless-rag ingest <path>     Convenience: load text files for quick testing

Multi-column embedding — embed multiple columns for richer semantics:

# Single column (default)
seamless-rag embed --table products --column description

# Multi-column — values concatenated for richer vector search
seamless-rag embed --table products --columns "name,category,price,rating"
# Internally: "Widget — Tools — 29.99 — 4.5"

# Now "cheap high-rated tools" matches on price AND rating, not just description
seamless-rag ask "cheap high-rated tools" --where "price < 50"

Global options: --host, --port, --database, --provider, --model, --log-level

As Agent Tools

Seamless-RAG commands work as agent tools. An LLM agent can call these to interact with MariaDB:

# Agent tool: search MariaDB and get compact context
result = rag.ask("quarterly revenue by region", top_k=10)
# result.context_toon → compact tabular format for next LLM call
# result.savings_pct → token savings vs compact JSON

# Agent tool: export any SQL query as TOON
toon = rag.export("SELECT region, revenue, quarter FROM sales")
# Feed to next agent step with minimal token overhead

# Agent tool: multi-column embed for richer search
rag.embed_table("products", text_column=["name", "category", "price"])
# "Widget — Tools — 29.99" → vector search matches name AND price

In a 20-step agent workflow querying a database at each step (measured on real data):

Dataset JSON (20 steps) TOON (20 steps) Tokens Saved Cost Saved
MovieLens (7 cols, 50 rows/step) 73,680 58,760 14,920 $0.037
Restaurant (9 cols, 50 rows/step) 69,640 42,640 27,000 $0.068

Python API

from seamless_rag import SeamlessRAG

with SeamlessRAG(host="localhost", database="mydb") as rag:
    rag.init()
    rag.ingest("research.txt", ["chunk1...", "chunk2..."])

    # Single-column embed (default)
    rag.embed_table("articles", text_column="content")

    # Multi-column embed — richer semantics
    rag.embed_table("products", text_column=["name", "category", "price"])

    # Semantic search with hybrid filter
    result = rag.ask("affordable tools", where="price < 50", mmr=True)
    print(result.answer)           # LLM-generated answer
    print(result.context_toon)     # compact context
    print(f"Saved {result.savings_pct:.0f}% tokens")

Pluggable Providers

Both embedding and LLM layers use typing.Protocol — no base class needed:

Layer Providers Default
Embedding SentenceTransformers, Gemini, OpenAI, Ollama SentenceTransformers (local, free)
LLM Ollama, Gemini, OpenAI Ollama (local, free)

Switch via env vars: EMBEDDING_PROVIDER=gemini LLM_PROVIDER=openai seamless-rag ask "..."

See the Providers guide for adding custom providers.

Architecture

seamless-rag CLI / Python API / Agent Tools
    │
    ├── EmbeddingProvider (Protocol)     ← 4 built-in, add your own
    ├── LLMProvider (Protocol)           ← 3 built-in, add your own
    ├── VectorStore (Protocol)           ← MariaDB with connection pool
    │     └── VECTOR(N) + HNSW index + VEC_DISTANCE_COSINE
    ├── AutoEmbedder                     ← batch + watch, multi-column concat
    ├── RAGEngine                        ← search → TOON → LLM (retry) → benchmark
    ├── TOONEncoder                      ← full v3 spec (166/166)
    └── TokenBenchmark                   ← tiktoken + GPT-4o cost calc

Test Results

538 tests passing (100%)
  lint:        100%
  unit:        100% (338/338)
  spec:        100% (166/166 TOON v3 conformance)
  integration: 100% (17/17)
  eval:        100%

Security

  • SQL injection prevention: WHERE filters and SELECT queries validated via sqlglot AST parsing — blocks writes, DDL, subqueries, and dangerous functions (SLEEP, BENCHMARK, LOAD_FILE)
  • Web UI: binds 127.0.0.1 by default; --share requires auth via SEAMLESS_WEB_USER / SEAMLESS_WEB_PASSWORD; error messages never leak server internals
  • LLM calls: context truncated to 20K chars; retry with jitter for transient errors; rate-limit detection
  • Identifiers: all table/column names validated against ^[A-Za-z_][A-Za-z0-9_]*$

Built for the MariaDB Ecosystem

MariaDB

  • MariaDB 11.7+ VECTOR columns, HNSW indexes, VEC_DISTANCE_COSINE
  • Native binary protocol via mariadb-connector-python (array.array float32, no VEC_FromText round-trip)
  • Connection pooling with unique pool names for concurrent instances
  • Version validation (>= 11.7.2) on init

Where this fits in the MariaDB ecosystem:

Layer Project Relationship
Database MariaDB Server 11.7+ Required runtime — VECTOR/HNSW lands in 11.7.2
Driver mariadb-connector-python Used directly via mariadb.ConnectionPool
Knowledge Base VEC_DISTANCE_COSINE, VECTOR INDEX Reference docs for the SQL we generate
Hackathons MariaDB Python Hackathon 2025 winners Sister projects (Adaptive Query Optimizer, Apache Airflow integration)
Distribution PyPI: seamless-rag pip install seamless-rag[mariadb]

License

Copyright 2026 LiuWei (SunflowersLwtech)
Licensed under the Apache License, Version 2.0

See LICENSE | CONTRIBUTING | Documentation

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

seamless_rag-0.1.2.tar.gz (2.1 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

seamless_rag-0.1.2-py3-none-any.whl (60.2 kB view details)

Uploaded Python 3

File details

Details for the file seamless_rag-0.1.2.tar.gz.

File metadata

  • Download URL: seamless_rag-0.1.2.tar.gz
  • Upload date:
  • Size: 2.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for seamless_rag-0.1.2.tar.gz
Algorithm Hash digest
SHA256 82f8f285814e5f77a8dcb8d3cedc301b269c32dadc20d2e6cf43d73e6ac698cb
MD5 abe4e16eed2fe37ff16558ba418bdb30
BLAKE2b-256 0474fa0ed15e9736792ce796232b7192f0e11c002d77f00acad561e3163015ad

See more details on using hashes here.

File details

Details for the file seamless_rag-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: seamless_rag-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 60.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for seamless_rag-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 5036327eaa6870f5bab6e1f03ce8c703ba70b1c73c1ad5096a1dcf915ac50b32
MD5 f832995e31bce472e7d7baa323113a23
BLAKE2b-256 aa9fa814f3ddddabe35b7819dc3e25aaf1772ec25a2e0726d9ab948ac7998eaa

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page