TOON-Native Auto-Embedding & RAG Toolkit for MariaDB — VECTOR(N), HNSW, VEC_DISTANCE_COSINE, with TOON tabular output that saves 10-55% LLM tokens vs JSON.
Project description
Seamless-RAG
Vector Search & Structured-Data RAG Toolkit for MariaDB
Turn any MariaDB table into a searchable vector store. Query results come back in TOON v3 tabular format — a compact wire format that saves 10-55% of tokens (vs compact JSON) when feeding structured data to LLMs or agents.
Quick Start for Judges
Evaluating this submission? Start with the Judges' Testing Guide — four progressive paths from inspect-only (5 min, no install) to full test suite (15 min). The full docs site is at https://mariadb-hackathon-my-2026.github.io/seamless-rag/.
90-second verification:
git clone https://github.com/SunflowersLwtech/seamless-rag.git
cd seamless-rag
docker compose up -d --wait
docker compose exec app seamless-rag demo
A 90-second screencast of the same flow lives at docs/assets/demo.gif (also demo.mp4):
MariaDB Features We Use
This project is MariaDB-native end-to-end. The pipeline only works because of features that landed in MariaDB 11.7.2+ and is explicitly tuned for them — no external vector store, no shadow index, no application-side ANN.
| Feature | What we use it for | Where in the code |
|---|---|---|
VECTOR(N) column type |
First-class storage for 384-dim float32 embeddings, no BLOB workaround |
storage/mariadb.py:112 (schema), storage/mariadb.py:227 (auto-add column for arbitrary tables) |
VECTOR INDEX … DISTANCE=cosine (HNSW) |
Sub-linear similarity search; we tune mhnsw_ef_search = 100 per session for recall/latency trade-off |
storage/mariadb.py:114, storage/mariadb.py:337 |
VEC_DISTANCE_COSINE |
Distance function in ORDER BY so the planner picks the HNSW index |
storage/mariadb.py:343, storage/mariadb.py:362 |
Native binary protocol via mariadb-connector-python |
array.array('f', embedding) is sent verbatim — no VEC_FromText round-trip, no string parsing |
storage/mariadb.py:254, storage/mariadb.py:288 (batch insert) |
| CTE for context windowing | Single round-trip retrieval: closest chunks plus their neighbours by chunk_order |
storage/mariadb.py:341 (WITH closest AS …) |
| Hybrid SQL filter + vector ORDER BY | seamless-rag ask "waterproof watches" --where "price < 50" — SQL pre-filter narrows the candidate set, vector ranks within |
storage/mariadb.py:315 (validated WHERE), storage/mariadb.py:362 (combined query) |
| Connection pool + autocommit | mariadb.ConnectionPool with per-call lease, isolation-aware so the watcher never sees stale snapshots |
storage/mariadb.py:158, storage/mariadb.py:178 |
| Foreign keys + composite index | chunks.document_id REFERENCES documents(id) plus INDEX idx_doc_order(document_id, chunk_order) so the CTE neighbour-join stays index-only |
storage/mariadb.py:117-118 |
| Auto-schema for arbitrary tables | seamless-rag embed --table products --columns name,category adds a VECTOR(N) column and HNSW index to your existing table without touching its other columns |
storage/mariadb.py:227-232 |
Bare VEC_DISTANCE() auto-pick (MariaDB-only) |
When the index has DISTANCE=cosine, plain VEC_DISTANCE(...) reads it from the index and applies cosine — no other RDBMS does this. Demonstrated live by seamless-rag schema |
storage/mariadb.py:431 (compare_vec_distance), tests/integration/test_vector_operations.py (1e-6 equivalence assertion) |
See it for yourself in 5 seconds: seamless-rag schema pretty-prints SHOW CREATE TABLE chunks (highlighting vector(384) and VECTOR KEY ... DISTANCE=cosine), SHOW INDEX FROM chunks (with the VECTOR row called out), and runs a side-by-side VEC_DISTANCE() vs VEC_DISTANCE_COSINE() query so you can verify auto-pick parity yourself.
Tested against MariaDB 11.8 (the version shipped in the official mariadb:11.8 Docker image). 11/11 integration tests pass against the real server, exercising every feature above — see tests/integration/test_vector_operations.py.
Without MariaDB's VECTOR + HNSW, this project would either need a sidecar vector DB (Chroma/Qdrant/pgvector) or a from-scratch ANN implementation. Neither would be MariaDB-native, neither would benefit from the same indexes that already serve OLTP traffic.
Why
LLMs and agents consume structured data as context. The standard approach — dumping JSON — wastes tokens on repeated field names and structural characters:
[{"id":1,"name":"Widget","category":"Tools","price":29.99,"stock":150,"supplier":"Acme","rating":4.5},
{"id":2,"name":"Gadget","category":"Tools","price":19.99,"stock":300,"supplier":"Acme","rating":4.2}]
TOON tabular writes field names once, values as compact rows:
[2,]{id,name,category,price,stock,supplier,rating}:
1,Widget,Tools,29.99,150,Acme,4.5
2,Gadget,Tools,19.99,300,Acme,4.2
Measured on real public datasets (full benchmark):
| Dataset (query type) | Rows | JSON Tokens | TOON Tokens | Savings |
|---|---|---|---|---|
| MovieLens — top rated movies (7 cols) | 100 | 6,540 | 5,019 | 23.3% |
| MovieLens — metadata only (4 cols) | 100 | 2,258 | 1,364 | 39.6% |
| SF Restaurant — violations (9 cols) | 100 | 7,071 | 4,326 | 38.8% |
| SF Restaurant — high risk (9 cols) | 50 | 3,437 | 2,076 | 39.6% |
Savings scale with row count and stabilize at the dataset's natural ceiling:
| Rows | MovieLens (7 cols) | Restaurant (9 cols) |
|---|---|---|
| 10 | 21.7% | 34.6% |
| 50 | 22.0% | 38.2% |
| 100 | 24.1% | 38.8% |
| 500 | 29.0% | 38.9% |
TOON is not magic — it shines on structured tabular data with many columns and short values, which is exactly what comes out of database queries. All measurements use compact JSON (separators=(",",":")) as baseline.
Where It Fits
For structured database data, the industry uses two retrieval approaches. Seamless-RAG bridges both to LLMs:
"Q3 revenue by region?" "Find products similar to X"
│ │
Text-to-SQL Vector Search
(LLM generates SQL) (cosine similarity)
│ │
└──────────┬─────────────────────────┘
▼
MariaDB executes
▼
list[dict] results
▼
Seamless-RAG → TOON format ← saves 20-40% tokens
▼
LLM / Agent consumes
- Precise queries ("revenue > 1M"): write SQL directly, use
seamless-rag exportto TOON-format the results - Semantic queries ("similar products"): use
seamless-rag askfor vector search on text columns - Hybrid ("waterproof watches under $50"):
seamless-rag ask --where "price < 50"combines both
Seamless-RAG is a format + embedding bridge, not a replacement for SQL.
Quick Start
Your data is already in MariaDB. Seamless-RAG adds vectors and TOON.
Prerequisites — the mariadb Python connector is a C extension. Install the system library first:
- macOS:
brew install mariadb-connector-c - Debian/Ubuntu:
sudo apt install libmariadb-dev - Or use Docker (see Quick Start for Judges) and skip pip entirely
pip install "seamless-rag[mariadb,embeddings]" # from PyPI
# or, from source:
# pip install -e ".[mariadb,embeddings]"
docker compose up -d # MariaDB 11.8
seamless-rag init # create VECTOR columns + HNSW index
seamless-rag embed --table products --column description # embed existing rows
seamless-rag ask "Which products are most relevant?" # vector search → TOON → LLM
seamless-rag export "SELECT id, name, price FROM products LIMIT 20" # SQL → TOON
Optional extras: [gemini], [openai], [ollama], [web] (Gradio UI), or [all] for everything.
No file loading, no document chunking — data lives in MariaDB, Seamless-RAG bridges it to vectors and LLMs.
CLI Commands
seamless-rag init Create VECTOR columns + HNSW index
seamless-rag embed Bulk-embed existing table rows (core workflow)
seamless-rag watch Auto-embed new inserts in real time (Rich live)
seamless-rag ask <question> Vector search → TOON context → LLM answer
seamless-rag export <sql> Any SELECT → TOON format
seamless-rag benchmark JSON vs TOON token/cost comparison
seamless-rag web Gradio web UI (localhost-only by default)
seamless-rag demo End-to-end demo with sample data
seamless-rag ingest <path> Convenience: load text files for quick testing
Multi-column embedding — embed multiple columns for richer semantics:
# Single column (default)
seamless-rag embed --table products --column description
# Multi-column — values concatenated for richer vector search
seamless-rag embed --table products --columns "name,category,price,rating"
# Internally: "Widget — Tools — 29.99 — 4.5"
# Now "cheap high-rated tools" matches on price AND rating, not just description
seamless-rag ask "cheap high-rated tools" --where "price < 50"
Global options: --host, --port, --database, --provider, --model, --log-level
As Agent Tools
Seamless-RAG commands work as agent tools. An LLM agent can call these to interact with MariaDB:
# Agent tool: search MariaDB and get compact context
result = rag.ask("quarterly revenue by region", top_k=10)
# result.context_toon → compact tabular format for next LLM call
# result.savings_pct → token savings vs compact JSON
# Agent tool: export any SQL query as TOON
toon = rag.export("SELECT region, revenue, quarter FROM sales")
# Feed to next agent step with minimal token overhead
# Agent tool: multi-column embed for richer search
rag.embed_table("products", text_column=["name", "category", "price"])
# "Widget — Tools — 29.99" → vector search matches name AND price
In a 20-step agent workflow querying a database at each step (measured on real data):
| Dataset | JSON (20 steps) | TOON (20 steps) | Tokens Saved | Cost Saved |
|---|---|---|---|---|
| MovieLens (7 cols, 50 rows/step) | 73,680 | 58,760 | 14,920 | $0.037 |
| Restaurant (9 cols, 50 rows/step) | 69,640 | 42,640 | 27,000 | $0.068 |
Python API
from seamless_rag import SeamlessRAG
with SeamlessRAG(host="localhost", database="mydb") as rag:
rag.init()
rag.ingest("research.txt", ["chunk1...", "chunk2..."])
# Single-column embed (default)
rag.embed_table("articles", text_column="content")
# Multi-column embed — richer semantics
rag.embed_table("products", text_column=["name", "category", "price"])
# Semantic search with hybrid filter
result = rag.ask("affordable tools", where="price < 50", mmr=True)
print(result.answer) # LLM-generated answer
print(result.context_toon) # compact context
print(f"Saved {result.savings_pct:.0f}% tokens")
Pluggable Providers
Both embedding and LLM layers use typing.Protocol — no base class needed:
| Layer | Providers | Default |
|---|---|---|
| Embedding | SentenceTransformers, Gemini, OpenAI, Ollama | SentenceTransformers (local, free) |
| LLM | Ollama, Gemini, OpenAI | Ollama (local, free) |
Switch via env vars: EMBEDDING_PROVIDER=gemini LLM_PROVIDER=openai seamless-rag ask "..."
See the Providers guide for adding custom providers.
Architecture
seamless-rag CLI / Python API / Agent Tools
│
├── EmbeddingProvider (Protocol) ← 4 built-in, add your own
├── LLMProvider (Protocol) ← 3 built-in, add your own
├── VectorStore (Protocol) ← MariaDB with connection pool
│ └── VECTOR(N) + HNSW index + VEC_DISTANCE_COSINE
├── AutoEmbedder ← batch + watch, multi-column concat
├── RAGEngine ← search → TOON → LLM (retry) → benchmark
├── TOONEncoder ← full v3 spec (166/166)
└── TokenBenchmark ← tiktoken + GPT-4o cost calc
Test Results
538 tests passing (100%)
lint: 100%
unit: 100% (338/338)
spec: 100% (166/166 TOON v3 conformance)
integration: 100% (17/17)
eval: 100%
Security
- SQL injection prevention: WHERE filters and SELECT queries validated via sqlglot AST parsing — blocks writes, DDL, subqueries, and dangerous functions (SLEEP, BENCHMARK, LOAD_FILE)
- Web UI: binds
127.0.0.1by default;--sharerequires auth viaSEAMLESS_WEB_USER/SEAMLESS_WEB_PASSWORD; error messages never leak server internals - LLM calls: context truncated to 20K chars; retry with jitter for transient errors; rate-limit detection
- Identifiers: all table/column names validated against
^[A-Za-z_][A-Za-z0-9_]*$
Built for the MariaDB Ecosystem
- MariaDB 11.7+ VECTOR columns, HNSW indexes, VEC_DISTANCE_COSINE
- Native binary protocol via
mariadb-connector-python(array.array float32, noVEC_FromTextround-trip) - Connection pooling with unique pool names for concurrent instances
- Version validation (>= 11.7.2) on init
Where this fits in the MariaDB ecosystem:
| Layer | Project | Relationship |
|---|---|---|
| Database | MariaDB Server 11.7+ | Required runtime — VECTOR/HNSW lands in 11.7.2 |
| Driver | mariadb-connector-python | Used directly via mariadb.ConnectionPool |
| Knowledge Base | VEC_DISTANCE_COSINE, VECTOR INDEX | Reference docs for the SQL we generate |
| Hackathons | MariaDB Python Hackathon 2025 winners | Sister projects (Adaptive Query Optimizer, Apache Airflow integration) |
| Distribution | PyPI: seamless-rag |
pip install seamless-rag[mariadb] |
License
Copyright 2026 LiuWei (SunflowersLwtech)
Licensed under the Apache License, Version 2.0
See LICENSE | CONTRIBUTING | Documentation
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file seamless_rag-0.1.3.tar.gz.
File metadata
- Download URL: seamless_rag-0.1.3.tar.gz
- Upload date:
- Size: 2.1 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
33a166d760f3e042f453c25246a159e26f00301f4d175be8f26eb7be60f3ed5b
|
|
| MD5 |
5856817a08e6018410e35b89d475d0a2
|
|
| BLAKE2b-256 |
9dc8ae720cc8d88ff5d46a954a3151ab97e7a074976d6a099ba756fde6e30a0e
|
File details
Details for the file seamless_rag-0.1.3-py3-none-any.whl.
File metadata
- Download URL: seamless_rag-0.1.3-py3-none-any.whl
- Upload date:
- Size: 60.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b66caa1ff9fb0402d0e3ba7a535bafebada84bfa90879e37c92486ac1fef8fcb
|
|
| MD5 |
b1f3537fe182260caeab7d1273db14b7
|
|
| BLAKE2b-256 |
0b02fcde2fa96f2e8330bffe34bdcbf3bddd6d564702a36705589b5f06795293
|