File-based semantic layer for AI agents: YAML in git, validated and compiled by Rust
Project description
semlay
A file-based semantic layer for AI agents. Define tables, columns, metrics, joins, and business context in simple YAML files. Store them in git, validate them in CI, compile them to a single context bundle, and deploy it to S3 — no running service required. Rust core, Python bindings.
your-repo/
├── semantic.yml # project manifest
└── models/
├── orders.yml # one table per file: source, columns, metrics, joins
└── customers.yml
Why
AI agents writing SQL against raw schemas guess at business logic and get it wrong. Semantic layers fix that, but the existing ones are SaaS-attached, dbt-coupled, or API-centric platforms you have to operate. semlay is just files:
- Physical format is first-class. Declare whether a dataset is Delta Lake,
Iceberg, Parquet, CSV, or a Postgres table. SQL generation uses it — DuckDB
queries against Parquet render as
read_parquet(...)and run with zero setup. - Agent context is first-class. Metrics carry
context(pitfalls, caveats),example_queries, valid time grains, and deprecation notes pointing at replacements. The linter warns when context is missing. - Structured errors. A bad query returns machine-readable JSON so agents self-correct instead of hallucinating (see below).
- Git-native deployment. Validate on PR, compile on merge, push one JSON bundle to S3. Agents load the bundle; nothing to host.
The spec in one file
# models/orders.yml
version: 1
source:
format: parquet # delta | iceberg | parquet | csv | txt | json
location: data/orders.parquet
# ...or a database: # postgres | mysql | snowflake | bigquery | databricks
# format: postgres
# connection: warehouse # defined in semantic.yml; DSN comes from an env var
# schema: analytics
# relation: fct_orders # physical name, if it differs from table.name
table:
name: orders
description: One row per order line item from the order management system.
grain: order line item # what one row means — agents need this
owner: data-eng@example.com
status: certified # certified | experimental | deprecated
columns:
- name: order_id
type: int64 # Arrow type names: int64, float64, decimal(18,2),
primary_key: true # string, bool, date, timestamp, time, binary...
description: Natural key from the OMS.
- name: email
type: string
classification: pii # public | internal | restricted | pii
- name: status
type: string
description: One of placed, shipped, returned, cancelled.
synonyms: [order state] # how humans actually ask for it
- name: ordered_at
type: timestamp
time_dimension: true # default time axis for grain queries
metrics:
- name: gross_revenue
expr: SUM(amount)
description: Pre-discount revenue in USD.
context: |
Excludes refunds — join refunds for net revenue. Cancelled lines are
excluded by the built-in filter.
example_queries: # anchor agent question-matching
- gross revenue by region last 30 days
filters:
- status <> 'cancelled' # always applied, per-metric
valid_grains: [day, week, month, quarter, year] # hour is rejected
relationships:
- to: customers
type: many_to_one # one_to_one | one_to_many | many_to_one | many_to_many
on: customer_id # or columns: {local: ..., remote: ...}
context: |
The OMS backfills late order edits up to 48 hours; treat the last two days
as provisional when reporting.
JSON Schemas for editor autocomplete and CI validation live in
schemas/.
Four complete example projects live in examples/ — Parquet
with runnable sample data, Snowflake with connections and deprecation,
Postgres with views and explicit join columns, and a Delta/Iceberg/CSV
lakehouse with two-hop joins. See examples/README.md.
CLI walkthrough
All commands run against examples/shop, which ships with
sample Parquet data.
Validate (hard errors) and lint (weak agent context):
$ semlay validate examples/shop
ok: 2 model(s), 0 errors
$ semlay lint examples/shop
ok: 2 model(s), 0 warning(s)
On a broken project, errors name the file and the fix:
error: models/bad.yml: source format `Postgres` requires `connection`
error: models/bad.yml: column `c`: unknown column type `varchar` (expected one of int8..int64, ...)
warning: models/bad.yml: metric `m` has no `context`; agents will guess at caveats
Export agent context — the whole project as compact markdown for a system
prompt, CLAUDE.md, or RAG ingestion. This is how agents consume the layer
with zero tooling on their side:
$ semlay context examples/shop -o semantic-context.md
The output covers every table (grain, source, deprecations), column tables with PII flags and synonyms, and each metric's expression, always-applied filters, valid grains, caveats, and example questions.
Generate SQL from a semantic query:
$ semlay sql examples/shop -m gross_revenue,order_count -d region -g month
SELECT
date_trunc('month', "orders"."ordered_at") AS "ordered_at_month",
"customers"."region" AS "region",
SUM("orders"."amount") FILTER (WHERE "orders"."status" <> 'cancelled') AS "gross_revenue",
COUNT(DISTINCT "orders"."order_id") AS "order_count"
FROM read_parquet('data/orders.parquet') AS "orders"
LEFT JOIN read_parquet('data/customers.parquet') AS "customers"
ON "orders"."customer_id" = "customers"."customer_id"
GROUP BY 1, 2
ORDER BY 1, 2
Note what the layer did: qualified every column, applied gross_revenue's
built-in filter to its aggregate only (cancelled orders still count in
order_count), planned the join from the relationship graph, and rendered the
Parquet source as a DuckDB reader so the SQL runs as-is. Dialects: duckdb,
postgres, databricks, snowflake. Dialect differences are handled per
construct — Snowflake has no FILTER clause, so the same metric renders
there as SUM(CASE WHEN status <> 'cancelled' THEN amount END).
Top-N queries sort by any metric or dimension:
$ semlay sql examples/shop -m gross_revenue -d region --sort gross_revenue:desc -l 5
Structured errors — typos and bad grains come back as JSON agents can act on:
$ semlay sql examples/shop -m gross_revenu
{
"code": "unknown_metric",
"name": "gross_revenu",
"suggestions": ["gross_revenue"]
}
$ semlay sql examples/shop -m gross_revenue -g hour
{
"code": "incompatible_grain",
"metric": "gross_revenue",
"grain": "hour",
"valid": ["day", "week", "month", "quarter", "year"]
}
Other codes: unknown_dimension (with suggestions), ambiguous_dimension
(with qualified candidates), no_join_path, metrics_span_tables,
no_time_dimension, unknown_sort_field (with the sortable fields).
Fan-out protection. Relationship cardinality is enforced, not decorative.
If reaching a dimension requires traversing a one-to-many join, the row
multiplication would silently inflate SUMs — the classic semantic-layer
correctness bug. semlay refuses with a structured error instead:
{
"code": "fan_out_join",
"from": "orders",
"to": "line_items"
}
(The message tells the agent to define the metric on the many side instead.)
Expression safety. Metric expressions and filters are parsed as a single
SQL expression and re-rendered from the AST. Statements can't smuggle in —
status = 'x'; DROP TABLE orders is rejected as trailing input, not
truncated, not executed.
Compile the whole project into one deployable context bundle:
$ semlay compile examples/shop -o bundle.json
The bundle is the project with refs resolved plus a join graph — everything an
agent needs in one fetch. See
examples/deploy-bundle.yml for a GitHub Actions
workflow that validates on PR and ships the bundle to S3 on merge.
MCP server
Serve the project to any MCP client (Claude Code, Claude Desktop, Cursor) over stdio — no network, no auth, nothing to host:
$ semlay mcp examples/shop
Claude Code registration:
claude mcp add shop-semantics -- semlay mcp /path/to/your/semantic-layer
Six tools: search (names, synonyms, descriptions, example questions),
list_tables, get_table (full per-table context), list_metrics,
generate_sql, get_context. Tool errors return the same structured JSON as
the CLI with isError: true, so the model reads unknown_metric +
suggestions and retries — the self-correction loop is covered by an
integration test that drives the real binary over stdio.
The server reloads YAML from disk on every call: edit a model, the agent sees it on its next tool call.
Python
import semlay
semlay.validate("examples/shop") # [] when clean, else [{file, message}]
semlay.lint("examples/shop") # warnings, same shape
bundle = semlay.compile_bundle("examples/shop") # dict
md = semlay.context_markdown("examples/shop") # agent-ready markdown
sql = semlay.generate_sql(
"examples/shop",
{
"metrics": ["gross_revenue"],
"dimensions": ["region"],
"grain": "month",
"order_by": [{"field": "gross_revenue", "desc": True}],
"limit": 5,
},
dialect="duckdb",
)
# Structured errors raise ValueError carrying JSON:
import json
try:
semlay.generate_sql("examples/shop", {"metrics": ["gross_revenu"]})
except ValueError as e:
err = json.loads(str(e))
assert err["code"] == "unknown_metric"
assert err["suggestions"] == ["gross_revenue"]
Build locally with maturin develop --manifest-path python/Cargo.toml.
How it works
crates/semlay-core spec types (serde), loader, validator, linter, bundle compiler
crates/semlay-sql semantic query -> resolved plan -> dialect SQL (sqlparser AST)
crates/semlay-mcp MCP stdio server (hand-rolled JSON-RPC; no async runtime)
crates/semlay-cli `semlay` binary
python/ PyO3 bindings (abi3, Python >= 3.9)
schemas/ generated JSON Schemas for the YAML files
Column types map onto Apache Arrow's type system — one type model across every
physical format. Metric expressions and filters are parsed with sqlparser; bare
column references are qualified against the base table, and per-metric filters
attach as FILTER (WHERE ...) on aggregate AST nodes, which DuckDB, Postgres,
and Spark/Databricks all support.
Tests
The test suite doubles as a tour of the behavior:
-
crates/semlay-core/tests/validate_lint_tests.rs— one test per validation rule and lint. -
crates/semlay-sql/tests/sql_tests.rs— golden SQL, dialect quoting, error codes. -
crates/semlay-sql/tests/edge_tests.rs— two-hop joins, ambiguous dimensions, ratio metrics filtering both aggregates, database sources. -
crates/semlay-cli/tests/cli_tests.rs— the real binary end to end, including executing generated SQL in DuckDB and asserting the numbers. -
python/tests/test_semlay.py— bindings. -
crates/semlay-mcp/tests/protocol_tests.rs— MCP handshake, tool listing, the agent self-correction loop.
cargo test --workspace # 65 tests
pytest python/tests # 10 tests (after maturin develop)
Limitations (current, by design of v0)
- Metrics in one query must live on one table. Cross-table metric math
(revenue from
ordersdivided by spend fromads) is not composed yet; query each side and combine downstream. - Derived metrics can't reference other metrics. Ratios work as plain
expressions (
SUM(a) / COUNT(DISTINCT b));metric_a / metric_breuse is planned. - Query filters are scoped to the base table's columns. Filter on joined tables by qualifying dimensions instead.
- No derived dimensions (CASE bucketing, concatenations) — model them as columns upstream for now.
weekgrain follows each engine's week-start convention; semlay does not normalize it.
Status
Early. Spec v1, validator, linter, context bundle (with provenance), agent context export, SQL generation with fan-out protection and top-N, MCP server, Python bindings. Planned next: schema drift detection against real Parquet/Delta/Iceberg metadata (Arrow-based), OSI (Open Semantic Interchange) import/export.
License
Apache-2.0
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file semlay-0.1.0.tar.gz.
File metadata
- Download URL: semlay-0.1.0.tar.gz
- Upload date:
- Size: 64.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.14.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5be43f4ce1ce1866afb2b8d99f2bea32864fdfd85285d4ea41243382abd88a13
|
|
| MD5 |
81cf49c0909f484e53418aa31ac00119
|
|
| BLAKE2b-256 |
9f4515bf1ca103aa5e70f9889dad9ccf8cf6f7dbb417edd04cfa61d16c37a9e6
|
File details
Details for the file semlay-0.1.0-cp39-abi3-win_amd64.whl.
File metadata
- Download URL: semlay-0.1.0-cp39-abi3-win_amd64.whl
- Upload date:
- Size: 1.7 MB
- Tags: CPython 3.9+, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.14.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
daf31b21a14ba662a9d320f817af4461c02b0a4076f281a8f7836d0f814fb30c
|
|
| MD5 |
add5493430231b2469fc2ff63f88c3f4
|
|
| BLAKE2b-256 |
0e51e8d966e8a74bda14f7106f641c7f7e05941ba426fff0375d8802e80ce71c
|
File details
Details for the file semlay-0.1.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: semlay-0.1.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 2.0 MB
- Tags: CPython 3.9+, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.14.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
be104997cde3ea10dff48b6e5dfc1dcb67e179848b4f56fd6371c23f5464555c
|
|
| MD5 |
ac4c07c2b7e147f26c6d23951dfbb5b9
|
|
| BLAKE2b-256 |
a7c4b6f5f62b1815958de20018a0c0206d50155ee92e1fc8561ef1a63f80a6b0
|
File details
Details for the file semlay-0.1.0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.
File metadata
- Download URL: semlay-0.1.0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
- Upload date:
- Size: 1.9 MB
- Tags: CPython 3.9+, manylinux: glibc 2.17+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.14.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a7116159f186d10995403fb697e902e73453f2f8c5dbd38d8cec6bdf2f804230
|
|
| MD5 |
3c6684702ced674d2b6262675862aaed
|
|
| BLAKE2b-256 |
57cae79c2cb879e59363dec4e708eb26d6fa6de59d529ed38bb8e590bb5dd860
|
File details
Details for the file semlay-0.1.0-cp39-abi3-macosx_11_0_arm64.whl.
File metadata
- Download URL: semlay-0.1.0-cp39-abi3-macosx_11_0_arm64.whl
- Upload date:
- Size: 1.7 MB
- Tags: CPython 3.9+, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.14.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
874e6d6b82ebf53517a82248513f59ad4cf6281bcc76679224e903bb1e787ae3
|
|
| MD5 |
1efcfcab4df4d3b47175f1eae666288e
|
|
| BLAKE2b-256 |
8c57eb21cac9180232377ab2d2806424663bdb1430dd408857e4aebfaf429fa0
|
File details
Details for the file semlay-0.1.0-cp39-abi3-macosx_10_12_x86_64.whl.
File metadata
- Download URL: semlay-0.1.0-cp39-abi3-macosx_10_12_x86_64.whl
- Upload date:
- Size: 1.8 MB
- Tags: CPython 3.9+, macOS 10.12+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.14.1
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4015a4423fca9bedd8a445897c628d5cd7d4dbc3dfc40e103e277f30b9c90644
|
|
| MD5 |
dda2b01f8416b23a7df234d5a09cd36e
|
|
| BLAKE2b-256 |
44cbc04d6079f3fa22abd234e9825492188f84246d44c1514503420088358a38
|