File-based semantic layer for AI agents: YAML in git, validated and compiled by Rust

Project description

semlay

A file-based semantic layer for AI agents. Define tables, columns, metrics, joins, and business context in simple YAML files. Store them in git, validate them in CI, compile them to a single context bundle, and deploy it to S3 — no running service required. Rust core, Python bindings.

your-repo/
├── semantic.yml          # project manifest
└── models/
    ├── orders.yml        # one table per file: source, columns, metrics, joins
    └── customers.yml

Why

AI agents writing SQL against raw schemas guess at business logic and get it wrong. Semantic layers fix that, but the existing ones are SaaS-attached, dbt-coupled, or API-centric platforms you have to operate. semlay is just files:

Physical format is first-class. Declare whether a dataset is Delta Lake, Iceberg, Parquet, CSV, or a Postgres table. SQL generation uses it — DuckDB queries against Parquet render as read_parquet(...) and run with zero setup.
Agent context is first-class. Metrics carry context (pitfalls, caveats), example_queries, valid time grains, and deprecation notes pointing at replacements. The linter warns when context is missing.
Structured errors. A bad query returns machine-readable JSON so agents self-correct instead of hallucinating (see below).
Git-native deployment. Validate on PR, compile on merge, push one JSON bundle to S3. Agents load the bundle; nothing to host.

The spec in one file

# models/orders.yml
version: 1
source:
  format: parquet            # delta | iceberg | parquet | csv | txt | json
  location: data/orders.parquet
  # ...or a database:        # postgres | mysql | snowflake | bigquery | databricks
  # format: postgres
  # connection: warehouse    # defined in semantic.yml; DSN comes from an env var
  # schema: analytics
  # relation: fct_orders     # physical name, if it differs from table.name

table:
  name: orders
  description: One row per order line item from the order management system.
  grain: order line item     # what one row means — agents need this
  owner: data-eng@example.com
  status: certified          # certified | experimental | deprecated

columns:
  - name: order_id
    type: int64              # Arrow type names: int64, float64, decimal(18,2),
    primary_key: true        # string, bool, date, timestamp, time, binary...
    description: Natural key from the OMS.
  - name: email
    type: string
    classification: pii      # public | internal | restricted | pii
  - name: status
    type: string
    description: One of placed, shipped, returned, cancelled.
    synonyms: [order state]  # how humans actually ask for it
  - name: ordered_at
    type: timestamp
    time_dimension: true     # default time axis for grain queries

metrics:
  - name: gross_revenue
    expr: SUM(amount)
    description: Pre-discount revenue in USD.
    context: |
      Excludes refunds — join refunds for net revenue. Cancelled lines are
      excluded by the built-in filter.
    example_queries:                 # anchor agent question-matching
      - gross revenue by region last 30 days
    filters:
      - status <> 'cancelled'        # always applied, per-metric
    valid_grains: [day, week, month, quarter, year]   # hour is rejected

relationships:
  - to: customers
    type: many_to_one        # one_to_one | one_to_many | many_to_one | many_to_many
    on: customer_id          # or columns: {local: ..., remote: ...}

context: |
  The OMS backfills late order edits up to 48 hours; treat the last two days
  as provisional when reporting.

JSON Schemas for editor autocomplete and CI validation live in schemas/.

Four complete example projects live in examples/ — Parquet with runnable sample data, Snowflake with connections and deprecation, Postgres with views and explicit join columns, and a Delta/Iceberg/CSV lakehouse with two-hop joins. See examples/README.md.

CLI walkthrough

All commands run against examples/shop, which ships with sample Parquet data.

Validate (hard errors) and lint (weak agent context):

$ semlay validate examples/shop
ok: 2 model(s), 0 errors

$ semlay lint examples/shop
ok: 2 model(s), 0 warning(s)

On a broken project, errors name the file and the fix:

error: models/bad.yml: source format `Postgres` requires `connection`
error: models/bad.yml: column `c`: unknown column type `varchar` (expected one of int8..int64, ...)
warning: models/bad.yml: metric `m` has no `context`; agents will guess at caveats

Export agent context — the whole project as compact markdown for a system prompt, CLAUDE.md, or RAG ingestion. This is how agents consume the layer with zero tooling on their side:

$ semlay context examples/shop -o semantic-context.md

The output covers every table (grain, source, deprecations), column tables with PII flags and synonyms, and each metric's expression, always-applied filters, valid grains, caveats, and example questions.

Generate SQL from a semantic query:

$ semlay sql examples/shop -m gross_revenue,order_count -d region -g month

SELECT
  date_trunc('month', "orders"."ordered_at") AS "ordered_at_month",
  "customers"."region" AS "region",
  SUM("orders"."amount") FILTER (WHERE "orders"."status" <> 'cancelled') AS "gross_revenue",
  COUNT(DISTINCT "orders"."order_id") AS "order_count"
FROM read_parquet('data/orders.parquet') AS "orders"
LEFT JOIN read_parquet('data/customers.parquet') AS "customers"
  ON "orders"."customer_id" = "customers"."customer_id"
GROUP BY 1, 2
ORDER BY 1, 2

Note what the layer did: qualified every column, applied gross_revenue's built-in filter to its aggregate only (cancelled orders still count in order_count), planned the join from the relationship graph, and rendered the Parquet source as a DuckDB reader so the SQL runs as-is. Dialects: duckdb, postgres, databricks, snowflake. Dialect differences are handled per construct — Snowflake has no FILTER clause, so the same metric renders there as SUM(CASE WHEN status <> 'cancelled' THEN amount END).

Top-N queries sort by any metric or dimension:

$ semlay sql examples/shop -m gross_revenue -d region --sort gross_revenue:desc -l 5

Structured errors — typos and bad grains come back as JSON agents can act on:

$ semlay sql examples/shop -m gross_revenu
{
  "code": "unknown_metric",
  "name": "gross_revenu",
  "suggestions": ["gross_revenue"]
}

$ semlay sql examples/shop -m gross_revenue -g hour
{
  "code": "incompatible_grain",
  "metric": "gross_revenue",
  "grain": "hour",
  "valid": ["day", "week", "month", "quarter", "year"]
}

Other codes: unknown_dimension (with suggestions), ambiguous_dimension (with qualified candidates), no_join_path, metrics_span_tables, no_time_dimension, unknown_sort_field (with the sortable fields).

Fan-out protection. Relationship cardinality is enforced, not decorative. If reaching a dimension requires traversing a one-to-many join, the row multiplication would silently inflate SUMs — the classic semantic-layer correctness bug. semlay refuses with a structured error instead:

{
  "code": "fan_out_join",
  "from": "orders",
  "to": "line_items"
}

(The message tells the agent to define the metric on the many side instead.)

Expression safety. Metric expressions and filters are parsed as a single SQL expression and re-rendered from the AST. Statements can't smuggle in — status = 'x'; DROP TABLE orders is rejected as trailing input, not truncated, not executed.

Compile the whole project into one deployable context bundle:

$ semlay compile examples/shop -o bundle.json

The bundle is the project with refs resolved plus a join graph — everything an agent needs in one fetch. See examples/deploy-bundle.yml for a GitHub Actions workflow that validates on PR and ships the bundle to S3 on merge.

MCP server

Serve the project to any MCP client (Claude Code, Claude Desktop, Cursor) over stdio — no network, no auth, nothing to host:

$ semlay mcp examples/shop

Claude Code registration:

claude mcp add shop-semantics -- semlay mcp /path/to/your/semantic-layer

Six tools: search (names, synonyms, descriptions, example questions), list_tables, get_table (full per-table context), list_metrics, generate_sql, get_context. Tool errors return the same structured JSON as the CLI with isError: true, so the model reads unknown_metric + suggestions and retries — the self-correction loop is covered by an integration test that drives the real binary over stdio.

The server reloads YAML from disk on every call: edit a model, the agent sees it on its next tool call.

Python

import semlay

semlay.validate("examples/shop")        # [] when clean, else [{file, message}]
semlay.lint("examples/shop")            # warnings, same shape
bundle = semlay.compile_bundle("examples/shop")   # dict

md = semlay.context_markdown("examples/shop")  # agent-ready markdown

sql = semlay.generate_sql(
    "examples/shop",
    {
        "metrics": ["gross_revenue"],
        "dimensions": ["region"],
        "grain": "month",
        "order_by": [{"field": "gross_revenue", "desc": True}],
        "limit": 5,
    },
    dialect="duckdb",
)

# Structured errors raise ValueError carrying JSON:
import json
try:
    semlay.generate_sql("examples/shop", {"metrics": ["gross_revenu"]})
except ValueError as e:
    err = json.loads(str(e))
    assert err["code"] == "unknown_metric"
    assert err["suggestions"] == ["gross_revenue"]

Build locally with maturin develop --manifest-path python/Cargo.toml.

How it works

crates/semlay-core   spec types (serde), loader, validator, linter, bundle compiler
crates/semlay-sql    semantic query -> resolved plan -> dialect SQL (sqlparser AST)
crates/semlay-mcp    MCP stdio server (hand-rolled JSON-RPC; no async runtime)
crates/semlay-cli    `semlay` binary
python/              PyO3 bindings (abi3, Python >= 3.9)
schemas/             generated JSON Schemas for the YAML files

Column types map onto Apache Arrow's type system — one type model across every physical format. Metric expressions and filters are parsed with sqlparser; bare column references are qualified against the base table, and per-metric filters attach as FILTER (WHERE ...) on aggregate AST nodes, which DuckDB, Postgres, and Spark/Databricks all support.

Tests

The test suite doubles as a tour of the behavior:

crates/semlay-core/tests/validate_lint_tests.rs — one test per validation rule and lint.
crates/semlay-sql/tests/sql_tests.rs — golden SQL, dialect quoting, error codes.
crates/semlay-sql/tests/edge_tests.rs — two-hop joins, ambiguous dimensions, ratio metrics filtering both aggregates, database sources.
crates/semlay-cli/tests/cli_tests.rs — the real binary end to end, including executing generated SQL in DuckDB and asserting the numbers.
python/tests/test_semlay.py — bindings.
crates/semlay-mcp/tests/protocol_tests.rs — MCP handshake, tool listing, the agent self-correction loop.

cargo test --workspace          # 65 tests
pytest python/tests             # 10 tests (after maturin develop)

Limitations (current, by design of v0)

Metrics in one query must live on one table. Cross-table metric math (revenue from orders divided by spend from ads) is not composed yet; query each side and combine downstream.
Derived metrics can't reference other metrics. Ratios work as plain expressions (SUM(a) / COUNT(DISTINCT b)); metric_a / metric_b reuse is planned.
Query filters are scoped to the base table's columns. Filter on joined tables by qualifying dimensions instead.
No derived dimensions (CASE bucketing, concatenations) — model them as columns upstream for now.
week grain follows each engine's week-start convention; semlay does not normalize it.

Status

Early. Spec v1, validator, linter, context bundle (with provenance), agent context export, SQL generation with fan-out protection and top-N, MCP server, Python bindings. Planned next: schema drift detection against real Parquet/Delta/Iceberg metadata (Arrow-based), OSI (Open Semantic Interchange) import/export.

License

Apache-2.0

Project details

Release history Release notifications | RSS feed

This version

0.1.0

Jul 3, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

semlay-0.1.0.tar.gz (64.3 kB view details)

Uploaded Jul 3, 2026 Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

semlay-0.1.0-cp39-abi3-win_amd64.whl (1.7 MB view details)

Uploaded Jul 3, 2026 CPython 3.9+Windows x86-64

semlay-0.1.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.0 MB view details)

Uploaded Jul 3, 2026 CPython 3.9+manylinux: glibc 2.17+ x86-64

semlay-0.1.0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.9 MB view details)

Uploaded Jul 3, 2026 CPython 3.9+manylinux: glibc 2.17+ ARM64

semlay-0.1.0-cp39-abi3-macosx_11_0_arm64.whl (1.7 MB view details)

Uploaded Jul 3, 2026 CPython 3.9+macOS 11.0+ ARM64

semlay-0.1.0-cp39-abi3-macosx_10_12_x86_64.whl (1.8 MB view details)

Uploaded Jul 3, 2026 CPython 3.9+macOS 10.12+ x86-64

File details

Details for the file semlay-0.1.0.tar.gz.

File metadata

Download URL: semlay-0.1.0.tar.gz
Upload date: Jul 3, 2026
Size: 64.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: maturin/1.14.1

File hashes

Hashes for semlay-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`5be43f4ce1ce1866afb2b8d99f2bea32864fdfd85285d4ea41243382abd88a13`
MD5	`81cf49c0909f484e53418aa31ac00119`
BLAKE2b-256	`9f4515bf1ca103aa5e70f9889dad9ccf8cf6f7dbb417edd04cfa61d16c37a9e6`

See more details on using hashes here.

File details

Details for the file semlay-0.1.0-cp39-abi3-win_amd64.whl.

File metadata

Download URL: semlay-0.1.0-cp39-abi3-win_amd64.whl
Upload date: Jul 3, 2026
Size: 1.7 MB
Tags: CPython 3.9+, Windows x86-64
Uploaded using Trusted Publishing? No
Uploaded via: maturin/1.14.1

File hashes

Hashes for semlay-0.1.0-cp39-abi3-win_amd64.whl
Algorithm	Hash digest
SHA256	`daf31b21a14ba662a9d320f817af4461c02b0a4076f281a8f7836d0f814fb30c`
MD5	`add5493430231b2469fc2ff63f88c3f4`
BLAKE2b-256	`0e51e8d966e8a74bda14f7106f641c7f7e05941ba426fff0375d8802e80ce71c`

See more details on using hashes here.

File details

Details for the file semlay-0.1.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

Download URL: semlay-0.1.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Upload date: Jul 3, 2026
Size: 2.0 MB
Tags: CPython 3.9+, manylinux: glibc 2.17+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: maturin/1.14.1

File hashes

Hashes for semlay-0.1.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm	Hash digest
SHA256	`be104997cde3ea10dff48b6e5dfc1dcb67e179848b4f56fd6371c23f5464555c`
MD5	`ac4c07c2b7e147f26c6d23951dfbb5b9`
BLAKE2b-256	`a7c4b6f5f62b1815958de20018a0c0206d50155ee92e1fc8561ef1a63f80a6b0`

See more details on using hashes here.

File details

Details for the file semlay-0.1.0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

Download URL: semlay-0.1.0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Upload date: Jul 3, 2026
Size: 1.9 MB
Tags: CPython 3.9+, manylinux: glibc 2.17+ ARM64
Uploaded using Trusted Publishing? No
Uploaded via: maturin/1.14.1

File hashes

Hashes for semlay-0.1.0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm	Hash digest
SHA256	`a7116159f186d10995403fb697e902e73453f2f8c5dbd38d8cec6bdf2f804230`
MD5	`3c6684702ced674d2b6262675862aaed`
BLAKE2b-256	`57cae79c2cb879e59363dec4e708eb26d6fa6de59d529ed38bb8e590bb5dd860`

See more details on using hashes here.

File details

Details for the file semlay-0.1.0-cp39-abi3-macosx_11_0_arm64.whl.

File metadata

Download URL: semlay-0.1.0-cp39-abi3-macosx_11_0_arm64.whl
Upload date: Jul 3, 2026
Size: 1.7 MB
Tags: CPython 3.9+, macOS 11.0+ ARM64
Uploaded using Trusted Publishing? No
Uploaded via: maturin/1.14.1

File hashes

Hashes for semlay-0.1.0-cp39-abi3-macosx_11_0_arm64.whl
Algorithm	Hash digest
SHA256	`874e6d6b82ebf53517a82248513f59ad4cf6281bcc76679224e903bb1e787ae3`
MD5	`1efcfcab4df4d3b47175f1eae666288e`
BLAKE2b-256	`8c57eb21cac9180232377ab2d2806424663bdb1430dd408857e4aebfaf429fa0`

See more details on using hashes here.

File details

Details for the file semlay-0.1.0-cp39-abi3-macosx_10_12_x86_64.whl.

File metadata

Download URL: semlay-0.1.0-cp39-abi3-macosx_10_12_x86_64.whl
Upload date: Jul 3, 2026
Size: 1.8 MB
Tags: CPython 3.9+, macOS 10.12+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: maturin/1.14.1

File hashes

Hashes for semlay-0.1.0-cp39-abi3-macosx_10_12_x86_64.whl
Algorithm	Hash digest
SHA256	`4015a4423fca9bedd8a445897c628d5cd7d4dbc3dfc40e103e277f30b9c90644`
MD5	`dda2b01f8416b23a7df234d5a09cd36e`
BLAKE2b-256	`44cbc04d6079f3fa22abd234e9825492188f84246d44c1514503420088358a38`

See more details on using hashes here.

semlay 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Project description

semlay

Why

The spec in one file

CLI walkthrough

MCP server

Python

How it works

Tests

Limitations (current, by design of v0)

Status

License

Project details

Verified details

Maintainers

Unverified details

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distributions

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes