Skip to main content

File-based semantic layer for AI agents: YAML in git, validated and compiled by Rust

Project description

semlay

A file-based semantic layer for AI agents. Define tables, columns, metrics, joins, and business context in simple YAML files. Store them in git, validate them in CI, compile them to a single context bundle, and deploy it to S3 — no running service required. Rust core, Python bindings.

your-repo/
├── semantic.yml          # project manifest
└── models/
    ├── orders.yml        # one table per file: source, columns, metrics, joins
    └── customers.yml

Why

AI agents writing SQL against raw schemas guess at business logic and get it wrong. Semantic layers fix that, but the existing ones are SaaS-attached, dbt-coupled, or API-centric platforms you have to operate. semlay is just files:

  • Physical format is first-class. Declare whether a dataset is Delta Lake, Iceberg, Parquet, CSV, or a Postgres table. SQL generation uses it — DuckDB queries against Parquet render as read_parquet(...) and run with zero setup.
  • Agent context is first-class. Metrics carry context (pitfalls, caveats), example_queries, valid time grains, and deprecation notes pointing at replacements. The linter warns when context is missing.
  • Structured errors. A bad query returns machine-readable JSON so agents self-correct instead of hallucinating (see below).
  • Git-native deployment. Validate on PR, compile on merge, push one JSON bundle to S3. Agents load the bundle; nothing to host.

The spec in one file

# models/orders.yml
version: 1
source:
  format: parquet            # delta | iceberg | parquet | csv | txt | json
  location: data/orders.parquet
  # ...or a database:        # postgres | mysql | snowflake | bigquery | databricks
  # format: postgres
  # connection: warehouse    # defined in semantic.yml; DSN comes from an env var
  # schema: analytics
  # relation: fct_orders     # physical name, if it differs from table.name

table:
  name: orders
  description: One row per order line item from the order management system.
  grain: order line item     # what one row means — agents need this
  owner: data-eng@example.com
  status: certified          # certified | experimental | deprecated

columns:
  - name: order_id
    type: int64              # Arrow type names: int64, float64, decimal(18,2),
    primary_key: true        # string, bool, date, timestamp, time, binary...
    description: Natural key from the OMS.
  - name: email
    type: string
    classification: pii      # public | internal | restricted | pii
  - name: status
    type: string
    description: One of placed, shipped, returned, cancelled.
    synonyms: [order state]  # how humans actually ask for it
  - name: ordered_at
    type: timestamp
    time_dimension: true     # default time axis for grain queries

metrics:
  - name: gross_revenue
    expr: SUM(amount)
    description: Pre-discount revenue in USD.
    context: |
      Excludes refunds — join refunds for net revenue. Cancelled lines are
      excluded by the built-in filter.
    example_queries:                 # anchor agent question-matching
      - gross revenue by region last 30 days
    filters:
      - status <> 'cancelled'        # always applied, per-metric
    valid_grains: [day, week, month, quarter, year]   # hour is rejected

relationships:
  - to: customers
    type: many_to_one        # one_to_one | one_to_many | many_to_one | many_to_many
    on: customer_id          # or columns: {local: ..., remote: ...}

context: |
  The OMS backfills late order edits up to 48 hours; treat the last two days
  as provisional when reporting.

JSON Schemas for editor autocomplete and CI validation live in schemas/.

Four complete example projects live in examples/ — Parquet with runnable sample data, Snowflake with connections and deprecation, Postgres with views and explicit join columns, and a Delta/Iceberg/CSV lakehouse with two-hop joins. See examples/README.md.

CLI walkthrough

All commands run against examples/shop, which ships with sample Parquet data.

Validate (hard errors) and lint (weak agent context):

$ semlay validate examples/shop
ok: 2 model(s), 0 errors

$ semlay lint examples/shop
ok: 2 model(s), 0 warning(s)

On a broken project, errors name the file and the fix:

error: models/bad.yml: source format `Postgres` requires `connection`
error: models/bad.yml: column `c`: unknown column type `varchar` (expected one of int8..int64, ...)
warning: models/bad.yml: metric `m` has no `context`; agents will guess at caveats

Export agent context — the whole project as compact markdown for a system prompt, CLAUDE.md, or RAG ingestion. This is how agents consume the layer with zero tooling on their side:

$ semlay context examples/shop -o semantic-context.md

The output covers every table (grain, source, deprecations), column tables with PII flags and synonyms, and each metric's expression, always-applied filters, valid grains, caveats, and example questions.

Generate SQL from a semantic query:

$ semlay sql examples/shop -m gross_revenue,order_count -d region -g month
SELECT
  date_trunc('month', "orders"."ordered_at") AS "ordered_at_month",
  "customers"."region" AS "region",
  SUM("orders"."amount") FILTER (WHERE "orders"."status" <> 'cancelled') AS "gross_revenue",
  COUNT(DISTINCT "orders"."order_id") AS "order_count"
FROM read_parquet('data/orders.parquet') AS "orders"
LEFT JOIN read_parquet('data/customers.parquet') AS "customers"
  ON "orders"."customer_id" = "customers"."customer_id"
GROUP BY 1, 2
ORDER BY 1, 2

Note what the layer did: qualified every column, applied gross_revenue's built-in filter to its aggregate only (cancelled orders still count in order_count), planned the join from the relationship graph, and rendered the Parquet source as a DuckDB reader so the SQL runs as-is. Dialects: duckdb, postgres, databricks, snowflake. Dialect differences are handled per construct — Snowflake has no FILTER clause, so the same metric renders there as SUM(CASE WHEN status <> 'cancelled' THEN amount END).

Top-N queries sort by any metric or dimension:

$ semlay sql examples/shop -m gross_revenue -d region --sort gross_revenue:desc -l 5

Structured errors — typos and bad grains come back as JSON agents can act on:

$ semlay sql examples/shop -m gross_revenu
{
  "code": "unknown_metric",
  "name": "gross_revenu",
  "suggestions": ["gross_revenue"]
}

$ semlay sql examples/shop -m gross_revenue -g hour
{
  "code": "incompatible_grain",
  "metric": "gross_revenue",
  "grain": "hour",
  "valid": ["day", "week", "month", "quarter", "year"]
}

Other codes: unknown_dimension (with suggestions), ambiguous_dimension (with qualified candidates), no_join_path, metrics_span_tables, no_time_dimension, unknown_sort_field (with the sortable fields).

Fan-out protection. Relationship cardinality is enforced, not decorative. If reaching a dimension requires traversing a one-to-many join, the row multiplication would silently inflate SUMs — the classic semantic-layer correctness bug. semlay refuses with a structured error instead:

{
  "code": "fan_out_join",
  "from": "orders",
  "to": "line_items"
}

(The message tells the agent to define the metric on the many side instead.)

Expression safety. Metric expressions and filters are parsed as a single SQL expression and re-rendered from the AST. Statements can't smuggle in — status = 'x'; DROP TABLE orders is rejected as trailing input, not truncated, not executed.

Compile the whole project into one deployable context bundle:

$ semlay compile examples/shop -o bundle.json

The bundle is the project with refs resolved plus a join graph — everything an agent needs in one fetch. See examples/deploy-bundle.yml for a GitHub Actions workflow that validates on PR and ships the bundle to S3 on merge.

MCP server

Serve the project to any MCP client (Claude Code, Claude Desktop, Cursor) over stdio — no network, no auth, nothing to host:

$ semlay mcp examples/shop

Claude Code registration:

claude mcp add shop-semantics -- semlay mcp /path/to/your/semantic-layer

Six tools: search (names, synonyms, descriptions, example questions), list_tables, get_table (full per-table context), list_metrics, generate_sql, get_context. Tool errors return the same structured JSON as the CLI with isError: true, so the model reads unknown_metric + suggestions and retries — the self-correction loop is covered by an integration test that drives the real binary over stdio.

The server reloads YAML from disk on every call: edit a model, the agent sees it on its next tool call.

Python

import semlay

semlay.validate("examples/shop")        # [] when clean, else [{file, message}]
semlay.lint("examples/shop")            # warnings, same shape
bundle = semlay.compile_bundle("examples/shop")   # dict

md = semlay.context_markdown("examples/shop")  # agent-ready markdown

sql = semlay.generate_sql(
    "examples/shop",
    {
        "metrics": ["gross_revenue"],
        "dimensions": ["region"],
        "grain": "month",
        "order_by": [{"field": "gross_revenue", "desc": True}],
        "limit": 5,
    },
    dialect="duckdb",
)

# Structured errors raise ValueError carrying JSON:
import json
try:
    semlay.generate_sql("examples/shop", {"metrics": ["gross_revenu"]})
except ValueError as e:
    err = json.loads(str(e))
    assert err["code"] == "unknown_metric"
    assert err["suggestions"] == ["gross_revenue"]

Build locally with maturin develop --manifest-path python/Cargo.toml.

How it works

crates/semlay-core   spec types (serde), loader, validator, linter, bundle compiler
crates/semlay-sql    semantic query -> resolved plan -> dialect SQL (sqlparser AST)
crates/semlay-mcp    MCP stdio server (hand-rolled JSON-RPC; no async runtime)
crates/semlay-cli    `semlay` binary
python/              PyO3 bindings (abi3, Python >= 3.9)
schemas/             generated JSON Schemas for the YAML files

Column types map onto Apache Arrow's type system — one type model across every physical format. Metric expressions and filters are parsed with sqlparser; bare column references are qualified against the base table, and per-metric filters attach as FILTER (WHERE ...) on aggregate AST nodes, which DuckDB, Postgres, and Spark/Databricks all support.

Tests

The test suite doubles as a tour of the behavior:

cargo test --workspace          # 65 tests
pytest python/tests             # 10 tests (after maturin develop)

Limitations (current, by design of v0)

  • Metrics in one query must live on one table. Cross-table metric math (revenue from orders divided by spend from ads) is not composed yet; query each side and combine downstream.
  • Derived metrics can't reference other metrics. Ratios work as plain expressions (SUM(a) / COUNT(DISTINCT b)); metric_a / metric_b reuse is planned.
  • Query filters are scoped to the base table's columns. Filter on joined tables by qualifying dimensions instead.
  • No derived dimensions (CASE bucketing, concatenations) — model them as columns upstream for now.
  • week grain follows each engine's week-start convention; semlay does not normalize it.

Status

Early. Spec v1, validator, linter, context bundle (with provenance), agent context export, SQL generation with fan-out protection and top-N, MCP server, Python bindings. Planned next: schema drift detection against real Parquet/Delta/Iceberg metadata (Arrow-based), OSI (Open Semantic Interchange) import/export.

License

Apache-2.0

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

semlay-0.1.0.tar.gz (64.3 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

semlay-0.1.0-cp39-abi3-win_amd64.whl (1.7 MB view details)

Uploaded CPython 3.9+Windows x86-64

semlay-0.1.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.0 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ x86-64

semlay-0.1.0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (1.9 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ ARM64

semlay-0.1.0-cp39-abi3-macosx_11_0_arm64.whl (1.7 MB view details)

Uploaded CPython 3.9+macOS 11.0+ ARM64

semlay-0.1.0-cp39-abi3-macosx_10_12_x86_64.whl (1.8 MB view details)

Uploaded CPython 3.9+macOS 10.12+ x86-64

File details

Details for the file semlay-0.1.0.tar.gz.

File metadata

  • Download URL: semlay-0.1.0.tar.gz
  • Upload date:
  • Size: 64.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.14.1

File hashes

Hashes for semlay-0.1.0.tar.gz
Algorithm Hash digest
SHA256 5be43f4ce1ce1866afb2b8d99f2bea32864fdfd85285d4ea41243382abd88a13
MD5 81cf49c0909f484e53418aa31ac00119
BLAKE2b-256 9f4515bf1ca103aa5e70f9889dad9ccf8cf6f7dbb417edd04cfa61d16c37a9e6

See more details on using hashes here.

File details

Details for the file semlay-0.1.0-cp39-abi3-win_amd64.whl.

File metadata

  • Download URL: semlay-0.1.0-cp39-abi3-win_amd64.whl
  • Upload date:
  • Size: 1.7 MB
  • Tags: CPython 3.9+, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.14.1

File hashes

Hashes for semlay-0.1.0-cp39-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 daf31b21a14ba662a9d320f817af4461c02b0a4076f281a8f7836d0f814fb30c
MD5 add5493430231b2469fc2ff63f88c3f4
BLAKE2b-256 0e51e8d966e8a74bda14f7106f641c7f7e05941ba426fff0375d8802e80ce71c

See more details on using hashes here.

File details

Details for the file semlay-0.1.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for semlay-0.1.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 be104997cde3ea10dff48b6e5dfc1dcb67e179848b4f56fd6371c23f5464555c
MD5 ac4c07c2b7e147f26c6d23951dfbb5b9
BLAKE2b-256 a7c4b6f5f62b1815958de20018a0c0206d50155ee92e1fc8561ef1a63f80a6b0

See more details on using hashes here.

File details

Details for the file semlay-0.1.0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for semlay-0.1.0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 a7116159f186d10995403fb697e902e73453f2f8c5dbd38d8cec6bdf2f804230
MD5 3c6684702ced674d2b6262675862aaed
BLAKE2b-256 57cae79c2cb879e59363dec4e708eb26d6fa6de59d529ed38bb8e590bb5dd860

See more details on using hashes here.

File details

Details for the file semlay-0.1.0-cp39-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for semlay-0.1.0-cp39-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 874e6d6b82ebf53517a82248513f59ad4cf6281bcc76679224e903bb1e787ae3
MD5 1efcfcab4df4d3b47175f1eae666288e
BLAKE2b-256 8c57eb21cac9180232377ab2d2806424663bdb1430dd408857e4aebfaf429fa0

See more details on using hashes here.

File details

Details for the file semlay-0.1.0-cp39-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for semlay-0.1.0-cp39-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 4015a4423fca9bedd8a445897c628d5cd7d4dbc3dfc40e103e277f30b9c90644
MD5 dda2b01f8416b23a7df234d5a09cd36e
BLAKE2b-256 44cbc04d6079f3fa22abd234e9825492188f84246d44c1514503420088358a38

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page