Skip to main content

Recovers SQL lineage that OpenMetadata's parser misses — MSSQL stored procedures, BigQuery procedural SQL, and 20+ dialects — using Gudu SQLFlow.

Project description

gsp-openmetadata-sidecar

Recover SQL lineage that OpenMetadata's parser silently drops — MSSQL stored procedures, BigQuery procedural SQL, MERGE column lineage, temp-table hops, and 20+ dialects — using Gudu SQLFlow, then push the recovered lineage back to OpenMetadata via PUT /api/v1/lineage.

Optionally, auto-create any missing Database / DatabaseSchema / Table entities before emitting lineage, so SQL referencing not-yet-ingested tables still produces complete graphs.

No changes to OpenMetadata code. Public REST API only.

Quick start

pip install gsp-openmetadata-sidecar
gsp-openmetadata-sidecar --dry-run --sql "CREATE PROC p AS BEGIN INSERT INTO t2 SELECT a, b FROM t1 END"

This command:

  • parses the SQL with SQLFlow's free anonymous tier
  • prints the extracted lineage edges (t1 → t2) to stdout
  • does not contact OpenMetadata
  • does not require sidecar.yaml — if the file is absent, built-in defaults are used

If you see upstream → downstream edges, the tool works in your environment. For more realistic fixtures — MSSQL stored procedures, BigQuery procedural SQL, MERGE column lineage — see examples/ in the repo:

gsp-openmetadata-sidecar --sql-file examples/mssql_stored_procedure.sql --dry-run

Everything below is about graduating from that first --dry-run to a useful live deployment.

Choose your path

The sidecar has three onboarding paths with different prerequisites. Pick the row that matches your goal:

Goal What you need Shortest command
Evaluate lineage extraction A SQL file or inline SQL gsp-openmetadata-sidecar --sql-file proc.sql --dry-run
Push lineage to an existing OM catalog Above + --om-server, --om-token, --service-name (+ --database-name if SQL uses short names) See Push lineage to OpenMetadata
Push lineage and auto-create missing tables Above + --default-database/--default-schema + bot with Create RBAC See Auto-create missing entities

If you're new to OpenMetadata, you will also want to skim the OpenMetadata primer once — most first-time friction comes from misunderstanding how OpenMetadata identifies tables, not from bugs in this sidecar.

When to use this

OpenMetadata uses a three-parser chain (sqlglot → sqlfluff → sqlparse via collate-sqllineage) that silently loses lineage on procedural and dialect-specific SQL. Use this sidecar when you see any of:

  • MSSQL stored procedures (BEGIN…END, DECLARE, CREATE PROCEDURE)
  • MSSQL square-bracket identifiers ([dbo].[table])
  • Temp-table intermediates (#tempTable), breaking multi-hop lineage
  • Multi-statement scripts (DECLARE, IF/THEN, CREATE TEMP TABLE)
  • MERGE INTO … WHEN MATCHED … WHEN NOT MATCHED column lineage
  • Cross-database MSSQL queries
  • Lineage against tables that don't exist in OpenMetadata yet (staging, ephemeral, not-yet-ingested sources)

See Related issues for upstream tickets.

Install

From PyPI:

pip install gsp-openmetadata-sidecar

From source (recommended for contributors and environments with an externally managed Python):

python3 -m venv .venv
. .venv/bin/activate
pip install -e .
gsp-openmetadata-sidecar --help

Requirements:

  • Python 3.9+
  • Network access to the SQLFlow API (for anonymous, authenticated, self_hosted modes)
  • Java 8+ and a licensed gsqlparser-*-shaded.jar (for local_jar mode only)

Push lineage to OpenMetadata

To push lineage (instead of dry-running), you need:

  1. An OpenMetadata instance with the target tables already ingested — or --auto-create-entities enabled
  2. A JWT token (Settings → Bots in the OpenMetadata UI)
  3. The database service name as registered in OpenMetadata
gsp-openmetadata-sidecar \
  --sql-file stored_proc.sql \
  --om-server http://localhost:8585/api \
  --om-token "eyJ..." \
  --service-name mssql_prod \
  --database-name SalesDB \
  --schema-name dbo

Or via a config file. A minimal sidecar.yaml for a live run:

sqlflow:
  mode: anonymous
  db_vendor: dbvmssql

openmetadata:
  server: http://localhost:8585/api
  token: "eyJ..."
  service_name: mssql_prod
  database_name: SalesDB
  schema_name: dbo
gsp-openmetadata-sidecar --config sidecar.yaml --sql-file stored_proc.sql

See examples/sidecar.yaml.example for the full annotated version, including authenticated / self_hosted / local_jar and auto-create knobs.

The flags you'll actually type

Flag What it does
--sql-file / --sql Input SQL. One file path, or an inline string.
--dry-run Skip the write to OpenMetadata; print extracted lineage instead.
--om-server OpenMetadata API base URL. Must include /api.
--om-token Bot JWT token. From Settings → Bots in the OM UI.
--service-name Database service name as registered in OM. First FQN segment.
--database-name / --schema-name Fallbacks for FQN assembly when SQL uses short names.
--auto-create-entities Opt in to creating missing Database / DatabaseSchema / Table before emitting.

All other flags are listed in the Configuration reference.

Which defaults do I need?

The sidecar has two separate layers of defaults with similar-looking names. Confusing them is the most common first-run mistake:

Layer Flags Purpose
SQLFlow (parse-time) --default-server / --default-database / --default-schema Help SQLFlow understand the SQL. Fed as request fields so SELECT * FROM Customers is parsed as SalesDB.dbo.Customers instead of a placeholder.
Sidecar (FQN-assembly) --service-name / --database-name / --schema-name Help the sidecar build OpenMetadata FQNs. Fill in missing parts when SQLFlow returns fewer than 3 segments.

Rules of thumb:

  • If your SQL uses fully-qualified names ([SalesDB].[dbo].[Customers]), you need only --service-name.
  • If your SQL uses bare table names, set both pairs to the same values so SQLFlow parses correctly and the sidecar can build a valid 4-part FQN.
  • --default-* flags are ignored in local_jar mode (with a warning) — the JAR-based DataFlowAnalyzer CLI does not accept them.

For the full mechanics, see FQN resolution and default database/schema and SQLFlow default qualifiers (parse-time) below.

Common first-run mistakes

A short list of the failure modes most new users hit:

  • --om-server missing the /api suffix. It must be http://host:port/api, not just http://host:port.
  • Confusing --default-database with --database-name. See the cheat-sheet above.
  • Expecting --dry-run to contact OpenMetadata. It does not. Drop --dry-run for a live run.
  • Expecting local_jar mode to honor --default-*. It doesn't — SQLFlow's JAR CLI doesn't accept those.
  • Expecting lineage push to create tables by itself. It doesn't. Either ingest structure first, or pass --auto-create-entities.
  • Bot token missing EditLineage. PUT /api/v1/lineage returns 403. Add the permission or grant the bot the DataConsumer role.
  • Expecting column-level lineage after an auto-create run. Auto-created tables are intentionally skeletal (no columns), so column lineage is suppressed on edges touching them. See Why isn't there column lineage after auto-create? below.

Common commands

# Inline SQL (no file required):
gsp-openmetadata-sidecar --sql "CREATE PROC p AS BEGIN INSERT INTO t2 SELECT a, b FROM t1 END" --dry-run

# BigQuery procedural SQL with DECLARE, CREATE TEMP TABLE:
gsp-openmetadata-sidecar --sql-file examples/bigquery_procedural.sql --db-vendor dbvbigquery --dry-run

# Table-level lineage only (skip column mappings):
gsp-openmetadata-sidecar --sql-file proc.sql --no-column-lineage --dry-run

# Print raw SQLFlow JSON response to stdout:
gsp-openmetadata-sidecar --sql-file proc.sql --json --dry-run

# Verbose/debug logging:
gsp-openmetadata-sidecar --sql-file proc.sql --dry-run -v

# Preview what entities would be auto-created (zero writes):
gsp-openmetadata-sidecar --sql-file proc.sql --auto-create-entities --dry-run

# Live run with auto-create, conservative cap, fail on any create failure:
gsp-openmetadata-sidecar --sql-file proc.sql --auto-create-entities \
  --max-entities-to-create 10 --on-create-failure abort

# Show version:
gsp-openmetadata-sidecar --version

Auto-create missing entities (opt-in)

By default, the sidecar only writes lineage edges. If a referenced table isn't found in OpenMetadata, it logs Skipping lineage: upstream table not found: …, skips that edge, and continues. This is the safe baseline.

With --auto-create-entities (or openmetadata.auto_create_entities: true in YAML, or GSP_OM_AUTO_CREATE_ENTITIES=true), the sidecar runs a pre-pass planner, creates any missing Database / DatabaseSchema / Table via POST in strict order, and then emits the lineage. The DatabaseService is never auto-created.

When to reach for it:

  • SQL references staging / ephemeral / not-yet-onboarded tables
  • You can't wait for a metadata ingestion run to populate the catalog
  • You want complete lineage graphs now, and are willing to accept skeletal (column-less) table entities that a connector can enrich later

Two hard prerequisites

  1. Set parse-time defaults. Auto-create refuses to run unless --default-database + --default-schema are set (or the openmetadata.* equivalents). Without them, SQLFlow can return partial identifiers that would otherwise materialize as ghost entities at non-4-part FQNs.
  2. The bot needs Create RBAC on Database, DatabaseSchema, and Table, scoped to the configured service. See the operator guide for a recommended role shape.

The minimum safe command

# 1. Dry-run to preview what would be created.
gsp-openmetadata-sidecar \
  --config sidecar.yaml \
  --sql-file my_etl.sql \
  --auto-create-entities \
  --dry-run

# 2. Live run with a conservative cap.
gsp-openmetadata-sidecar \
  --config sidecar.yaml \
  --sql-file my_etl.sql \
  --auto-create-entities \
  --on-create-failure abort \
  --max-entities-to-create 10

For safety invariants, rollout recipe, and stop-ship criteria before rolling this out on a real OpenMetadata instance, see docs/auto-create-operator-guide.md.

Why isn't there column lineage after auto-create?

This is the single biggest surprise on a first auto-create run, so call it out up front:

Auto-created tables have no columns. The sidecar POSTs them with columns: [] because SQLFlow never connected to the source database and therefore doesn't know the column types, lengths, nullability, or ordering. Inventing column metadata on admin-curated entities would be worse than leaving the tables skeletal — an ingestion connector that later scans the source cannot reliably diff-and-merge against placeholder columns.

Column lineage is then suppressed on every edge touching a skeletal endpoint. You'll see this in the run summary:

Lineage emission complete: 2 emitted, 0 skipped (column lineage suppressed on 2 edges, 0 column-pair(s) filtered)

Table-level arrows appear in the OM UI as usual; the thin column-to-column lines inside them do not.

To get column lineage back, populate the columns on the tables and re-run the sidecar. Normal path:

  1. Run OM's native structural ingestion (Airflow DAG or the Ingestion container) against the source. It fills columns[] on the existing skeletal entities.
  2. Re-run the sidecar on the same SQL. It finds the tables already exist (via exact FQN lookup), sees they now have columns, and emits full column-level columnsLineage.

For a smoke test without a real source database, you can PATCH columns onto the skeletal tables directly:

curl -X PATCH "$GSP_OM_SERVER/v1/tables/name/mssql_prod.sales.dbo.invoices" \
  -H "Authorization: Bearer $GSP_OM_TOKEN" \
  -H "Content-Type: application/json-patch+json" \
  -d '[{"op":"replace","path":"/columns","value":[
        {"name":"InvoiceDate","dataType":"DATETIME"},
        {"name":"CustomerId","dataType":"BIGINT"},
        {"name":"Amount","dataType":"NUMERIC"}]}]'

Then re-run the sidecar (no --auto-create-entities needed; the tables already exist). Column casing doesn't have to match SQLFlow's output — the sidecar does case-insensitive matching via a lowercase→canonical map.

Backend modes

Mode Auth Rate limit Data stays… Best for
anonymous (default) None 50/day per IP Gudu cloud Quick evaluation
authenticated user_id + secret_key 10k/month Gudu cloud Regular use
self_hosted user_id + secret_key (token exchange) Unlimited Your network Production / air-gapped
local_jar None (local process) Unlimited Your machine Offline / no Docker
# Anonymous (default — no signup needed):
gsp-openmetadata-sidecar --sql-file proc.sql --dry-run

# Authenticated — BOTH user_id and secret_key are required:
GSP_BACKEND_MODE=authenticated \
GSP_SQLFLOW_USER_ID=your-user-id \
GSP_SQLFLOW_SECRET_KEY=your-secret-key \
  gsp-openmetadata-sidecar --sql-file proc.sql --dry-run

# Equivalent with CLI flags:
gsp-openmetadata-sidecar --mode authenticated \
  --user-id your-user-id --secret-key your-secret-key \
  --sql-file proc.sql --dry-run

# Self-hosted SQLFlow Docker (token-exchange protocol, same creds pattern):
gsp-openmetadata-sidecar --mode self_hosted \
  --sqlflow-url http://localhost:8165/api/gspLive_backend/sqlflow/generation/sqlflow/exportFullLineageAsJson \
  --user-id gudu --secret-key 0123456789 \
  --sql-file proc.sql --dry-run

# Local JAR (no network; requires Java and a licensed JAR):
gsp-openmetadata-sidecar --mode local_jar \
  --jar-path /path/to/gsqlparser-shaded.jar \
  --sql-file proc.sql --dry-run

OpenMetadata primer (for first-time users)

If you've never used OpenMetadata before, this section gives you the vocabulary the rest of this README assumes. Skim it before you start passing flags like --service-name or --om-token.

The entity hierarchy

OpenMetadata stores metadata as entities. Every table, column, dashboard, pipeline, and user is an entity. For SQL lineage you care about three entity types, arranged as a strict 4-level hierarchy:

Database Service   (e.g. "mssql_prod")      — a connection to a source system
    └── Database   (e.g. "SalesDB")         — a logical database within that service
        └── Schema (e.g. "dbo")             — a schema/namespace within the database
            └── Table (e.g. "Customers")    — the actual table (with columns inside)

Key points for newcomers:

  • Service is not the server. A service is an OpenMetadata-registered connection. You choose the name when you set it up in the OM UI (Settings → Services → Databases → Add). It is usually named after the source system (e.g. mssql_prod, snowflake_analytics), but it is an opaque identifier — not the SQL Server hostname or DSN.
  • Database ≠ schema. For MSSQL, SalesDB.dbo.Customers has database SalesDB and schema dbo. For PostgreSQL or Snowflake, same structure. For MySQL there's effectively no separate schema level, so OM stores a synthetic schema called default.
  • Columns live inside the table entity, not as standalone entities with their own URLs. Column lineage is expressed within a table-to-table lineage edge.

Fully-Qualified Name (FQN)

An FQN is how OpenMetadata uniquely identifies an entity across the whole catalog. For tables the shape is:

service.database.schema.table

Concrete example:

mssql_prod.SalesDB.dbo.Customers

Things worth knowing up front:

  • FQNs are case-preserving but the API lookup is case-sensitive by default. mssql_prod.SalesDB.dbo.Customers and mssql_prod.salesdb.dbo.customers are treated as different FQNs by the direct lookup endpoint GET /api/v1/tables/name/{fqn}. This sidecar handles that by falling back to a case-insensitive search — but it's still the single biggest source of "why can't it find my table?" confusion.
  • Segments containing dots or special characters are quoted. mssql_prod."My.Database".dbo."Order Details" is a valid FQN. The sidecar does not currently emit quoted segments, so avoid dots in names if you can.
  • The service segment is chosen by whoever registered the service, not derived from the SQL. This is why the sidecar needs --service-name — SQLFlow can tell you the SQL said SalesDB.dbo.Customers, but only you know that SalesDB sits under the OM service named mssql_prod.

Ingestion vs lineage

OpenMetadata distinguishes two kinds of metadata:

  1. Structural ingestion — creates the table entities themselves by inspecting the source system's information_schema (or equivalent). Usually runs on a schedule as an Airflow DAG or as the OM Ingestion container. Produces rows in tables, databases, schemas, etc.
  2. Lineage ingestion — creates directed edges between existing table entities. Produces rows in entity_relationship with relation type upstream.

By default, lineage ingestion does not create tables. This sidecar can optionally auto-create missing Database / DatabaseSchema / Table entities before emission (see Auto-create missing entities) — but in the default mode, if SalesDB.dbo.Customers doesn't already exist as a table entity, you cannot attach lineage to it.

When the sidecar logs Skipping lineage: upstream table not found: mssql.salesdb.dbo.customers, it means: "I asked OM for that FQN and got a 404 (even after case-insensitive fallback). The SQL references a table that OM doesn't know about yet."

Table-level vs column-level lineage

  • Table-level lineage — a directed edge between two table entities. Answers "does data flow from A to B?". Persisted as a single row per edge.
  • Column-level lineage — on top of the table edge, a list of column mappings: "column A.orderId feeds column B.orderId, and columns A.firstName + A.lastName feed column B.fullName". Persisted inside the edge's lineageDetails.columnsLineage JSON field.

OpenMetadata's UI renders column lineage as thin lines between column pills inside each table card. You only get those lines if the sidecar populated columnsLineage. Turn it off with --no-column-lineage if you only want table-level arrows.

Authentication: Bots and JWT tokens

OpenMetadata's REST API uses bearer-token auth. For automated tools (like this sidecar), the recommended pattern is:

  1. In the OM UI: Settings → Bots → Add new bot (e.g. lineage-sidecar-bot).
  2. Generate a JWT token for that bot. Copy it immediately — OM only shows it once.
  3. Pass it to the sidecar via --om-token or the GSP_OM_TOKEN env var.

Bot tokens are long-lived by default (OM rotates them only when you regenerate). Treat them like passwords. The token carries the bot's role — make sure the bot has the DataConsumer role plus EditLineage permission, otherwise the PUT /api/v1/lineage call will return 403.

The lineage API this sidecar uses

The sidecar pushes edges via:

PUT /api/v1/lineage
{
  "edge": {
    "fromEntity": { "id": "<upstream-uuid>",   "type": "table" },
    "toEntity":   { "id": "<downstream-uuid>", "type": "table" },
    "lineageDetails": {
      "sqlQuery": "CREATE PROC ...",
      "source":   "QueryLineage",
      "columnsLineage": [ { "fromColumns": [...], "toColumn": "..." } ]
    }
  }
}

Note that both sides are identified by entity UUID, not FQN. The sidecar resolves FQN → UUID itself (step [4] in the pipeline diagram below). That lookup is where FQN correctness pays off: a wrong FQN means a 404 means a skipped edge.

How these concepts map to sidecar flags

OM concept Sidecar flag Notes
Database service name --service-name First FQN segment. Default: mssql.
Database --database-name Fills FQN when SQL has 1- or 2-part names. No default.
Schema --schema-name Fills FQN when SQL has 1-part names. Default: dbo.
OM server URL --om-server E.g. http://localhost:8585/api (note the /api suffix).
Bot JWT token --om-token From Settings → Bots.
Column lineage toggle --column-lineage / --no-column-lineage Default: on.

How it works

The sidecar is a bridge between SQLFlow (which understands SQL) and OpenMetadata (which stores metadata). SQL syntax parsing is delegated to SQLFlow — the sidecar decides how the file is chunked, maps the result to OpenMetadata entities, and pushes lineage edges.

Pipeline

SQL file / inline SQL
        |
        v
  [1] SQLFlow backend ── parses SQL, returns table & column relationships
        |                 (sidecar sends raw SQL + dialect, nothing else)
        v
  [2] Lineage mapper ── extracts upstream/downstream table pairs + column mappings
        |                from SQLFlow's JSON response
        v
  [3] FQN builder ───── converts SQLFlow table names (e.g. SALESDB.DBO.CUSTOMERS)
        |                to OpenMetadata FQNs (e.g. mssql.salesdb.dbo.customers)
        |                using --service-name, --database-name, --schema-name
        v
  [4] Entity lookup ─── looks up each FQN in OpenMetadata to get entity UUIDs
        |                (exact match first, then case-insensitive search fallback)
        v
  [4b] (optional) Auto-create planner — only when --auto-create-entities is set.
        |                Groups missing endpoints into a Database → Schema → Table
        |                plan, enforces a safety cap, runs a preflight probe, then
        |                POSTs each tier in strict order. Never touches existing
        |                entities; never creates a DatabaseService.
        v
  [5] Lineage push ──── sends edges to OpenMetadata via PUT /api/v1/lineage
                         with column-level detail (suppressed on edges whose
                         endpoints are skeletal / column-less; unknown-column
                         pairs are filtered pre-emit).

Where do table/column names come from?

All table and column names come from SQLFlow, not from the sidecar — and not from an LLM. SQLFlow is a deterministic SQL parser (Gudu Soft's General SQL Parser). The sidecar's Python is a mapper/translator, not a parser.

SQLFlow's JSON response has two branches the sidecar reads:

  • dbobjs — a tree of servers → databases → schemas → tables[]/views[]. This is SQLFlow's inferred catalog from the SQL (with --default-server/database/schema applied at parse time). src/gsp_openmetadata_sidecar/lineage_mapper.py::_build_id_to_fqn walks it to produce {sqlflow_entity_id: "server.db.schema.table"}.
  • relationships[] — column-level flow edges, each with a target and sources[]. lineage_mapper.py::extract_lineage filters to persistent effects (create_view, insert, merge, …), resolves intermediate result-sets (RS-*, MERGE-INSERT-*) transitively back to real tables, and emits TableLineage(upstream_table, downstream_table, column_mappings[]) dataclasses.

The upstream_table / downstream_table strings are the tables that need to exist in OpenMetadata. emitter.py then completes them to 4-part FQNs via _build_fqn (using --service-name, --database-name, --schema-name as fallbacks) and does GET /api/v1/tables/name/{fqn} to resolve them to UUIDs.

Example — given this SQL:

CREATE VIEW [ReportDB].[dbo].[vw_CustomerOrders] AS
SELECT [SalesDB].[dbo].[Customers].[CustomerID],
       [SalesDB].[dbo].[Orders].[OrderDate]
FROM [SalesDB].[dbo].[Customers]
JOIN [SalesDB].[dbo].[Orders] ON ...

SQLFlow returns relationships like:

  • SALESDB.DBO.CUSTOMERS.CUSTOMERIDREPORTDB.DBO.VW_CUSTOMERORDERS.CUSTOMERID
  • SALESDB.DBO.ORDERS.ORDERDATEREPORTDB.DBO.VW_CUSTOMERORDERS.ORDERDATE

The sidecar reads these and builds lineage edges.

Implication for auto-creation of missing entities: SQLFlow's JSON is authoritative for the identity and hierarchy (service/db/schema/table) of every referenced table, because those are what it parsed from the SQL. It is not authoritative for source-table columns, types, descriptions, or owners — SQLFlow never connected to the source database. The auto-create path therefore creates tables with an empty columns array and leaves column-level enrichment to a real ingestion connector.

What happens when tables don't exist in OpenMetadata?

There are two modes, chosen per-run:

Default (feature off): the sidecar only creates lineage edges. If a referenced table isn't found, it logs Skipping lineage: upstream table not found: mssql.salesdb.dbo.customers, skips that edge, and keeps going for the rest. This is the safe baseline — nothing is ever written to the catalog beyond lineage.

Opt-in (--auto-create-entities): the sidecar runs a pre-pass planner, creates any missing Database / DatabaseSchema / Table via POST (never PUT, never DatabaseService), enforces a hard safety cap, and then emits the lineage. Column lineage is suppressed on edges touching freshly-created or column-less endpoints — the sidecar never invents columns. See docs/auto-create-operator-guide.md for safety invariants, RBAC requirements, and the rollout recipe.

In either mode, running OpenMetadata's native metadata ingestion first remains the recommended way to populate the catalog — auto-create is a fallback for tables ingestion can't see (ephemeral staging, ad-hoc sources, pre-onboarding pilots).

Case-insensitive entity matching

SQLFlow uppercases identifiers for case-insensitive databases (MSSQL, etc.), but OpenMetadata may store them in mixed case (e.g. Customers not CUSTOMERS). The sidecar:

  1. First tries an exact FQN lookup: GET /api/v1/tables/name/mssql.salesdb.dbo.customers
  2. On 404, falls back to the search API (case-insensitive)
  3. Picks the best match (exact case-insensitive FQN match wins)

FQN resolution and default database/schema

OpenMetadata identifies tables by FQN: service.database.schema.table. The sidecar builds FQNs from SQLFlow's output using the defaults you provide.

How defaults work

When SQL references a table without a full database.schema.table path, SQLFlow returns only the parts present in the SQL. The sidecar fills in missing parts from your --database-name and --schema-name settings:

Parts in SQL SQLFlow returns Sidecar fills in from defaults
customers (1-part) CUSTOMERS database from --database-name, schema from --schema-name
dbo.customers (2-part) DBO.CUSTOMERS database from --database-name
SalesDB.dbo.customers (3-part) SALESDB.DBO.CUSTOMERS nothing — all parts present

Important: --database-name and --schema-name are NOT sent to SQLFlow. They are used only on the sidecar side for FQN assembly. SQLFlow receives only the raw SQL text and the --db-vendor dialect flag.

Defaults

Setting Default When used
--service-name mssql Always prepended as the first FQN segment
--database-name (none) Used when SQL has 1-part or 2-part table names
--schema-name dbo Used when SQL has 1-part table names

Examples

SQL reference Config Resolved FQN
customers --service-name mssql --database-name SalesDB --schema-name dbo mssql.salesdb.dbo.customers
dbo.customers --service-name mssql --database-name SalesDB mssql.salesdb.dbo.customers
SalesDB.dbo.customers --service-name mssql mssql.salesdb.dbo.customers
[Sales].[dbo].[Invoices] --service-name mssql mssql.sales.dbo.invoices

Square brackets, backticks, and quotes are automatically stripped. All table/database/schema names from SQL are lowercased during FQN construction (with case-insensitive search fallback).

SQLFlow default qualifiers (parse-time)

--default-server, --default-database, and --default-schema are sent to SQLFlow as the defaultServer / defaultDatabase / defaultSchema request fields. They influence how SQLFlow parses unqualified references — before lineage ever reaches the sidecar. This is a different layer from --database-name / --schema-name, which only fill in the OM FQN on the sidecar side. Set both pairs (or neither) depending on which layer needs the default.

Layer Flags What they do
SQLFlow (parse-time) --default-server / --default-database / --default-schema SQLFlow populates the server/database/schema of its output tree so unqualified SQL (e.g. SELECT * FROM Customers) resolves to a real qualified table instead of the placeholder DEFAULT_SERVER.DEFAULT.DEFAULT.Customers.
Sidecar (FQN-assembly) --service-name / --database-name / --schema-name Sidecar builds OpenMetadata FQNs from SQLFlow's output. Fills in missing parts when SQLFlow returned fewer than 3 segments.

Typical pairing:

gsp-openmetadata-sidecar \
  --sql "SELECT * FROM Customers" \
  --default-database SalesDB --default-schema dbo \
  --service-name mssql_prod --database-name SalesDB --schema-name dbo \
  --dry-run

Without the --default-* flags, SQLFlow returns CUSTOMERS as a one-part name and the sidecar fills in from --database-name / --schema-name. With them, SQLFlow's tree itself carries SalesDB.dbo.Customers, the mapper prefers that qualified form, and the sidecar-side defaults become load-bearing only for references SQLFlow couldn't qualify at all.

The three flags are only applied by the HTTP backends. In local_jar mode they are ignored with a warning, because the DataFlowAnalyzer CLI does not expose them.

Configuration reference

All settings can be provided via CLI flags, environment variables, or a YAML config file. Priority: CLI > env vars > YAML > defaults.

Required

Setting CLI flag Env var YAML key Notes
Backend mode --mode GSP_BACKEND_MODE sqlflow.mode anonymous | authenticated | self_hosted | local_jar
User ID --user-id GSP_SQLFLOW_USER_ID sqlflow.user_id Required for authenticated and (usually) self_hosted
Secret key --secret-key GSP_SQLFLOW_SECRET_KEY sqlflow.secret_key Required for authenticated and (usually) self_hosted
JAR path --jar-path GSP_JAR_PATH sqlflow.jar_path Required for local_jar

Common

Setting CLI flag Env var YAML key
SQLFlow URL --sqlflow-url GSP_SQLFLOW_URL sqlflow.url
SQL dialect --db-vendor GSP_DB_VENDOR sqlflow.db_vendor
SQLFlow default server --default-server GSP_DEFAULT_SERVER sqlflow.default_server
SQLFlow default database --default-database GSP_DEFAULT_DATABASE sqlflow.default_database
SQLFlow default schema --default-schema GSP_DEFAULT_SCHEMA sqlflow.default_schema
SQL file --sql-file GSP_SQL_FILE input.sql_file
Inline SQL --sql GSP_SQL_TEXT input.sql_text
OM server --om-server GSP_OM_SERVER openmetadata.server
OM token --om-token GSP_OM_TOKEN openmetadata.token
Service name --service-name GSP_OM_SERVICE_NAME openmetadata.service_name
Database name --database-name GSP_OM_DATABASE_NAME openmetadata.database_name
Schema name --schema-name GSP_OM_SCHEMA_NAME openmetadata.schema_name
Column lineage --column-lineage / --no-column-lineage GSP_COLUMN_LINEAGE openmetadata.column_lineage

Advanced

Setting CLI flag Env var YAML key Notes
Java executable --java-bin GSP_JAVA_BIN sqlflow.java_bin local_jar only; defaults to java on PATH
Dry run --dry-run Skip the write to OpenMetadata
JSON output --json Print raw SQLFlow response to stdout
Verbose logging -v / --verbose Enable DEBUG-level logs
Auto-create entities --auto-create-entities / --no-auto-create-entities GSP_OM_AUTO_CREATE_ENTITIES openmetadata.auto_create_entities Opt-in; see Auto-create missing entities
Create-failure policy --on-create-failure {abort,skip-edge} GSP_OM_ON_CREATE_FAILURE openmetadata.on_create_failure Default abort. 401/403 always fatal regardless.
Safety cap --max-entities-to-create N GSP_OM_MAX_ENTITIES_TO_CREATE openmetadata.max_entities_to_create Default 100. Plan aborts before any write if exceeded.
Audit tag GSP_OM_AUTO_CREATED_TAG_FQN openmetadata.auto_created_tag_fqn Best-effort PATCH each auto-created entity with this tag.

Input handling

The sidecar does not do SQL lineage parsing itself, but it does decide how input files are chunked before calling SQLFlow:

  • Procedural files — if the file contains DECLARE, BEGIN, IF … THEN, CALL, LOOP, EXCEPTION WHEN, END LOOP, END IF, or WHILE, the entire file is sent as a single statement. Splitting on semicolons would break the procedural block.
  • Non-procedural files — the file is split on ; and each statement is sent to SQLFlow independently.
  • Inline SQL (--sql) — always sent as a single statement.

Statement-boundary detection is heuristic: it keys off keyword presence, not full SQL parsing. If a non-procedural file has lineage issues from the split, try running it through the --sql path as one block.

Troubleshooting and limitations

ValueError: sqlflow.user_id and sqlflow.secret_key are both required when mode is 'authenticated' authenticated mode needs both credentials. Set GSP_SQLFLOW_USER_ID + GSP_SQLFLOW_SECRET_KEY, or pass --user-id + --secret-key.

Anonymous API rate limit exceeded The anonymous tier is 50 calls/day per IP. The tool exits with code 2 on this error. Switch to authenticated (10k/month) or self_hosted (unlimited).

Skipping lineage: upstream table not found: … That FQN does not exist in OpenMetadata. Run OpenMetadata ingestion first, create the table entity, or re-run with --auto-create-entities (see Auto-create missing entities). The sidecar keeps going and emits the other edges it can resolve — a single file can partially succeed.

auto_create_entities=true requires sqlflow.default_database … AND sqlflow.default_schema … Auto-create refuses to run without defaults because partial SQL references (e.g. bare customers) would otherwise synthesize ghost entities at non-4-part FQNs. Set --default-database + --default-schema (or the openmetadata.database_name / schema_name equivalents).

Plan would create N entities; max_entities_to_create=M The pre-pass planner counted more missing entities than --max-entities-to-create (default 100) allows. Re-run with --dry-run to review the full tree, confirm the count matches your expectation, then raise --max-entities-to-create explicitly.

OpenMetadata rejected the minimal create payload … at preflight Auto-create's first write returned HTTP 400. Your OpenMetadata version may have tightened payload validation beyond what the sidecar expects. See docs/entity-emission-api-evidence.md for the payload shapes the sidecar sends. File an issue with the OM version + 400 body.

Auto-create refuses foreign service 'X' (configured: 'Y') A lineage FQN named an OpenMetadata service other than the one you configured. Either the SQL or an upstream emitter labeled the lineage incorrectly. The sidecar is single-service by design; run once per service.

Multi-statement files can partially succeed. Per-statement errors are logged and counted, but processing continues through the rest of the file. The process still exits with code 1 if any statement failed.

Large SQL is truncated in the stored sqlQuery. OpenMetadata's lineageDetails.sqlQuery field is capped at 10,000 characters when the sidecar builds the payload. The lineage edges themselves are unaffected.

local_jar: jar not found at … The sidecar does not bundle the SQLFlow JAR. Point --jar-path (or GSP_JAR_PATH) at a licensed gsqlparser-*-shaded.jar.

JVM cold-start with local_jar. Each call spawns a fresh java process (~0.5–1 s overhead). Fine for ad-hoc files; not ideal for log ingestion with hundreds of statements.

Related issues

This tool is motivated by (and has been tested against) the following upstream OpenMetadata threads:

  • #16737 — MSSQL stored procedure lineage not reflected
  • #25299CREATE PROCEDURE / BEGIN-END / temp tables fail
  • #16710 — SQL Server ingestion fails on stored procs with %
  • #17586 — MSSQL procedure lineage not picked up
  • Discussion #23717 — Cross-database MSSQL lineage

License

Apache-2.0. See LICENSE.

This tool calls the Gudu SQLFlow service for SQL parsing. The SQLFlow service is proprietary software by Gudu Software. See the SQLFlow documentation for service terms.

Related

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gsp_openmetadata_sidecar-0.1.0.tar.gz (87.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gsp_openmetadata_sidecar-0.1.0-py3-none-any.whl (51.8 kB view details)

Uploaded Python 3

File details

Details for the file gsp_openmetadata_sidecar-0.1.0.tar.gz.

File metadata

  • Download URL: gsp_openmetadata_sidecar-0.1.0.tar.gz
  • Upload date:
  • Size: 87.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for gsp_openmetadata_sidecar-0.1.0.tar.gz
Algorithm Hash digest
SHA256 a451c50d78516c3ef5cacaa4703c519b79245796cd682423a8711e6127eb0291
MD5 148c82ed14c9c8f3bf21158f30935a9a
BLAKE2b-256 8f2313c0d7c1fa1d41a77e14f00ec1f882755ac2e718e49e754d1c80509c52ab

See more details on using hashes here.

File details

Details for the file gsp_openmetadata_sidecar-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for gsp_openmetadata_sidecar-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 685b47ef22d2fd782e67d40a7044cc2bbc7acff894bad75b56088712b5d3a007
MD5 dabf491f5497101354f6ef8caf610185
BLAKE2b-256 85a6411de97f483e3eb4f172b3139fd6dc705432bc7f3cc01c140b31e8278630

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page