Recovers SQL lineage that OpenMetadata's parser misses — MSSQL stored procedures, BigQuery procedural SQL, and 20+ dialects — using Gudu SQLFlow.
Project description
gsp-openmetadata-sidecar
Recover SQL lineage that OpenMetadata's parser silently drops — MSSQL stored procedures, BigQuery procedural SQL, MERGE column lineage, temp-table hops, and 20+ dialects — using Gudu SQLFlow, then push the recovered lineage back to OpenMetadata via PUT /api/v1/lineage.
Optionally, auto-create any missing Database / DatabaseSchema / Table entities before emitting lineage, so SQL referencing not-yet-ingested tables still produces complete graphs.
No changes to OpenMetadata code. Public REST API only.
Quick start
pip install gsp-openmetadata-sidecar
gsp-openmetadata-sidecar --dry-run --sql "CREATE PROC p AS BEGIN INSERT INTO t2 SELECT a, b FROM t1 END"
This command:
- parses the SQL with SQLFlow's free anonymous tier
- prints the extracted lineage edges (
t1 → t2) to stdout - does not contact OpenMetadata
- does not require
sidecar.yaml— if the file is absent, built-in defaults are used
If you see upstream → downstream edges, the tool works in your environment. For more realistic fixtures — MSSQL stored procedures, BigQuery procedural SQL, MERGE column lineage — see examples/ in the repo:
gsp-openmetadata-sidecar --sql-file examples/mssql_stored_procedure.sql --dry-run
Everything below is about graduating from that first --dry-run to a useful live deployment.
Choose your path
The sidecar has three onboarding paths with different prerequisites. Pick the row that matches your goal:
| Goal | What you need | Shortest command |
|---|---|---|
| Evaluate lineage extraction | A SQL file or inline SQL | gsp-openmetadata-sidecar --sql-file proc.sql --dry-run |
| Push lineage to an existing OM catalog | Above + --om-server, --om-token, --service-name (+ --database-name if SQL uses short names) |
See Push lineage to OpenMetadata |
| Push lineage and auto-create missing tables | Above + --default-database/--default-schema + bot with Create RBAC |
See Auto-create missing entities |
If you're new to OpenMetadata, you will also want to skim the OpenMetadata primer once — most first-time friction comes from misunderstanding how OpenMetadata identifies tables, not from bugs in this sidecar.
When to use this
OpenMetadata uses a three-parser chain (sqlglot → sqlfluff → sqlparse via collate-sqllineage) that silently loses lineage on procedural and dialect-specific SQL. Use this sidecar when you see any of:
- MSSQL stored procedures (
BEGIN…END,DECLARE,CREATE PROCEDURE) - MSSQL square-bracket identifiers (
[dbo].[table]) - Temp-table intermediates (
#tempTable), breaking multi-hop lineage - Multi-statement scripts (
DECLARE,IF/THEN,CREATE TEMP TABLE) MERGE INTO … WHEN MATCHED … WHEN NOT MATCHEDcolumn lineage- Cross-database MSSQL queries
- Lineage against tables that don't exist in OpenMetadata yet (staging, ephemeral, not-yet-ingested sources)
See Related issues for upstream tickets.
Install
From PyPI:
pip install gsp-openmetadata-sidecar
From source (recommended for contributors and environments with an externally managed Python):
python3 -m venv .venv
. .venv/bin/activate
pip install -e .
gsp-openmetadata-sidecar --help
Requirements:
- Python 3.9+
- Network access to the SQLFlow API (for
anonymous,authenticated,self_hostedmodes) - Java 8+ and a licensed
gsqlparser-*-shaded.jar(forlocal_jarmode only)
Push lineage to OpenMetadata
To push lineage (instead of dry-running), you need:
- An OpenMetadata instance with the target tables already ingested — or
--auto-create-entitiesenabled - A JWT token (Settings → Bots in the OpenMetadata UI)
- The database service name as registered in OpenMetadata
gsp-openmetadata-sidecar \
--sql-file stored_proc.sql \
--om-server http://localhost:8585/api \
--om-token "eyJ..." \
--service-name mssql_prod \
--database-name SalesDB \
--schema-name dbo
Or via a config file. A minimal sidecar.yaml for a live run:
sqlflow:
mode: anonymous
db_vendor: dbvmssql
openmetadata:
server: http://localhost:8585/api
token: "eyJ..."
service_name: mssql_prod
database_name: SalesDB
schema_name: dbo
gsp-openmetadata-sidecar --config sidecar.yaml --sql-file stored_proc.sql
See examples/sidecar.yaml.example for the full annotated version, including authenticated / self_hosted / local_jar and auto-create knobs.
The flags you'll actually type
| Flag | What it does |
|---|---|
--sql-file / --sql |
Input SQL. One file path, or an inline string. |
--dry-run |
Skip the write to OpenMetadata; print extracted lineage instead. |
--om-server |
OpenMetadata API base URL. Must include /api. |
--om-token |
Bot JWT token. From Settings → Bots in the OM UI. |
--service-name |
Database service name as registered in OM. First FQN segment. |
--database-name / --schema-name |
Fallbacks for FQN assembly when SQL uses short names. |
--auto-create-entities |
Opt in to creating missing Database / DatabaseSchema / Table before emitting. |
All other flags are listed in the Configuration reference.
Which defaults do I need?
The sidecar has two separate layers of defaults with similar-looking names. Confusing them is the most common first-run mistake:
| Layer | Flags | Purpose |
|---|---|---|
| SQLFlow (parse-time) | --default-server / --default-database / --default-schema |
Help SQLFlow understand the SQL. Fed as request fields so SELECT * FROM Customers is parsed as SalesDB.dbo.Customers instead of a placeholder. |
| Sidecar (FQN-assembly) | --service-name / --database-name / --schema-name |
Help the sidecar build OpenMetadata FQNs. Fill in missing parts when SQLFlow returns fewer than 3 segments. |
Rules of thumb:
- If your SQL uses fully-qualified names (
[SalesDB].[dbo].[Customers]), you need only--service-name. - If your SQL uses bare table names, set both pairs to the same values so SQLFlow parses correctly and the sidecar can build a valid 4-part FQN.
--default-*flags are ignored inlocal_jarmode (with a warning) — the JAR-basedDataFlowAnalyzerCLI does not accept them.
For the full mechanics, see FQN resolution and default database/schema and SQLFlow default qualifiers (parse-time) below.
Common first-run mistakes
A short list of the failure modes most new users hit:
--om-servermissing the/apisuffix. It must behttp://host:port/api, not justhttp://host:port.- Confusing
--default-databasewith--database-name. See the cheat-sheet above. - Expecting
--dry-runto contact OpenMetadata. It does not. Drop--dry-runfor a live run. - Expecting
local_jarmode to honor--default-*. It doesn't — SQLFlow's JAR CLI doesn't accept those. - Expecting lineage push to create tables by itself. It doesn't. Either ingest structure first, or pass
--auto-create-entities. - Bot token missing
EditLineage.PUT /api/v1/lineagereturns 403. Add the permission or grant the bot theDataConsumerrole. - Expecting column-level lineage after an auto-create run. Auto-created tables are intentionally skeletal (no columns), so column lineage is suppressed on edges touching them. See Why isn't there column lineage after auto-create? below.
Common commands
# Inline SQL (no file required):
gsp-openmetadata-sidecar --sql "CREATE PROC p AS BEGIN INSERT INTO t2 SELECT a, b FROM t1 END" --dry-run
# BigQuery procedural SQL with DECLARE, CREATE TEMP TABLE:
gsp-openmetadata-sidecar --sql-file examples/bigquery_procedural.sql --db-vendor dbvbigquery --dry-run
# Table-level lineage only (skip column mappings):
gsp-openmetadata-sidecar --sql-file proc.sql --no-column-lineage --dry-run
# Print raw SQLFlow JSON response to stdout:
gsp-openmetadata-sidecar --sql-file proc.sql --json --dry-run
# Verbose/debug logging:
gsp-openmetadata-sidecar --sql-file proc.sql --dry-run -v
# Preview what entities would be auto-created (zero writes):
gsp-openmetadata-sidecar --sql-file proc.sql --auto-create-entities --dry-run
# Live run with auto-create, conservative cap, fail on any create failure:
gsp-openmetadata-sidecar --sql-file proc.sql --auto-create-entities \
--max-entities-to-create 10 --on-create-failure abort
# Show version:
gsp-openmetadata-sidecar --version
Auto-create missing entities (opt-in)
By default, the sidecar only writes lineage edges. If a referenced table isn't found in OpenMetadata, it logs Skipping lineage: upstream table not found: …, skips that edge, and continues. This is the safe baseline.
With --auto-create-entities (or openmetadata.auto_create_entities: true in YAML, or GSP_OM_AUTO_CREATE_ENTITIES=true), the sidecar runs a pre-pass planner, creates any missing Database / DatabaseSchema / Table via POST in strict order, and then emits the lineage. The DatabaseService is never auto-created.
When to reach for it:
- SQL references staging / ephemeral / not-yet-onboarded tables
- You can't wait for a metadata ingestion run to populate the catalog
- You want complete lineage graphs now, and are willing to accept skeletal (column-less) table entities that a connector can enrich later
Two hard prerequisites
- Set parse-time defaults. Auto-create refuses to run unless
--default-database+--default-schemaare set (or theopenmetadata.*equivalents). Without them, SQLFlow can return partial identifiers that would otherwise materialize as ghost entities at non-4-part FQNs. - The bot needs
CreateRBAC onDatabase,DatabaseSchema, andTable, scoped to the configured service. See the operator guide for a recommended role shape.
The minimum safe command
# 1. Dry-run to preview what would be created.
gsp-openmetadata-sidecar \
--config sidecar.yaml \
--sql-file my_etl.sql \
--auto-create-entities \
--dry-run
# 2. Live run with a conservative cap.
gsp-openmetadata-sidecar \
--config sidecar.yaml \
--sql-file my_etl.sql \
--auto-create-entities \
--on-create-failure abort \
--max-entities-to-create 10
For safety invariants, rollout recipe, and stop-ship criteria before rolling this out on a real OpenMetadata instance, see docs/auto-create-operator-guide.md.
Why isn't there column lineage after auto-create?
This is the single biggest surprise on a first auto-create run, so call it out up front:
Auto-created tables have no columns. The sidecar POSTs them with columns: [] because SQLFlow never connected to the source database and therefore doesn't know the column types, lengths, nullability, or ordering. Inventing column metadata on admin-curated entities would be worse than leaving the tables skeletal — an ingestion connector that later scans the source cannot reliably diff-and-merge against placeholder columns.
Column lineage is then suppressed on every edge touching a skeletal endpoint. You'll see this in the run summary:
Lineage emission complete: 2 emitted, 0 skipped (column lineage suppressed on 2 edges, 0 column-pair(s) filtered)
Table-level arrows appear in the OM UI as usual; the thin column-to-column lines inside them do not.
To get column lineage back, populate the columns on the tables and re-run the sidecar. Normal path:
- Run OM's native structural ingestion (Airflow DAG or the Ingestion container) against the source. It fills
columns[]on the existing skeletal entities. - Re-run the sidecar on the same SQL. It finds the tables already exist (via exact FQN lookup), sees they now have columns, and emits full column-level
columnsLineage.
For a smoke test without a real source database, you can PATCH columns onto the skeletal tables directly:
curl -X PATCH "$GSP_OM_SERVER/v1/tables/name/mssql_prod.sales.dbo.invoices" \
-H "Authorization: Bearer $GSP_OM_TOKEN" \
-H "Content-Type: application/json-patch+json" \
-d '[{"op":"replace","path":"/columns","value":[
{"name":"InvoiceDate","dataType":"DATETIME"},
{"name":"CustomerId","dataType":"BIGINT"},
{"name":"Amount","dataType":"NUMERIC"}]}]'
Then re-run the sidecar (no --auto-create-entities needed; the tables already exist). Column casing doesn't have to match SQLFlow's output — the sidecar does case-insensitive matching via a lowercase→canonical map.
Backend modes
| Mode | Auth | Rate limit | Data stays… | Best for |
|---|---|---|---|---|
anonymous (default) |
None | 50/day per IP | Gudu cloud | Quick evaluation |
authenticated |
user_id + secret_key |
10k/month | Gudu cloud | Regular use |
self_hosted |
user_id + secret_key (token exchange) |
Unlimited | Your network | Production / air-gapped |
local_jar |
None (local process) | Unlimited | Your machine | Offline / no Docker |
# Anonymous (default — no signup needed):
gsp-openmetadata-sidecar --sql-file proc.sql --dry-run
# Authenticated — BOTH user_id and secret_key are required:
GSP_BACKEND_MODE=authenticated \
GSP_SQLFLOW_USER_ID=your-user-id \
GSP_SQLFLOW_SECRET_KEY=your-secret-key \
gsp-openmetadata-sidecar --sql-file proc.sql --dry-run
# Equivalent with CLI flags:
gsp-openmetadata-sidecar --mode authenticated \
--user-id your-user-id --secret-key your-secret-key \
--sql-file proc.sql --dry-run
# Self-hosted SQLFlow Docker (token-exchange protocol, same creds pattern):
gsp-openmetadata-sidecar --mode self_hosted \
--sqlflow-url http://localhost:8165/api/gspLive_backend/sqlflow/generation/sqlflow/exportFullLineageAsJson \
--user-id gudu --secret-key 0123456789 \
--sql-file proc.sql --dry-run
# Local JAR (no network; requires Java and a licensed JAR):
gsp-openmetadata-sidecar --mode local_jar \
--jar-path /path/to/gsqlparser-shaded.jar \
--sql-file proc.sql --dry-run
OpenMetadata primer (for first-time users)
If you've never used OpenMetadata before, this section gives you the vocabulary the rest of this README assumes. Skim it before you start passing flags like --service-name or --om-token.
The entity hierarchy
OpenMetadata stores metadata as entities. Every table, column, dashboard, pipeline, and user is an entity. For SQL lineage you care about three entity types, arranged as a strict 4-level hierarchy:
Database Service (e.g. "mssql_prod") — a connection to a source system
└── Database (e.g. "SalesDB") — a logical database within that service
└── Schema (e.g. "dbo") — a schema/namespace within the database
└── Table (e.g. "Customers") — the actual table (with columns inside)
Key points for newcomers:
- Service is not the server. A service is an OpenMetadata-registered connection. You choose the name when you set it up in the OM UI (Settings → Services → Databases → Add). It is usually named after the source system (e.g.
mssql_prod,snowflake_analytics), but it is an opaque identifier — not the SQL Server hostname or DSN. - Database ≠ schema. For MSSQL,
SalesDB.dbo.Customershas databaseSalesDBand schemadbo. For PostgreSQL or Snowflake, same structure. For MySQL there's effectively no separate schema level, so OM stores a synthetic schema calleddefault. - Columns live inside the table entity, not as standalone entities with their own URLs. Column lineage is expressed within a table-to-table lineage edge.
Fully-Qualified Name (FQN)
An FQN is how OpenMetadata uniquely identifies an entity across the whole catalog. For tables the shape is:
service.database.schema.table
Concrete example:
mssql_prod.SalesDB.dbo.Customers
Things worth knowing up front:
- FQNs are case-preserving but the API lookup is case-sensitive by default.
mssql_prod.SalesDB.dbo.Customersandmssql_prod.salesdb.dbo.customersare treated as different FQNs by the direct lookup endpointGET /api/v1/tables/name/{fqn}. This sidecar handles that by falling back to a case-insensitive search — but it's still the single biggest source of "why can't it find my table?" confusion. - Segments containing dots or special characters are quoted.
mssql_prod."My.Database".dbo."Order Details"is a valid FQN. The sidecar does not currently emit quoted segments, so avoid dots in names if you can. - The service segment is chosen by whoever registered the service, not derived from the SQL. This is why the sidecar needs
--service-name— SQLFlow can tell you the SQL saidSalesDB.dbo.Customers, but only you know thatSalesDBsits under the OM service namedmssql_prod.
Ingestion vs lineage
OpenMetadata distinguishes two kinds of metadata:
- Structural ingestion — creates the table entities themselves by inspecting the source system's information_schema (or equivalent). Usually runs on a schedule as an Airflow DAG or as the OM Ingestion container. Produces rows in
tables,databases,schemas, etc. - Lineage ingestion — creates directed edges between existing table entities. Produces rows in
entity_relationshipwith relation typeupstream.
By default, lineage ingestion does not create tables. This sidecar can optionally auto-create missing Database / DatabaseSchema / Table entities before emission (see Auto-create missing entities) — but in the default mode, if SalesDB.dbo.Customers doesn't already exist as a table entity, you cannot attach lineage to it.
When the sidecar logs Skipping lineage: upstream table not found: mssql.salesdb.dbo.customers, it means: "I asked OM for that FQN and got a 404 (even after case-insensitive fallback). The SQL references a table that OM doesn't know about yet."
Table-level vs column-level lineage
- Table-level lineage — a directed edge between two table entities. Answers "does data flow from A to B?". Persisted as a single row per edge.
- Column-level lineage — on top of the table edge, a list of column mappings: "column
A.orderIdfeeds columnB.orderId, and columnsA.firstName + A.lastNamefeed columnB.fullName". Persisted inside the edge'slineageDetails.columnsLineageJSON field.
OpenMetadata's UI renders column lineage as thin lines between column pills inside each table card. You only get those lines if the sidecar populated columnsLineage. Turn it off with --no-column-lineage if you only want table-level arrows.
Authentication: Bots and JWT tokens
OpenMetadata's REST API uses bearer-token auth. For automated tools (like this sidecar), the recommended pattern is:
- In the OM UI: Settings → Bots → Add new bot (e.g.
lineage-sidecar-bot). - Generate a JWT token for that bot. Copy it immediately — OM only shows it once.
- Pass it to the sidecar via
--om-tokenor theGSP_OM_TOKENenv var.
Bot tokens are long-lived by default (OM rotates them only when you regenerate). Treat them like passwords. The token carries the bot's role — make sure the bot has the DataConsumer role plus EditLineage permission, otherwise the PUT /api/v1/lineage call will return 403.
The lineage API this sidecar uses
The sidecar pushes edges via:
PUT /api/v1/lineage
{
"edge": {
"fromEntity": { "id": "<upstream-uuid>", "type": "table" },
"toEntity": { "id": "<downstream-uuid>", "type": "table" },
"lineageDetails": {
"sqlQuery": "CREATE PROC ...",
"source": "QueryLineage",
"columnsLineage": [ { "fromColumns": [...], "toColumn": "..." } ]
}
}
}
Note that both sides are identified by entity UUID, not FQN. The sidecar resolves FQN → UUID itself (step [4] in the pipeline diagram below). That lookup is where FQN correctness pays off: a wrong FQN means a 404 means a skipped edge.
How these concepts map to sidecar flags
| OM concept | Sidecar flag | Notes |
|---|---|---|
| Database service name | --service-name |
First FQN segment. Default: mssql. |
| Database | --database-name |
Fills FQN when SQL has 1- or 2-part names. No default. |
| Schema | --schema-name |
Fills FQN when SQL has 1-part names. Default: dbo. |
| OM server URL | --om-server |
E.g. http://localhost:8585/api (note the /api suffix). |
| Bot JWT token | --om-token |
From Settings → Bots. |
| Column lineage toggle | --column-lineage / --no-column-lineage |
Default: on. |
How it works
The sidecar is a bridge between SQLFlow (which understands SQL) and OpenMetadata (which stores metadata). SQL syntax parsing is delegated to SQLFlow — the sidecar decides how the file is chunked, maps the result to OpenMetadata entities, and pushes lineage edges.
Pipeline
SQL file / inline SQL
|
v
[1] SQLFlow backend ── parses SQL, returns table & column relationships
| (sidecar sends raw SQL + dialect, nothing else)
v
[2] Lineage mapper ── extracts upstream/downstream table pairs + column mappings
| from SQLFlow's JSON response
v
[3] FQN builder ───── converts SQLFlow table names (e.g. SALESDB.DBO.CUSTOMERS)
| to OpenMetadata FQNs (e.g. mssql.salesdb.dbo.customers)
| using --service-name, --database-name, --schema-name
v
[4] Entity lookup ─── looks up each FQN in OpenMetadata to get entity UUIDs
| (exact match first, then case-insensitive search fallback)
v
[4b] (optional) Auto-create planner — only when --auto-create-entities is set.
| Groups missing endpoints into a Database → Schema → Table
| plan, enforces a safety cap, runs a preflight probe, then
| POSTs each tier in strict order. Never touches existing
| entities; never creates a DatabaseService.
v
[5] Lineage push ──── sends edges to OpenMetadata via PUT /api/v1/lineage
with column-level detail (suppressed on edges whose
endpoints are skeletal / column-less; unknown-column
pairs are filtered pre-emit).
Where do table/column names come from?
All table and column names come from SQLFlow, not from the sidecar — and not from an LLM. SQLFlow is a deterministic SQL parser (Gudu Soft's General SQL Parser). The sidecar's Python is a mapper/translator, not a parser.
SQLFlow's JSON response has two branches the sidecar reads:
dbobjs— a tree ofservers → databases → schemas → tables[]/views[]. This is SQLFlow's inferred catalog from the SQL (with--default-server/database/schemaapplied at parse time).src/gsp_openmetadata_sidecar/lineage_mapper.py::_build_id_to_fqnwalks it to produce{sqlflow_entity_id: "server.db.schema.table"}.relationships[]— column-level flow edges, each with atargetandsources[].lineage_mapper.py::extract_lineagefilters to persistent effects (create_view,insert,merge, …), resolves intermediate result-sets (RS-*,MERGE-INSERT-*) transitively back to real tables, and emitsTableLineage(upstream_table, downstream_table, column_mappings[])dataclasses.
The upstream_table / downstream_table strings are the tables that need to exist in OpenMetadata. emitter.py then completes them to 4-part FQNs via _build_fqn (using --service-name, --database-name, --schema-name as fallbacks) and does GET /api/v1/tables/name/{fqn} to resolve them to UUIDs.
Example — given this SQL:
CREATE VIEW [ReportDB].[dbo].[vw_CustomerOrders] AS
SELECT [SalesDB].[dbo].[Customers].[CustomerID],
[SalesDB].[dbo].[Orders].[OrderDate]
FROM [SalesDB].[dbo].[Customers]
JOIN [SalesDB].[dbo].[Orders] ON ...
SQLFlow returns relationships like:
SALESDB.DBO.CUSTOMERS.CUSTOMERID→REPORTDB.DBO.VW_CUSTOMERORDERS.CUSTOMERIDSALESDB.DBO.ORDERS.ORDERDATE→REPORTDB.DBO.VW_CUSTOMERORDERS.ORDERDATE
The sidecar reads these and builds lineage edges.
Implication for auto-creation of missing entities: SQLFlow's JSON is authoritative for the identity and hierarchy (service/db/schema/table) of every referenced table, because those are what it parsed from the SQL. It is not authoritative for source-table columns, types, descriptions, or owners — SQLFlow never connected to the source database. The auto-create path therefore creates tables with an empty columns array and leaves column-level enrichment to a real ingestion connector.
What happens when tables don't exist in OpenMetadata?
There are two modes, chosen per-run:
Default (feature off): the sidecar only creates lineage edges. If a referenced table isn't found, it logs Skipping lineage: upstream table not found: mssql.salesdb.dbo.customers, skips that edge, and keeps going for the rest. This is the safe baseline — nothing is ever written to the catalog beyond lineage.
Opt-in (--auto-create-entities): the sidecar runs a pre-pass planner, creates any missing Database / DatabaseSchema / Table via POST (never PUT, never DatabaseService), enforces a hard safety cap, and then emits the lineage. Column lineage is suppressed on edges touching freshly-created or column-less endpoints — the sidecar never invents columns. See docs/auto-create-operator-guide.md for safety invariants, RBAC requirements, and the rollout recipe.
In either mode, running OpenMetadata's native metadata ingestion first remains the recommended way to populate the catalog — auto-create is a fallback for tables ingestion can't see (ephemeral staging, ad-hoc sources, pre-onboarding pilots).
Case-insensitive entity matching
SQLFlow uppercases identifiers for case-insensitive databases (MSSQL, etc.), but OpenMetadata may store them in mixed case (e.g. Customers not CUSTOMERS). The sidecar:
- First tries an exact FQN lookup:
GET /api/v1/tables/name/mssql.salesdb.dbo.customers - On 404, falls back to the search API (case-insensitive)
- Picks the best match (exact case-insensitive FQN match wins)
FQN resolution and default database/schema
OpenMetadata identifies tables by FQN: service.database.schema.table. The sidecar builds FQNs from SQLFlow's output using the defaults you provide.
How defaults work
When SQL references a table without a full database.schema.table path, SQLFlow returns only the parts present in the SQL. The sidecar fills in missing parts from your --database-name and --schema-name settings:
| Parts in SQL | SQLFlow returns | Sidecar fills in from defaults |
|---|---|---|
customers (1-part) |
CUSTOMERS |
database from --database-name, schema from --schema-name |
dbo.customers (2-part) |
DBO.CUSTOMERS |
database from --database-name |
SalesDB.dbo.customers (3-part) |
SALESDB.DBO.CUSTOMERS |
nothing — all parts present |
Important: --database-name and --schema-name are NOT sent to SQLFlow. They are used only on the sidecar side for FQN assembly. SQLFlow receives only the raw SQL text and the --db-vendor dialect flag.
Defaults
| Setting | Default | When used |
|---|---|---|
--service-name |
mssql |
Always prepended as the first FQN segment |
--database-name |
(none) | Used when SQL has 1-part or 2-part table names |
--schema-name |
dbo |
Used when SQL has 1-part table names |
Examples
| SQL reference | Config | Resolved FQN |
|---|---|---|
customers |
--service-name mssql --database-name SalesDB --schema-name dbo |
mssql.salesdb.dbo.customers |
dbo.customers |
--service-name mssql --database-name SalesDB |
mssql.salesdb.dbo.customers |
SalesDB.dbo.customers |
--service-name mssql |
mssql.salesdb.dbo.customers |
[Sales].[dbo].[Invoices] |
--service-name mssql |
mssql.sales.dbo.invoices |
Square brackets, backticks, and quotes are automatically stripped. All table/database/schema names from SQL are lowercased during FQN construction (with case-insensitive search fallback).
SQLFlow default qualifiers (parse-time)
--default-server, --default-database, and --default-schema are sent to SQLFlow as the defaultServer / defaultDatabase / defaultSchema request fields. They influence how SQLFlow parses unqualified references — before lineage ever reaches the sidecar. This is a different layer from --database-name / --schema-name, which only fill in the OM FQN on the sidecar side. Set both pairs (or neither) depending on which layer needs the default.
| Layer | Flags | What they do |
|---|---|---|
| SQLFlow (parse-time) | --default-server / --default-database / --default-schema |
SQLFlow populates the server/database/schema of its output tree so unqualified SQL (e.g. SELECT * FROM Customers) resolves to a real qualified table instead of the placeholder DEFAULT_SERVER.DEFAULT.DEFAULT.Customers. |
| Sidecar (FQN-assembly) | --service-name / --database-name / --schema-name |
Sidecar builds OpenMetadata FQNs from SQLFlow's output. Fills in missing parts when SQLFlow returned fewer than 3 segments. |
Typical pairing:
gsp-openmetadata-sidecar \
--sql "SELECT * FROM Customers" \
--default-database SalesDB --default-schema dbo \
--service-name mssql_prod --database-name SalesDB --schema-name dbo \
--dry-run
Without the --default-* flags, SQLFlow returns CUSTOMERS as a one-part name and the sidecar fills in from --database-name / --schema-name. With them, SQLFlow's tree itself carries SalesDB.dbo.Customers, the mapper prefers that qualified form, and the sidecar-side defaults become load-bearing only for references SQLFlow couldn't qualify at all.
The three flags are only applied by the HTTP backends. In local_jar mode they are ignored with a warning, because the DataFlowAnalyzer CLI does not expose them.
Configuration reference
All settings can be provided via CLI flags, environment variables, or a YAML config file. Priority: CLI > env vars > YAML > defaults.
Required
| Setting | CLI flag | Env var | YAML key | Notes |
|---|---|---|---|---|
| Backend mode | --mode |
GSP_BACKEND_MODE |
sqlflow.mode |
anonymous | authenticated | self_hosted | local_jar |
| User ID | --user-id |
GSP_SQLFLOW_USER_ID |
sqlflow.user_id |
Required for authenticated and (usually) self_hosted |
| Secret key | --secret-key |
GSP_SQLFLOW_SECRET_KEY |
sqlflow.secret_key |
Required for authenticated and (usually) self_hosted |
| JAR path | --jar-path |
GSP_JAR_PATH |
sqlflow.jar_path |
Required for local_jar |
Common
| Setting | CLI flag | Env var | YAML key |
|---|---|---|---|
| SQLFlow URL | --sqlflow-url |
GSP_SQLFLOW_URL |
sqlflow.url |
| SQL dialect | --db-vendor |
GSP_DB_VENDOR |
sqlflow.db_vendor |
| SQLFlow default server | --default-server |
GSP_DEFAULT_SERVER |
sqlflow.default_server |
| SQLFlow default database | --default-database |
GSP_DEFAULT_DATABASE |
sqlflow.default_database |
| SQLFlow default schema | --default-schema |
GSP_DEFAULT_SCHEMA |
sqlflow.default_schema |
| SQL file | --sql-file |
GSP_SQL_FILE |
input.sql_file |
| Inline SQL | --sql |
GSP_SQL_TEXT |
input.sql_text |
| OM server | --om-server |
GSP_OM_SERVER |
openmetadata.server |
| OM token | --om-token |
GSP_OM_TOKEN |
openmetadata.token |
| Service name | --service-name |
GSP_OM_SERVICE_NAME |
openmetadata.service_name |
| Database name | --database-name |
GSP_OM_DATABASE_NAME |
openmetadata.database_name |
| Schema name | --schema-name |
GSP_OM_SCHEMA_NAME |
openmetadata.schema_name |
| Column lineage | --column-lineage / --no-column-lineage |
GSP_COLUMN_LINEAGE |
openmetadata.column_lineage |
Advanced
| Setting | CLI flag | Env var | YAML key | Notes |
|---|---|---|---|---|
| Java executable | --java-bin |
GSP_JAVA_BIN |
sqlflow.java_bin |
local_jar only; defaults to java on PATH |
| Dry run | --dry-run |
— | — | Skip the write to OpenMetadata |
| JSON output | --json |
— | — | Print raw SQLFlow response to stdout |
| Verbose logging | -v / --verbose |
— | — | Enable DEBUG-level logs |
| Auto-create entities | --auto-create-entities / --no-auto-create-entities |
GSP_OM_AUTO_CREATE_ENTITIES |
openmetadata.auto_create_entities |
Opt-in; see Auto-create missing entities |
| Create-failure policy | --on-create-failure {abort,skip-edge} |
GSP_OM_ON_CREATE_FAILURE |
openmetadata.on_create_failure |
Default abort. 401/403 always fatal regardless. |
| Safety cap | --max-entities-to-create N |
GSP_OM_MAX_ENTITIES_TO_CREATE |
openmetadata.max_entities_to_create |
Default 100. Plan aborts before any write if exceeded. |
| Audit tag | — | GSP_OM_AUTO_CREATED_TAG_FQN |
openmetadata.auto_created_tag_fqn |
Best-effort PATCH each auto-created entity with this tag. |
Input handling
The sidecar does not do SQL lineage parsing itself, but it does decide how input files are chunked before calling SQLFlow:
- Procedural files — if the file contains
DECLARE,BEGIN,IF … THEN,CALL,LOOP,EXCEPTION WHEN,END LOOP,END IF, orWHILE, the entire file is sent as a single statement. Splitting on semicolons would break the procedural block. - Non-procedural files — the file is split on
;and each statement is sent to SQLFlow independently. - Inline SQL (
--sql) — always sent as a single statement.
Statement-boundary detection is heuristic: it keys off keyword presence, not full SQL parsing. If a non-procedural file has lineage issues from the split, try running it through the --sql path as one block.
Troubleshooting and limitations
ValueError: sqlflow.user_id and sqlflow.secret_key are both required when mode is 'authenticated'
authenticated mode needs both credentials. Set GSP_SQLFLOW_USER_ID + GSP_SQLFLOW_SECRET_KEY, or pass --user-id + --secret-key.
Anonymous API rate limit exceeded
The anonymous tier is 50 calls/day per IP. The tool exits with code 2 on this error. Switch to authenticated (10k/month) or self_hosted (unlimited).
Skipping lineage: upstream table not found: …
That FQN does not exist in OpenMetadata. Run OpenMetadata ingestion first, create the table entity, or re-run with --auto-create-entities (see Auto-create missing entities). The sidecar keeps going and emits the other edges it can resolve — a single file can partially succeed.
auto_create_entities=true requires sqlflow.default_database … AND sqlflow.default_schema …
Auto-create refuses to run without defaults because partial SQL references (e.g. bare customers) would otherwise synthesize ghost entities at non-4-part FQNs. Set --default-database + --default-schema (or the openmetadata.database_name / schema_name equivalents).
Plan would create N entities; max_entities_to_create=M
The pre-pass planner counted more missing entities than --max-entities-to-create (default 100) allows. Re-run with --dry-run to review the full tree, confirm the count matches your expectation, then raise --max-entities-to-create explicitly.
OpenMetadata rejected the minimal create payload … at preflight
Auto-create's first write returned HTTP 400. Your OpenMetadata version may have tightened payload validation beyond what the sidecar expects. See docs/entity-emission-api-evidence.md for the payload shapes the sidecar sends. File an issue with the OM version + 400 body.
Auto-create refuses foreign service 'X' (configured: 'Y')
A lineage FQN named an OpenMetadata service other than the one you configured. Either the SQL or an upstream emitter labeled the lineage incorrectly. The sidecar is single-service by design; run once per service.
Multi-statement files can partially succeed. Per-statement errors are logged and counted, but processing continues through the rest of the file. The process still exits with code 1 if any statement failed.
Large SQL is truncated in the stored sqlQuery. OpenMetadata's lineageDetails.sqlQuery field is capped at 10,000 characters when the sidecar builds the payload. The lineage edges themselves are unaffected.
local_jar: jar not found at …
The sidecar does not bundle the SQLFlow JAR. Point --jar-path (or GSP_JAR_PATH) at a licensed gsqlparser-*-shaded.jar.
JVM cold-start with local_jar. Each call spawns a fresh java process (~0.5–1 s overhead). Fine for ad-hoc files; not ideal for log ingestion with hundreds of statements.
Related issues
This tool is motivated by (and has been tested against) the following upstream OpenMetadata threads:
- #16737 — MSSQL stored procedure lineage not reflected
- #25299 —
CREATE PROCEDURE/BEGIN-END/ temp tables fail - #16710 — SQL Server ingestion fails on stored procs with
% - #17586 — MSSQL procedure lineage not picked up
- Discussion #23717 — Cross-database MSSQL lineage
License
Apache-2.0. See LICENSE.
This tool calls the Gudu SQLFlow service for SQL parsing. The SQLFlow service is proprietary software by Gudu Software. See the SQLFlow documentation for service terms.
Related
- gsp-datahub-sidecar — same concept for DataHub
- Gudu SQLFlow — the SQL lineage engine powering this tool
- General SQL Parser — the SQL parser library by Gudu Software
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file gsp_openmetadata_sidecar-0.1.0.tar.gz.
File metadata
- Download URL: gsp_openmetadata_sidecar-0.1.0.tar.gz
- Upload date:
- Size: 87.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a451c50d78516c3ef5cacaa4703c519b79245796cd682423a8711e6127eb0291
|
|
| MD5 |
148c82ed14c9c8f3bf21158f30935a9a
|
|
| BLAKE2b-256 |
8f2313c0d7c1fa1d41a77e14f00ec1f882755ac2e718e49e754d1c80509c52ab
|
File details
Details for the file gsp_openmetadata_sidecar-0.1.0-py3-none-any.whl.
File metadata
- Download URL: gsp_openmetadata_sidecar-0.1.0-py3-none-any.whl
- Upload date:
- Size: 51.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
685b47ef22d2fd782e67d40a7044cc2bbc7acff894bad75b56088712b5d3a007
|
|
| MD5 |
dabf491f5497101354f6ef8caf610185
|
|
| BLAKE2b-256 |
85a6411de97f483e3eb4f172b3139fd6dc705432bc7f3cc01c140b31e8278630
|