Recovers BigQuery procedural-SQL lineage that DataHub's sqlglot parser misses, using Gudu SQLFlow.

These details have not been verified by PyPI

Project links

Project description

gsp-datahub-sidecar

Recover SQL lineage that DataHub's sqlglot parser misses — across BigQuery, MSSQL, Snowflake, Oracle, and 20+ dialects — using Gudu SQLFlow.

The problem

DataHub uses sqlglot for SQL lineage extraction. sqlglot silently loses lineage on:

BigQuery procedural SQL — DECLARE, IF/THEN, CALL, CREATE TEMP TABLE (#11654)
BigQuery dbt deduplication macros — array_agg()[offset(0)] with struct unpacking (#11670)
MSSQL stored procedures — CASE...END breaks split_statements (#12606)
MSSQL identifier casing — lowercased URNs break lineage links (#13792, #11322)
Power BI queries — -- comments silently drop JOINs from lineage (#11251)

The solution

This sidecar runs alongside your existing DataHub ingestion. It re-parses the SQL statements that sqlglot failed on using Gudu SQLFlow (which handles procedural SQL and dialect-specific syntax natively), then emits the recovered lineage to DataHub via the REST API.

DataHub ingestion (unchanged)          gsp-datahub-sidecar (this tool)
  audit log -> sqlglot -> lineage         |
                    |                     |  re-parse failed SQL with SQLFlow
                    v                     |  emit recovered lineage via REST
              "Command fallback"          v
              (lineage lost)        DataHub GMS (lineage restored)

Install

pip install gsp-datahub-sidecar

If you hit error: externally-managed-environment on Ubuntu 23.04+ / Debian 12+, see Troubleshooting.

Quick start

Sample SQL files are included in the examples/ directory — try them immediately after install:

# BigQuery procedural SQL (DataHub Issue #11654):
gsp-datahub-sidecar --sql-file examples/bigquery_procedural.sql --dry-run

# BigQuery dbt deduplication macro (DataHub Issue #11670):
gsp-datahub-sidecar --sql-file examples/bigquery_dbt_dedup.sql --dry-run

# MSSQL stored procedure with CASE...END (DataHub Issue #12606):
gsp-datahub-sidecar --sql-file examples/mssql_stored_procedure.sql --db-vendor dbvmssql --dry-run

# MSSQL view with mixed-case identifiers (DataHub Issues #13792, #11322):
gsp-datahub-sidecar --sql-file examples/mssql_case_sensitivity.sql --db-vendor dbvmssql --dry-run

# Power BI query with comments (DataHub Issue #11251):
gsp-datahub-sidecar --sql-file examples/powerbi_comments.sql --db-vendor dbvmssql --dry-run

# Oracle CREATE VIEW with subqueries:
gsp-datahub-sidecar --sql-file examples/oracle_create_view.sql --db-vendor dbvoracle --dry-run

# Analyze inline SQL:
gsp-datahub-sidecar --sql "DECLARE x INT; CREATE VIEW v AS SELECT a FROM t" --dry-run

# Parse DataHub ingestion logs and emit lineage:
gsp-datahub-sidecar --config sidecar.yaml --log-file /var/log/datahub/ingest.log

Four backend modes

Mode	Auth	Limit	Data location	Use case
`anonymous` (default)	None	50/day per IP	SQL sent to api.gudusoft.com	Quick evaluation
`authenticated`	`userId` + `secretKey` (token-exchange)	10k/month	SQL sent to api.gudusoft.com	Extended evaluation
`self_hosted`	`userId` + `secretKey` (token-exchange)	Unlimited	SQL stays in your VPC	Production
`local_jar`	None (local subprocess)	Governed by your JAR license	SQL never leaves the process	Air-gapped / CI (bring your own licensed JAR)

The authenticated and self_hosted tiers use the same SQLFlow token-exchange protocol and hit the same exportFullLineageAsJson endpoint — only the host differs (api.gudusoft.com vs. your Docker). The local_jar tier skips HTTP entirely and invokes SQLFlow's bundled DataFlowAnalyzer CLI via java; it produces the same Dataflow JSON model as the cloud path, so lineage extraction is bit-for-bit equivalent.

Anonymous (default, zero setup)

gsp-datahub-sidecar --sql-file queries.sql --dry-run

Authenticated (sign up for a free key)

gsp-datahub-sidecar --mode authenticated \
  --user-id YOUR_USER_ID \
  --secret-key YOUR_SECRET_KEY \
  --sql-file queries.sql

The sidecar handles SQLFlow's token-exchange flow automatically.

Shell escaping. Cloud userIds typically contain | (e.g. auth0|605f3cfe...), which bash treats as a pipe. Always single-quote the value — or escape the | with a backslash — so the shell doesn't split the command:
# Good: single quotes prevent shell expansion
--user-id 'auth0|605f3cfe4e52d60069bcb258'

# Also fine: backslash-escaped pipe
--user-id auth0\|605f3cfe4e52d60069bcb258

# BAD: bash reads this as a pipe and breaks the command
--user-id auth0|605f3cfe4e52d60069bcb258
The same applies to any value containing |, $, `, &, ;, (, ), <, >, *, ?, #, !, or whitespace — prefer single quotes by default for --user-id, --secret-key, and --sql.

Self-hosted (production)

Deploy SQLFlow Docker in your VPC, grab your userId and secretKey from the SQLFlow web UI (e.g. http://localhost:8165/), then:

gsp-datahub-sidecar --mode self_hosted \
  --user-id YOUR_USER_ID \
  --secret-key YOUR_SECRET_KEY \
  --sql-file queries.sql

Local JAR (no network, no Docker) — bring your own licensed JAR

The sidecar does not ship gsqlparser-*-shaded.jar. Gudu SQLFlow's core parser JAR is a commercial product licensed separately from this sidecar (Apache 2.0). Point --jar-path at a JAR you are licensed to run — for example the one built from your gsp_java_core checkout. See sqlflow.gudusoft.com for licensing.

With a licensed gsqlparser-*-shaded.jar on disk, the sidecar can invoke it directly as a subprocess — no HTTP round-trip, no Docker, SQL never leaves the process. Requires a JRE 8+ on PATH.

gsp-datahub-sidecar --mode local_jar \
  --jar-path /path/to/your/gsqlparser-4.1.0.13-shaded.jar \
  --sql-file queries.sql

Equivalent config / env:

sqlflow.mode: local_jar + sqlflow.jar_path: /path/to/...jar in sidecar.yaml
GSP_BACKEND_MODE=local_jar GSP_JAR_PATH=/path/to/...jar

Use a non-default JVM with --java-bin /opt/jdk21/bin/java (or GSP_JAVA_BIN).

This path uses the exact same Dataflow JSON model as the cloud exportFullLineageAsJson endpoint, so lineage output is identical. The one cost is JVM cold-start (~0.5–1 s per call), so for log-file ingestion of many statements, prefer self_hosted or batch upstream.

--sqlflow-url is optional in self-hosted mode and defaults to:

http://localhost:8165/api/gspLive_backend/sqlflow/generation/sqlflow/exportFullLineageAsJson

Pass --sqlflow-url only when your SQLFlow Docker is not on localhost:8165 — e.g. a remote host, a different port, or behind an nginx reverse proxy. Point it at the full exportFullLineageAsJson path:

# Remote host / non-default port
gsp-datahub-sidecar --mode self_hosted \
  --sqlflow-url http://sqlflow.internal:9100/api/gspLive_backend/sqlflow/generation/sqlflow/exportFullLineageAsJson \
  --user-id YOUR_USER_ID \
  --secret-key YOUR_SECRET_KEY \
  --sql-file queries.sql

You can also set this in sidecar.yaml (sqlflow.url) or via the GSP_SQLFLOW_URL environment variable.

Configuration

Copy sidecar.yaml.example to sidecar.yaml and edit. All settings can also be overridden with environment variables:

Env var	Config key	Description
`GSP_BACKEND_MODE`	`sqlflow.mode`	`anonymous`, `authenticated`, or `self_hosted`
`GSP_SQLFLOW_URL`	`sqlflow.url`	Override the SQLFlow API URL
`GSP_SQLFLOW_USER_ID`	`sqlflow.user_id`	SQLFlow userId — required for `authenticated` and `self_hosted` modes
`GSP_SQLFLOW_SECRET_KEY`	`sqlflow.secret_key`	SQLFlow secretKey — required for `authenticated` and `self_hosted` modes
`GSP_DB_VENDOR`	`sqlflow.db_vendor`	SQL dialect (default: `dbvbigquery`)
`GSP_DATAHUB_SERVER`	`datahub.server`	DataHub GMS URL
`GSP_DATAHUB_TOKEN`	`datahub.token`	DataHub auth token
`GSP_COLUMN_LINEAGE`	`datahub.column_lineage`	Emit column-level lineage (default `true`)
`GSP_JAR_PATH`	`sqlflow.jar_path`	Path to `gsqlparser-*-shaded.jar` — required for `local_jar` mode
`GSP_JAVA_BIN`	`sqlflow.java_bin`	Java executable for `local_jar` mode (default: `java` on PATH)

Dry run vs. live mode

--dry-run does everything except sending lineage to DataHub — safe to run anytime, no DataHub server needed:

# Dry run: parse SQL, extract lineage, show what would be sent (no DataHub needed)
gsp-datahub-sidecar --sql-file examples/bigquery_procedural.sql --dry-run

Without --dry-run, lineage is written to a running DataHub GMS. Point to your DataHub server with --datahub-server:

# Emit lineage to DataHub running on your cluster:
gsp-datahub-sidecar --sql-file examples/bigquery_procedural.sql \
  --datahub-server http://datahub-gms:8080

# If DataHub has authentication enabled, add a token:
gsp-datahub-sidecar --sql-file examples/bigquery_procedural.sql \
  --datahub-server http://datahub-gms:8080 \
  --datahub-token eyJhbGciOi...

Or set these in sidecar.yaml / environment variables so you don't repeat them:

export GSP_DATAHUB_SERVER=http://datahub-gms:8080
export GSP_DATAHUB_TOKEN=eyJhbGciOi...
gsp-datahub-sidecar --sql-file examples/bigquery_procedural.sql

Step	`--dry-run`	live (no flag)
Call SQLFlow API to parse SQL	Yes	Yes
Extract lineage from response	Yes	Yes
Build DataHub MCPs	Yes	Yes
Send MCPs to DataHub GMS	No (logs what it would send)	Yes

Table-level vs. column-level lineage

By default, the sidecar emits both table-level upstream lineage and column-level (fine-grained) lineage — DataHub will show both table arrows and per-column arrows in the Lineage tab.

To emit table-level lineage only (skip fineGrainedLineages):

gsp-datahub-sidecar --sql-file queries.sql --no-column-lineage

Equivalent config / env:

datahub.column_lineage: false in sidecar.yaml
GSP_COLUMN_LINEAGE=false

--dry-run prints up to 5 column mappings per MCP so you can verify what would be sent.

Verify lineage in DataHub

After emitting lineage (live mode, no --dry-run), verify it in three ways:

1. DataHub Web UI (recommended)

Open DataHub in your browser (e.g. http://datahub-frontend:9002), search for the downstream table name, and click the Lineage tab. You should see arrows connecting upstream tables to downstream tables.

For the BigQuery procedural example, search for temp_table or final_output:

project.dataset.view_name  ──>  temp_table       (6 columns)
temp_table_delta           ──>  final_output     (5 columns)

2. DataHub CLI

# Check lineage for a specific dataset:
datahub get --urn "urn:li:dataset:(urn:li:dataPlatform:bigquery,temp_table,PROD)" --aspect upstreamLineage

3. DataHub GMS REST API

# Query the lineage aspect directly:
curl -s "http://datahub-gms:8080/aspects/urn%3Ali%3Adataset%3A(urn%3Ali%3AdataPlatform%3Abigquery%2Ctemp_table%2CPROD)?aspect=upstreamLineage" | python3 -m json.tool

If the lineage appears, the sidecar successfully recovered what sqlglot missed.

Fresh DataHub vs. existing DataHub

Fresh DataHub (no existing metadata)

The sidecar works on a completely empty DataHub — DataHub auto-creates dataset entities when it receives lineage MCPs. You'll see the lineage graph immediately. However, the tables will appear as minimal shells:

	With platform ingestion	Sidecar only (no prior metadata)
Lineage arrows in graph	Yes	Yes
Column-level lineage	Yes	Yes
Table/column names	Full display names	URN-derived names
Column types & descriptions	Yes	No
Table schema / field list	Yes	No
Row counts / statistics	Yes	No
Platform icon	Yes	Yes

This is fine for a demo or evaluation — the lineage visualization proves that GSP recovers what sqlglot misses. The missing metadata would normally come from DataHub's platform ingestion (BigQuery, MSSQL, etc.).

Existing DataHub (with metadata and sqlglot-generated lineage)

This is the real production scenario. The sidecar adds to the lineage that DataHub's platform ingestion already created — it does not replace or conflict with it.

How it works:

DataHub's platform ingestion runs first and creates lineage for all SQL that sqlglot can parse (standard SELECT, INSERT, CREATE VIEW, etc.)
The sidecar runs after and emits lineage only for the SQL that sqlglot failed on (procedural blocks, dialect-specific syntax, complex expressions, etc.)
DataHub merges both into a single lineage graph per dataset

The result is a more complete lineage graph — the existing lineage stays intact, and the sidecar fills in the gaps.

Important: URN matching

For the sidecar's lineage to connect with existing DataHub entities, the dataset URNs must match exactly. This means the table names in the sidecar output must match the names DataHub's BigQuery ingestion created.

Things to check:

Case sensitivity: DataHub ingestion often lowercases URNs when convert_urns_to_lowercase: true is set (common for BigQuery). The sidecar also lowercases by default, so this should match. For MSSQL, where identifier casing matters (see #13792), verify that the sidecar's output casing aligns with your DataHub URNs.
Schema-qualified names: Tables are typically ingested as database.schema.table (MSSQL) or project.dataset.table (BigQuery). The sidecar uses the table names as they appear in the SQL. If your SQL uses bracket-quoted names like [schema].[table] or backtick-quoted names like `project.dataset.table`, the sidecar strips the quotes and preserves the full path. Verify that the resulting URN matches what DataHub already has.
Platform and environment: The sidecar defaults to platform: bigquery and env: PROD. Set --datahub-platform mssql (or snowflake, oracle, etc.) and --datahub-env in sidecar.yaml or via environment variables to match your DataHub setup.
Temp tables: The sidecar emits lineage for temp tables (e.g. temp_table, final_output). If your DataHub's dataset_pattern.deny excludes temp tables (common in BigQuery ingestion configs), the temp table entities will be created by the sidecar but won't have schema metadata. This is expected — the lineage through them is still valuable.

The sidecar will NOT:

Overwrite or delete existing lineage created by DataHub's ingestion
Modify any existing dataset metadata (schemas, descriptions, tags)
Interfere with DataHub's ingestion schedule or stateful ingestion

Troubleshooting

`error: externally-managed-environment` when running `pip install`

On Ubuntu 23.04+ / Debian 12+ (Python 3.11+), the system Python is PEP 668-protected to prevent pip from clobbering OS packages. You'll see:

error: externally-managed-environment

× This environment is externally managed
╰─> To install Python packages system-wide, try apt install
    python3-xyz, where xyz is the package you are trying to
    install.

Pick one of the following:

Option 1 — venv (recommended)

apt install -y python3-venv python3-full
python3 -m venv ~/sqlflow-venv
~/sqlflow-venv/bin/pip install gsp-datahub-sidecar
~/sqlflow-venv/bin/gsp-datahub-sidecar --help

Activate the venv (source ~/sqlflow-venv/bin/activate) or invoke the binary by its full path.

Option 2 — pipx (isolated, auto-added to PATH)

apt install -y pipx
pipx ensurepath
pipx install gsp-datahub-sidecar

Option 3 — override (quick, not recommended on a server)

pip install --break-system-packages gsp-datahub-sidecar

This can conflict with apt-managed Python packages. Use venv or pipx for anything long-lived.

`Self-hosted SQLFlow returned HTTP 400` / `401 Invalid user or token, access deny`

The self-hosted SQLFlow Docker uses a two-step auth flow:

Exchange userId + secretKey for a short-lived token at /api/gspLive_backend/user/generateToken (form-encoded).
Call the lineage endpoint with userId + token (form-encoded, not JSON, not secretKey, not a Bearer header).

See SQLFlow's reference: GenerateToken.py.

Newer sidecar versions do this automatically — upgrade if you still see HTTP 400 or 401 Invalid user or token:

pip install --upgrade gsp-datahub-sidecar

Then pass your SQLFlow Docker credentials (get them from the SQLFlow web UI at http://<host>:8165/). In self-hosted mode --sqlflow-url defaults to http://localhost:8165/api/gspLive_backend/sqlflow/generation/sqlflow/exportFullLineageAsJson, so you can omit it when SQLFlow is on localhost:

gsp-datahub-sidecar --mode self_hosted \
  --user-id YOUR_USER_ID \
  --secret-key YOUR_SECRET_KEY \
  --sql-file queries.sql

If SQLFlow is on a different host/port, add --sqlflow-url pointing at the full exportFullLineageAsJson path, e.g.:

--sqlflow-url http://sqlflow.internal:9100/api/gspLive_backend/sqlflow/generation/sqlflow/exportFullLineageAsJson

You can verify the flow directly with curl first:

# 1. Generate a token
TOKEN=$(curl -sS -X POST \
  -H "Content-Type: application/x-www-form-urlencoded" \
  --data-urlencode 'userId=YOUR_USER_ID' \
  --data-urlencode 'secretKey=YOUR_SECRET_KEY' \
  http://localhost:8165/api/gspLive_backend/user/generateToken \
  | python3 -c 'import json,sys;print(json.load(sys.stdin)["token"])')

# 2. Call the lineage endpoint with userId + token (not secretKey)
curl -sS -X POST \
  -H "Content-Type: application/x-www-form-urlencoded" \
  --data-urlencode 'userId=YOUR_USER_ID' \
  --data-urlencode "token=$TOKEN" \
  --data-urlencode 'dbvendor=dbvbigquery' \
  --data-urlencode 'sqltext=SELECT a FROM t;' \
  --data-urlencode 'showRelationType=fdd' \
  http://localhost:8165/api/gspLive_backend/sqlflow/generation/sqlflow/exportFullLineageAsJson

If curl returns {"code":200,...} but the sidecar still fails, double-check that you upgraded past the version where the token flow was added.

`FileNotFoundError: SQL file not found: examples/bigquery_procedural.sql`

pip install gsp-datahub-sidecar only installs the Python package — it does not copy the examples/ directory onto your filesystem. You'll see:

FileNotFoundError: SQL file not found: examples/bigquery_procedural.sql

Pick one:

Option A — download a single example

curl -L -O https://raw.githubusercontent.com/gudusoftware/gsp-datahub-sidecar/main/examples/bigquery_procedural.sql
curl -L -O https://raw.githubusercontent.com/gudusoftware/gsp-datahub-sidecar/main/examples/oracle_create_view.sql

gsp-datahub-sidecar --sql-file bigquery_procedural.sql --dry-run

Option B — clone the repo for all examples

git clone https://github.com/gudusoftware/gsp-datahub-sidecar.git
cd gsp-datahub-sidecar
gsp-datahub-sidecar --sql-file examples/bigquery_procedural.sql --dry-run

Option C — pass SQL inline (no file needed)

gsp-datahub-sidecar --sql "DECLARE x INT64; CREATE VIEW v AS SELECT a FROM t" --dry-run

Option D — point at your own SQL file

gsp-datahub-sidecar --sql-file /path/to/your/query.sql --dry-run

Licensing

This sidecar (glue code) is Apache 2.0 licensed. Gudu SQLFlow is a commercial product by Gudu Software. The anonymous tier provides free evaluation access. For production use, deploy the self-hosted SQLFlow Docker with a license, or supply your own licensed gsqlparser-*-shaded.jar to --mode local_jar — the sidecar does not bundle or redistribute the SQLFlow JAR.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Apr 21, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gsp_datahub_sidecar-0.1.0.tar.gz (52.2 kB view details)

Uploaded Apr 21, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

gsp_datahub_sidecar-0.1.0-py3-none-any.whl (33.5 kB view details)

Uploaded Apr 21, 2026 Python 3

File details

Details for the file gsp_datahub_sidecar-0.1.0.tar.gz.

File metadata

Download URL: gsp_datahub_sidecar-0.1.0.tar.gz
Upload date: Apr 21, 2026
Size: 52.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for gsp_datahub_sidecar-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`d21771a001ae320eff5dd97190a725aed1c5b7653ad6e86c2f02298f4649ce5b`
MD5	`0f759c25bc2c47c4f728ff77951bd7e9`
BLAKE2b-256	`b3ecc54a9d668385b0ac14721c4e79905ef3cbdb9c5b8c246a71f674ed73db5b`

See more details on using hashes here.

File details

Details for the file gsp_datahub_sidecar-0.1.0-py3-none-any.whl.

File metadata

Download URL: gsp_datahub_sidecar-0.1.0-py3-none-any.whl
Upload date: Apr 21, 2026
Size: 33.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for gsp_datahub_sidecar-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3b00841255a2be850a36a5e4f21f1d763c1ac1f413e8c9c67c038b301050e1a7`
MD5	`b4f2eaf87477bb997a6cc3befe825497`
BLAKE2b-256	`f640038e909ed037f1f03a163190cafe89cce83cfc4eebc64e2f725bcf73969e`

See more details on using hashes here.

gsp-datahub-sidecar 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

gsp-datahub-sidecar

The problem

The solution

Install

Quick start

Four backend modes

Anonymous (default, zero setup)

Authenticated (sign up for a free key)

Self-hosted (production)

Local JAR (no network, no Docker) — bring your own licensed JAR

Configuration

Dry run vs. live mode

Table-level vs. column-level lineage

Verify lineage in DataHub

Fresh DataHub vs. existing DataHub

Fresh DataHub (no existing metadata)

Existing DataHub (with metadata and sqlglot-generated lineage)

Troubleshooting

error: externally-managed-environment when running pip install

Self-hosted SQLFlow returned HTTP 400 / 401 Invalid user or token, access deny

FileNotFoundError: SQL file not found: examples/bigquery_procedural.sql

Licensing

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`error: externally-managed-environment` when running `pip install`

`Self-hosted SQLFlow returned HTTP 400` / `401 Invalid user or token, access deny`

`FileNotFoundError: SQL file not found: examples/bigquery_procedural.sql`