Recovers BigQuery procedural-SQL lineage that DataHub's sqlglot parser misses, using Gudu SQLFlow.
Project description
gsp-datahub-sidecar
Recover SQL lineage that DataHub's sqlglot parser misses — across BigQuery, MSSQL, Snowflake, Oracle, and 20+ dialects — using Gudu SQLFlow.
The problem
DataHub uses sqlglot for SQL lineage extraction. sqlglot silently loses lineage on:
- BigQuery procedural SQL —
DECLARE,IF/THEN,CALL,CREATE TEMP TABLE(#11654) - BigQuery dbt deduplication macros —
array_agg()[offset(0)]with struct unpacking (#11670) - MSSQL stored procedures —
CASE...ENDbreakssplit_statements(#12606) - MSSQL identifier casing — lowercased URNs break lineage links (#13792, #11322)
- Power BI queries —
--comments silently drop JOINs from lineage (#11251)
The solution
This sidecar runs alongside your existing DataHub ingestion. It re-parses the SQL statements that sqlglot failed on using Gudu SQLFlow (which handles procedural SQL and dialect-specific syntax natively), then emits the recovered lineage to DataHub via the REST API.
DataHub ingestion (unchanged) gsp-datahub-sidecar (this tool)
audit log -> sqlglot -> lineage |
| | re-parse failed SQL with SQLFlow
v | emit recovered lineage via REST
"Command fallback" v
(lineage lost) DataHub GMS (lineage restored)
Install
pip install gsp-datahub-sidecar
If you hit error: externally-managed-environment on Ubuntu 23.04+ / Debian 12+, see Troubleshooting.
Quick start
Sample SQL files are included in the examples/ directory — try them immediately after install:
# BigQuery procedural SQL (DataHub Issue #11654):
gsp-datahub-sidecar --sql-file examples/bigquery_procedural.sql --dry-run
# BigQuery dbt deduplication macro (DataHub Issue #11670):
gsp-datahub-sidecar --sql-file examples/bigquery_dbt_dedup.sql --dry-run
# MSSQL stored procedure with CASE...END (DataHub Issue #12606):
gsp-datahub-sidecar --sql-file examples/mssql_stored_procedure.sql --db-vendor dbvmssql --dry-run
# MSSQL view with mixed-case identifiers (DataHub Issues #13792, #11322):
gsp-datahub-sidecar --sql-file examples/mssql_case_sensitivity.sql --db-vendor dbvmssql --dry-run
# Power BI query with comments (DataHub Issue #11251):
gsp-datahub-sidecar --sql-file examples/powerbi_comments.sql --db-vendor dbvmssql --dry-run
# Oracle CREATE VIEW with subqueries:
gsp-datahub-sidecar --sql-file examples/oracle_create_view.sql --db-vendor dbvoracle --dry-run
# Analyze inline SQL:
gsp-datahub-sidecar --sql "DECLARE x INT; CREATE VIEW v AS SELECT a FROM t" --dry-run
# Parse DataHub ingestion logs and emit lineage:
gsp-datahub-sidecar --config sidecar.yaml --log-file /var/log/datahub/ingest.log
Four backend modes
| Mode | Auth | Limit | Data location | Use case |
|---|---|---|---|---|
anonymous (default) |
None | 50/day per IP | SQL sent to api.gudusoft.com | Quick evaluation |
authenticated |
userId + secretKey (token-exchange) |
10k/month | SQL sent to api.gudusoft.com | Extended evaluation |
self_hosted |
userId + secretKey (token-exchange) |
Unlimited | SQL stays in your VPC | Production |
local_jar |
None (local subprocess) | Governed by your JAR license | SQL never leaves the process | Air-gapped / CI (bring your own licensed JAR) |
The authenticated and self_hosted tiers use the same SQLFlow token-exchange protocol and hit the same exportFullLineageAsJson endpoint — only the host differs (api.gudusoft.com vs. your Docker). The local_jar tier skips HTTP entirely and invokes SQLFlow's bundled DataFlowAnalyzer CLI via java; it produces the same Dataflow JSON model as the cloud path, so lineage extraction is bit-for-bit equivalent.
Anonymous (default, zero setup)
gsp-datahub-sidecar --sql-file queries.sql --dry-run
Authenticated (sign up for a free key)
Sign up at docs.gudusoft.com/sign-up to get a userId and secretKey, then:
gsp-datahub-sidecar --mode authenticated \
--user-id YOUR_USER_ID \
--secret-key YOUR_SECRET_KEY \
--sql-file queries.sql
The sidecar handles SQLFlow's token-exchange flow automatically.
Shell escaping. Cloud userIds typically contain
|(e.g.auth0|605f3cfe...), which bash treats as a pipe. Always single-quote the value — or escape the|with a backslash — so the shell doesn't split the command:# Good: single quotes prevent shell expansion --user-id 'auth0|605f3cfe4e52d60069bcb258' # Also fine: backslash-escaped pipe --user-id auth0\|605f3cfe4e52d60069bcb258 # BAD: bash reads this as a pipe and breaks the command --user-id auth0|605f3cfe4e52d60069bcb258The same applies to any value containing
|,$,`,&,;,(,),<,>,*,?,#,!, or whitespace — prefer single quotes by default for--user-id,--secret-key, and--sql.
Self-hosted (production)
Deploy SQLFlow Docker in your VPC, grab your userId and secretKey from the SQLFlow web UI (e.g. http://localhost:8165/), then:
gsp-datahub-sidecar --mode self_hosted \
--user-id YOUR_USER_ID \
--secret-key YOUR_SECRET_KEY \
--sql-file queries.sql
Local JAR (no network, no Docker) — bring your own licensed JAR
The sidecar does not ship
gsqlparser-*-shaded.jar. Gudu SQLFlow's core parser JAR is a commercial product licensed separately from this sidecar (Apache 2.0). Point--jar-pathat a JAR you are licensed to run — for example the one built from yourgsp_java_corecheckout. See sqlflow.gudusoft.com for licensing.
With a licensed gsqlparser-*-shaded.jar on disk, the sidecar can invoke it directly as a subprocess — no HTTP round-trip, no Docker, SQL never leaves the process. Requires a JRE 8+ on PATH.
gsp-datahub-sidecar --mode local_jar \
--jar-path /path/to/your/gsqlparser-4.1.0.13-shaded.jar \
--sql-file queries.sql
Equivalent config / env:
sqlflow.mode: local_jar+sqlflow.jar_path: /path/to/...jarinsidecar.yamlGSP_BACKEND_MODE=local_jar GSP_JAR_PATH=/path/to/...jar
Use a non-default JVM with --java-bin /opt/jdk21/bin/java (or GSP_JAVA_BIN).
This path uses the exact same Dataflow JSON model as the cloud exportFullLineageAsJson endpoint, so lineage output is identical. The one cost is JVM cold-start (~0.5–1 s per call), so for log-file ingestion of many statements, prefer self_hosted or batch upstream.
--sqlflow-url is optional in self-hosted mode and defaults to:
http://localhost:8165/api/gspLive_backend/sqlflow/generation/sqlflow/exportFullLineageAsJson
Pass --sqlflow-url only when your SQLFlow Docker is not on localhost:8165 — e.g. a remote host, a different port, or behind an nginx reverse proxy. Point it at the full exportFullLineageAsJson path:
# Remote host / non-default port
gsp-datahub-sidecar --mode self_hosted \
--sqlflow-url http://sqlflow.internal:9100/api/gspLive_backend/sqlflow/generation/sqlflow/exportFullLineageAsJson \
--user-id YOUR_USER_ID \
--secret-key YOUR_SECRET_KEY \
--sql-file queries.sql
You can also set this in sidecar.yaml (sqlflow.url) or via the GSP_SQLFLOW_URL environment variable.
Configuration
Copy sidecar.yaml.example to sidecar.yaml and edit. All settings can also be overridden with environment variables:
| Env var | Config key | Description |
|---|---|---|
GSP_BACKEND_MODE |
sqlflow.mode |
anonymous, authenticated, or self_hosted |
GSP_SQLFLOW_URL |
sqlflow.url |
Override the SQLFlow API URL |
GSP_SQLFLOW_USER_ID |
sqlflow.user_id |
SQLFlow userId — required for authenticated and self_hosted modes |
GSP_SQLFLOW_SECRET_KEY |
sqlflow.secret_key |
SQLFlow secretKey — required for authenticated and self_hosted modes |
GSP_DB_VENDOR |
sqlflow.db_vendor |
SQL dialect (default: dbvbigquery) |
GSP_DATAHUB_SERVER |
datahub.server |
DataHub GMS URL |
GSP_DATAHUB_TOKEN |
datahub.token |
DataHub auth token |
GSP_COLUMN_LINEAGE |
datahub.column_lineage |
Emit column-level lineage (default true) |
GSP_JAR_PATH |
sqlflow.jar_path |
Path to gsqlparser-*-shaded.jar — required for local_jar mode |
GSP_JAVA_BIN |
sqlflow.java_bin |
Java executable for local_jar mode (default: java on PATH) |
Dry run vs. live mode
--dry-run does everything except sending lineage to DataHub — safe to run anytime, no DataHub server needed:
# Dry run: parse SQL, extract lineage, show what would be sent (no DataHub needed)
gsp-datahub-sidecar --sql-file examples/bigquery_procedural.sql --dry-run
Without --dry-run, lineage is written to a running DataHub GMS. Point to your DataHub server with --datahub-server:
# Emit lineage to DataHub running on your cluster:
gsp-datahub-sidecar --sql-file examples/bigquery_procedural.sql \
--datahub-server http://datahub-gms:8080
# If DataHub has authentication enabled, add a token:
gsp-datahub-sidecar --sql-file examples/bigquery_procedural.sql \
--datahub-server http://datahub-gms:8080 \
--datahub-token eyJhbGciOi...
Or set these in sidecar.yaml / environment variables so you don't repeat them:
export GSP_DATAHUB_SERVER=http://datahub-gms:8080
export GSP_DATAHUB_TOKEN=eyJhbGciOi...
gsp-datahub-sidecar --sql-file examples/bigquery_procedural.sql
| Step | --dry-run |
live (no flag) |
|---|---|---|
| Call SQLFlow API to parse SQL | Yes | Yes |
| Extract lineage from response | Yes | Yes |
| Build DataHub MCPs | Yes | Yes |
| Send MCPs to DataHub GMS | No (logs what it would send) | Yes |
Table-level vs. column-level lineage
By default, the sidecar emits both table-level upstream lineage and column-level (fine-grained) lineage — DataHub will show both table arrows and per-column arrows in the Lineage tab.
To emit table-level lineage only (skip fineGrainedLineages):
gsp-datahub-sidecar --sql-file queries.sql --no-column-lineage
Equivalent config / env:
datahub.column_lineage: falseinsidecar.yamlGSP_COLUMN_LINEAGE=false
--dry-run prints up to 5 column mappings per MCP so you can verify what would be sent.
Verify lineage in DataHub
After emitting lineage (live mode, no --dry-run), verify it in three ways:
1. DataHub Web UI (recommended)
Open DataHub in your browser (e.g. http://datahub-frontend:9002), search for the downstream table name, and click the Lineage tab. You should see arrows connecting upstream tables to downstream tables.
For the BigQuery procedural example, search for temp_table or final_output:
project.dataset.view_name ──> temp_table (6 columns)
temp_table_delta ──> final_output (5 columns)
2. DataHub CLI
# Check lineage for a specific dataset:
datahub get --urn "urn:li:dataset:(urn:li:dataPlatform:bigquery,temp_table,PROD)" --aspect upstreamLineage
3. DataHub GMS REST API
# Query the lineage aspect directly:
curl -s "http://datahub-gms:8080/aspects/urn%3Ali%3Adataset%3A(urn%3Ali%3AdataPlatform%3Abigquery%2Ctemp_table%2CPROD)?aspect=upstreamLineage" | python3 -m json.tool
If the lineage appears, the sidecar successfully recovered what sqlglot missed.
Fresh DataHub vs. existing DataHub
Fresh DataHub (no existing metadata)
The sidecar works on a completely empty DataHub — DataHub auto-creates dataset entities when it receives lineage MCPs. You'll see the lineage graph immediately. However, the tables will appear as minimal shells:
| With platform ingestion | Sidecar only (no prior metadata) | |
|---|---|---|
| Lineage arrows in graph | Yes | Yes |
| Column-level lineage | Yes | Yes |
| Table/column names | Full display names | URN-derived names |
| Column types & descriptions | Yes | No |
| Table schema / field list | Yes | No |
| Row counts / statistics | Yes | No |
| Platform icon | Yes | Yes |
This is fine for a demo or evaluation — the lineage visualization proves that GSP recovers what sqlglot misses. The missing metadata would normally come from DataHub's platform ingestion (BigQuery, MSSQL, etc.).
Existing DataHub (with metadata and sqlglot-generated lineage)
This is the real production scenario. The sidecar adds to the lineage that DataHub's platform ingestion already created — it does not replace or conflict with it.
How it works:
- DataHub's platform ingestion runs first and creates lineage for all SQL that sqlglot can parse (standard SELECT, INSERT, CREATE VIEW, etc.)
- The sidecar runs after and emits lineage only for the SQL that sqlglot failed on (procedural blocks, dialect-specific syntax, complex expressions, etc.)
- DataHub merges both into a single lineage graph per dataset
The result is a more complete lineage graph — the existing lineage stays intact, and the sidecar fills in the gaps.
Important: URN matching
For the sidecar's lineage to connect with existing DataHub entities, the dataset URNs must match exactly. This means the table names in the sidecar output must match the names DataHub's BigQuery ingestion created.
Things to check:
-
Case sensitivity: DataHub ingestion often lowercases URNs when
convert_urns_to_lowercase: trueis set (common for BigQuery). The sidecar also lowercases by default, so this should match. For MSSQL, where identifier casing matters (see #13792), verify that the sidecar's output casing aligns with your DataHub URNs. -
Schema-qualified names: Tables are typically ingested as
database.schema.table(MSSQL) orproject.dataset.table(BigQuery). The sidecar uses the table names as they appear in the SQL. If your SQL uses bracket-quoted names like[schema].[table]or backtick-quoted names like`project.dataset.table`, the sidecar strips the quotes and preserves the full path. Verify that the resulting URN matches what DataHub already has. -
Platform and environment: The sidecar defaults to
platform: bigqueryandenv: PROD. Set--datahub-platform mssql(orsnowflake,oracle, etc.) and--datahub-envinsidecar.yamlor via environment variables to match your DataHub setup. -
Temp tables: The sidecar emits lineage for temp tables (e.g.
temp_table,final_output). If your DataHub'sdataset_pattern.denyexcludes temp tables (common in BigQuery ingestion configs), the temp table entities will be created by the sidecar but won't have schema metadata. This is expected — the lineage through them is still valuable.
The sidecar will NOT:
- Overwrite or delete existing lineage created by DataHub's ingestion
- Modify any existing dataset metadata (schemas, descriptions, tags)
- Interfere with DataHub's ingestion schedule or stateful ingestion
Troubleshooting
error: externally-managed-environment when running pip install
On Ubuntu 23.04+ / Debian 12+ (Python 3.11+), the system Python is PEP 668-protected to prevent pip from clobbering OS packages. You'll see:
error: externally-managed-environment
× This environment is externally managed
╰─> To install Python packages system-wide, try apt install
python3-xyz, where xyz is the package you are trying to
install.
Pick one of the following:
Option 1 — venv (recommended)
apt install -y python3-venv python3-full
python3 -m venv ~/sqlflow-venv
~/sqlflow-venv/bin/pip install gsp-datahub-sidecar
~/sqlflow-venv/bin/gsp-datahub-sidecar --help
Activate the venv (source ~/sqlflow-venv/bin/activate) or invoke the binary by its full path.
Option 2 — pipx (isolated, auto-added to PATH)
apt install -y pipx
pipx ensurepath
pipx install gsp-datahub-sidecar
Option 3 — override (quick, not recommended on a server)
pip install --break-system-packages gsp-datahub-sidecar
This can conflict with apt-managed Python packages. Use venv or pipx for anything long-lived.
Self-hosted SQLFlow returned HTTP 400 / 401 Invalid user or token, access deny
The self-hosted SQLFlow Docker uses a two-step auth flow:
- Exchange
userId+secretKeyfor a short-livedtokenat/api/gspLive_backend/user/generateToken(form-encoded). - Call the lineage endpoint with
userId+token(form-encoded, not JSON, notsecretKey, not a Bearer header).
See SQLFlow's reference: GenerateToken.py.
Newer sidecar versions do this automatically — upgrade if you still see HTTP 400 or 401 Invalid user or token:
pip install --upgrade gsp-datahub-sidecar
Then pass your SQLFlow Docker credentials (get them from the SQLFlow web UI at http://<host>:8165/). In self-hosted mode --sqlflow-url defaults to http://localhost:8165/api/gspLive_backend/sqlflow/generation/sqlflow/exportFullLineageAsJson, so you can omit it when SQLFlow is on localhost:
gsp-datahub-sidecar --mode self_hosted \
--user-id YOUR_USER_ID \
--secret-key YOUR_SECRET_KEY \
--sql-file queries.sql
If SQLFlow is on a different host/port, add --sqlflow-url pointing at the full exportFullLineageAsJson path, e.g.:
--sqlflow-url http://sqlflow.internal:9100/api/gspLive_backend/sqlflow/generation/sqlflow/exportFullLineageAsJson
You can verify the flow directly with curl first:
# 1. Generate a token
TOKEN=$(curl -sS -X POST \
-H "Content-Type: application/x-www-form-urlencoded" \
--data-urlencode 'userId=YOUR_USER_ID' \
--data-urlencode 'secretKey=YOUR_SECRET_KEY' \
http://localhost:8165/api/gspLive_backend/user/generateToken \
| python3 -c 'import json,sys;print(json.load(sys.stdin)["token"])')
# 2. Call the lineage endpoint with userId + token (not secretKey)
curl -sS -X POST \
-H "Content-Type: application/x-www-form-urlencoded" \
--data-urlencode 'userId=YOUR_USER_ID' \
--data-urlencode "token=$TOKEN" \
--data-urlencode 'dbvendor=dbvbigquery' \
--data-urlencode 'sqltext=SELECT a FROM t;' \
--data-urlencode 'showRelationType=fdd' \
http://localhost:8165/api/gspLive_backend/sqlflow/generation/sqlflow/exportFullLineageAsJson
If curl returns {"code":200,...} but the sidecar still fails, double-check that you upgraded past the version where the token flow was added.
FileNotFoundError: SQL file not found: examples/bigquery_procedural.sql
pip install gsp-datahub-sidecar only installs the Python package — it does not copy the examples/ directory onto your filesystem. You'll see:
FileNotFoundError: SQL file not found: examples/bigquery_procedural.sql
Pick one:
Option A — download a single example
curl -L -O https://raw.githubusercontent.com/gudusoftware/gsp-datahub-sidecar/main/examples/bigquery_procedural.sql
curl -L -O https://raw.githubusercontent.com/gudusoftware/gsp-datahub-sidecar/main/examples/oracle_create_view.sql
gsp-datahub-sidecar --sql-file bigquery_procedural.sql --dry-run
Option B — clone the repo for all examples
git clone https://github.com/gudusoftware/gsp-datahub-sidecar.git
cd gsp-datahub-sidecar
gsp-datahub-sidecar --sql-file examples/bigquery_procedural.sql --dry-run
Option C — pass SQL inline (no file needed)
gsp-datahub-sidecar --sql "DECLARE x INT64; CREATE VIEW v AS SELECT a FROM t" --dry-run
Option D — point at your own SQL file
gsp-datahub-sidecar --sql-file /path/to/your/query.sql --dry-run
Licensing
This sidecar (glue code) is Apache 2.0 licensed. Gudu SQLFlow is a commercial product by Gudu Software. The anonymous tier provides free evaluation access. For production use, deploy the self-hosted SQLFlow Docker with a license, or supply your own licensed gsqlparser-*-shaded.jar to --mode local_jar — the sidecar does not bundle or redistribute the SQLFlow JAR.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file gsp_datahub_sidecar-0.1.0.tar.gz.
File metadata
- Download URL: gsp_datahub_sidecar-0.1.0.tar.gz
- Upload date:
- Size: 52.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d21771a001ae320eff5dd97190a725aed1c5b7653ad6e86c2f02298f4649ce5b
|
|
| MD5 |
0f759c25bc2c47c4f728ff77951bd7e9
|
|
| BLAKE2b-256 |
b3ecc54a9d668385b0ac14721c4e79905ef3cbdb9c5b8c246a71f674ed73db5b
|
File details
Details for the file gsp_datahub_sidecar-0.1.0-py3-none-any.whl.
File metadata
- Download URL: gsp_datahub_sidecar-0.1.0-py3-none-any.whl
- Upload date:
- Size: 33.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3b00841255a2be850a36a5e4f21f1d763c1ac1f413e8c9c67c038b301050e1a7
|
|
| MD5 |
b4f2eaf87477bb997a6cc3befe825497
|
|
| BLAKE2b-256 |
f640038e909ed037f1f03a163190cafe89cce83cfc4eebc64e2f725bcf73969e
|