Skip to main content

Recovers BigQuery procedural-SQL lineage that DataHub's sqlglot parser misses, using Gudu SQLFlow.

Project description

gsp-datahub-sidecar

Recover SQL lineage that DataHub's sqlglot parser misses — across BigQuery, MSSQL, Snowflake, Oracle, and 20+ dialects — using Gudu SQLFlow.

The problem

DataHub uses sqlglot for SQL lineage extraction. sqlglot silently loses lineage on:

  • BigQuery procedural SQL — DECLARE, IF/THEN, CALL, CREATE TEMP TABLE (#11654)
  • BigQuery dbt deduplication macros — array_agg()[offset(0)] with struct unpacking (#11670)
  • MSSQL stored procedures — CASE...END breaks split_statements (#12606)
  • MSSQL identifier casing — lowercased URNs break lineage links (#13792, #11322)
  • Power BI queries — -- comments silently drop JOINs from lineage (#11251)

The solution

This sidecar runs alongside your existing DataHub ingestion. It re-parses the SQL statements that sqlglot failed on using Gudu SQLFlow (which handles procedural SQL and dialect-specific syntax natively), then emits the recovered lineage to DataHub via the REST API.

DataHub ingestion (unchanged)          gsp-datahub-sidecar (this tool)
  audit log -> sqlglot -> lineage         |
                    |                     |  re-parse failed SQL with SQLFlow
                    v                     |  emit recovered lineage via REST
              "Command fallback"          v
              (lineage lost)        DataHub GMS (lineage restored)

Install

pip install gsp-datahub-sidecar

If you hit error: externally-managed-environment on Ubuntu 23.04+ / Debian 12+, see Troubleshooting.

Quick start

Sample SQL files are included in the examples/ directory — try them immediately after install:

# BigQuery procedural SQL (DataHub Issue #11654):
gsp-datahub-sidecar --sql-file examples/bigquery_procedural.sql --dry-run

# BigQuery dbt deduplication macro (DataHub Issue #11670):
gsp-datahub-sidecar --sql-file examples/bigquery_dbt_dedup.sql --dry-run

# MSSQL stored procedure with CASE...END (DataHub Issue #12606):
gsp-datahub-sidecar --sql-file examples/mssql_stored_procedure.sql --db-vendor dbvmssql --dry-run

# MSSQL view with mixed-case identifiers (DataHub Issues #13792, #11322):
gsp-datahub-sidecar --sql-file examples/mssql_case_sensitivity.sql --db-vendor dbvmssql --dry-run

# Power BI query with comments (DataHub Issue #11251):
gsp-datahub-sidecar --sql-file examples/powerbi_comments.sql --db-vendor dbvmssql --dry-run

# Oracle CREATE VIEW with subqueries:
gsp-datahub-sidecar --sql-file examples/oracle_create_view.sql --db-vendor dbvoracle --dry-run

# Analyze inline SQL:
gsp-datahub-sidecar --sql "DECLARE x INT; CREATE VIEW v AS SELECT a FROM t" --dry-run

# Parse DataHub ingestion logs and emit lineage:
gsp-datahub-sidecar --config sidecar.yaml --log-file /var/log/datahub/ingest.log

Four backend modes

Mode Auth Limit Data location Use case
anonymous (default) None 50/day per IP SQL sent to api.gudusoft.com Quick evaluation
authenticated userId + secretKey (token-exchange) 10k/month SQL sent to api.gudusoft.com Extended evaluation
self_hosted userId + secretKey (token-exchange) Unlimited SQL stays in your VPC Production
local_jar None (local subprocess) Governed by your JAR license SQL never leaves the process Air-gapped / CI (bring your own licensed JAR)

The authenticated and self_hosted tiers use the same SQLFlow token-exchange protocol and hit the same exportFullLineageAsJson endpoint — only the host differs (api.gudusoft.com vs. your Docker). The local_jar tier skips HTTP entirely and invokes SQLFlow's bundled DataFlowAnalyzer CLI via java; it produces the same Dataflow JSON model as the cloud path, so lineage extraction is bit-for-bit equivalent.

Anonymous (default, zero setup)

gsp-datahub-sidecar --sql-file queries.sql --dry-run

Authenticated (sign up for a free key)

Sign up at docs.gudusoft.com/sign-up to get a userId and secretKey, then:

gsp-datahub-sidecar --mode authenticated \
  --user-id YOUR_USER_ID \
  --secret-key YOUR_SECRET_KEY \
  --sql-file queries.sql

The sidecar handles SQLFlow's token-exchange flow automatically.

Shell escaping. Cloud userIds typically contain | (e.g. auth0|605f3cfe...), which bash treats as a pipe. Always single-quote the value — or escape the | with a backslash — so the shell doesn't split the command:

# Good: single quotes prevent shell expansion
--user-id 'auth0|605f3cfe4e52d60069bcb258'

# Also fine: backslash-escaped pipe
--user-id auth0\|605f3cfe4e52d60069bcb258

# BAD: bash reads this as a pipe and breaks the command
--user-id auth0|605f3cfe4e52d60069bcb258

The same applies to any value containing |, $, `, &, ;, (, ), <, >, *, ?, #, !, or whitespace — prefer single quotes by default for --user-id, --secret-key, and --sql.

Self-hosted (production)

Deploy SQLFlow Docker in your VPC, grab your userId and secretKey from the SQLFlow web UI (e.g. http://localhost:8165/), then:

gsp-datahub-sidecar --mode self_hosted \
  --user-id YOUR_USER_ID \
  --secret-key YOUR_SECRET_KEY \
  --sql-file queries.sql

Local JAR (no network, no Docker) — bring your own licensed JAR

The sidecar does not ship gsqlparser-*-shaded.jar. Gudu SQLFlow's core parser JAR is a commercial product licensed separately from this sidecar (Apache 2.0). Point --jar-path at a JAR you are licensed to run — for example the one built from your gsp_java_core checkout. See sqlflow.gudusoft.com for licensing.

With a licensed gsqlparser-*-shaded.jar on disk, the sidecar can invoke it directly as a subprocess — no HTTP round-trip, no Docker, SQL never leaves the process. Requires a JRE 8+ on PATH.

gsp-datahub-sidecar --mode local_jar \
  --jar-path /path/to/your/gsqlparser-4.1.0.13-shaded.jar \
  --sql-file queries.sql

Equivalent config / env:

  • sqlflow.mode: local_jar + sqlflow.jar_path: /path/to/...jar in sidecar.yaml
  • GSP_BACKEND_MODE=local_jar GSP_JAR_PATH=/path/to/...jar

Use a non-default JVM with --java-bin /opt/jdk21/bin/java (or GSP_JAVA_BIN).

This path uses the exact same Dataflow JSON model as the cloud exportFullLineageAsJson endpoint, so lineage output is identical. The one cost is JVM cold-start (~0.5–1 s per call), so for log-file ingestion of many statements, prefer self_hosted or batch upstream.

--sqlflow-url is optional in self-hosted mode and defaults to:

http://localhost:8165/api/gspLive_backend/sqlflow/generation/sqlflow/exportFullLineageAsJson

Pass --sqlflow-url only when your SQLFlow Docker is not on localhost:8165 — e.g. a remote host, a different port, or behind an nginx reverse proxy. Point it at the full exportFullLineageAsJson path:

# Remote host / non-default port
gsp-datahub-sidecar --mode self_hosted \
  --sqlflow-url http://sqlflow.internal:9100/api/gspLive_backend/sqlflow/generation/sqlflow/exportFullLineageAsJson \
  --user-id YOUR_USER_ID \
  --secret-key YOUR_SECRET_KEY \
  --sql-file queries.sql

You can also set this in sidecar.yaml (sqlflow.url) or via the GSP_SQLFLOW_URL environment variable.

Configuration

Copy sidecar.yaml.example to sidecar.yaml and edit. All settings can also be overridden with environment variables:

Env var Config key Description
GSP_BACKEND_MODE sqlflow.mode anonymous, authenticated, or self_hosted
GSP_SQLFLOW_URL sqlflow.url Override the SQLFlow API URL
GSP_SQLFLOW_USER_ID sqlflow.user_id SQLFlow userId — required for authenticated and self_hosted modes
GSP_SQLFLOW_SECRET_KEY sqlflow.secret_key SQLFlow secretKey — required for authenticated and self_hosted modes
GSP_DB_VENDOR sqlflow.db_vendor SQL dialect (default: dbvbigquery)
GSP_DATAHUB_SERVER datahub.server DataHub GMS URL
GSP_DATAHUB_TOKEN datahub.token DataHub auth token
GSP_COLUMN_LINEAGE datahub.column_lineage Emit column-level lineage (default true)
GSP_JAR_PATH sqlflow.jar_path Path to gsqlparser-*-shaded.jar — required for local_jar mode
GSP_JAVA_BIN sqlflow.java_bin Java executable for local_jar mode (default: java on PATH)

Dry run vs. live mode

--dry-run does everything except sending lineage to DataHub — safe to run anytime, no DataHub server needed:

# Dry run: parse SQL, extract lineage, show what would be sent (no DataHub needed)
gsp-datahub-sidecar --sql-file examples/bigquery_procedural.sql --dry-run

Without --dry-run, lineage is written to a running DataHub GMS. Point to your DataHub server with --datahub-server:

# Emit lineage to DataHub running on your cluster:
gsp-datahub-sidecar --sql-file examples/bigquery_procedural.sql \
  --datahub-server http://datahub-gms:8080

# If DataHub has authentication enabled, add a token:
gsp-datahub-sidecar --sql-file examples/bigquery_procedural.sql \
  --datahub-server http://datahub-gms:8080 \
  --datahub-token eyJhbGciOi...

Or set these in sidecar.yaml / environment variables so you don't repeat them:

export GSP_DATAHUB_SERVER=http://datahub-gms:8080
export GSP_DATAHUB_TOKEN=eyJhbGciOi...
gsp-datahub-sidecar --sql-file examples/bigquery_procedural.sql
Step --dry-run live (no flag)
Call SQLFlow API to parse SQL Yes Yes
Extract lineage from response Yes Yes
Build DataHub MCPs Yes Yes
Send MCPs to DataHub GMS No (logs what it would send) Yes

Table-level vs. column-level lineage

By default, the sidecar emits both table-level upstream lineage and column-level (fine-grained) lineage — DataHub will show both table arrows and per-column arrows in the Lineage tab.

To emit table-level lineage only (skip fineGrainedLineages):

gsp-datahub-sidecar --sql-file queries.sql --no-column-lineage

Equivalent config / env:

  • datahub.column_lineage: false in sidecar.yaml
  • GSP_COLUMN_LINEAGE=false

--dry-run prints up to 5 column mappings per MCP so you can verify what would be sent.

Verify lineage in DataHub

After emitting lineage (live mode, no --dry-run), verify it in three ways:

1. DataHub Web UI (recommended)

Open DataHub in your browser (e.g. http://datahub-frontend:9002), search for the downstream table name, and click the Lineage tab. You should see arrows connecting upstream tables to downstream tables.

For the BigQuery procedural example, search for temp_table or final_output:

project.dataset.view_name  ──>  temp_table       (6 columns)
temp_table_delta           ──>  final_output     (5 columns)

2. DataHub CLI

# Check lineage for a specific dataset:
datahub get --urn "urn:li:dataset:(urn:li:dataPlatform:bigquery,temp_table,PROD)" --aspect upstreamLineage

3. DataHub GMS REST API

# Query the lineage aspect directly:
curl -s "http://datahub-gms:8080/aspects/urn%3Ali%3Adataset%3A(urn%3Ali%3AdataPlatform%3Abigquery%2Ctemp_table%2CPROD)?aspect=upstreamLineage" | python3 -m json.tool

If the lineage appears, the sidecar successfully recovered what sqlglot missed.

Fresh DataHub vs. existing DataHub

Fresh DataHub (no existing metadata)

The sidecar works on a completely empty DataHub — DataHub auto-creates dataset entities when it receives lineage MCPs. You'll see the lineage graph immediately. However, the tables will appear as minimal shells:

With platform ingestion Sidecar only (no prior metadata)
Lineage arrows in graph Yes Yes
Column-level lineage Yes Yes
Table/column names Full display names URN-derived names
Column types & descriptions Yes No
Table schema / field list Yes No
Row counts / statistics Yes No
Platform icon Yes Yes

This is fine for a demo or evaluation — the lineage visualization proves that GSP recovers what sqlglot misses. The missing metadata would normally come from DataHub's platform ingestion (BigQuery, MSSQL, etc.).

Existing DataHub (with metadata and sqlglot-generated lineage)

This is the real production scenario. The sidecar adds to the lineage that DataHub's platform ingestion already created — it does not replace or conflict with it.

How it works:

  • DataHub's platform ingestion runs first and creates lineage for all SQL that sqlglot can parse (standard SELECT, INSERT, CREATE VIEW, etc.)
  • The sidecar runs after and emits lineage only for the SQL that sqlglot failed on (procedural blocks, dialect-specific syntax, complex expressions, etc.)
  • DataHub merges both into a single lineage graph per dataset

The result is a more complete lineage graph — the existing lineage stays intact, and the sidecar fills in the gaps.

Important: URN matching

For the sidecar's lineage to connect with existing DataHub entities, the dataset URNs must match exactly. This means the table names in the sidecar output must match the names DataHub's BigQuery ingestion created.

Things to check:

  1. Case sensitivity: DataHub ingestion often lowercases URNs when convert_urns_to_lowercase: true is set (common for BigQuery). The sidecar also lowercases by default, so this should match. For MSSQL, where identifier casing matters (see #13792), verify that the sidecar's output casing aligns with your DataHub URNs.

  2. Schema-qualified names: Tables are typically ingested as database.schema.table (MSSQL) or project.dataset.table (BigQuery). The sidecar uses the table names as they appear in the SQL. If your SQL uses bracket-quoted names like [schema].[table] or backtick-quoted names like `project.dataset.table`, the sidecar strips the quotes and preserves the full path. Verify that the resulting URN matches what DataHub already has.

  3. Platform and environment: The sidecar defaults to platform: bigquery and env: PROD. Set --datahub-platform mssql (or snowflake, oracle, etc.) and --datahub-env in sidecar.yaml or via environment variables to match your DataHub setup.

  4. Temp tables: The sidecar emits lineage for temp tables (e.g. temp_table, final_output). If your DataHub's dataset_pattern.deny excludes temp tables (common in BigQuery ingestion configs), the temp table entities will be created by the sidecar but won't have schema metadata. This is expected — the lineage through them is still valuable.

The sidecar will NOT:

  • Overwrite or delete existing lineage created by DataHub's ingestion
  • Modify any existing dataset metadata (schemas, descriptions, tags)
  • Interfere with DataHub's ingestion schedule or stateful ingestion

Troubleshooting

error: externally-managed-environment when running pip install

On Ubuntu 23.04+ / Debian 12+ (Python 3.11+), the system Python is PEP 668-protected to prevent pip from clobbering OS packages. You'll see:

error: externally-managed-environment

× This environment is externally managed
╰─> To install Python packages system-wide, try apt install
    python3-xyz, where xyz is the package you are trying to
    install.

Pick one of the following:

Option 1 — venv (recommended)

apt install -y python3-venv python3-full
python3 -m venv ~/sqlflow-venv
~/sqlflow-venv/bin/pip install gsp-datahub-sidecar
~/sqlflow-venv/bin/gsp-datahub-sidecar --help

Activate the venv (source ~/sqlflow-venv/bin/activate) or invoke the binary by its full path.

Option 2 — pipx (isolated, auto-added to PATH)

apt install -y pipx
pipx ensurepath
pipx install gsp-datahub-sidecar

Option 3 — override (quick, not recommended on a server)

pip install --break-system-packages gsp-datahub-sidecar

This can conflict with apt-managed Python packages. Use venv or pipx for anything long-lived.

Self-hosted SQLFlow returned HTTP 400 / 401 Invalid user or token, access deny

The self-hosted SQLFlow Docker uses a two-step auth flow:

  1. Exchange userId + secretKey for a short-lived token at /api/gspLive_backend/user/generateToken (form-encoded).
  2. Call the lineage endpoint with userId + token (form-encoded, not JSON, not secretKey, not a Bearer header).

See SQLFlow's reference: GenerateToken.py.

Newer sidecar versions do this automatically — upgrade if you still see HTTP 400 or 401 Invalid user or token:

pip install --upgrade gsp-datahub-sidecar

Then pass your SQLFlow Docker credentials (get them from the SQLFlow web UI at http://<host>:8165/). In self-hosted mode --sqlflow-url defaults to http://localhost:8165/api/gspLive_backend/sqlflow/generation/sqlflow/exportFullLineageAsJson, so you can omit it when SQLFlow is on localhost:

gsp-datahub-sidecar --mode self_hosted \
  --user-id YOUR_USER_ID \
  --secret-key YOUR_SECRET_KEY \
  --sql-file queries.sql

If SQLFlow is on a different host/port, add --sqlflow-url pointing at the full exportFullLineageAsJson path, e.g.:

--sqlflow-url http://sqlflow.internal:9100/api/gspLive_backend/sqlflow/generation/sqlflow/exportFullLineageAsJson

You can verify the flow directly with curl first:

# 1. Generate a token
TOKEN=$(curl -sS -X POST \
  -H "Content-Type: application/x-www-form-urlencoded" \
  --data-urlencode 'userId=YOUR_USER_ID' \
  --data-urlencode 'secretKey=YOUR_SECRET_KEY' \
  http://localhost:8165/api/gspLive_backend/user/generateToken \
  | python3 -c 'import json,sys;print(json.load(sys.stdin)["token"])')

# 2. Call the lineage endpoint with userId + token (not secretKey)
curl -sS -X POST \
  -H "Content-Type: application/x-www-form-urlencoded" \
  --data-urlencode 'userId=YOUR_USER_ID' \
  --data-urlencode "token=$TOKEN" \
  --data-urlencode 'dbvendor=dbvbigquery' \
  --data-urlencode 'sqltext=SELECT a FROM t;' \
  --data-urlencode 'showRelationType=fdd' \
  http://localhost:8165/api/gspLive_backend/sqlflow/generation/sqlflow/exportFullLineageAsJson

If curl returns {"code":200,...} but the sidecar still fails, double-check that you upgraded past the version where the token flow was added.

FileNotFoundError: SQL file not found: examples/bigquery_procedural.sql

pip install gsp-datahub-sidecar only installs the Python package — it does not copy the examples/ directory onto your filesystem. You'll see:

FileNotFoundError: SQL file not found: examples/bigquery_procedural.sql

Pick one:

Option A — download a single example

curl -L -O https://raw.githubusercontent.com/gudusoftware/gsp-datahub-sidecar/main/examples/bigquery_procedural.sql
curl -L -O https://raw.githubusercontent.com/gudusoftware/gsp-datahub-sidecar/main/examples/oracle_create_view.sql

gsp-datahub-sidecar --sql-file bigquery_procedural.sql --dry-run

Option B — clone the repo for all examples

git clone https://github.com/gudusoftware/gsp-datahub-sidecar.git
cd gsp-datahub-sidecar
gsp-datahub-sidecar --sql-file examples/bigquery_procedural.sql --dry-run

Option C — pass SQL inline (no file needed)

gsp-datahub-sidecar --sql "DECLARE x INT64; CREATE VIEW v AS SELECT a FROM t" --dry-run

Option D — point at your own SQL file

gsp-datahub-sidecar --sql-file /path/to/your/query.sql --dry-run

Licensing

This sidecar (glue code) is Apache 2.0 licensed. Gudu SQLFlow is a commercial product by Gudu Software. The anonymous tier provides free evaluation access. For production use, deploy the self-hosted SQLFlow Docker with a license, or supply your own licensed gsqlparser-*-shaded.jar to --mode local_jar — the sidecar does not bundle or redistribute the SQLFlow JAR.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gsp_datahub_sidecar-0.1.0.tar.gz (52.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

gsp_datahub_sidecar-0.1.0-py3-none-any.whl (33.5 kB view details)

Uploaded Python 3

File details

Details for the file gsp_datahub_sidecar-0.1.0.tar.gz.

File metadata

  • Download URL: gsp_datahub_sidecar-0.1.0.tar.gz
  • Upload date:
  • Size: 52.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.3

File hashes

Hashes for gsp_datahub_sidecar-0.1.0.tar.gz
Algorithm Hash digest
SHA256 d21771a001ae320eff5dd97190a725aed1c5b7653ad6e86c2f02298f4649ce5b
MD5 0f759c25bc2c47c4f728ff77951bd7e9
BLAKE2b-256 b3ecc54a9d668385b0ac14721c4e79905ef3cbdb9c5b8c246a71f674ed73db5b

See more details on using hashes here.

File details

Details for the file gsp_datahub_sidecar-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for gsp_datahub_sidecar-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3b00841255a2be850a36a5e4f21f1d763c1ac1f413e8c9c67c038b301050e1a7
MD5 b4f2eaf87477bb997a6cc3befe825497
BLAKE2b-256 f640038e909ed037f1f03a163190cafe89cce83cfc4eebc64e2f725bcf73969e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page