HiveSQL and SparkSQL optimized SQL asset extraction, dependency scanning, and table-level lineage analysis.
Project description
SQL Asset Graph
SQL Asset Graph is a command-line tool optimized for HiveSQL and SparkSQL asset analysis, covering SQL extraction, table usage scanning, table-level lineage generation, lineage cycle analysis, and graph-style lineage queries.
The current release is deliberately tuned for HiveSQL and SparkSQL-style warehouse scripts rather than broad multi-dialect parsing. In repository-scale Hive/Spark SQL projects with dynamic table templates, embedded SQL, and stable CSV output contracts, it provides stronger practical detection coverage than generic open-source lineage tools that primarily target isolated statements or broad ANSI-style dialect support.
It is designed for repository-style SQL assets instead of isolated ad hoc statements, and it supports extracting embedded SQL from Python files, replacing placeholders before analysis, exporting stable CSV outputs, and querying upstream or downstream table relationships from generated lineage data.
Quick Start
Install from the current repository:
pip install -e .
After installation, you can use either entrypoint:
python -m sql_asset_graph.main --help
sql-asset-graph --help
Generate table usage information from a directory of SQL files:
sql-asset-graph table-usage -i ./sql_dir -o ./output/table_usage.csv
Expected CSV header:
file_name,access_type,table_name
Generate direct table lineage:
sql-asset-graph lineage -i ./sql_dir -o ./output/table_lineage.csv
Expected CSV header:
file_name,statement_index,statement_type,target_table,source_table,unresolved_dynamic_tables
Analyze table-level lineage cycles from lineage output:
sql-asset-graph lineage-cycles ./output/table_lineage.csv
Cycle CSV header:
cycle_id,cycle_length,sequence_index,table_name
For lightweight shell usage, lineage-cycles also accepts - as the input path and reads table_lineage.csv content from stdin. Legacy alias: analyze.
Current Focus
SQL Asset Graph currently works best for HiveSQL and SparkSQL-style batch SQL workflows, especially when SQL lives in repositories together with Python orchestration scripts.
- Focused on Hive-style DML and lineage paths such as
INSERT OVERWRITE,CREATE TABLE AS SELECT, andCREATE VIEW AS SELECT - Optimized for repository-scale SQL processing instead of one-off interactive parsing
- Suitable when SQL is extracted from Python first and then passed through placeholder replacement, table usage export, lineage, and lineage cycle analysis
Current Advantages
Compared with generic SQL lineage tooling, the current version is strongest in HiveSQL and SparkSQL repository workflows.
- HiveSQL and SparkSQL-style workflow focus.
- Stable CSV outputs for downstream automation.
- Conservative handling of dynamic table templates.
- End-to-end repository workflow from extraction to lineage query.
- A default native Hive-oriented backend rather than a thin wrapper over a generic parser.
In practical terms, that means the current version is a better fit when you care more about predictable detection for HiveSQL/SparkSQL warehouse scripts than about supporting every SQL dialect equally.
What It Does
SQL Asset Graph provides an end-to-end workflow for repository-based SQL analysis.
- Extract SQL fragments from Python files.
- Replace placeholder variables in SQL files.
- Scan SQL files to identify read and write table usage.
- Generate direct table-level lineage rows.
- Analyze table-level lineage cycles.
- Query upstream, downstream, and cyclic relationships from lineage CSV output.
The current implementation is intentionally specialized: it favors HiveSQL and SparkSQL-style repository assets, stable output contracts, and conservative lineage behavior over broad multi-engine dialect coverage.
Typical Use Cases
SQL Asset Graph is a good fit when you need one or more of the following:
- Batch analysis of SQL files stored in a repository.
- Preprocessing SQL that is embedded in Python scripts.
- Stable CSV outputs that can be consumed by downstream tools.
- Table-level lineage for HiveSQL or SparkSQL-style SQL workflows.
- Lightweight lineage graph queries without introducing a separate service.
Command Overview
The CLI exposes six main commands.
extract-sql
Extract SQL strings from Python files or directories.
python -m sql_asset_graph.main extract-sql /path/to/file.py
python -m sql_asset_graph.main extract-sql /path/to/python_dir -o ./output
python -m sql_asset_graph.main extract-sql /path/to/file.py --format json
python -m sql_asset_graph.main extract-sql /path/to/file.py --format csv
Legacy alias: extract
fill-placeholder
Replace placeholders in SQL files using values from a constants module.
python -m sql_asset_graph.main fill-placeholder input.sql -c path/to/constants.py
python -m sql_asset_graph.main fill-placeholder input.sql -c path/to/constants.py -s
cat input.sql | python -m sql_asset_graph.main fill-placeholder - -c path/to/constants.py
Legacy alias: replace
table-usage
Scan SQL files and export table read/write usage.
python -m sql_asset_graph.main table-usage -i ./sample.sql
python -m sql_asset_graph.main table-usage -i ./sql_dir -o ./output/table_usage.csv
python -m sql_asset_graph.main table-usage -i ./sql_dir --format json -o ./output/table_usage.json
cat sample.sql | python -m sql_asset_graph.main table-usage -i - --source-name sample.sql
cat sample.sql | python -m sql_asset_graph.main table-usage -i - --format csv --source-name sample.sql
Legacy alias: scan
Output header:
file_name,access_type,table_name
lineage
Generate direct table-level lineage.
python -m sql_asset_graph.main lineage -i ./sample.sql
python -m sql_asset_graph.main lineage -i ./sql_dir -o ./output/table_lineage.csv
python -m sql_asset_graph.main lineage -i ./sql_dir --format json -o ./output/table_lineage.json
cat sample.sql | python -m sql_asset_graph.main lineage -i - --source-name sample.sql
cat sample.sql | python -m sql_asset_graph.main lineage -i - --format csv --source-name sample.sql
Output header:
file_name,statement_index,statement_type,target_table,source_table,unresolved_dynamic_tables
lineage-graph
Query lineage relationships from generated CSV output.
python -m sql_asset_graph.main lineage-graph output/table_lineage.csv --upstream APP.TARGET_Y
python -m sql_asset_graph.main lineage-graph output/table_lineage.csv --downstream APP.SOURCE_X
python -m sql_asset_graph.main lineage-graph output/table_lineage.csv --cycles
cat output/table_lineage.csv | python -m sql_asset_graph.main lineage-graph - --upstream APP.TARGET_Y
lineage-cycles
Analyze table-level lineage cycles from table_lineage.csv.
python -m sql_asset_graph.main lineage-cycles output/table_lineage.csv
python -m sql_asset_graph.main lineage-cycles output/table_lineage.csv --format json -o cycles.json
cat output/table_lineage.csv | python -m sql_asset_graph.main lineage-cycles -
Legacy alias: analyze
Cycle CSV header:
cycle_id,cycle_length,sequence_index,table_name
Output Files
The tool produces plain files that are easy to inspect or integrate into other workflows.
*_extracted_sql_*.sql: extracted SQL collected from Python sources*_extracted_sql_*.json: structured extracted SQL records*_extracted_sql_*.csv: tabular extracted SQL recordstable_usage_*.csv: table read/write usage rows with headerfile_name,access_type,table_nametable_usage_*.json: structured table read/write usage payloadtable_lineage_*.csv: direct table lineage rows with headerfile_name,statement_index,statement_type,target_table,source_table,unresolved_dynamic_tablestable_lineage_*.json: structured table lineage payloadtable_lineage_cycles_*.csv: detected table-level lineage cycles with headercycle_id,cycle_length,sequence_index,table_nametable_lineage_cycles_*.json: structured table-level lineage cycle payload
Current Scope
SQL Asset Graph currently focuses on table-level lineage.
- It does not provide column-level lineage.
- It is optimized for HiveSQL and SparkSQL-style repository processing rather than broad multi-dialect SQL coverage.
- It is especially suitable when you need both CSV/JSON exports and follow-up lineage analysis.
- Dynamic table templates are treated conservatively and may be reported separately instead of being forced into guessed lineage edges.
Requirements
- Python 3.9+
- Standard library only
License
See LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sql_asset_graph-0.1.0.tar.gz.
File metadata
- Download URL: sql_asset_graph-0.1.0.tar.gz
- Upload date:
- Size: 28.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ae3e5af89a718dc495d278bf22276ed1d06a700f62f3613b2ad280a1fd443384
|
|
| MD5 |
894fbf63d30662e2ee76abd4cd7b4bb1
|
|
| BLAKE2b-256 |
eba52cd5412155aac0493f296c1457ef9cd9a972726c76b64bee60bc39e98815
|
File details
Details for the file sql_asset_graph-0.1.0-py3-none-any.whl.
File metadata
- Download URL: sql_asset_graph-0.1.0-py3-none-any.whl
- Upload date:
- Size: 32.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3e83fff1318009fed41e9c75723ce86d1ac5da64b6ba3cf6e210f1bf9b9929b0
|
|
| MD5 |
e7302174927d7303d1ea2b24be65e9be
|
|
| BLAKE2b-256 |
db9caf790f6f22244c68206ccea2b86f89902ca7b3d42168003351f9cd3a5ac0
|