Universal static analyzer for data pipelines
Project description
LineageScope
Universal static analyzer for data pipelines. Point it at a Git repository to analyze SQL, dbt, Airflow, Spark, and data contracts without a database or cloud account.
Full documentation: kirannarayanak.github.io/lineagescope (MkDocs, deployed from main).
Quickstart
pip install lineagescope
lineagescope scan . --format json
lineagescope ci --threshold 70 --path .
Install from PyPI with pip install lineagescope; the CLI command is lineagescope. (The unrelated CPU tool pipescope is a different package on PyPI.)
(From a checkout, use pip install -e ".[dev]".)
Demo (terminal GIF)
The GIF above is a stylized preview (generated with python scripts/generate_demo_gif.py, requires Pillow). For a pixel-perfect terminal capture, record with asciinema and convert with agg; see docs/demo/README.md. The repo includes docs/demo/lineagescope-demo.cast as a sample cast you can pass to agg.
Architecture
flowchart LR
subgraph ingest [Ingest]
W[Directory walk]
P[Parsers: SQL, dbt, Airflow, Spark, ODCS]
end
subgraph graph [Graph]
G[Pipeline graph NetworkX]
end
subgraph analyze [Analyzers]
A1[Dead assets]
A2[Tests and docs]
A3[Complexity]
A4[Ownership]
A5[Contracts]
A6[Cost hotspots]
end
subgraph out [Output]
T[Rich terminal]
J[JSON]
H[HTML and D3]
end
W --> P --> G
G --> A1 & A2 & A3 & A4 & A5 & A6
A1 & A2 & A3 & A4 & A5 & A6 --> T & J & H
Requirements
- Python 3.11+
Install
From PyPI (after the package is published):
pip install lineagescope
From a source checkout:
pip install .
For local development, use the editable install in Setup below.
Setup
python -m venv venv
# Windows:
venv\Scripts\activate
# Linux/macOS:
# source venv/bin/activate
pip install -e ".[dev]"
pre-commit install
Development
Run the test suite and linter locally:
pytest
ruff check lineagescope tests
Documentation site (MkDocs + Material):
pip install -e ".[docs]"
mkdocs serve
# open http://127.0.0.1:8000/lineagescope/ (Ctrl+C to stop the dev server)
# or: LINEAGESCOPE_DOCS_SITE_URL=http://127.0.0.1:8000/ mkdocs serve → http://127.0.0.1:8000/
Optional: pre-commit run --all-files runs the same hooks as on commit (Ruff, YAML/TOML checks, whitespace).
Publishing a release: see RELEASING.md and CHANGELOG.md.
Windows terminal
On Windows, LineageScope reconfigures stdout/stderr to UTF-8 when supported so Rich tables and paths render correctly. For best results, use Windows Terminal or PowerShell 7+; you can also set PYTHONUTF8=1 in the environment or run chcp 65001 in legacy consoles.
Usage
lineagescope --help
lineagescope scan .
lineagescope scan path/to/repo --dialect postgres
lineagescope scan . --format json
lineagescope scan . --exclude node_modules,venv,.venv,.git
Terminal mode shows a Rich progress bar while walking the tree, loading dbt projects, parsing files, and reading contracts. Unreadable or failing files are skipped with a warning (see parse_warnings in JSON or the yellow panel in the terminal).
Analyzer tuning (optional):
lineagescope scan . --dead-asset-whitelist my_export,legacy_sink
lineagescope scan . --test-coverage-critical-deps 15
JSON output (--format json)
Top-level keys include:
| Key | Purpose |
|---|---|
assets, edges |
Parsed inventory and lineage (source → target) |
graph |
node_count, edge_count, is_directed_acyclic |
analytics |
Graph metrics plus per-analyzer blocks (see below) |
findings |
Combined issues from all analyzers (categories vary by rule) |
scores |
Integer 0–100 per dimension (see below) |
parse_warnings |
Optional list of human-readable skip messages (parse/read failures) |
analytics (high level) — In addition to graph/orphan/cycle style metrics, you get:
| Block | Role |
|---|---|
dead_asset_analysis |
Dead / sink analysis details |
test_coverage, test_coverage_analysis |
Test presence and downstream risk |
documentation_coverage, documentation_coverage_analysis |
Docs vs assets |
complexity_analysis |
SQL/graph complexity and percentile flags |
ownership_analysis |
Ownership coverage and staleness counts |
contract_compliance_analysis |
ODCS contracts vs asset columns/types |
cost_hotspot_analysis |
Static SQL cost patterns × downstream impact |
analytics["contract_compliance_analysis"] fields:
| Field | Meaning |
|---|---|
contract_compliance_score |
Same integer as scores["contracts"] |
compliant_contracts / total_contracts |
Contracts with no column/type drift vs denominator |
compliance_ratio |
Ratio, or JSON null when total_contracts is 0 |
analytics["ownership_analysis"] fields:
| Field | Meaning |
|---|---|
ownership_score |
Same integer as scores["ownership"] |
assets_with_owner / total_count |
Scoped assets with a resolved owner vs denominator |
coverage_ratio |
assets_with_owner / total_count, or JSON null when total_count is 0 |
no_owner_count |
Assets with no CODEOWNERS, dbt meta.owner, or git author |
stale_count |
Assets whose file last commit is older than ~6 months |
scores
| Key | Interpretation |
|---|---|
dead_assets, test_coverage, documentation, ownership, contracts, cost_hotspots |
Higher is better (fewer heavy SQL patterns) |
complexity |
Higher = more complex (more structural/SQL weight) |
findings categories (non-exhaustive): dead_asset, missing_test, weak_test_coverage, missing_documentation, complexity flags, no_owner, stale_asset, contract_asset_not_found, contract_missing_column, contract_extra_column, contract_type_mismatch, cost_hotspot.
analytics["cost_hotspot_analysis"] fields:
| Field | Meaning |
|---|---|
cost_hotspot_score |
Same integer as scores["cost_hotspots"] (100 = no hotspots) |
total_pattern_instances |
Sum of pattern counts across flagged SQL assets |
max_weighted_impact |
Largest pattern_count × (1 + 0.12 × min(50, downstream)) |
ranked |
Top assets by weighted impact (name, patterns, downstream count, score) |
Use this shape for CI gates and dashboards.
Ownership
LineageScope resolves an owner per asset in this order: CODEOWNERS (path match, last matching line wins) → dbt meta.owner on models and source tables in schema.yml → git last commit author on that file. Synthetic SQL query-block assets are excluded from ownership scoring.
- CODEOWNERS: place
.github/CODEOWNERSorCODEOWNERSat the repository root (or under the scan root if not using git). Patterns follow GitHub-style globs (*,**,/). - dbt: under
models:entries usemeta.owner; undersources:→tables:usemeta.ownerper table. - Stale:
stale_assetfindings use the last commit timestamp fromgit logwhen the repo is available; assets outside git or without history are not flagged as stale by date.
Terminal mode prints an Ownership score panel alongside the other Rich panels.
Data contracts (ODCS)
LineageScope discovers YAML files that look like Open Data Contract Standard documents (for example dataContractSpecification plus dataset, or kind: DataContract with a top-level schema: list). Table and column identifiers are read from name, then physicalName, then logicalName when the earlier keys are absent. Each contract table with at least one column definition is compared to a matching asset by name (exact, case-insensitive, or last path segment such as schema.table → table).
- Columns: Contract column names are compared to
Asset.columns(fromCREATE TABLE/VIEWSQL or dbtschema.yml). Missing or extra columns emit findings; extras are info severity. - Types: When both the contract and the asset declare a type (
logicalType/physicalType/ legacytype, vs dbtdata_typeor SQLCREATEcolumn types), LineageScope normalizes families (e.g.int↔integer,varchar↔string) and flags warning mismatches. - Score:
(compliant_contracts / total_contracts) * 100, where a contract is compliant when it resolves to an asset and produces no contract findings. Contracts with no column list are ignored for the denominator.
Terminal mode includes a Contract compliance panel.
Cost hotspots
LineageScope scans each .sql model/table/view asset (excluding synthetic query-block rows), reads the file once per path, and runs static checks from detect_cost_patterns() in sql_parser: SELECT_STAR, CROSS_JOIN, MISSING_WHERE_CLAUSE (DELETE/UPDATE without WHERE), SELECT_WITHOUT_WHERE (top-level SELECT with FROM but no WHERE), NO_LIMIT (top-level SELECT with FROM but no LIMIT/FETCH), plus MISSING_PARTITION_FILTER when an asset references a table tagged with partition_key in dbt meta (see below) and the WHERE clause does not reference that column.
Downstream weighting: For each flagged asset, LineageScope computes len(nx.descendants(graph, asset)) in the lineage graph and weighted impact = pattern_count × (1 + 0.12 × min(50, downstream_count)). The ranked list in analytics is sorted by this weighted impact descending.
dbt: Set meta.partition_key on a model or source table in schema.yml to tag that logical table name; the analyzer maps asset.name → partition column for partition checks.
Terminal mode includes a Cost hotspots panel.
CI
On GitHub, .github/workflows/ci.yml runs Ruff (ruff check .) and pytest with coverage (pytest --cov=lineagescope) on Python 3.11 for every push and pull_request.
JSON gates (jq)
LineageScope does not exit with a non-zero status when findings are present. In CI, use --format json and assert on scores or findings (for example with jq, available on GitHub-hosted runners).
Minimum contract compliance score (90):
lineagescope scan . --format json | jq -e '.scores.contracts >= 90'
Fail if any contract-related finding exists:
lineagescope scan . --format json | jq -e '[.findings[] | select(.category | test("^contract_"))] | length == 0'
Multiple checks in one step (write JSON once):
lineagescope scan . --format json > lineagescope-scan.json
jq -e '.scores.contracts >= 90' lineagescope-scan.json
jq -e '.scores.dead_assets >= 80' lineagescope-scan.json
jq -e '[.findings[] | select(.category | test("^contract_"))] | length == 0' lineagescope-scan.json
jq -e exits with status 1 when the filter yields false or null, which fails the shell step. On Windows without jq, use PowerShell (ConvertFrom-Json) or run these checks in GitHub Actions / WSL where jq is available.
License
MIT. See LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file lineagescope-0.2.0.tar.gz.
File metadata
- Download URL: lineagescope-0.2.0.tar.gz
- Upload date:
- Size: 178.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b738b2e14b5a403f0e5c1f4b06da726924482ac8e2aa37272e7991a1e04adfc4
|
|
| MD5 |
75f9d73d09a46497f094e56c18fa265a
|
|
| BLAKE2b-256 |
00b8abb326c4072e2b6c04e8327308004bad8080557d23c4a8d51855f988bb19
|
File details
Details for the file lineagescope-0.2.0-py3-none-any.whl.
File metadata
- Download URL: lineagescope-0.2.0-py3-none-any.whl
- Upload date:
- Size: 60.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
315fa12cc6403b16cbcc8cf7003a6e15ab0cf1242bb78d734bfa6e378ec7bb52
|
|
| MD5 |
b30a288628100d3a19a5eab69870e9fb
|
|
| BLAKE2b-256 |
3ee89b51d8fe312efb4849ae60bcfe6aba1c76940f6c900f058353f756902436
|