Skip to main content

Universal static analyzer for data pipelines

Project description

LineageScope

Universal static analyzer for data pipelines. Point it at a Git repository to analyze SQL, dbt, Airflow, Spark, and data contracts without a database or cloud account.

CI Documentation

Full documentation: kirannarayanak.github.io/lineagescope (MkDocs, deployed from main).

Quickstart

pip install lineagescope
lineagescope scan . --format json
lineagescope ci --threshold 70 --path .

Install from PyPI with pip install lineagescope; the CLI command is lineagescope. (The unrelated CPU tool pipescope is a different package on PyPI.)

(From a checkout, use pip install -e ".[dev]".)

Demo (terminal GIF)

LineageScope scan demo

The GIF above is a stylized preview (generated with python scripts/generate_demo_gif.py, requires Pillow). For a pixel-perfect terminal capture, record with asciinema and convert with agg; see docs/demo/README.md. The repo includes docs/demo/lineagescope-demo.cast as a sample cast you can pass to agg.

Architecture

flowchart LR
  subgraph ingest [Ingest]
    W[Directory walk]
    P[Parsers: SQL, dbt, Airflow, Spark, ODCS]
  end
  subgraph graph [Graph]
    G[Pipeline graph NetworkX]
  end
  subgraph analyze [Analyzers]
    A1[Dead assets]
    A2[Tests and docs]
    A3[Complexity]
    A4[Ownership]
    A5[Contracts]
    A6[Cost hotspots]
  end
  subgraph out [Output]
    T[Rich terminal]
    J[JSON]
    H[HTML and D3]
  end
  W --> P --> G
  G --> A1 & A2 & A3 & A4 & A5 & A6
  A1 & A2 & A3 & A4 & A5 & A6 --> T & J & H

Requirements

  • Python 3.11+

Install

From PyPI (after the package is published):

pip install lineagescope

From a source checkout:

pip install .

For local development, use the editable install in Setup below.

Setup

python -m venv venv
# Windows:
venv\Scripts\activate
# Linux/macOS:
# source venv/bin/activate

pip install -e ".[dev]"
pre-commit install

Development

Run the test suite and linter locally:

pytest
ruff check lineagescope tests

Documentation site (MkDocs + Material):

pip install -e ".[docs]"
mkdocs serve
# open http://127.0.0.1:8000/lineagescope/  (Ctrl+C to stop the dev server)
# or: LINEAGESCOPE_DOCS_SITE_URL=http://127.0.0.1:8000/ mkdocs serve  → http://127.0.0.1:8000/

Optional: pre-commit run --all-files runs the same hooks as on commit (Ruff, YAML/TOML checks, whitespace).

Publishing a release: see RELEASING.md and CHANGELOG.md.

Windows terminal

On Windows, LineageScope reconfigures stdout/stderr to UTF-8 when supported so Rich tables and paths render correctly. For best results, use Windows Terminal or PowerShell 7+; you can also set PYTHONUTF8=1 in the environment or run chcp 65001 in legacy consoles.

Usage

lineagescope --help
lineagescope scan .
lineagescope scan path/to/repo --dialect postgres
lineagescope scan . --format json
lineagescope scan . --exclude node_modules,venv,.venv,.git

Terminal mode shows a Rich progress bar while walking the tree, loading dbt projects, parsing files, and reading contracts. Unreadable or failing files are skipped with a warning (see parse_warnings in JSON or the yellow panel in the terminal).

Analyzer tuning (optional):

lineagescope scan . --dead-asset-whitelist my_export,legacy_sink
lineagescope scan . --test-coverage-critical-deps 15

JSON output (--format json)

Top-level keys include:

Key Purpose
assets, edges Parsed inventory and lineage (sourcetarget)
graph node_count, edge_count, is_directed_acyclic
analytics Graph metrics plus per-analyzer blocks (see below)
findings Combined issues from all analyzers (categories vary by rule)
scores Integer 0–100 per dimension (see below)
parse_warnings Optional list of human-readable skip messages (parse/read failures)

analytics (high level) — In addition to graph/orphan/cycle style metrics, you get:

Block Role
dead_asset_analysis Dead / sink analysis details
test_coverage, test_coverage_analysis Test presence and downstream risk
documentation_coverage, documentation_coverage_analysis Docs vs assets
complexity_analysis SQL/graph complexity and percentile flags
ownership_analysis Ownership coverage and staleness counts
contract_compliance_analysis ODCS contracts vs asset columns/types
cost_hotspot_analysis Static SQL cost patterns × downstream impact

analytics["contract_compliance_analysis"] fields:

Field Meaning
contract_compliance_score Same integer as scores["contracts"]
compliant_contracts / total_contracts Contracts with no column/type drift vs denominator
compliance_ratio Ratio, or JSON null when total_contracts is 0

analytics["ownership_analysis"] fields:

Field Meaning
ownership_score Same integer as scores["ownership"]
assets_with_owner / total_count Scoped assets with a resolved owner vs denominator
coverage_ratio assets_with_owner / total_count, or JSON null when total_count is 0
no_owner_count Assets with no CODEOWNERS, dbt meta.owner, or git author
stale_count Assets whose file last commit is older than ~6 months

scores

Key Interpretation
dead_assets, test_coverage, documentation, ownership, contracts, cost_hotspots Higher is better (fewer heavy SQL patterns)
complexity Higher = more complex (more structural/SQL weight)

findings categories (non-exhaustive): dead_asset, missing_test, weak_test_coverage, missing_documentation, complexity flags, no_owner, stale_asset, contract_asset_not_found, contract_missing_column, contract_extra_column, contract_type_mismatch, cost_hotspot.

analytics["cost_hotspot_analysis"] fields:

Field Meaning
cost_hotspot_score Same integer as scores["cost_hotspots"] (100 = no hotspots)
total_pattern_instances Sum of pattern counts across flagged SQL assets
max_weighted_impact Largest pattern_count × (1 + 0.12 × min(50, downstream))
ranked Top assets by weighted impact (name, patterns, downstream count, score)

Use this shape for CI gates and dashboards.

Ownership

LineageScope resolves an owner per asset in this order: CODEOWNERS (path match, last matching line wins) → dbt meta.owner on models and source tables in schema.ymlgit last commit author on that file. Synthetic SQL query-block assets are excluded from ownership scoring.

  • CODEOWNERS: place .github/CODEOWNERS or CODEOWNERS at the repository root (or under the scan root if not using git). Patterns follow GitHub-style globs (*, **, /).
  • dbt: under models: entries use meta.owner; under sources:tables: use meta.owner per table.
  • Stale: stale_asset findings use the last commit timestamp from git log when the repo is available; assets outside git or without history are not flagged as stale by date.

Terminal mode prints an Ownership score panel alongside the other Rich panels.

Data contracts (ODCS)

LineageScope discovers YAML files that look like Open Data Contract Standard documents (for example dataContractSpecification plus dataset, or kind: DataContract with a top-level schema: list). Table and column identifiers are read from name, then physicalName, then logicalName when the earlier keys are absent. Each contract table with at least one column definition is compared to a matching asset by name (exact, case-insensitive, or last path segment such as schema.tabletable).

  • Columns: Contract column names are compared to Asset.columns (from CREATE TABLE/VIEW SQL or dbt schema.yml). Missing or extra columns emit findings; extras are info severity.
  • Types: When both the contract and the asset declare a type (logicalType / physicalType / legacy type, vs dbt data_type or SQL CREATE column types), LineageScope normalizes families (e.g. intinteger, varcharstring) and flags warning mismatches.
  • Score: (compliant_contracts / total_contracts) * 100, where a contract is compliant when it resolves to an asset and produces no contract findings. Contracts with no column list are ignored for the denominator.

Terminal mode includes a Contract compliance panel.

Cost hotspots

LineageScope scans each .sql model/table/view asset (excluding synthetic query-block rows), reads the file once per path, and runs static checks from detect_cost_patterns() in sql_parser: SELECT_STAR, CROSS_JOIN, MISSING_WHERE_CLAUSE (DELETE/UPDATE without WHERE), SELECT_WITHOUT_WHERE (top-level SELECT with FROM but no WHERE), NO_LIMIT (top-level SELECT with FROM but no LIMIT/FETCH), plus MISSING_PARTITION_FILTER when an asset references a table tagged with partition_key in dbt meta (see below) and the WHERE clause does not reference that column.

Downstream weighting: For each flagged asset, LineageScope computes len(nx.descendants(graph, asset)) in the lineage graph and weighted impact = pattern_count × (1 + 0.12 × min(50, downstream_count)). The ranked list in analytics is sorted by this weighted impact descending.

dbt: Set meta.partition_key on a model or source table in schema.yml to tag that logical table name; the analyzer maps asset.name → partition column for partition checks.

Terminal mode includes a Cost hotspots panel.

CI

On GitHub, .github/workflows/ci.yml runs Ruff (ruff check .) and pytest with coverage (pytest --cov=lineagescope) on Python 3.11 for every push and pull_request.

JSON gates (jq)

LineageScope does not exit with a non-zero status when findings are present. In CI, use --format json and assert on scores or findings (for example with jq, available on GitHub-hosted runners).

Minimum contract compliance score (90):

lineagescope scan . --format json | jq -e '.scores.contracts >= 90'

Fail if any contract-related finding exists:

lineagescope scan . --format json | jq -e '[.findings[] | select(.category | test("^contract_"))] | length == 0'

Multiple checks in one step (write JSON once):

lineagescope scan . --format json > lineagescope-scan.json
jq -e '.scores.contracts >= 90' lineagescope-scan.json
jq -e '.scores.dead_assets >= 80' lineagescope-scan.json
jq -e '[.findings[] | select(.category | test("^contract_"))] | length == 0' lineagescope-scan.json

jq -e exits with status 1 when the filter yields false or null, which fails the shell step. On Windows without jq, use PowerShell (ConvertFrom-Json) or run these checks in GitHub Actions / WSL where jq is available.

License

MIT. See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lineagescope-0.2.0.tar.gz (178.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lineagescope-0.2.0-py3-none-any.whl (60.0 kB view details)

Uploaded Python 3

File details

Details for the file lineagescope-0.2.0.tar.gz.

File metadata

  • Download URL: lineagescope-0.2.0.tar.gz
  • Upload date:
  • Size: 178.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for lineagescope-0.2.0.tar.gz
Algorithm Hash digest
SHA256 b738b2e14b5a403f0e5c1f4b06da726924482ac8e2aa37272e7991a1e04adfc4
MD5 75f9d73d09a46497f094e56c18fa265a
BLAKE2b-256 00b8abb326c4072e2b6c04e8327308004bad8080557d23c4a8d51855f988bb19

See more details on using hashes here.

File details

Details for the file lineagescope-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: lineagescope-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 60.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for lineagescope-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 315fa12cc6403b16cbcc8cf7003a6e15ab0cf1242bb78d734bfa6e378ec7bb52
MD5 b30a288628100d3a19a5eab69870e9fb
BLAKE2b-256 3ee89b51d8fe312efb4849ae60bcfe6aba1c76940f6c900f058353f756902436

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page