Universal static analyzer for data pipelines

These details have not been verified by PyPI

Project links

Project description

PipeScope

Universal static analyzer for data pipelines. Point it at a Git repository to analyze SQL, dbt, Airflow, Spark, and data contracts without a database or cloud account.

Full documentation: kirannarayanak.github.io/PipeScope (MkDocs, deployed from main).

Quickstart

pip install lineagescope
pipescope scan . --format json
pipescope ci --threshold 70 --path .

Install the package as lineagescope on PyPI; the command is still pipescope. (The name pipescope on PyPI is a different project.)

(From a checkout, use pip install -e ".[dev]" instead of pip install lineagescope.)

Demo (terminal GIF)

PipeScope scan demo

The GIF above is a stylized preview (generated with python scripts/generate_demo_gif.py, requires Pillow). For a pixel-perfect terminal capture, record with asciinema and convert with agg; see docs/demo/README.md. The repo includes docs/demo/pipescope-demo.cast as a sample cast you can pass to agg.

Architecture

flowchart LR
  subgraph ingest [Ingest]
    W[Directory walk]
    P[Parsers: SQL, dbt, Airflow, Spark, ODCS]
  end
  subgraph graph [Graph]
    G[Pipeline graph NetworkX]
  end
  subgraph analyze [Analyzers]
    A1[Dead assets]
    A2[Tests and docs]
    A3[Complexity]
    A4[Ownership]
    A5[Contracts]
    A6[Cost hotspots]
  end
  subgraph out [Output]
    T[Rich terminal]
    J[JSON]
    H[HTML and D3]
  end
  W --> P --> G
  G --> A1 & A2 & A3 & A4 & A5 & A6
  A1 & A2 & A3 & A4 & A5 & A6 --> T & J & H

Requirements

Python 3.11+

Install

From PyPI (after the package is published):

pip install lineagescope

From a source checkout:

pip install .

For local development, use the editable install in Setup below.

Setup

python -m venv venv
# Windows:
venv\Scripts\activate
# Linux/macOS:
# source venv/bin/activate

pip install -e ".[dev]"
pre-commit install

Development

Run the test suite and linter locally:

pytest
ruff check pipescope tests

Documentation site (MkDocs + Material):

pip install -e ".[docs]"
mkdocs serve
# open http://127.0.0.1:8000/PipeScope/  (Ctrl+C to stop the dev server)
# or: PIPESCOPE_DOCS_SITE_URL=http://127.0.0.1:8000/ mkdocs serve  → http://127.0.0.1:8000/

Optional: pre-commit run --all-files runs the same hooks as on commit (Ruff, YAML/TOML checks, whitespace).

Publishing a release: see RELEASING.md and CHANGELOG.md.

Windows terminal

On Windows, PipeScope reconfigures stdout/stderr to UTF-8 when supported so Rich tables and paths render correctly. For best results, use Windows Terminal or PowerShell 7+; you can also set PYTHONUTF8=1 in the environment or run chcp 65001 in legacy consoles.

Usage

pipescope --help
pipescope scan .
pipescope scan path/to/repo --dialect postgres
pipescope scan . --format json
pipescope scan . --exclude node_modules,venv,.venv,.git

Terminal mode shows a Rich progress bar while walking the tree, loading dbt projects, parsing files, and reading contracts. Unreadable or failing files are skipped with a warning (see parse_warnings in JSON or the yellow panel in the terminal).

Analyzer tuning (optional):

pipescope scan . --dead-asset-whitelist my_export,legacy_sink
pipescope scan . --test-coverage-critical-deps 15

JSON output (`--format json`)

Top-level keys include:

Key	Purpose
`assets`, `edges`	Parsed inventory and lineage (`source` → `target`)
`graph`	`node_count`, `edge_count`, `is_directed_acyclic`
`analytics`	Graph metrics plus per-analyzer blocks (see below)
`findings`	Combined issues from all analyzers (categories vary by rule)
`scores`	Integer 0–100 per dimension (see below)
`parse_warnings`	Optional list of human-readable skip messages (parse/read failures)

analytics (high level) — In addition to graph/orphan/cycle style metrics, you get:

Block	Role
`dead_asset_analysis`	Dead / sink analysis details
`test_coverage`, `test_coverage_analysis`	Test presence and downstream risk
`documentation_coverage`, `documentation_coverage_analysis`	Docs vs assets
`complexity_analysis`	SQL/graph complexity and percentile flags
`ownership_analysis`	Ownership coverage and staleness counts
`contract_compliance_analysis`	ODCS contracts vs asset columns/types
`cost_hotspot_analysis`	Static SQL cost patterns × downstream impact

analytics["contract_compliance_analysis"] fields:

Field	Meaning
`contract_compliance_score`	Same integer as `scores["contracts"]`
`compliant_contracts` / `total_contracts`	Contracts with no column/type drift vs denominator
`compliance_ratio`	Ratio, or JSON `null` when `total_contracts` is 0

analytics["ownership_analysis"] fields:

Field	Meaning
`ownership_score`	Same integer as `scores["ownership"]`
`assets_with_owner` / `total_count`	Scoped assets with a resolved owner vs denominator
`coverage_ratio`	`assets_with_owner / total_count`, or JSON `null` when `total_count` is 0
`no_owner_count`	Assets with no CODEOWNERS, dbt `meta.owner`, or git author
`stale_count`	Assets whose file last commit is older than ~6 months

scores

Key	Interpretation
`dead_assets`, `test_coverage`, `documentation`, `ownership`, `contracts`, `cost_hotspots`	Higher is better (fewer heavy SQL patterns)
`complexity`	Higher = more complex (more structural/SQL weight)

findings categories (non-exhaustive): dead_asset, missing_test, weak_test_coverage, missing_documentation, complexity flags, no_owner, stale_asset, contract_asset_not_found, contract_missing_column, contract_extra_column, contract_type_mismatch, cost_hotspot.

analytics["cost_hotspot_analysis"] fields:

Field	Meaning
`cost_hotspot_score`	Same integer as `scores["cost_hotspots"]` (100 = no hotspots)
`total_pattern_instances`	Sum of pattern counts across flagged SQL assets
`max_weighted_impact`	Largest `pattern_count × (1 + 0.12 × min(50, downstream))`
`ranked`	Top assets by weighted impact (name, patterns, downstream count, score)

Use this shape for CI gates and dashboards.

Ownership

PipeScope resolves an owner per asset in this order: CODEOWNERS (path match, last matching line wins) → dbt meta.owner on models and source tables in schema.yml → git last commit author on that file. Synthetic SQL query-block assets are excluded from ownership scoring.

CODEOWNERS: place .github/CODEOWNERS or CODEOWNERS at the repository root (or under the scan root if not using git). Patterns follow GitHub-style globs (*, **, /).
dbt: under models: entries use meta.owner; under sources: → tables: use meta.owner per table.
Stale: stale_asset findings use the last commit timestamp from git log when the repo is available; assets outside git or without history are not flagged as stale by date.

Terminal mode prints an Ownership score panel alongside the other Rich panels.

Data contracts (ODCS)

PipeScope discovers YAML files that look like Open Data Contract Standard documents (for example dataContractSpecification plus dataset, or kind: DataContract with a top-level schema: list). Table and column identifiers are read from name, then physicalName, then logicalName when the earlier keys are absent. Each contract table with at least one column definition is compared to a matching asset by name (exact, case-insensitive, or last path segment such as schema.table → table).

Columns: Contract column names are compared to Asset.columns (from CREATE TABLE/VIEW SQL or dbt schema.yml). Missing or extra columns emit findings; extras are info severity.
Types: When both the contract and the asset declare a type (logicalType / physicalType / legacy type, vs dbt data_type or SQL CREATE column types), PipeScope normalizes families (e.g. int ↔ integer, varchar ↔ string) and flags warning mismatches.
Score: (compliant_contracts / total_contracts) * 100, where a contract is compliant when it resolves to an asset and produces no contract findings. Contracts with no column list are ignored for the denominator.

Terminal mode includes a Contract compliance panel.

Cost hotspots

PipeScope scans each .sql model/table/view asset (excluding synthetic query-block rows), reads the file once per path, and runs static checks from detect_cost_patterns() in sql_parser: SELECT_STAR, CROSS_JOIN, MISSING_WHERE_CLAUSE (DELETE/UPDATE without WHERE), SELECT_WITHOUT_WHERE (top-level SELECT with FROM but no WHERE), NO_LIMIT (top-level SELECT with FROM but no LIMIT/FETCH), plus MISSING_PARTITION_FILTER when an asset references a table tagged with partition_key in dbt meta (see below) and the WHERE clause does not reference that column.

Downstream weighting: For each flagged asset, PipeScope computes len(nx.descendants(graph, asset)) in the lineage graph and weighted impact = pattern_count × (1 + 0.12 × min(50, downstream_count)). The ranked list in analytics is sorted by this weighted impact descending.

dbt: Set meta.partition_key on a model or source table in schema.yml to tag that logical table name; the analyzer maps asset.name → partition column for partition checks.

Terminal mode includes a Cost hotspots panel.

CI

On GitHub, .github/workflows/ci.yml runs Ruff (ruff check .) and pytest with coverage (pytest --cov=pipescope) on Python 3.11 for every push and pull_request.

JSON gates (jq)

PipeScope does not exit with a non-zero status when findings are present. In CI, use --format json and assert on scores or findings (for example with jq, available on GitHub-hosted runners).

Minimum contract compliance score (90):

pipescope scan . --format json | jq -e '.scores.contracts >= 90'

Fail if any contract-related finding exists:

pipescope scan . --format json | jq -e '[.findings[] | select(.category | test("^contract_"))] | length == 0'

Multiple checks in one step (write JSON once):

pipescope scan . --format json > pipescope-scan.json
jq -e '.scores.contracts >= 90' pipescope-scan.json
jq -e '.scores.dead_assets >= 80' pipescope-scan.json
jq -e '[.findings[] | select(.category | test("^contract_"))] | length == 0' pipescope-scan.json

jq -e exits with status 1 when the filter yields false or null, which fails the shell step. On Windows without jq, use PowerShell (ConvertFrom-Json) or run these checks in GitHub Actions / WSL where jq is available.

License

MIT. See LICENSE.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.0

Apr 11, 2026

This version

0.1.3

Apr 11, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lineagescope-0.1.3.tar.gz (177.6 kB view details)

Uploaded Apr 11, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

lineagescope-0.1.3-py3-none-any.whl (60.0 kB view details)

Uploaded Apr 11, 2026 Python 3

File details

Details for the file lineagescope-0.1.3.tar.gz.

File metadata

Download URL: lineagescope-0.1.3.tar.gz
Upload date: Apr 11, 2026
Size: 177.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for lineagescope-0.1.3.tar.gz
Algorithm	Hash digest
SHA256	`71654cd346d11373865cca5d44055b386f948f6e660a7299de1b27e648caae70`
MD5	`3fab741a2852a8b707c31fd757cba30c`
BLAKE2b-256	`152a55ee0197e80d8194e28afbef85a49cddf99316f1c6ee372f61e195d678eb`

See more details on using hashes here.

File details

Details for the file lineagescope-0.1.3-py3-none-any.whl.

File metadata

Download URL: lineagescope-0.1.3-py3-none-any.whl
Upload date: Apr 11, 2026
Size: 60.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.10

File hashes

Hashes for lineagescope-0.1.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a7cd20b5a578284f27a1a0d52d8573ab2613aca1f92a124bc0405e0b4ca4d5bd`
MD5	`4b70c949d9d05bb206505a3b3e52f08b`
BLAKE2b-256	`73c4665b5d3f9c5bc8d9e8ba72d3124ab864f17479070a1b6d126cef0a55ab3a`

See more details on using hashes here.

lineagescope 0.1.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

PipeScope

Quickstart

Demo (terminal GIF)

Architecture

Requirements

Install

Setup

Development

Windows terminal

Usage

JSON output (--format json)

Ownership

Data contracts (ODCS)

Cost hotspots

CI

JSON gates (jq)

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

JSON output (`--format json`)