Skip to main content

Column-level data lineage and SVG/DOT/HTML visualisation from Apache Spark Catalyst LogicalPlan JSON dumps.

Project description

catalyst-column-lineage

PyPI version Python versions License: MIT

Pure-Python, dependency-free column-level data lineage extractor for Apache Spark. It consumes a Catalyst LogicalPlan JSON dump (the format emitted by LogicalPlan.toJSON / JsonProtocol) and produces:

  • a structured LineageResult data model (per-column sources and the reconstructed expression that produced each column),
  • a self-contained SVG visualisation,
  • a Graphviz DOT description,
  • an HTML page embedding both the SVG and the JSON dump,
  • a compact plain-text summary.

The package walks the plan tree bottom-up, resolves Catalyst ExprIds through CTERelationDef / CTERelationRef remappings, and renders SQL-like expressions for aliases, casts, literals, aggregates, window functions, generators, and any unknown expression via a generic fallback.

Install

pip install catalyst-column-lineage

The import name is spark_lineage (the distribution name is catalyst-column-lineage):

import spark_lineage

No runtime dependencies. Python 3.9+.

Quick start

Every entry point accepts the plan in any of these shapes:

Input Notes
str / bytes JSON text
list[dict] Spark's native pre-order plan list
dict (single node) One plan node with class field
dict ({"plan": [...]}, {"logicalPlan": [...]}, {"nodes": [...]}) wrapper
LineageResult already-analysed; returned unchanged
import json
from spark_lineage import build_lineage, to_svg, to_html

plan = json.load(open("plan.json"))

result = build_lineage(plan)
print(result.plan_kind)          # QUERY | DDL | DML | NAMESPACE
print(result.operation)          # short class name of the root plan
print(result.target)             # TargetTable for DDL/DML, else None
print(result.source_tables)      # ['spark_catalog.default.employees', ...]

for col in result.columns:
    print(col.name, col.transformation_type, col.transformation)
    for src in col.sources:
        print("    <-", src.fqn)

# One-shot helpers — pass the same plan in any supported form:
to_svg(plan, path="lineage.svg")
to_html(plan, path="lineage.html", title="Employees lineage")

Methods on LineageResult

result = build_lineage(plan)

result.to_dict()            # plain dict
result.to_json(indent=2)    # str (JSON)
result.to_text()            # str (human-readable summary)
result.to_svg()             # str (SVG)
result.to_dot()             # str (Graphviz DOT)
result.to_html()            # str (HTML embedding the SVG)

# Generic save: format inferred from extension (.svg/.dot/.html/.json/.txt).
result.save("lineage.svg")
result.save("lineage.html")
result.save("forced.bin", format="dot")

Standalone helpers

from spark_lineage import to_json, to_text, to_svg, to_dot, to_html, save

to_json(plan)                       # dict
to_text(plan)                       # str
to_svg(plan, path="out.svg")
to_dot(plan, path="out.dot")
to_html(plan, path="out.html")
save(plan, "out/lineage.svg")       # extension-driven

Low-level building blocks

from spark_lineage import (
    plan_from_json, analyze,                              # parse + analyse separately
    render_svg, render_dot, render_html, render_text,     # operate on LineageResult
)

node    = plan_from_json(plan)        # PlanNode tree
result  = analyze(node)               # LineageResult
svg     = render_svg(result)

Command-line interface

The package installs two equivalent console scripts: catalyst-column-lineage and spark-lineage. You can also run it as a module: python -m spark_lineage.

# Single file → stdout
spark-lineage plan.json

# Whole directory → per-file *.lineage.json under lineage_out/
spark-lineage --dir test_json --out lineage_out

# Other formats
spark-lineage --dir test_json --out lineage_text --format text
spark-lineage --dir test_json --out lineage_svg  --format svg
spark-lineage --dir test_json --out lineage_html --format html
spark-lineage --dir test_json --out lineage_dot  --format dot

SVG visualisation

--format svg (or to_svg(...)) produces a self-contained SVG (no external CSS / fonts / scripts) that lays out source tables on the left and output / CTE panels on the right with colour-coded Bezier edges:

┌───────────────────────────┐                         ┌────────────────────────────┐
│ spark_catalog.default.    │                         │ Output       QUERY :: Sort │
│  employees                │                         │ ────────────────────────── │
│  active                   ├──── IDENTITY ──────────►│ IDENTITY  emp_id :: int    │
│  dept_id                  ├╮                        │           ...              │
│  emp_id                   ├╪──── GROUPING ─────────►│ GROUPING  dept_id          │
│  emp_name                 ├╯                        │ AGGREGATE total_salary     │
│  hire_date                │  ╲ AGGREGATE ──────────►│           SUM(...)         │
│  salary                   ├───╯                     │           ...              │
└───────────────────────────┘                         └────────────────────────────┘

Edge / badge colour reflects transformation_type:

Colour Meaning
Gray IDENTITY (passthrough)
Blue EXPRESSION (computed)
Orange AGGREGATE (SUM/AVG/...)
Purple WINDOW (OVER (...))
Green LITERAL
Red GROUPING (GROUPING SETS slot)
Teal GENERATOR (EXPLODE/...)
Slate OUTER_REFERENCE (correlated ref)

Every output row carries an SVG <title> tooltip with the untruncated expression and full source list, so opening the SVG in a browser gives full on-hover detail even when the inline rendering is truncated.

Output schema

LineageResult.to_dict() returns:

{
  "plan_kind": "QUERY",                 // QUERY | DDL | DML | NAMESPACE | UNKNOWN
  "operation": "Sort",                  // short class of the root plan node
  "target_table": null,                 // {catalog, database, table, fqn} for DDL/DML
  "source_tables": ["spark_catalog.default.employees"],
  "columns": [
    {
      "ordinal": 0,
      "name": "dept_id",
      "data_type": "int",
      "transformation_type": "IDENTITY", // IDENTITY | EXPRESSION | AGGREGATE | WINDOW | LITERAL | GROUPING | GENERATOR | OUTER_REFERENCE
      "transformation": "dept_id",
      "sources": [
        { "catalog": "spark_catalog", "database": "default",
          "table": "employees", "column": "dept_id",
          "fqn": "spark_catalog.default.employees.dept_id" }
      ]
    }
  ],
  "ctes": { /* same shape as columns, keyed by CTE name */ },
  "notes":  []
}

Capabilities

The analyser handles every common Catalyst node:

  • Leaves: LogicalRelation, HiveTableRelation, UnresolvedRelation, OneRowRelation, LocalRelation, Range.
  • Schema-changing: Project, Aggregate, Window, Generate, Expand (GROUPING SETS / ROLLUP / CUBE), Pivot, Unpivot.
  • Single-child passthroughs: Filter, Sort, Distinct, Repartition*, Sample, Limit/GlobalLimit/LocalLimit, RebalancePartitions, Coalesce, etc.
  • Multi-input: Join (Inner / LeftOuter / RightOuter / FullOuter / LeftSemi / LeftAnti / Cross), Union, Intersect, Except.
  • CTEs: WithCTE, CTERelationDef, CTERelationRef (with ExprId remapping).
  • DDL / DML: CreateTable*, ReplaceTable*, CreateView*, AlterView*, InsertIntoStatement, AppendData, OverwriteByExpression, MergeIntoTable, UpdateTable, DeleteFromTable, DropTable, RenameTable, TruncateTable, namespace ops.
  • Expressions: Alias, AttributeReference, Literal, Cast, If/CaseWhen, WindowExpression + WindowSpecDefinition + SpecifiedWindowFrame, AggregateExpression, OuterReference, generic fallback for arbitrary functions.

Unknown nodes are handled gracefully: lineage is propagated through them as a passthrough where the schema is preserved, and a note is added so nothing silently goes missing.

Versioning & stability

Semantic versioning. The public API consists of every name re-exported from the top-level spark_lineage module — see spark_lineage.__all__.

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

catalyst_column_lineage-0.1.0.tar.gz (37.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

catalyst_column_lineage-0.1.0-py3-none-any.whl (40.0 kB view details)

Uploaded Python 3

File details

Details for the file catalyst_column_lineage-0.1.0.tar.gz.

File metadata

  • Download URL: catalyst_column_lineage-0.1.0.tar.gz
  • Upload date:
  • Size: 37.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for catalyst_column_lineage-0.1.0.tar.gz
Algorithm Hash digest
SHA256 4992994a7e52630482697658e610454b2607c73139bda7cdf878e0dfebf09228
MD5 972b8dfccadde4d9c028cd8cfe39aa19
BLAKE2b-256 509ba120fc3f8d322407cd5ee92bc7c60f795e8683748bd01f533087afe21bcf

See more details on using hashes here.

File details

Details for the file catalyst_column_lineage-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for catalyst_column_lineage-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 caadd793fe03dcc7fe7a97b9907b656443f3566610148b3748a73783803c2544
MD5 e0671c48d1087f52b6594d846e1e6722
BLAKE2b-256 3db1143888c4a5f420d9a9988530acbc9704bf91e56856f99a33277bd41f1f1d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page