Skip to main content

Column-level data lineage and SVG/DOT/HTML visualisation from Apache Spark Catalyst LogicalPlan JSON dumps.

Project description

catalyst-column-lineage

PyPI version Python versions License: MIT

Pure-Python, dependency-free column-level data lineage extractor for Apache Spark. It consumes a Catalyst LogicalPlan JSON dump (the format emitted by LogicalPlan.toJSON / JsonProtocol) and produces:

  • a structured LineageResult data model (per-column sources and the reconstructed expression that produced each column),
  • a self-contained SVG visualisation,
  • a Graphviz DOT description,
  • an HTML page embedding both the SVG and the JSON dump,
  • a compact plain-text summary.

The package walks the plan tree bottom-up, resolves Catalyst ExprIds through CTERelationDef / CTERelationRef remappings, and renders SQL-like expressions for aliases, casts, literals, aggregates, window functions, generators, and any unknown expression via a generic fallback.

Install

pip install catalyst-column-lineage

The import name is spark_lineage (the distribution name is catalyst-column-lineage):

import spark_lineage

No runtime dependencies. Python 3.9+.

Quick start

Every entry point accepts the plan in any of these shapes:

Input Notes
str / bytes JSON text
list[dict] Spark's native pre-order plan list
dict (single node) One plan node with class field
dict ({"plan": [...]}, {"logicalPlan": [...]}, {"nodes": [...]}) wrapper
LineageResult already-analysed; returned unchanged
import json
from spark_lineage import build_lineage, to_svg, to_html

plan = json.load(open("plan.json"))

result = build_lineage(plan)
print(result.plan_kind)          # QUERY | DDL | DML | NAMESPACE
print(result.operation)          # short class name of the root plan
print(result.target)             # TargetTable for DDL/DML, else None
print(result.source_tables)      # ['spark_catalog.default.employees', ...]

for col in result.columns:
    print(col.name, col.transformation_type, col.transformation)
    for src in col.sources:
        print("    <-", src.fqn)

# One-shot helpers — pass the same plan in any supported form:
to_svg(plan, path="lineage.svg")
to_html(plan, path="lineage.html", title="Employees lineage")

Methods on LineageResult

result = build_lineage(plan)

result.to_dict()            # plain dict
result.to_json(indent=2)    # str (JSON)
result.to_text()            # str (human-readable summary)
result.to_svg()             # str (SVG)
result.to_dot()             # str (Graphviz DOT)
result.to_html()            # str (HTML embedding the SVG)

# Generic save: format inferred from extension (.svg/.dot/.html/.json/.txt).
result.save("lineage.svg")
result.save("lineage.html")
result.save("forced.bin", format="dot")

Standalone helpers

from spark_lineage import to_json, to_text, to_svg, to_dot, to_html, save

to_json(plan)                       # dict
to_text(plan)                       # str
to_svg(plan, path="out.svg")
to_dot(plan, path="out.dot")
to_html(plan, path="out.html")
save(plan, "out/lineage.svg")       # extension-driven

Low-level building blocks

from spark_lineage import (
    plan_from_json, analyze,                              # parse + analyse separately
    render_svg, render_dot, render_html, render_text,     # operate on LineageResult
)

node    = plan_from_json(plan)        # PlanNode tree
result  = analyze(node)               # LineageResult
svg     = render_svg(result)

Command-line interface

The package installs two equivalent console scripts: catalyst-column-lineage and spark-lineage. You can also run it as a module: python -m spark_lineage.

# Single file → stdout
spark-lineage plan.json

# Whole directory → per-file *.lineage.json under lineage_out/
spark-lineage --dir test_json --out lineage_out

# Other formats
spark-lineage --dir test_json --out lineage_text --format text
spark-lineage --dir test_json --out lineage_svg  --format svg
spark-lineage --dir test_json --out lineage_html --format html
spark-lineage --dir test_json --out lineage_dot  --format dot

SVG visualisation

--format svg (or to_svg(...)) produces a self-contained SVG (no external CSS / fonts / scripts) that lays out source tables on the left and output / CTE panels on the right with colour-coded Bezier edges:

┌───────────────────────────┐                         ┌────────────────────────────┐
│ spark_catalog.default.    │                         │ Output       QUERY :: Sort │
│  employees                │                         │ ────────────────────────── │
│  active                   ├──── IDENTITY ──────────►│ IDENTITY  emp_id :: int    │
│  dept_id                  ├╮                        │           ...              │
│  emp_id                   ├╪──── GROUPING ─────────►│ GROUPING  dept_id          │
│  emp_name                 ├╯                        │ AGGREGATE total_salary     │
│  hire_date                │  ╲ AGGREGATE ──────────►│           SUM(...)         │
│  salary                   ├───╯                     │           ...              │
└───────────────────────────┘                         └────────────────────────────┘

Edge / badge colour reflects transformation_type:

Colour Meaning
Gray IDENTITY (passthrough)
Blue EXPRESSION (computed)
Orange AGGREGATE (SUM/AVG/...)
Purple WINDOW (OVER (...))
Green LITERAL
Red GROUPING (GROUPING SETS slot)
Teal GENERATOR (EXPLODE/...)
Slate OUTER_REFERENCE (correlated ref)

Every output row carries an SVG <title> tooltip with the untruncated expression and full source list, so opening the SVG in a browser gives full on-hover detail even when the inline rendering is truncated.

Output schema

LineageResult.to_dict() returns:

{
  "plan_kind": "QUERY",                 // QUERY | DDL | DML | NAMESPACE | UNKNOWN
  "operation": "Sort",                  // short class of the root plan node
  "target_table": null,                 // {catalog, database, table, fqn} for DDL/DML
  "source_tables": ["spark_catalog.default.employees"],
  "columns": [
    {
      "ordinal": 0,
      "name": "dept_id",
      "data_type": "int",
      "transformation_type": "IDENTITY", // IDENTITY | EXPRESSION | AGGREGATE | WINDOW | LITERAL | GROUPING | GENERATOR | OUTER_REFERENCE
      "transformation": "dept_id",
      "sources": [
        { "catalog": "spark_catalog", "database": "default",
          "table": "employees", "column": "dept_id",
          "fqn": "spark_catalog.default.employees.dept_id" }
      ]
    }
  ],
  "ctes": { /* same shape as columns, keyed by CTE name */ },
  "notes":  []
}

Capabilities

The analyser handles every common Catalyst node:

  • Leaves: LogicalRelation, HiveTableRelation, UnresolvedRelation, OneRowRelation, LocalRelation, Range.
  • Schema-changing: Project, Aggregate, Window, Generate, Expand (GROUPING SETS / ROLLUP / CUBE), Pivot, Unpivot.
  • Single-child passthroughs: Filter, Sort, Distinct, Repartition*, Sample, Limit/GlobalLimit/LocalLimit, RebalancePartitions, Coalesce, etc.
  • Multi-input: Join (Inner / LeftOuter / RightOuter / FullOuter / LeftSemi / LeftAnti / Cross), Union, Intersect, Except.
  • CTEs: WithCTE, CTERelationDef, CTERelationRef (with ExprId remapping).
  • DDL / DML: CreateTable*, ReplaceTable*, CreateView*, AlterView*, InsertIntoStatement, AppendData, OverwriteByExpression, MergeIntoTable, UpdateTable, DeleteFromTable, DropTable, RenameTable, TruncateTable, namespace ops.
  • Expressions: Alias, AttributeReference, Literal, Cast, If/CaseWhen, WindowExpression + WindowSpecDefinition + SpecifiedWindowFrame, AggregateExpression, OuterReference, generic fallback for arbitrary functions.

Unknown nodes are handled gracefully: lineage is propagated through them as a passthrough where the schema is preserved, and a note is added so nothing silently goes missing.

Versioning & stability

Semantic versioning. The public API consists of every name re-exported from the top-level spark_lineage module — see spark_lineage.__all__.

License

MIT — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

catalyst_column_lineage-0.1.1.tar.gz (41.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

catalyst_column_lineage-0.1.1-py3-none-any.whl (45.1 kB view details)

Uploaded Python 3

File details

Details for the file catalyst_column_lineage-0.1.1.tar.gz.

File metadata

  • Download URL: catalyst_column_lineage-0.1.1.tar.gz
  • Upload date:
  • Size: 41.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.15

File hashes

Hashes for catalyst_column_lineage-0.1.1.tar.gz
Algorithm Hash digest
SHA256 b155605dc624059e47b67d2977afbc83f2df843b4ecb4fb68ac9f78d84458724
MD5 6d7e11cb7e48e9891b717e1c699b363e
BLAKE2b-256 3b10069d5c87bb93f15a274986d18fd1ada4e7ce512e9599aaa06c066b8f2643

See more details on using hashes here.

File details

Details for the file catalyst_column_lineage-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for catalyst_column_lineage-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 ee14cb96d4582afb2e9af0afa8853a776f5d52e00f5cde519d5b9263692e2302
MD5 a2b19ef42ee3bf38131ee51b61d0ec33
BLAKE2b-256 2871395532dfe1f3389da9643b76ebe3b3a84f9c4f3981e42ed5b74f9cc99b3a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page