Column-level data lineage and SVG/DOT/HTML visualisation from Apache Spark Catalyst LogicalPlan JSON dumps.

These details have not been verified by PyPI

Project description

catalyst-column-lineage

Pure-Python, dependency-free column-level data lineage extractor for Apache Spark. It consumes a Catalyst LogicalPlan JSON dump (the format emitted by LogicalPlan.toJSON / JsonProtocol) and produces:

a structured LineageResult data model (per-column sources and the reconstructed expression that produced each column),
a self-contained SVG visualisation,
a Graphviz DOT description,
an HTML page embedding both the SVG and the JSON dump,
a compact plain-text summary.

The package walks the plan tree bottom-up, resolves Catalyst ExprIds through CTERelationDef / CTERelationRef remappings, and renders SQL-like expressions for aliases, casts, literals, aggregates, window functions, generators, and any unknown expression via a generic fallback.

Install

pip install catalyst-column-lineage

The import name is spark_lineage (the distribution name is catalyst-column-lineage):

import spark_lineage

No runtime dependencies. Python 3.9+.

Quick start

Every entry point accepts the plan in any of these shapes:

Input	Notes
`str` / `bytes`	JSON text
`list[dict]`	Spark's native pre-order plan list
`dict` (single node)	One plan node with `class` field
`dict` (`{"plan": [...]}`, `{"logicalPlan": [...]}`, `{"nodes": [...]}`)	wrapper
`LineageResult`	already-analysed; returned unchanged

import json
from spark_lineage import build_lineage, to_svg, to_html

plan = json.load(open("plan.json"))

result = build_lineage(plan)
print(result.plan_kind)          # QUERY | DDL | DML | NAMESPACE
print(result.operation)          # short class name of the root plan
print(result.target)             # TargetTable for DDL/DML, else None
print(result.source_tables)      # ['spark_catalog.default.employees', ...]

for col in result.columns:
    print(col.name, col.transformation_type, col.transformation)
    for src in col.sources:
        print("    <-", src.fqn)

# One-shot helpers — pass the same plan in any supported form:
to_svg(plan, path="lineage.svg")
to_html(plan, path="lineage.html", title="Employees lineage")

Methods on `LineageResult`

result = build_lineage(plan)

result.to_dict()            # plain dict
result.to_json(indent=2)    # str (JSON)
result.to_text()            # str (human-readable summary)
result.to_svg()             # str (SVG)
result.to_dot()             # str (Graphviz DOT)
result.to_html()            # str (HTML embedding the SVG)

# Generic save: format inferred from extension (.svg/.dot/.html/.json/.txt).
result.save("lineage.svg")
result.save("lineage.html")
result.save("forced.bin", format="dot")

Standalone helpers

from spark_lineage import to_json, to_text, to_svg, to_dot, to_html, save

to_json(plan)                       # dict
to_text(plan)                       # str
to_svg(plan, path="out.svg")
to_dot(plan, path="out.dot")
to_html(plan, path="out.html")
save(plan, "out/lineage.svg")       # extension-driven

Low-level building blocks

from spark_lineage import (
    plan_from_json, analyze,                              # parse + analyse separately
    render_svg, render_dot, render_html, render_text,     # operate on LineageResult
)

node    = plan_from_json(plan)        # PlanNode tree
result  = analyze(node)               # LineageResult
svg     = render_svg(result)

Command-line interface

The package installs two equivalent console scripts: catalyst-column-lineage and spark-lineage. You can also run it as a module: python -m spark_lineage.

# Single file → stdout
spark-lineage plan.json

# Whole directory → per-file *.lineage.json under lineage_out/
spark-lineage --dir test_json --out lineage_out

# Other formats
spark-lineage --dir test_json --out lineage_text --format text
spark-lineage --dir test_json --out lineage_svg  --format svg
spark-lineage --dir test_json --out lineage_html --format html
spark-lineage --dir test_json --out lineage_dot  --format dot

SVG visualisation

--format svg (or to_svg(...)) produces a self-contained SVG (no external CSS / fonts / scripts) that lays out source tables on the left and output / CTE panels on the right with colour-coded Bezier edges:

┌───────────────────────────┐                         ┌────────────────────────────┐
│ spark_catalog.default.    │                         │ Output       QUERY :: Sort │
│  employees                │                         │ ────────────────────────── │
│  active                   ├──── IDENTITY ──────────►│ IDENTITY  emp_id :: int    │
│  dept_id                  ├╮                        │           ...              │
│  emp_id                   ├╪──── GROUPING ─────────►│ GROUPING  dept_id          │
│  emp_name                 ├╯                        │ AGGREGATE total_salary     │
│  hire_date                │  ╲ AGGREGATE ──────────►│           SUM(...)         │
│  salary                   ├───╯                     │           ...              │
└───────────────────────────┘                         └────────────────────────────┘

Edge / badge colour reflects transformation_type:

Colour	Meaning
Gray	IDENTITY (passthrough)
Blue	EXPRESSION (computed)
Orange	AGGREGATE (`SUM`/`AVG`/...)
Purple	WINDOW (`OVER (...)`)
Green	LITERAL
Red	GROUPING (`GROUPING SETS` slot)
Teal	GENERATOR (`EXPLODE`/...)
Slate	OUTER_REFERENCE (correlated ref)

Every output row carries an SVG <title> tooltip with the untruncated expression and full source list, so opening the SVG in a browser gives full on-hover detail even when the inline rendering is truncated.

Output schema

LineageResult.to_dict() returns:

{
  "plan_kind": "QUERY",                 // QUERY | DDL | DML | NAMESPACE | UNKNOWN
  "operation": "Sort",                  // short class of the root plan node
  "target_table": null,                 // {catalog, database, table, fqn} for DDL/DML
  "source_tables": ["spark_catalog.default.employees"],
  "columns": [
    {
      "ordinal": 0,
      "name": "dept_id",
      "data_type": "int",
      "transformation_type": "IDENTITY", // IDENTITY | EXPRESSION | AGGREGATE | WINDOW | LITERAL | GROUPING | GENERATOR | OUTER_REFERENCE
      "transformation": "dept_id",
      "sources": [
        { "catalog": "spark_catalog", "database": "default",
          "table": "employees", "column": "dept_id",
          "fqn": "spark_catalog.default.employees.dept_id" }
      ]
    }
  ],
  "ctes": { /* same shape as columns, keyed by CTE name */ },
  "notes":  []
}

Capabilities

The analyser handles every common Catalyst node:

Leaves: LogicalRelation, HiveTableRelation, UnresolvedRelation, OneRowRelation, LocalRelation, Range.
Schema-changing: Project, Aggregate, Window, Generate, Expand (GROUPING SETS / ROLLUP / CUBE), Pivot, Unpivot.
Single-child passthroughs: Filter, Sort, Distinct, Repartition*, Sample, Limit/GlobalLimit/LocalLimit, RebalancePartitions, Coalesce, etc.
Multi-input: Join (Inner / LeftOuter / RightOuter / FullOuter / LeftSemi / LeftAnti / Cross), Union, Intersect, Except.
CTEs: WithCTE, CTERelationDef, CTERelationRef (with ExprId remapping).
DDL / DML: CreateTable*, ReplaceTable*, CreateView*, AlterView*, InsertIntoStatement, AppendData, OverwriteByExpression, MergeIntoTable, UpdateTable, DeleteFromTable, DropTable, RenameTable, TruncateTable, namespace ops.
Expressions: Alias, AttributeReference, Literal, Cast, If/CaseWhen, WindowExpression + WindowSpecDefinition + SpecifiedWindowFrame, AggregateExpression, OuterReference, generic fallback for arbitrary functions.

Unknown nodes are handled gracefully: lineage is propagated through them as a passthrough where the schema is preserved, and a note is added so nothing silently goes missing.

Versioning & stability

Semantic versioning. The public API consists of every name re-exported from the top-level spark_lineage module — see spark_lineage.__all__.

License

MIT — see LICENSE.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.1.1

Apr 29, 2026

This version

0.1.0

Apr 28, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

catalyst_column_lineage-0.1.0.tar.gz (37.0 kB view details)

Uploaded Apr 28, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

catalyst_column_lineage-0.1.0-py3-none-any.whl (40.0 kB view details)

Uploaded Apr 28, 2026 Python 3

File details

Details for the file catalyst_column_lineage-0.1.0.tar.gz.

File metadata

Download URL: catalyst_column_lineage-0.1.0.tar.gz
Upload date: Apr 28, 2026
Size: 37.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for catalyst_column_lineage-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`4992994a7e52630482697658e610454b2607c73139bda7cdf878e0dfebf09228`
MD5	`972b8dfccadde4d9c028cd8cfe39aa19`
BLAKE2b-256	`509ba120fc3f8d322407cd5ee92bc7c60f795e8683748bd01f533087afe21bcf`

See more details on using hashes here.

File details

Details for the file catalyst_column_lineage-0.1.0-py3-none-any.whl.

File metadata

Download URL: catalyst_column_lineage-0.1.0-py3-none-any.whl
Upload date: Apr 28, 2026
Size: 40.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for catalyst_column_lineage-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`caadd793fe03dcc7fe7a97b9907b656443f3566610148b3748a73783803c2544`
MD5	`e0671c48d1087f52b6594d846e1e6722`
BLAKE2b-256	`3db1143888c4a5f420d9a9988530acbc9704bf91e56856f99a33277bd41f1f1d`

See more details on using hashes here.

catalyst-column-lineage 0.1.0

Navigation

Verified details

Project links

Maintainers

Unverified details

Meta

Classifiers

Project description

catalyst-column-lineage

Install

Quick start

Methods on `LineageResult`

Standalone helpers

Low-level building blocks

Command-line interface

SVG visualisation

Output schema

Capabilities

Versioning & stability

License

Project details

Verified details

Project links

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

catalyst-column-lineage 0.1.0

Navigation

Verified details

Project links

Maintainers

Unverified details

Meta

Classifiers

Project description

catalyst-column-lineage

Install

Quick start

Methods on LineageResult

Standalone helpers

Low-level building blocks

Command-line interface

SVG visualisation

Output schema

Capabilities

Versioning & stability

License

Project details

Verified details

Project links

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Methods on `LineageResult`