Skip to main content

Extract column-level lineage from Polars LazyFrame transformations.

Project description

polars-lineage

Extract column-level lineage from Polars LazyFrame transformations and emit deterministic lineage artifacts in multiple formats.

Install and Setup

Install API-only dependencies:

pip install polars-lineage

Install with CLI support:

pip install "polars-lineage[cli]"

For local development:

uv sync --dev

Python API (LazyFrame First)

import polars as pl
from polars_lineage import extract_lazyframe_lineage

lazyframe = pl.DataFrame({"a": [1, 2], "b": [3, 4]}).lazy().select(
    [
        pl.col("a").alias("x"),
        (pl.col("a") + pl.col("b")).alias("sum"),
    ]
)

payloads = extract_lazyframe_lineage(
    lazyframe,
    {
        "sources": {"orders": "svc.db.raw.orders"},
        "destination_table": "svc.db.curated.metrics",
    },
)

print(payloads)

extract_lazyframe_lineage(...) stays backward-compatible and returns OpenMetadata payloads by default.

For format-aware output, use extract_lazyframe_lineage_formatted(...):

import polars as pl
from polars_lineage import (
    LineageDocument,
    extract_lazyframe_lineage_document,
    extract_lazyframe_lineage_formatted,
)

lazyframe = pl.DataFrame({"a": [1], "b": [2]}).lazy().select(
    [(pl.col("a") + pl.col("b")).alias("sum")]
)

json_document: LineageDocument = extract_lazyframe_lineage_document(
    lazyframe,
    {
        "sources": {"orders": "svc.db.raw.orders"},
        "destination_table": "svc.db.curated.metrics",
    },
)

markdown_report = extract_lazyframe_lineage_formatted(
    lazyframe,
    {
        "sources": {"orders": "svc.db.raw.orders"},
        "destination_table": "svc.db.curated.metrics",
    },
    output_format="markdown",
)

print(json_document.model_dump())
print(markdown_report)

If you want a strongly typed Pydantic model for consumer code, use:

import polars as pl
from polars_lineage import LineageDocument, extract_lazyframe_lineage_document

lazyframe = pl.DataFrame({"a": [1]}).lazy().select([pl.col("a").alias("x")])

document: LineageDocument = extract_lazyframe_lineage_document(
    lazyframe,
    {
        "sources": {"orders": "svc.db.raw.orders"},
        "destination_table": "svc.db.curated.metrics",
    },
)

print(document.model_dump())

Example with multiple input sources (join):

import polars as pl
from polars_lineage import extract_lazyframe_lineage

left = pl.DataFrame({"id": [1, 2], "a": [10, 20]}).lazy()
right = pl.DataFrame({"id": [1, 2], "b": [3, 4]}).lazy()

lazyframe = left.join(right, on="id", how="left").with_columns(
    (pl.col("a") + pl.col("b")).alias("total")
)

payloads = extract_lazyframe_lineage(
    lazyframe,
    {
        "sources": {
            "left": "svc.db.raw.left_table",
            "right": "svc.db.raw.right_table",
        },
        "destination_table": "svc.db.curated.joined_metrics",
    },
)

print(payloads)

The mapping argument can be either:

  • a MappingConfig
  • a dict with sources and destination_table

Metadata-on-LazyFrame Pattern

After importing polars_lineage, pl.LazyFrame gets an add_metadata(...) helper.

Supported forms:

  • explicit lineage mapping:
    • source="svc.db.raw.orders" (single source), or
    • sources={"left": "...", "right": "..."} (multi-source)
    • optional destination_table="svc.db.curated.result"
  • metadata mode:
    • name="orders", source_type="postgres", source_url="postgres://..."
    • destination table is auto-derived unless provided

Example with metadata attached directly to LazyFrame definitions:

import polars as pl
import polars_lineage  # registers LazyFrame.add_metadata

df_order = (
    pl.DataFrame({"a": [1], "b": [2]})
    .lazy()
    .add_metadata(
        name="orders",
        source_type="postgres",
        source_url="postgres://myserver/svc.db.raw.orders",
    )
)

df_account = (
    pl.DataFrame({"a": [1], "b": [2]})
    .lazy()
    .add_metadata(
        name="account",
        source_type="rest",
        source_url="https://account/list",
    )
)

lineage = (
    df_account.join(df_order, on="a", how="inner")
    .select([(pl.col("a") + pl.col("b")).alias("sum")])
    .extract_lineage()
)

print(lineage)

Equivalent single-source style with explicit source:

import polars as pl
import polars_lineage

lineage = (
    pl.DataFrame({"a": [1], "b": [2]})
    .lazy()
    .add_metadata(
        source="svc.db.raw.orders",
        destination_table="svc.db.curated.order_metrics",
    )
    .select([(pl.col("a") + pl.col("b")).alias("sum")])
    .extract_lineage()
)

print(lineage)

Example with group-by aggregation lineage:

import polars as pl
from polars_lineage import extract_lazyframe_lineage

lazyframe = (
    pl.DataFrame({"customer_id": [1, 1, 2], "amount": [10, 15, 8]})
    .lazy()
    .group_by("customer_id")
    .agg(pl.col("amount").sum().alias("total_amount"))
)

payloads = extract_lazyframe_lineage(
    lazyframe,
    {
        "sources": {"payments": "svc.db.raw.payments"},
        "destination_table": "svc.db.curated.customer_totals",
    },
)

print(payloads)

CLI Usage

Install the optional CLI extra first:

pip install "polars-lineage[cli]"
polars-lineage extract --mapping mapping.yml --out lineage.json

Choose an output format with --format:

# Existing default behavior
polars-lineage extract --mapping mapping.yml --out lineage-openmetadata.json --format openmetadata

# Strongly typed custom JSON document
polars-lineage extract --mapping mapping.yml --out lineage.json --format json

# Human-readable report
polars-lineage extract --mapping mapping.yml --out lineage.md --format markdown

mapping.yml example:

sources:
  left: svc.db.raw.left_table
  right: svc.db.raw.right_table
destination_table: svc.db.curated.final_table
plan_path: ./plan.txt

Notes:

  • plan_path can be relative to the mapping file location.
  • CLI reads a pre-generated Polars explain plan from disk.
  • For join plans, mapping.sources must include left and right aliases.

Wrapper Notes

  • LineageLazyFrame preserves metadata through most chained LazyFrame operations.
  • Joining two wrapped frames merges source metadata automatically (left/right).
  • extract_lineage() runs the same extraction pipeline as extract_lazyframe_lineage(...).

Current Capabilities

  • Projection lineage (select, with_columns)
  • Literals and aliases
  • Basic expression dependency extraction (arithmetic, casts, conditional-like patterns)
  • Transitive dependency resolution
  • Join-aware attribution with explicit left/right mapping aliases
  • Group-by aggregation expression and key coverage
  • Deterministic OpenMetadata payload export
  • Deterministic custom JSON export via typed LineageDocument model
  • Deterministic Markdown lineage rendering

Output Formats

  • openmetadata: existing OpenMetadata AddLineageRequest-style payload list
  • json: custom typed JSON document
    • top-level: destination_table, edges[]
    • edge: source_table, destination_table, columns[]
    • column: to_column, from_columns, function, confidence
  • markdown: human-readable lineage table report
  • openlineage: planned follow-up

Current Constraints

  • Multiple joins in one parsed plan are rejected.
  • Join mappings must include left and right source aliases.
  • Ambiguous non-join overlapping columns are rejected with clear errors.
  • For static type checking, dynamically added LazyFrame.add_metadata(...) may require stubs for full IDE/mypy method discovery.

Development

uv run pytest
uv run ruff check .
uv run mypy
uv build

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

polars_lineage-0.1.2.tar.gz (13.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

polars_lineage-0.1.2-py3-none-any.whl (21.1 kB view details)

Uploaded Python 3

File details

Details for the file polars_lineage-0.1.2.tar.gz.

File metadata

  • Download URL: polars_lineage-0.1.2.tar.gz
  • Upload date:
  • Size: 13.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for polars_lineage-0.1.2.tar.gz
Algorithm Hash digest
SHA256 c7b6b1ec663d4319148630e013d2061c6593b06809bb4220e872fcd7f689d903
MD5 a15d6f1fee52b851c6752f9b7d220ed1
BLAKE2b-256 2b224ce6011f03b150aa1a1dfdb2392e7bff3e1ce7bc0758c9cd797cbbd12bcb

See more details on using hashes here.

Provenance

The following attestation bundles were made for polars_lineage-0.1.2.tar.gz:

Publisher: release.yml on davzucky/polars-lineage

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file polars_lineage-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: polars_lineage-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 21.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for polars_lineage-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 9a2309cd6c110be824d025c4f6c6fd15597d83999ddbf71b1e9136d3d7b760b4
MD5 13109ffb8918cf15eab126233ab21051
BLAKE2b-256 4a8ae717f25d3692d19ce4933daeea524f222ccf1d1835434fb86e325775ff47

See more details on using hashes here.

Provenance

The following attestation bundles were made for polars_lineage-0.1.2-py3-none-any.whl:

Publisher: release.yml on davzucky/polars-lineage

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page