Extract column-level lineage from Polars LazyFrame transformations.
Project description
polars-lineage
Extract column-level lineage from Polars LazyFrame transformations and emit deterministic lineage artifacts in multiple formats.
Install and Setup
Install API-only dependencies:
pip install polars-lineage
Install with CLI support:
pip install "polars-lineage[cli]"
For local development:
uv sync --dev
Python API (LazyFrame First)
import polars as pl
from polars_lineage import extract_lazyframe_lineage
lazyframe = pl.DataFrame({"a": [1, 2], "b": [3, 4]}).lazy().select(
[
pl.col("a").alias("x"),
(pl.col("a") + pl.col("b")).alias("sum"),
]
)
payloads = extract_lazyframe_lineage(
lazyframe,
{
"sources": {"orders": "svc.db.raw.orders"},
"destination_table": "svc.db.curated.metrics",
},
)
print(payloads)
extract_lazyframe_lineage(...) stays backward-compatible and returns OpenMetadata payloads by default.
For format-aware output, use extract_lazyframe_lineage_formatted(...):
import polars as pl
from polars_lineage import (
LineageDocument,
extract_lazyframe_lineage_document,
extract_lazyframe_lineage_formatted,
)
lazyframe = pl.DataFrame({"a": [1], "b": [2]}).lazy().select(
[(pl.col("a") + pl.col("b")).alias("sum")]
)
json_document: LineageDocument = extract_lazyframe_lineage_document(
lazyframe,
{
"sources": {"orders": "svc.db.raw.orders"},
"destination_table": "svc.db.curated.metrics",
},
)
markdown_report = extract_lazyframe_lineage_formatted(
lazyframe,
{
"sources": {"orders": "svc.db.raw.orders"},
"destination_table": "svc.db.curated.metrics",
},
output_format="markdown",
)
print(json_document.model_dump())
print(markdown_report)
If you want a strongly typed Pydantic model for consumer code, use:
import polars as pl
from polars_lineage import LineageDocument, extract_lazyframe_lineage_document
lazyframe = pl.DataFrame({"a": [1]}).lazy().select([pl.col("a").alias("x")])
document: LineageDocument = extract_lazyframe_lineage_document(
lazyframe,
{
"sources": {"orders": "svc.db.raw.orders"},
"destination_table": "svc.db.curated.metrics",
},
)
print(document.model_dump())
Example with multiple input sources (join):
import polars as pl
from polars_lineage import extract_lazyframe_lineage
left = pl.DataFrame({"id": [1, 2], "a": [10, 20]}).lazy()
right = pl.DataFrame({"id": [1, 2], "b": [3, 4]}).lazy()
lazyframe = left.join(right, on="id", how="left").with_columns(
(pl.col("a") + pl.col("b")).alias("total")
)
payloads = extract_lazyframe_lineage(
lazyframe,
{
"sources": {
"left": "svc.db.raw.left_table",
"right": "svc.db.raw.right_table",
},
"destination_table": "svc.db.curated.joined_metrics",
},
)
print(payloads)
The mapping argument can be either:
- a
MappingConfig - a
dictwithsourcesanddestination_table
Metadata-on-LazyFrame Pattern
After importing polars_lineage, pl.LazyFrame gets an add_metadata(...) helper.
Supported forms:
- explicit lineage mapping:
source="svc.db.raw.orders"(single source), orsources={"left": "...", "right": "..."}(multi-source)- optional
destination_table="svc.db.curated.result"
- metadata mode:
name="orders",source_type="postgres",source_url="postgres://..."- destination table is auto-derived unless provided
Example with metadata attached directly to LazyFrame definitions:
import polars as pl
import polars_lineage # registers LazyFrame.add_metadata
df_order = (
pl.DataFrame({"a": [1], "b": [2]})
.lazy()
.add_metadata(
name="orders",
source_type="postgres",
source_url="postgres://myserver/svc.db.raw.orders",
)
)
df_account = (
pl.DataFrame({"a": [1], "b": [2]})
.lazy()
.add_metadata(
name="account",
source_type="rest",
source_url="https://account/list",
)
)
lineage = (
df_account.join(df_order, on="a", how="inner")
.select([(pl.col("a") + pl.col("b")).alias("sum")])
.extract_lineage()
)
print(lineage)
Equivalent single-source style with explicit source:
import polars as pl
import polars_lineage
lineage = (
pl.DataFrame({"a": [1], "b": [2]})
.lazy()
.add_metadata(
source="svc.db.raw.orders",
destination_table="svc.db.curated.order_metrics",
)
.select([(pl.col("a") + pl.col("b")).alias("sum")])
.extract_lineage()
)
print(lineage)
Example with group-by aggregation lineage:
import polars as pl
from polars_lineage import extract_lazyframe_lineage
lazyframe = (
pl.DataFrame({"customer_id": [1, 1, 2], "amount": [10, 15, 8]})
.lazy()
.group_by("customer_id")
.agg(pl.col("amount").sum().alias("total_amount"))
)
payloads = extract_lazyframe_lineage(
lazyframe,
{
"sources": {"payments": "svc.db.raw.payments"},
"destination_table": "svc.db.curated.customer_totals",
},
)
print(payloads)
CLI Usage
Install the optional CLI extra first:
pip install "polars-lineage[cli]"
polars-lineage extract --mapping mapping.yml --out lineage.json
Choose an output format with --format:
# Existing default behavior
polars-lineage extract --mapping mapping.yml --out lineage-openmetadata.json --format openmetadata
# Strongly typed custom JSON document
polars-lineage extract --mapping mapping.yml --out lineage.json --format json
# Human-readable report
polars-lineage extract --mapping mapping.yml --out lineage.md --format markdown
mapping.yml example:
sources:
left: svc.db.raw.left_table
right: svc.db.raw.right_table
destination_table: svc.db.curated.final_table
plan_path: ./plan.txt
Notes:
plan_pathcan be relative to the mapping file location.- CLI reads a pre-generated Polars explain plan from disk.
- For join plans,
mapping.sourcesmust includeleftandrightaliases.
Wrapper Notes
LineageLazyFramepreserves metadata through most chainedLazyFrameoperations.- Joining two wrapped frames merges source metadata automatically (
left/right). extract_lineage()runs the same extraction pipeline asextract_lazyframe_lineage(...).
Current Capabilities
- Projection lineage (
select,with_columns) - Literals and aliases
- Basic expression dependency extraction (arithmetic, casts, conditional-like patterns)
- Transitive dependency resolution
- Join-aware attribution with explicit
left/rightmapping aliases - Group-by aggregation expression and key coverage
- Deterministic OpenMetadata payload export
- Deterministic custom JSON export via typed
LineageDocumentmodel - Deterministic Markdown lineage rendering
Output Formats
openmetadata: existing OpenMetadata AddLineageRequest-style payload listjson: custom typed JSON document- top-level:
destination_table,edges[] - edge:
source_table,destination_table,columns[] - column:
to_column,from_columns,function,confidence
- top-level:
markdown: human-readable lineage table reportopenlineage: planned follow-up
Current Constraints
- Multiple joins in one parsed plan are rejected.
- Join mappings must include
leftandrightsource aliases. - Ambiguous non-join overlapping columns are rejected with clear errors.
- For static type checking, dynamically added
LazyFrame.add_metadata(...)may require stubs for full IDE/mypy method discovery.
Development
uv run pytest
uv run ruff check .
uv run mypy
uv build
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file polars_lineage-0.1.2.tar.gz.
File metadata
- Download URL: polars_lineage-0.1.2.tar.gz
- Upload date:
- Size: 13.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c7b6b1ec663d4319148630e013d2061c6593b06809bb4220e872fcd7f689d903
|
|
| MD5 |
a15d6f1fee52b851c6752f9b7d220ed1
|
|
| BLAKE2b-256 |
2b224ce6011f03b150aa1a1dfdb2392e7bff3e1ce7bc0758c9cd797cbbd12bcb
|
Provenance
The following attestation bundles were made for polars_lineage-0.1.2.tar.gz:
Publisher:
release.yml on davzucky/polars-lineage
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
polars_lineage-0.1.2.tar.gz -
Subject digest:
c7b6b1ec663d4319148630e013d2061c6593b06809bb4220e872fcd7f689d903 - Sigstore transparency entry: 1007230264
- Sigstore integration time:
-
Permalink:
davzucky/polars-lineage@4565e95df823037e7c57d5ec2edabd1d14d88158 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/davzucky
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@4565e95df823037e7c57d5ec2edabd1d14d88158 -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file polars_lineage-0.1.2-py3-none-any.whl.
File metadata
- Download URL: polars_lineage-0.1.2-py3-none-any.whl
- Upload date:
- Size: 21.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9a2309cd6c110be824d025c4f6c6fd15597d83999ddbf71b1e9136d3d7b760b4
|
|
| MD5 |
13109ffb8918cf15eab126233ab21051
|
|
| BLAKE2b-256 |
4a8ae717f25d3692d19ce4933daeea524f222ccf1d1835434fb86e325775ff47
|
Provenance
The following attestation bundles were made for polars_lineage-0.1.2-py3-none-any.whl:
Publisher:
release.yml on davzucky/polars-lineage
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
polars_lineage-0.1.2-py3-none-any.whl -
Subject digest:
9a2309cd6c110be824d025c4f6c6fd15597d83999ddbf71b1e9136d3d7b760b4 - Sigstore transparency entry: 1007230330
- Sigstore integration time:
-
Permalink:
davzucky/polars-lineage@4565e95df823037e7c57d5ec2edabd1d14d88158 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/davzucky
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@4565e95df823037e7c57d5ec2edabd1d14d88158 -
Trigger Event:
workflow_dispatch
-
Statement type: