A Polars plugin for persistent DataFrame-level metadata
Project description
polars-config-meta
A Polars plugin for persistent DataFrame-level metadata.
polars-config-meta offers a simple way to store and propagate Python-side metadata for Polars DataFrames and LazyFrames. It achieves this by:
- Registering a custom
config_metanamespace on eachDataFrameandLazyFrame. - Keeping an internal dictionary keyed by the
id(df), with automatic weak-reference cleanup to avoid memory leaks. - Automatically patching common Polars methods (like
with_columns,select,filter, etc.) so that metadata is preserved even when using regular Polars syntax. - Providing a "fallthrough" mechanism so you can write
df.config_meta.some_polars_method(...)and have the resulting newDataFrameautomatically inherit the old metadata, for use to either explicitly note the metadata transfer or as a backup if a method was not monkeypatched (please file a bug report if you find any!). - Optionally embedding that metadata in file‐level Parquet metadata when you call
df.config_meta.write_parquet(...), and retrieving it withread_parquet_with_meta(...)(eager) orscan_parquet_with_meta(...)(lazy).
Installation
pip install polars-config-meta[polars]
On older CPUs add the polars-lts-cpu extra:
pip install polars-config-meta[polars-lts-cpu]
For parquet file-level metadata read/writing, add the pyarrow extra:
pip install polars-config-meta[pyarrow]
Key Points
-
Automatic Metadata Preservation (New in v0.2.0!) By default, the plugin patches common Polars DataFrame methods (
with_columns,select,filter,sort, etc.) to automatically preserve metadata. This means both of these will preserve metadata:df.with_columns(...)← regular Polars method (automatically patched)df.config_meta.with_columns(...)← through the namespace
This behavior can be configured globally (see Configuration below).
-
Weak-Reference Based We store metadata in class-level dictionaries keyed by
id(df)and hold aweakrefto the DataFrame. Once the DataFrame is garbage-collected, the metadata is removed too. -
Works with DataFrames and LazyFrames The plugin supports both eager (
DataFrame) and lazy (LazyFrame) execution modes. -
Parquet Integration
df.config_meta.write_parquet("file.parquet")automatically embeds the plugin metadata into the Arrow schema'smetadata.read_parquet_with_meta("file.parquet")reads the file, extracts that metadata, and reattaches it to the returnedDataFrame.scan_parquet_with_meta("file.parquet")scans the file, extracts that metadata, and reattaches it to the returnedLazyFrame.
-
Chainable Operations Since metadata is preserved across transformations, you can chain multiple operations:
result = ( df.config_meta.set(owner="Alice") .with_columns(doubled=pl.col("a") * 2) .filter(pl.col("doubled") > 5) .select(["doubled"]) ) # Metadata is preserved throughout the chain!
Basic Usage
import polars as pl
import polars_config_meta # this registers the plugin
df = pl.DataFrame({"a": [1, 2, 3]})
df.config_meta.set(owner="Alice", confidence=0.95)
# Both of these preserve metadata (auto-patching is enabled by default):
df2 = df.with_columns(doubled=pl.col("a") * 2)
print(df2.config_meta.get_metadata())
# -> {'owner': 'Alice', 'confidence': 0.95}
df3 = df.config_meta.with_columns(tripled=pl.col("a") * 3)
print(df3.config_meta.get_metadata())
# -> {'owner': 'Alice', 'confidence': 0.95}
# Chain operations - metadata flows through:
df4 = (
df.with_columns(squared=pl.col("a") ** 2)
.filter(pl.col("squared") > 4)
.select(["a", "squared"])
)
print(df4.config_meta.get_metadata())
# -> {'owner': 'Alice', 'confidence': 0.95}
# Write to Parquet, storing the metadata in file-level metadata:
df4.config_meta.write_parquet("output.parquet")
# Later, read it back:
from polars_config_meta import read_parquet_with_meta
df_in = read_parquet_with_meta("output.parquet")
print(df_in.config_meta.get_metadata())
# -> {'owner': 'Alice', 'confidence': 0.95}
Configuration
The plugin provides a ConfigMetaOpts class to control automatic metadata preservation behavior:
from polars_config_meta import ConfigMetaOpts
# Disable automatic metadata preservation for regular DataFrame methods
ConfigMetaOpts.disable_auto_preserve()
df = pl.DataFrame({"a": [1, 2, 3]})
df.config_meta.set(owner="Alice")
df2 = df.with_columns(doubled=pl.col("a") * 2)
print(df2.config_meta.get_metadata())
# -> {} (metadata NOT preserved with regular methods)
df3 = df.config_meta.with_columns(tripled=pl.col("a") * 3)
print(df3.config_meta.get_metadata())
# -> {'owner': 'Alice'} (still works via namespace!)
# Re-enable automatic preservation
ConfigMetaOpts.enable_auto_preserve()
df4 = df.with_columns(quadrupled=pl.col("a") * 4)
print(df4.config_meta.get_metadata())
# -> {'owner': 'Alice'} (metadata preserved again)
Configuration Options
ConfigMetaOpts.enable_auto_preserve(): Enable automatic metadata preservation for regular DataFrame methods (this is the default behavior).ConfigMetaOpts.disable_auto_preserve(): Disable automatic preservation. Onlydf.config_meta.<method>()will preserve metadata.
Note: The df.config_meta.<method>() syntax always preserves metadata, regardless of the configuration setting.
API Reference
Setting and Retrieving Metadata
-
df.config_meta.set(**kwargs): Set metadata key-value pairsdf.config_meta.set(owner="Alice", confidence=0.95, version=2)
-
df.config_meta.get_metadata(): Get all metadata as a dictionarymetadata = df.config_meta.get_metadata() # -> {'owner': 'Alice', 'confidence': 0.95, 'version': 2}
-
df.config_meta.update(mapping): Update metadata from a dictionarydf.config_meta.update({"confidence": 0.99, "validated": True})
-
df.config_meta.merge(*dfs): Merge metadata from other DataFramesdf3.config_meta.merge(df1, df2) # df3 now has metadata from both df1 and df2
-
df.config_meta.clear_metadata(): Remove all metadata for this DataFramedf.config_meta.clear_metadata()
Parquet I/O
-
df.config_meta.write_parquet(file_path, **kwargs): Write DataFrame to Parquet with embedded metadatadf.config_meta.write_parquet("output.parquet")
-
read_parquet_with_meta(file_path, **kwargs): Read Parquet file with metadata (eager)from polars_config_meta import read_parquet_with_meta df = read_parquet_with_meta("output.parquet")
-
scan_parquet_with_meta(file_path, **kwargs): Scan Parquet file with metadata (lazy)from polars_config_meta import scan_parquet_with_meta lf = scan_parquet_with_meta("output.parquet")
Automatic Method Forwarding
Any Polars DataFrame/LazyFrame method can be called through df.config_meta.<method>():
# All of these preserve metadata:
df.config_meta.with_columns(new_col=pl.col("a") * 2)
df.config_meta.select(["a", "b"])
df.config_meta.filter(pl.col("a") > 0)
df.config_meta.sort("a")
df.config_meta.unique()
df.config_meta.drop(["col1"])
df.config_meta.rename({"old": "new"})
# ... and many more!
Common Patterns
Setting Metadata on Creation
df = pl.DataFrame({"a": [1, 2, 3]})
df.config_meta.set(
source="user_upload",
timestamp="2025-01-15",
validated=False
)
Chaining Operations
result = (
df.with_columns(normalized=pl.col("value") / pl.col("value").sum())
.filter(pl.col("normalized") > 0.1)
.sort("normalized", descending=True)
)
# Metadata flows through the entire chain
Merging Metadata from Multiple Sources
df1.config_meta.set(source="api", quality="high")
df2.config_meta.set(source="cache", timestamp="2025-01-15")
df3 = pl.concat([df1, df2])
df3.config_meta.merge(df1, df2)
# df3 now has: {'source': 'cache', 'quality': 'high', 'timestamp': '2025-01-15'}
# Note: Later DataFrames' values override earlier ones
Persistent Storage with Parquet
# Save with metadata
df.config_meta.set(lineage="raw_data", version=1)
df.config_meta.write_parquet("data_v1.parquet")
# Load with metadata
df_loaded = read_parquet_with_meta("data_v1.parquet")
print(df_loaded.config_meta.get_metadata())
# -> {'lineage': 'raw_data', 'version': 1}
How It Works
Automatic Patching
When you first access .config_meta on any DataFrame, the plugin automatically patches common Polars methods like:
with_columns,select,filter,sort,unique,drop,rename,castdrop_nulls,fill_null,fill_nanhead,tail,sample,slice,limitreverse,rechunk,clone,clear- ... and more
These patched methods automatically copy metadata from the source DataFrame to the result DataFrame.
Storage and Garbage Collection
Internally, the plugin stores metadata in a global dictionary, _df_id_to_meta, keyed by id(df),
and also keeps a weakref to each DataFrame. As soon as a DataFrame is out of scope and
garbage-collected, the entry in _df_id_to_meta is automatically removed. This prevents memory
leaks and keeps the plugin usage simple.
Method Interception
When you call df.config_meta.some_method(...):
- The plugin checks if
some_methodexists on the plugin itself (likeset,get_metadata,write_parquet) - If not, it forwards the call to the underlying DataFrame's method
- If the result is a new DataFrame/LazyFrame, it automatically copies the metadata
Caveats
-
Python-Layer Only This is purely at the Python layer. Polars doesn't guarantee stable IDs or official hooks for such metadata.
-
Metadata is Ephemeral (Unless Saved) Metadata is stored in memory and tied to DataFrame object IDs. It won't survive serialization unless you explicitly use
df.config_meta.write_parquet()andread_parquet_with_meta(). -
Other Formats Not Supported Currently, only Parquet format supports automatic metadata embedding/extraction. For CSV, Arrow, IPC, etc., you'd need to implement your own serialization logic.
-
Configuration is Global The
ConfigMetaOptssettings apply globally to all DataFrames in your Python session.
Contributing
- Issues & Discussions: Please open a GitHub issue for bugs, ideas, or questions.
- Pull Requests: PRs are welcome! This plugin is a community-driven approach to persist DataFrame-level metadata in Polars.
Polars Development
There is ongoing work to support file-level metadata in the Polars Parquet writing, see this PR for details. Once that lands, this plugin may be able to integrate more seamlessly.
License
This project is licensed under the MIT License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file polars_config_meta-0.2.0.tar.gz.
File metadata
- Download URL: polars_config_meta-0.2.0.tar.gz
- Upload date:
- Size: 13.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: pdm/2.26.0 CPython/3.13.3 Linux/6.8.0-57-generic
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
80e375b206e4028508f3feabf63e9c0c63f7a472ed2cc45f77c9cb9084c48e34
|
|
| MD5 |
0c423873fc37e49546350dfe83ac62fd
|
|
| BLAKE2b-256 |
7c45d17c0a8a7552ac445115d79f14a6d10ddae9004234441349ba490df123e9
|
File details
Details for the file polars_config_meta-0.2.0-py3-none-any.whl.
File metadata
- Download URL: polars_config_meta-0.2.0-py3-none-any.whl
- Upload date:
- Size: 9.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: pdm/2.26.0 CPython/3.13.3 Linux/6.8.0-57-generic
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
755ab74eaf30a6eccba9b7f2ac63a67a06828ce760c4599b82da592bc2f4e6db
|
|
| MD5 |
2cf50e1eff1b1e96e82a47b6f0dcba6a
|
|
| BLAKE2b-256 |
01c8d03297b74d96d9a31c8e89cc0cf66a55c075f0f0f0e5ee8d0e5d0e7ad9ee
|