Simplified work with partitions based on Polars library
Project description
polars_partitions
GitHub: polars_partitions
Description
This library is not a replacement for Polars. The main goal is to improve the work (write/read/filter) with partitions by creating a Table Of Contents file (hereinafter referred to as "TOC").
Write Partition
polars_parquet.wr_partition()
polars_parquet.wr_partition(
df: DataFrame,
columns: array | string,
output_path: str
)
Parameters
df
Polars DataFrame
columns
Array of columns on which to create partitions
output_path
Path to save to
TOC record
polars_parquet.wr_toc()
polars_parquet.wr_toc(
df: DataFrame on which the partitions are based,
columns: array | string,
output_path: str
)
Parameters
df
Dictionary, where the key is the column and the array is the values
columns
Array of columns to create partitions for
output_path
Path to save to
Reading TOC
polars_parquet.rd_toc()
polars_parquet.rd_toc(
output_path: DataFrame,
filters: dict = None,
btwn: str = None
)
Parameters
output_path
Path where to save.
filters
Dictionary, where the key is the column and the array is the values
btwn
Works in conjunction with filters. It takes as input the column name on which to apply the between filter. It takes the first two values from the filters(array).
Read Partition
polars_parquet.rd_partition()
polars_parquet.rd_partition(
output_path: str,
columns: array | string = "*",
filters: dict = None,
btwn: str = None
) → LazyFrame
Parameters
output_path
Path to the parquet file or to the partitions folder
columns
Array of columns to return
filters
Dictionary where the key is the column and the array is the values
btwn
Works in conjunction with filters. It takes as input the column name on which to apply the between filter. It takes the first two values from the filters(array).
How to use (example)
from pl_partitions import polars_partition
from datetime import date
import polars as pl
# Create a test dataset
df = pl.DataFrame({'col1':[date(2024,1,1),date(2024,1,1),date(2024,1,2),date(2024,1,2),date(2024,1,2),date(2024,1,3),date(2024,1,3),date(2024,1,3)],
'col2':['A2','A2','A2','A2','A2','A2','B2','B2','B2','B2'],
'col3':[1,2,3,4,5,6,7,8]
})
output_path = 'your_path/folder_name_where_to_save'.
# Which columns are partitioned by
columns = ['col1', 'col2']
pp = polars_partitions()
# Write the partitions
# pp.wr_partition(df, columns, output_path)
# Read TOC
# print(pp.rd_toc(output_path))
# Read partitions and apply filters
# filters = {'col1':[date(2024,1,1),date(2024,1,3)]}
# df = pp.rd_partition(output_path, filters=filters, btwn='col1', columns=['col1', 'col3'])
# print(df.collect())
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for polars_partitions-0.1.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 6ac92419ea46a139257bbdd1298ce68518869b846a36ef619e0fa205d9269e9e |
|
MD5 | 87501e2f04b074535e0f3fbe154ab582 |
|
BLAKE2b-256 | aaa7f121f8f8d8cfda4f95eadaac724f578199a682b74944afab5ae23c351773 |