Skip to main content

Simplified work with partitions based on Polars library

Project description

polars_partitions

GitHub: polars_partitions
Version: 0.1.1

Python

pip install polars_partitions

Description

This library is not a replacement for Polars. The main goal is to improve the work (write/read/filter) with partitions by creating a Table Of Contents file (hereinafter referred to as "TOC").

Write Partition

polars_parquet.wr_partition()

polars_parquet.wr_partition(
          df: DataFrame,
          columns: array | string,
          output_path: str
)

Parameters

df
          Polars DataFrame
columns
          Array of columns on which to create partitions
output_path
          Path to save to

TOC record

polars_parquet.wr_toc()

polars_parquet.wr_toc(
          df: DataFrame on which the partitions are based,
          columns: array | string,
          output_path: str
)

Parameters

df
          Dictionary, where the key is the column and the array is the values
columns
          Array of columns to create partitions for
output_path
          Path to save to

Reading TOC

polars_parquet.rd_toc()

polars_parquet.rd_toc(
          output_path: DataFrame,
          filters: dict = None,
          btwn: str = None
)

Parameters

output_path
          Path where to save.
filters
          Dictionary, where the key is the column and the array is the values
btwn
          Works in conjunction with filters. It takes as input the column name on which to apply the between filter. It takes the first two values from the filters(array).

Read Partition

polars_parquet.rd_partition()

polars_parquet.rd_partition(
          output_path: str,
          columns: array | string = "*",
          filters: dict = None,
          btwn: str = None
) → LazyFrame

Parameters

output_path
          Path to the parquet file or to the partitions folder
columns
          Array of columns to return
filters
          Dictionary where the key is the column and the array is the values
btwn
          Works in conjunction with filters. It takes as input the column name on which to apply the between filter. It takes the first two values from the filters(array).

How to use (example)

import polars_partitions as plp
from datetime import date
import polars as pl

# Create a test dataset
df = pl.DataFrame({'col1':[date(2024,1,1),date(2024,1,1),date(2024,1,2),date(2024,1,2),date(2024,1,2),date(2024,1,3),date(2024,1,3),date(2024,1,3)],
              'col2':['A2','A2','A2','A2','A2','A2','B2','B2','B2','B2'],
              'col3':[1,2,3,4,5,6,7,8]
              })

output_path = 'your_path/folder_name_where_to_save'.
# Which columns are partitioned by
columns = ['col1', 'col2'] 

pp = plp.polars_partitions()

# Write the partitions
pp.wr_partition(df, columns, output_path)

# Read TOC
# print(pp.rd_toc(output_path))

# Read partitions and apply filters
# filters = {'col1':[date(2024,1,1),date(2024,1,3)]}
# df = pp.rd_partition(output_path, filters=filters, btwn='col1', columns=['col1', 'col3']) 
# print(df.collect())

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

polars_partitions-0.1.1.tar.gz (4.9 kB view hashes)

Uploaded Source

Built Distribution

polars_partitions-0.1.1-py3-none-any.whl (5.0 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page