Snowpark column and table statistics collection

These details have not been verified by PyPI

Project links

Project description

snowpark-checkpoints-collectors

This package is on Public Preview.

snowpark-checkpoints-collector package offers a function for extracting information from PySpark dataframes. We can then use that data to validate against the converted Snowpark dataframes to ensure that behavioral equivalence has been achieved.

Install the library

pip install snowpark-checkpoints-collectors

This package requires PySpark to be installed in the same environment. If you do not have it, you can install PySpark alongside Snowpark Checkpoints by running the following command:

pip install "snowpark-checkpoints-collectors[pyspark]"

Features

Schema inference collected data mode (Schema): This is the default mode, which leverages Pandera schema inference to obtain the metadata and checks that will be evaluated for the specified dataframe. This mode also collects custom data from columns of the DataFrame based on the PySpark type.
DataFrame collected data mode (DataFrame): This mode collects the data of the PySpark dataframe. In this case, the mechanism saves all data of the given dataframe in parquet format. Using the default user Snowflake connection, it tries to upload the parquet files into the Snowflake temporal stage and create a table based on the information in the stage. The name of the file and the table is the same as the checkpoint.

Functionalities

Collect DataFrame Checkpoint

from pyspark.sql import DataFrame as SparkDataFrame
from snowflake.snowpark_checkpoints_collector.collection_common import CheckpointMode
from typing import Optional

# Signature of the function
def collect_dataframe_checkpoint(
    df: SparkDataFrame,
    checkpoint_name: str,
    sample: Optional[float] = None,
    mode: Optional[CheckpointMode] = None,
    output_path: Optional[str] = None,
) -> None:
    ...

df: The input Spark dataframe to collect.
checkpoint_name: Name of the checkpoint schema file or dataframe.
sample: Fraction of DataFrame to sample for schema inference, defaults to 1.0.
mode: The mode to execution the collection (Schema or Dataframe), defaults to CheckpointMode.Schema.
output_path: The output path to save the checkpoint, defaults to current working directory.

Skip DataFrame Checkpoint Collection

from pyspark.sql import DataFrame as SparkDataFrame
from snowflake.snowpark_checkpoints_collector.collection_common import CheckpointMode
from typing import Optional

# Signature of the function
def xcollect_dataframe_checkpoint(
    df: SparkDataFrame,
    checkpoint_name: str,
    sample: Optional[float] = None,
    mode: Optional[CheckpointMode] = None,
    output_path: Optional[str] = None,
) -> None:
    ...

The signature of the method is the same of collect_dataframe_checkpoint.

Usage Example

Schema mode

from pyspark.sql import SparkSession
from snowflake.snowpark_checkpoints_collector import collect_dataframe_checkpoint
from snowflake.snowpark_checkpoints_collector.collection_common import CheckpointMode

spark_session = SparkSession.builder.getOrCreate()
sample_size = 1.0

pyspark_df = spark_session.createDataFrame(
    [("apple", 21), ("lemon", 34), ("banana", 50)], schema="fruit string, age integer"
)

collect_dataframe_checkpoint(
    pyspark_df,
    checkpoint_name="collect_checkpoint_mode_1",
    sample=sample_size,
    mode=CheckpointMode.SCHEMA,
)

Dataframe mode

from pyspark.sql import SparkSession
from snowflake.snowpark_checkpoints_collector import collect_dataframe_checkpoint
from snowflake.snowpark_checkpoints_collector.collection_common import CheckpointMode
from pyspark.sql.types import StructType, StructField, ByteType, StringType, IntegerType 

spark_schema = StructType(
    [
        StructField("BYTE", ByteType(), True),
        StructField("STRING", StringType(), True),
        StructField("INTEGER", IntegerType(), True)
    ]
)

data = [(1, "apple", 21), (2, "lemon", 34), (3, "banana", 50)]

spark_session = SparkSession.builder.getOrCreate()
pyspark_df = spark_session.createDataFrame(data, schema=spark_schema).orderBy(
    "INTEGER"
)

collect_dataframe_checkpoint(
    pyspark_df,
    checkpoint_name="collect_checkpoint_mode_2",
    mode=CheckpointMode.DATAFRAME,
)

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.4.0

Jun 30, 2025

0.3.3

May 20, 2025

This version

0.3.2

May 12, 2025

0.3.1

May 9, 2025

0.3.0

Apr 29, 2025

0.2.1

Apr 7, 2025

0.2.0

Mar 24, 2025

0.1.4

Mar 13, 2025

0.1.3

Feb 7, 2025

0.1.2

Feb 3, 2025

0.1.1

Jan 29, 2025

0.1.0

Jan 28, 2025

0.1.0rc3 pre-release

Jan 27, 2025

0.1.0rc2 pre-release

Jan 23, 2025

0.1.0rc1 pre-release

Jan 15, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

snowpark_checkpoints_collectors-0.3.2.tar.gz (55.4 kB view details)

Uploaded May 12, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

snowpark_checkpoints_collectors-0.3.2-py3-none-any.whl (66.4 kB view details)

Uploaded May 12, 2025 Python 3

File details

Details for the file snowpark_checkpoints_collectors-0.3.2.tar.gz.

File metadata

Download URL: snowpark_checkpoints_collectors-0.3.2.tar.gz
Upload date: May 12, 2025
Size: 55.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.12.3

File hashes

Hashes for snowpark_checkpoints_collectors-0.3.2.tar.gz
Algorithm	Hash digest
SHA256	`2c81a6ebe7a899d18289bdda0f9b18db02c71a98c1494defd9139e5415ada47c`
MD5	`6c82a148841d615dacea627f95004dc1`
BLAKE2b-256	`189d78072a198c604b1586fc5ad0b62967446f1de05fc8e608488e2e42db8ed9`

See more details on using hashes here.

File details

Details for the file snowpark_checkpoints_collectors-0.3.2-py3-none-any.whl.

File metadata

Download URL: snowpark_checkpoints_collectors-0.3.2-py3-none-any.whl
Upload date: May 12, 2025
Size: 66.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.0.1 CPython/3.12.3

File hashes

Hashes for snowpark_checkpoints_collectors-0.3.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`700a8abc9defe62ebbe20163401f0377db78ddc13e076ef81f1366a58ed7d9c2`
MD5	`f0f8b8c13744f5c6bfa3de3aec0bd564`
BLAKE2b-256	`fa4e9f1b576891a2d5e1e59aa9003073bc089163d2ebdea0e68ebb07b18d33f1`

See more details on using hashes here.

snowpark-checkpoints-collectors 0.3.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

snowpark-checkpoints-collectors

This package is on Public Preview.

Install the library

Features

Functionalities

Collect DataFrame Checkpoint

Skip DataFrame Checkpoint Collection

Usage Example

Schema mode

Dataframe mode

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes