Skip to main content

Snowflake Snowpark for Python

Project description

Snowflake Snowpark Python and Snowpark pandas APIs

Build and Test codecov PyPi License Apache-2.0 Codestyle Black

The Snowpark library provides intuitive APIs for querying and processing data in a data pipeline. Using this library, you can build applications that process data in Snowflake without having to move data to the system where your application code runs.

Source code | Snowpark Python developer guide | Snowpark Python API reference | Snowpark pandas developer guide | Snowpark pandas API reference | Product documentation | Samples

Getting started

Have your Snowflake account ready

If you don't have a Snowflake account yet, you can sign up for a 30-day free trial account.

Create a Python virtual environment

You can use miniconda, anaconda, or virtualenv to create a Python 3.8, 3.9, 3.10 or 3.11 virtual environment.

For Snowpark pandas, only Python 3.9, 3.10, or 3.11 is supported.

To have the best experience when using it with UDFs, creating a local conda environment with the Snowflake channel is recommended.

Install the library to the Python virtual environment

pip install snowflake-snowpark-python

To use the Snowpark pandas API, you can optionally install the following, which installs modin in the same environment. The Snowpark pandas API provides a familiar interface for pandas users to query and process data directly in Snowflake.

pip install "snowflake-snowpark-python[modin]"

Create a session and use the Snowpark Python API

from snowflake.snowpark import Session

connection_parameters = {
  "account": "<your snowflake account>",
  "user": "<your snowflake user>",
  "password": "<your snowflake password>",
  "role": "<snowflake user role>",
  "warehouse": "<snowflake warehouse>",
  "database": "<snowflake database>",
  "schema": "<snowflake schema>"
}

session = Session.builder.configs(connection_parameters).create()
# Create a Snowpark dataframe from input data
df = session.create_dataframe([[1, 2], [3, 4]], schema=["a", "b"]) 
df = df.filter(df.a > 1)
result = df.collect()
df.show()

# -------------
# |"A"  |"B"  |
# -------------
# |3    |4    |
# -------------

Create a session and use the Snowpark pandas API

import modin.pandas as pd
import snowflake.snowpark.modin.plugin
from snowflake.snowpark import Session

CONNECTION_PARAMETERS = {
    'account': '<myaccount>',
    'user': '<myuser>',
    'password': '<mypassword>',
    'role': '<myrole>',
    'database': '<mydatabase>',
    'schema': '<myschema>',
    'warehouse': '<mywarehouse>',
}
session = Session.builder.configs(CONNECTION_PARAMETERS).create()

# Create a Snowpark pandas dataframe from input data
df = pd.DataFrame([['a', 2.0, 1],['b', 4.0, 2],['c', 6.0, None]], columns=["COL_STR", "COL_FLOAT", "COL_INT"])
df
#   COL_STR  COL_FLOAT  COL_INT
# 0       a        2.0      1.0
# 1       b        4.0      2.0
# 2       c        6.0      NaN

df.shape
# (3, 3)

df.head(2)
#   COL_STR  COL_FLOAT  COL_INT
# 0       a        2.0        1
# 1       b        4.0        2

df.dropna(subset=["COL_INT"], inplace=True)

df
#   COL_STR  COL_FLOAT  COL_INT
# 0       a        2.0        1
# 1       b        4.0        2

df.shape
# (2, 3)

df.head(2)
#   COL_STR  COL_FLOAT  COL_INT
# 0       a        2.0        1
# 1       b        4.0        2

# Save the result back to Snowflake with a row_pos column.
df.reset_index(drop=True).to_snowflake('pandas_test2', index=True, index_label=['row_pos'])

Samples

The Snowpark Python developer guide, Snowpark Python API references, Snowpark pandas developer guide, and Snowpark pandas api references have basic sample code. Snowflake-Labs has more curated demos.

Logging

Configure logging level for snowflake.snowpark for Snowpark Python API logs. Snowpark uses the Snowflake Python Connector. So you may also want to configure the logging level for snowflake.connector when the error is in the Python Connector. For instance,

import logging
for logger_name in ('snowflake.snowpark', 'snowflake.connector'):
    logger = logging.getLogger(logger_name)
    logger.setLevel(logging.DEBUG)
    ch = logging.StreamHandler()
    ch.setLevel(logging.DEBUG)
    ch.setFormatter(logging.Formatter('%(asctime)s - %(threadName)s %(filename)s:%(lineno)d - %(funcName)s() - %(levelname)s - %(message)s'))
    logger.addHandler(ch)

Reading and writing to pandas DataFrame

Snowpark Python API supports reading from and writing to a pandas DataFrame via the to_pandas and write_pandas commands.

To use these operations, ensure that pandas is installed in the same environment. You can install pandas alongside Snowpark Python by executing the following command:

pip install "snowflake-snowpark-python[pandas]"

Once pandas is installed, you can convert between a Snowpark DataFrame and pandas DataFrame as follows:

df = session.create_dataframe([[1, 2], [3, 4]], schema=["a", "b"])
# Convert Snowpark DataFrame to pandas DataFrame
pandas_df = df.to_pandas() 
# Write pandas DataFrame to a Snowflake table and return Snowpark DataFrame
snowpark_df = session.write_pandas(pandas_df, "new_table", auto_create_table=True)

Snowpark pandas API also supports writing to pandas:

import modin.pandas as pd
df = pd.DataFrame([[1, 2], [3, 4]], columns=["a", "b"])
# Convert Snowpark pandas DataFrame to pandas DataFrame
pandas_df = df.to_pandas() 

Note that the above Snowpark pandas commands will work if Snowpark is installed with the [modin] option, the additional [pandas] installation is not required.

Contributing

Please refer to CONTRIBUTING.md.

Release History

1.25.0 (2024-11-13)

Snowpark Python API Updates

New Features

  • Added the following new functions in snowflake.snowpark.dataframe:
    • map
  • Added support for passing parameter include_error to Session.query_history to record queries that have error during execution.

Improvements

  • When target stage is not set in profiler, a default stage from Session.get_session_stage is used instead of raising SnowparkSQLException.
  • Allowed lower case or mixed case input when calling Session.stored_procedure_profiler.set_active_profiler.
  • Added distributed tracing using open telemetry APIs for action function in DataFrame:
    • cache_result
  • Removed opentelemetry warning from logging.

Bug Fixes

  • Fixed the pre-action and post-action query propagation when In expression were used in selects.
  • Fixed a bug that raised error AttributeError while calling Session.stored_procedure_profiler.get_output when Session.stored_procedure_profiler is disabled.

Dependency Updates

  • Added a dependency on protobuf>=5.28 and tzlocal at runtime.
  • Added a dependency on protoc-wheel-0 for the development profile.
  • Require snowflake-connector-python>=3.12.0, <4.0.0 (was >=3.10.0).

Snowpark pandas API Updates

Dependency Updates

  • Updated modin from 0.28.1 to 0.30.1.
  • Added support for all pandas 2.2.x versions.

New Features

  • Added support for Index.to_numpy.
  • Added support for DataFrame.align and Series.align for axis=0.
  • Added support for size in GroupBy.aggregate, DataFrame.aggregate, and Series.aggregate.
  • Added support for snowflake.snowpark.functions.window
  • Added support for pd.read_pickle (Uses native pandas for processing).
  • Added support for pd.read_html (Uses native pandas for processing).
  • Added support for pd.read_xml (Uses native pandas for processing).
  • Added support for aggregation functions "size" and len in GroupBy.aggregate, DataFrame.aggregate, and Series.aggregate.
  • Added support for list values in Series.str.len.

Bug Fixes

  • Fixed a bug where aggregating a single-column dataframe with a single callable function (e.g. pd.DataFrame([0]).agg(np.mean)) would fail to transpose the result.
  • Fixed bugs where DataFrame.dropna() would:
    • Treat an empty subset (e.g. []) as if it specified all columns instead of no columns.
    • Raise a TypeError for a scalar subset instead of filtering on just that column.
    • Raise a ValueError for a subset of type pandas.Index instead of filtering on the columns in the index.
  • Disable creation of scoped read only table to mitigate Disable creation of scoped read only table to mitigate TableNotFoundError when using dynamic pivot in notebook environment.
  • Fixed a bug when concat dataframe or series objects are coming from the same dataframe when axis = 1.

Improvements

  • Improve np.where with scalar x value by eliminating unnecessary join and temp table creation.
  • Improve get_dummies performance by flattening the pivot with join.

Snowpark Local Testing Updates

New Features

  • Added support for patching functions that are unavailable in the snowflake.snowpark.functions module.
  • Added support for snowflake.snowpark.functions.any_value

Bug Fixes

  • Fixed a bug where Table.update could not handle VariantType, MapType, and ArrayType data types.
  • Fixed a bug where column aliases were incorrectly resolved in DataFrame.join, causing errors when selecting columns from a joined DataFrame.
  • Fixed a bug where Table.update and Table.merge could fail if the target table's index was not the default RangeIndex.

1.24.0 (2024-10-28)

Snowpark Python API Updates

New Features

  • Updated Session class to be thread-safe. This allows concurrent DataFrame transformations, DataFrame actions, UDF and stored procedure registration, and concurrent file uploads when using the same Session object.
    • The feature is disabled by default and can be enabled by setting FEATURE_THREAD_SAFE_PYTHON_SESSION to True for account.
    • Updating session configurations, like changing database or schema, when multiple threads are using the session may lead to unexpected behavior.
    • When enabled, some internally created temporary table names returned from DataFrame.queries API are not deterministic, and may be different when DataFrame actions are executed. This does not affect explicit user-created temporary tables.
  • Added support for 'Service' domain to session.lineage.trace API.
  • Added support for copy_grants parameter when registering UDxF and stored procedures.
  • Added support for the following methods in DataFrameWriter to support daisy-chaining:
    • option
    • options
    • partition_by
  • Added support for snowflake_cortex_summarize.

Improvements

  • Improved the following new capability for function snowflake.snowpark.functions.array_remove it is now possible to use in python.
  • Disables sql simplification when sort is performed after limit.
    • Previously, df.sort().limit() and df.limit().sort() generates the same query with sort in front of limit. Now, df.limit().sort() will generate query that reads df.limit().sort().
    • Improve performance of generated query for df.limit().sort(), because limit stops table scanning as soon as the number of records is satisfied.
  • Added a client side error message for when an invalid stage location is passed to DataFrame read functions.

Bug Fixes

  • Fixed a bug where the automatic cleanup of temporary tables could interfere with the results of async query execution.
  • Fixed a bug in DataFrame.analytics.time_series_agg function to handle multiple data points in same sliding interval.
  • Fixed a bug that created inconsistent casing in field names of structured objects in iceberg schemas.

Deprecations

Snowpark pandas API Updates

New Features

  • Added support for np.subtract, np.multiply, np.divide, and np.true_divide.
  • Added support for tracking usages of __array_ufunc__.
  • Added numpy compatibility support for np.float_power, np.mod, np.remainder, np.greater, np.greater_equal, np.less, np.less_equal, np.not_equal, and np.equal.
  • Added numpy compatibility support for np.log, np.log2, and np.log10
  • Added support for DataFrameGroupBy.bfill, SeriesGroupBy.bfill, DataFrameGroupBy.ffill, and SeriesGroupBy.ffill.
  • Added support for on parameter with Resampler.
  • Added support for timedelta inputs in value_counts().
  • Added support for applying Snowpark Python function snowflake_cortex_summarize.
  • Added support for DataFrame.attrs and Series.attrs.
  • Added support for DataFrame.style.
  • Added numpy compatibility support for np.full_like

Improvements

  • Improved generated SQL query for head and iloc when the row key is a slice.
  • Improved error message when passing an unknown timezone to tz_convert and tz_localize in Series, DataFrame, Series.dt, and DatetimeIndex.
  • Improved documentation for tz_convert and tz_localize in Series, DataFrame, Series.dt, and DatetimeIndex to specify the supported timezone formats.
  • Added additional kwargs support for df.apply and series.apply ( as well as map and applymap ) when using snowpark functions. This allows for some position independent compatibility between apply and functions where the first argument is not a pandas object.
  • Improved generated SQL query for iloc and iat when the row key is a scalar.
  • Removed all joins in iterrows.
  • Improved documentation for Series.map to reflect the unsupported features.
  • Added support for np.may_share_memory which is used internally by many scikit-learn functions. This method will always return false when called with a Snowpark pandas object.

Bug Fixes

  • Fixed a bug where DataFrame and Series pct_change() would raise TypeError when input contained timedelta columns.
  • Fixed a bug where replace() would sometimes propagate Timedelta types incorrectly through replace(). Instead raise NotImplementedError for replace() on Timedelta.
  • Fixed a bug where DataFrame and Series round() would raise AssertionError for Timedelta columns. Instead raise NotImplementedError for round() on Timedelta.
  • Fixed a bug where reindex fails when the new index is a Series with non-overlapping types from the original index.
  • Fixed a bug where calling __getitem__ on a DataFrameGroupBy object always returned a DataFrameGroupBy object if as_index=False.
  • Fixed a bug where inserting timedelta values into an existing column would silently convert the values to integers instead of raising NotImplementedError.
  • Fixed a bug where DataFrame.shift() on axis=0 and axis=1 would fail to propagate timedelta types.
  • DataFrame.abs(), DataFrame.__neg__(), DataFrame.stack(), and DataFrame.unstack() now raise NotImplementedError for timedelta inputs instead of failing to propagate timedelta types.

Snowpark Local Testing Updates

Bug Fixes

  • Fixed a bug where DataFrame.alias raises KeyError for input column name.
  • Fixed a bug where to_csv on Snowflake stage fails when data contains empty strings.

1.23.0 (2024-10-09)

Snowpark Python API Updates

New Features

  • Added the following new functions in snowflake.snowpark.functions:
    • make_interval
  • Added support for using Snowflake Interval constants with Window.range_between() when the order by column is TIMESTAMP or DATE type.
  • Added support for file writes. This feature is currently in private preview.
  • Added thread_id to QueryRecord to track the thread id submitting the query history.
  • Added support for Session.stored_procedure_profiler.

Improvements

Bug Fixes

  • Fixed a bug where registering a stored procedure or UDxF with type hints would give a warning 'NoneType' has no len() when trying to read default values from function.

Snowpark pandas API Updates

New Features

  • Added support for TimedeltaIndex.mean method.
  • Added support for some cases of aggregating Timedelta columns on axis=0 with agg or aggregate.
  • Added support for by, left_by, right_by, left_index, and right_index for pd.merge_asof.
  • Added support for passing parameter include_describe to Session.query_history.
  • Added support for DatetimeIndex.mean and DatetimeIndex.std methods.
  • Added support for Resampler.asfreq, Resampler.indices, Resampler.nunique, and Resampler.quantile.
  • Added support for resample frequency W, ME, YE with closed = "left".
  • Added support for DataFrame.rolling.corr and Series.rolling.corr for pairwise = False and int window.
  • Added support for string time-based window and min_periods = None for Rolling.
  • Added support for DataFrameGroupBy.fillna and SeriesGroupBy.fillna.
  • Added support for constructing Series and DataFrame objects with the lazy Index object as data, index, and columns arguments.
  • Added support for constructing Series and DataFrame objects with index and column values not present in DataFrame/Series data.
  • Added support for pd.read_sas (Uses native pandas for processing).
  • Added support for applying rolling().count() and expanding().count() to Timedelta series and columns.
  • Added support for tz in both pd.date_range and pd.bdate_range.
  • Added support for Series.items.
  • Added support for errors="ignore" in pd.to_datetime.
  • Added support for DataFrame.tz_localize and Series.tz_localize.
  • Added support for DataFrame.tz_convert and Series.tz_convert.
  • Added support for applying Snowpark Python functions (e.g., sin) in Series.map, Series.apply, DataFrame.apply and DataFrame.applymap.

Improvements

  • Improved to_pandas to persist the original timezone offset for TIMESTAMP_TZ type.
  • Improved dtype results for TIMESTAMP_TZ type to show correct timezone offset.
  • Improved dtype results for TIMESTAMP_LTZ type to show correct timezone.
  • Improved error message when passing non-bool value to numeric_only for groupby aggregations.
  • Removed unnecessary warning about sort algorithm in sort_values.
  • Use SCOPED object for internal create temp tables. The SCOPED objects will be stored sproc scoped if created within stored sproc, otherwise will be session scoped, and the object will be automatically cleaned at the end of the scope.
  • Improved warning messages for operations that lead to materialization with inadvertent slowness.
  • Removed unnecessary warning message about convert_dtype in Series.apply.

Bug Fixes

  • Fixed a bug where an Index object created from a Series/DataFrame incorrectly updates the Series/DataFrame's index name after an inplace update has been applied to the original Series/DataFrame.
  • Suppressed an unhelpful SettingWithCopyWarning that sometimes appeared when printing Timedelta columns.
  • Fixed inplace argument for Series objects derived from other Series objects.
  • Fixed a bug where Series.sort_values failed if series name overlapped with index column name.
  • Fixed a bug where transposing a dataframe would map Timedelta index levels to integer column levels.
  • Fixed a bug where Resampler methods on timedelta columns would produce integer results.
  • Fixed a bug where pd.to_numeric() would leave Timedelta inputs as Timedelta instead of converting them to integers.
  • Fixed loc set when setting a single row, or multiple rows, of a DataFrame with a Series value.

Snowpark Local Testing Updates

Bug Fixes

  • Fixed a bug where nullable columns were annotated wrongly.
  • Fixed a bug where the date_add and date_sub functions failed for NULL values.
  • Fixed a bug where equal_null could fail inside a merge statement.
  • Fixed a bug where row_number could fail inside a Window function.
  • Fixed a bug where updates could fail when the source is the result of a join.

1.22.1 (2024-09-11)

This is a re-release of 1.22.0. Please refer to the 1.22.0 release notes for detailed release content.

1.22.0 (2024-09-10)

Snowpark Python API Updates

New Features

  • Added the following new functions in snowflake.snowpark.functions:
    • array_remove
    • ln

Improvements

  • Improved documentation for Session.write_pandas by making use_logical_type option more explicit.
  • Added support for specifying the following to DataFrameWriter.save_as_table:
    • enable_schema_evolution
    • data_retention_time
    • max_data_extension_time
    • change_tracking
    • copy_grants
    • iceberg_config A dicitionary that can hold the following iceberg configuration options:
      • external_volume
      • catalog
      • base_location
      • catalog_sync
      • storage_serialization_policy
  • Added support for specifying the following to DataFrameWriter.copy_into_table:
    • iceberg_config A dicitionary that can hold the following iceberg configuration options:
      • external_volume
      • catalog
      • base_location
      • catalog_sync
      • storage_serialization_policy
  • Added support for specifying the following parameters to DataFrame.create_or_replace_dynamic_table:
    • mode
    • refresh_mode
    • initialize
    • clustering_keys
    • is_transient
    • data_retention_time
    • max_data_extension_time

Bug Fixes

  • Fixed a bug in session.read.csv that caused an error when setting PARSE_HEADER = True in an externally defined file format.
  • Fixed a bug in query generation from set operations that allowed generation of duplicate queries when children have common subqueries.
  • Fixed a bug in session.get_session_stage that referenced a non-existing stage after switching database or schema.
  • Fixed a bug where calling DataFrame.to_snowpark_pandas without explicitly initializing the Snowpark pandas plugin caused an error.
  • Fixed a bug where using the explode function in dynamic table creation caused a SQL compilation error due to improper boolean type casting on the outer parameter.

Snowpark Local Testing Updates

New Features

  • Added support for type coercion when passing columns as input to UDF calls.
  • Added support for Index.identical.

Bug Fixes

  • Fixed a bug where the truncate mode in DataFrameWriter.save_as_table incorrectly handled DataFrames containing only a subset of columns from the existing table.
  • Fixed a bug where function to_timestamp does not set the default timezone of the column datatype.

Snowpark pandas API Updates

New Features

  • Added limited support for the Timedelta type, including the following features. Snowpark pandas will raise NotImplementedError for unsupported Timedelta use cases.
    • supporting tracking the Timedelta type through copy, cache_result, shift, sort_index, assign, bfill, ffill, fillna, compare, diff, drop, dropna, duplicated, empty, equals, insert, isin, isna, items, iterrows, join, len, mask, melt, merge, nlargest, nsmallest, to_pandas.
    • converting non-timedelta to timedelta via astype.
    • NotImplementedError will be raised for the rest of methods that do not support Timedelta.
    • support for subtracting two timestamps to get a Timedelta.
    • support indexing with Timedelta data columns.
    • support for adding or subtracting timestamps and Timedelta.
    • support for binary arithmetic between two Timedelta values.
    • support for binary arithmetic and comparisons between Timedelta values and numeric values.
    • support for lazy TimedeltaIndex.
    • support for pd.to_timedelta.
    • support for GroupBy aggregations min, max, mean, idxmax, idxmin, std, sum, median, count, any, all, size, nunique, head, tail, aggregate.
    • support for GroupBy filtrations first and last.
    • support for TimedeltaIndex attributes: days, seconds, microseconds and nanoseconds.
    • support for diff with timestamp columns on axis=0 and axis=1
    • support for TimedeltaIndex methods: ceil, floor and round.
    • support for TimedeltaIndex.total_seconds method.
  • Added support for index's arithmetic and comparison operators.
  • Added support for Series.dt.round.
  • Added documentation pages for DatetimeIndex.
  • Added support for Index.name, Index.names, Index.rename, and Index.set_names.
  • Added support for Index.__repr__.
  • Added support for DatetimeIndex.month_name and DatetimeIndex.day_name.
  • Added support for Series.dt.weekday, Series.dt.time, and DatetimeIndex.time.
  • Added support for Index.min and Index.max.
  • Added support for pd.merge_asof.
  • Added support for Series.dt.normalize and DatetimeIndex.normalize.
  • Added support for Index.is_boolean, Index.is_integer, Index.is_floating, Index.is_numeric, and Index.is_object.
  • Added support for DatetimeIndex.round, DatetimeIndex.floor and DatetimeIndex.ceil.
  • Added support for Series.dt.days_in_month and Series.dt.daysinmonth.
  • Added support for DataFrameGroupBy.value_counts and SeriesGroupBy.value_counts.
  • Added support for Series.is_monotonic_increasing and Series.is_monotonic_decreasing.
  • Added support for Index.is_monotonic_increasing and Index.is_monotonic_decreasing.
  • Added support for pd.crosstab.
  • Added support for pd.bdate_range and included business frequency support (B, BME, BMS, BQE, BQS, BYE, BYS) for both pd.date_range and pd.bdate_range.
  • Added support for lazy Index objects as labels in DataFrame.reindex and Series.reindex.
  • Added support for Series.dt.days, Series.dt.seconds, Series.dt.microseconds, and Series.dt.nanoseconds.
  • Added support for creating a DatetimeIndex from an Index of numeric or string type.
  • Added support for string indexing with Timedelta objects.
  • Added support for Series.dt.total_seconds method.
  • Added support for DataFrame.apply(axis=0).
  • Added support for Series.dt.tz_convert and Series.dt.tz_localize.
  • Added support for DatetimeIndex.tz_convert and DatetimeIndex.tz_localize.

Improvements

  • Improve concat, join performance when operations are performed on series coming from the same dataframe by avoiding unnecessary joins.
  • Refactored quoted_identifier_to_snowflake_type to avoid making metadata queries if the types have been cached locally.
  • Improved pd.to_datetime to handle all local input cases.
  • Create a lazy index from another lazy index without pulling data to client.
  • Raised NotImplementedError for Index bitwise operators.
  • Display a more clear error message when Index.names is set to a non-like-like object.
  • Raise a warning whenever MultiIndex values are pulled in locally.
  • Improve warning message for pd.read_snowflake include the creation reason when temp table creation is triggered.
  • Improve performance for DataFrame.set_index, or setting DataFrame.index or Series.index by avoiding checks require eager evaluation. As a consequence, when the new index that does not match the current Series/DataFrame object length, a ValueError is no longer raised. Instead, when the Series/DataFrame object is longer than the provided index, the Series/DataFrame's new index is filled with NaN values for the "extra" elements. Otherwise, the extra values in the provided index are ignored.
  • Properly raise NotImplementedError when ambiguous/nonexistent are non-string in ceil/floor/round.

Bug Fixes

  • Stopped ignoring nanoseconds in pd.Timedelta scalars.
  • Fixed AssertionError in tree of binary operations.
  • Fixed bug in Series.dt.isocalendar using a named Series
  • Fixed inplace argument for Series objects derived from DataFrame columns.
  • Fixed a bug where Series.reindex and DataFrame.reindex did not update the result index's name correctly.
  • Fixed a bug where Series.take did not error when axis=1 was specified.

1.21.1 (2024-09-05)

Snowpark Python API Updates

Bug Fixes

  • Fixed a bug where using to_pandas_batches with async jobs caused an error due to improper handling of waiting for asynchronous query completion.

1.21.0 (2024-08-19)

Snowpark Python API Updates

New Features

  • Added support for snowflake.snowpark.testing.assert_dataframe_equal that is a utility function to check the equality of two Snowpark DataFrames.

Improvements

  • Added support server side string size limitations.
  • Added support to create and invoke stored procedures, UDFs and UDTFs with optional arguments.
  • Added support for column lineage in the DataFrame.lineage.trace API.
  • Added support for passing INFER_SCHEMA options to DataFrameReader via INFER_SCHEMA_OPTIONS.
  • Added support for passing parameters parameter to Column.rlike and Column.regexp.
  • Added support for automatically cleaning up temporary tables created by df.cache_result() in the current session, when the DataFrame is no longer referenced (i.e., gets garbage collected). It is still an experimental feature not enabled by default, and can be enabled by setting session.auto_clean_up_temp_table_enabled to True.
  • Added support for string literals to the fmt parameter of snowflake.snowpark.functions.to_date.
  • Added support for system$reference function.

Bug Fixes

  • Fixed a bug where SQL generated for selecting * column has an incorrect subquery.
  • Fixed a bug in DataFrame.to_pandas_batches where the iterator could throw an error if certain transformation is made to the pandas dataframe due to wrong isolation level.
  • Fixed a bug in DataFrame.lineage.trace to split the quoted feature view's name and version correctly.
  • Fixed a bug in Column.isin that caused invalid sql generation when passed an empty list.
  • Fixed a bug that fails to raise NotImplementedError while setting cell with list like item.

Snowpark Local Testing Updates

New Features

  • Added support for the following APIs:
    • snowflake.snowpark.functions
      • rank
      • dense_rank
      • percent_rank
      • cume_dist
      • ntile
      • datediff
      • array_agg
    • snowflake.snowpark.column.Column.within_group
  • Added support for parsing flags in regex statements for mocked plans. This maintains parity with the rlike and regexp changes above.

Bug Fixes

  • Fixed a bug where Window Functions LEAD and LAG do not handle option ignore_nulls properly.
  • Fixed a bug where values were not populated into the result DataFrame during the insertion of table merge operation.

Improvements

  • Fix pandas FutureWarning about integer indexing.

Snowpark pandas API Updates

New Features

  • Added support for DataFrame.backfill, DataFrame.bfill, Series.backfill, and Series.bfill.
  • Added support for DataFrame.compare and Series.compare with default parameters.
  • Added support for Series.dt.microsecond and Series.dt.nanosecond.
  • Added support for Index.is_unique and Index.has_duplicates.
  • Added support for Index.equals.
  • Added support for Index.value_counts.
  • Added support for Series.dt.day_name and Series.dt.month_name.
  • Added support for indexing on Index, e.g., df.index[:10].
  • Added support for DataFrame.unstack and Series.unstack.
  • Added support for DataFrame.asfreq and Series.asfreq.
  • Added support for Series.dt.is_month_start and Series.dt.is_month_end.
  • Added support for Index.all and Index.any.
  • Added support for Series.dt.is_year_start and Series.dt.is_year_end.
  • Added support for Series.dt.is_quarter_start and Series.dt.is_quarter_end.
  • Added support for lazy DatetimeIndex.
  • Added support for Series.argmax and Series.argmin.
  • Added support for Series.dt.is_leap_year.
  • Added support for DataFrame.items.
  • Added support for Series.dt.floor and Series.dt.ceil.
  • Added support for Index.reindex.
  • Added support for DatetimeIndex properties: year, month, day, hour, minute, second, microsecond, nanosecond, date, dayofyear, day_of_year, dayofweek, day_of_week, weekday, quarter, is_month_start, is_month_end, is_quarter_start, is_quarter_end, is_year_start, is_year_end and is_leap_year.
  • Added support for Resampler.fillna and Resampler.bfill.
  • Added limited support for the Timedelta type, including creating Timedelta columns and to_pandas.
  • Added support for Index.argmax and Index.argmin.

Improvements

  • Removed the public preview warning message when importing Snowpark pandas.
  • Removed unnecessary count query from SnowflakeQueryCompiler.is_series_like method.
  • Dataframe.columns now returns native pandas Index object instead of Snowpark Index object.
  • Refactor and introduce query_compiler argument in Index constructor to create Index from query compiler.
  • pd.to_datetime now returns a DatetimeIndex object instead of a Series object.
  • pd.date_range now returns a DatetimeIndex object instead of a Series object.

Bug Fixes

  • Made passing an unsupported aggregation function to pivot_table raise NotImplementedError instead of KeyError.
  • Removed axis labels and callable names from error messages and telemetry about unsupported aggregations.
  • Fixed AssertionError in Series.drop_duplicates and DataFrame.drop_duplicates when called after sort_values.
  • Fixed a bug in Index.to_frame where the result frame's column name may be wrong where name is unspecified.
  • Fixed a bug where some Index docstrings are ignored.
  • Fixed a bug in Series.reset_index(drop=True) where the result name may be wrong.
  • Fixed a bug in Groupby.first/last ordering by the correct columns in the underlying window expression.

1.20.0 (2024-07-17)

Snowpark Python API Updates

Improvements

  • Added distributed tracing using open telemetry APIs for table stored procedure function in DataFrame:
    • _execute_and_get_query_id
  • Added support for the arrays_zip function.
  • Improves performance for binary column expression and df._in by avoiding unnecessary cast for numeric values. You can enable this optimization by setting session.eliminate_numeric_sql_value_cast_enabled = True.
  • Improved error message for write_pandas when the target table does not exist and auto_create_table=False.
  • Added open telemetry tracing on UDxF functions in Snowpark.
  • Added open telemetry tracing on stored procedure registration in Snowpark.
  • Added a new optional parameter called format_json to the Session.SessionBuilder.app_name function that sets the app name in the Session.query_tag in JSON format. By default, this parameter is set to False.

Bug Fixes

  • Fixed a bug where SQL generated for lag(x, 0) was incorrect and failed with error message argument 1 to function LAG needs to be constant, found 'SYSTEM$NULL_TO_FIXED(null)'.

Snowpark Local Testing Updates

New Features

  • Added support for the following APIs:
    • snowflake.snowpark.functions
      • random
  • Added new parameters to patch function when registering a mocked function:
    • distinct allows an alternate function to be specified for when a sql function should be distinct.
    • pass_column_index passes a named parameter column_index to the mocked function that contains the pandas.Index for the input data.
    • pass_row_index passes a named parameter row_index to the mocked function that is the 0 indexed row number the function is currently operating on.
    • pass_input_data passes a named parameter input_data to the mocked function that contains the entire input dataframe for the current expression.
    • Added support for the column_order parameter to method DataFrameWriter.save_as_table.

Bug Fixes

  • Fixed a bug that caused DecimalType columns to be incorrectly truncated to integer precision when used in BinaryExpressions.

Snowpark pandas API Updates

New Features

  • Added support for DataFrameGroupBy.all, SeriesGroupBy.all, DataFrameGroupBy.any, and SeriesGroupBy.any.
  • Added support for DataFrame.nlargest, DataFrame.nsmallest, Series.nlargest and Series.nsmallest.
  • Added support for replace and frac > 1 in DataFrame.sample and Series.sample.
  • Added support for read_excel (Uses local pandas for processing)
  • Added support for Series.at, Series.iat, DataFrame.at, and DataFrame.iat.
  • Added support for Series.dt.isocalendar.
  • Added support for Series.case_when except when condition or replacement is callable.
  • Added documentation pages for Index and its APIs.
  • Added support for DataFrame.assign.
  • Added support for DataFrame.stack.
  • Added support for DataFrame.pivot and pd.pivot.
  • Added support for DataFrame.to_csv and Series.to_csv.
  • Added partial support for Series.str.translate where the values in the table are single-codepoint strings.
  • Added support for DataFrame.corr.
  • Allow df.plot() and series.plot() to be called, materializing the data into the local client
  • Added support for DataFrameGroupBy and SeriesGroupBy aggregations first and last
  • Added support for DataFrameGroupBy.get_group.
  • Added support for limit parameter when method parameter is used in fillna.
  • Added partial support for Series.str.translate where the values in the table are single-codepoint strings.
  • Added support for DataFrame.corr.
  • Added support for DataFrame.equals and Series.equals.
  • Added support for DataFrame.reindex and Series.reindex.
  • Added support for Index.astype.
  • Added support for Index.unique and Index.nunique.
  • Added support for Index.sort_values.

Bug Fixes

  • Fixed an issue when using np.where and df.where when the scalar 'other' is the literal 0.
  • Fixed a bug regarding precision loss when converting to Snowpark pandas DataFrame or Series with dtype=np.uint64.
  • Fixed bug where values is set to index when index and columns contain all columns in DataFrame during pivot_table.

Improvements

  • Added support for Index.copy()
  • Added support for Index APIs: dtype, values, item(), tolist(), to_series() and to_frame()
  • Expand support for DataFrames with no rows in pd.pivot_table and DataFrame.pivot_table.
  • Added support for inplace parameter in DataFrame.sort_index and Series.sort_index.

1.19.0 (2024-06-25)

Snowpark Python API Updates

New Features

  • Added support for to_boolean function.
  • Added documentation pages for Index and its APIs.

Bug Fixes

  • Fixed a bug where python stored procedure with table return type fails when run in a task.
  • Fixed a bug where df.dropna fails due to RecursionError: maximum recursion depth exceeded when the DataFrame has more than 500 columns.
  • Fixed a bug where AsyncJob.result("no_result") doesn't wait for the query to finish execution.

Snowpark Local Testing Updates

New Features

  • Added support for the strict parameter when registering UDFs and Stored Procedures.

Bug Fixes

  • Fixed a bug in convert_timezone that made the setting the source_timezone parameter return an error.
  • Fixed a bug where creating DataFrame with empty data of type DateType raises AttributeError.
  • Fixed a bug that table merge fails when update clause exists but no update takes place.
  • Fixed a bug in mock implementation of to_char that raises IndexError when incoming column has nonconsecutive row index.
  • Fixed a bug in handling of CaseExpr expressions that raises IndexError when incoming column has nonconsecutive row index.
  • Fixed a bug in implementation of Column.like that raises IndexError when incoming column has nonconsecutive row index.

Improvements

  • Added support for type coercion in the implementation of DataFrame.replace, DataFrame.dropna and the mock function iff.

Snowpark pandas API Updates

New Features

  • Added partial support for DataFrame.pct_change and Series.pct_change without the freq and limit parameters.
  • Added support for Series.str.get.
  • Added support for Series.dt.dayofweek, Series.dt.day_of_week, Series.dt.dayofyear, and Series.dt.day_of_year.
  • Added support for Series.str.__getitem__ (Series.str[...]).
  • Added support for Series.str.lstrip and Series.str.rstrip.
  • Added support for DataFrameGroupBy.size and SeriesGroupBy.size.
  • Added support for DataFrame.expanding and Series.expanding for aggregations count, sum, min, max, mean, std, var, and sem with axis=0.
  • Added support for DataFrame.rolling and Series.rolling for aggregation count with axis=0.
  • Added support for Series.str.match.
  • Added support for DataFrame.resample and Series.resample for aggregations size, first, and last.
  • Added support for DataFrameGroupBy.all, SeriesGroupBy.all, DataFrameGroupBy.any, and SeriesGroupBy.any.
  • Added support for DataFrame.nlargest, DataFrame.nsmallest, Series.nlargest and Series.nsmallest.
  • Added support for replace and frac > 1 in DataFrame.sample and Series.sample.
  • Added support for read_excel (Uses local pandas for processing)
  • Added support for Series.at, Series.iat, DataFrame.at, and DataFrame.iat.
  • Added support for Series.dt.isocalendar.
  • Added support for Series.case_when except when condition or replacement is callable.
  • Added documentation pages for Index and its APIs.
  • Added support for DataFrame.assign.
  • Added support for DataFrame.stack.
  • Added support for DataFrame.pivot and pd.pivot.
  • Added support for DataFrame.to_csv and Series.to_csv.
  • Added support for Index.T.

Bug Fixes

  • Fixed a bug that causes output of GroupBy.aggregate's columns to be ordered incorrectly.
  • Fixed a bug where DataFrame.describe on a frame with duplicate columns of differing dtypes could cause an error or incorrect results.
  • Fixed a bug in DataFrame.rolling and Series.rolling so window=0 now throws NotImplementedError instead of ValueError

Improvements

  • Added support for named aggregations in DataFrame.aggregate and Series.aggregate with axis=0.
  • pd.read_csv reads using the native pandas CSV parser, then uploads data to snowflake using parquet. This enables most of the parameters supported by read_csv including date parsing and numeric conversions. Uploading via parquet is roughly twice as fast as uploading via CSV.
  • Initial work to support an pd.Index directly in Snowpark pandas. Support for pd.Index as a first-class component of Snowpark pandas is coming soon.
  • Added a lazy index constructor and support for len, shape, size, empty, to_pandas() and names. For df.index, Snowpark pandas creates a lazy index object.
  • For df.columns, Snowpark pandas supports a non-lazy version of an Index since the data is already stored locally.

1.18.0 (2024-05-28)

Snowpark Python API Updates

Improvements

  • Improved error message to remind users set {"infer_schema": True} when reading csv file without specifying its schema.
  • Improved error handling for Session.create_dataframe when called with more than 512 rows and using format or pyformat paramstyle.

Snowpark pandas API Updates

New Features

  • Added DataFrame.cache_result and Series.cache_result methods for users to persist DataFrames and Series to a temporary table lasting the duration of the session to improve latency of subsequent operations.

Bug Fixes

Improvements

  • Added partial support for DataFrame.pivot_table with no index parameter, as well as for margins parameter.
  • Updated the signature of DataFrame.shift/Series.shift/DataFrameGroupBy.shift/SeriesGroupBy.shift to match pandas 2.2.1. Snowpark pandas does not yet support the newly-added suffix argument, or sequence values of periods.
  • Re-added support for Series.str.split.

Bug Fixes

  • Fixed how we support mixed columns for string methods (Series.str.*).

Snowpark Local Testing Updates

New Features

  • Added support for the following DataFrameReader read options to file formats csv and json:
    • PURGE
    • PATTERN
    • INFER_SCHEMA with value being False
    • ENCODING with value being UTF8
  • Added support for DataFrame.analytics.moving_agg and DataFrame.analytics.cumulative_agg_agg.
  • Added support for if_not_exists parameter during UDF and stored procedure registration.

Bug Fixes

  • Fixed a bug that when processing time format, fractional second part is not handled properly.
  • Fixed a bug that caused function calls on * to fail.
  • Fixed a bug that prevented creation of map and struct type objects.
  • Fixed a bug that function date_add was unable to handle some numeric types.
  • Fixed a bug that TimestampType casting resulted in incorrect data.
  • Fixed a bug that caused DecimalType data to have incorrect precision in some cases.
  • Fixed a bug where referencing missing table or view raises confusing IndexError.
  • Fixed a bug that mocked function to_timestamp_ntz can not handle None data.
  • Fixed a bug that mocked UDFs handles output data of None improperly.
  • Fixed a bug where DataFrame.with_column_renamed ignores attributes from parent DataFrames after join operations.
  • Fixed a bug that integer precision of large value gets lost when converted to pandas DataFrame.
  • Fixed a bug that the schema of datetime object is wrong when create DataFrame from a pandas DataFrame.
  • Fixed a bug in the implementation of Column.equal_nan where null data is handled incorrectly.
  • Fixed a bug where DataFrame.drop ignore attributes from parent DataFrames after join operations.
  • Fixed a bug in mocked function date_part where Column type is set wrong.
  • Fixed a bug where DataFrameWriter.save_as_table does not raise exceptions when inserting null data into non-nullable columns.
  • Fixed a bug in the implementation of DataFrameWriter.save_as_table where
    • Append or Truncate fails when incoming data has different schema than existing table.
    • Truncate fails when incoming data does not specify columns that are nullable.

Improvements

  • Removed dependency check for pyarrow as it is not used.
  • Improved target type coverage of Column.cast, adding support for casting to boolean and all integral types.
  • Aligned error experience when calling UDFs and stored procedures.
  • Added appropriate error messages for is_permanent and anonymous options in UDFs and stored procedures registration to make it more clear that those features are not yet supported.
  • File read operation with unsupported options and values now raises NotImplementedError instead of warnings and unclear error information.

1.17.0 (2024-05-21)

Snowpark Python API Updates

New Features

  • Added support to add a comment on tables and views using the functions listed below:
    • DataFrameWriter.save_as_table
    • DataFrame.create_or_replace_view
    • DataFrame.create_or_replace_temp_view
    • DataFrame.create_or_replace_dynamic_table

Improvements

  • Improved error message to remind users to set {"infer_schema": True} when reading CSV file without specifying its schema.

Snowpark pandas API Updates

New Features

Snowpark Local Testing Updates

New Features

  • Added support for NumericType and VariantType data conversion in the mocked function to_timestamp_ltz, to_timestamp_ntz, to_timestamp_tz and to_timestamp.
  • Added support for DecimalType, BinaryType, ArrayType, MapType, TimestampType, DateType and TimeType data conversion in the mocked function to_char.
  • Added support for the following APIs:
    • snowflake.snowpark.functions:
      • to_varchar
    • snowflake.snowpark.DataFrame:
      • pivot
    • snowflake.snowpark.Session:
      • cancel_all
  • Introduced a new exception class snowflake.snowpark.mock.exceptions.SnowparkLocalTestingException.
  • Added support for casting to FloatType

Bug Fixes

  • Fixed a bug that stored procedure and UDF should not remove imports already in the sys.path during the clean-up step.
  • Fixed a bug that when processing datetime format, the fractional second part is not handled properly.
  • Fixed a bug that on Windows platform that file operations was unable to properly handle file separator in directory name.
  • Fixed a bug that on Windows platform that when reading a pandas dataframe, IntervalType column with integer data can not be processed.
  • Fixed a bug that prevented users from being able to select multiple columns with the same alias.
  • Fixed a bug that Session.get_current_[schema|database|role|user|account|warehouse] returns upper-cased identifiers when identifiers are quoted.
  • Fixed a bug that function substr and substring can not handle 0-based start_expr.

Improvements

  • Standardized the error experience by raising SnowparkLocalTestingException in error cases which is on par with SnowparkSQLException raised in non-local execution.
  • Improved error experience of Session.write_pandas method that NotImplementError will be raised when called.
  • Aligned error experience with reusing a closed session in non-local execution.

1.16.0 (2024-05-07)

New Features

  • Support stored procedure register with packages given as Python modules.
  • Added snowflake.snowpark.Session.lineage.trace to explore data lineage of snowfake objects.
  • Added support for structured type schema parsing.

Bug Fixes

  • Fixed a bug when inferring schema, single quotes are added to stage files already have single quotes.

Local Testing Updates

New Features

  • Added support for StringType, TimestampType and VariantType data conversion in the mocked function to_date.
  • Added support for the following APIs:
    • snowflake.snowpark.functions
      • get
      • concat
      • concat_ws

Bug Fixes

  • Fixed a bug that caused NaT and NaN values to not be recognized.
  • Fixed a bug where, when inferring a schema, single quotes were added to stage files that already had single quotes.
  • Fixed a bug where DataFrameReader.csv was unable to handle quoted values containing a delimiter.
  • Fixed a bug that when there is None value in an arithmetic calculation, the output should remain None instead of math.nan.
  • Fixed a bug in function sum and covar_pop that when there is math.nan in the data, the output should also be math.nan.
  • Fixed a bug that stage operation can not handle directories.
  • Fixed a bug that DataFrame.to_pandas should take Snowflake numeric types with precision 38 as int64.

1.15.0 (2024-04-24)

New Features

  • Added truncate save mode in DataFrameWrite to overwrite existing tables by truncating the underlying table instead of dropping it.
  • Added telemetry to calculate query plan height and number of duplicate nodes during collect operations.
  • Added the functions below to unload data from a DataFrame into one or more files in a stage:
    • DataFrame.write.json
    • DataFrame.write.csv
    • DataFrame.write.parquet
  • Added distributed tracing using open telemetry APIs for action functions in DataFrame and DataFrameWriter:
    • snowflake.snowpark.DataFrame:
      • collect
      • collect_nowait
      • to_pandas
      • count
      • show
    • snowflake.snowpark.DataFrameWriter:
      • save_as_table
  • Added support for snow:// URLs to snowflake.snowpark.Session.file.get and snowflake.snowpark.Session.file.get_stream
  • Added support to register stored procedures and UDxFs with a comment.
  • UDAF client support is ready for public preview. Please stay tuned for the Snowflake announcement of UDAF public preview.
  • Added support for dynamic pivot. This feature is currently in private preview.

Improvements

  • Improved the generated query performance for both compilation and execution by converting duplicate subqueries to Common Table Expressions (CTEs). It is still an experimental feature not enabled by default, and can be enabled by setting session.cte_optimization_enabled to True.

Bug Fixes

  • Fixed a bug where statement_params was not passed to query executions that register stored procedures and user defined functions.
  • Fixed a bug causing snowflake.snowpark.Session.file.get_stream to fail for quoted stage locations.
  • Fixed a bug that an internal type hint in utils.py might raise AttributeError in case the underlying module can not be found.

Local Testing Updates

New Features

  • Added support for registering UDFs and stored procedures.
  • Added support for the following APIs:
    • snowflake.snowpark.Session:
      • file.put
      • file.put_stream
      • file.get
      • file.get_stream
      • read.json
      • add_import
      • remove_import
      • get_imports
      • clear_imports
      • add_packages
      • add_requirements
      • clear_packages
      • remove_package
      • udf.register
      • udf.register_from_file
      • sproc.register
      • sproc.register_from_file
    • snowflake.snowpark.functions
      • current_database
      • current_session
      • date_trunc
      • object_construct
      • object_construct_keep_null
      • pow
      • sqrt
      • udf
      • sproc
  • Added support for StringType, TimestampType and VariantType data conversion in the mocked function to_time.

Bug Fixes

  • Fixed a bug that null filled columns for constant functions.
  • Fixed a bug that implementation of to_object, to_array and to_binary to better handle null inputs.
  • Fixed a bug that timestamp data comparison can not handle year beyond 2262.
  • Fixed a bug that Session.builder.getOrCreate should return the created mock session.

1.14.0 (2024-03-20)

New Features

  • Added support for creating vectorized UDTFs with process method.
  • Added support for dataframe functions:
    • to_timestamp_ltz
    • to_timestamp_ntz
    • to_timestamp_tz
    • locate
  • Added support for ASOF JOIN type.
  • Added support for the following local testing APIs:
    • snowflake.snowpark.functions:
      • to_double
      • to_timestamp
      • to_timestamp_ltz
      • to_timestamp_ntz
      • to_timestamp_tz
      • greatest
      • least
      • convert_timezone
      • dateadd
      • date_part
    • snowflake.snowpark.Session:
      • get_current_account
      • get_current_warehouse
      • get_current_role
      • use_schema
      • use_warehouse
      • use_database
      • use_role

Bug Fixes

  • Fixed a bug in SnowflakePlanBuilder that save_as_table does not filter column that name start with '$' and follow by number correctly.
  • Fixed a bug that statement parameters may have no effect when resolving imports and packages.
  • Fixed bugs in local testing:
    • LEFT ANTI and LEFT SEMI joins drop rows with null values.
    • DataFrameReader.csv incorrectly parses data when the optional parameter field_optionally_enclosed_by is specified.
    • Column.regexp only considers the first entry when pattern is a Column.
    • Table.update raises KeyError when updating null values in the rows.
    • VARIANT columns raise errors at DataFrame.collect.
    • count_distinct does not work correctly when counting.
    • Null values in integer columns raise TypeError.

Improvements

  • Added telemetry to local testing.
  • Improved the error message of DataFrameReader to raise FileNotFound error when reading a path that does not exist or when there are no files under the path.

1.13.0 (2024-02-26)

New Features

  • Added support for an optional date_part argument in function last_day.
  • SessionBuilder.app_name will set the query_tag after the session is created.
  • Added support for the following local testing functions:
    • current_timestamp
    • current_date
    • current_time
    • strip_null_value
    • upper
    • lower
    • length
    • initcap

Improvements

  • Added cleanup logic at interpreter shutdown to close all active sessions.
  • Closing sessions within stored procedures now is a no-op logging a warning instead of raising an error.

Bug Fixes

  • Fixed a bug in DataFrame.to_local_iterator where the iterator could yield wrong results if another query is executed before the iterator finishes due to wrong isolation level. For details, please see #945.
  • Fixed a bug that truncated table names in error messages while running a plan with local testing enabled.
  • Fixed a bug that Session.range returns empty result when the range is large.

1.12.1 (2024-02-08)

Improvements

  • Use split_blocks=True by default during to_pandas conversion, for optimal memory allocation. This parameter is passed to pyarrow.Table.to_pandas, which enables PyArrow to split the memory allocation into smaller, more manageable blocks instead of allocating a single contiguous block. This results in better memory management when dealing with larger datasets.

Bug Fixes

  • Fixed a bug in DataFrame.to_pandas that caused an error when evaluating on a Dataframe with an IntergerType column with null values.

1.12.0 (2024-01-30)

New Features

  • Exposed statement_params in StoredProcedure.__call__.
  • Added two optional arguments to Session.add_import.
    • chunk_size: The number of bytes to hash per chunk of the uploaded files.
    • whole_file_hash: By default only the first chunk of the uploaded import is hashed to save time. When this is set to True each uploaded file is fully hashed instead.
  • Added parameters external_access_integrations and secrets when creating a UDAF from Snowpark Python to allow integration with external access.
  • Added a new method Session.append_query_tag. Allows an additional tag to be added to the current query tag by appending it as a comma separated value.
  • Added a new method Session.update_query_tag. Allows updates to a JSON encoded dictionary query tag.
  • SessionBuilder.getOrCreate will now attempt to replace the singleton it returns when token expiration has been detected.
  • Added support for new functions in snowflake.snowpark.functions:
    • array_except
    • create_map
    • sign/signum
  • Added the following functions to DataFrame.analytics:
    • Added the moving_agg function in DataFrame.analytics to enable moving aggregations like sums and averages with multiple window sizes.
    • Added the cummulative_agg function in DataFrame.analytics to enable commulative aggregations like sums and averages on multiple columns.
    • Added the compute_lag and compute_lead functions in DataFrame.analytics for enabling lead and lag calculations on multiple columns.
    • Added the time_series_agg function in DataFrame.analytics to enable time series aggregations like sums and averages with multiple time windows.

Bug Fixes

  • Fixed a bug in DataFrame.na.fill that caused Boolean values to erroneously override integer values.

  • Fixed a bug in Session.create_dataframe where the Snowpark DataFrames created using pandas DataFrames were not inferring the type for timestamp columns correctly. The behavior is as follows:

    • Earlier timestamp columns without a timezone would be converted to nanosecond epochs and inferred as LongType(), but will now be correctly maintained as timestamp values and be inferred as TimestampType(TimestampTimeZone.NTZ).
    • Earlier timestamp columns with a timezone would be inferred as TimestampType(TimestampTimeZone.NTZ) and loose timezone information but will now be correctly inferred as TimestampType(TimestampTimeZone.LTZ) and timezone information is retained correctly.
    • Set session parameter PYTHON_SNOWPARK_USE_LOGICAL_TYPE_FOR_CREATE_DATAFRAME to revert back to old behavior. It is recommended that you update your code to align with correct behavior because the parameter will be removed in the future.
  • Fixed a bug that DataFrame.to_pandas gets decimal type when scale is not 0, and creates an object dtype in pandas. Instead, we cast the value to a float64 type.

  • Fixed bugs that wrongly flattened the generated SQL when one of the following happens:

    • DataFrame.filter() is called after DataFrame.sort().limit().
    • DataFrame.sort() or filter() is called on a DataFrame that already has a window function or sequence-dependent data generator column. For instance, df.select("a", seq1().alias("b")).select("a", "b").sort("a") won't flatten the sort clause anymore.
    • a window or sequence-dependent data generator column is used after DataFrame.limit(). For instance, df.limit(10).select(row_number().over()) won't flatten the limit and select in the generated SQL.
  • Fixed a bug where aliasing a DataFrame column raised an error when the DataFame was copied from another DataFrame with an aliased column. For instance,

    df = df.select(col("a").alias("b"))
    df = copy(df)
    df.select(col("b").alias("c"))  # threw an error. Now it's fixed.
    
  • Fixed a bug in Session.create_dataframe that the non-nullable field in a schema is not respected for boolean type. Note that this fix is only effective when the user has the privilege to create a temp table.

  • Fixed a bug in SQL simplifier where non-select statements in session.sql dropped a SQL query when used with limit().

  • Fixed a bug that raised an exception when session parameter ERROR_ON_NONDETERMINISTIC_UPDATE is true.

Behavior Changes (API Compatible)

  • When parsing data types during a to_pandas operation, we rely on GS precision value to fix precision issues for large integer values. This may affect users where a column that was earlier returned as int8 gets returned as int64. Users can fix this by explicitly specifying precision values for their return column.
  • Aligned behavior for Session.call in case of table stored procedures where running Session.call would not trigger stored procedure unless a collect() operation was performed.
  • StoredProcedureRegistration will now automatically add snowflake-snowpark-python as a package dependency. The added dependency will be on the client's local version of the library and an error is thrown if the server cannot support that version.

1.11.1 (2023-12-07)

Bug Fixes

  • Fixed a bug that numpy should not be imported at the top level of mock module.
  • Added support for these new functions in snowflake.snowpark.functions:
    • from_utc_timestamp
    • to_utc_timestamp

1.11.0 (2023-12-05)

New Features

  • Add the conn_error attribute to SnowflakeSQLException that stores the whole underlying exception from snowflake-connector-python.

  • Added support for RelationalGroupedDataframe.pivot() to access pivot in the following pattern Dataframe.group_by(...).pivot(...).

  • Added experimental feature: Local Testing Mode, which allows you to create and operate on Snowpark Python DataFrames locally without connecting to a Snowflake account. You can use the local testing framework to test your DataFrame operations locally, on your development machine or in a CI (continuous integration) pipeline, before deploying code changes to your account.

  • Added support for arrays_to_object new functions in snowflake.snowpark.functions.

  • Added support for the vector data type.

Dependency Updates

  • Bumped cloudpickle dependency to work with cloudpickle==2.2.1
  • Updated snowflake-connector-python to 3.4.0.

Bug Fixes

  • DataFrame column names quoting check now supports newline characters.
  • Fix a bug where a DataFrame generated by session.read.with_metadata creates inconsistent table when doing df.write.save_as_table.

1.10.0 (2023-11-03)

New Features

  • Added support for managing case sensitivity in DataFrame.to_local_iterator().
  • Added support for specifying vectorized UDTF's input column names by using the optional parameter input_names in UDTFRegistration.register/register_file and functions.pandas_udtf. By default, RelationalGroupedDataFrame.applyInPandas will infer the column names from current dataframe schema.
  • Add sql_error_code and raw_message attributes to SnowflakeSQLException when it is caused by a SQL exception.

Bug Fixes

  • Fixed a bug in DataFrame.to_pandas() where converting snowpark dataframes to pandas dataframes was losing precision on integers with more than 19 digits.
  • Fixed a bug that session.add_packages can not handle requirement specifier that contains project name with underscore and version.
  • Fixed a bug in DataFrame.limit() when offset is used and the parent DataFrame uses limit. Now the offset won't impact the parent DataFrame's limit.
  • Fixed a bug in DataFrame.write.save_as_table where dataframes created from read api could not save data into snowflake because of invalid column name $1.

Behavior change

  • Changed the behavior of date_format:
    • The format argument changed from optional to required.
    • The returned result changed from a date object to a date-formatted string.
  • When a window function, or a sequence-dependent data generator (normal, zipf, uniform, seq1, seq2, seq4, seq8) function is used, the sort and filter operation will no longer be flattened when generating the query.

1.9.0 (2023-10-13)

New Features

  • Added support for the Python 3.11 runtime environment.

Dependency updates

  • Added back the dependency of typing-extensions.

Bug Fixes

  • Fixed a bug where imports from permanent stage locations were ignored for temporary stored procedures, UDTFs, UDFs, and UDAFs.
  • Revert back to using CTAS (create table as select) statement for Dataframe.writer.save_as_table which does not need insert permission for writing tables.

New Features

  • Support PythonObjJSONEncoder json-serializable objects for ARRAY and OBJECT literals.

1.8.0 (2023-09-14)

New Features

  • Added support for VOLATILE/IMMUTABLE keyword when registering UDFs.

  • Added support for specifying clustering keys when saving dataframes using DataFrame.save_as_table.

  • Accept Iterable objects input for schema when creating dataframes using Session.create_dataframe.

  • Added the property DataFrame.session to return a Session object.

  • Added the property Session.session_id to return an integer that represents session ID.

  • Added the property Session.connection to return a SnowflakeConnection object .

  • Added support for creating a Snowpark session from a configuration file or environment variables.

Dependency updates

  • Updated snowflake-connector-python to 3.2.0.

Bug Fixes

  • Fixed a bug where automatic package upload would raise ValueError even when compatible package version were added in session.add_packages.
  • Fixed a bug where table stored procedures were not registered correctly when using register_from_file.
  • Fixed a bug where dataframe joins failed with invalid_identifier error.
  • Fixed a bug where DataFrame.copy disables SQL simplfier for the returned copy.
  • Fixed a bug where session.sql().select() would fail if any parameters are specified to session.sql()

1.7.0 (2023-08-28)

New Features

  • Added parameters external_access_integrations and secrets when creating a UDF, UDTF or Stored Procedure from Snowpark Python to allow integration with external access.
  • Added support for these new functions in snowflake.snowpark.functions:
    • array_flatten
    • flatten
  • Added support for apply_in_pandas in snowflake.snowpark.relational_grouped_dataframe.
  • Added support for replicating your local Python environment on Snowflake via Session.replicate_local_environment.

Bug Fixes

  • Fixed a bug where session.create_dataframe fails to properly set nullable columns where nullability was affected by order or data was given.
  • Fixed a bug where DataFrame.select could not identify and alias columns in presence of table functions when output columns of table function overlapped with columns in dataframe.

Behavior Changes

  • When creating stored procedures, UDFs, UDTFs, UDAFs with parameter is_permanent=False will now create temporary objects even when stage_name is provided. The default value of is_permanent is False which is why if this value is not explicitly set to True for permanent objects, users will notice a change in behavior.
  • types.StructField now enquotes column identifier by default.

1.6.1 (2023-08-02)

New Features

  • Added support for these new functions in snowflake.snowpark.functions:
    • array_sort
    • sort_array
    • array_min
    • array_max
    • explode_outer
  • Added support for pure Python packages specified via Session.add_requirements or Session.add_packages. They are now usable in stored procedures and UDFs even if packages are not present on the Snowflake Anaconda channel.
    • Added Session parameter custom_packages_upload_enabled and custom_packages_force_upload_enabled to enable the support for pure Python packages feature mentioned above. Both parameters default to False.
  • Added support for specifying package requirements by passing a Conda environment yaml file to Session.add_requirements.
  • Added support for asynchronous execution of multi-query dataframes that contain binding variables.
  • Added support for renaming multiple columns in DataFrame.rename.
  • Added support for Geometry datatypes.
  • Added support for params in session.sql() in stored procedures.
  • Added support for user-defined aggregate functions (UDAFs). This feature is currently in private preview.
  • Added support for vectorized UDTFs (user-defined table functions). This feature is currently in public preview.
  • Added support for Snowflake Timestamp variants (i.e., TIMESTAMP_NTZ, TIMESTAMP_LTZ, TIMESTAMP_TZ)
    • Added TimestampTimezone as an argument in TimestampType constructor.
    • Added type hints NTZ, LTZ, TZ and Timestamp to annotate functions when registering UDFs.

Improvements

  • Removed redundant dependency typing-extensions.
  • DataFrame.cache_result now creates temp table fully qualified names under current database and current schema.

Bug Fixes

  • Fixed a bug where type check happens on pandas before it is imported.
  • Fixed a bug when creating a UDF from numpy.ufunc.
  • Fixed a bug where DataFrame.union was not generating the correct Selectable.schema_query when SQL simplifier is enabled.

Behavior Changes

  • DataFrameWriter.save_as_table now respects the nullable field of the schema provided by the user or the inferred schema based on data from user input.

Dependency updates

  • Updated snowflake-connector-python to 3.0.4.

1.5.1 (2023-06-20)

New Features

  • Added support for the Python 3.10 runtime environment.

1.5.0 (2023-06-09)

Behavior Changes

  • Aggregation results, from functions such as DataFrame.agg and DataFrame.describe, no longer strip away non-printing characters from column names.

New Features

  • Added support for the Python 3.9 runtime environment.
  • Added support for new functions in snowflake.snowpark.functions:
    • array_generate_range
    • array_unique_agg
    • collect_set
    • sequence
  • Added support for registering and calling stored procedures with TABLE return type.
  • Added support for parameter length in StringType() to specify the maximum number of characters that can be stored by the column.
  • Added the alias functions.element_at() for functions.get().
  • Added the alias Column.contains for functions.contains.
  • Added experimental feature DataFrame.alias.
  • Added support for querying metadata columns from stage when creating DataFrame using DataFrameReader.
  • Added support for StructType.add to append more fields to existing StructType objects.
  • Added support for parameter execute_as in StoredProcedureRegistration.register_from_file() to specify stored procedure caller rights.

Bug Fixes

  • Fixed a bug where the Dataframe.join_table_function did not run all of the necessary queries to set up the join table function when SQL simplifier was enabled.
  • Fixed type hint declaration for custom types - ColumnOrName, ColumnOrLiteralStr, ColumnOrSqlExpr, LiteralType and ColumnOrLiteral that were breaking mypy checks.
  • Fixed a bug where DataFrameWriter.save_as_table and DataFrame.copy_into_table failed to parse fully qualified table names.

1.4.0 (2023-04-24)

New Features

  • Added support for session.getOrCreate.
  • Added support for alias Column.getField.
  • Added support for new functions in snowflake.snowpark.functions:
    • date_add and date_sub to make add and subtract operations easier.
    • daydiff
    • explode
    • array_distinct.
    • regexp_extract.
    • struct.
    • format_number.
    • bround.
    • substring_index
  • Added parameter skip_upload_on_content_match when creating UDFs, UDTFs and stored procedures using register_from_file to skip uploading files to a stage if the same version of the files are already on the stage.
  • Added support for DataFrameWriter.save_as_table method to take table names that contain dots.
  • Flattened generated SQL when DataFrame.filter() or DataFrame.order_by() is followed by a projection statement (e.g. DataFrame.select(), DataFrame.with_column()).
  • Added support for creating dynamic tables (in private preview) using Dataframe.create_or_replace_dynamic_table.
  • Added an optional argument params in session.sql() to support binding variables. Note that this is not supported in stored procedures yet.

Bug Fixes

  • Fixed a bug in strtok_to_array where an exception was thrown when a delimiter was passed in.
  • Fixed a bug in session.add_import where the module had the same namespace as other dependencies.

1.3.0 (2023-03-28)

New Features

  • Added support for delimiters parameter in functions.initcap().
  • Added support for functions.hash() to accept a variable number of input expressions.
  • Added API Session.RuntimeConfig for getting/setting/checking the mutability of any runtime configuration.
  • Added support managing case sensitivity in Row results from DataFrame.collect using case_sensitive parameter.
  • Added API Session.conf for getting, setting or checking the mutability of any runtime configuration.
  • Added support for managing case sensitivity in Row results from DataFrame.collect using case_sensitive parameter.
  • Added indexer support for snowflake.snowpark.types.StructType.
  • Added a keyword argument log_on_exception to Dataframe.collect and Dataframe.collect_no_wait to optionally disable error logging for SQL exceptions.

Bug Fixes

  • Fixed a bug where a DataFrame set operation(DataFrame.substract, DataFrame.union, etc.) being called after another DataFrame set operation and DataFrame.select or DataFrame.with_column throws an exception.
  • Fixed a bug where chained sort statements are overwritten by the SQL simplifier.

Improvements

  • Simplified JOIN queries to use constant subquery aliases (SNOWPARK_LEFT, SNOWPARK_RIGHT) by default. Users can disable this at runtime with session.conf.set('use_constant_subquery_alias', False) to use randomly generated alias names instead.
  • Allowed specifying statement parameters in session.call().
  • Enabled the uploading of large pandas DataFrames in stored procedures by defaulting to a chunk size of 100,000 rows.

1.2.0 (2023-03-02)

New Features

  • Added support for displaying source code as comments in the generated scripts when registering stored procedures. This is enabled by default, turn off by specifying source_code_display=False at registration.
  • Added a parameter if_not_exists when creating a UDF, UDTF or Stored Procedure from Snowpark Python to ignore creating the specified function or procedure if it already exists.
  • Accept integers when calling snowflake.snowpark.functions.get to extract value from array.
  • Added functions.reverse in functions to open access to Snowflake built-in function reverse.
  • Added parameter require_scoped_url in snowflake.snowflake.files.SnowflakeFile.open() (in Private Preview) to replace is_owner_file is marked for deprecation.

Bug Fixes

  • Fixed a bug that overwrote paramstyle to qmark when creating a Snowpark session.
  • Fixed a bug where df.join(..., how="cross") fails with SnowparkJoinException: (1112): Unsupported using join type 'Cross'.
  • Fixed a bug where querying a DataFrame column created from chained function calls used a wrong column name.

1.1.0 (2023-01-26)

New Features:

  • Added asc, asc_nulls_first, asc_nulls_last, desc, desc_nulls_first, desc_nulls_last, date_part and unix_timestamp in functions.
  • Added the property DataFrame.dtypes to return a list of column name and data type pairs.
  • Added the following aliases:
    • functions.expr() for functions.sql_expr().
    • functions.date_format() for functions.to_date().
    • functions.monotonically_increasing_id() for functions.seq8()
    • functions.from_unixtime() for functions.to_timestamp()

Bug Fixes:

Improvements

  • The session parameter PYTHON_SNOWPARK_USE_SQL_SIMPLIFIER is True after Snowflake 7.3 was released. In snowpark-python, session.sql_simplifier_enabled reads the value of PYTHON_SNOWPARK_USE_SQL_SIMPLIFIER by default, meaning that the SQL simplfier is enabled by default after the Snowflake 7.3 release. To turn this off, set PYTHON_SNOWPARK_USE_SQL_SIMPLIFIER in Snowflake to False or run session.sql_simplifier_enabled = False from Snowpark. It is recommended to use the SQL simplifier because it helps to generate more concise SQL.

1.0.0 (2022-11-01)

New Features

  • Added Session.generator() to create a new DataFrame using the Generator table function.
  • Added a parameter secure to the functions that create a secure UDF or UDTF.

0.12.0 (2022-10-14)

New Features

  • Added new APIs for async job:
    • Session.create_async_job() to create an AsyncJob instance from a query id.
    • AsyncJob.result() now accepts argument result_type to return the results in different formats.
    • AsyncJob.to_df() returns a DataFrame built from the result of this asynchronous job.
    • AsyncJob.query() returns the SQL text of the executed query.
  • DataFrame.agg() and RelationalGroupedDataFrame.agg() now accept variable-length arguments.
  • Added parameters lsuffix and rsuffix to DataFram.join() and DataFrame.cross_join() to conveniently rename overlapping columns.
  • Added Table.drop_table() so you can drop the temp table after DataFrame.cache_result(). Table is also a context manager so you can use the with statement to drop the cache temp table after use.
  • Added Session.use_secondary_roles().
  • Added functions first_value() and last_value(). (contributed by @chasleslr)
  • Added on as an alias for using_columns and how as an alias for join_type in DataFrame.join().

Bug Fixes

  • Fixed a bug in Session.create_dataframe() that raised an error when schema names had special characters.
  • Fixed a bug in which options set in Session.read.option() were not passed to DataFrame.copy_into_table() as default values.
  • Fixed a bug in which DataFrame.copy_into_table() raises an error when a copy option has single quotes in the value.

0.11.0 (2022-09-28)

Behavior Changes

  • Session.add_packages() now raises ValueError when the version of a package cannot be found in Snowflake Anaconda channel. Previously, Session.add_packages() succeeded, and a SnowparkSQLException exception was raised later in the UDF/SP registration step.

New Features:

  • Added method FileOperation.get_stream() to support downloading stage files as stream.
  • Added support in functions.ntiles() to accept int argument.
  • Added the following aliases:
    • functions.call_function() for functions.call_builtin().
    • functions.function() for functions.builtin().
    • DataFrame.order_by() for DataFrame.sort()
    • DataFrame.orderBy() for DataFrame.sort()
  • Improved DataFrame.cache_result() to return a more accurate Table class instead of a DataFrame class.
  • Added support to allow session as the first argument when calling StoredProcedure.

Improvements

  • Improved nested query generation by flattening queries when applicable.
    • This improvement could be enabled by setting Session.sql_simplifier_enabled = True.
    • DataFrame.select(), DataFrame.with_column(), DataFrame.drop() and other select-related APIs have more flattened SQLs.
    • DataFrame.union(), DataFrame.union_all(), DataFrame.except_(), DataFrame.intersect(), DataFrame.union_by_name() have flattened SQLs generated when multiple set operators are chained.
  • Improved type annotations for async job APIs.

Bug Fixes

  • Fixed a bug in which Table.update(), Table.delete(), Table.merge() try to reference a temp table that does not exist.

0.10.0 (2022-09-16)

New Features:

  • Added experimental APIs for evaluating Snowpark dataframes with asynchronous queries:
    • Added keyword argument block to the following action APIs on Snowpark dataframes (which execute queries) to allow asynchronous evaluations:
      • DataFrame.collect(), DataFrame.to_local_iterator(), DataFrame.to_pandas(), DataFrame.to_pandas_batches(), DataFrame.count(), DataFrame.first().
      • DataFrameWriter.save_as_table(), DataFrameWriter.copy_into_location().
      • Table.delete(), Table.update(), Table.merge().
    • Added method DataFrame.collect_nowait() to allow asynchronous evaluations.
    • Added class AsyncJob to retrieve results from asynchronously executed queries and check their status.
  • Added support for table_type in Session.write_pandas(). You can now choose from these table_type options: "temporary", "temp", and "transient".
  • Added support for using Python structured data (list, tuple and dict) as literal values in Snowpark.
  • Added keyword argument execute_as to functions.sproc() and session.sproc.register() to allow registering a stored procedure as a caller or owner.
  • Added support for specifying a pre-configured file format when reading files from a stage in Snowflake.

Improvements:

  • Added support for displaying details of a Snowpark session.

Bug Fixes:

  • Fixed a bug in which DataFrame.copy_into_table() and DataFrameWriter.save_as_table() mistakenly created a new table if the table name is fully qualified, and the table already exists.

Deprecations:

  • Deprecated keyword argument create_temp_table in Session.write_pandas().
  • Deprecated invoking UDFs using arguments wrapped in a Python list or tuple. You can use variable-length arguments without a list or tuple.

Dependency updates

  • Updated snowflake-connector-python to 2.7.12.

0.9.0 (2022-08-30)

New Features:

  • Added support for displaying source code as comments in the generated scripts when registering UDFs. This feature is turned on by default. To turn it off, pass the new keyword argument source_code_display as False when calling register() or @udf().
  • Added support for calling table functions from DataFrame.select(), DataFrame.with_column() and DataFrame.with_columns() which now take parameters of type table_function.TableFunctionCall for columns.
  • Added keyword argument overwrite to session.write_pandas() to allow overwriting contents of a Snowflake table with that of a pandas DataFrame.
  • Added keyword argument column_order to df.write.save_as_table() to specify the matching rules when inserting data into table in append mode.
  • Added method FileOperation.put_stream() to upload local files to a stage via file stream.
  • Added methods TableFunctionCall.alias() and TableFunctionCall.as_() to allow aliasing the names of columns that come from the output of table function joins.
  • Added function get_active_session() in module snowflake.snowpark.context to get the current active Snowpark session.

Bug Fixes:

  • Fixed a bug in which batch insert should not raise an error when statement_params is not passed to the function.
  • Fixed a bug in which column names should be quoted when session.create_dataframe() is called with dicts and a given schema.
  • Fixed a bug in which creation of table should be skipped if the table already exists and is in append mode when calling df.write.save_as_table().
  • Fixed a bug in which third-party packages with underscores cannot be added when registering UDFs.

Improvements:

  • Improved function function.uniform() to infer the types of inputs max_ and min_ and cast the limits to IntegerType or FloatType correspondingly.

0.8.0 (2022-07-22)

New Features:

  • Added keyword only argument statement_params to the following methods to allow for specifying statement level parameters:
    • collect, to_local_iterator, to_pandas, to_pandas_batches, count, copy_into_table, show, create_or_replace_view, create_or_replace_temp_view, first, cache_result and random_split on class snowflake.snowpark.Dateframe.
    • update, delete and merge on class snowflake.snowpark.Table.
    • save_as_table and copy_into_location on class snowflake.snowpark.DataFrameWriter.
    • approx_quantile, statement_params, cov and crosstab on class snowflake.snowpark.DataFrameStatFunctions.
    • register and register_from_file on class snowflake.snowpark.udf.UDFRegistration.
    • register and register_from_file on class snowflake.snowpark.udtf.UDTFRegistration.
    • register and register_from_file on class snowflake.snowpark.stored_procedure.StoredProcedureRegistration.
    • udf, udtf and sproc in snowflake.snowpark.functions.
  • Added support for Column as an input argument to session.call().
  • Added support for table_type in df.write.save_as_table(). You can now choose from these table_type options: "temporary", "temp", and "transient".

Improvements:

  • Added validation of object name in session.use_* methods.
  • Updated the query tag in SQL to escape it when it has special characters.
  • Added a check to see if Anaconda terms are acknowledged when adding missing packages.

Bug Fixes:

  • Fixed the limited length of the string column in session.create_dataframe().
  • Fixed a bug in which session.create_dataframe() mistakenly converted 0 and False to None when the input data was only a list.
  • Fixed a bug in which calling session.create_dataframe() using a large local dataset sometimes created a temp table twice.
  • Aligned the definition of function.trim() with the SQL function definition.
  • Fixed an issue where snowpark-python would hang when using the Python system-defined (built-in function) sum vs. the Snowpark function.sum().

Deprecations:

  • Deprecated keyword argument create_temp_table in df.write.save_as_table().

0.7.0 (2022-05-25)

New Features:

  • Added support for user-defined table functions (UDTFs).
    • Use function snowflake.snowpark.functions.udtf() to register a UDTF, or use it as a decorator to register the UDTF.
      • You can also use Session.udtf.register() to register a UDTF.
    • Use Session.udtf.register_from_file() to register a UDTF from a Python file.
  • Updated APIs to query a table function, including both Snowflake built-in table functions and UDTFs.
    • Use function snowflake.snowpark.functions.table_function() to create a callable representing a table function and use it to call the table function in a query.
    • Alternatively, use function snowflake.snowpark.functions.call_table_function() to call a table function.
    • Added support for over clause that specifies partition by and order by when lateral joining a table function.
    • Updated Session.table_function() and DataFrame.join_table_function() to accept TableFunctionCall instances.

Breaking Changes:

  • When creating a function with functions.udf() and functions.sproc(), you can now specify an empty list for the imports or packages argument to indicate that no import or package is used for this UDF or stored procedure. Previously, specifying an empty list meant that the function would use session-level imports or packages.
  • Improved the __repr__ implementation of data types in types.py. The unused type_name property has been removed.
  • Added a Snowpark-specific exception class for SQL errors. This replaces the previous ProgrammingError from the Python connector.

Improvements:

  • Added a lock to a UDF or UDTF when it is called for the first time per thread.
  • Improved the error message for pickling errors that occurred during UDF creation.
  • Included the query ID when logging the failed query.

Bug Fixes:

  • Fixed a bug in which non-integral data (such as timestamps) was occasionally converted to integer when calling DataFrame.to_pandas().
  • Fixed a bug in which DataFrameReader.parquet() failed to read a parquet file when its column contained spaces.
  • Fixed a bug in which DataFrame.copy_into_table() failed when the dataframe is created by reading a file with inferred schemas.

Deprecations

Session.flatten() and DataFrame.flatten().

Dependency Updates:

  • Restricted the version of cloudpickle <= 2.0.0.

0.6.0 (2022-04-27)

New Features:

  • Added support for vectorized UDFs with the input as a pandas DataFrame or pandas Series and the output as a pandas Series. This improves the performance of UDFs in Snowpark.
  • Added support for inferring the schema of a DataFrame by default when it is created by reading a Parquet, Avro, or ORC file in the stage.
  • Added functions current_session(), current_statement(), current_user(), current_version(), current_warehouse(), date_from_parts(), date_trunc(), dayname(), dayofmonth(), dayofweek(), dayofyear(), grouping(), grouping_id(), hour(), last_day(), minute(), next_day(), previous_day(), second(), month(), monthname(), quarter(), year(), current_database(), current_role(), current_schema(), current_schemas(), current_region(), current_avaliable_roles(), add_months(), any_value(), bitnot(), bitshiftleft(), bitshiftright(), convert_timezone(), uniform(), strtok_to_array(), sysdate(), time_from_parts(), timestamp_from_parts(), timestamp_ltz_from_parts(), timestamp_ntz_from_parts(), timestamp_tz_from_parts(), weekofyear(), percentile_cont() to snowflake.snowflake.functions.

Breaking Changes:

  • Expired deprecations:
    • Removed the following APIs that were deprecated in 0.4.0: DataFrame.groupByGroupingSets(), DataFrame.naturalJoin(), DataFrame.joinTableFunction, DataFrame.withColumns(), Session.getImports(), Session.addImport(), Session.removeImport(), Session.clearImports(), Session.getSessionStage(), Session.getDefaultDatabase(), Session.getDefaultSchema(), Session.getCurrentDatabase(), Session.getCurrentSchema(), Session.getFullyQualifiedCurrentSchema().

Improvements:

  • Added support for creating an empty DataFrame with a specific schema using the Session.create_dataframe() method.
  • Changed the logging level from INFO to DEBUG for several logs (e.g., the executed query) when evaluating a dataframe.
  • Improved the error message when failing to create a UDF due to pickle errors.

Bug Fixes:

  • Removed pandas hard dependencies in the Session.create_dataframe() method.

Dependency Updates:

  • Added typing-extension as a new dependency with the version >= 4.1.0.

0.5.0 (2022-03-22)

New Features

  • Added stored procedures API.
    • Added Session.sproc property and sproc() to snowflake.snowpark.functions, so you can register stored procedures.
    • Added Session.call to call stored procedures by name.
  • Added UDFRegistration.register_from_file() to allow registering UDFs from Python source files or zip files directly.
  • Added UDFRegistration.describe() to describe a UDF.
  • Added DataFrame.random_split() to provide a way to randomly split a dataframe.
  • Added functions md5(), sha1(), sha2(), ascii(), initcap(), length(), lower(), lpad(), ltrim(), rpad(), rtrim(), repeat(), soundex(), regexp_count(), replace(), charindex(), collate(), collation(), insert(), left(), right(), endswith() to snowflake.snowpark.functions.
  • Allowed call_udf() to accept literal values.
  • Provided a distinct keyword in array_agg().

Bug Fixes:

  • Fixed an issue that caused DataFrame.to_pandas() to have a string column if Column.cast(IntegerType()) was used.
  • Fixed a bug in DataFrame.describe() when there is more than one string column.

0.4.0 (2022-02-15)

New Features

  • You can now specify which Anaconda packages to use when defining UDFs.
    • Added add_packages(), get_packages(), clear_packages(), and remove_package(), to class Session.
    • Added add_requirements() to Session so you can use a requirements file to specify which packages this session will use.
    • Added parameter packages to function snowflake.snowpark.functions.udf() and method UserDefinedFunction.register() to indicate UDF-level Anaconda package dependencies when creating a UDF.
    • Added parameter imports to snowflake.snowpark.functions.udf() and UserDefinedFunction.register() to specify UDF-level code imports.
  • Added a parameter session to function udf() and UserDefinedFunction.register() so you can specify which session to use to create a UDF if you have multiple sessions.
  • Added types Geography and Variant to snowflake.snowpark.types to be used as type hints for Geography and Variant data when defining a UDF.
  • Added support for Geography geoJSON data.
  • Added Table, a subclass of DataFrame for table operations:
    • Methods update and delete update and delete rows of a table in Snowflake.
    • Method merge merges data from a DataFrame to a Table.
    • Override method DataFrame.sample() with an additional parameter seed, which works on tables but not on view and sub-queries.
  • Added DataFrame.to_local_iterator() and DataFrame.to_pandas_batches() to allow getting results from an iterator when the result set returned from the Snowflake database is too large.
  • Added DataFrame.cache_result() for caching the operations performed on a DataFrame in a temporary table. Subsequent operations on the original DataFrame have no effect on the cached result DataFrame.
  • Added property DataFrame.queries to get SQL queries that will be executed to evaluate the DataFrame.
  • Added Session.query_history() as a context manager to track SQL queries executed on a session, including all SQL queries to evaluate DataFrames created from a session. Both query ID and query text are recorded.
  • You can now create a Session instance from an existing established snowflake.connector.SnowflakeConnection. Use parameter connection in Session.builder.configs().
  • Added use_database(), use_schema(), use_warehouse(), and use_role() to class Session to switch database/schema/warehouse/role after a session is created.
  • Added DataFrameWriter.copy_into_table() to unload a DataFrame to stage files.
  • Added DataFrame.unpivot().
  • Added Column.within_group() for sorting the rows by columns with some aggregation functions.
  • Added functions listagg(), mode(), div0(), acos(), asin(), atan(), atan2(), cos(), cosh(), sin(), sinh(), tan(), tanh(), degrees(), radians(), round(), trunc(), and factorial() to snowflake.snowflake.functions.
  • Added an optional argument ignore_nulls in function lead() and lag().
  • The condition parameter of function when() and iff() now accepts SQL expressions.

Improvements

  • All function and method names have been renamed to use the snake case naming style, which is more Pythonic. For convenience, some camel case names are kept as aliases to the snake case APIs. It is recommended to use the snake case APIs.
    • Deprecated these methods on class Session and replaced them with their snake case equivalents: getImports(), addImports(), removeImport(), clearImports(), getSessionStage(), getDefaultSchema(), getDefaultSchema(), getCurrentDatabase(), getFullyQualifiedCurrentSchema().
    • Deprecated these methods on class DataFrame and replaced them with their snake case equivalents: groupingByGroupingSets(), naturalJoin(), withColumns(), joinTableFunction().
  • Property DataFrame.columns is now consistent with DataFrame.schema.names and the Snowflake database Identifier Requirements.
  • Column.__bool__() now raises a TypeError. This will ban the use of logical operators and, or, not on Column object, for instance col("a") > 1 and col("b") > 2 will raise the TypeError. Use (col("a") > 1) & (col("b") > 2) instead.
  • Changed PutResult and GetResult to subclass NamedTuple.
  • Fixed a bug which raised an error when the local path or stage location has a space or other special characters.
  • Changed DataFrame.describe() so that non-numeric and non-string columns are ignored instead of raising an exception.

Dependency updates

  • Updated snowflake-connector-python to 2.7.4.

0.3.0 (2022-01-09)

New Features

  • Added Column.isin(), with an alias Column.in_().
  • Added Column.try_cast(), which is a special version of cast(). It tries to cast a string expression to other types and returns null if the cast is not possible.
  • Added Column.startswith() and Column.substr() to process string columns.
  • Column.cast() now also accepts a str value to indicate the cast type in addition to a DataType instance.
  • Added DataFrame.describe() to summarize stats of a DataFrame.
  • Added DataFrame.explain() to print the query plan of a DataFrame.
  • DataFrame.filter() and DataFrame.select_expr() now accepts a sql expression.
  • Added a new bool parameter create_temp_table to methods DataFrame.saveAsTable() and Session.write_pandas() to optionally create a temp table.
  • Added DataFrame.minus() and DataFrame.subtract() as aliases to DataFrame.except_().
  • Added regexp_replace(), concat(), concat_ws(), to_char(), current_timestamp(), current_date(), current_time(), months_between(), cast(), try_cast(), greatest(), least(), and hash() to module snowflake.snowpark.functions.

Bug Fixes

  • Fixed an issue where Session.createDataFrame(pandas_df) and Session.write_pandas(pandas_df) raise an exception when the pandas DataFrame has spaces in the column name.
  • DataFrame.copy_into_table() sometimes prints an error level log entry while it actually works. It's fixed now.
  • Fixed an API docs issue where some DataFrame APIs are missing from the docs.

Dependency updates

0.2.0 (2021-12-02)

New Features

  • Updated the Session.createDataFrame() method for creating a DataFrame from a pandas DataFrame.
  • Added the Session.write_pandas() method for writing a pandas DataFrame to a table in Snowflake and getting a Snowpark DataFrame object back.
  • Added new classes and methods for calling window functions.
  • Added the new functions cume_dist(), to find the cumulative distribution of a value with regard to other values within a window partition, and row_number(), which returns a unique row number for each row within a window partition.
  • Added functions for computing statistics for DataFrames in the DataFrameStatFunctions class.
  • Added functions for handling missing values in a DataFrame in the DataFrameNaFunctions class.
  • Added new methods rollup(), cube(), and pivot() to the DataFrame class.
  • Added the GroupingSets class, which you can use with the DataFrame groupByGroupingSets method to perform a SQL GROUP BY GROUPING SETS.
  • Added the new FileOperation(session) class that you can use to upload and download files to and from a stage.
  • Added the DataFrame.copy_into_table() method for loading data from files in a stage into a table.
  • In CASE expressions, the functions when() and otherwise() now accept Python types in addition to Column objects.
  • When you register a UDF you can now optionally set the replace parameter to True to overwrite an existing UDF with the same name.

Improvements

  • UDFs are now compressed before they are uploaded to the server. This makes them about 10 times smaller, which can help when you are using large ML model files.
  • When the size of a UDF is less than 8196 bytes, it will be uploaded as in-line code instead of uploaded to a stage.

Bug Fixes

  • Fixed an issue where the statement df.select(when(col("a") == 1, 4).otherwise(col("a"))), [Row(4), Row(2), Row(3)] raised an exception.
  • Fixed an issue where df.toPandas() raised an exception when a DataFrame was created from large local data.

0.1.0 (2021-10-26)

Start of Private Preview

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

snowflake_snowpark_python-1.25.0.tar.gz (1.4 MB view details)

Uploaded Source

Built Distribution

File details

Details for the file snowflake_snowpark_python-1.25.0.tar.gz.

File metadata

File hashes

Hashes for snowflake_snowpark_python-1.25.0.tar.gz
Algorithm Hash digest
SHA256 184b9f6873b4c318af2727c707000998c9f6e24542036e536d86c2e314549e47
MD5 a04281c2a993758509d4ccc35d635da7
BLAKE2b-256 460b8d167516df04914fb26dd3c217c771a5ef746b2319e41ae1accc34487691

See more details on using hashes here.

File details

Details for the file snowflake_snowpark_python-1.25.0-py3-none-any.whl.

File metadata

File hashes

Hashes for snowflake_snowpark_python-1.25.0-py3-none-any.whl
Algorithm Hash digest
SHA256 0b965308eb88568cbe6865164fb5441a336d9cf55bc603291f64ebfe799018d1
MD5 a551e738b8c4d9e29f75c93e41dcc654
BLAKE2b-256 e3553cbf9670a6cc14690aa9c3cc279a5e6d8dc74ef2deb8ce6f8fb7d8268a35

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page