Skip to main content

Amazon SageMaker FeatureStore PySpark Bindings

Project description

SageMaker FeatureStore PySpark

SageMaker FeatureStore Spark is an open source Spark library for Amazon SageMaker FeatureStore. With this connector, you can easily ingest data to FeatureGroup's online and offline store from Spark DataFrame. This package provides the Python (PySpark) interface.

For full documentation including Scala usage, cross-account Lake Formation access, and troubleshooting, see the GitHub repository.

Supported Versions

Component Supported Versions
Spark 3.1, 3.2, 3.3, 3.4, 3.5
Python 3.8, 3.9, 3.10, 3.11, 3.12
EMR emr-7.x and above

Note: Not all Python/PySpark combinations are supported. See the compatibility matrix below.

Python / PySpark Compatibility Matrix

Python \ PySpark 3.1 3.2 3.3 3.4 3.5
3.8 Yes Yes Yes Yes Yes
3.9 Yes Yes Yes Yes Yes
3.10 No Yes Yes Yes Yes
3.11 No No No Yes Yes
3.12 No No No Yes Yes

Note: PySpark versions older than 3.5 are in maintenance mode and will not receive new features. New functionality is only added for PySpark 3.5+.

Installation

Prerequisites: PySpark and NumPy must be installed in your environment.

The package is available on PyPI. It bundles pre-built JARs for each supported Spark version (3.1-3.5). At runtime, the correct JAR is automatically selected based on your installed PySpark version.

If SPARK_HOME is set, the installer copies the matching JAR into $SPARK_HOME/jars. For EMR, the path is handled automatically.

pip3 install sagemaker-feature-store-pyspark --no-binary :all:

EMR

Create a custom jar step to install the library:

  • Jar Location: command-runner.jar
  • Arguments: sudo -E pip3 install sagemaker-feature-store-pyspark --no-binary :all:

This installs the library on the Driver node only. To distribute to all executor nodes, create an installation script and add a custom bootstrap action when creating the EMR cluster.

Since bootstrap actions run before EMR applications are installed, dependent JARs cannot be automatically loaded to SPARK_HOME. When submitting your application, specify dependent JARs using:

--jars `feature-store-pyspark-dependency-jars`

SageMaker Notebook

SageMaker Notebook instances may use an older version of Spark. Install a compatible version first:

# Install a version of PySpark compatible with the library (3.1 - 3.5)
!pip3 install pyspark==3.5.1

Getting Started

FeatureStoreManager is the main interface for all library operations, including data ingestion and loading feature definitions.

Ingest Data

from feature_store_pyspark.FeatureStoreManager import FeatureStoreManager

feature_group_arn = "arn:aws:sagemaker:...:feature-group/your-feature-group"
feature_store_manager = FeatureStoreManager()
feature_store_manager.ingest_data(
    input_data_frame=df,
    feature_group_arn=feature_group_arn,
    target_stores=["OfflineStore"]
)

If target_stores is set to ["OfflineStore"], data is ingested directly to the offline store without using the FeatureStore Runtime API, reducing WCU costs. The default is None (ingests to both online and offline stores).

Load Feature Definitions

feature_definitions = feature_store_manager.load_feature_definitions_from_schema(df)

Returns feature definitions that can be used with the CreateFeatureGroup API.

Retrieve Failed Ingestion Records

failed_df = feature_store_manager.get_failed_stream_ingestion_data_frame()

Returns a DataFrame containing records that failed during ingest_data().

Lake Formation Support

When your offline store's S3 location is registered with AWS Lake Formation, enable the use_lake_formation_credentials parameter (requires PySpark 3.5+):

feature_store_manager.ingest_data(
    input_data_frame=df,
    feature_group_arn=feature_group_arn,
    target_stores=["OfflineStore"],
    use_lake_formation_credentials=True
)

For prerequisites, cross-account access, and troubleshooting, see the main repository README.

License

This project is licensed under the Apache-2.0 License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sagemaker_feature_store_pyspark-2.0.0.tar.gz (274.8 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sagemaker_feature_store_pyspark-2.0.0-py3-none-any.whl (275.0 MB view details)

Uploaded Python 3

File details

Details for the file sagemaker_feature_store_pyspark-2.0.0.tar.gz.

File metadata

File hashes

Hashes for sagemaker_feature_store_pyspark-2.0.0.tar.gz
Algorithm Hash digest
SHA256 85e77de430ddb50baffd51064f7b7fc5d05ea992816bc395fc252fe6d5540538
MD5 b40f7481c38150aed06b8868213aad8b
BLAKE2b-256 ebd578a03ce4a2cb6552c326142184f53ecf6d7e979fe018ad3db69591fb5116

See more details on using hashes here.

File details

Details for the file sagemaker_feature_store_pyspark-2.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for sagemaker_feature_store_pyspark-2.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2676593d486036a9a588b99e072f2f952ab377ccb48c5c86dc116a638995c560
MD5 d4b2c0559d454bd9dea7e235b48bcfb4
BLAKE2b-256 7468b837a5bc2c81d80dd72f4c61cd7bfebd3b5e8e0982cb9a205ae41b3b926d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page