Amazon SageMaker FeatureStore PySpark Bindings
Project description
SageMaker FeatureStore PySpark
SageMaker FeatureStore Spark is an open source Spark library for
Amazon SageMaker FeatureStore.
With this connector, you can easily ingest data to FeatureGroup's online and
offline store from Spark DataFrame. This package provides the Python (PySpark)
interface.
For full documentation including Scala usage, cross-account Lake Formation access, and troubleshooting, see the GitHub repository.
Supported Versions
| Component | Supported Versions |
|---|---|
| Spark | 3.1, 3.2, 3.3, 3.4, 3.5 |
| Python | 3.8, 3.9, 3.10, 3.11, 3.12 |
| EMR | emr-7.x and above |
Note: Not all Python/PySpark combinations are supported. See the compatibility matrix below.
Python / PySpark Compatibility Matrix
| Python \ PySpark | 3.1 | 3.2 | 3.3 | 3.4 | 3.5 |
|---|---|---|---|---|---|
| 3.8 | Yes | Yes | Yes | Yes | Yes |
| 3.9 | Yes | Yes | Yes | Yes | Yes |
| 3.10 | No | Yes | Yes | Yes | Yes |
| 3.11 | No | No | No | Yes | Yes |
| 3.12 | No | No | No | Yes | Yes |
Note: PySpark versions older than 3.5 are in maintenance mode and will not receive new features. New functionality is only added for PySpark 3.5+.
Installation
Prerequisites: PySpark and NumPy must be installed in your environment.
The package is available on PyPI. It bundles pre-built JARs for each supported Spark version (3.1-3.5). At runtime, the correct JAR is automatically selected based on your installed PySpark version.
If SPARK_HOME is set, the installer copies the matching JAR into
$SPARK_HOME/jars. For EMR, the path is handled automatically.
pip3 install sagemaker-feature-store-pyspark --no-binary :all:
EMR
Create a custom jar step to install the library:
- Jar Location:
command-runner.jar - Arguments:
sudo -E pip3 install sagemaker-feature-store-pyspark --no-binary :all:
This installs the library on the Driver node only. To distribute to all executor nodes, create an installation script and add a custom bootstrap action when creating the EMR cluster.
Since bootstrap actions run before EMR applications are installed, dependent JARs
cannot be automatically loaded to SPARK_HOME. When submitting your application,
specify dependent JARs using:
--jars `feature-store-pyspark-dependency-jars`
SageMaker Notebook
SageMaker Notebook instances may use an older version of Spark. Install a compatible version first:
# Install a version of PySpark compatible with the library (3.1 - 3.5)
!pip3 install pyspark==3.5.1
Getting Started
FeatureStoreManager is the main interface for all library operations, including
data ingestion and loading feature definitions.
Ingest Data
from feature_store_pyspark.FeatureStoreManager import FeatureStoreManager
feature_group_arn = "arn:aws:sagemaker:...:feature-group/your-feature-group"
feature_store_manager = FeatureStoreManager()
feature_store_manager.ingest_data(
input_data_frame=df,
feature_group_arn=feature_group_arn,
target_stores=["OfflineStore"]
)
If target_stores is set to ["OfflineStore"], data is ingested directly to
the offline store without using the FeatureStore Runtime API, reducing WCU costs.
The default is None (ingests to both online and offline stores).
Load Feature Definitions
feature_definitions = feature_store_manager.load_feature_definitions_from_schema(df)
Returns feature definitions that can be used with the CreateFeatureGroup API.
Retrieve Failed Ingestion Records
failed_df = feature_store_manager.get_failed_stream_ingestion_data_frame()
Returns a DataFrame containing records that failed during ingest_data().
Lake Formation Support
When your offline store's S3 location is registered with
AWS Lake Formation, enable the
use_lake_formation_credentials parameter (requires PySpark 3.5+):
feature_store_manager.ingest_data(
input_data_frame=df,
feature_group_arn=feature_group_arn,
target_stores=["OfflineStore"],
use_lake_formation_credentials=True
)
For prerequisites, cross-account access, and troubleshooting, see the main repository README.
License
This project is licensed under the Apache-2.0 License.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sagemaker_feature_store_pyspark-2.0.0.tar.gz.
File metadata
- Download URL: sagemaker_feature_store_pyspark-2.0.0.tar.gz
- Upload date:
- Size: 274.8 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
85e77de430ddb50baffd51064f7b7fc5d05ea992816bc395fc252fe6d5540538
|
|
| MD5 |
b40f7481c38150aed06b8868213aad8b
|
|
| BLAKE2b-256 |
ebd578a03ce4a2cb6552c326142184f53ecf6d7e979fe018ad3db69591fb5116
|
File details
Details for the file sagemaker_feature_store_pyspark-2.0.0-py3-none-any.whl.
File metadata
- Download URL: sagemaker_feature_store_pyspark-2.0.0-py3-none-any.whl
- Upload date:
- Size: 275.0 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.10.17
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2676593d486036a9a588b99e072f2f952ab377ccb48c5c86dc116a638995c560
|
|
| MD5 |
d4b2c0559d454bd9dea7e235b48bcfb4
|
|
| BLAKE2b-256 |
7468b837a5bc2c81d80dd72f4c61cd7bfebd3b5e8e0982cb9a205ae41b3b926d
|