Skip to main content

HSFS Python SDK to interact with Hopsworks Feature Store

Project description

Hopsworks Feature Store

Hopsworks Community Hopsworks Feature Store Documentation python PyPiStatus Scala/Java Artifacts Downloads Ruff License

HSFS is the library to interact with the Hopsworks Feature Store. The library makes creating new features, feature groups and training datasets easy.

The library is environment independent and can be used in two modes:

  • Spark mode: For data engineering jobs that create and write features into the feature store or generate training datasets. It requires a Spark environment such as the one provided in the Hopsworks platform or Databricks. In Spark mode, HSFS provides bindings both for Python and JVM languages.

  • Python mode: For data science jobs to explore the features available in the feature store, generate training datasets and feed them in a training pipeline. Python mode requires just a Python interpreter and can be used both in Hopsworks from Python Jobs/Jupyter Kernels, Amazon SageMaker or KubeFlow.

The library automatically configures itself based on the environment it is run. However, to connect from an external environment such as Databricks or AWS Sagemaker, additional connection information, such as host and port, is required. For more information checkout the Hopsworks documentation.

Getting Started On Hopsworks

Get started easily by registering an account on Hopsworks Serverless. Create your project and a new Api key. In a new python environment with Python 3.8 or higher, install the client library using pip:

# Get all Hopsworks SDKs: Feature Store, Model Serving and Platform SDK
pip install hopsworks
# or minimum install with the Feature Store SDK
pip install hsfs[python]
# if using zsh don't forget the quotes
pip install 'hsfs[python]'

You can start a notebook and instantiate a connection and get the project feature store handler.

import hopsworks

project = hopsworks.login() # you will be prompted for your api key
fs = project.get_feature_store()

or using hsfs directly:

import hsfs

connection = hsfs.connection(
    host="c.app.hopsworks.ai", #
    project="your-project",
    api_key_value="your-api-key",
)
fs = connection.get_feature_store()

Create a new feature group to start inserting feature values.

fg = fs.create_feature_group("rain",
                        version=1,
                        description="Rain features",
                        primary_key=['date', 'location_id'],
                        online_enabled=True)

fg.save(dataframe)

Upsert new data in to the feature group with time_travel_format="HUDI"".

fg.insert(upsert_df)

Retrieve commit timeline metdata of the feature group with time_travel_format="HUDI"".

fg.commit_details()

"Reading feature group as of specific point in time".

fg = fs.get_feature_group("rain", 1)
fg.read("2020-10-20 07:34:11").show()

Read updates that occurred between specified points in time.

fg = fs.get_feature_group("rain", 1)
fg.read_changes("2020-10-20 07:31:38", "2020-10-20 07:34:11").show()

Join features together

feature_join = rain_fg.select_all()
                    .join(temperature_fg.select_all(), on=["date", "location_id"])
                    .join(location_fg.select_all())
feature_join.show(5)

join feature groups that correspond to specific point in time

feature_join = rain_fg.select_all()
                    .join(temperature_fg.select_all(), on=["date", "location_id"])
                    .join(location_fg.select_all())
                    .as_of("2020-10-31")
feature_join.show(5)

join feature groups that correspond to different time

rain_fg_q = rain_fg.select_all().as_of("2020-10-20 07:41:43")
temperature_fg_q = temperature_fg.select_all().as_of("2020-10-20 07:32:33")
location_fg_q = location_fg.select_all().as_of("2020-10-20 07:33:08")
joined_features_q = rain_fg_q.join(temperature_fg_q).join(location_fg_q)

Use the query object to create a training dataset:

td = fs.create_training_dataset("rain_dataset",
                                version=1,
                                data_format="tfrecords",
                                description="A test training dataset saved in TfRecords format",
                                splits={'train': 0.7, 'test': 0.2, 'validate': 0.1})

td.save(feature_join)

A short introduction to the Scala API:

import com.logicalclocks.hsfs._
val connection = HopsworksConnection.builder().build()
val fs = connection.getFeatureStore();
val attendances_features_fg = fs.getFeatureGroup("games_features", 1);
attendances_features_fg.show(1)

You can find more examples on how to use the library in our hops-examples repository.

Usage

Usage data is collected for improving quality of the library. It is turned on by default if the backend is "c.app.hopsworks.ai". To turn it off, use one of the following way:

# use environment variable
import os
os.environ["ENABLE_HOPSWORKS_USAGE"] = "false"

# use `disable_usage_logging`
import hsfs
hsfs.disable_usage_logging()

The source code can be found in python/hsfs/usage.py.

Documentation

Documentation is available at Hopsworks Feature Store Documentation.

Issues

For general questions about the usage of Hopsworks and the Feature Store please open a topic on Hopsworks Community.

Please report any issue using Github issue tracking.

Please attach the client environment from the output below in the issue:

import hopsworks
import hsfs
hopsworks.login().get_feature_store()
print(hsfs.get_env())

Contributing

If you would like to contribute to this library, please see the Contribution Guidelines.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hsfs-3.9.0rc3.tar.gz (293.6 kB view details)

Uploaded Source

Built Distribution

hsfs-3.9.0rc3-py3-none-any.whl (358.9 kB view details)

Uploaded Python 3

File details

Details for the file hsfs-3.9.0rc3.tar.gz.

File metadata

  • Download URL: hsfs-3.9.0rc3.tar.gz
  • Upload date:
  • Size: 293.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.10.12

File hashes

Hashes for hsfs-3.9.0rc3.tar.gz
Algorithm Hash digest
SHA256 c4422b5cd055c95feab65a9134b8d4b90b2bb99c907a9099b84df4172bc2186e
MD5 ba739ed5e8b780e72fdcb9ecd48226ab
BLAKE2b-256 7d222c694c74a2b6e0acbaa22e6760174f4e6036c0545ca45b9913659fa37cd2

See more details on using hashes here.

File details

Details for the file hsfs-3.9.0rc3-py3-none-any.whl.

File metadata

  • Download URL: hsfs-3.9.0rc3-py3-none-any.whl
  • Upload date:
  • Size: 358.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.10.12

File hashes

Hashes for hsfs-3.9.0rc3-py3-none-any.whl
Algorithm Hash digest
SHA256 22d3eb358f9e0f64c6577fbca653d59a75c7f226f7315350759d977c557c103f
MD5 7b994c8c97904f278053aededa9d2961
BLAKE2b-256 7d8de9f4e4d715da4f9a6646f7498e07bc5d9701b6dc72950dad4b1b8b0e5dfa

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page