Skip to main content

The Enterprise Feature Store For Machine Learning

Project description

PyPI Version GitHub License

What is Glacius?

A powerful, easy to use, resource efficient feature platform for machine learning.

Key Features

  • 📈 Feature Engineering & Transformations: Glacius can handle projects of any size, from small teams to large enterprises processing petabytes of data.
  • 🚀 Low Latency Feature Serving: Access features instantly, ensuring predictions are derived from the most recent data.
  • 📈 Feature Registry: A unified view of all your machine learning features and definitions.
  • 🔄 Feature Versioning: Keep track of how your features evolve over time.

Getting Started

1. Set Up Your Workspace

Go to glacius.ai and register for an account. After signing in, navigate to "workspaces" and click "create a workspace". It will prompt you to do the following:

  • Select an AWS Region
  • Enter your desired workspace name
  • Create a cross account execution role - This is the role Glacius will use when processing features
  • Create a stack in your AWS account and input the role you created from the step above as the principalARN.
  • After you finish creating your stack, enter the role ARN from the output (this allows Glacius to assume the role you just created)

Congrats! You've set up your first Glacius workspace.

2. Generate Your API Key

After you've finished setting up your workspace, generate an API key. This is what your client will use to authenticate with our back-end infrastructure.

3. Install Glacius (Pip)

pip install glacius

And that's it! You're all set to explore the capabilities of Glacius.

Defining And Registering Features

1. Instantiate a Client

To start, let's instantiate a client. We'll specify the namespace "development" as we are simply playing around. We will be using this client to register features and trigger jobs later. Make sure to input your API key here that we generated earlier.

from glacius.core.data_sources.snowflake import SnowflakeSource
from glacius.core.client import Client

client = Client(api_key="***", namespace="development")

2. Define A Data Source

Let's define a data source. We'll use snowflake as an example. This table contains items interaction data on items for our hypothetical e-commerce app. We'll make sure to specify the timestamp_col as this allows Glacius to perform point-in-time joins.

from glacius.core.data_sources.snowflake import SnowflakeSource

item_engagement_data_source = SnowflakeSource(
    name = "global_item_engagement_data",
    description = "item engagement data",
    timestamp_col = "timestamp",
    table = "global_item_engagement_data",
    database = "gradiently",
    schema="public" 
)

3. Defining Your Feature Bundle

Features are grouped into logical groups called feature bundles. A feature bundle is a logical grouping of features that share an entity and a datasource. For example, we could have a feature bundle for user features, another for item features, and another for user-item features.

Let's define our bundle here and add some aggregation features. we'll also need to specify the entity this bundle is attached to. Glacius also supports composite entities, but in this example, we have a simple single entity with a single join key.

This defines the following feature (total items clicked) across the different time windows [1,3,5,7] days and also within these categories:

  • electronics_accessories
  • fashion_apparel
  • home_garden
from glacius.core.feature_bundle import FeatureBundle
from glacius.core.entity import Entity
from glacius.core.dtypes import Int32


user_entity = Entity(keys=["user_id"])

user_bundle = FeatureBundle(
    name="user_feature_bundle",
    description="user features on item engagement data",
    source=item_engagement_data_source,
    entity=user_entity,    
)

categories = ["electronics_accessories", "fashion_apparel", "home_garden"]
time_windows = [1,3,5,7]

for category in categories:  
  user_bundle.add_features([
      Feature(
          name = f"total_items_clicked_{category}_{t}d",
          description = f"total items clicked over {t} days",
          expr = when(col("product_category") == category).then(col("item_click")).otherwise(0),
          dtype = Int32,
          agg=Aggregation(method=AggregationType.SUM, window=timedelta(days=t))
      ) for t time_windows
  ])

4. Register Your Feature Bundle

Finally, let's register our bundle. You can now view it from the UI!

response = client.register(feature_bundles=[user_bundle],
                          commit_msg="Added user feature bundle containing click features\
                          for electronics, fashion, and home garden")

Offline Features

To build offline features for training, we will need an label datasource and the list of feature names you're interested in computing.

The label datasource is the spine of the dataset which includes the events you're interested (for example, click events, or item bought events), and the timestamp of when the event occurred. This is crucial so that Glacius can perform a point-in-time join to compute what the features were for a given entity at that specific point in time.

1. Define the label data source.

label_datasource = SnowflakeSource(
    name = "user_observation_data",
    description = "user click events table and timestamp",
    timestamp_col = "timestamp",
    table = "user_observation_data",
    database = "gradiently",
    schema="public"
)

Triggering Offline Job Via the Registry

If you are triggering jobs via the registry, you'll need to specify which namespace version you'd like to use. By default it will use the latest version of the namespace. This ensures backwards compatibility for production pipelines.

job = client.get_offline_features(
    feature_names = [f.name for f in user_bundle.features], 
    labels_datasource=label_datasource, 
    output_path="s3://my-s3-bucket/offline_features_test_job", 
    namespace_version="latest", 
)

Triggering Ad Hoc Offline Job

If you'd just like to trigger the job from the feature bundles you've just defined in the notebook, you can pass them in directly as well for ad hoc runs.

job = client.get_offline_features(
    feature_bundles=[user_bundle]
    labels_datasource=label_datasource, 
    output_path="s3://my-s3-bucket/offline_features_test_job", 
)

Online Materialization

Glacius also allows you to materialize features into an ultra low latency online store for real time serving. To materialize, simply list the features you're interested in materializing along with the namespace version.

Online Materialization

client.materialize_features(feature_names=[f.name for f in user_bundle.features], version="latest")

Getting Online Features

To get online features, call the get_online_features API with the feature names you're interested in getting, and also the unique IDs associated with the entity you're interested in.

online_features = client.get_online_features(feature_names=[f.name for f in new_bundle.features], entity_ids=[        
     user_entity.id("2139083"),    
     user_entity.id("92098321"),    
     ])

If you aren't using python for real time inference, you can also call the API

curl --location 'localhost:8000/online-store' \
--header 'x-api-key: ******' \
--header 'Content-Type: application/json' \
--data '{
    "namespace": "development",
    "workspace": "test-dev",
    "feature_names": ["AVG_MUSIC_STREAMING_SECS_1_24H", "AVG_MUSIC_STREAMING_SECS_2_24H", "AVG_MUSIC_STREAMING_SECS_3_24H", "AVG_MUSIC_STREAMING_SECS_4_24H", "AVG_MUSIC_STREAMING_SECS_5_24H", "AVG_MUSIC_STREAMING_SECS_6_24H"],
    "entity_ids": ["USER_ID:1671", "USER_ID:1233", "USER_ID:13821"]
}'

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

glacius-0.0.8.tar.gz (18.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

glacius-0.0.8-py3-none-any.whl (21.4 kB view details)

Uploaded Python 3

File details

Details for the file glacius-0.0.8.tar.gz.

File metadata

  • Download URL: glacius-0.0.8.tar.gz
  • Upload date:
  • Size: 18.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.6

File hashes

Hashes for glacius-0.0.8.tar.gz
Algorithm Hash digest
SHA256 25019d555675b9c089db9ba31d44e0c533e535ecc2cbcb01ec6bb0f97646c465
MD5 e95f984310fbf3123315fc579596f892
BLAKE2b-256 0a7f7bceb097fb4f2468303e3129c11e55dbbf711e1ba085112afa8d1d46bc12

See more details on using hashes here.

File details

Details for the file glacius-0.0.8-py3-none-any.whl.

File metadata

  • Download URL: glacius-0.0.8-py3-none-any.whl
  • Upload date:
  • Size: 21.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.6

File hashes

Hashes for glacius-0.0.8-py3-none-any.whl
Algorithm Hash digest
SHA256 b7fd8c390c58d86827827478500707d465e20faef490670990cb5b81df5f15f7
MD5 739b400952d4b2b00d6d806a0328ab2e
BLAKE2b-256 7a6580833482b73f6c6f816b895d4677920773cc6fa8bb9b09d91a8295ede1f6

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page