A scalable feature store that makes it easy to align offline and online ML systems

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Aligned

Aligned help defining a single source of truth for logic while keeping the technology stack flexible. Such innovation has been possible by removing the need to depend on a processing engine, leading to less- and more transparent- code. Furthermore, the declarative API has made it possible to comment, add data validation, and define feature transformation at the same location. Therefore, it leads to a precise definition of the intended result.

Main advantages:

Test new features faster
Adapt faster to new technical and business requirements.
Stop technology lock-in, like processing engines and infrastructure.
Stop vendor lock-in. Deploy to any provider that fits you

As a result, loading model featurs can be done with the following code.

await store.model("titanic").features_for(entities).as_pandas()

Read the post about how the most elegant MLOps tool was created

Also check out the the example repo to see how it can be used

Aligned is still in actice development, so changes are likely.

Feature Views

Write features as the should be, as data models. Then get code completion and typesafety by referencing them in other features.

This makes the features light weight, data source indipendent, and flexible.

class TitanicPassenger(FeatureView):

    metadata = FeatureViewMetadata(
        name="passenger",
        description="Some features from the titanic dataset",
        batch_source=FileSource.csv_at("titanic.csv"),
        stream_source=HttpStreamSource(topic_name="titanic")
    )

    passenger_id = Entity(dtype=Int32())

    # Input values
    age = (
        Float()
            .description("A float as some have decimals")
            .is_required()
            .lower_bound(0)
            .upper_bound(110)
    )

    name = String()
    sex = String().accepted_values(["male", "female"])
    survived = Bool().description("If the passenger survived")
    sibsp = Int32().lower_bound(0, is_inclusive=True).description("Number of siblings on titanic")
    cabin = String()

    # Creates two one hot encoded values
    is_male, is_female = sex.one_hot_encode(['male', 'female'])

    # Standard scale the age.
    # This will fit the scaler using a data slice from the batch source
    # limited to maximum 100 rows. We can also uese a time constraint if wanted
    scaled_age = age.standard_scaled(limit=100)

Data sources

Alinged makes handling data sources easy, as you do not have to think about how it is done. Only define where the data is, and we handle the dirty work.

my_db = PostgreSQLConfig(env_var="DATABASE_URL")

class TitanicPassenger(FeatureView):

    metadata = FeatureViewMetadata(
        name="passenger",
        description="Some features from the titanic dataset",
        batch_source=my_db.table(
            "passenger",
            mapping_keys={
                "Passenger_Id": "passenger_id"
            }
        ),
        stream_source=HttpStreamSource(topic_name="titanic")
    )

    passenger_id = Entity(dtype=Int32())

Fast development

Making iterativ and fast exploration in ML is important. This is why Aligned also makes it super easy to combine, and test multiple sources.

my_db = PostgreSQLConfig.localhost()

aws_bucket = AwsS3Config(...)

class SomeFeatures(FeatureView):

    metadata = FeatureViewMetadata(
        name="some_features",
        description="...",
        batch_source=my_db.table("local_features")
    )

    # Some features
    ...

class AwsFeatures(FeatureView):

    metadata = FeatureViewMetadata(
        name="aws",
        description="...",
        batch_source=aws_bucket.file_at("path/to/file.parquet")
    )

    # Some features
    ...

Model Service

Usually will you need to combine multiple features for each model. This is where a ModelService comes in. Here can you define which features should be exposed.

# Uses the variable name, as the model service name.
# Can also define a custom name, if wanted.
titanic_model = ModelService(
    features=[
        TitanicPassenger.select_all(),

        # Select features with code completion
        LocationFeatures.select(lambda view: [
            view.distance_to_shore,
            view.distance_to_closest_boat
        ]),
    ]
)

Data Enrichers

In manny cases will extra data be needed in order to generate some features. We therefore need some way of enriching the data. This can easily be done with Alinged's DataEnrichers.

my_db = PostgreSQLConfig.localhost()
redis = RedisConfig.localhost()

user_location = my_db.data_enricher( # Fetch all user locations
    sql="SELECT * FROM user_location"
).cache( # Cache them for one day
    ttl=timedelta(days=1),
    cache_key="user_location_cache"
).lock( # Make sure only one processer fetches the data at a time
    lock_name="user_location_lock",
    redis_config=redis
)


async def distance_to_users(df: DataFrame) -> Series:
    user_location_df = await user_location.load()
    ...
    return distances

class SomeFeatures(FeatureView):

    metadata = FeatureViewMetadata(...)

    latitude = Float()
    longitude = Float()

    distance_to_users = Float().transformed(distance_to_users, using_features=[latitude, longitude])

Access Data

You can easily create a feature store that contains all your feature definitions. This can then be used to genreate data sets, setup an instce to serve features, DAG's etc.

store = FeatureStore.from_dir(".")

# Select all features from a single feature view
df = await store.all_for("passenger", limit=100).to_df()

Centraliced Feature Store Definition

You would often share the features with other coworkers, or split them into different stages, like staging, shadow, or production. One option is therefore to reference the storage you use, and load the FeatureStore from there.

aws_bucket = AwsS3Config(...)
store = await aws_bucket.file_at("production.json").feature_store()

# This switches from the production online store to the offline store
# Aka. the batch sources defined on the feature views
experimental_store = store.offline_store()

This json file can be generated by running alinged apply.

Select multiple feature views

df = await store.features_for({
    "passenger_id": [1, 50, 110]
}, features=[
    "passenger:scaled_age",
    "passenger:is_male",
    "passenger:sibsp"

    "other_features:distance_to_closest_boat",
]).to_df()

Model Service

Selecting features for a model is super simple.

df = await store.model("titanic_model").features_for({
    "passenger_id": [1, 50, 110]
}).to_df()

Feature View

If you want to only select features for a specific feature view, then this is also possible.

prev_30_days = await store.feature_view("match").previous(days=30).to_df()
sample_of_20 = await store.feature_view("match").all(limit=20).to_df()

Data quality

Alinged will make sure all the different features gets formatted as the correct datatype. In addition will aligned also make sure that the returend features aligne with defined constraints.

class TitanicPassenger(FeatureView):

    ...

    age = (
        Float()
            .is_required()
            .lower_bound(0)
            .upper_bound(110)
    )
    sibsp = Int32().lower_bound(0, is_inclusive=True)

Then since our feature view have a is_required and a lower_bound, will the .validate(...) command filter out the entites that do not follow that behavior.

from aligned.validation.pandera import PanderaValidator

df = await store.model("titanic_model").features_for({
    "passenger_id": [1, 50, 110]
}).validate(
    PanderaValidator()  # Validates all features
).to_df()

Feature Server

This expectes that you either run the command in your feature store repo, or have a file with a RepoReference instance. You can also setup an online source like Redis, for faster storage.

redis = RedisConfig.localhost()

aws_bucket = AwsS3Config(...)

repo_files = RepoReference(
    env_var_name="ENVIRONMENT",
    repo_paths={
        "production": aws_bucket.file_at("feature-store/production.json"),
        "shadow": aws_bucket.file_at("feature-store/shadow.json"),
        "staging": aws_bucket.file_at("feature-store/staging.json")
        # else generate the feature store from the current dir
    }
)

# Use redis as the online source, if not running localy
if repo_files.selected != "local":
    online_source = redis.online_source()

Then run aligned serve, and a FastAPI server will start. Here can you push new features, which then transforms and stores the features, or just fetch them.

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.0.95

May 1, 2024

0.0.94

Apr 17, 2024

0.0.93

Apr 16, 2024

0.0.92

Mar 26, 2024

0.0.91

Mar 26, 2024

0.0.90

Mar 26, 2024

0.0.89

Mar 25, 2024

0.0.88

Mar 25, 2024

0.0.87

Mar 25, 2024

0.0.86

Mar 25, 2024

0.0.85

Mar 25, 2024

0.0.84

Mar 25, 2024

0.0.83

Mar 19, 2024

0.0.82

Mar 18, 2024

0.0.81

Mar 14, 2024

0.0.80

Mar 13, 2024

0.0.79

Mar 12, 2024

0.0.78

Mar 11, 2024

0.0.77

Mar 5, 2024

0.0.76

Mar 4, 2024

0.0.75

Mar 4, 2024

0.0.74

Mar 3, 2024

0.0.73

Mar 2, 2024

0.0.72

Feb 23, 2024

0.0.71

Feb 10, 2024

0.0.70

Feb 9, 2024

0.0.69

Feb 3, 2024

0.0.68

Feb 3, 2024

0.0.67

Feb 3, 2024

0.0.66

Jan 28, 2024

0.0.65

Jan 28, 2024

0.0.64

Jan 28, 2024

0.0.63

Jan 18, 2024

0.0.62

Jan 16, 2024

0.0.61

Jan 16, 2024

0.0.60

Jan 6, 2024

0.0.59

Jan 5, 2024

0.0.58

Jan 5, 2024

0.0.57

Jan 5, 2024

0.0.56

Dec 25, 2023

0.0.55

Dec 25, 2023

0.0.54

Dec 16, 2023

0.0.53

Dec 13, 2023

0.0.52

Dec 10, 2023

0.0.51

Dec 9, 2023

0.0.50

Dec 9, 2023

0.0.49

Dec 4, 2023

0.0.48

Dec 1, 2023

0.0.47

Nov 29, 2023

0.0.46

Nov 22, 2023

0.0.45

Nov 21, 2023

0.0.44

Nov 21, 2023

0.0.43

Nov 19, 2023

0.0.42

Nov 13, 2023

0.0.41

Nov 13, 2023

0.0.40

Nov 13, 2023

0.0.39

Nov 13, 2023

0.0.38

Nov 9, 2023

0.0.37

Nov 9, 2023

0.0.36

Nov 7, 2023

0.0.35

Nov 6, 2023

0.0.34

Nov 2, 2023

0.0.33

Oct 31, 2023

0.0.32

Oct 24, 2023

0.0.31

Oct 23, 2023

0.0.30

Oct 18, 2023

0.0.29

Oct 16, 2023

0.0.28

Oct 16, 2023

0.0.27

Oct 13, 2023

0.0.26

Oct 13, 2023

0.0.25

Oct 4, 2023

0.0.24

Sep 5, 2023

0.0.23

Aug 31, 2023

0.0.22

Aug 8, 2023

0.0.21

Aug 3, 2023

0.0.20

Jun 23, 2023

0.0.19

Jun 22, 2023

0.0.18

Jun 22, 2023

0.0.17

Jun 22, 2023

0.0.16

May 25, 2023

0.0.15

May 25, 2023

0.0.14

May 22, 2023

0.0.13

May 21, 2023

0.0.12

May 21, 2023

0.0.11

May 2, 2023

0.0.10

Apr 28, 2023

0.0.10a0 pre-release

Jan 9, 2023

0.0.9

Apr 28, 2023

0.0.9a0 pre-release

Jan 3, 2023

0.0.8

Apr 28, 2023

0.0.8a0 pre-release

Dec 23, 2022

0.0.7

Mar 27, 2023

0.0.7a0 pre-release

Dec 22, 2022

0.0.6

Mar 27, 2023

0.0.6a0 pre-release

Nov 21, 2022

This version

0.0.5

Mar 15, 2023

0.0.5a0 pre-release

Nov 18, 2022

0.0.4

Mar 14, 2023

0.0.4a0 pre-release

Nov 12, 2022

0.0.3

Mar 14, 2023

0.0.3a0 pre-release

Nov 5, 2022

0.0.2

Mar 14, 2023

0.0.2a0 pre-release

Nov 4, 2022

0.0.1

Mar 14, 2023

0.0.1a0 pre-release

Oct 27, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aligned-0.0.5.tar.gz (92.2 kB view hashes)

Uploaded Mar 15, 2023 Source

Built Distribution

aligned-0.0.5-py3-none-any.whl (117.6 kB view hashes)

Uploaded Mar 15, 2023 Python 3

Hashes for aligned-0.0.5.tar.gz

Hashes for aligned-0.0.5.tar.gz
Algorithm	Hash digest
SHA256	`c1eb9aab42e02864ea1a5509b45ca81f9712fa02f953b7287fc6ee1867f13173`
MD5	`cdfc7eb44d9e97ce038130f01e495b04`
BLAKE2b-256	`cb2914e6727437f229906eeb80de83145f2a47e040101896ada5ea29d6c1d4e3`

Hashes for aligned-0.0.5-py3-none-any.whl

Hashes for aligned-0.0.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`5d9b4489abc6e749fdf1b7bd744ec8ec761a59c6dbff03b71da1b330080f4c9d`
MD5	`5086337505c025f0e2d00b99227e7484`
BLAKE2b-256	`5d594cb0ca957322b23f9af65668001c610cc83c561d1f98b43a1a53c13608f1`