Skip to main content

Leakage-safe, point-in-time feature engineering for event logs.

Project description

SafeFeat Logo

PyPI version Documentation License: MIT

Leakage-safe, point-in-time feature engineering for event logs.

safefeat builds ML features from event data using only information available at prediction time — no future data, no silent leakage, no surprises in production.


The Problem

When you compute features like "total purchases in the last 30 days" without anchoring to a cutoff time, you accidentally include future events. Your model looks great in training — then falls apart in production.

# ❌ Leaky — uses ALL events, including future ones
features = events.groupby("user_id")["amount"].sum()
df = spine.merge(features, on="user_id")

# ✅ Safe — only uses events before each cutoff_time
X = build_features(spine, tables, spec, event_time_cols={"events": "event_time"})

Install

pip install safefeat

How It Works

safefeat works with three components:

Component Description
Spine When to make predictions — one row per (entity_id, cutoff_time)
Events Historical time-series data tied to each entity
Spec Declarative definition of what features to compute

For each row in the spine, safefeat joins only events where event_time <= cutoff_time, then computes your features. Future events are excluded.


Quick Start

import pandas as pd
from safefeat import build_features, WindowAgg

spine = pd.DataFrame({
    "entity_id":   ["u1", "u2"],
    "cutoff_time": ["2024-01-10", "2024-01-31"],
})

events = pd.DataFrame({
    "entity_id":  ["u1", "u1", "u2", "u2"],
    "event_time": ["2024-01-05", "2024-01-06", "2024-01-10", "2024-01-30"],
    "amount":     [10.0, 20.0, 5.0, 25.0],
    "event_type": ["click", "purchase", "purchase", "click"],
})

spec = [
    WindowAgg(
        table="events",
        windows=["7D", "30D"],
        metrics={
            "*":          ["count"],
            "amount":     ["sum", "mean"],
            "event_type": ["nunique"],
        },
    )
]

X = build_features(
    spine=spine,
    tables={"events": events},
    spec=spec,
    event_time_cols={"events": "event_time"},
    allowed_lag="0s",
)

Output columns follow the pattern {table}__{column}__{agg}__{window}:

events__n_events__7d               # number of events in the last 7 days
events__amount__sum__7d            # total spend in the last 7 days
events__amount__mean__7d           # average spend per event in the last 7 days
events__event_type__nunique__7d    # distinct event types seen in the last 7 days

events__n_events__30d              # number of events in the last 30 days
events__amount__sum__30d           # total spend in the last 30 days
events__amount__mean__30d          # average spend per event in the last 30 days
events__event_type__nunique__30d   # distinct event types seen in the last 30 days

Demo Dataset

safefeat ships with a synthetic e-commerce dataset for experimentation:

from safefeat.datasets import load_customer_demo

events, spine = load_customer_demo()

See the customer demo examples for worked questions using this dataset.

Window aggregations

Windows support days, months, years, and unlimited history:

spec = [
    WindowAgg(
        table="events",
        windows=["7D", "30D", "3M", "1Y", None],  # None = all history before cutoff
        metrics={
            "*":          ["count"],
            "amount":     ["sum", "mean"],
            "event_type": ["nunique"],
        },
    )
]
Unit Example Meaning
D "30D" Exact days
M "3M" Calendar months
Y "1Y" Calendar years
None None All history before cutoff

Recency features

from safefeat import RecencyBlock

spec = [RecencyBlock(table="events")]

X = build_features(
    spine=spine,
    tables={"events": events},
    spec=spec,
    event_time_cols={"events": "event_time"},
)
# Adds: events__recency (days since last event before cutoff_time)

Filter to a specific event type:

spec = [
    RecencyBlock(
        table="events",
        filter_col="event_type",
        filter_value="purchase",
    )
]
# Adds: events__recency__event_type_purchase

Audit report

Verify exactly which events were included and dropped for each prediction point:

X, audit = build_features(
    spine=spine,
    tables={"events": events},
    spec=spec,
    event_time_cols={"events": "event_time"},
    return_report=True,
)

events_audit = audit.tables.get("events")
print(events_audit.total_joined_pairs)    # total event-cutoff pairs considered
print(events_audit.kept_pairs)            # events before cutoff (used)
print(events_audit.dropped_future_pairs)  # events after cutoff (excluded)

Multiple event tables

Pass multiple tables — each with its own event time column:

spec = [
    WindowAgg(table="transactions", windows=["30D"], metrics={"amount": ["sum"]}),
    WindowAgg(table="logins",       windows=["7D"],  metrics={"*": ["count"]}),
    RecencyBlock(table="transactions"),
]

X = build_features(
    spine=spine,
    tables={
        "transactions": transactions_df,
        "logins":        logins_df,
    },
    event_time_cols={
        "transactions": "transaction_time",
        "logins":       "login_time",
    },
)

The table= name is just a label — it must match a key in tables and event_time_cols, but can be anything you choose.


Development

pip install -e ".[dev]"
pytest -q
ruff check .

Contributing

Contributions, bug reports, and feature requests are welcome. Open an issue at github.com/AlishaAng/safefeat/issues.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

safefeat-0.1.3.tar.gz (1.1 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

safefeat-0.1.3-py3-none-any.whl (1.0 MB view details)

Uploaded Python 3

File details

Details for the file safefeat-0.1.3.tar.gz.

File metadata

  • Download URL: safefeat-0.1.3.tar.gz
  • Upload date:
  • Size: 1.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for safefeat-0.1.3.tar.gz
Algorithm Hash digest
SHA256 56fafa8ddbd563d6a2f93f1b307b6ba4ae8693b51b59c79ead24ae20c53b3d58
MD5 e0b41524f278ade4898c99e06d6aa1e6
BLAKE2b-256 ad8372aa47df787e27b25c7aa5cdddeec77bf1abf2e7b70611c86e43ae32b44d

See more details on using hashes here.

File details

Details for the file safefeat-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: safefeat-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 1.0 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for safefeat-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 66eaaade780475b03e3e8c1610488eccde82c4b3de4eb0affd6646371d4de4c7
MD5 1504eeddf976f7f33a627d8a977aed4c
BLAKE2b-256 aa5fb702502d4d395582ee680d9b7e713bc0b2a116994feca890b29d08d7ac20

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page