Leakage-safe, point-in-time feature engineering for event logs.
Project description
Leakage-safe, point-in-time feature engineering for event logs.
safefeat builds ML features from event data using only information available at prediction time — no future data, no silent leakage, no surprises in production.
The Problem
When you compute features like "total purchases in the last 30 days" without anchoring to a cutoff time, you accidentally include future events. Your model looks great in training — then falls apart in production.
# ❌ Leaky — uses ALL events, including future ones
features = events.groupby("user_id")["amount"].sum()
df = spine.merge(features, on="user_id")
# ✅ Safe — only uses events before each cutoff_time
X = build_features(spine, tables, spec, event_time_cols={"events": "event_time"})
Install
pip install safefeat
How It Works
safefeat works with three components:
| Component | Description |
|---|---|
| Spine | When to make predictions — one row per (entity_id, cutoff_time) |
| Events | Historical time-series data tied to each entity |
| Spec | Declarative definition of what features to compute |
For each row in the spine, safefeat joins only events where event_time <= cutoff_time, then computes your features. Future events are excluded.
Quick Start
import pandas as pd
from safefeat import build_features, WindowAgg
spine = pd.DataFrame({
"entity_id": ["u1", "u2"],
"cutoff_time": ["2024-01-10", "2024-01-31"],
})
events = pd.DataFrame({
"entity_id": ["u1", "u1", "u2", "u2"],
"event_time": ["2024-01-05", "2024-01-06", "2024-01-10", "2024-01-30"],
"amount": [10.0, 20.0, 5.0, 25.0],
"event_type": ["click", "purchase", "purchase", "click"],
})
spec = [
WindowAgg(
table="events",
windows=["7D", "30D"],
metrics={
"*": ["count"],
"amount": ["sum", "mean"],
"event_type": ["nunique"],
},
)
]
X = build_features(
spine=spine,
tables={"events": events},
spec=spec,
event_time_cols={"events": "event_time"},
allowed_lag="0s",
)
Output columns follow the pattern {table}__{column}__{agg}__{window}:
events__n_events__7d # number of events in the last 7 days
events__amount__sum__7d # total spend in the last 7 days
events__amount__mean__7d # average spend per event in the last 7 days
events__event_type__nunique__7d # distinct event types seen in the last 7 days
events__n_events__30d # number of events in the last 30 days
events__amount__sum__30d # total spend in the last 30 days
events__amount__mean__30d # average spend per event in the last 30 days
events__event_type__nunique__30d # distinct event types seen in the last 30 days
Demo Dataset
safefeat ships with a synthetic e-commerce dataset for experimentation:
from safefeat.datasets import load_customer_demo
events, spine = load_customer_demo()
See the customer demo examples for worked questions using this dataset.
Window aggregations
Windows support days, months, years, and unlimited history:
spec = [
WindowAgg(
table="events",
windows=["7D", "30D", "3M", "1Y", None], # None = all history before cutoff
metrics={
"*": ["count"],
"amount": ["sum", "mean"],
"event_type": ["nunique"],
},
)
]
| Unit | Example | Meaning |
|---|---|---|
D |
"30D" |
Exact days |
M |
"3M" |
Calendar months |
Y |
"1Y" |
Calendar years |
None |
None |
All history before cutoff |
Recency features
from safefeat import RecencyBlock
spec = [RecencyBlock(table="events")]
X = build_features(
spine=spine,
tables={"events": events},
spec=spec,
event_time_cols={"events": "event_time"},
)
# Adds: events__recency (days since last event before cutoff_time)
Filter to a specific event type:
spec = [
RecencyBlock(
table="events",
filter_col="event_type",
filter_value="purchase",
)
]
# Adds: events__recency__event_type_purchase
Audit report
Verify exactly which events were included and dropped for each prediction point:
X, audit = build_features(
spine=spine,
tables={"events": events},
spec=spec,
event_time_cols={"events": "event_time"},
return_report=True,
)
events_audit = audit.tables.get("events")
print(events_audit.total_joined_pairs) # total event-cutoff pairs considered
print(events_audit.kept_pairs) # events before cutoff (used)
print(events_audit.dropped_future_pairs) # events after cutoff (excluded)
Multiple event tables
Pass multiple tables — each with its own event time column:
spec = [
WindowAgg(table="transactions", windows=["30D"], metrics={"amount": ["sum"]}),
WindowAgg(table="logins", windows=["7D"], metrics={"*": ["count"]}),
RecencyBlock(table="transactions"),
]
X = build_features(
spine=spine,
tables={
"transactions": transactions_df,
"logins": logins_df,
},
event_time_cols={
"transactions": "transaction_time",
"logins": "login_time",
},
)
The table= name is just a label — it must match a key in tables and event_time_cols, but can be anything you choose.
Development
pip install -e ".[dev]"
pytest -q
ruff check .
Contributing
Contributions, bug reports, and feature requests are welcome. Open an issue at github.com/AlishaAng/safefeat/issues.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file safefeat-0.1.3.tar.gz.
File metadata
- Download URL: safefeat-0.1.3.tar.gz
- Upload date:
- Size: 1.1 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
56fafa8ddbd563d6a2f93f1b307b6ba4ae8693b51b59c79ead24ae20c53b3d58
|
|
| MD5 |
e0b41524f278ade4898c99e06d6aa1e6
|
|
| BLAKE2b-256 |
ad8372aa47df787e27b25c7aa5cdddeec77bf1abf2e7b70611c86e43ae32b44d
|
File details
Details for the file safefeat-0.1.3-py3-none-any.whl.
File metadata
- Download URL: safefeat-0.1.3-py3-none-any.whl
- Upload date:
- Size: 1.0 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
66eaaade780475b03e3e8c1610488eccde82c4b3de4eb0affd6646371d4de4c7
|
|
| MD5 |
1504eeddf976f7f33a627d8a977aed4c
|
|
| BLAKE2b-256 |
aa5fb702502d4d395582ee680d9b7e713bc0b2a116994feca890b29d08d7ac20
|