Skip to main content

Leakage-safe, point-in-time feature engineering for event logs.

Project description

safefeat

PyPI version Documentation License: MIT

Leakage-safe, point-in-time feature engineering for event logs.

safefeat builds ML features from event data using only information available at prediction time — no future data, no silent leakage, no surprises in production.


The Problem

When you compute features like "total purchases in the last 30 days" without anchoring to a cutoff time, you accidentally include future events. Your model looks great in training — then falls apart in production.

# ❌ Leaky — uses ALL events, including future ones
features = events.groupby("user_id")["amount"].sum()
df = spine.merge(features, on="user_id")

# ✅ Safe — only uses events before each cutoff_time
X = build_features(spine, tables, spec, event_time_cols={"events": "event_time"})

Install

pip install safefeat

How It Works

safefeat works with three components:

Component Description
Spine When to make predictions — one row per (entity_id, cutoff_time)
Events Historical time-series data tied to each entity
Spec Declarative definition of what features to compute

For each row in the spine, safefeat joins only events where event_time <= cutoff_time, then computes your features. Future events are excluded.


Quick Start

Window aggregations

import pandas as pd
from safefeat import build_features, WindowAgg

spine = pd.DataFrame({
    "entity_id":   ["u1", "u2"],
    "cutoff_time": ["2024-01-10", "2024-01-31"],
})

events = pd.DataFrame({
    "entity_id":  ["u1", "u1", "u2", "u2"],
    "event_time": ["2024-01-05", "2024-01-06", "2024-01-10", "2024-01-30"],
    "amount":     [10.0, 20.0, 5.0, 25.0],
    "event_type": ["click", "purchase", "purchase", "click"],
})

spec = [
    WindowAgg(
        table="events",
        windows=["7D", "30D"],
        metrics={
            "*":          ["count"],        # total events
            "amount":     ["sum", "mean"],  # numeric aggregations
            "event_type": ["nunique"],      # distinct event types
        },
    )
]

X = build_features(
    spine=spine,
    tables={"events": events},
    spec=spec,
    event_time_cols={"events": "event_time"},
    allowed_lag="0s",
)

Output columns follow the pattern {table}__{column}__{agg}__{window}:

events__n_events__7d
events__amount__sum__7d
events__amount__mean__30d
events__event_type__nunique__30d

Recency features

Time since the most recent event before each cutoff — useful for churn, fraud, and behavioural modelling:

from safefeat import RecencyBlock

spec = [RecencyBlock(table="events")]

X = build_features(
    spine=spine,
    tables={"events": events},
    spec=spec,
    event_time_cols={"events": "event_time"},
)
# Adds: events__recency (days since last event before cutoff_time)

Filter by event type:

spec = [
    RecencyBlock(
        table="events",
        filter_col="event_type",
        filter_value="purchase",
    )
]
# Adds: events__recency__event_type_purchase

Audit report

Verify exactly which events were included and dropped for each prediction point:

X, audit = build_features(
    spine=spine,
    tables={"events": events},
    spec=spec,
    event_time_cols={"events": "event_time"},
    return_report=True,
)

events_audit = audit.tables.get("events")
print(events_audit.total_joined_pairs)    # total event-cutoff pairs considered
print(events_audit.kept_pairs)            # events before cutoff (used)
print(events_audit.dropped_future_pairs)  # events after cutoff (excluded)

Development

pip install -e ".[dev]"
pytest -q
ruff check .

Documentation

Full documentation, concepts, and API reference: 👉 https://alishaang.github.io/safefeat/

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

safefeat-0.1.2.tar.gz (22.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

safefeat-0.1.2-py3-none-any.whl (10.3 kB view details)

Uploaded Python 3

File details

Details for the file safefeat-0.1.2.tar.gz.

File metadata

  • Download URL: safefeat-0.1.2.tar.gz
  • Upload date:
  • Size: 22.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for safefeat-0.1.2.tar.gz
Algorithm Hash digest
SHA256 eccf503fb7264379b4026385ff99d520a52e2ee0b9f8dbcbdcd00e9676a0d62b
MD5 7379d8dd829ab94cafcdda5a13814902
BLAKE2b-256 55dd607b1b08cb1a09e5d17b5ea21c16bea2b9904bd7251b2cdf813a6e9b14f6

See more details on using hashes here.

File details

Details for the file safefeat-0.1.2-py3-none-any.whl.

File metadata

  • Download URL: safefeat-0.1.2-py3-none-any.whl
  • Upload date:
  • Size: 10.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for safefeat-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 45b7e94ce063a3b41ea6e4906156d521ef873229fcfe4a0d8b15d8e63cfdc81b
MD5 dc791af78a3af838684c7a0660150f61
BLAKE2b-256 8eeee57c246d71f3b21e835f451527bcf00ccb22a4014bf27e836d8a0c43c7ab

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page