Skip to main content

Temporal correctness layer for ML training data

Project description

Timefence

Timefence

Your ML model may be trained on the future. Find out in one command.

CI codecov PyPI Python License: MIT

Website · Docs · Changelog · Contributing


Timefence finds and fixes temporal data leakage in ML training sets. No infrastructure required — runs locally, reads Parquet/CSV, and finishes in seconds.

If you build training data by joining features to labels, your model may be training on the future. A LEFT JOIN or merge_asof gives each label the latest feature row — including data from after the event you're predicting. The model trains on the future. Offline metrics look great. Production doesn't match. No error, no warning, no way to tell from the output alone.

pip install timefence

Try It in 60 Seconds

timefence quickstart churn-example && cd churn-example
timefence audit data/train_LEAKY.parquet
TEMPORAL AUDIT REPORT
Scanned 5,000 rows

WARNING  LEAKAGE DETECTED in 3 of 4 features

  LEAK  rolling_spend_30d
        1,520 rows (30.4%) use feature data from the future
        Severity: HIGH

  LEAK  days_since_login
        4,909 rows (98.2%) use feature data from the future
        Severity: HIGH

  OK    user_country - clean (5,000 rows)
  OK    account_age_days - clean (5,000 rows)

Rebuild it with temporal correctness:

timefence build --labels data/labels.parquet --features features.py --output train_CLEAN.parquet
Building training set...

  Labels     5,000 rows from data/labels.parquet
  Features   4 features

  Joining with point-in-time correctness (feature_time < label_time):

  OK  user_country         5,000 / 5,000 matched
  OK  account_age_days     5,000 / 5,000 matched
  OK  rolling_spend_30d    5,000 / 5,000 matched
  OK  days_since_login     5,000 / 5,000 matched

  Written   train_CLEAN.parquet (5,000 rows, 7 cols)

Verify:

timefence audit train_CLEAN.parquet
# ALL CLEAN - no temporal leakage detected

Audit Your Existing Data

You don't need to change your pipeline. Point Timefence at any training set you already have:

timefence audit your_training_set.parquet --features features.py --keys user_id --label-time label_time

If it's clean, you'll know. If it's not, you'll see exactly which features leak, how many rows, and the severity. Takes seconds.

Python API

Audit any existing dataset — no sources or feature definitions needed:

import timefence

report = timefence.audit("train.parquet", keys=["user_id"], label_time="label_time")
report.assert_clean()  # raises if leakage found

Or define sources and features to build a correct dataset from scratch:

users = timefence.Source(path="data/users.parquet", keys=["user_id"], timestamp="updated_at")
txns  = timefence.Source(path="data/txns.parquet", keys=["user_id"], timestamp="created_at")

country = timefence.Feature(source=users, columns=["country"])
spend   = timefence.Feature(source=txns, embargo="1d", name="spend_30d", sql="""
    SELECT user_id, created_at AS feature_time,
           SUM(amount) OVER (PARTITION BY user_id ORDER BY created_at
               RANGE BETWEEN INTERVAL 30 DAY PRECEDING AND CURRENT ROW) AS spend_30d
    FROM {source}
""")

labels = timefence.Labels(
    path="data/labels.parquet", keys=["user_id"],
    label_time="label_time", target=["churned"],
)

result = timefence.build(labels=labels, features=[country, spend], output="train.parquet")

Add to CI

Stop leakage before it reaches production:

- run: pip install timefence && timefence audit data/train.parquet --features features.py --strict

--strict exits with code 1 on leakage. Your pipeline fails before a leaky model ever trains.

Performance

Built on DuckDB's columnar engine. Median of 3 runs after warmup (Intel i7, 16 GB):

Scenario Labels Features Build Audit
Small project 100K 1 0.5s 0.3s
Typical project 100K 10 1.9s 1.7s
Large project 1M 1 3.0s 2.0s
Large + many features 1M 10 12s 8.5s

Adding embargo, staleness, and splits costs seconds, not minutes.

Run benchmarks yourself
uv run python benchmarks/bench.py --quick
uv run python benchmarks/bench.py --quick --include-pandas

How It Works

Timefence generates SQL (ASOF JOIN or ROW_NUMBER) and runs it in an embedded DuckDB. No server, no JVM, no Spark. It enforces one rule — feature_time < label_time - embargo — for every row, every feature, every build. Every query is inspectable via timefence -v build or timefence explain.

All Features

Joins Point-in-time correct. ASOF JOIN fast path, ROW_NUMBER fallback
Guardrails Embargo, max lookback, max staleness — all configurable
Inputs Parquet, CSV, SQL query, DataFrame
Feature modes Column selection, SQL, Python transform
Splitting Time-based train / validation / test splits
Caching Feature-level cache with content-hash keys
Audit Full rebuild-and-compare or lightweight temporal check
Reports Severity classification. JSON manifest, HTML report, Rich terminal
CLI quickstart build audit explain diff inspect catalog doctor
Flags -v verbose · --debug · --strict CI gate · --json · --html

What Timefence Is NOT

Not This Why Use Instead
Feature store No server, no online serving Tecton, Feast
Data orchestrator No scheduling, no DAGs Airflow, Dagster
Data quality framework Temporal correctness only Great Expectations
ML pipeline framework Produces training data only MLflow, Metaflow

One tool. One job. Temporal correctness for ML training data.


Documentation · Contributing · Changelog

MIT License

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

timefence-0.9.1.tar.gz (512.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

timefence-0.9.1-py3-none-any.whl (44.0 kB view details)

Uploaded Python 3

File details

Details for the file timefence-0.9.1.tar.gz.

File metadata

  • Download URL: timefence-0.9.1.tar.gz
  • Upload date:
  • Size: 512.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for timefence-0.9.1.tar.gz
Algorithm Hash digest
SHA256 76d9cdf437d11f7248bf56fad7339f882286efc1b768191ae203f823e8b8bf21
MD5 a1ddeebc50d53347cd5151d778e38a97
BLAKE2b-256 5799d09137eba2f7c9f9dcc55261eb7fc9ad5e8b657986d13da31f4e9cf61b1e

See more details on using hashes here.

Provenance

The following attestation bundles were made for timefence-0.9.1.tar.gz:

Publisher: release.yml on gauthierpiarrette/timefence

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file timefence-0.9.1-py3-none-any.whl.

File metadata

  • Download URL: timefence-0.9.1-py3-none-any.whl
  • Upload date:
  • Size: 44.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for timefence-0.9.1-py3-none-any.whl
Algorithm Hash digest
SHA256 bbc60841e16a8105c43b508a809a108df4eaf694051c8af5f0d9e30dcf0b7717
MD5 6f578bb0f0264971d949b4cbff004660
BLAKE2b-256 2d86efb16373e12e32d56c5f421fa7131db04b06f107ab48ab0ecfb0590b6bac

See more details on using hashes here.

Provenance

The following attestation bundles were made for timefence-0.9.1-py3-none-any.whl:

Publisher: release.yml on gauthierpiarrette/timefence

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page