Skip to main content

Git for Datasets — time-travel debugging and lineage tracking for pandas/Polars.

Project description

⚡ flashback

Git for Datasets — time-travel debugging and transformation lineage tracking for pandas & Polars.

CI PyPI Python Coverage Ruff License: MIT

📂 load  ──▶  🔍 filter  ──▶  ➕ with_columns  ──▶  ⏪ lag  ──▶  HEAD
                  │
              (before-lag)  ◀── fb.checkout("before-lag")

Why this exists

Every ML researcher has asked: "Why did my metric change?" Nobody knows.

You ran a 6-hour training job, the Sharpe ratio dropped from 1.4 to 0.9, and somewhere between the raw tick data and the feature matrix a silent transformation introduced look-ahead bias. You have no idea where.

DVC is too heavy — it versions entire files with S3 backends, CI pipelines, and YAML configs. You don't want to learn a new orchestration system; you want to know what happened to column price_lag1 between step 3 and step 7.

Git doesn't understand columns. git diff on a Parquet file is binary noise. It cannot tell you "this .filter() removed 412 rows" or "this .with_columns() introduced a null in 3% of rows."

flashback fixes this.

It wraps your DataFrame in a zero-cost proxy that records every transformation as a node in an in-memory Directed Acyclic Graph (DAG). Each node is identified by a deterministic SHA-256 hash of the schema + operation arguments, giving you:

  • Instant time-travelfb.checkout("before-lag") returns the exact frame at that checkpoint with no I/O unless you ask for it.
  • Structural diffingframe.diff(other) shows you exactly which rows were added or removed between any two checkpoints.
  • Beautiful lineage viewsfb.visualize() renders a rich-powered git-log-style tree in your terminal, or an SVG graph in Jupyter.
  • Reproducibility — identical transformations applied to identical data always produce the same node ID — transformations are deterministic by construction.

Install

pip install flashback
# or, if you use uv (recommended):
uv add flashback

Requirements: Python ≥ 3.10, Polars ≥ 0.20, pandas ≥ 2.0.


Quickstart

import flashback as fb

# ── 1. Load any source ──────────────────────────────────────────────────────
df = fb.load("trades.parquet")          # Parquet
df = fb.load("prices.csv")             # CSV
df = fb.load(my_polars_df)             # existing Polars DataFrame
df = fb.load(my_pandas_df)             # existing Pandas DataFrame

# ── 2. Transform — every step is recorded automatically ─────────────────────
df = df.filter(fb.col("price") > 0)
df = df.with_columns(
    (fb.col("price") * fb.col("volume")).alias("notional")
)

# Tag a checkpoint before the next risky operation.
df = df.tag("before-lag")

df = df.lag("price", 1)               # sugar for shift(-1) + tracking
df = df.rolling_mean("notional", 5)

# ── 3. Time-travel ──────────────────────────────────────────────────────────
df_clean = fb.checkout("before-lag")  # ← instant; no disk I/O

# ── 4. See what broke your Sharpe ratio ─────────────────────────────────────
fb.visualize()

Terminal output:

╭─ flashback lineage  •  4 commits  •  HEAD → rolling_mean ──────────────────╮
│                                                                             │
│  📂 LOAD  5,000 rows × 4 cols  [14:03:01]                                  │
│  │                                                                          │
│  ├─ 🔍 filter  arg_0=...col("price")...  4,823 rows × 4 cols  #a1b2c3d4   │
│  │                                                                          │
│  ├─ ➕ with_columns  arg_0=...alias("notional")  4,823 rows × 5  #e5f6a7  │
│  │                                                                          │
│  ├─ ⏪ lag  column='price'  n=1  4,823 rows × 6  [before-lag]  #b8c9d0    │
│  │                                                                          │
│  └─ 📈 rolling_mean  window=5  4,823 rows × 7 ● HEAD  #01e2f3a4           │
│                                                                             │
╰─────────────────────────────────────────────────────────────────────────────╯

API Reference

fb.load(source, *, label=None, track=True)

Load a DataFrame from a file path, Polars DataFrame, or Pandas DataFrame and begin tracking its lineage.

Param Type Description
source str | pl.DataFrame | pd.DataFrame | FlashbackFrame Data source
label str | None Human-readable root label (default: filename stem or "root")
track bool Register with the global registry (default: True)

Supported formats: .parquet, .csv, .json, .ndjson, .ipc, .arrow


fb.col(name)

Alias for polars.col. Use inside transform chains for IDE-friendly imports:

df = df.filter(fb.col("price") > 0)

fb.commit(frame, label, *, message="")

Tag the current state of frame with a human-readable label — analogous to git tag.

df = fb.commit(df, "before-normalise", message="Raw features, no scaling")

Or use the method form:

df = df.tag("before-normalise", message="Raw features, no scaling")

fb.checkout(label, *, frame=None)

Time-travel to a named checkpoint. Returns a new FlashbackFrame at that exact state, fully materialised.

df_original = fb.checkout("before-normalise")

If frame is provided, searches only that frame's lineage. Otherwise, searches the global registry.


fb.visualize(frame=None, *, style="tree", max_width=120)

Render the transformation lineage.

  • style="tree" — rich tree with icons, timestamps, shapes, node IDs.
  • style="dag" — compact ASCII graph (git log --graph style).
  • In Jupyter, automatically falls back to an SVG/HTML widget.

FlashbackFrame.lag(column, n=1, *, alias=None)

Shift column by n periods with a tracked checkpoint.

df = df.lag("price", 1)                    # → price_lag1
df = df.lag("price", 3, alias="price_t3")  # → price_t3

FlashbackFrame.rolling_mean(column, window, *, alias=None, min_periods=None)

Rolling mean over window periods with lineage tracking.

df = df.rolling_mean("notional", 20)  # → notional_rmean20

FlashbackFrame.diff(other)

Structural diff between two frames. Returns a Polars DataFrame with a _diff column of "added" / "removed".

delta = df_now.diff(df_old)
print(delta.filter(pl.col("_diff") == "removed"))

FlashbackFrame.history()

Return the full transformation chain as a list of dicts (root → HEAD):

for step in df.history():
    print(step["op_name"], step["shape"], step["label"])

Persistence

Lineage graphs can be saved to and loaded from disk:

from flashback.storage import Storage

store = Storage(".flashback")  # or Storage.from_cwd()
store.save(df, frame_id="experiment-001")

# Later, in another session:
df = store.load("experiment-001")

The .flashback/ directory layout:

.flashback/
├── config.json
├── graphs/
│   └── experiment-001.json   # serialised DAG
└── cache/
    └── <node_id>.parquet     # materialised node snapshots

How it works

┌──────────────────────────────────────────────────────────┐
│  FlashbackFrame                                          │
│                                                          │
│  ┌──────────────┐    intercept    ┌───────────────────┐  │
│  │  Polars API  │ ─────────────▶ │   LineageDAG      │  │
│  │  .filter()   │                │                   │  │
│  │  .sort()     │  record node   │  root ──▶ filter  │  │
│  │  .join()     │ ◀──────────── │         ──▶ sort  │  │
│  └──────────────┘                │         ──▶ join  │  │
│         │                        └───────────────────┘  │
│         ▼                                               │
│  polars.DataFrame  (unchanged; Polars still optimises)  │
└──────────────────────────────────────────────────────────┘

Node identity is a 20-character hex SHA-256 of:

{
  "parents": ["<parent_node_id>"],
  "op": "filter",
  "kwargs": {"arg_0": "[(col(\"price\")) > (0)]"},
  "schema": {"id": "Int64", "price": "Float64", ...}
}

This means:

  • Identical pipelines on identical data always hash to the same node → instant cache hits.
  • Changing any argument or parent state produces a different hash → no silent collisions.

Development

git clone https://github.com/flashback-dev/flashback
cd flashback
pip install -e ".[dev]"

# Lint
ruff check flashback tests
ruff format --check flashback tests

# Type-check
mypy flashback

# Test with coverage
pytest

The CI matrix runs across Ubuntu × macOS × Windows and Python 3.10 – 3.13 with a hard 90% coverage threshold.


Roadmap

  • Branchingfb.branch("experiment-A") for parallel pipeline exploration
  • Merge — reconcile two branches at the DAG level
  • Remote storage — push/pull lineage graphs to S3 / GCS
  • Streaming Polars — track lazy plans before .collect()
  • Notebook integration%load_ext flashback magic with live DAG sidebar
  • Export to DVC — generate .dvc stage files from a flashback DAG

License

MIT — see LICENSE.


Built with Polars · Rich · NetworkX

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

flashback_df-0.1.1.tar.gz (30.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

flashback_df-0.1.1-py3-none-any.whl (26.5 kB view details)

Uploaded Python 3

File details

Details for the file flashback_df-0.1.1.tar.gz.

File metadata

  • Download URL: flashback_df-0.1.1.tar.gz
  • Upload date:
  • Size: 30.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for flashback_df-0.1.1.tar.gz
Algorithm Hash digest
SHA256 8afa967a43f2380ed51d1fc3b7e8be651f45892a2fc4db8af7b14c15a470063e
MD5 4da870833c32fdecc281031882bae4eb
BLAKE2b-256 ebdc518906f879a773abf642b8c7e1c1e2ff1cd23ee7d99dd52fab19cdd4038a

See more details on using hashes here.

File details

Details for the file flashback_df-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: flashback_df-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 26.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.9.6

File hashes

Hashes for flashback_df-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 bf9b04fecebe9a06e6e8a7ea699c18b0d1d96ce12b591a1768ab8c0fcde1d9e5
MD5 d17f0c3c9dd8f81cdbb181f480accf58
BLAKE2b-256 d5df3c1e0e47989c35877b8b9272e390ccc236b228c7bb818bd8bee8312bc606

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page