Git for Datasets — time-travel debugging and lineage tracking for pandas/Polars.
Project description
⚡ flashback
Git for Datasets — time-travel debugging and transformation lineage tracking for pandas & Polars.
📂 load ──▶ 🔍 filter ──▶ ➕ with_columns ──▶ ⏪ lag ──▶ HEAD
│
(before-lag) ◀── fb.checkout("before-lag")
Why this exists
Every ML researcher has asked: "Why did my metric change?" Nobody knows.
You ran a 6-hour training job, the Sharpe ratio dropped from 1.4 to 0.9, and somewhere between the raw tick data and the feature matrix a silent transformation introduced look-ahead bias. You have no idea where.
DVC is too heavy — it versions entire files with S3 backends, CI pipelines,
and YAML configs. You don't want to learn a new orchestration system; you
want to know what happened to column price_lag1 between step 3 and step 7.
Git doesn't understand columns. git diff on a Parquet file is binary
noise. It cannot tell you "this .filter() removed 412 rows" or "this
.with_columns() introduced a null in 3% of rows."
flashback fixes this.
It wraps your DataFrame in a zero-cost proxy that records every transformation as a node in an in-memory Directed Acyclic Graph (DAG). Each node is identified by a deterministic SHA-256 hash of the schema + operation arguments, giving you:
- Instant time-travel —
fb.checkout("before-lag")returns the exact frame at that checkpoint with no I/O unless you ask for it. - Structural diffing —
frame.diff(other)shows you exactly which rows were added or removed between any two checkpoints. - Beautiful lineage views —
fb.visualize()renders arich-powered git-log-style tree in your terminal, or an SVG graph in Jupyter. - Reproducibility — identical transformations applied to identical data always produce the same node ID — transformations are deterministic by construction.
Install
pip install flashback
# or, if you use uv (recommended):
uv add flashback
Requirements: Python ≥ 3.10, Polars ≥ 0.20, pandas ≥ 2.0.
Quickstart
import flashback as fb
# ── 1. Load any source ──────────────────────────────────────────────────────
df = fb.load("trades.parquet") # Parquet
df = fb.load("prices.csv") # CSV
df = fb.load(my_polars_df) # existing Polars DataFrame
df = fb.load(my_pandas_df) # existing Pandas DataFrame
# ── 2. Transform — every step is recorded automatically ─────────────────────
df = df.filter(fb.col("price") > 0)
df = df.with_columns(
(fb.col("price") * fb.col("volume")).alias("notional")
)
# Tag a checkpoint before the next risky operation.
df = df.tag("before-lag")
df = df.lag("price", 1) # sugar for shift(-1) + tracking
df = df.rolling_mean("notional", 5)
# ── 3. Time-travel ──────────────────────────────────────────────────────────
df_clean = fb.checkout("before-lag") # ← instant; no disk I/O
# ── 4. See what broke your Sharpe ratio ─────────────────────────────────────
fb.visualize()
Terminal output:
╭─ flashback lineage • 4 commits • HEAD → rolling_mean ──────────────────╮
│ │
│ 📂 LOAD 5,000 rows × 4 cols [14:03:01] │
│ │ │
│ ├─ 🔍 filter arg_0=...col("price")... 4,823 rows × 4 cols #a1b2c3d4 │
│ │ │
│ ├─ ➕ with_columns arg_0=...alias("notional") 4,823 rows × 5 #e5f6a7 │
│ │ │
│ ├─ ⏪ lag column='price' n=1 4,823 rows × 6 [before-lag] #b8c9d0 │
│ │ │
│ └─ 📈 rolling_mean window=5 4,823 rows × 7 ● HEAD #01e2f3a4 │
│ │
╰─────────────────────────────────────────────────────────────────────────────╯
API Reference
fb.load(source, *, label=None, track=True)
Load a DataFrame from a file path, Polars DataFrame, or Pandas DataFrame and begin tracking its lineage.
| Param | Type | Description |
|---|---|---|
source |
str | pl.DataFrame | pd.DataFrame | FlashbackFrame |
Data source |
label |
str | None |
Human-readable root label (default: filename stem or "root") |
track |
bool |
Register with the global registry (default: True) |
Supported formats: .parquet, .csv, .json, .ndjson, .ipc, .arrow
fb.col(name)
Alias for polars.col. Use inside transform chains for IDE-friendly imports:
df = df.filter(fb.col("price") > 0)
fb.commit(frame, label, *, message="")
Tag the current state of frame with a human-readable label — analogous to
git tag.
df = fb.commit(df, "before-normalise", message="Raw features, no scaling")
Or use the method form:
df = df.tag("before-normalise", message="Raw features, no scaling")
fb.checkout(label, *, frame=None)
Time-travel to a named checkpoint. Returns a new FlashbackFrame at that
exact state, fully materialised.
df_original = fb.checkout("before-normalise")
If frame is provided, searches only that frame's lineage. Otherwise,
searches the global registry.
fb.visualize(frame=None, *, style="tree", max_width=120)
Render the transformation lineage.
style="tree"— rich tree with icons, timestamps, shapes, node IDs.style="dag"— compact ASCII graph (git log --graphstyle).- In Jupyter, automatically falls back to an SVG/HTML widget.
FlashbackFrame.lag(column, n=1, *, alias=None)
Shift column by n periods with a tracked checkpoint.
df = df.lag("price", 1) # → price_lag1
df = df.lag("price", 3, alias="price_t3") # → price_t3
FlashbackFrame.rolling_mean(column, window, *, alias=None, min_periods=None)
Rolling mean over window periods with lineage tracking.
df = df.rolling_mean("notional", 20) # → notional_rmean20
FlashbackFrame.diff(other)
Structural diff between two frames. Returns a Polars DataFrame with a _diff
column of "added" / "removed".
delta = df_now.diff(df_old)
print(delta.filter(pl.col("_diff") == "removed"))
FlashbackFrame.history()
Return the full transformation chain as a list of dicts (root → HEAD):
for step in df.history():
print(step["op_name"], step["shape"], step["label"])
Persistence
Lineage graphs can be saved to and loaded from disk:
from flashback.storage import Storage
store = Storage(".flashback") # or Storage.from_cwd()
store.save(df, frame_id="experiment-001")
# Later, in another session:
df = store.load("experiment-001")
The .flashback/ directory layout:
.flashback/
├── config.json
├── graphs/
│ └── experiment-001.json # serialised DAG
└── cache/
└── <node_id>.parquet # materialised node snapshots
How it works
┌──────────────────────────────────────────────────────────┐
│ FlashbackFrame │
│ │
│ ┌──────────────┐ intercept ┌───────────────────┐ │
│ │ Polars API │ ─────────────▶ │ LineageDAG │ │
│ │ .filter() │ │ │ │
│ │ .sort() │ record node │ root ──▶ filter │ │
│ │ .join() │ ◀──────────── │ ──▶ sort │ │
│ └──────────────┘ │ ──▶ join │ │
│ │ └───────────────────┘ │
│ ▼ │
│ polars.DataFrame (unchanged; Polars still optimises) │
└──────────────────────────────────────────────────────────┘
Node identity is a 20-character hex SHA-256 of:
{
"parents": ["<parent_node_id>"],
"op": "filter",
"kwargs": {"arg_0": "[(col(\"price\")) > (0)]"},
"schema": {"id": "Int64", "price": "Float64", ...}
}
This means:
- Identical pipelines on identical data always hash to the same node → instant cache hits.
- Changing any argument or parent state produces a different hash → no silent collisions.
Development
git clone https://github.com/flashback-dev/flashback
cd flashback
pip install -e ".[dev]"
# Lint
ruff check flashback tests
ruff format --check flashback tests
# Type-check
mypy flashback
# Test with coverage
pytest
The CI matrix runs across Ubuntu × macOS × Windows and Python 3.10 – 3.13 with a hard 90% coverage threshold.
Roadmap
- Branching —
fb.branch("experiment-A")for parallel pipeline exploration - Merge — reconcile two branches at the DAG level
- Remote storage — push/pull lineage graphs to S3 / GCS
- Streaming Polars — track lazy plans before
.collect() - Notebook integration —
%load_ext flashbackmagic with live DAG sidebar - Export to DVC — generate
.dvcstage files from a flashback DAG
License
MIT — see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file flashback_df-0.1.1.tar.gz.
File metadata
- Download URL: flashback_df-0.1.1.tar.gz
- Upload date:
- Size: 30.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8afa967a43f2380ed51d1fc3b7e8be651f45892a2fc4db8af7b14c15a470063e
|
|
| MD5 |
4da870833c32fdecc281031882bae4eb
|
|
| BLAKE2b-256 |
ebdc518906f879a773abf642b8c7e1c1e2ff1cd23ee7d99dd52fab19cdd4038a
|
File details
Details for the file flashback_df-0.1.1-py3-none-any.whl.
File metadata
- Download URL: flashback_df-0.1.1-py3-none-any.whl
- Upload date:
- Size: 26.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.9.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bf9b04fecebe9a06e6e8a7ea699c18b0d1d96ce12b591a1768ab8c0fcde1d9e5
|
|
| MD5 |
d17f0c3c9dd8f81cdbb181f480accf58
|
|
| BLAKE2b-256 |
d5df3c1e0e47989c35877b8b9272e390ccc236b228c7bb818bd8bee8312bc606
|