Skip to main content

Polymarket crash-recovery labeled dataset (308 trades, 80.2% WR). Public ground truth for prediction-market mean-reversion research.

Project description

cross-signal-data

The labeled Polymarket crash-recovery dataset behind a 79.8% win-rate live trading bot.

308 closed trades. Real Polymarket markets. Real entry triggers. Real outcomes. Public for anyone who wants to build their own mean-reversion bot, replicate our results, or prove us wrong.

What's in here

A single CSV (data/crashes_v1.csv) with one row per closed trade on Polymarket where the crash-recovery bot entered. Each row has:

  • The market (public Polymarket market_id and question text)
  • The signal (pre_crash_high, entry_price, drop_pct)
  • The outcome (exit_price, exit_reason, pnl_usd, is_profitable)
  • Time features (entry_hour_utc, entry_dow, hold_hours)
Stat Value
Total trades 308
Profitable 247 (80.2%)
Date range March 2026 – April 2026
Median hold ~3 hours
Avg drop_pct at entry ~22%
Avg recovered_to_pct_of_high ~85%
Exit reason Count
RECOVERY (price came back) 235
TIMEOUT_48H (held 48h, exited) 62
TIMEOUT (early TIMEOUT exit) 11

Why this exists

Most prediction-market datasets are either:

  • Synthetic (generated for academic papers, no real money behind them), or
  • Aggregate (volume, liquidity at hourly resolution — useless for tactical signals)

This is neither. It's the actual labeled examples of a single specific signal — Polymarket markets that crashed N% from a recent high — paired with the actual outcome of trading the recovery. If you want to study whether mean-reversion works on prediction markets, this is the data.

Install

pip install cross-signal-data

Quick use (Python)

from cross_signal_data import load

df = load()
print(df.shape)              # (308, 19)
print(df.columns.tolist())   # full list of fields

# Filter to RECOVERY-only trades
recovered = df[df["exit_reason"] == "RECOVERY"]

# What entry-price bucket has the best win rate?
buckets = df.groupby(df["entry_price"].round(2)).agg(
    n=("trade_id", "count"),
    win_rate=("is_profitable", "mean"),
)
print(buckets)

If you don't have pandas:

from cross_signal_data import load
rows = load(as_pandas=False)  # list of dicts
print(len(rows), rows[0])

Quick use (any language)

The file is plain CSV. Just download it:

curl -o crashes_v1.csv https://raw.githubusercontent.com/LuciferForge/cross-signal-data/main/data/crashes_v1.csv

Schema

See docs/schema.md for full column-by-column documentation.

Key columns:

  • entry_price — the price-per-share when the bot entered (0–1)
  • pre_crash_high — the recent local-window high
  • drop_pct(pre_crash_high − entry_price) / pre_crash_high × 100
  • exit_reasonRECOVERY, TIMEOUT_48H, TIMEOUT, or STOP
  • is_profitable — 1 if pnl_usd > 0 else 0
  • recovered_to_pct_of_highexit_price / pre_crash_high × 100

Methodology

See docs/methodology.md for:

  • How the crash signal is defined
  • Entry/exit rules
  • Known biases (survivorship: only triggers that fired are recorded; a different threshold might surface different examples)
  • What's NOT in the data (slippage cost — see pnl-truthteller for the slippage layer)

Reproducibility

The script that generated this dataset is in scripts/extract.py. Anyone with the source positions.json from the bot can rerun it:

python scripts/extract.py \
    --positions /path/to/positions.json \
    --output data/crashes_v1.csv

Baseline notebook

notebooks/baseline_model.py trains a logistic regression and random forest on the dataset to predict is_profitable.

Result: ~79.9% cross-validated accuracy with simple features — essentially matching the bot's 80.2% WR. Translation: most of the alpha is in the entry trigger itself (which already filters to high-WR setups), not in further feature engineering. If you want to beat this dataset, you almost certainly need features the bot doesn't currently log (orderbook depth, market category, time-to-resolution).

Top feature importances from the random forest:

Feature Importance
drop_pct 0.254
shares 0.200
entry_price 0.174
pre_crash_high 0.171
entry_hour_utc 0.110
entry_dow 0.059

A clean, exploitable insight from the diurnal column: win rate at hours 16, 21, 22 UTC reaches ~100% (small samples though); hour 8 UTC dips to ~55%. Off-peak hours are punishing. Adjust your live-firing schedule accordingly.

pip install cross-signal-data[ml]
python notebooks/baseline_model.py

Versioning

Version Date Trades Notes
v1 2026-04-28 308 Initial public release

Future versions will add more trades, more features (orderbook depth at entry, market category, time-to-resolution) and possibly per-market metadata. Pin to a specific version if reproducibility matters: load(version="v1").

License

Code: MIT. Use the loader, the extraction script, and the baseline notebook however you want.

Data: MIT. Public on-chain prediction market data, transformed into a labeled dataset. Cite if you use it in research.

Citation

@dataset{cross_signal_data_2026,
    title  = {cross-signal-data: Polymarket crash-recovery labeled dataset},
    author = {LuciferForge},
    year   = {2026},
    url    = {https://github.com/LuciferForge/cross-signal-data}
}

About the author

Built by LuciferForge, running a public-audited Polymarket crash bot (308 closed trades, 80.2% WR, all data here). Also runs:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cross_signal_data-0.1.0.tar.gz (29.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cross_signal_data-0.1.0-py3-none-any.whl (26.0 kB view details)

Uploaded Python 3

File details

Details for the file cross_signal_data-0.1.0.tar.gz.

File metadata

  • Download URL: cross_signal_data-0.1.0.tar.gz
  • Upload date:
  • Size: 29.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.10.2

File hashes

Hashes for cross_signal_data-0.1.0.tar.gz
Algorithm Hash digest
SHA256 aa147699e4064bad5afe2204365dbde64e3166971930c7d52fdb0d225b0e890d
MD5 e58f214529de93cc810254997d6664c6
BLAKE2b-256 9e1fdda550a939de0c0d72f464cd8c61d683ac5a1ae9bdfdfc0269dfada8344b

See more details on using hashes here.

File details

Details for the file cross_signal_data-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for cross_signal_data-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 29e70e83733157add3568b8bec04ce38ceb33f30c423ce8ecda51be2f0e2f8dd
MD5 26237814252b2f70cc9c0d790cf46039
BLAKE2b-256 2e66b5b4a81088338cf33ec26911e404401ac6bd48714c28c553769896c197df

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page