High-performance market data management library with unified multi-provider interface
Project description
ml4t-data
Unified market data acquisition and storage for quantitative research workflows.
Part of the ML4T Library Ecosystem
This library is one of five interconnected libraries supporting the machine learning for trading workflow described in Machine Learning for Trading:
Each library addresses a distinct stage: data infrastructure, feature engineering, signal evaluation, strategy backtesting, and live deployment.
What This Library Does
Quantitative research requires consistent, reproducible access to market data from multiple sources. ml4t-data provides:
DataManageras the unified interface: fetch, store, update, and query across all providers- 23 live provider adapters covering equities, crypto, futures, forex, and economic data
- Automated storage in Hive-partitioned Parquet format with metadata tracking
- Incremental updates, gap detection, and backfill via CLI
- Built-in data validation (OHLC invariants, deduplication, anomaly detection)
- Futures module for CME/ICE bulk downloads with continuous contract construction
- COT module for CFTC Commitment of Traders weekly reports
- Resilience: rate limiting, retry with exponential backoff, gap detection
The goal is to support an ongoing research workflow rather than one-off downloads. Data is stored locally, tracked for freshness, and queryable with tools like DuckDB or Polars.
Installation
pip install ml4t-data
Quick Start
DataManager (Unified Interface)
from ml4t.data import DataManager
dm = DataManager()
# Fetch and store
dm.fetch("AAPL", "2020-01-01", "2024-12-31", provider="yahoo")
# Load from local storage
data = dm.load("AAPL", "2020-01-01", "2024-12-31")
# Batch load multiple symbols
prices = dm.batch_load(["AAPL", "MSFT", "GOOGL"], "2020-01-01", "2024-12-31")
# Incremental update
dm.update("AAPL")
# List what's stored
symbols = dm.list_symbols()
metadata = dm.get_metadata("AAPL")
Direct Provider Access
All providers implement the same interface:
from ml4t.data.providers import YahooFinanceProvider, CoinGeckoProvider, FREDProvider
# Equities
provider = YahooFinanceProvider()
data = provider.fetch_ohlcv("AAPL", "2020-01-01", "2024-12-31")
# Crypto
crypto = CoinGeckoProvider().fetch_ohlcv("bitcoin", "2024-01-01", "2024-12-31")
# Economic data
fred = FREDProvider().fetch_series("GDP", "2020-01-01", "2024-12-31")
Data Providers
No API Key Required
| Provider | Coverage |
|---|---|
| Yahoo Finance | US/global equities, ETFs, crypto, forex |
| CoinGecko | 10,000+ cryptocurrencies |
| FRED | 850,000 economic series |
| Fama-French | Academic factor data |
| AQR | Research factors (QMJ, BAB, HML Devil, VME, more) |
| Wiki Prices | Frozen US equities history (1962-2018) |
| Kalshi | Prediction market contracts |
| Polymarket | Prediction market history/order book snapshots |
| Binance Public | Bulk crypto data downloads |
| NASDAQ ITCH Sample | Tick-level sample data |
Authenticated or Metered APIs
| Provider | Coverage |
|---|---|
| EODHD | 60+ global exchanges |
| Tiingo | US equities with quality focus |
| Twelve Data | Multi-asset coverage |
| Databento | CME, CBOE, ICE futures/options |
| Polygon | US equities, options, forex, crypto |
| Finnhub | 70+ global exchanges |
| Binance | Crypto exchange data |
| OKX | Crypto perpetuals and funding rates |
| CryptoCompare | Crypto market data |
| OANDA | Forex broker data |
Specialized Modules
Futures
Bulk download and continuous contract construction for CME/ICE products:
from ml4t.data.futures import FuturesDownloader, ContinuousContractBuilder
# Bulk download via Databento (parent symbology)
downloader = FuturesDownloader(config)
downloader.download() # Downloads ES, NQ, CL, GC, etc.
# Build continuous contracts with configurable roll logic
builder = ContinuousContractBuilder()
continuous = builder.build(contracts_df, roll_method="volume")
Book-focused interface with profiling:
from ml4t.data.futures import FuturesDataManager
fm = FuturesDataManager.from_config("config.yaml")
fm.download_all()
data = fm.load_ohlcv("ES")
profile = fm.generate_profile("ES")
COT (Commitment of Traders)
CFTC weekly positioning data for futures markets:
from ml4t.data.cot import COTFetcher, create_cot_features, combine_cot_ohlcv_pit
fetcher = COTFetcher(config)
cot_data = fetcher.fetch_product("ES", start_year=2015, end_year=2024)
# Point-in-time combination with OHLCV (no look-ahead)
combined = combine_cot_ohlcv_pit(cot_data, ohlcv_data)
# Generate features from COT data
features = create_cot_features(cot_data)
Book Data Managers
Simplified interfaces for the ML4T book workflow:
from ml4t.data.etfs import ETFDataManager
from ml4t.data.crypto import CryptoDataManager
# 50 diversified ETFs via Yahoo Finance
etf_dm = ETFDataManager.from_config("config.yaml")
etf_dm.download_all()
aapl = etf_dm.load_ohlcv("AAPL")
# Crypto premium index via Binance Public
crypto_dm = CryptoDataManager.from_config("config.yaml")
crypto_dm.download_premium_index()
CLI for Automated Updates
# Fetch specific symbols
ml4t-data fetch -s AAPL -s MSFT -s GOOGL --provider yahoo --start 2020-01-01
# Incremental update
ml4t-data update --symbol AAPL
# Validate data quality
ml4t-data validate --symbol AAPL --anomalies
# Check storage status
ml4t-data status --detailed
# List available data
ml4t-data list-data
# Export to CSV/JSON/Excel
ml4t-data export --symbol AAPL --format-type csv --output aapl.csv
# Get symbol info
ml4t-data info --symbol AAPL
Configuration-driven batch updates:
storage:
path: ~/data/market
datasets:
sp500_daily:
provider: yahoo
symbols_file: symbols/sp500.txt
frequency: daily
start_date: 2015-01-01
crypto:
provider: coingecko
symbols: [bitcoin, ethereum, solana]
frequency: daily
start_date: 2020-01-01
Storage Format
Data is stored in Hive-partitioned Parquet:
~/data/market/
├── yahoo/daily/symbol=AAPL/data.parquet
├── yahoo/daily/symbol=MSFT/data.parquet
└── coingecko/daily/symbol=bitcoin/data.parquet
Query with DuckDB or Polars:
import duckdb
result = duckdb.execute("""
SELECT * FROM read_parquet('~/data/market/yahoo/daily/**/*.parquet')
WHERE symbol IN ('AAPL', 'MSFT')
AND date >= '2024-01-01'
""").pl()
Data Validation
from ml4t.data.validation import OHLCVValidator, ValidationReport
validator = OHLCVValidator()
report = validator.validate(data)
# Checks: high >= low, high >= open/close, low <= open/close
# Detects: duplicates, gaps, anomalies
Anomaly detection:
from ml4t.data.anomaly import AnomalyManager, ReturnOutlierDetector, VolumeSpikeDetector
manager = AnomalyManager([
ReturnOutlierDetector(),
VolumeSpikeDetector(),
])
report = manager.detect(data)
Documentation
- Getting Started — quick start guide
- Configuration — YAML config reference
- Storage — Hive partitioning and backends
- Incremental Updates — update strategies and gap detection
- Data Quality — validation and anomaly detection
- CLI Reference — command-line interface
- Provider Selection Guide — choosing providers
- Creating a Provider — extending with new sources
Technical Characteristics
- Polars-based: Native Polars DataFrames throughout
- Consistent schema: All providers return the same column structure
- Async support: Async providers and batch operations for parallel downloads
- Metadata tracking: Last update timestamps, row counts, date ranges
- Resilience: Rate limiting, retry with exponential backoff, gap detection
- Multiple backends: File system, S3, and in-memory storage
- Type-safe: Full type annotations throughout
Related Libraries
- ml4t-engineer: Feature engineering and technical indicators
- ml4t-diagnostic: Signal evaluation and statistical validation
- ml4t-backtest: Event-driven backtesting
- ml4t-live: Live trading with broker integration
Development
git clone https://github.com/ml4t/ml4t-data.git
cd ml4t-data
uv sync
uv run pytest tests/ -q
uv run ty check
License
MIT License - see LICENSE for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file ml4t_data-0.1.0b10.tar.gz.
File metadata
- Download URL: ml4t_data-0.1.0b10.tar.gz
- Upload date:
- Size: 638.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
49756b4708d1dfa933e3f611bda70cd8af8b25272fedf29a666d065e0bca08a5
|
|
| MD5 |
5a40827cc425da9c92fd8bda36055174
|
|
| BLAKE2b-256 |
0889c0a3735cba7c29cdf969803d4672504db1854b09222677c683a9426c94a5
|
Provenance
The following attestation bundles were made for ml4t_data-0.1.0b10.tar.gz:
Publisher:
release.yml on ml4t/data
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
ml4t_data-0.1.0b10.tar.gz -
Subject digest:
49756b4708d1dfa933e3f611bda70cd8af8b25272fedf29a666d065e0bca08a5 - Sigstore transparency entry: 1211353946
- Sigstore integration time:
-
Permalink:
ml4t/data@042789a6fc8f7eea19e43db4c04f79bcfa579f2e -
Branch / Tag:
refs/tags/v0.1.0b10 - Owner: https://github.com/ml4t
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@042789a6fc8f7eea19e43db4c04f79bcfa579f2e -
Trigger Event:
push
-
Statement type:
File details
Details for the file ml4t_data-0.1.0b10-py3-none-any.whl.
File metadata
- Download URL: ml4t_data-0.1.0b10-py3-none-any.whl
- Upload date:
- Size: 422.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
de471ee6f494b89fa1e788fbb1fd9927ccab8170396590a6a62ffed2b82c788f
|
|
| MD5 |
d2190441456878f9dcc6e5716cb51880
|
|
| BLAKE2b-256 |
0aa334785136c0cca9b5f065b19ab6c67351a2e9230cc2ee1494bd5af3616d19
|
Provenance
The following attestation bundles were made for ml4t_data-0.1.0b10-py3-none-any.whl:
Publisher:
release.yml on ml4t/data
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
ml4t_data-0.1.0b10-py3-none-any.whl -
Subject digest:
de471ee6f494b89fa1e788fbb1fd9927ccab8170396590a6a62ffed2b82c788f - Sigstore transparency entry: 1211354015
- Sigstore integration time:
-
Permalink:
ml4t/data@042789a6fc8f7eea19e43db4c04f79bcfa579f2e -
Branch / Tag:
refs/tags/v0.1.0b10 - Owner: https://github.com/ml4t
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@042789a6fc8f7eea19e43db4c04f79bcfa579f2e -
Trigger Event:
push
-
Statement type: