Skip to main content

High-performance market data management library with unified multi-provider interface

Project description

ml4t-data

Python 3.12+ PyPI License: MIT

Unified market data acquisition and storage for quantitative research workflows.

Part of the ML4T Library Ecosystem

This library is one of five interconnected libraries supporting the machine learning for trading workflow described in Machine Learning for Trading:

ML4T Library Ecosystem

Each library addresses a distinct stage: data infrastructure, feature engineering, signal evaluation, strategy backtesting, and live deployment.

What This Library Does

Quantitative research requires consistent, reproducible access to market data from multiple sources. ml4t-data provides:

  • DataManager as the unified interface: fetch, store, update, and query across all providers
  • 23 live provider adapters covering equities, crypto, futures, forex, and economic data
  • Automated storage in Hive-partitioned Parquet format with metadata tracking
  • Incremental updates, gap detection, and backfill via CLI
  • Built-in data validation (OHLC invariants, deduplication, anomaly detection)
  • Futures module for CME/ICE bulk downloads with continuous contract construction
  • COT module for CFTC Commitment of Traders weekly reports
  • Resilience: rate limiting, retry with exponential backoff, gap detection

The goal is to support an ongoing research workflow rather than one-off downloads. Data is stored locally, tracked for freshness, and queryable with tools like DuckDB or Polars.

ml4t-data Architecture

Installation

pip install ml4t-data

Quick Start

DataManager (Unified Interface)

from ml4t.data import DataManager

dm = DataManager()

# Fetch and store
dm.fetch("AAPL", "2020-01-01", "2024-12-31", provider="yahoo")

# Load from local storage
data = dm.load("AAPL", "2020-01-01", "2024-12-31")

# Batch load multiple symbols
prices = dm.batch_load(["AAPL", "MSFT", "GOOGL"], "2020-01-01", "2024-12-31")

# Incremental update
dm.update("AAPL")

# List what's stored
symbols = dm.list_symbols()
metadata = dm.get_metadata("AAPL")

Direct Provider Access

All providers implement the same interface:

from ml4t.data.providers import YahooFinanceProvider, CoinGeckoProvider, FREDProvider

# Equities
provider = YahooFinanceProvider()
data = provider.fetch_ohlcv("AAPL", "2020-01-01", "2024-12-31")

# Crypto
crypto = CoinGeckoProvider().fetch_ohlcv("bitcoin", "2024-01-01", "2024-12-31")

# Economic data
fred = FREDProvider().fetch_series("GDP", "2020-01-01", "2024-12-31")

Data Providers

No API Key Required

Provider Coverage
Yahoo Finance US/global equities, ETFs, crypto, forex
CoinGecko 10,000+ cryptocurrencies
FRED 850,000 economic series
Fama-French Academic factor data
AQR Research factors (QMJ, BAB, HML Devil, VME, more)
Wiki Prices Frozen US equities history (1962-2018)
Kalshi Prediction market contracts
Polymarket Prediction market history/order book snapshots
Binance Public Bulk crypto data downloads
NASDAQ ITCH Sample Tick-level sample data

Authenticated or Metered APIs

Provider Coverage
EODHD 60+ global exchanges
Tiingo US equities with quality focus
Twelve Data Multi-asset coverage
Databento CME, CBOE, ICE futures/options
Polygon US equities, options, forex, crypto
Finnhub 70+ global exchanges
Binance Crypto exchange data
OKX Crypto perpetuals and funding rates
CryptoCompare Crypto market data
OANDA Forex broker data

Specialized Modules

Futures

Bulk download and continuous contract construction for CME/ICE products:

from ml4t.data.futures import FuturesDownloader, ContinuousContractBuilder

# Bulk download via Databento (parent symbology)
downloader = FuturesDownloader(config)
downloader.download()  # Downloads ES, NQ, CL, GC, etc.

# Build continuous contracts with configurable roll logic
builder = ContinuousContractBuilder()
continuous = builder.build(contracts_df, roll_method="volume")

Book-focused interface with profiling:

from ml4t.data.futures import FuturesDataManager

fm = FuturesDataManager.from_config("config.yaml")
fm.download_all()
data = fm.load_ohlcv("ES")
profile = fm.generate_profile("ES")

COT (Commitment of Traders)

CFTC weekly positioning data for futures markets:

from ml4t.data.cot import COTFetcher, create_cot_features, combine_cot_ohlcv_pit

fetcher = COTFetcher(config)
cot_data = fetcher.fetch_product("ES", start_year=2015, end_year=2024)

# Point-in-time combination with OHLCV (no look-ahead)
combined = combine_cot_ohlcv_pit(cot_data, ohlcv_data)

# Generate features from COT data
features = create_cot_features(cot_data)

Book Data Managers

Simplified interfaces for the ML4T book workflow:

from ml4t.data.etfs import ETFDataManager
from ml4t.data.crypto import CryptoDataManager

# 50 diversified ETFs via Yahoo Finance
etf_dm = ETFDataManager.from_config("config.yaml")
etf_dm.download_all()
aapl = etf_dm.load_ohlcv("AAPL")

# Crypto premium index via Binance Public
crypto_dm = CryptoDataManager.from_config("config.yaml")
crypto_dm.download_premium_index()

CLI for Automated Updates

# Fetch specific symbols
ml4t-data fetch -s AAPL -s MSFT -s GOOGL --provider yahoo --start 2020-01-01

# Incremental update
ml4t-data update --symbol AAPL

# Validate data quality
ml4t-data validate --symbol AAPL --anomalies

# Check storage status
ml4t-data status --detailed

# List available data
ml4t-data list-data

# Export to CSV/JSON/Excel
ml4t-data export --symbol AAPL --format-type csv --output aapl.csv

# Get symbol info
ml4t-data info --symbol AAPL

Configuration-driven batch updates:

storage:
  path: ~/data/market

datasets:
  sp500_daily:
    provider: yahoo
    symbols_file: symbols/sp500.txt
    frequency: daily
    start_date: 2015-01-01

  crypto:
    provider: coingecko
    symbols: [bitcoin, ethereum, solana]
    frequency: daily
    start_date: 2020-01-01

Storage Format

Data is stored in Hive-partitioned Parquet:

~/data/market/
├── yahoo/daily/symbol=AAPL/data.parquet
├── yahoo/daily/symbol=MSFT/data.parquet
└── coingecko/daily/symbol=bitcoin/data.parquet

Query with DuckDB or Polars:

import duckdb

result = duckdb.execute("""
    SELECT * FROM read_parquet('~/data/market/yahoo/daily/**/*.parquet')
    WHERE symbol IN ('AAPL', 'MSFT')
    AND date >= '2024-01-01'
""").pl()

Data Validation

from ml4t.data.validation import OHLCVValidator, ValidationReport

validator = OHLCVValidator()
report = validator.validate(data)
# Checks: high >= low, high >= open/close, low <= open/close
# Detects: duplicates, gaps, anomalies

Anomaly detection:

from ml4t.data.anomaly import AnomalyManager, ReturnOutlierDetector, VolumeSpikeDetector

manager = AnomalyManager([
    ReturnOutlierDetector(),
    VolumeSpikeDetector(),
])
report = manager.detect(data)

Documentation

Technical Characteristics

  • Polars-based: Native Polars DataFrames throughout
  • Consistent schema: All providers return the same column structure
  • Async support: Async providers and batch operations for parallel downloads
  • Metadata tracking: Last update timestamps, row counts, date ranges
  • Resilience: Rate limiting, retry with exponential backoff, gap detection
  • Multiple backends: File system, S3, and in-memory storage
  • Type-safe: Full type annotations throughout

Related Libraries

  • ml4t-engineer: Feature engineering and technical indicators
  • ml4t-diagnostic: Signal evaluation and statistical validation
  • ml4t-backtest: Event-driven backtesting
  • ml4t-live: Live trading with broker integration

Development

git clone https://github.com/ml4t/ml4t-data.git
cd ml4t-data
uv sync
uv run pytest tests/ -q
uv run ty check

License

MIT License - see LICENSE for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ml4t_data-0.1.0b10.tar.gz (638.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ml4t_data-0.1.0b10-py3-none-any.whl (422.5 kB view details)

Uploaded Python 3

File details

Details for the file ml4t_data-0.1.0b10.tar.gz.

File metadata

  • Download URL: ml4t_data-0.1.0b10.tar.gz
  • Upload date:
  • Size: 638.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for ml4t_data-0.1.0b10.tar.gz
Algorithm Hash digest
SHA256 49756b4708d1dfa933e3f611bda70cd8af8b25272fedf29a666d065e0bca08a5
MD5 5a40827cc425da9c92fd8bda36055174
BLAKE2b-256 0889c0a3735cba7c29cdf969803d4672504db1854b09222677c683a9426c94a5

See more details on using hashes here.

Provenance

The following attestation bundles were made for ml4t_data-0.1.0b10.tar.gz:

Publisher: release.yml on ml4t/data

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file ml4t_data-0.1.0b10-py3-none-any.whl.

File metadata

  • Download URL: ml4t_data-0.1.0b10-py3-none-any.whl
  • Upload date:
  • Size: 422.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for ml4t_data-0.1.0b10-py3-none-any.whl
Algorithm Hash digest
SHA256 de471ee6f494b89fa1e788fbb1fd9927ccab8170396590a6a62ffed2b82c788f
MD5 d2190441456878f9dcc6e5716cb51880
BLAKE2b-256 0aa334785136c0cca9b5f065b19ab6c67351a2e9230cc2ee1494bd5af3616d19

See more details on using hashes here.

Provenance

The following attestation bundles were made for ml4t_data-0.1.0b10-py3-none-any.whl:

Publisher: release.yml on ml4t/data

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page