Skip to main content

A Python library for downloading and processing Binance Vision historical data

Project description

binance-data

A Python library for downloading and processing historical data from Binance Vision.

Features

  • Download historical data from Binance Vision S3 bucket
  • Support for multiple asset types (spot, futures)
  • Flexible prefix-based approach for any data type
  • Output formats: Parquet (default) or CSV
  • Pandera schema validation for data integrity
  • Timestamp auto-detection (milliseconds vs nanoseconds)
  • Concurrent downloads for better performance
  • Optional retention of raw ZIP files
  • Preserve original directory structure

Installation

pip install binance-data

Or install from source:

uv pip install -e .

Quick Start

from binance_data_loader import BinanceDataDownloader

# Download BTCUSDT 1h futures data as Parquet
downloader = BinanceDataDownloader(
    prefix="data/futures/um/daily/klines/BTCUSDT/1h/",
    destination_dir="./data",
    output_format="parquet",
    keep_zip=False,
)
downloader.download()

Usage Examples

Download Futures Data

from binance_data_loader import BinanceDataDownloader

# Download USDT-Margined futures data
downloader = BinanceDataDownloader(
    prefix="data/futures/um/daily/klines/BTCUSDT/1h/",
    destination_dir="./data",
    output_format="parquet",
)
downloader.download()

# Download COIN-Margined futures data
downloader = BinanceDataDownloader(
    prefix="data/futures/cm/daily/klines/BTCUSD_PERP/1h/",
    destination_dir="./data",
    output_format="parquet",
)
downloader.download()

Download Spot Data

# Download spot data
downloader = BinanceDataDownloader(
    prefix="data/spot/daily/klines/ETHUSDT/5m/",
    destination_dir="./data",
    output_format="csv",  # Save as CSV instead of Parquet
    keep_zip=True,  # Keep raw ZIP files
)
downloader.download()

Process Existing ZIP Files

If you already have ZIP files downloaded and only want to convert them to Parquet/CSV:

from binance_data_loader import BinanceDataDownloader
from datetime import datetime, UTC

# Process existing ZIP files, skip downloading
downloader = BinanceDataDownloader(
    prefix="data/spot/daily/klines/ETHUSDT/5m/",
    destination_dir="./data",
    output_format="parquet",
    skip_download=True,  # Skip downloading, only process existing ZIP files
)
downloader.download()

Filter by Date Range

Download only files within a specific date range:

from binance_data_loader import BinanceDataDownloader
from datetime import datetime, UTC, timedelta

# Download data for the last 6 months
six_months_ago = datetime.now(tz=UTC) - timedelta(days=180)

downloader = BinanceDataDownloader(
    prefix="data/futures/um/daily/klines/BTCUSDT/1h/",
    destination_dir="./data",
    output_format="parquet",
    start_date=six_months_ago,
)
downloader.download()

Available Intervals

Binance supports the following intervals:

  • Seconds: 1s
  • Minutes: 1m, 3m, 5m, 15m, 30m
  • Hours: 1h, 2h, 4h, 6h, 8h, 12h
  • Days: 1d, 3d
  • Weeks: 1w
  • Months: 1M

Prefix Structure

The library uses a prefix-based approach where you specify the exact path to the data you want:

data/{asset_type}/{time_period}/{data_type}/{symbol}/{interval}/

Examples:

  • data/futures/um/daily/klines/BTCUSDT/1h/ - BTCUSDT futures 1h klines
  • data/spot/daily/klines/ETHUSDT/5m/ - ETHUSDT spot 5m klines
  • data/futures/um/monthly/klines/BTCUSDT/1m/ - BTCUSDT futures monthly 1m klines

Configuration Options

downloader = BinanceDataDownloader(
    prefix="data/futures/um/daily/klines/BTCUSDT/1h/",  # Required: Data prefix
    destination_dir="./data",                              # Optional: Output directory (default: "./data")
    output_format="parquet",                                # Optional: "parquet" or "csv" (default: "parquet")
    keep_zip=True,                                         # Optional: Keep raw ZIP files (default: True)
    max_workers=10,                                        # Optional: Concurrent download workers (default: 10)
    max_processors=4,                                      # Optional: Parallel processing workers (default: 4)
    start_date=datetime(2024, 1, 1, tzinfo=UTC),        # Optional: Start datetime filter (default: None)
    end_date=datetime(2024, 12, 31, tzinfo=UTC),          # Optional: End datetime filter (default: None)
    skip_download=False,                                     # Optional: Skip download, only process existing ZIP files (default: False)
    base_url="https://s3-ap-northeast-1.amazonaws.com/data.binance.vision",  # Optional: Custom base URL
)

API Reference

BinanceDataDownloader

Main downloader class for fetching Binance Vision data.

Constructor

BinanceDataDownloader(
    prefix: str,
    destination_dir: str = "./data",
    output_format: str = "parquet",
    keep_zip: bool = True,
    max_workers: int = 10,
    max_processors: int = 4,
    start_date: datetime = None,
    end_date: datetime = None,
    skip_download: bool = False,
    base_url: str = "https://s3-ap-northeast-1.amazonaws.com/data.binance.vision",
)

Parameters:

  • prefix (str, required): Binance S3 bucket prefix for the data you want to download
  • destination_dir (str, optional): Directory where processed files will be saved. Default: "./data"
  • output_format (str, optional): Output format, either "parquet" or "csv". Default: "parquet"
  • keep_zip (bool, optional): Whether to keep raw ZIP files after processing. Default: True
  • max_workers (int, optional): Number of concurrent download workers. Default: 10
  • max_processors (int, optional): Number of parallel processing workers. Default: 4
  • start_date (datetime, optional): Start datetime for filtering files. Only downloads/converts files from this date onwards. Default: None
  • end_date (datetime, optional): End datetime for filtering files. Only downloads/converts files up to this date. Default: None
  • skip_download (bool, optional): If True, skip downloading and only process existing ZIP files. Default: False
  • base_url (str, optional): Base URL for Binance data S3 bucket

Methods

download()
download() -> Tuple[List[dict], List[dict]]

Execute the download and processing pipeline.

Returns:

  • Tuple[List[dict], List[dict]]:
    • First element: List of download results (success/failure)
    • Second element: List of processing results (successful, failed)

Example:

download_results, process_results = downloader.download()

# Download results
print(f"Downloaded {len([r for r in download_results if r['status'] == 'success'])} files")

# Process results: (successful, failed)
successful, failed = process_results
print(f"Processed {len(successful)} files successfully, {len(failed)} failed")

DataProcessor

Process downloaded ZIP files into Parquet or CSV format.

from binance_data_loader.processor import DataProcessor

processor = DataProcessor(output_format="parquet")
result = processor.process_zip_file(
    zip_path="data/futures/um/daily/klines/BTCUSDT/1h/BTCUSDT-1h-2024-01-01.zip",
    output_dir="./output",
    base_data_dir="./data",
)

# Process multiple files in parallel
successful, failed = processor.process_zip_files(
    zip_files=["path1.zip", "path2.zip"],
    output_dir="./output",
    base_data_dir="./data",
    max_workers=4,
)

BinanceDataMetadata

Fetch metadata about available Binance data files.

from binance_data_loader.metadata import BinanceDataMetadata

metadata = BinanceDataMetadata()
df = metadata.fetch_file_list(
    prefix="data/futures/um/daily/klines/BTCUSDT/1h/",
    stop_date="2024-01-31",  # Optional: stop at this date
)

print(f"Found {len(df)} files")
print(df.head())

Data Loading and Resampling

Loading Data

After downloading and processing data, you can easily load it for analysis:

from binance_data_loader import BinanceDataLoader
from datetime import datetime, timedelta, UTC

loader = BinanceDataLoader(data_dir="./data", data_type="spot")

# Get available date range
start, end = loader.get_date_range("ETHUSDT", "1s")
print(f"Available data from {start} to {end}")

# Load last week of data
end_time = datetime.now(tz=UTC)
start_time = end_time - timedelta(days=7)

df = loader.load(
    symbol="ETHUSDT",
    interval="1s",
    start_time=start_time,
    end_time=end_time,
)

Resampling Data

The loader supports on-the-fly resampling to higher timeframes:

# Load 1s data and resample to 5m
df_5m = loader.load(
    symbol="ETHUSDT",
    interval="1s",
    resample_to="5m",
    start_time=start_time,
    end_time=end_time,
)

# Resample to 1h
df_1h = loader.load(
    symbol="ETHUSDT",
    interval="1s",
    resample_to="1h",
    start_time=start_time,
    end_time=end_time,
)

Supported resampling intervals:

  • Seconds: 1s, 5s, 15s, 30s
  • Minutes: 1m, 3m, 5m, 15m, 30m
  • Hours: 1h, 2h, 4h, 6h, 12h
  • Days: 1d
  • Weeks: 1w

Convenience Functions

Quick loading without class instantiation:

from binance_data_loader import load_kline_data, get_date_range

# Get date range
start, end = get_date_range(
    data_dir="./data",
    symbol="BTCUSDT",
    data_type="spot",
    interval="1h",
)

# Load with resampling
df = load_kline_data(
    data_dir="./data",
    symbol="BTCUSDT",
    data_type="spot",
    interval="1h",
    resample_to="1d",
    start_time=datetime(2024, 1, 1),
    end_time=datetime(2024, 12, 31),
)

Working with Both Spot and Futures

# Load spot data
spot_loader = BinanceDataLoader(data_dir="./data", data_type="spot")
df_spot = spot_loader.load("BTCUSDT", "1h")

# Load futures data
futures_loader = BinanceDataLoader(data_dir="./data", data_type="futures")
df_futures = futures_loader.load("BTCUSDT", "1h")

Data Schema

Kline Data

When downloading kline data, the output will contain the following columns:

Column Type Description
open_time Datetime Open time (UTC)
open Float Open price
high Float High price
low Float Low price
close Float Close price
volume Float Volume in base asset
close_time Datetime Close time (UTC)
quote_volume Float Volume in quote asset
count Int Number of trades
taker_buy_volume Float Taker buy base asset volume
taker_buy_quote_volume Float Taker buy quote asset volume
ignore Int Ignore

The library automatically:

  • Validates data structure using Pandera schemas
  • Detects and converts timestamp units (milliseconds/nanoseconds)
  • Ensures proper type casting
  • Validates UTC timezone

Performance Tips

  1. Adjust Workers: Increase max_workers for faster downloads, but be mindful of your network bandwidth.
  2. Process in Parallel: Increase max_processors for faster conversion, but consider CPU resources.
  3. Use Parquet: Parquet is more efficient than CSV for large datasets and subsequent analysis.
  4. Keep ZIP: Set keep_zip=True if you need to re-process data with different settings.

Examples

The library includes several example scripts in the examples/ folder to help you get started quickly:

Download Examples

  • examples/download_futures_data.py - Download futures (USDT-Margined) kline data

    • Download last year of BTCUSDT 1h data
    • Download 2024 ETHUSDT 5m data
    • Demonstrates date range filtering and keep_zip options
  • examples/download_spot_data.py - Download spot kline data

    • Download first week of ETHUSDT 1s data (Jan 1-7, 2024)
    • Download last month of BTCUSDT 1m data
    • Download in CSV format instead of Parquet

Loading and Resampling Examples

  • examples/load_and_resample.py - Load and resample downloaded data
    • Load spot data without resampling
    • Resample 1s data to 5m, 15m, and 1h intervals
    • Load futures data
    • Complete workflow showing load, resample, and compare different timeframes
    • Load data for specific date ranges

Running the Examples

Each example can be run directly:

# Download futures data
python examples/download_futures_data.py

# Download spot data
python examples/download_spot_data.py

# Load and resample data
python examples/load_and_resample.py

You can also modify the examples to suit your needs - change symbols, intervals, date ranges, or output formats.

Roadmap

  • Kline data download and processing
  • Parquet and CSV output formats
  • Concurrent downloads and parallel processing
  • Data loader utilities for easy reading of downloaded data
  • Resampling utilities
  • Support for other data types (aggTrades, trades, bookDepth, etc.)
  • CLI interface

Contributing

Contributions are welcome! Please feel free to submit issues and pull requests.

License

MIT License

Acknowledgments

This library is inspired by and borrows ideas from:

Support

For issues and questions, please open an issue on GitHub.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

binance_data_loader-0.1.0.tar.gz (36.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

binance_data_loader-0.1.0-py3-none-any.whl (19.6 kB view details)

Uploaded Python 3

File details

Details for the file binance_data_loader-0.1.0.tar.gz.

File metadata

  • Download URL: binance_data_loader-0.1.0.tar.gz
  • Upload date:
  • Size: 36.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.14.2

File hashes

Hashes for binance_data_loader-0.1.0.tar.gz
Algorithm Hash digest
SHA256 764dd85a089d476e204ba9788599e52045d3cffd103c4b945fefe1c7928d4f5e
MD5 0c5c7ef33a8cd8645dc2d1598d188c20
BLAKE2b-256 f8889db6eea0cb7ca7ca53806d9898c250423675372de59480994137ee825448

See more details on using hashes here.

File details

Details for the file binance_data_loader-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for binance_data_loader-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 87e803cef3451f44e4365e3a5b45013d4757592a5ddc853e7d22929aff69de26
MD5 e44a331f39b9c0a522a95f48e41a50f0
BLAKE2b-256 59ccb28ecddb3476f8194bca0d1a0083ef7c59832599b7a4830fe2827b2433c0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page