ClickHouse-based cryptocurrency data collection with zero-gap guarantee. 22x faster via Binance public repository with persistent database storage, USDT-margined futures support, and production-ready ReplacingMergeTree schema.

These details have not been verified by PyPI

Project links

Project description

Gapless Crypto ClickHouse

When to Use This Package

Choose gapless-crypto-clickhouse (this package) when you need:

Persistent database storage for multi-symbol, multi-timeframe datasets
Advanced SQL queries for time-series analysis, aggregations, and joins
USDT-margined futures support (perpetual contracts)
Production data pipelines with deterministic versioning and deduplication
Python 3.12+ modern runtime environment

Choose gapless-crypto-data (file-based) when you need:

Simple file-based workflows with CSV output
Single-symbol analysis without database overhead
Python 3.9-3.13 broader compatibility
Lightweight dependency footprint (no database required)

Both packages share the same 22x performance advantage via Binance public repository and zero-gap guarantee.

Features

22x faster data collection via Binance public data repository
2x faster queries with Apache Arrow optimization (v6.0.0+, 41K+ rows/s at scale)
Auto-ingestion: Unified query_ohlcv() API downloads missing data automatically
ClickHouse database with ReplacingMergeTree for deterministic deduplication
USDT-margined futures support (perpetual contracts via instrument_type column)
Zero gaps guarantee through intelligent monthly-to-daily fallback
Complete 16-timeframe support: 13 standard (1s, 1m, 3m, 5m, 15m, 30m, 1h, 2h, 4h, 6h, 8h, 12h, 1d) + 3 exotic (3d, 1w, 1mo)
11-column microstructure format (spot) and 12-column format (futures with funding rate)
Advanced SQL queries for time-series analysis, multi-symbol joins, aggregations
Persistent storage with compression (DoubleDelta timestamps, Gorilla OHLCV)
AI agent ready: llms.txt + probe.py for capability discovery
UV-based Python tooling for modern dependency management
Production-ready with comprehensive test coverage

Quick Start

Installation (UV)

# Install via UV
uv add gapless-crypto-clickhouse

# Or install globally
uv tool install gapless-crypto-clickhouse

Installation (pip)

pip install gapless-crypto-clickhouse

Database Setup (ClickHouse)

For persistent storage and advanced query capabilities, set up ClickHouse:

# Start ClickHouse using Docker Compose
docker-compose up -d

# Verify ClickHouse is running
docker-compose ps

# View logs
docker-compose logs -f clickhouse

See Database Integration for complete setup guide and usage examples.

Python API (Recommended)

Function-based API

import gapless_crypto_clickhouse as gcd

# Fetch recent data with date range (CCXT-compatible timeframe parameter)
df = gcd.download("BTCUSDT", timeframe="1h", start="2024-01-01", end="2024-06-30")

# Or with limit
df = gcd.fetch_data("ETHUSDT", timeframe="4h", limit=1000)

# Backward compatibility (legacy interval parameter)
df = gcd.fetch_data("ETHUSDT", interval="4h", limit=1000)  # DeprecationWarning

# Get available symbols and timeframes
symbols = gcd.get_supported_symbols()
timeframes = gcd.get_supported_timeframes()

# Fill gaps in existing data
results = gcd.fill_gaps("./data")

# Multi-symbol batch download (concurrent execution - 10-20x faster)
results = gcd.download_multiple(
    symbols=["BTCUSDT", "ETHUSDT", "BNBUSDT", "XRPUSDT", "SOLUSDT"],
    timeframe="1h",
    start_date="2024-01-01",
    end_date="2024-06-30",
    max_workers=5  # Configure concurrency
)
# Returns: dict[str, pd.DataFrame]
# Example: btc_df = results["BTCUSDT"]

Class-based API

from gapless_crypto_clickhouse import BinancePublicDataCollector, UniversalGapFiller

# Custom collection with full control
collector = BinancePublicDataCollector(
    symbol="SOLUSDT",
    start_date="2023-01-01",
    end_date="2023-12-31"
)

result = collector.collect_timeframe_data("1h")
df = result["dataframe"]

# Manual gap filling
gap_filler = UniversalGapFiller()
gaps = gap_filler.detect_all_gaps(csv_file, "1h")

Note: This package never included a CLI interface (unlike parent package gapless-crypto-data). It provides a Python API only for programmatic access. See examples above for usage patterns.

Data Structure

All functions return pandas DataFrames with complete microstructure data:

import gapless_crypto_clickhouse as gcd

# Fetch data
df = gcd.download("BTCUSDT", timeframe="1h", start="2024-01-01", end="2024-06-30")

# DataFrame columns (11-column microstructure format)
print(df.columns.tolist())
# ['date', 'open', 'high', 'low', 'close', 'volume',
#  'close_time', 'quote_asset_volume', 'number_of_trades',
#  'taker_buy_base_asset_volume', 'taker_buy_quote_asset_volume']

# Professional microstructure analysis
buy_pressure = df['taker_buy_base_asset_volume'].sum() / df['volume'].sum()
avg_trade_size = df['volume'].sum() / df['number_of_trades'].sum()
market_impact = df['quote_asset_volume'].std() / df['quote_asset_volume'].mean()

print(f"Taker buy pressure: {buy_pressure:.1%}")
print(f"Average trade size: {avg_trade_size:.4f} BTC")
print(f"Market impact volatility: {market_impact:.3f}")

Data Sources

The package supports two data collection methods:

Binance Public Repository: Pre-generated monthly ZIP files for historical data
Binance API: Real-time data for gap filling and recent data collection

🏗️ Architecture

Core Components

BinancePublicDataCollector: Data collection with full 11-column microstructure format
UniversalGapFiller: Intelligent gap detection and filling with authentic API-first validation
AtomicCSVOperations: Corruption-proof file operations with atomic writes
SafeCSVMerger: Safe merging of data files with integrity validation

Data Flow

Binance Public Data Repository → BinancePublicDataCollector → 11-Column Microstructure Format
                ↓
Gap Detection → UniversalGapFiller → Authentic API-First Validation
                ↓
AtomicCSVOperations → Final Gapless Dataset with Order Flow Metrics

🗄️ Database Integration

ClickHouse is a required component for this package. The database-first architecture enables persistent storage, advanced query capabilities, and multi-symbol analysis.

When to use:

File-based approach: Simple workflows, single symbols, CSV output compatibility
Database approach: Multi-symbol analysis, time-series queries, aggregations, production pipelines (recommended)

Quick Start with Docker Compose

The repository includes a production-ready docker-compose.yml for local development:

# Start ClickHouse (runs in background)
docker-compose up -d

# Verify container is healthy
docker-compose ps

# View initialization logs
docker-compose logs clickhouse

# Access ClickHouse client (optional)
docker exec -it gapless-clickhouse clickhouse-client

What happens on first start:

Downloads ClickHouse 24.1-alpine image (~200 MB)
Creates ohlcv table with ReplacingMergeTree engine (from schema.sql)
Configures compression (DoubleDelta for timestamps, Gorilla for OHLCV)
Sets up health checks and automatic restart

Schema auto-initialization: The schema.sql file is automatically executed via Docker's initdb.d mechanism.

Quick Start: Unified Query API (v6.0.0+)

The recommended way to query data in v6.0.0+ is using query_ohlcv() with auto-ingestion and Apache Arrow optimization:

from gapless_crypto_clickhouse import query_ohlcv

# Query with auto-ingestion (downloads data if missing)
df = query_ohlcv(
    "BTCUSDT",
    "1h",
    "2024-01-01",
    "2024-01-31"
)
print(f"Retrieved {len(df)} rows")  # 744 rows (31 days * 24 hours)

# Multi-symbol query
df = query_ohlcv(
    ["BTCUSDT", "ETHUSDT", "SOLUSDT"],
    "1h",
    "2024-01-01",
    "2024-01-31"
)

# Futures data
df = query_ohlcv(
    "BTCUSDT",
    "1h",
    "2024-01-01",
    "2024-01-31",
    instrument_type="futures-um"
)

# Query without auto-ingestion (faster, raises if data missing)
df = query_ohlcv(
    "BTCUSDT",
    "1h",
    "2024-01-01",
    "2024-01-31",
    auto_ingest=False
)

Performance (Apache Arrow optimization):

2x faster at scale: 41,272 rows/s vs 20,534 rows/s for large datasets (>8000 rows)
43-57% less memory: Arrow buffers reduce memory usage for medium/large queries
Auto-ingestion: Downloads missing data automatically on first query
Best for: Analytical queries, backtesting, multi-symbol analysis (typical use case)

When to use lower-level APIs: Advanced use cases requiring custom SQL, bulk loading, or connection management.

Basic Usage Examples

Connection and Health Check

from gapless_crypto_clickhouse.clickhouse import ClickHouseConnection

# Connect to ClickHouse (reads from .env or uses defaults)
with ClickHouseConnection() as conn:
    # Verify connection
    health = conn.health_check()
    print(f"ClickHouse connected: {health}")

    # Execute simple query
    result = conn.execute("SELECT count() FROM ohlcv")
    print(f"Total rows in database: {result[0][0]:,}")

Bulk Data Ingestion

from gapless_crypto_clickhouse.clickhouse import ClickHouseConnection
from gapless_crypto_clickhouse.collectors.clickhouse_bulk_loader import ClickHouseBulkLoader

# Ingest historical data from Binance public repository
with ClickHouseConnection() as conn:
    loader = ClickHouseBulkLoader(conn, instrument_type="spot")

    # Ingest single month (e.g., January 2024)
    rows_inserted = loader.ingest_month("BTCUSDT", "1h", year=2024, month=1)
    print(f"Inserted {rows_inserted:,} rows for BTCUSDT 1h (Jan 2024)")

    # Ingest date range (e.g., Q1 2024)
    total_rows = loader.ingest_date_range(
        symbol="ETHUSDT",
        timeframe="4h",
        start_date="2024-01-01",
        end_date="2024-03-31"
    )
    print(f"Inserted {total_rows:,} rows for ETHUSDT 4h (Q1 2024)")

Zero-gap guarantee: ClickHouse uses deterministic versioning (SHA256 hash) to handle duplicate ingestion safely. Re-running ingestion commands won't create duplicates.

Querying Data

from gapless_crypto_clickhouse.clickhouse import ClickHouseConnection
from gapless_crypto_clickhouse.clickhouse_query import OHLCVQuery

with ClickHouseConnection() as conn:
    query = OHLCVQuery(conn)

    # Get latest data (last 10 bars)
    df = query.get_latest("BTCUSDT", "1h", limit=10)
    print(f"Latest 10 bars:\n{df[['timestamp', 'close']]}")

    # Get specific date range
    df = query.get_range(
        symbol="BTCUSDT",
        timeframe="1h",
        start_date="2024-01-01",
        end_date="2024-01-31",
        instrument_type="spot"
    )
    print(f"January 2024: {len(df):,} bars")

    # Multi-symbol comparison
    df = query.get_multi_symbol(
        symbols=["BTCUSDT", "ETHUSDT", "SOLUSDT"],
        timeframe="1d",
        start_date="2024-01-01",
        end_date="2024-12-31"
    )
    print(f"Multi-symbol dataset: {df.shape}")

FINAL keyword: All queries automatically use FINAL to ensure deduplicated results. This adds ~10-30% overhead but guarantees data correctness.

Futures Support (ADR-0004)

# Ingest futures data (12-column format with funding rate)
with ClickHouseConnection() as conn:
    loader = ClickHouseBulkLoader(conn, instrument_type="futures")
    rows = loader.ingest_month("BTCUSDT", "1h", 2024, 1)
    print(f"Futures data: {rows:,} rows")

    # Query futures data (isolated from spot)
    query = OHLCVQuery(conn)
    df_spot = query.get_latest("BTCUSDT", "1h", instrument_type="spot", limit=10)
    df_futures = query.get_latest("BTCUSDT", "1h", instrument_type="futures", limit=10)

    print(f"Spot data: {len(df_spot)} bars")
    print(f"Futures data: {len(df_futures)} bars")

Spot/Futures isolation: The instrument_type column ensures spot and futures data coexist without conflicts.

Configuration

Environment Variables (.env file or system environment):

CLICKHOUSE_HOST=localhost        # ClickHouse server hostname
CLICKHOUSE_PORT=9000             # Native protocol port (default: 9000)
CLICKHOUSE_HTTP_PORT=8123        # HTTP interface port (default: 8123)
CLICKHOUSE_USER=default          # Username (default: 'default')
CLICKHOUSE_PASSWORD=             # Password (empty for local dev)
CLICKHOUSE_DB=default            # Database name (default: 'default')

Docker Compose defaults: The included docker-compose.yml uses these defaults, no .env file required for local development.

Local Visualization Tools

Comprehensive toolchain for ClickHouse data exploration and monitoring (100% open source):

Web Interfaces:

CH-UI (modern TypeScript UI): http://localhost:5521
```
docker-compose up -d ch-ui
```
ClickHouse Play (built-in): http://localhost:8123/play

CLI Tools:

clickhouse-client (official CLI with 70+ formats):

docker exec -it gapless-clickhouse clickhouse-client

clickhouse-local (file analysis without server):

clickhouse-local --query "SELECT * FROM file('data.csv', CSV)"

Performance Monitoring:

chdig (TUI with flamegraph visualization):

brew install chdig
chdig --host localhost --port 9000

Validation: Run automated validation suite:

bash scripts/validate-clickhouse-tools.sh

Comprehensive guides: See docs/development/ for detailed usage guides, examples, and troubleshooting.

Migration Guide

Migrating from gapless-crypto-data (file-based) to gapless-crypto-clickhouse (database-first):

See docs/CLICKHOUSE_MIGRATION.md for:

Architecture changes (file-based → ClickHouse)
Code migration examples (drop-in replacement)
Deployment guide (Docker Compose, production)
Performance characteristics (ingestion, query, deduplication)
Troubleshooting common issues

Key Changes:

Package name: gapless-crypto-data → gapless-crypto-clickhouse
Import paths: gapless_crypto_data → gapless_crypto_clickhouse
ClickHouse requirement: ClickHouse database required (Docker Compose provided)
Python version: 3.12+ (was 3.9-3.13)
API signatures: Unchanged (backwards compatible)

Rollback strategy: Continue using gapless-crypto-data for file-based workflows. Both packages maintained independently.

Production Deployment

Recommended setup:

Persistent storage: Mount volumes for data durability
Authentication: Set CLICKHOUSE_PASSWORD for non-localhost deployments
TLS: Enable TLS for remote connections
Monitoring: ClickHouse exports Prometheus metrics on port 9363
Backups: Use ClickHouse Backup tool or volume snapshots

Scaling:

Single-node: Validated at 53.7M rows (ADR-0003), headroom to ~200M rows
Distributed: ClickHouse supports sharding and replication for larger datasets

See ClickHouse documentation for production deployment best practices.

🔧 Advanced Usage

Batch Processing

Simple API (Recommended)

import gapless_crypto_clickhouse as gcd

# Process multiple symbols with simple loops
symbols = ["BTCUSDT", "ETHUSDT", "SOLUSDT", "ADAUSDT"]
timeframes = ["1h", "4h"]

for symbol in symbols:
    for timeframe in timeframes:
        df = gcd.fetch_data(symbol, timeframe, start="2023-01-01", end="2023-12-31")
        print(f"{symbol} {timeframe}: {len(df)} bars collected")

Advanced API (Complex Workflows)

from gapless_crypto_clickhouse import BinancePublicDataCollector

# Initialize with custom settings
collector = BinancePublicDataCollector(
    start_date="2023-01-01",
    end_date="2023-12-31",
    output_dir="./crypto_data"
)

# Process multiple symbols with detailed control
symbols = ["BTCUSDT", "ETHUSDT", "SOLUSDT"]
for symbol in symbols:
    collector.symbol = symbol
    results = collector.collect_multiple_timeframes(["1m", "5m", "1h", "4h"])
    for timeframe, result in results.items():
        print(f"{symbol} {timeframe}: {result['stats']}")

Gap Analysis

Simple API (Recommended)

import gapless_crypto_clickhouse as gcd

# Quick gap filling for entire directory
results = gcd.fill_gaps("./data")
print(f"Processed {results['files_processed']} files")
print(f"Filled {results['gaps_filled']}/{results['gaps_detected']} gaps")
print(f"Success rate: {results['success_rate']:.1f}%")

# Gap filling for specific symbols only
results = gcd.fill_gaps("./data", symbols=["BTCUSDT", "ETHUSDT"])

Advanced API (Detailed Control)

from gapless_crypto_clickhouse import UniversalGapFiller

gap_filler = UniversalGapFiller()

# Manual gap detection and analysis
gaps = gap_filler.detect_all_gaps("BTCUSDT_1h.csv", "1h")
print(f"Found {len(gaps)} gaps")

for gap in gaps:
    duration_hours = gap['duration'].total_seconds() / 3600
    print(f"Gap: {gap['start_time']} → {gap['end_time']} ({duration_hours:.1f}h)")

# Fill specific gaps
result = gap_filler.process_file("BTCUSDT_1h.csv", "1h")

Database Query Examples

For users leveraging ClickHouse database integration:

Bulk Ingestion Pipeline

from gapless_crypto_clickhouse.clickhouse import ClickHouseConnection
from gapless_crypto_clickhouse.collectors.clickhouse_bulk_loader import ClickHouseBulkLoader

# Multi-symbol bulk ingestion for backtesting datasets
symbols = ["BTCUSDT", "ETHUSDT", "SOLUSDT", "ADAUSDT", "DOGEUSDT"]
timeframes = ["1h", "4h", "1d"]

with ClickHouseConnection() as conn:
    loader = ClickHouseBulkLoader(conn, instrument_type="spot")

    for symbol in symbols:
        for timeframe in timeframes:
            # Ingest Q1 2024 data
            rows = loader.ingest_date_range(
                symbol=symbol,
                timeframe=timeframe,
                start_date="2024-01-01",
                end_date="2024-03-31"
            )
            print(f"{symbol} {timeframe}: {rows:,} rows ingested")

# Zero-gap guarantee: Re-running this script won't create duplicates

Multi-Symbol Analysis

from gapless_crypto_clickhouse.clickhouse import ClickHouseConnection
from gapless_crypto_clickhouse.clickhouse_query import OHLCVQuery

with ClickHouseConnection() as conn:
    query = OHLCVQuery(conn)

    # Get synchronized data for all symbols (same time range)
    df = query.get_multi_symbol(
        symbols=["BTCUSDT", "ETHUSDT", "SOLUSDT"],
        timeframe="1h",
        start_date="2024-01-01",
        end_date="2024-01-31"
    )

    # Analyze cross-asset correlations
    pivot = df.pivot_table(index="timestamp", columns="symbol", values="close")
    correlation = pivot.corr()
    print(f"Correlation matrix:\n{correlation}")

    # Relative strength analysis
    for symbol in ["BTCUSDT", "ETHUSDT", "SOLUSDT"]:
        symbol_data = df[df["symbol"] == symbol]
        returns = symbol_data["close"].pct_change().sum()
        print(f"{symbol} total return: {returns:.2%}")

Advanced Time-Series Queries

from gapless_crypto_clickhouse.clickhouse import ClickHouseConnection

with ClickHouseConnection() as conn:
    # Custom SQL for advanced analytics (ClickHouse functions)
    query = """
    SELECT
        symbol,
        timeframe,
        toStartOfDay(timestamp) AS day,
        avg(close) AS avg_price,
        stddevPop(close) AS volatility,
        sum(volume) AS total_volume,
        count() AS bar_count
    FROM ohlcv FINAL
    WHERE symbol IN ('BTCUSDT', 'ETHUSDT')
      AND timeframe = '1h'
      AND timestamp >= '2024-01-01'
      AND timestamp < '2024-02-01'
    GROUP BY symbol, timeframe, day
    ORDER BY day ASC, symbol ASC
    """

    result = conn.execute(query)

    # Process results
    for row in result:
        symbol, timeframe, day, avg_price, volatility, volume, bars = row
        print(f"{day} {symbol}: avg=${avg_price:.2f}, vol={volatility:.2f}, volume={volume:,.0f}")

Hybrid Approach (File + Database)

Combine file-based collection with database querying:

import gapless_crypto_clickhouse as gcd
from gapless_crypto_clickhouse.clickhouse import ClickHouseConnection
from gapless_crypto_clickhouse.collectors.clickhouse_bulk_loader import ClickHouseBulkLoader

# Step 1: Collect to CSV files (22x faster, portable format)
df = gcd.download("BTCUSDT", timeframe="1h", start="2024-01-01", end="2024-03-31")
print(f"Downloaded {len(df):,} bars to CSV")

# Step 2: Ingest CSV to ClickHouse for analysis
with ClickHouseConnection() as conn:
    loader = ClickHouseBulkLoader(conn)
    loader.ingest_from_dataframe(df, symbol="BTCUSDT", timeframe="1h")

    # Step 3: Run advanced queries
    query = OHLCVQuery(conn)
    gaps = query.detect_gaps("BTCUSDT", "1h", "2024-01-01", "2024-03-31")
    print(f"Gap detection: {len(gaps)} gaps found")

When to use hybrid approach:

Initial data collection: Use file-based (faster, no database required)
Post-processing: Load into ClickHouse for aggregations, joins, time-series analytics
Archival: Keep CSV files for portability, use database for active analysis

AI Agent Integration

This package includes probe hooks (gapless_crypto_clickhouse.__probe__) that enable AI coding agents to discover functionality programmatically.

For AI Coding Agent Users

To have your AI coding agent analyze this package, use this prompt:

Analyze gapless-crypto-data using: import gapless_crypto_clickhouse; probe = gapless_crypto_clickhouse.__probe__

Execute: probe.discover_api(), probe.get_capabilities(), probe.get_task_graph()

Provide insights about cryptocurrency data collection capabilities and usage patterns.

🛠️ Development

Prerequisites

UV Package Manager - Install UV
Python 3.9+ - UV will manage Python versions automatically
Git - For repository cloning and version control
Docker & Docker Compose (Optional) - For ClickHouse database development

Development Installation Workflow

IMPORTANT: This project uses mandatory pre-commit hooks to prevent broken code from being committed. All commits are automatically validated for formatting, linting, and basic quality checks.

Step 1: Clone Repository

git clone https://github.com/terrylica/gapless-crypto-clickhouse.git
cd gapless-crypto-clickhouse

Step 2: Development Environment Setup

# Create isolated virtual environment
uv venv

# Activate virtual environment
source .venv/bin/activate  # macOS/Linux
# .venv\Scripts\activate   # Windows

# Install all dependencies (production + development)
uv sync --dev

Step 3: Verify Installation

# Run test suite
uv run pytest

Step 3a: Database Setup (Optional - ClickHouse)

If you want to develop with ClickHouse database features:

# Start ClickHouse container
docker-compose up -d

# Verify ClickHouse is running and healthy
docker-compose ps
docker-compose logs clickhouse | grep "Ready for connections"

# Test ClickHouse connection
docker exec gapless-clickhouse clickhouse-client --query "SELECT 1"

# View ClickHouse schema
docker exec gapless-clickhouse clickhouse-client --query "SHOW CREATE TABLE ohlcv"

What gets initialized:

ClickHouse 24.1-alpine container on ports 9000 (native) and 8123 (HTTP)
ohlcv table with ReplacingMergeTree engine (from schema.sql)
Persistent volume for data (clickhouse-data)
Health checks and automatic restart

Test database ingestion:

# Create a test script: test_clickhouse.py
from gapless_crypto_clickhouse.clickhouse import ClickHouseConnection
from gapless_crypto_clickhouse.collectors.clickhouse_bulk_loader import ClickHouseBulkLoader

with ClickHouseConnection() as conn:
    # Health check
    print(f"ClickHouse connected: {conn.health_check()}")

    # Test ingestion (small dataset)
    loader = ClickHouseBulkLoader(conn, instrument_type="spot")
    rows = loader.ingest_month("BTCUSDT", "1d", year=2024, month=1)
    print(f"Test ingestion: {rows} rows")

# Run test
# uv run python test_clickhouse.py

Teardown:

# Stop ClickHouse (keeps data)
docker-compose down

# Stop and delete all data (fresh start)
docker-compose down -v

Step 4: Set Up Pre-Commit Hooks (Mandatory)

# Install pre-commit hooks (prevents broken code from being committed)
uv run pre-commit install

# Test pre-commit hooks
uv run pre-commit run --all-files

Step 5: Development Tools

# Code formatting
uv run ruff format .

# Linting and auto-fixes
uv run ruff check --fix .

# Type checking
uv run mypy src/

# Run specific tests
uv run pytest tests/test_binance_collector.py -v

# Manual pre-commit validation
uv run pre-commit run --all-files

Development Commands Reference

Task	Command
Install dependencies	`uv sync --dev`
Setup pre-commit hooks	`uv run pre-commit install`
Add new dependency	`uv add package-name`
Add dev dependency	`uv add --dev package-name`
Run Python API	`uv run python -c "import gapless_crypto_clickhouse as gcd; print(gcd.get_info())"`
Run tests	`uv run pytest`
Format code	`uv run ruff format .`
Lint code	`uv run ruff check --fix .`
Type check	`uv run mypy src/`
Validate pre-commit	`uv run pre-commit run --all-files`
Build package	`uv build`

E2E Validation Framework

Autonomous end-to-end validation of ClickHouse web interfaces with Playwright 1.56+ and screenshot evidence.

Validate Web Interfaces:

# Full validation (static + unit + integration + e2e)
uv run scripts/run_validation.py

# E2E tests only
uv run scripts/run_validation.py --e2e-only

# CI mode (headless, no interactive prompts)
uv run scripts/run_validation.py --ci

First-Time Setup:

# Install Playwright browsers (one-time)
uv run playwright install chromium --with-deps

# Verify installation
uv run playwright --version

Test Targets:

CH-UI Dashboard: localhost:5521
ClickHouse Play: localhost:8123/play

Features:

Zero manual intervention (PEP 723 self-contained)
Screenshot capture for visual regression detection
Comprehensive coverage (happy path, errors, edge cases, timeouts)
CI/CD optimized with browser caching (30-60s speedup)

Documentation:

Project Structure for Development

gapless-crypto-clickhouse/
├── src/gapless_crypto_clickhouse/        # Main package
│   ├── __init__.py                 # Package exports
│   ├── collectors/                 # Data collection modules
│   └── gap_filling/                # Gap detection/filling
├── tests/                          # Test suite
├── docs/                           # Documentation
├── examples/                       # Usage examples
├── pyproject.toml                  # Project configuration
└── uv.lock                        # Dependency lock file

Building and Publishing

# Build package
uv build

# Publish to PyPI (requires API token)
uv publish

📁 Project Structure

gapless-crypto-clickhouse/
├── src/
│   └── gapless_crypto_clickhouse/
│       ├── __init__.py              # Package exports
│       ├── collectors/
│       │   ├── __init__.py
│       │   └── binance_public_data_collector.py
│       ├── gap_filling/
│       │   ├── __init__.py
│       │   ├── universal_gap_filler.py
│       │   └── safe_file_operations.py
│       └── utils/
│           └── __init__.py
├── tests/                           # Test suite
├── docs/                           # Documentation
├── pyproject.toml                  # Project configuration
├── README.md                       # This file
└── LICENSE                         # MIT License

🔍 Supported Timeframes

All 16 Binance timeframes supported for complete market coverage (13 standard + 3 exotic):

Timeframe	Code	Description	Use Case
1 second	`1s`	Ultra-high frequency	HFT, microstructure analysis
1 minute	`1m`	High resolution	Scalping, order flow
3 minutes	`3m`	Short-term analysis	Quick trend detection
5 minutes	`5m`	Common trading timeframe	Day trading signals
15 minutes	`15m`	Medium-term signals	Swing trading entry
30 minutes	`30m`	Longer-term patterns	Position management
1 hour	`1h`	Popular for backtesting	Strategy development
2 hours	`2h`	Extended analysis	Multi-timeframe confluence
4 hours	`4h`	Daily cycle patterns	Trend following
6 hours	`6h`	Quarter-day analysis	Position sizing
8 hours	`8h`	Third-day cycles	Risk management
12 hours	`12h`	Half-day patterns	Overnight positions
1 day	`1d`	Daily analysis	Long-term trends
3 days	`3d`	Multi-day patterns	Weekly trend detection
1 week	`1w`	Weekly analysis	Swing trading, market cycles
1 month	`1mo`	Monthly patterns	Long-term strategy, macro

⚠️ Requirements

Python 3.9+
pandas >= 2.0.0
requests >= 2.25.0
Stable internet connection for data downloads

🤝 Contributing

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Install development dependencies (uv sync --dev)
Make your changes
Run tests (uv run pytest)
Format code (uv run ruff format .)
Commit changes (git commit -m 'Add amazing feature')
Push to branch (git push origin feature/amazing-feature)
Open a Pull Request

📚 API Reference

BinancePublicDataCollector

Cryptocurrency spot data collection from Binance's public data repository using pre-generated monthly ZIP files.

Key Methods

__init__(symbol, start_date, end_date, output_dir)

Initialize the collector with trading pair and date range.

collector = BinancePublicDataCollector(
    symbol="BTCUSDT",           # USDT spot pair
    start_date="2023-01-01",    # Start date (YYYY-MM-DD)
    end_date="2023-12-31",      # End date (YYYY-MM-DD)
    output_dir="./crypto_data"  # Output directory (optional)
)

collect_timeframe_data(trading_timeframe) -> Dict[str, Any]

Collect complete historical data for a single timeframe with full 11-column microstructure format.

result = collector.collect_timeframe_data("1h")
df = result["dataframe"]              # pandas DataFrame with OHLCV + microstructure
filepath = result["filepath"]         # Path to saved CSV file
stats = result["stats"]               # Collection statistics

# Access microstructure data
total_trades = df["number_of_trades"].sum()
taker_buy_ratio = df["taker_buy_base_asset_volume"].sum() / df["volume"].sum()

collect_multiple_timeframes(timeframes) -> Dict[str, Dict[str, Any]]

Collect data for multiple timeframes with comprehensive progress tracking.

results = collector.collect_multiple_timeframes(["1h", "4h"])
for timeframe, result in results.items():
    df = result["dataframe"]
    print(f"{timeframe}: {len(df):,} bars")

UniversalGapFiller

Gap detection and filling for various timeframes with 11-column microstructure format using Binance API data.

Key Methods

detect_all_gaps(csv_file) -> List[Dict]

Automatically detect timestamp gaps in CSV files.

gap_filler = UniversalGapFiller()
gaps = gap_filler.detect_all_gaps("BTCUSDT_1h_data.csv")
print(f"Found {len(gaps)} gaps to fill")

fill_gap(csv_file, gap_info) -> bool

Fill a specific gap with authentic Binance API data.

# Fill first detected gap
success = gap_filler.fill_gap("BTCUSDT_1h_data.csv", gaps[0])
print(f"Gap filled successfully: {success}")

process_file(directory) -> Dict[str, Dict]

Batch process all CSV files in a directory for gap detection and filling.

results = gap_filler.process_file("./crypto_data/")
for filename, result in results.items():
    print(f"{filename}: {result['gaps_filled']} gaps filled")

AtomicCSVOperations

Safe atomic operations for CSV files with header preservation and corruption prevention. Uses temporary files and atomic rename operations to ensure data integrity.

Key Methods

create_backup() -> Path

Create timestamped backup of original file before modifications.

from pathlib import Path
atomic_ops = AtomicCSVOperations(Path("data.csv"))
backup_path = atomic_ops.create_backup()

write_dataframe_atomic(df) -> bool

Atomically write DataFrame to CSV with integrity validation.

success = atomic_ops.write_dataframe_atomic(df)
if not success:
    atomic_ops.rollback_from_backup()

SafeCSVMerger

Safe CSV data merging with gap filling capabilities and data integrity validation. Handles temporal data insertion while maintaining chronological order.

Key Methods

merge_gap_data_safe(gap_data, gap_start, gap_end) -> bool

Safely merge gap data into existing CSV using atomic operations.

from datetime import datetime
merger = SafeCSVMerger(Path("eth_data.csv"))
success = merger.merge_gap_data_safe(
    gap_data,                    # DataFrame with gap data
    datetime(2024, 1, 1, 12),   # Gap start time
    datetime(2024, 1, 1, 15)    # Gap end time
)

Output Formats

DataFrame Structure (Python API)

Returns pandas DataFrame with 11-column microstructure format:

Column	Type	Description	Example
`date`	datetime64[ns]	Open timestamp	`2024-01-01 12:00:00`
`open`	float64	Opening price	`42150.50`
`high`	float64	Highest price	`42200.00`
`low`	float64	Lowest price	`42100.25`
`close`	float64	Closing price	`42175.75`
`volume`	float64	Base asset volume	`15.250000`
`close_time`	datetime64[ns]	Close timestamp	`2024-01-01 12:59:59`
`quote_asset_volume`	float64	Quote asset volume	`643238.125`
`number_of_trades`	int64	Trade count	`1547`
`taker_buy_base_asset_volume`	float64	Taker buy base volume	`7.825000`
`taker_buy_quote_asset_volume`	float64	Taker buy quote volume	`329891.750`

CSV File Structure

CSV files include header comments with metadata followed by data:

# Binance Spot Market Data v2.5.0
# Generated: 2025-09-18T23:09:25.391126+00:00Z
# Source: Binance Public Data Repository
# Market: SPOT | Symbol: BTCUSDT | Timeframe: 1h
# Coverage: 48 bars
# Period: 2024-01-01 00:00:00 to 2024-01-02 23:00:00
# Collection: direct_download in 0.0s
# Data Hash: 5fba9d2e5d3db849...
# Compliance: Zero-Magic-Numbers, Temporal-Integrity, Official-Binance-Source
#
date,open,high,low,close,volume,close_time,quote_asset_volume,number_of_trades,taker_buy_base_asset_volume,taker_buy_quote_asset_volume
2024-01-01 00:00:00,42283.58,42554.57,42261.02,42475.23,1271.68108,2024-01-01 00:59:59,53957248.973789,47134,682.57581,28957416.819645

Metadata JSON Structure

Each CSV file includes comprehensive metadata in .metadata.json:

{
  "version": "v2.5.0",
  "generator": "BinancePublicDataCollector",
  "data_source": "Binance Public Data Repository",
  "symbol": "BTCUSDT",
  "timeframe": "1h",
  "enhanced_microstructure_format": {
    "total_columns": 11,
    "analysis_capabilities": [
      "order_flow_analysis",
      "liquidity_metrics",
      "market_microstructure",
      "trade_weighted_prices",
      "institutional_data_patterns"
    ]
  },
  "gap_analysis": {
    "total_gaps_detected": 0,
    "data_completeness_score": 1.0,
    "gap_filling_method": "authentic_binance_api"
  },
  "data_integrity": {
    "chronological_order": true,
    "corruption_detected": false
  }
}

Streaming Output (Memory-Efficient)

For large datasets, Polars streaming provides constant memory usage:

from gapless_crypto_clickhouse.streaming import StreamingDataProcessor

processor = StreamingDataProcessor(chunk_size=10_000, memory_limit_mb=100)
for chunk in processor.stream_csv_chunks("large_dataset.csv"):
    # Process chunk with constant memory usage
    print(f"Chunk shape: {chunk.shape}")

File Naming Convention

Output files follow consistent naming pattern:

binance_spot_{SYMBOL}-{TIMEFRAME}_{START_DATE}-{END_DATE}_v{VERSION}.csv
binance_spot_{SYMBOL}-{TIMEFRAME}_{START_DATE}-{END_DATE}_v{VERSION}.metadata.json

Examples:

binance_spot_BTCUSDT-1h_20240101-20240102_v2.5.0.csv
binance_spot_ETHUSDT-4h_20240101-20240201_v2.5.0.csv
binance_spot_SOLUSDT-1d_20240101-20241231_v2.5.0.csv

Error Handling

All classes implement robust error handling with meaningful exceptions:

try:
    collector = BinancePublicDataCollector(symbol="INVALIDPAIR")
    result = collector.collect_timeframe_data("1h")
except ValueError as e:
    print(f"Invalid symbol format: {e}")
except ConnectionError as e:
    print(f"Network error: {e}")
except FileNotFoundError as e:
    print(f"Output directory error: {e}")

Type Hints

All public APIs include comprehensive type hints for better IDE support:

from typing import Dict, List, Optional, Any
from pathlib import Path
import pandas as pd

def collect_timeframe_data(self, trading_timeframe: str) -> Dict[str, Any]:
    # Returns dict with 'dataframe', 'filepath', and 'stats' keys
    pass

def collect_multiple_timeframes(
    self,
    timeframes: Optional[List[str]] = None
) -> Dict[str, Dict[str, Any]]:
    # Returns nested dict by timeframe
    pass

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🏢 About Eon Labs

Gapless Crypto ClickHouse is developed by Eon Labs, specializing in quantitative trading infrastructure and machine learning for financial markets.

UV-based - Python dependency management 📊 11-Column Format - Microstructure data with order flow metrics 🔒 Gap Detection - Data completeness validation and filling

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

17.4.0

Jan 24, 2026

17.2.1

Dec 28, 2025

17.1.0

Dec 28, 2025

17.0.0

Dec 10, 2025

16.1.0

Dec 10, 2025

16.0.0

Dec 6, 2025

15.1.1

Nov 27, 2025

15.1.0

Nov 27, 2025

15.0.1

Nov 26, 2025

15.0.0

Nov 26, 2025

14.1.3

Nov 26, 2025

14.1.1

Nov 26, 2025

This version

13.0.1

Nov 26, 2025

13.0.0

Nov 25, 2025

8.0.0

Nov 22, 2025

7.1.0

Nov 22, 2025

7.0.0

Nov 22, 2025

6.0.6

Nov 21, 2025

6.0.0

Nov 20, 2025

5.0.0

Nov 20, 2025

4.0.0

Nov 20, 2025

3.1.1

Nov 20, 2025

3.1.0

Nov 20, 2025

3.0.0

Nov 19, 2025

2.3.0

Nov 19, 2025

2.2.0

Nov 19, 2025

2.1.3

Nov 19, 2025

1.0.0

Nov 19, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

gapless_crypto_clickhouse-13.0.1.tar.gz (4.0 MB view details)

Uploaded Nov 26, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

gapless_crypto_clickhouse-13.0.1-py3-none-any.whl (141.1 kB view details)

Uploaded Nov 26, 2025 Python 3

File details

Details for the file gapless_crypto_clickhouse-13.0.1.tar.gz.

File metadata

Download URL: gapless_crypto_clickhouse-13.0.1.tar.gz
Upload date: Nov 26, 2025
Size: 4.0 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.10 {"installer":{"name":"uv","version":"0.9.10"},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for gapless_crypto_clickhouse-13.0.1.tar.gz
Algorithm	Hash digest
SHA256	`0c50a4da594c2976f3ef0536f4461e1c695d4ac27cfd92226d3f624c60befd44`
MD5	`3fc32b896b143c6b1423cf5a2a2f0486`
BLAKE2b-256	`9b3da58bacee246a45ea634fd9a00287d817dff8951d94424903f00ddf1f682b`

See more details on using hashes here.

File details

Details for the file gapless_crypto_clickhouse-13.0.1-py3-none-any.whl.

File metadata

Download URL: gapless_crypto_clickhouse-13.0.1-py3-none-any.whl
Upload date: Nov 26, 2025
Size: 141.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.9.10 {"installer":{"name":"uv","version":"0.9.10"},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for gapless_crypto_clickhouse-13.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`877955401ba393eff937c71ae26830f0f84bb397318e44dfc36b13be41a97ac5`
MD5	`a935eece837cb0cff5b54fdf240f0c57`
BLAKE2b-256	`d7b006efae5c96b07dc9a1951ecbf15c67fadc52765d7d15040ec9316e71d1e9`

See more details on using hashes here.

gapless-crypto-clickhouse 13.0.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Gapless Crypto ClickHouse

When to Use This Package

Features

Quick Start

Installation (UV)

Installation (pip)

Database Setup (ClickHouse)

Python API (Recommended)

Function-based API

Class-based API

Data Structure

Data Sources

🏗️ Architecture

Core Components

Data Flow

🗄️ Database Integration

Quick Start with Docker Compose

Quick Start: Unified Query API (v6.0.0+)

Basic Usage Examples

Connection and Health Check

Bulk Data Ingestion

Querying Data

Futures Support (ADR-0004)

Configuration

Local Visualization Tools

Migration Guide

Production Deployment

🔧 Advanced Usage

Batch Processing

Simple API (Recommended)

Advanced API (Complex Workflows)

Gap Analysis

Simple API (Recommended)

Advanced API (Detailed Control)

Database Query Examples

Bulk Ingestion Pipeline

Multi-Symbol Analysis

Advanced Time-Series Queries

Hybrid Approach (File + Database)

AI Agent Integration

For AI Coding Agent Users

🛠️ Development

Prerequisites

Development Installation Workflow

Step 1: Clone Repository

Step 2: Development Environment Setup

Step 3: Verify Installation

Step 3a: Database Setup (Optional - ClickHouse)

Step 4: Set Up Pre-Commit Hooks (Mandatory)

Step 5: Development Tools

Development Commands Reference

E2E Validation Framework

Project Structure for Development

Building and Publishing

📁 Project Structure

🔍 Supported Timeframes

⚠️ Requirements

🤝 Contributing

📚 API Reference

BinancePublicDataCollector

Key Methods

UniversalGapFiller

Key Methods

AtomicCSVOperations

Key Methods

SafeCSVMerger

Key Methods

Output Formats

DataFrame Structure (Python API)

CSV File Structure

Metadata JSON Structure