Skip to main content

Professional Exness forex tick data preprocessing with ClickHouse backend. Provides efficient storage with lossless precision and direct queryability.

Project description

Exness Data Preprocess v2.0.0

PyPI version Python versions License CI Downloads Code style: ruff

Professional forex tick data preprocessing with ClickHouse backend. Provides efficient storage with lossless precision, incremental updates, dual-variant storage (Raw_Spread + Standard), and 26-column OHLC schema with 10 global exchange sessions.

Features

  • ClickHouse Backend: High-performance columnar storage with ReplacingMergeTree deduplication
  • Incremental Updates: Automatic gap detection and download only missing months
  • Dual-Variant Storage: Raw_Spread (primary) + Standard (reference) in same database
  • 26-Column OHLC Schema: BID-based bars with dual spreads, tick counts, timezone tracking, and 10 global exchange sessions
  • High Performance: Vectorized session detection, SQL gap detection with complete coverage
  • Fast Queries: Sub-15ms query performance with date range filtering
  • On-Demand Resampling: Any timeframe (5m, 1h, 1d) resampled efficiently
  • Simple API: Clean Python API for all operations

Requirements

  • Python: 3.11+
  • ClickHouse: Running on localhost:8123 (local) or cloud instance

Installation

# From PyPI (when published)
pip install exness-data-preprocess

# From source
git clone https://github.com/Eon-Labs/exness-data-preprocess.git
cd exness-data-preprocess
pip install -e .

# Using uv (recommended)
uv pip install exness-data-preprocess

Quick Start

Python API

import exness_data_preprocess as edp

# Initialize processor (requires ClickHouse on localhost:8123)
processor = edp.ExnessDataProcessor()

# Download 3 years of EURUSD data (automatic gap detection)
result = processor.update_data(
    pair="EURUSD",
    start_date="2022-01-01",
    delete_zip=True,
)

print(f"Months added:  {result.months_added}")
print(f"Raw ticks:     {result.raw_ticks_added:,}")
print(f"Standard ticks: {result.standard_ticks_added:,}")
print(f"OHLC bars:     {result.ohlc_bars:,}")
print(f"Storage:       {result.storage_bytes:,} bytes")

# Query 1-minute OHLC bars for January 2024
df_1m = processor.query_ohlc(
    pair="EURUSD",
    timeframe="1m",
    start_date="2024-01-01",
    end_date="2024-01-31",
)
print(df_1m.head())

# Query raw tick data for September 2024
df_ticks = processor.query_ticks(
    pair="EURUSD",
    variant="raw_spread",
    start_date="2024-09-01",
    end_date="2024-09-30",
)
print(f"Ticks: {len(df_ticks):,}")

# Clean up
processor.close()

Architecture v2.0.0

Data Flow

Exness Public Repository (monthly ZIPs, both variants)
           ↓
    Automatic Gap Detection
           ↓
Download Only Missing Months (Raw_Spread + Standard)
           ↓
ClickHouse Storage (ReplacingMergeTree deduplication)
           ↓
26-Column OHLC Generation (dual spreads, tick counts, 10 global exchange sessions)
           ↓
Query Interface (date ranges, SQL filters, on-demand resampling)

Storage Format

ClickHouse Database: exness (single database for all instruments)

Schema:

  • raw_spread_ticks table: instrument, timestamp, bid, ask (ReplacingMergeTree)
  • standard_ticks table: instrument, timestamp, bid, ask (ReplacingMergeTree)
  • ohlc_1m table: 26-column schema with instrument column (ReplacingMergeTree)

26-Column OHLC Schema:

  • Core OHLC: instrument, timestamp, open, high, low, close (BID-based)
  • Spreads: raw_spread_avg, standard_spread_avg
  • Tick Counts: tick_count_raw_spread, tick_count_standard
  • Timezone: ny_hour, london_hour, ny_session, london_session
  • Holidays: is_us_holiday, is_uk_holiday, is_major_holiday
  • Exchange Sessions: 10 global exchanges (NYSE, LSE, XSWX, XFRA, XTSE, XNZE, XTKS, XASX, XHKG, XSES)

ClickHouse Configuration

Local Mode (default):

# Start ClickHouse server
clickhouse-server

# Default connection: localhost:8123

Cloud Mode (via environment variables):

export CLICKHOUSE_MODE=cloud
export CLICKHOUSE_HOST=your-instance.clickhouse.cloud
export CLICKHOUSE_PORT=8443
export CLICKHOUSE_USER=default
export CLICKHOUSE_PASSWORD=your-password

Why ClickHouse?

  • Columnar Storage: Optimized for analytical queries on time-series data
  • ReplacingMergeTree: Automatic deduplication at merge time
  • Fast Queries: Sub-15ms performance for date range queries
  • Scalability: Handles billions of ticks efficiently
  • Cloud Ready: Same API for local and cloud deployments

Usage Examples

Example 1: Initial Download and Incremental Updates

import exness_data_preprocess as edp

processor = edp.ExnessDataProcessor()

# Initial download (3-year history)
result = processor.update_data(
    pair="EURUSD",
    start_date="2022-01-01",
    delete_zip=True,
)

# Run again - only downloads new months since last update
result = processor.update_data(
    pair="EURUSD",
    start_date="2022-01-01",
)
print(f"Months added: {result.months_added} (0 if up to date)")

processor.close()

Example 2: Check Data Coverage

coverage = processor.get_data_coverage("EURUSD")

print(f"Database:        {coverage.database}")
print(f"Raw_Spread ticks: {coverage.raw_spread_ticks:,}")
print(f"Standard ticks:  {coverage.standard_ticks:,}")
print(f"OHLC bars:       {coverage.ohlc_bars:,}")
print(f"Date range:      {coverage.earliest_date} to {coverage.latest_date}")
print(f"Days covered:    {coverage.date_range_days}")
print(f"Storage:         {coverage.storage_bytes:,} bytes")

Example 3: Query OHLC with Date Ranges

# Query 1-minute bars for January 2024
df_1m = processor.query_ohlc(
    pair="EURUSD",
    timeframe="1m",
    start_date="2024-01-01",
    end_date="2024-01-31",
)

# Query 1-hour bars for Q1 2024 (resampled on-demand)
df_1h = processor.query_ohlc(
    pair="EURUSD",
    timeframe="1h",
    start_date="2024-01-01",
    end_date="2024-03-31",
)

# Query daily bars for entire 2024
df_1d = processor.query_ohlc(
    pair="EURUSD",
    timeframe="1d",
    start_date="2024-01-01",
    end_date="2024-12-31",
)

print(f"1m bars: {len(df_1m):,}")
print(f"1h bars: {len(df_1h):,}")
print(f"1d bars: {len(df_1d):,}")

Example 4: Query Ticks with Date Ranges

# Query Raw_Spread ticks for September 2024
df_raw = processor.query_ticks(
    pair="EURUSD",
    variant="raw_spread",
    start_date="2024-09-01",
    end_date="2024-09-30",
)

print(f"Raw_Spread ticks: {len(df_raw):,}")
print(f"Columns: {list(df_raw.columns)}")

# Calculate spread statistics
df_raw['Spread'] = df_raw['Ask'] - df_raw['Bid']
print(f"Mean spread: {df_raw['Spread'].mean() * 10000:.4f} pips")
print(f"Zero-spreads: {((df_raw['Spread'] == 0).sum() / len(df_raw) * 100):.2f}%")

Example 5: Query with SQL Filters

# Query only zero-spread ticks
df_zero = processor.query_ticks(
    pair="EURUSD",
    variant="raw_spread",
    start_date="2024-09-01",
    end_date="2024-09-01",
    filter_sql="Bid = Ask",
)
print(f"Zero-spread ticks: {len(df_zero):,}")

# Query high-price ticks
df_high = processor.query_ticks(
    pair="EURUSD",
    variant="raw_spread",
    start_date="2024-09-01",
    end_date="2024-09-30",
    filter_sql="Bid > 1.11",
)
print(f"High-price ticks: {len(df_high):,}")

Example 6: Process Multiple Instruments

processor = edp.ExnessDataProcessor()

# Process multiple pairs
pairs = ["EURUSD", "GBPUSD", "XAUUSD"]

for pair in pairs:
    print(f"Processing {pair}...")
    result = processor.update_data(
        pair=pair,
        start_date="2023-01-01",
        delete_zip=True,
    )
    print(f"  Months added: {result.months_added}")
    print(f"  Storage: {result.storage_bytes:,} bytes")

processor.close()

Example 7: Parallel Processing

from concurrent.futures import ThreadPoolExecutor, as_completed

def process_instrument(pair, start_date):
    processor = edp.ExnessDataProcessor()
    result = processor.update_data(pair=pair, start_date=start_date, delete_zip=True)
    processor.close()
    return result

instruments = [
    ("EURUSD", "2023-01-01"),
    ("GBPUSD", "2023-01-01"),
    ("XAUUSD", "2023-01-01"),
    ("USDJPY", "2023-01-01"),
]

with ThreadPoolExecutor(max_workers=4) as executor:
    futures = {
        executor.submit(process_instrument, pair, start_date): pair
        for pair, start_date in instruments
    }

    for future in as_completed(futures):
        pair = futures[future]
        result = future.result()
        print(f"{pair}: {result.months_added} months added")

Development

Setup

# Clone repository
git clone https://github.com/Eon-Labs/exness-data-preprocess.git
cd exness-data-preprocess

# Install with development dependencies (using uv)
uv sync --dev

# Or with pip
pip install -e ".[dev]"

Testing

# Run all tests
uv run pytest

# Run with coverage
uv run pytest --cov=exness_data_preprocess --cov-report=html

# Run specific test
uv run pytest tests/test_processor.py -v

Code Quality

# Format code
uv run ruff format .

# Lint
uv run ruff check --fix .

# Type checking
uv run mypy src/

Building

# Build package
uv build

# Test installation locally
uv tool install --editable .

Data Source

Data is sourced from Exness's public tick data repository:

  • URL: https://ticks.ex2archive.com/
  • Format: Monthly ZIP files with CSV tick data
  • Variants: Raw_Spread (zero-spreads) + Standard (market spreads)
  • Content: Timestamp, Bid, Ask prices for major forex pairs
  • Quality: Institutional ECN/STP data with microsecond precision

Technical Specifications

Database Size (3-Year History, EURUSD)

Metric Value
Raw_Spread ticks ~18.6M
Standard ticks ~19.6M
OHLC bars (1m) ~413K
Database size ~2.08 GB
Date range 2022-01-01 to 2025-01-10

Query Performance

Operation Time
Query 880K ticks (1 month) <15ms
Query 1m OHLC (1 month) <10ms
Resample to 1h (1 month) <15ms
Resample to 1d (1 year) <20ms

Architecture Benefits

Feature Benefit
ClickHouse backend Columnar storage optimized for time-series
ReplacingMergeTree Automatic deduplication at merge time
Automatic gap detection Download only missing months
Dual-variant storage Raw_Spread + Standard in same database
26-column OHLC schema Dual spreads + dual tick counts + sessions
Date range queries Efficient filtering without loading entire dataset
On-demand resampling Any timeframe in <15ms
SQL filter support Direct SQL WHERE clauses on ticks
Cloud ready Same API for local and cloud ClickHouse

Performance Optimizations (v0.5.0)

Incremental OHLC Generation - 7.3x speedup for updates:

Vectorized Session Detection - 2.2x speedup for trading hour detection:

SQL Gap Detection - Complete coverage with 46% code reduction:

  • Bug fix: Python approach missed internal gaps (41 detected vs 42 actual)
  • SQL EXCEPT operator detects ALL gaps (before + within + after existing data)
  • Code reduced from 62 lines to 34 lines (46% reduction)
  • SSoT: docs/phases/PHASE3_SQL_GAP_DETECTION_PLAN.yaml

Release Notes: See CHANGELOG.md for complete v0.5.0 details

API Reference

ExnessDataProcessor

processor = edp.ExnessDataProcessor()  # Requires ClickHouse on localhost:8123

Methods:

  • update_data(pair, start_date, force_redownload=False, delete_zip=True) - Update database with latest data, returns UpdateResult
  • query_ohlc(pair, timeframe, start_date=None, end_date=None) - Query OHLC bars, returns DataFrame
  • query_ticks(pair, variant, start_date=None, end_date=None, filter_sql=None) - Query tick data, returns DataFrame
  • get_data_coverage(pair) - Get coverage information, returns CoverageInfo
  • close() - Close ClickHouse connection

Parameters:

  • pair (str): Currency pair (e.g., "EURUSD", "GBPUSD", "XAUUSD")
  • timeframe (str): OHLC timeframe ("1m", "5m", "15m", "1h", "4h", "1d")
  • variant (str): Tick variant ("raw_spread" or "standard")
  • start_date (str): Start date in "YYYY-MM-DD" format
  • end_date (str): End date in "YYYY-MM-DD" format
  • filter_sql (str): SQL WHERE clause (e.g., "bid > 1.11 AND ask < 1.12")

Return Models (Pydantic)

UpdateResult:

  • database (str): ClickHouse database name
  • months_added (int): Number of months downloaded
  • raw_ticks_added (int): Number of raw spread ticks added
  • standard_ticks_added (int): Number of standard ticks added
  • ohlc_bars (int): Number of OHLC bars generated
  • storage_bytes (int): Total storage in bytes

CoverageInfo:

  • database (str): ClickHouse database name
  • raw_spread_ticks (int): Total raw spread tick count
  • standard_ticks (int): Total standard tick count
  • ohlc_bars (int): Total OHLC bar count
  • earliest_date (str | None): Earliest data timestamp
  • latest_date (str | None): Latest data timestamp
  • date_range_days (int): Days of data coverage
  • storage_bytes (int): Total storage in bytes

Migration from v1.x

v1.x (Legacy DuckDB):

  • Single DuckDB file per instrument: eurusd.duckdb
  • Fields: duckdb_path, duckdb_size_mb, database_exists
  • Constructor: ExnessDataProcessor(base_dir=...)

v2.0.0 (ClickHouse-only):

  • ClickHouse database: exness (all instruments)
  • Fields: database, storage_bytes (renamed)
  • Constructor: ExnessDataProcessor() (no base_dir)
  • Requires: ClickHouse running on localhost:8123

BREAKING CHANGES:

  • duckdb_pathdatabase (str, ClickHouse database name)
  • duckdb_size_mbstorage_bytes (int, bytes not MB)
  • database_exists → removed (ClickHouse always exists if connection works)
  • base_dir parameter → removed (ClickHouse manages storage)

Migration Steps:

  1. Install and start ClickHouse: brew install clickhouse && clickhouse-server
  2. Update code to use new field names (database, storage_bytes)
  3. Remove base_dir parameter from ExnessDataProcessor()
  4. Run processor.update_data(pair, start_date) to populate ClickHouse
  5. Delete old DuckDB files if no longer needed

License

MIT License - see LICENSE file for details.

Authors

Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines.

Acknowledgments

  • Exness for providing high-quality public tick data
  • ClickHouse for high-performance columnar storage with sub-15ms query performance

Additional Documentation

📚 Complete Documentation Hub - Organized guide from beginner to advanced (72+ documents)

  • Basic Usage Examples: See examples/basic_usage.py
  • Batch Processing: See examples/batch_processing.py
  • Architecture Details: See docs/UNIFIED_DUCKDB_PLAN_v2.md
  • Unit Tests: See tests/ directory

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

exness_data_preprocess-2.0.0.tar.gz (1.6 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

exness_data_preprocess-2.0.0-py3-none-any.whl (45.2 kB view details)

Uploaded Python 3

File details

Details for the file exness_data_preprocess-2.0.0.tar.gz.

File metadata

  • Download URL: exness_data_preprocess-2.0.0.tar.gz
  • Upload date:
  • Size: 1.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for exness_data_preprocess-2.0.0.tar.gz
Algorithm Hash digest
SHA256 f3b5abc16c9dedf8093e42db1a0c3df5e3debb2506cf5f7877708677d599f60e
MD5 a2465863cc2e852c0c20f6d0d4cd9653
BLAKE2b-256 559ad33ea474a2d6861edd58ab46c6444f592af4239c70d2edd605453bab07a9

See more details on using hashes here.

Provenance

The following attestation bundles were made for exness_data_preprocess-2.0.0.tar.gz:

Publisher: publish.yml on terrylica/exness-data-preprocess

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file exness_data_preprocess-2.0.0-py3-none-any.whl.

File metadata

File hashes

Hashes for exness_data_preprocess-2.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 9768aff6d098541d5d2fdf7cf427ee9311e17dbdc1fc1f5c01b805c545763aeb
MD5 c5ae1db1d258ad807931baa4345c923c
BLAKE2b-256 7b2b31085955559230d34b14e2ac77a84321ea15a08cb6411a4f1334051b56ee

See more details on using hashes here.

Provenance

The following attestation bundles were made for exness_data_preprocess-2.0.0-py3-none-any.whl:

Publisher: publish.yml on terrylica/exness-data-preprocess

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page