Skip to main content

Professional Exness forex tick data preprocessing with optimal compression (Parquet Zstd-22) and DuckDB OHLC generation. Provides efficient storage (9% smaller than ZIP) with lossless precision and direct queryability.

Project description

Exness Data Preprocess v2.0.0

PyPI version Python versions License CI Downloads Code style: ruff

Professional forex tick data preprocessing with unified single-file DuckDB storage. Provides incremental updates, dual-variant storage (Raw_Spread + Standard), and Phase7 30-column OHLC schema (v1.6.0) with 10 global exchange sessions (trading hour detection) and sub-15ms query performance.

Features

  • Unified Single-File Architecture: One DuckDB file per instrument (eurusd.duckdb)
  • Incremental Updates: Automatic gap detection and download only missing months
  • Dual-Variant Storage: Raw_Spread (primary) + Standard (reference) in same database
  • Phase7 OHLC Schema: 30-column bars (v1.6.0) with dual spreads, tick counts, normalized metrics, and 10 global exchange sessions with trading hour detection
  • High Performance: Incremental OHLC generation (7.3x speedup), vectorized session detection (2.2x speedup), SQL gap detection with complete coverage
  • Fast Queries: Date range queries with sub-15ms performance
  • On-Demand Resampling: Any timeframe (5m, 1h, 1d) resampled in <15ms
  • PRIMARY KEY Constraints: Prevents duplicate data during incremental updates
  • Simple API: Clean Python API for all operations

Installation

# From PyPI (when published)
pip install exness-data-preprocess

# From source
git clone https://github.com/Eon-Labs/exness-data-preprocess.git
cd exness-data-preprocess
pip install -e .

# Using uv (recommended)
uv pip install exness-data-preprocess

Quick Start

Python API

import exness_data_preprocess as edp

# Initialize processor
processor = edp.ExnessDataProcessor(base_dir="~/eon/exness-data")

# Download 3 years of EURUSD data (automatic gap detection)
result = processor.update_data(
    pair="EURUSD",
    start_date="2022-01-01",
    delete_zip=True,
)

print(f"Months added:  {result['months_added']}")
print(f"Raw ticks:     {result['raw_ticks_added']:,}")
print(f"Standard ticks: {result['standard_ticks_added']:,}")
print(f"OHLC bars:     {result['ohlc_bars']:,}")
print(f"Database size: {result['duckdb_size_mb']:.2f} MB")

# Query 1-minute OHLC bars for January 2024
df_1m = processor.query_ohlc(
    pair="EURUSD",
    timeframe="1m",
    start_date="2024-01-01",
    end_date="2024-01-31",
)
print(df_1m.head())

# Query raw tick data for September 2024
df_ticks = processor.query_ticks(
    pair="EURUSD",
    variant="raw_spread",
    start_date="2024-09-01",
    end_date="2024-09-30",
)
print(f"Ticks: {len(df_ticks):,}")

Architecture v2.0.0

Data Flow

Exness Public Repository (monthly ZIPs, both variants)
           ↓
    Automatic Gap Detection
           ↓
Download Only Missing Months (Raw_Spread + Standard)
           ↓
DuckDB Single-File Storage (PRIMARY KEY prevents duplicates)
           ↓
Phase7 30-Column OHLC Generation (v1.6.0 - dual spreads, tick counts, normalized metrics, 10 global exchange sessions with trading hour detection)
           ↓
Query Interface (date ranges, SQL filters, on-demand resampling)

Storage Format

Single File Per Instrument: ~/eon/exness-data/eurusd.duckdb

Schema:

  • raw_spread_ticks table: Timestamp (PK), Bid, Ask
  • standard_ticks table: Timestamp (PK), Bid, Ask
  • ohlc_1m table: Phase7 30-column schema (v1.6.0)
  • metadata table: Coverage tracking

Phase7 30-Column OHLC (v1.6.0):

  • Column Definitions: See schema.py - Single source of truth
  • Comprehensive Reference: See DATABASE_SCHEMA.md - Query examples and usage patterns
  • Key Features: BID-only OHLC with dual spreads (Raw_Spread + Standard), normalized spread metrics, and 10 global exchange sessions with trading hour detection (XNYS, XLON, XSWX, XFRA, XTSE, XNZE, XTKS, XASX, XHKG, XSES)

Directory Structure

Default Location: ~/eon/exness-data/ (outside project workspace)

~/eon/exness-data/
├── eurusd.duckdb      # Single file for all EURUSD data
├── gbpusd.duckdb      # Single file for all GBPUSD data
├── xauusd.duckdb      # Single file for all XAUUSD data
└── temp/
    └── (temporary ZIP files)

Why Single-File Per Instrument?

  • Unified Storage: All years in one database
  • Incremental Updates: Automatic gap detection and download only missing months
  • No Duplicates: PRIMARY KEY constraints prevent duplicate data
  • Fast Queries: Date range queries with sub-15ms performance
  • Scalability: Multi-year data in ~2 GB per instrument (3 years)

Usage Examples

Example 1: Initial Download and Incremental Updates

import exness_data_preprocess as edp

processor = edp.ExnessDataProcessor()

# Initial download (3-year history)
result = processor.update_data(
    pair="EURUSD",
    start_date="2022-01-01",
    delete_zip=True,
)

# Run again - only downloads new months since last update
result = processor.update_data(
    pair="EURUSD",
    start_date="2022-01-01",
)
print(f"Months added: {result['months_added']} (0 if up to date)")

Example 2: Check Data Coverage

coverage = processor.get_data_coverage("EURUSD")

print(f"Database exists: {coverage['database_exists']}")
print(f"Raw_Spread ticks: {coverage['raw_spread_ticks']:,}")
print(f"Standard ticks:  {coverage['standard_ticks']:,}")
print(f"OHLC bars:       {coverage['ohlc_bars']:,}")
print(f"Date range:      {coverage['earliest_date']} to {coverage['latest_date']}")
print(f"Days covered:    {coverage['date_range_days']}")
print(f"Database size:   {coverage['duckdb_size_mb']:.2f} MB")

Example 3: Query OHLC with Date Ranges

# Query 1-minute bars for January 2024
df_1m = processor.query_ohlc(
    pair="EURUSD",
    timeframe="1m",
    start_date="2024-01-01",
    end_date="2024-01-31",
)

# Query 1-hour bars for Q1 2024 (resampled on-demand)
df_1h = processor.query_ohlc(
    pair="EURUSD",
    timeframe="1h",
    start_date="2024-01-01",
    end_date="2024-03-31",
)

# Query daily bars for entire 2024
df_1d = processor.query_ohlc(
    pair="EURUSD",
    timeframe="1d",
    start_date="2024-01-01",
    end_date="2024-12-31",
)

print(f"1m bars: {len(df_1m):,}")
print(f"1h bars: {len(df_1h):,}")
print(f"1d bars: {len(df_1d):,}")

Example 4: Query Ticks with Date Ranges

# Query Raw_Spread ticks for September 2024
df_raw = processor.query_ticks(
    pair="EURUSD",
    variant="raw_spread",
    start_date="2024-09-01",
    end_date="2024-09-30",
)

print(f"Raw_Spread ticks: {len(df_raw):,}")
print(f"Columns: {list(df_raw.columns)}")

# Calculate spread statistics
df_raw['Spread'] = df_raw['Ask'] - df_raw['Bid']
print(f"Mean spread: {df_raw['Spread'].mean() * 10000:.4f} pips")
print(f"Zero-spreads: {((df_raw['Spread'] == 0).sum() / len(df_raw) * 100):.2f}%")

Example 5: Query with SQL Filters

# Query only zero-spread ticks
df_zero = processor.query_ticks(
    pair="EURUSD",
    variant="raw_spread",
    start_date="2024-09-01",
    end_date="2024-09-01",
    filter_sql="Bid = Ask",
)
print(f"Zero-spread ticks: {len(df_zero):,}")

# Query high-price ticks
df_high = processor.query_ticks(
    pair="EURUSD",
    variant="raw_spread",
    start_date="2024-09-01",
    end_date="2024-09-30",
    filter_sql="Bid > 1.11",
)
print(f"High-price ticks: {len(df_high):,}")

Example 6: Process Multiple Instruments

processor = edp.ExnessDataProcessor()

# Process multiple pairs
pairs = ["EURUSD", "GBPUSD", "XAUUSD"]

for pair in pairs:
    print(f"Processing {pair}...")
    result = processor.update_data(
        pair=pair,
        start_date="2023-01-01",
        delete_zip=True,
    )
    print(f"  Months added: {result['months_added']}")
    print(f"  Database size: {result['duckdb_size_mb']:.2f} MB")

Example 7: Parallel Processing

from concurrent.futures import ThreadPoolExecutor, as_completed

def process_instrument(pair, start_date):
    processor = edp.ExnessDataProcessor()
    return processor.update_data(pair=pair, start_date=start_date, delete_zip=True)

instruments = [
    ("EURUSD", "2023-01-01"),
    ("GBPUSD", "2023-01-01"),
    ("XAUUSD", "2023-01-01"),
    ("USDJPY", "2023-01-01"),
]

with ThreadPoolExecutor(max_workers=4) as executor:
    futures = {
        executor.submit(process_instrument, pair, start_date): pair
        for pair, start_date in instruments
    }

    for future in as_completed(futures):
        pair = futures[future]
        result = future.result()
        print(f"{pair}: {result['months_added']} months added")

Development

Setup

# Clone repository
git clone https://github.com/Eon-Labs/exness-data-preprocess.git
cd exness-data-preprocess

# Install with development dependencies (using uv)
uv sync --dev

# Or with pip
pip install -e ".[dev]"

Testing

# Run all tests
uv run pytest

# Run with coverage
uv run pytest --cov=exness_data_preprocess --cov-report=html

# Run specific test
uv run pytest tests/test_processor.py -v

Code Quality

# Format code
uv run ruff format .

# Lint
uv run ruff check --fix .

# Type checking
uv run mypy src/

Building

# Build package
uv build

# Test installation locally
uv tool install --editable .

Data Source

Data is sourced from Exness's public tick data repository:

  • URL: https://ticks.ex2archive.com/
  • Format: Monthly ZIP files with CSV tick data
  • Variants: Raw_Spread (zero-spreads) + Standard (market spreads)
  • Content: Timestamp, Bid, Ask prices for major forex pairs
  • Quality: Institutional ECN/STP data with microsecond precision

Technical Specifications

Database Size (3-Year History, EURUSD)

Metric Value
Raw_Spread ticks ~18.6M
Standard ticks ~19.6M
OHLC bars (1m) ~413K
Database size ~2.08 GB
Date range 2022-01-01 to 2025-01-10

Query Performance

Operation Time
Query 880K ticks (1 month) <15ms
Query 1m OHLC (1 month) <10ms
Resample to 1h (1 month) <15ms
Resample to 1d (1 year) <20ms

Architecture Benefits

Feature Benefit
Single file per instrument Unified storage, no file fragmentation
PRIMARY KEY constraints Prevents duplicates during incremental updates
Automatic gap detection Download only missing months
Dual-variant storage Raw_Spread + Standard in same database
Phase7 OHLC schema Dual spreads + dual tick counts
Date range queries Efficient filtering without loading entire dataset
On-demand resampling Any timeframe in <15ms
SQL filter support Direct SQL WHERE clauses on ticks

Performance Optimizations (v0.5.0)

Incremental OHLC Generation - 7.3x speedup for updates:

Vectorized Session Detection - 2.2x speedup for trading hour detection:

SQL Gap Detection - Complete coverage with 46% code reduction:

  • Bug fix: Python approach missed internal gaps (41 detected vs 42 actual)
  • SQL EXCEPT operator detects ALL gaps (before + within + after existing data)
  • Code reduced from 62 lines to 34 lines (46% reduction)
  • SSoT: docs/phases/PHASE3_SQL_GAP_DETECTION_PLAN.yaml

Release Notes: See CHANGELOG.md for complete v0.5.0 details

API Reference

ExnessDataProcessor

processor = edp.ExnessDataProcessor(base_dir="~/eon/exness-data")

Methods:

  • update_data(pair, start_date, force_redownload=False, delete_zip=True) - Update database with latest data
  • query_ohlc(pair, timeframe, start_date=None, end_date=None) - Query OHLC bars
  • query_ticks(pair, variant, start_date=None, end_date=None, filter_sql=None) - Query tick data
  • get_data_coverage(pair) - Get coverage information

Parameters:

  • pair (str): Currency pair (e.g., "EURUSD", "GBPUSD", "XAUUSD")
  • timeframe (str): OHLC timeframe ("1m", "5m", "15m", "1h", "4h", "1d")
  • variant (str): Tick variant ("raw_spread" or "standard")
  • start_date (str): Start date in "YYYY-MM-DD" format
  • end_date (str): End date in "YYYY-MM-DD" format
  • filter_sql (str): SQL WHERE clause (e.g., "Bid > 1.11 AND Ask < 1.12")

Migration from v1.0.0

v1.0.0 (Legacy):

  • Monthly DuckDB files: eurusd_ohlc_2024_08.duckdb
  • Parquet tick storage: eurusd_ticks_2024_08.parquet
  • Functions: process_month(), process_date_range(), analyze_ticks()

v2.0.0 (Current):

  • Single DuckDB file: eurusd.duckdb
  • No Parquet files (everything in DuckDB)
  • Unified API: processor.update_data(), processor.query_ohlc(), processor.query_ticks()

Migration Steps:

  1. Run processor.update_data(pair, start_date) to create new unified database
  2. Delete old monthly files: rm eurusd_ohlc_2024_*.duckdb eurusd_ticks_2024_*.parquet
  3. Update code to use new API methods

License

MIT License - see LICENSE file for details.

Authors

Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines.

Acknowledgments

  • Exness for providing high-quality public tick data
  • DuckDB for embedded OLAP capabilities with sub-15ms query performance

Additional Documentation

📚 Complete Documentation Hub - Organized guide from beginner to advanced (72+ documents)

  • Basic Usage Examples: See examples/basic_usage.py
  • Batch Processing: See examples/batch_processing.py
  • Architecture Details: See docs/UNIFIED_DUCKDB_PLAN_v2.md
  • Unit Tests: See tests/ directory

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

exness_data_preprocess-0.8.0.tar.gz (1.6 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

exness_data_preprocess-0.8.0-py3-none-any.whl (64.3 kB view details)

Uploaded Python 3

File details

Details for the file exness_data_preprocess-0.8.0.tar.gz.

File metadata

  • Download URL: exness_data_preprocess-0.8.0.tar.gz
  • Upload date:
  • Size: 1.6 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for exness_data_preprocess-0.8.0.tar.gz
Algorithm Hash digest
SHA256 913b091bcc3e8ed961e7974533cb2eef9513024bf4d53078d6e57c395b3f2e4a
MD5 6590127078dd127b847133d6d2a6a328
BLAKE2b-256 ad6124517d43629efe77a6fcd446166d854293aa1f5691800ce67b846f94b2f3

See more details on using hashes here.

Provenance

The following attestation bundles were made for exness_data_preprocess-0.8.0.tar.gz:

Publisher: publish.yml on terrylica/exness-data-preprocess

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file exness_data_preprocess-0.8.0-py3-none-any.whl.

File metadata

File hashes

Hashes for exness_data_preprocess-0.8.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a7fbe2a186190843e34784c5d10e32a29d542df55226a1df43e5631ed7d48e6e
MD5 815fc62276a7e52d3cababd05d5e71df
BLAKE2b-256 e6f91c644da6281588ff513a080b81c3e7f0ac388b7fd31b1a4c461033112c92

See more details on using hashes here.

Provenance

The following attestation bundles were made for exness_data_preprocess-0.8.0-py3-none-any.whl:

Publisher: publish.yml on terrylica/exness-data-preprocess

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page