Skip to main content

Professional Exness forex tick data preprocessing with optimal compression (Parquet Zstd-22) and DuckDB OHLC generation. Provides efficient storage (9% smaller than ZIP) with lossless precision and direct queryability.

Project description

Exness Data Preprocess v2.0.0

PyPI version Python versions License CI Downloads Code style: ruff

Professional forex tick data preprocessing with unified single-file DuckDB storage. Provides incremental updates, dual-variant storage (Raw_Spread + Standard), and Phase7 9-column OHLC schema with sub-15ms query performance.

Features

  • Unified Single-File Architecture: One DuckDB file per instrument (eurusd.duckdb)
  • Incremental Updates: Automatic gap detection and download only missing months
  • Dual-Variant Storage: Raw_Spread (primary) + Standard (reference) in same database
  • Phase7 OHLC Schema: 9-column bars with dual spreads and dual tick counts
  • Fast Queries: Date range queries with sub-15ms performance
  • On-Demand Resampling: Any timeframe (5m, 1h, 1d) resampled in <15ms
  • PRIMARY KEY Constraints: Prevents duplicate data during incremental updates
  • Simple API: Clean Python API for all operations

Installation

# From PyPI (when published)
pip install exness-data-preprocess

# From source
git clone https://github.com/Eon-Labs/exness-data-preprocess.git
cd exness-data-preprocess
pip install -e .

# Using uv (recommended)
uv pip install exness-data-preprocess

Quick Start

Python API

import exness_data_preprocess as edp

# Initialize processor
processor = edp.ExnessDataProcessor(base_dir="~/eon/exness-data")

# Download 3 years of EURUSD data (automatic gap detection)
result = processor.update_data(
    pair="EURUSD",
    start_date="2022-01-01",
    delete_zip=True,
)

print(f"Months added:  {result['months_added']}")
print(f"Raw ticks:     {result['raw_ticks_added']:,}")
print(f"Standard ticks: {result['standard_ticks_added']:,}")
print(f"OHLC bars:     {result['ohlc_bars']:,}")
print(f"Database size: {result['duckdb_size_mb']:.2f} MB")

# Query 1-minute OHLC bars for January 2024
df_1m = processor.query_ohlc(
    pair="EURUSD",
    timeframe="1m",
    start_date="2024-01-01",
    end_date="2024-01-31",
)
print(df_1m.head())

# Query raw tick data for September 2024
df_ticks = processor.query_ticks(
    pair="EURUSD",
    variant="raw_spread",
    start_date="2024-09-01",
    end_date="2024-09-30",
)
print(f"Ticks: {len(df_ticks):,}")

Architecture v2.0.0

Data Flow

Exness Public Repository (monthly ZIPs, both variants)
           ↓
    Automatic Gap Detection
           ↓
Download Only Missing Months (Raw_Spread + Standard)
           ↓
DuckDB Single-File Storage (PRIMARY KEY prevents duplicates)
           ↓
Phase7 9-Column OHLC Generation (dual spreads + dual tick counts)
           ↓
Query Interface (date ranges, SQL filters, on-demand resampling)

Storage Format

Single File Per Instrument: ~/eon/exness-data/eurusd.duckdb

Schema:

  • raw_spread_ticks table: Timestamp (PK), Bid, Ask
  • standard_ticks table: Timestamp (PK), Bid, Ask
  • ohlc_1m table: Phase7 9-column schema
  • metadata table: Coverage tracking

Phase7 9-Column OHLC:

  1. Timestamp - Bar timestamp
  2. Open - Opening price
  3. High - High price
  4. Low - Low price
  5. Close - Closing price
  6. raw_spread_avg - Average spread from Raw_Spread variant
  7. standard_spread_avg - Average spread from Standard variant
  8. tick_count_raw_spread - Tick count from Raw_Spread variant
  9. tick_count_standard - Tick count from Standard variant

Directory Structure

Default Location: ~/eon/exness-data/ (outside project workspace)

~/eon/exness-data/
├── eurusd.duckdb      # Single file for all EURUSD data
├── gbpusd.duckdb      # Single file for all GBPUSD data
├── xauusd.duckdb      # Single file for all XAUUSD data
└── temp/
    └── (temporary ZIP files)

Why Single-File Per Instrument?

  • Unified Storage: All years in one database
  • Incremental Updates: Automatic gap detection and download only missing months
  • No Duplicates: PRIMARY KEY constraints prevent duplicate data
  • Fast Queries: Date range queries with sub-15ms performance
  • Scalability: Multi-year data in ~2 GB per instrument (3 years)

Usage Examples

Example 1: Initial Download and Incremental Updates

import exness_data_preprocess as edp

processor = edp.ExnessDataProcessor()

# Initial download (3-year history)
result = processor.update_data(
    pair="EURUSD",
    start_date="2022-01-01",
    delete_zip=True,
)

# Run again - only downloads new months since last update
result = processor.update_data(
    pair="EURUSD",
    start_date="2022-01-01",
)
print(f"Months added: {result['months_added']} (0 if up to date)")

Example 2: Check Data Coverage

coverage = processor.get_data_coverage("EURUSD")

print(f"Database exists: {coverage['database_exists']}")
print(f"Raw_Spread ticks: {coverage['raw_spread_ticks']:,}")
print(f"Standard ticks:  {coverage['standard_ticks']:,}")
print(f"OHLC bars:       {coverage['ohlc_bars']:,}")
print(f"Date range:      {coverage['earliest_date']} to {coverage['latest_date']}")
print(f"Days covered:    {coverage['date_range_days']}")
print(f"Database size:   {coverage['duckdb_size_mb']:.2f} MB")

Example 3: Query OHLC with Date Ranges

# Query 1-minute bars for January 2024
df_1m = processor.query_ohlc(
    pair="EURUSD",
    timeframe="1m",
    start_date="2024-01-01",
    end_date="2024-01-31",
)

# Query 1-hour bars for Q1 2024 (resampled on-demand)
df_1h = processor.query_ohlc(
    pair="EURUSD",
    timeframe="1h",
    start_date="2024-01-01",
    end_date="2024-03-31",
)

# Query daily bars for entire 2024
df_1d = processor.query_ohlc(
    pair="EURUSD",
    timeframe="1d",
    start_date="2024-01-01",
    end_date="2024-12-31",
)

print(f"1m bars: {len(df_1m):,}")
print(f"1h bars: {len(df_1h):,}")
print(f"1d bars: {len(df_1d):,}")

Example 4: Query Ticks with Date Ranges

# Query Raw_Spread ticks for September 2024
df_raw = processor.query_ticks(
    pair="EURUSD",
    variant="raw_spread",
    start_date="2024-09-01",
    end_date="2024-09-30",
)

print(f"Raw_Spread ticks: {len(df_raw):,}")
print(f"Columns: {list(df_raw.columns)}")

# Calculate spread statistics
df_raw['Spread'] = df_raw['Ask'] - df_raw['Bid']
print(f"Mean spread: {df_raw['Spread'].mean() * 10000:.4f} pips")
print(f"Zero-spreads: {((df_raw['Spread'] == 0).sum() / len(df_raw) * 100):.2f}%")

Example 5: Query with SQL Filters

# Query only zero-spread ticks
df_zero = processor.query_ticks(
    pair="EURUSD",
    variant="raw_spread",
    start_date="2024-09-01",
    end_date="2024-09-01",
    filter_sql="Bid = Ask",
)
print(f"Zero-spread ticks: {len(df_zero):,}")

# Query high-price ticks
df_high = processor.query_ticks(
    pair="EURUSD",
    variant="raw_spread",
    start_date="2024-09-01",
    end_date="2024-09-30",
    filter_sql="Bid > 1.11",
)
print(f"High-price ticks: {len(df_high):,}")

Example 6: Process Multiple Instruments

processor = edp.ExnessDataProcessor()

# Process multiple pairs
pairs = ["EURUSD", "GBPUSD", "XAUUSD"]

for pair in pairs:
    print(f"Processing {pair}...")
    result = processor.update_data(
        pair=pair,
        start_date="2023-01-01",
        delete_zip=True,
    )
    print(f"  Months added: {result['months_added']}")
    print(f"  Database size: {result['duckdb_size_mb']:.2f} MB")

Example 7: Parallel Processing

from concurrent.futures import ThreadPoolExecutor, as_completed

def process_instrument(pair, start_date):
    processor = edp.ExnessDataProcessor()
    return processor.update_data(pair=pair, start_date=start_date, delete_zip=True)

instruments = [
    ("EURUSD", "2023-01-01"),
    ("GBPUSD", "2023-01-01"),
    ("XAUUSD", "2023-01-01"),
    ("USDJPY", "2023-01-01"),
]

with ThreadPoolExecutor(max_workers=4) as executor:
    futures = {
        executor.submit(process_instrument, pair, start_date): pair
        for pair, start_date in instruments
    }

    for future in as_completed(futures):
        pair = futures[future]
        result = future.result()
        print(f"{pair}: {result['months_added']} months added")

Development

Setup

# Clone repository
git clone https://github.com/Eon-Labs/exness-data-preprocess.git
cd exness-data-preprocess

# Install with development dependencies (using uv)
uv sync --dev

# Or with pip
pip install -e ".[dev]"

Testing

# Run all tests
uv run pytest

# Run with coverage
uv run pytest --cov=exness_data_preprocess --cov-report=html

# Run specific test
uv run pytest tests/test_processor.py -v

Code Quality

# Format code
uv run ruff format .

# Lint
uv run ruff check --fix .

# Type checking
uv run mypy src/

Building

# Build package
uv build

# Test installation locally
uv tool install --editable .

Data Source

Data is sourced from Exness's public tick data repository:

  • URL: https://ticks.ex2archive.com/
  • Format: Monthly ZIP files with CSV tick data
  • Variants: Raw_Spread (zero-spreads) + Standard (market spreads)
  • Content: Timestamp, Bid, Ask prices for major forex pairs
  • Quality: Institutional ECN/STP data with microsecond precision

Technical Specifications

Database Size (3-Year History, EURUSD)

Metric Value
Raw_Spread ticks ~18.6M
Standard ticks ~19.6M
OHLC bars (1m) ~413K
Database size ~2.08 GB
Date range 2022-01-01 to 2025-01-10

Query Performance

Operation Time
Query 880K ticks (1 month) <15ms
Query 1m OHLC (1 month) <10ms
Resample to 1h (1 month) <15ms
Resample to 1d (1 year) <20ms

Architecture Benefits

Feature Benefit
Single file per instrument Unified storage, no file fragmentation
PRIMARY KEY constraints Prevents duplicates during incremental updates
Automatic gap detection Download only missing months
Dual-variant storage Raw_Spread + Standard in same database
Phase7 OHLC schema Dual spreads + dual tick counts
Date range queries Efficient filtering without loading entire dataset
On-demand resampling Any timeframe in <15ms
SQL filter support Direct SQL WHERE clauses on ticks

API Reference

ExnessDataProcessor

processor = edp.ExnessDataProcessor(base_dir="~/eon/exness-data")

Methods:

  • update_data(pair, start_date, force_redownload=False, delete_zip=True) - Update database with latest data
  • query_ohlc(pair, timeframe, start_date=None, end_date=None) - Query OHLC bars
  • query_ticks(pair, variant, start_date=None, end_date=None, filter_sql=None) - Query tick data
  • get_data_coverage(pair) - Get coverage information

Parameters:

  • pair (str): Currency pair (e.g., "EURUSD", "GBPUSD", "XAUUSD")
  • timeframe (str): OHLC timeframe ("1m", "5m", "15m", "1h", "4h", "1d")
  • variant (str): Tick variant ("raw_spread" or "standard")
  • start_date (str): Start date in "YYYY-MM-DD" format
  • end_date (str): End date in "YYYY-MM-DD" format
  • filter_sql (str): SQL WHERE clause (e.g., "Bid > 1.11 AND Ask < 1.12")

Migration from v1.0.0

v1.0.0 (Legacy):

  • Monthly DuckDB files: eurusd_ohlc_2024_08.duckdb
  • Parquet tick storage: eurusd_ticks_2024_08.parquet
  • Functions: process_month(), process_date_range(), analyze_ticks()

v2.0.0 (Current):

  • Single DuckDB file: eurusd.duckdb
  • No Parquet files (everything in DuckDB)
  • Unified API: processor.update_data(), processor.query_ohlc(), processor.query_ticks()

Migration Steps:

  1. Run processor.update_data(pair, start_date) to create new unified database
  2. Delete old monthly files: rm eurusd_ohlc_2024_*.duckdb eurusd_ticks_2024_*.parquet
  3. Update code to use new API methods

License

MIT License - see LICENSE file for details.

Authors

Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines.

Acknowledgments

  • Exness for providing high-quality public tick data
  • DuckDB for embedded OLAP capabilities with sub-15ms query performance

Additional Documentation

  • Basic Usage Examples: See examples/basic_usage.py
  • Batch Processing: See examples/batch_processing.py
  • Architecture Details: See docs/UNIFIED_DUCKDB_PLAN_v2.md
  • Unit Tests: See tests/ directory

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

exness_data_preprocess-0.1.0.tar.gz (1.4 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

exness_data_preprocess-0.1.0-py3-none-any.whl (19.9 kB view details)

Uploaded Python 3

File details

Details for the file exness_data_preprocess-0.1.0.tar.gz.

File metadata

File hashes

Hashes for exness_data_preprocess-0.1.0.tar.gz
Algorithm Hash digest
SHA256 6baccc85e242e9767558b159a4565d343e92d4558579c0c82e355a9554a5dd6a
MD5 2c8dd8d1c8d82db20cc2a387bd371975
BLAKE2b-256 85525e17b44da5d5b71f40e9ce6e10bba3b51738debfdce6486785566c5b5d07

See more details on using hashes here.

File details

Details for the file exness_data_preprocess-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for exness_data_preprocess-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 05d3deb0034092d6292afa4dbd189750dccf5e2239fbf0271e9f45d89ae5161b
MD5 839dbba451cb6c93b881ecd3673b7d42
BLAKE2b-256 11413f5e40b42055396a1726478d0699e68b1a054e4529f83ddedd92570e4cbb

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page