Professional Exness forex tick data preprocessing with ClickHouse backend. Provides efficient storage with lossless precision and direct queryability.

These details have not been verified by PyPI

Project description

Exness Data Preprocess v2.0.0

Professional forex tick data preprocessing with ClickHouse backend. Provides efficient storage with lossless precision, incremental updates, dual-variant storage (Raw_Spread + Standard), and 26-column OHLC schema with 10 global exchange sessions.

Features

ClickHouse Backend: High-performance columnar storage with ReplacingMergeTree deduplication
Incremental Updates: Automatic gap detection and download only missing months
Dual-Variant Storage: Raw_Spread (primary) + Standard (reference) in same database
26-Column OHLC Schema: BID-based bars with dual spreads, tick counts, timezone tracking, and 10 global exchange sessions
High Performance: Vectorized session detection, SQL gap detection with complete coverage
Fast Queries: Sub-15ms query performance with date range filtering
On-Demand Resampling: Any timeframe (5m, 1h, 1d) resampled efficiently
Simple API: Clean Python API for all operations

Requirements

Python: 3.11+
ClickHouse: Running on localhost:8123 (local) or cloud instance

Installation

# From PyPI (when published)
pip install exness-data-preprocess

# From source
git clone https://github.com/Eon-Labs/exness-data-preprocess.git
cd exness-data-preprocess
pip install -e .

# Using uv (recommended)
uv pip install exness-data-preprocess

Quick Start

Python API

import exness_data_preprocess as edp

# Initialize processor (requires ClickHouse on localhost:8123)
processor = edp.ExnessDataProcessor()

# Download 3 years of EURUSD data (automatic gap detection)
result = processor.update_data(
    pair="EURUSD",
    start_date="2022-01-01",
    delete_zip=True,
)

print(f"Months added:  {result.months_added}")
print(f"Raw ticks:     {result.raw_ticks_added:,}")
print(f"Standard ticks: {result.standard_ticks_added:,}")
print(f"OHLC bars:     {result.ohlc_bars:,}")
print(f"Storage:       {result.storage_bytes:,} bytes")

# Query 1-minute OHLC bars for January 2024
df_1m = processor.query_ohlc(
    pair="EURUSD",
    timeframe="1m",
    start_date="2024-01-01",
    end_date="2024-01-31",
)
print(df_1m.head())

# Query raw tick data for September 2024
df_ticks = processor.query_ticks(
    pair="EURUSD",
    variant="raw_spread",
    start_date="2024-09-01",
    end_date="2024-09-30",
)
print(f"Ticks: {len(df_ticks):,}")

# Clean up
processor.close()

Architecture v2.0.0

Data Flow

Exness Public Repository (monthly ZIPs, both variants)
           ↓
    Automatic Gap Detection
           ↓
Download Only Missing Months (Raw_Spread + Standard)
           ↓
ClickHouse Storage (ReplacingMergeTree deduplication)
           ↓
26-Column OHLC Generation (dual spreads, tick counts, 10 global exchange sessions)
           ↓
Query Interface (date ranges, SQL filters, on-demand resampling)

Storage Format

ClickHouse Database: exness (single database for all instruments)

Schema:

raw_spread_ticks table: instrument, timestamp, bid, ask (ReplacingMergeTree)
standard_ticks table: instrument, timestamp, bid, ask (ReplacingMergeTree)
ohlc_1m table: 26-column schema with instrument column (ReplacingMergeTree)

26-Column OHLC Schema:

Core OHLC: instrument, timestamp, open, high, low, close (BID-based)
Spreads: raw_spread_avg, standard_spread_avg
Tick Counts: tick_count_raw_spread, tick_count_standard
Timezone: ny_hour, london_hour, ny_session, london_session
Holidays: is_us_holiday, is_uk_holiday, is_major_holiday
Exchange Sessions: 10 global exchanges (NYSE, LSE, XSWX, XFRA, XTSE, XNZE, XTKS, XASX, XHKG, XSES)

ClickHouse Configuration

Local Mode (default):

# Start ClickHouse server
clickhouse-server

# Default connection: localhost:8123

Cloud Mode (via environment variables):

export CLICKHOUSE_MODE=cloud
export CLICKHOUSE_HOST=your-instance.clickhouse.cloud
export CLICKHOUSE_PORT=8443
export CLICKHOUSE_USER=default
export CLICKHOUSE_PASSWORD=your-password

Why ClickHouse?

Columnar Storage: Optimized for analytical queries on time-series data
ReplacingMergeTree: Automatic deduplication at merge time
Fast Queries: Sub-15ms performance for date range queries
Scalability: Handles billions of ticks efficiently
Cloud Ready: Same API for local and cloud deployments

Usage Examples

Example 1: Initial Download and Incremental Updates

import exness_data_preprocess as edp

processor = edp.ExnessDataProcessor()

# Initial download (3-year history)
result = processor.update_data(
    pair="EURUSD",
    start_date="2022-01-01",
    delete_zip=True,
)

# Run again - only downloads new months since last update
result = processor.update_data(
    pair="EURUSD",
    start_date="2022-01-01",
)
print(f"Months added: {result.months_added} (0 if up to date)")

processor.close()

Example 2: Check Data Coverage

coverage = processor.get_data_coverage("EURUSD")

print(f"Database:        {coverage.database}")
print(f"Raw_Spread ticks: {coverage.raw_spread_ticks:,}")
print(f"Standard ticks:  {coverage.standard_ticks:,}")
print(f"OHLC bars:       {coverage.ohlc_bars:,}")
print(f"Date range:      {coverage.earliest_date} to {coverage.latest_date}")
print(f"Days covered:    {coverage.date_range_days}")
print(f"Storage:         {coverage.storage_bytes:,} bytes")

Example 3: Query OHLC with Date Ranges

# Query 1-minute bars for January 2024
df_1m = processor.query_ohlc(
    pair="EURUSD",
    timeframe="1m",
    start_date="2024-01-01",
    end_date="2024-01-31",
)

# Query 1-hour bars for Q1 2024 (resampled on-demand)
df_1h = processor.query_ohlc(
    pair="EURUSD",
    timeframe="1h",
    start_date="2024-01-01",
    end_date="2024-03-31",
)

# Query daily bars for entire 2024
df_1d = processor.query_ohlc(
    pair="EURUSD",
    timeframe="1d",
    start_date="2024-01-01",
    end_date="2024-12-31",
)

print(f"1m bars: {len(df_1m):,}")
print(f"1h bars: {len(df_1h):,}")
print(f"1d bars: {len(df_1d):,}")

Example 4: Query Ticks with Date Ranges

# Query Raw_Spread ticks for September 2024
df_raw = processor.query_ticks(
    pair="EURUSD",
    variant="raw_spread",
    start_date="2024-09-01",
    end_date="2024-09-30",
)

print(f"Raw_Spread ticks: {len(df_raw):,}")
print(f"Columns: {list(df_raw.columns)}")

# Calculate spread statistics
df_raw['Spread'] = df_raw['Ask'] - df_raw['Bid']
print(f"Mean spread: {df_raw['Spread'].mean() * 10000:.4f} pips")
print(f"Zero-spreads: {((df_raw['Spread'] == 0).sum() / len(df_raw) * 100):.2f}%")

Example 5: Query with SQL Filters

# Query only zero-spread ticks
df_zero = processor.query_ticks(
    pair="EURUSD",
    variant="raw_spread",
    start_date="2024-09-01",
    end_date="2024-09-01",
    filter_sql="Bid = Ask",
)
print(f"Zero-spread ticks: {len(df_zero):,}")

# Query high-price ticks
df_high = processor.query_ticks(
    pair="EURUSD",
    variant="raw_spread",
    start_date="2024-09-01",
    end_date="2024-09-30",
    filter_sql="Bid > 1.11",
)
print(f"High-price ticks: {len(df_high):,}")

Example 6: Process Multiple Instruments

processor = edp.ExnessDataProcessor()

# Process multiple pairs
pairs = ["EURUSD", "GBPUSD", "XAUUSD"]

for pair in pairs:
    print(f"Processing {pair}...")
    result = processor.update_data(
        pair=pair,
        start_date="2023-01-01",
        delete_zip=True,
    )
    print(f"  Months added: {result.months_added}")
    print(f"  Storage: {result.storage_bytes:,} bytes")

processor.close()

Example 7: Parallel Processing

from concurrent.futures import ThreadPoolExecutor, as_completed

def process_instrument(pair, start_date):
    processor = edp.ExnessDataProcessor()
    result = processor.update_data(pair=pair, start_date=start_date, delete_zip=True)
    processor.close()
    return result

instruments = [
    ("EURUSD", "2023-01-01"),
    ("GBPUSD", "2023-01-01"),
    ("XAUUSD", "2023-01-01"),
    ("USDJPY", "2023-01-01"),
]

with ThreadPoolExecutor(max_workers=4) as executor:
    futures = {
        executor.submit(process_instrument, pair, start_date): pair
        for pair, start_date in instruments
    }

    for future in as_completed(futures):
        pair = futures[future]
        result = future.result()
        print(f"{pair}: {result.months_added} months added")

Development

Setup

# Clone repository
git clone https://github.com/Eon-Labs/exness-data-preprocess.git
cd exness-data-preprocess

# Install with development dependencies (using uv)
uv sync --dev

# Or with pip
pip install -e ".[dev]"

Testing

# Run all tests
uv run pytest

# Run with coverage
uv run pytest --cov=exness_data_preprocess --cov-report=html

# Run specific test
uv run pytest tests/test_processor.py -v

Code Quality

# Format code
uv run ruff format .

# Lint
uv run ruff check --fix .

# Type checking
uv run mypy src/

Building

# Build package
uv build

# Test installation locally
uv tool install --editable .

Data Source

Data is sourced from Exness's public tick data repository:

URL: https://ticks.ex2archive.com/
Format: Monthly ZIP files with CSV tick data
Variants: Raw_Spread (zero-spreads) + Standard (market spreads)
Content: Timestamp, Bid, Ask prices for major forex pairs
Quality: Institutional ECN/STP data with microsecond precision

Technical Specifications

Database Size (3-Year History, EURUSD)

Metric	Value
Raw_Spread ticks	~18.6M
Standard ticks	~19.6M
OHLC bars (1m)	~413K
Database size	~2.08 GB
Date range	2022-01-01 to 2025-01-10

Query Performance

Operation	Time
Query 880K ticks (1 month)	<15ms
Query 1m OHLC (1 month)	<10ms
Resample to 1h (1 month)	<15ms
Resample to 1d (1 year)	<20ms

Architecture Benefits

Feature	Benefit
ClickHouse backend	Columnar storage optimized for time-series
ReplacingMergeTree	Automatic deduplication at merge time
Automatic gap detection	Download only missing months
Dual-variant storage	Raw_Spread + Standard in same database
26-column OHLC schema	Dual spreads + dual tick counts + sessions
Date range queries	Efficient filtering without loading entire dataset
On-demand resampling	Any timeframe in <15ms
SQL filter support	Direct SQL WHERE clauses on ticks
Cloud ready	Same API for local and cloud ClickHouse

Performance Optimizations (v0.5.0)

Incremental OHLC Generation - 7.3x speedup for updates:

Full regeneration: 8.05s (303K bars, 7 months)
Incremental update: 1.10s (43K new bars, 1 month)
Implementation: Optional date-range parameters for partial regeneration
Validation: docs/validation/SPIKE_TEST_RESULTS_PHASE1_2025-10-18.md

Vectorized Session Detection - 2.2x speedup for trading hour detection:

Current approach: 5.99s (302K bars, 10 exchanges)
Vectorized approach: 2.69s (302K bars, 10 exchanges)
Combined Phase 1+2: ~16x total speedup (8.05s → 0.50s)
Implementation: Pre-compute trading minutes, vectorized .isin() lookup
SSoT: docs/phases/PHASE2_SESSION_VECTORIZATION_PLAN.yaml
Validation: docs/validation/SPIKE_TEST_RESULTS_PHASE2_2025-10-18.md

SQL Gap Detection - Complete coverage with 46% code reduction:

Bug fix: Python approach missed internal gaps (41 detected vs 42 actual)
SQL EXCEPT operator detects ALL gaps (before + within + after existing data)
Code reduced from 62 lines to 34 lines (46% reduction)
SSoT: docs/phases/PHASE3_SQL_GAP_DETECTION_PLAN.yaml

Release Notes: See CHANGELOG.md for complete v0.5.0 details

API Reference

ExnessDataProcessor

processor = edp.ExnessDataProcessor()  # Requires ClickHouse on localhost:8123

Methods:

update_data(pair, start_date, force_redownload=False, delete_zip=True) - Update database with latest data, returns UpdateResult
query_ohlc(pair, timeframe, start_date=None, end_date=None) - Query OHLC bars, returns DataFrame
query_ticks(pair, variant, start_date=None, end_date=None, filter_sql=None) - Query tick data, returns DataFrame
get_data_coverage(pair) - Get coverage information, returns CoverageInfo
close() - Close ClickHouse connection

Parameters:

pair (str): Currency pair (e.g., "EURUSD", "GBPUSD", "XAUUSD")
timeframe (str): OHLC timeframe ("1m", "5m", "15m", "1h", "4h", "1d")
variant (str): Tick variant ("raw_spread" or "standard")
start_date (str): Start date in "YYYY-MM-DD" format
end_date (str): End date in "YYYY-MM-DD" format
filter_sql (str): SQL WHERE clause (e.g., "bid > 1.11 AND ask < 1.12")

Return Models (Pydantic)

UpdateResult:

database (str): ClickHouse database name
months_added (int): Number of months downloaded
raw_ticks_added (int): Number of raw spread ticks added
standard_ticks_added (int): Number of standard ticks added
ohlc_bars (int): Number of OHLC bars generated
storage_bytes (int): Total storage in bytes

CoverageInfo:

database (str): ClickHouse database name
raw_spread_ticks (int): Total raw spread tick count
standard_ticks (int): Total standard tick count
ohlc_bars (int): Total OHLC bar count
earliest_date (str | None): Earliest data timestamp
latest_date (str | None): Latest data timestamp
date_range_days (int): Days of data coverage
storage_bytes (int): Total storage in bytes

Migration from v1.x

v1.x (Legacy DuckDB):

Single DuckDB file per instrument: eurusd.duckdb
Fields: duckdb_path, duckdb_size_mb, database_exists
Constructor: ExnessDataProcessor(base_dir=...)

v2.0.0 (ClickHouse-only):

ClickHouse database: exness (all instruments)
Fields: database, storage_bytes (renamed)
Constructor: ExnessDataProcessor() (no base_dir)
Requires: ClickHouse running on localhost:8123

BREAKING CHANGES:

duckdb_path → database (str, ClickHouse database name)
duckdb_size_mb → storage_bytes (int, bytes not MB)
database_exists → removed (ClickHouse always exists if connection works)
base_dir parameter → removed (ClickHouse manages storage)

Migration Steps:

Install and start ClickHouse: brew install clickhouse && clickhouse-server
Update code to use new field names (database, storage_bytes)
Remove base_dir parameter from ExnessDataProcessor()
Run processor.update_data(pair, start_date) to populate ClickHouse
Delete old DuckDB files if no longer needed

License

MIT License - see LICENSE file for details.

Authors

Terry Li terry@eonlabs.com
Eon Labs

Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines.

Acknowledgments

Exness for providing high-quality public tick data
ClickHouse for high-performance columnar storage with sub-15ms query performance

Additional Documentation

📚 Complete Documentation Hub - Organized guide from beginner to advanced (72+ documents)

Basic Usage Examples: See examples/basic_usage.py
Batch Processing: See examples/batch_processing.py
Architecture Details: See docs/UNIFIED_DUCKDB_PLAN_v2.md
Unit Tests: See tests/ directory

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

2.1.0

Dec 28, 2025

This version

2.0.0

Dec 27, 2025

1.2.1

Dec 11, 2025

1.2.0

Dec 10, 2025

1.1.0

Dec 10, 2025

1.0.0

Dec 10, 2025

0.8.0

Dec 10, 2025

0.7.2

Oct 29, 2025

0.7.1

Oct 28, 2025

0.7.0

Oct 28, 2025

0.6.0

Oct 27, 2025

0.5.0

Oct 21, 2025

0.4.0

Oct 18, 2025

0.3.1

Oct 16, 2025

0.3.0

Oct 15, 2025

0.2.0

Oct 13, 2025

0.1.0

Oct 12, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

exness_data_preprocess-2.0.0.tar.gz (1.6 MB view details)

Uploaded Dec 27, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

exness_data_preprocess-2.0.0-py3-none-any.whl (45.2 kB view details)

Uploaded Dec 27, 2025 Python 3

File details

Details for the file exness_data_preprocess-2.0.0.tar.gz.

File metadata

Download URL: exness_data_preprocess-2.0.0.tar.gz
Upload date: Dec 27, 2025
Size: 1.6 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for exness_data_preprocess-2.0.0.tar.gz
Algorithm	Hash digest
SHA256	`f3b5abc16c9dedf8093e42db1a0c3df5e3debb2506cf5f7877708677d599f60e`
MD5	`a2465863cc2e852c0c20f6d0d4cd9653`
BLAKE2b-256	`559ad33ea474a2d6861edd58ab46c6444f592af4239c70d2edd605453bab07a9`

See more details on using hashes here.

Provenance

The following attestation bundles were made for exness_data_preprocess-2.0.0.tar.gz:

Publisher: publish.yml on terrylica/exness-data-preprocess

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: exness_data_preprocess-2.0.0.tar.gz
- Subject digest: f3b5abc16c9dedf8093e42db1a0c3df5e3debb2506cf5f7877708677d599f60e
- Sigstore transparency entry: 780324637
- Sigstore integration time: Dec 27, 2025
Source repository:
- Permalink: terrylica/exness-data-preprocess@b0db0850dda579746c03e8efecd9cd2f6236ff92
- Branch / Tag: refs/tags/v2.0.0
- Owner: https://github.com/terrylica
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@b0db0850dda579746c03e8efecd9cd2f6236ff92
- Trigger Event: release

File details

Details for the file exness_data_preprocess-2.0.0-py3-none-any.whl.

File metadata

Download URL: exness_data_preprocess-2.0.0-py3-none-any.whl
Upload date: Dec 27, 2025
Size: 45.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for exness_data_preprocess-2.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9768aff6d098541d5d2fdf7cf427ee9311e17dbdc1fc1f5c01b805c545763aeb`
MD5	`c5ae1db1d258ad807931baa4345c923c`
BLAKE2b-256	`7b2b31085955559230d34b14e2ac77a84321ea15a08cb6411a4f1334051b56ee`

See more details on using hashes here.

Provenance

The following attestation bundles were made for exness_data_preprocess-2.0.0-py3-none-any.whl:

Publisher: publish.yml on terrylica/exness-data-preprocess

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: exness_data_preprocess-2.0.0-py3-none-any.whl
- Subject digest: 9768aff6d098541d5d2fdf7cf427ee9311e17dbdc1fc1f5c01b805c545763aeb
- Sigstore transparency entry: 780324638
- Sigstore integration time: Dec 27, 2025
Source repository:
- Permalink: terrylica/exness-data-preprocess@b0db0850dda579746c03e8efecd9cd2f6236ff92
- Branch / Tag: refs/tags/v2.0.0
- Owner: https://github.com/terrylica
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@b0db0850dda579746c03e8efecd9cd2f6236ff92
- Trigger Event: release

exness-data-preprocess 2.0.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Exness Data Preprocess v2.0.0

Features

Requirements

Installation

Quick Start

Python API

Architecture v2.0.0

Data Flow

Storage Format

ClickHouse Configuration

Usage Examples

Example 1: Initial Download and Incremental Updates

Example 2: Check Data Coverage

Example 3: Query OHLC with Date Ranges

Example 4: Query Ticks with Date Ranges

Example 5: Query with SQL Filters

Example 6: Process Multiple Instruments

Example 7: Parallel Processing

Development

Setup

Testing

Code Quality

Building

Data Source

Technical Specifications

Database Size (3-Year History, EURUSD)

Query Performance

Architecture Benefits

Performance Optimizations (v0.5.0)

API Reference

ExnessDataProcessor

Return Models (Pydantic)

Migration from v1.x

License

Authors

Contributing

Acknowledgments

Additional Documentation

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance