Professional Exness forex tick data preprocessing with optimal compression (Parquet Zstd-22) and DuckDB OHLC generation. Provides efficient storage (9% smaller than ZIP) with lossless precision and direct queryability.

These details have not been verified by PyPI

Project description

Exness Data Preprocess v2.0.0

Professional forex tick data preprocessing with unified single-file DuckDB storage. Provides incremental updates, dual-variant storage (Raw_Spread + Standard), and Phase7 30-column OHLC schema (v1.6.0) with 10 global exchange sessions (trading hour detection) and sub-15ms query performance.

Features

Unified Single-File Architecture: One DuckDB file per instrument (eurusd.duckdb)
Incremental Updates: Automatic gap detection and download only missing months
Dual-Variant Storage: Raw_Spread (primary) + Standard (reference) in same database
Phase7 OHLC Schema: 30-column bars (v1.6.0) with dual spreads, tick counts, normalized metrics, and 10 global exchange sessions with trading hour detection
High Performance: Incremental OHLC generation (7.3x speedup), vectorized session detection (2.2x speedup), SQL gap detection with complete coverage
Fast Queries: Date range queries with sub-15ms performance
On-Demand Resampling: Any timeframe (5m, 1h, 1d) resampled in <15ms
PRIMARY KEY Constraints: Prevents duplicate data during incremental updates
Simple API: Clean Python API for all operations

Installation

# From PyPI (when published)
pip install exness-data-preprocess

# From source
git clone https://github.com/Eon-Labs/exness-data-preprocess.git
cd exness-data-preprocess
pip install -e .

# Using uv (recommended)
uv pip install exness-data-preprocess

Quick Start

Python API

import exness_data_preprocess as edp

# Initialize processor
processor = edp.ExnessDataProcessor(base_dir="~/eon/exness-data")

# Download 3 years of EURUSD data (automatic gap detection)
result = processor.update_data(
    pair="EURUSD",
    start_date="2022-01-01",
    delete_zip=True,
)

print(f"Months added:  {result['months_added']}")
print(f"Raw ticks:     {result['raw_ticks_added']:,}")
print(f"Standard ticks: {result['standard_ticks_added']:,}")
print(f"OHLC bars:     {result['ohlc_bars']:,}")
print(f"Database size: {result['duckdb_size_mb']:.2f} MB")

# Query 1-minute OHLC bars for January 2024
df_1m = processor.query_ohlc(
    pair="EURUSD",
    timeframe="1m",
    start_date="2024-01-01",
    end_date="2024-01-31",
)
print(df_1m.head())

# Query raw tick data for September 2024
df_ticks = processor.query_ticks(
    pair="EURUSD",
    variant="raw_spread",
    start_date="2024-09-01",
    end_date="2024-09-30",
)
print(f"Ticks: {len(df_ticks):,}")

Architecture v2.0.0

Data Flow

Exness Public Repository (monthly ZIPs, both variants)
           ↓
    Automatic Gap Detection
           ↓
Download Only Missing Months (Raw_Spread + Standard)
           ↓
DuckDB Single-File Storage (PRIMARY KEY prevents duplicates)
           ↓
Phase7 30-Column OHLC Generation (v1.6.0 - dual spreads, tick counts, normalized metrics, 10 global exchange sessions with trading hour detection)
           ↓
Query Interface (date ranges, SQL filters, on-demand resampling)

Storage Format

Single File Per Instrument: ~/eon/exness-data/eurusd.duckdb

Schema:

raw_spread_ticks table: Timestamp (PK), Bid, Ask
standard_ticks table: Timestamp (PK), Bid, Ask
ohlc_1m table: Phase7 30-column schema (v1.6.0)
metadata table: Coverage tracking

Phase7 30-Column OHLC (v1.6.0):

Column Definitions: See schema.py - Single source of truth
Comprehensive Reference: See DATABASE_SCHEMA.md - Query examples and usage patterns
Key Features: BID-only OHLC with dual spreads (Raw_Spread + Standard), normalized spread metrics, and 10 global exchange sessions with trading hour detection (XNYS, XLON, XSWX, XFRA, XTSE, XNZE, XTKS, XASX, XHKG, XSES)

Directory Structure

Default Location: ~/eon/exness-data/ (outside project workspace)

~/eon/exness-data/
├── eurusd.duckdb      # Single file for all EURUSD data
├── gbpusd.duckdb      # Single file for all GBPUSD data
├── xauusd.duckdb      # Single file for all XAUUSD data
└── temp/
    └── (temporary ZIP files)

Why Single-File Per Instrument?

Unified Storage: All years in one database
Incremental Updates: Automatic gap detection and download only missing months
No Duplicates: PRIMARY KEY constraints prevent duplicate data
Fast Queries: Date range queries with sub-15ms performance
Scalability: Multi-year data in ~2 GB per instrument (3 years)

Usage Examples

Example 1: Initial Download and Incremental Updates

import exness_data_preprocess as edp

processor = edp.ExnessDataProcessor()

# Initial download (3-year history)
result = processor.update_data(
    pair="EURUSD",
    start_date="2022-01-01",
    delete_zip=True,
)

# Run again - only downloads new months since last update
result = processor.update_data(
    pair="EURUSD",
    start_date="2022-01-01",
)
print(f"Months added: {result['months_added']} (0 if up to date)")

Example 2: Check Data Coverage

coverage = processor.get_data_coverage("EURUSD")

print(f"Database exists: {coverage['database_exists']}")
print(f"Raw_Spread ticks: {coverage['raw_spread_ticks']:,}")
print(f"Standard ticks:  {coverage['standard_ticks']:,}")
print(f"OHLC bars:       {coverage['ohlc_bars']:,}")
print(f"Date range:      {coverage['earliest_date']} to {coverage['latest_date']}")
print(f"Days covered:    {coverage['date_range_days']}")
print(f"Database size:   {coverage['duckdb_size_mb']:.2f} MB")

Example 3: Query OHLC with Date Ranges

# Query 1-minute bars for January 2024
df_1m = processor.query_ohlc(
    pair="EURUSD",
    timeframe="1m",
    start_date="2024-01-01",
    end_date="2024-01-31",
)

# Query 1-hour bars for Q1 2024 (resampled on-demand)
df_1h = processor.query_ohlc(
    pair="EURUSD",
    timeframe="1h",
    start_date="2024-01-01",
    end_date="2024-03-31",
)

# Query daily bars for entire 2024
df_1d = processor.query_ohlc(
    pair="EURUSD",
    timeframe="1d",
    start_date="2024-01-01",
    end_date="2024-12-31",
)

print(f"1m bars: {len(df_1m):,}")
print(f"1h bars: {len(df_1h):,}")
print(f"1d bars: {len(df_1d):,}")

Example 4: Query Ticks with Date Ranges

# Query Raw_Spread ticks for September 2024
df_raw = processor.query_ticks(
    pair="EURUSD",
    variant="raw_spread",
    start_date="2024-09-01",
    end_date="2024-09-30",
)

print(f"Raw_Spread ticks: {len(df_raw):,}")
print(f"Columns: {list(df_raw.columns)}")

# Calculate spread statistics
df_raw['Spread'] = df_raw['Ask'] - df_raw['Bid']
print(f"Mean spread: {df_raw['Spread'].mean() * 10000:.4f} pips")
print(f"Zero-spreads: {((df_raw['Spread'] == 0).sum() / len(df_raw) * 100):.2f}%")

Example 5: Query with SQL Filters

# Query only zero-spread ticks
df_zero = processor.query_ticks(
    pair="EURUSD",
    variant="raw_spread",
    start_date="2024-09-01",
    end_date="2024-09-01",
    filter_sql="Bid = Ask",
)
print(f"Zero-spread ticks: {len(df_zero):,}")

# Query high-price ticks
df_high = processor.query_ticks(
    pair="EURUSD",
    variant="raw_spread",
    start_date="2024-09-01",
    end_date="2024-09-30",
    filter_sql="Bid > 1.11",
)
print(f"High-price ticks: {len(df_high):,}")

Example 6: Process Multiple Instruments

processor = edp.ExnessDataProcessor()

# Process multiple pairs
pairs = ["EURUSD", "GBPUSD", "XAUUSD"]

for pair in pairs:
    print(f"Processing {pair}...")
    result = processor.update_data(
        pair=pair,
        start_date="2023-01-01",
        delete_zip=True,
    )
    print(f"  Months added: {result['months_added']}")
    print(f"  Database size: {result['duckdb_size_mb']:.2f} MB")

Example 7: Parallel Processing

from concurrent.futures import ThreadPoolExecutor, as_completed

def process_instrument(pair, start_date):
    processor = edp.ExnessDataProcessor()
    return processor.update_data(pair=pair, start_date=start_date, delete_zip=True)

instruments = [
    ("EURUSD", "2023-01-01"),
    ("GBPUSD", "2023-01-01"),
    ("XAUUSD", "2023-01-01"),
    ("USDJPY", "2023-01-01"),
]

with ThreadPoolExecutor(max_workers=4) as executor:
    futures = {
        executor.submit(process_instrument, pair, start_date): pair
        for pair, start_date in instruments
    }

    for future in as_completed(futures):
        pair = futures[future]
        result = future.result()
        print(f"{pair}: {result['months_added']} months added")

Development

Setup

# Clone repository
git clone https://github.com/Eon-Labs/exness-data-preprocess.git
cd exness-data-preprocess

# Install with development dependencies (using uv)
uv sync --dev

# Or with pip
pip install -e ".[dev]"

Testing

# Run all tests
uv run pytest

# Run with coverage
uv run pytest --cov=exness_data_preprocess --cov-report=html

# Run specific test
uv run pytest tests/test_processor.py -v

Code Quality

# Format code
uv run ruff format .

# Lint
uv run ruff check --fix .

# Type checking
uv run mypy src/

Building

# Build package
uv build

# Test installation locally
uv tool install --editable .

Data Source

Data is sourced from Exness's public tick data repository:

URL: https://ticks.ex2archive.com/
Format: Monthly ZIP files with CSV tick data
Variants: Raw_Spread (zero-spreads) + Standard (market spreads)
Content: Timestamp, Bid, Ask prices for major forex pairs
Quality: Institutional ECN/STP data with microsecond precision

Technical Specifications

Database Size (3-Year History, EURUSD)

Metric	Value
Raw_Spread ticks	~18.6M
Standard ticks	~19.6M
OHLC bars (1m)	~413K
Database size	~2.08 GB
Date range	2022-01-01 to 2025-01-10

Query Performance

Operation	Time
Query 880K ticks (1 month)	<15ms
Query 1m OHLC (1 month)	<10ms
Resample to 1h (1 month)	<15ms
Resample to 1d (1 year)	<20ms

Architecture Benefits

Feature	Benefit
Single file per instrument	Unified storage, no file fragmentation
PRIMARY KEY constraints	Prevents duplicates during incremental updates
Automatic gap detection	Download only missing months
Dual-variant storage	Raw_Spread + Standard in same database
Phase7 OHLC schema	Dual spreads + dual tick counts
Date range queries	Efficient filtering without loading entire dataset
On-demand resampling	Any timeframe in <15ms
SQL filter support	Direct SQL WHERE clauses on ticks

Performance Optimizations (v0.5.0)

Incremental OHLC Generation - 7.3x speedup for updates:

Full regeneration: 8.05s (303K bars, 7 months)
Incremental update: 1.10s (43K new bars, 1 month)
Implementation: Optional date-range parameters for partial regeneration
Validation: docs/validation/SPIKE_TEST_RESULTS_PHASE1_2025-10-18.md

Vectorized Session Detection - 2.2x speedup for trading hour detection:

Current approach: 5.99s (302K bars, 10 exchanges)
Vectorized approach: 2.69s (302K bars, 10 exchanges)
Combined Phase 1+2: ~16x total speedup (8.05s → 0.50s)
Implementation: Pre-compute trading minutes, vectorized .isin() lookup
SSoT: docs/phases/PHASE2_SESSION_VECTORIZATION_PLAN.yaml
Validation: docs/validation/SPIKE_TEST_RESULTS_PHASE2_2025-10-18.md

SQL Gap Detection - Complete coverage with 46% code reduction:

Bug fix: Python approach missed internal gaps (41 detected vs 42 actual)
SQL EXCEPT operator detects ALL gaps (before + within + after existing data)
Code reduced from 62 lines to 34 lines (46% reduction)
SSoT: docs/phases/PHASE3_SQL_GAP_DETECTION_PLAN.yaml

Release Notes: See CHANGELOG.md for complete v0.5.0 details

API Reference

ExnessDataProcessor

processor = edp.ExnessDataProcessor(base_dir="~/eon/exness-data")

Methods:

update_data(pair, start_date, force_redownload=False, delete_zip=True) - Update database with latest data
query_ohlc(pair, timeframe, start_date=None, end_date=None) - Query OHLC bars
query_ticks(pair, variant, start_date=None, end_date=None, filter_sql=None) - Query tick data
get_data_coverage(pair) - Get coverage information

Parameters:

pair (str): Currency pair (e.g., "EURUSD", "GBPUSD", "XAUUSD")
timeframe (str): OHLC timeframe ("1m", "5m", "15m", "1h", "4h", "1d")
variant (str): Tick variant ("raw_spread" or "standard")
start_date (str): Start date in "YYYY-MM-DD" format
end_date (str): End date in "YYYY-MM-DD" format
filter_sql (str): SQL WHERE clause (e.g., "Bid > 1.11 AND Ask < 1.12")

Migration from v1.0.0

v1.0.0 (Legacy):

Monthly DuckDB files: eurusd_ohlc_2024_08.duckdb
Parquet tick storage: eurusd_ticks_2024_08.parquet
Functions: process_month(), process_date_range(), analyze_ticks()

v2.0.0 (Current):

Single DuckDB file: eurusd.duckdb
No Parquet files (everything in DuckDB)
Unified API: processor.update_data(), processor.query_ohlc(), processor.query_ticks()

Migration Steps:

Run processor.update_data(pair, start_date) to create new unified database
Delete old monthly files: rm eurusd_ohlc_2024_*.duckdb eurusd_ticks_2024_*.parquet
Update code to use new API methods

License

MIT License - see LICENSE file for details.

Authors

Terry Li terry@eonlabs.com
Eon Labs

Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines.

Acknowledgments

Exness for providing high-quality public tick data
DuckDB for embedded OLAP capabilities with sub-15ms query performance

Additional Documentation

📚 Complete Documentation Hub - Organized guide from beginner to advanced (72+ documents)

Basic Usage Examples: See examples/basic_usage.py
Batch Processing: See examples/batch_processing.py
Architecture Details: See docs/UNIFIED_DUCKDB_PLAN_v2.md
Unit Tests: See tests/ directory

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

2.1.0

Dec 28, 2025

2.0.0

Dec 27, 2025

1.2.1

Dec 11, 2025

1.2.0

Dec 10, 2025

1.1.0

Dec 10, 2025

1.0.0

Dec 10, 2025

This version

0.8.0

Dec 10, 2025

0.7.2

Oct 29, 2025

0.7.1

Oct 28, 2025

0.7.0

Oct 28, 2025

0.6.0

Oct 27, 2025

0.5.0

Oct 21, 2025

0.4.0

Oct 18, 2025

0.3.1

Oct 16, 2025

0.3.0

Oct 15, 2025

0.2.0

Oct 13, 2025

0.1.0

Oct 12, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

exness_data_preprocess-0.8.0.tar.gz (1.6 MB view details)

Uploaded Dec 10, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

exness_data_preprocess-0.8.0-py3-none-any.whl (64.3 kB view details)

Uploaded Dec 10, 2025 Python 3

File details

Details for the file exness_data_preprocess-0.8.0.tar.gz.

File metadata

Download URL: exness_data_preprocess-0.8.0.tar.gz
Upload date: Dec 10, 2025
Size: 1.6 MB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for exness_data_preprocess-0.8.0.tar.gz
Algorithm	Hash digest
SHA256	`913b091bcc3e8ed961e7974533cb2eef9513024bf4d53078d6e57c395b3f2e4a`
MD5	`6590127078dd127b847133d6d2a6a328`
BLAKE2b-256	`ad6124517d43629efe77a6fcd446166d854293aa1f5691800ce67b846f94b2f3`

See more details on using hashes here.

Provenance

The following attestation bundles were made for exness_data_preprocess-0.8.0.tar.gz:

Publisher: publish.yml on terrylica/exness-data-preprocess

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: exness_data_preprocess-0.8.0.tar.gz
- Subject digest: 913b091bcc3e8ed961e7974533cb2eef9513024bf4d53078d6e57c395b3f2e4a
- Sigstore transparency entry: 756471399
- Sigstore integration time: Dec 10, 2025
Source repository:
- Permalink: terrylica/exness-data-preprocess@d50da709a480cff4bf0be460ed84e82d0fe5ce7c
- Branch / Tag: refs/tags/v0.8.0
- Owner: https://github.com/terrylica
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@d50da709a480cff4bf0be460ed84e82d0fe5ce7c
- Trigger Event: release

File details

Details for the file exness_data_preprocess-0.8.0-py3-none-any.whl.

File metadata

Download URL: exness_data_preprocess-0.8.0-py3-none-any.whl
Upload date: Dec 10, 2025
Size: 64.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for exness_data_preprocess-0.8.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a7fbe2a186190843e34784c5d10e32a29d542df55226a1df43e5631ed7d48e6e`
MD5	`815fc62276a7e52d3cababd05d5e71df`
BLAKE2b-256	`e6f91c644da6281588ff513a080b81c3e7f0ac388b7fd31b1a4c461033112c92`

See more details on using hashes here.

Provenance

The following attestation bundles were made for exness_data_preprocess-0.8.0-py3-none-any.whl:

Publisher: publish.yml on terrylica/exness-data-preprocess

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: exness_data_preprocess-0.8.0-py3-none-any.whl
- Subject digest: a7fbe2a186190843e34784c5d10e32a29d542df55226a1df43e5631ed7d48e6e
- Sigstore transparency entry: 756471401
- Sigstore integration time: Dec 10, 2025
Source repository:
- Permalink: terrylica/exness-data-preprocess@d50da709a480cff4bf0be460ed84e82d0fe5ce7c
- Branch / Tag: refs/tags/v0.8.0
- Owner: https://github.com/terrylica
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@d50da709a480cff4bf0be460ed84e82d0fe5ce7c
- Trigger Event: release

exness-data-preprocess 0.8.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Exness Data Preprocess v2.0.0

Features

Installation

Quick Start

Python API

Architecture v2.0.0

Data Flow

Storage Format

Directory Structure

Usage Examples

Example 1: Initial Download and Incremental Updates

Example 2: Check Data Coverage

Example 3: Query OHLC with Date Ranges

Example 4: Query Ticks with Date Ranges

Example 5: Query with SQL Filters

Example 6: Process Multiple Instruments

Example 7: Parallel Processing

Development

Setup

Testing

Code Quality

Building

Data Source

Technical Specifications

Database Size (3-Year History, EURUSD)

Query Performance

Architecture Benefits

Performance Optimizations (v0.5.0)

API Reference

ExnessDataProcessor

Migration from v1.0.0

License

Authors

Contributing

Acknowledgments

Additional Documentation

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance