Professional Exness forex tick data preprocessing with optimal compression (Parquet Zstd-22) and DuckDB OHLC generation. Provides efficient storage (9% smaller than ZIP) with lossless precision and direct queryability.
Project description
Exness Data Preprocess v2.0.0
Professional forex tick data preprocessing with unified single-file DuckDB storage. Provides incremental updates, dual-variant storage (Raw_Spread + Standard), and Phase7 30-column OHLC schema (v1.5.0) with 10 global exchange sessions and sub-15ms query performance.
Features
- Unified Single-File Architecture: One DuckDB file per instrument (eurusd.duckdb)
- Incremental Updates: Automatic gap detection and download only missing months
- Dual-Variant Storage: Raw_Spread (primary) + Standard (reference) in same database
- Phase7 OHLC Schema: 30-column bars (v1.5.0) with dual spreads, tick counts, normalized metrics, and 10 global exchange sessions
- Fast Queries: Date range queries with sub-15ms performance
- On-Demand Resampling: Any timeframe (5m, 1h, 1d) resampled in <15ms
- PRIMARY KEY Constraints: Prevents duplicate data during incremental updates
- Simple API: Clean Python API for all operations
Installation
# From PyPI (when published)
pip install exness-data-preprocess
# From source
git clone https://github.com/Eon-Labs/exness-data-preprocess.git
cd exness-data-preprocess
pip install -e .
# Using uv (recommended)
uv pip install exness-data-preprocess
Quick Start
Python API
import exness_data_preprocess as edp
# Initialize processor
processor = edp.ExnessDataProcessor(base_dir="~/eon/exness-data")
# Download 3 years of EURUSD data (automatic gap detection)
result = processor.update_data(
pair="EURUSD",
start_date="2022-01-01",
delete_zip=True,
)
print(f"Months added: {result['months_added']}")
print(f"Raw ticks: {result['raw_ticks_added']:,}")
print(f"Standard ticks: {result['standard_ticks_added']:,}")
print(f"OHLC bars: {result['ohlc_bars']:,}")
print(f"Database size: {result['duckdb_size_mb']:.2f} MB")
# Query 1-minute OHLC bars for January 2024
df_1m = processor.query_ohlc(
pair="EURUSD",
timeframe="1m",
start_date="2024-01-01",
end_date="2024-01-31",
)
print(df_1m.head())
# Query raw tick data for September 2024
df_ticks = processor.query_ticks(
pair="EURUSD",
variant="raw_spread",
start_date="2024-09-01",
end_date="2024-09-30",
)
print(f"Ticks: {len(df_ticks):,}")
Architecture v2.0.0
Data Flow
Exness Public Repository (monthly ZIPs, both variants)
↓
Automatic Gap Detection
↓
Download Only Missing Months (Raw_Spread + Standard)
↓
DuckDB Single-File Storage (PRIMARY KEY prevents duplicates)
↓
Phase7 30-Column OHLC Generation (v1.5.0 - dual spreads, tick counts, normalized metrics, 10 global exchange sessions)
↓
Query Interface (date ranges, SQL filters, on-demand resampling)
Storage Format
Single File Per Instrument: ~/eon/exness-data/eurusd.duckdb
Schema:
raw_spread_tickstable: Timestamp (PK), Bid, Askstandard_tickstable: Timestamp (PK), Bid, Askohlc_1mtable: Phase7 30-column schema (v1.5.0)metadatatable: Coverage tracking
Phase7 30-Column OHLC (v1.5.0):
- Column Definitions: See
schema.py- Single source of truth - Comprehensive Reference: See
DATABASE_SCHEMA.md- Query examples and usage patterns - Key Features: BID-only OHLC with dual spreads (Raw_Spread + Standard), normalized spread metrics, and 10 global exchange sessions (XNYS, XLON, XSWX, XFRA, XTSE, XNZE, XTKS, XASX, XHKG, XSES)
Directory Structure
Default Location: ~/eon/exness-data/ (outside project workspace)
~/eon/exness-data/
├── eurusd.duckdb # Single file for all EURUSD data
├── gbpusd.duckdb # Single file for all GBPUSD data
├── xauusd.duckdb # Single file for all XAUUSD data
└── temp/
└── (temporary ZIP files)
Why Single-File Per Instrument?
- Unified Storage: All years in one database
- Incremental Updates: Automatic gap detection and download only missing months
- No Duplicates: PRIMARY KEY constraints prevent duplicate data
- Fast Queries: Date range queries with sub-15ms performance
- Scalability: Multi-year data in ~2 GB per instrument (3 years)
Usage Examples
Example 1: Initial Download and Incremental Updates
import exness_data_preprocess as edp
processor = edp.ExnessDataProcessor()
# Initial download (3-year history)
result = processor.update_data(
pair="EURUSD",
start_date="2022-01-01",
delete_zip=True,
)
# Run again - only downloads new months since last update
result = processor.update_data(
pair="EURUSD",
start_date="2022-01-01",
)
print(f"Months added: {result['months_added']} (0 if up to date)")
Example 2: Check Data Coverage
coverage = processor.get_data_coverage("EURUSD")
print(f"Database exists: {coverage['database_exists']}")
print(f"Raw_Spread ticks: {coverage['raw_spread_ticks']:,}")
print(f"Standard ticks: {coverage['standard_ticks']:,}")
print(f"OHLC bars: {coverage['ohlc_bars']:,}")
print(f"Date range: {coverage['earliest_date']} to {coverage['latest_date']}")
print(f"Days covered: {coverage['date_range_days']}")
print(f"Database size: {coverage['duckdb_size_mb']:.2f} MB")
Example 3: Query OHLC with Date Ranges
# Query 1-minute bars for January 2024
df_1m = processor.query_ohlc(
pair="EURUSD",
timeframe="1m",
start_date="2024-01-01",
end_date="2024-01-31",
)
# Query 1-hour bars for Q1 2024 (resampled on-demand)
df_1h = processor.query_ohlc(
pair="EURUSD",
timeframe="1h",
start_date="2024-01-01",
end_date="2024-03-31",
)
# Query daily bars for entire 2024
df_1d = processor.query_ohlc(
pair="EURUSD",
timeframe="1d",
start_date="2024-01-01",
end_date="2024-12-31",
)
print(f"1m bars: {len(df_1m):,}")
print(f"1h bars: {len(df_1h):,}")
print(f"1d bars: {len(df_1d):,}")
Example 4: Query Ticks with Date Ranges
# Query Raw_Spread ticks for September 2024
df_raw = processor.query_ticks(
pair="EURUSD",
variant="raw_spread",
start_date="2024-09-01",
end_date="2024-09-30",
)
print(f"Raw_Spread ticks: {len(df_raw):,}")
print(f"Columns: {list(df_raw.columns)}")
# Calculate spread statistics
df_raw['Spread'] = df_raw['Ask'] - df_raw['Bid']
print(f"Mean spread: {df_raw['Spread'].mean() * 10000:.4f} pips")
print(f"Zero-spreads: {((df_raw['Spread'] == 0).sum() / len(df_raw) * 100):.2f}%")
Example 5: Query with SQL Filters
# Query only zero-spread ticks
df_zero = processor.query_ticks(
pair="EURUSD",
variant="raw_spread",
start_date="2024-09-01",
end_date="2024-09-01",
filter_sql="Bid = Ask",
)
print(f"Zero-spread ticks: {len(df_zero):,}")
# Query high-price ticks
df_high = processor.query_ticks(
pair="EURUSD",
variant="raw_spread",
start_date="2024-09-01",
end_date="2024-09-30",
filter_sql="Bid > 1.11",
)
print(f"High-price ticks: {len(df_high):,}")
Example 6: Process Multiple Instruments
processor = edp.ExnessDataProcessor()
# Process multiple pairs
pairs = ["EURUSD", "GBPUSD", "XAUUSD"]
for pair in pairs:
print(f"Processing {pair}...")
result = processor.update_data(
pair=pair,
start_date="2023-01-01",
delete_zip=True,
)
print(f" Months added: {result['months_added']}")
print(f" Database size: {result['duckdb_size_mb']:.2f} MB")
Example 7: Parallel Processing
from concurrent.futures import ThreadPoolExecutor, as_completed
def process_instrument(pair, start_date):
processor = edp.ExnessDataProcessor()
return processor.update_data(pair=pair, start_date=start_date, delete_zip=True)
instruments = [
("EURUSD", "2023-01-01"),
("GBPUSD", "2023-01-01"),
("XAUUSD", "2023-01-01"),
("USDJPY", "2023-01-01"),
]
with ThreadPoolExecutor(max_workers=4) as executor:
futures = {
executor.submit(process_instrument, pair, start_date): pair
for pair, start_date in instruments
}
for future in as_completed(futures):
pair = futures[future]
result = future.result()
print(f"{pair}: {result['months_added']} months added")
Development
Setup
# Clone repository
git clone https://github.com/Eon-Labs/exness-data-preprocess.git
cd exness-data-preprocess
# Install with development dependencies (using uv)
uv sync --dev
# Or with pip
pip install -e ".[dev]"
Testing
# Run all tests
uv run pytest
# Run with coverage
uv run pytest --cov=exness_data_preprocess --cov-report=html
# Run specific test
uv run pytest tests/test_processor.py -v
Code Quality
# Format code
uv run ruff format .
# Lint
uv run ruff check --fix .
# Type checking
uv run mypy src/
Building
# Build package
uv build
# Test installation locally
uv tool install --editable .
Data Source
Data is sourced from Exness's public tick data repository:
- URL: https://ticks.ex2archive.com/
- Format: Monthly ZIP files with CSV tick data
- Variants: Raw_Spread (zero-spreads) + Standard (market spreads)
- Content: Timestamp, Bid, Ask prices for major forex pairs
- Quality: Institutional ECN/STP data with microsecond precision
Technical Specifications
Database Size (3-Year History, EURUSD)
| Metric | Value |
|---|---|
| Raw_Spread ticks | ~18.6M |
| Standard ticks | ~19.6M |
| OHLC bars (1m) | ~413K |
| Database size | ~2.08 GB |
| Date range | 2022-01-01 to 2025-01-10 |
Query Performance
| Operation | Time |
|---|---|
| Query 880K ticks (1 month) | <15ms |
| Query 1m OHLC (1 month) | <10ms |
| Resample to 1h (1 month) | <15ms |
| Resample to 1d (1 year) | <20ms |
Architecture Benefits
| Feature | Benefit |
|---|---|
| Single file per instrument | Unified storage, no file fragmentation |
| PRIMARY KEY constraints | Prevents duplicates during incremental updates |
| Automatic gap detection | Download only missing months |
| Dual-variant storage | Raw_Spread + Standard in same database |
| Phase7 OHLC schema | Dual spreads + dual tick counts |
| Date range queries | Efficient filtering without loading entire dataset |
| On-demand resampling | Any timeframe in <15ms |
| SQL filter support | Direct SQL WHERE clauses on ticks |
API Reference
ExnessDataProcessor
processor = edp.ExnessDataProcessor(base_dir="~/eon/exness-data")
Methods:
update_data(pair, start_date, force_redownload=False, delete_zip=True)- Update database with latest dataquery_ohlc(pair, timeframe, start_date=None, end_date=None)- Query OHLC barsquery_ticks(pair, variant, start_date=None, end_date=None, filter_sql=None)- Query tick dataget_data_coverage(pair)- Get coverage information
Parameters:
pair(str): Currency pair (e.g., "EURUSD", "GBPUSD", "XAUUSD")timeframe(str): OHLC timeframe ("1m", "5m", "15m", "1h", "4h", "1d")variant(str): Tick variant ("raw_spread" or "standard")start_date(str): Start date in "YYYY-MM-DD" formatend_date(str): End date in "YYYY-MM-DD" formatfilter_sql(str): SQL WHERE clause (e.g., "Bid > 1.11 AND Ask < 1.12")
Migration from v1.0.0
v1.0.0 (Legacy):
- Monthly DuckDB files:
eurusd_ohlc_2024_08.duckdb - Parquet tick storage:
eurusd_ticks_2024_08.parquet - Functions:
process_month(),process_date_range(),analyze_ticks()
v2.0.0 (Current):
- Single DuckDB file:
eurusd.duckdb - No Parquet files (everything in DuckDB)
- Unified API:
processor.update_data(),processor.query_ohlc(),processor.query_ticks()
Migration Steps:
- Run
processor.update_data(pair, start_date)to create new unified database - Delete old monthly files:
rm eurusd_ohlc_2024_*.duckdb eurusd_ticks_2024_*.parquet - Update code to use new API methods
License
MIT License - see LICENSE file for details.
Authors
- Terry Li terry@eonlabs.com
- Eon Labs
Contributing
Contributions are welcome! Please see CONTRIBUTING.md for guidelines.
Acknowledgments
- Exness for providing high-quality public tick data
- DuckDB for embedded OLAP capabilities with sub-15ms query performance
Additional Documentation
- Basic Usage Examples: See
examples/basic_usage.py - Batch Processing: See
examples/batch_processing.py - Architecture Details: See
docs/UNIFIED_DUCKDB_PLAN_v2.md - Unit Tests: See
tests/directory
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file exness_data_preprocess-0.3.1.tar.gz.
File metadata
- Download URL: exness_data_preprocess-0.3.1.tar.gz
- Upload date:
- Size: 1.4 MB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ac8408e5bfa186a6129f337fc3a0a2ccd46d47029ce6fd0ed03c16ffc5ff02a3
|
|
| MD5 |
26d5149ed8c0f7e77ab3c9cb0e570df1
|
|
| BLAKE2b-256 |
31e11cdd4ba1602ea8c75ad826b70e68e89353805c02e868e781c0e602028e4f
|
Provenance
The following attestation bundles were made for exness_data_preprocess-0.3.1.tar.gz:
Publisher:
publish.yml on terrylica/exness-data-preprocess
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
exness_data_preprocess-0.3.1.tar.gz -
Subject digest:
ac8408e5bfa186a6129f337fc3a0a2ccd46d47029ce6fd0ed03c16ffc5ff02a3 - Sigstore transparency entry: 611479914
- Sigstore integration time:
-
Permalink:
terrylica/exness-data-preprocess@2afa1d04d8a71428e81ce624bf7c50ae38dcb670 -
Branch / Tag:
refs/tags/v0.3.1 - Owner: https://github.com/terrylica
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@2afa1d04d8a71428e81ce624bf7c50ae38dcb670 -
Trigger Event:
release
-
Statement type:
File details
Details for the file exness_data_preprocess-0.3.1-py3-none-any.whl.
File metadata
- Download URL: exness_data_preprocess-0.3.1-py3-none-any.whl
- Upload date:
- Size: 37.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f0f200d0b6ceaeb45d2f7c22b039d33a74f898578b1a4cf133e52b5e5389d8f9
|
|
| MD5 |
176a47c17721a2afeea61a79fc366b20
|
|
| BLAKE2b-256 |
251b44b057a1326eff47b74db822fa57b9611a7d83e74939d4632871099c9a71
|
Provenance
The following attestation bundles were made for exness_data_preprocess-0.3.1-py3-none-any.whl:
Publisher:
publish.yml on terrylica/exness-data-preprocess
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
exness_data_preprocess-0.3.1-py3-none-any.whl -
Subject digest:
f0f200d0b6ceaeb45d2f7c22b039d33a74f898578b1a4cf133e52b5e5389d8f9 - Sigstore transparency entry: 611479924
- Sigstore integration time:
-
Permalink:
terrylica/exness-data-preprocess@2afa1d04d8a71428e81ce624bf7c50ae38dcb670 -
Branch / Tag:
refs/tags/v0.3.1 - Owner: https://github.com/terrylica
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@2afa1d04d8a71428e81ce624bf7c50ae38dcb670 -
Trigger Event:
release
-
Statement type: