Production-ready validation framework for OHLCV financial data
Project description
DQF - Data Quality Framework
MIF-certified data quality framework for OHLCV financial data.
DQF v1.1 validates financial time series through CORE and ADVISORY checks, produces MIF-Lite manifests with a cryptographic MIF-UID and a MIF Purity Index (MPI), and supports two operational modes: CERTIFICATION (strict, deterministic) and DIAGNOSTIC (advisory, flexible).
๐ฏ Why DQF Exists
The Fundamental Problem
Garbage In, Garbage Out (GIGO): No matter how sophisticated your trading algorithms or statistical models, if your input data is corrupted, all results are invalid.
Statistical Reality:
- 80% of quantitative strategies fail in production not because of flawed logic, but because of corrupted data during development
- Data quality issues are detected on average 6 months after deployment
- A single corrupted data point can invalidate months of backtesting
The Philosophy of Purification
DQF embodies the principle of systematic purification before critical operations:
Historical Precedents:
- Medicine: Hand washing before surgery (Semmelweis, 1847) - reduced mortality from 18% to 2%
- Laboratory Science: Sterile technique before experiments - ensures reproducibility
- Software Engineering: Input validation before processing - prevents crashes
- Quantitative Finance: DQF (data purification) before analysis - guarantees validity
Cultural Parallels (methodological, not spiritual):
- Islam: Wuแธลซ (ablution) - 7 ritual cleansings before Salat (prayer)
- Shinto: Temizuya (water purification) before entering a shrine
- Laboratory: Autoclave sterilization before cell culture
- DQF: 7 systematic checks before quantitative analysis
Core Principle: Without purification, the critical operation (analysis/trading) produces unreliable results.
โจ What DQF Does
Dual Mission
1. Validation: Detect and report data quality issues
- Identifies violations of market physics (H<L, negative volume, etc.)
- Detects statistical anomalies (extreme returns, forward-fill abuse)
- Validates structural integrity (timezone, calendar, duplicates)
2. Purification: Generate certified clean datasets
- Produces validated DataFrames with full provenance tracking
- Guarantees reproducibility (same data โ same results, always)
- Enables consistent analysis across teams and time
The DQF Guarantee
When DQF reports status: PASS:
- โ Data respects market physics laws
- โ No statistical anomalies detected
- โ Complete provenance chain tracked
- โ Dataset certified for production use
This is not just validation - it's data certification.
๐ฌ Core Benefits
For Quantitative Researchers
Problem: Corrupted data during backtesting โ false conclusions
# Without DQF: Unknown data quality
backtest_results = strategy.run(data) # ๐ฅ May be invalid
paper.publish(backtest_results) # ๐ฅ Non-reproducible
Solution: Certified clean data โ reliable backtests
# With DQF: Certified data quality
report = validator.validate(data)
if report.overall_status == "PASS":
backtest_results = strategy.run(report.cleaned_data) # โ
Valid
paper.publish(backtest_results) # โ
Reproducible
Benefits:
- โ Reproducible research (same data โ same results)
- โ Peer review confidence (provenance tracking)
- โ Publication credibility (certified datasets)
For Trading Systems
Problem: Data corruption in production โ catastrophic losses
# Without DQF: Unknown data quality
live_data = fetch_latest()
signal = model.predict(live_data) # ๐ฅ May be based on corrupted data
execute_trade(signal) # ๐ฅ Potential disaster
Solution: Real-time validation โ safe trading
# With DQF: Real-time validation
live_data = fetch_latest()
report = validator.validate(live_data)
if report.overall_status == "PASS":
signal = model.predict(report.cleaned_data) # โ
Safe
execute_trade(signal) # โ
Confident
else:
alert_team(report.all_issues) # ๐จ Data quality issue
halt_trading() # Safety first
Benefits:
- โ Risk mitigation (detect issues before trading)
- โ Regulatory compliance (audit trail)
- โ Post-mortem analysis (provenance tracking)
For Data Engineers
Problem: Silent data corruption in pipelines
# Without DQF: Silent failures
raw_data = extract_from_source()
transformed = apply_transformations(raw_data) # ๐ฅ May propagate corruption
load_to_warehouse(transformed) # ๐ฅ Garbage persisted
Solution: Validation checkpoints โ data integrity
# With DQF: Validated pipeline
raw_data = extract_from_source()
# Checkpoint 1: Validate raw data
raw_report = validator.validate(raw_data)
assert raw_report.overall_status == "PASS"
transformed = apply_transformations(raw_report.cleaned_data)
# Checkpoint 2: Validate transformed data
final_report = validator.validate(transformed)
assert final_report.overall_status == "PASS"
load_to_warehouse(final_report.cleaned_data) # โ
Only clean data persisted
Benefits:
- โ Early detection (issues caught immediately)
- โ Data lineage (full provenance chain)
- โ Quality metrics (SLA monitoring)
๐ Quick Start
Installation
pip install mif-dqf
Note โ package name vs import name: the PyPI package is
mif-dqfbut the Python import isfrom dqf import ...(notimport mif_dqf). This is intentional:dqfis the canonical module namespace.
Basic Usage
import pandas as pd
from dqf import DQFValidator, DQFConfig, DQFMode
# Load your data (timezone-aware index required)
data = pd.read_csv("spy.csv", index_col=0, parse_dates=True)
data.index = data.index.tz_localize("UTC")
# CERTIFICATION mode โ strict, deterministic, calendar required
config = DQFConfig(mode=DQFMode.CERTIFICATION)
validator = DQFValidator(config)
report = validator.validate(data, calendar="NYSE")
if report.is_certified:
print(f"โ
CERTIFIED MPI={report.purity_index:.1f}/100 gate={report.precondition_gate}")
print(f" UID: {report.mif_uid}")
clean_data = report.cleaned_data # validated DataFrame
report.print_summary() # human-readable summary
else:
print(f"โ {report.overall_status} gate={report.precondition_gate}")
print(f" CORE: {report.core_results}")
print(f" ADVISORY: {report.advisory_results}")
DIAGNOSTIC mode (no calendar required, useful for exploration):
config = DQFConfig(mode=DQFMode.DIAGNOSTIC)
report = DQFValidator(config).validate(data)
print(f"Status: {report.overall_status} MPI: {report.purity_index:.1f}")
Output (CERTIFICATION, clean data):
โ
CERTIFIED MPI=100.0/100 gate=1.0
UID: sha256:a3f9...
๐ DQF v1.1 Checks
CORE checks โ failure โ STATUS_VOID, precondition_gate = 0.0
| ID | Check | Purpose |
|---|---|---|
| PROD | Envelope seal | Output trust mechanism โ always injected PASS |
| C2 | OHLCV Integrity | Market physics (HโฅL, HโฅO/C, Vโฅ0, no NaN) |
| C3 | Calendar Alignment | Declared calendar required in CERTIFICATION mode |
| C5 | Index Traceability | Unique, chronological, timezone-aware index |
ADVISORY checks โ warn โ STATUS_WARNING, gate capped by MPI
| ID | Check | Purpose |
|---|---|---|
| C1 | Source Uniqueness | Single canonical source (SKIP in Phase 1 โ DAL pending) |
| C4 | Forward-Fill Detection | Detects interpolation abuse (consecutive repeats) |
Removed in v1.1: C6 (Sanity Tests) migrated to MIF Layer 1; C7 (Logging) replaced by PROD envelope.
๐ Complete Examples
Example 1: Research Workflow
import pandas as pd
from pathlib import Path
from dqf import DQFValidator, DQFConfig, DQFMode
# Research scenario: Certifying historical data for a paper
data = pd.read_csv("spy_2020_2024.csv", index_col=0, parse_dates=True)
data.index = data.index.tz_localize("UTC")
config = DQFConfig(mode=DQFMode.CERTIFICATION)
report = DQFValidator(config).validate(data, calendar="NYSE")
if report.is_certified:
# Save certified dataset with provenance
report.cleaned_data.to_csv("spy_2020_2024_certified.csv")
Path("provenance_spy.json").write_text(report.to_json())
print(f"โ
CERTIFIED MPI={report.purity_index:.1f}/100")
print(f" MIF-UID: {report.mif_uid}")
else:
print(f"โ {report.overall_status} (gate={report.precondition_gate})")
print(f" CORE failures: {report.core_results}")
Example 2: Production Pipeline
import logging
from dqf import DQFValidator, DQFConfig, DQFMode
logger = logging.getLogger(__name__)
# Shared validator โ reuse across calls (thread-safe for validate())
_config = DQFConfig(mode=DQFMode.CERTIFICATION, c4_warn_threshold=1)
_validator = DQFValidator(_config)
def validate_daily_data(symbol: str, calendar: str, data: pd.DataFrame) -> pd.DataFrame:
"""Certify daily data; raise on VOID."""
report = _validator.validate(data, calendar=calendar)
if report.overall_status == "VOID":
logger.critical("%s: VOID core=%s", symbol, report.core_results)
raise ValueError(f"CORE failure for {symbol} โ gate=0")
if report.overall_status == "WARNING":
logger.warning("%s: WARNING advisory=%s MPI=%.1f",
symbol, report.advisory_results, report.purity_index)
logger.info("%s: %s MPI=%.1f UID=%s",
symbol, report.overall_status, report.purity_index, report.mif_uid)
return report.cleaned_data
# Usage
try:
clean = validate_daily_data("SPY", "NYSE", raw_data)
load_to_warehouse(clean)
except ValueError as exc:
alert_team(str(exc))
halt_pipeline()
Example 3: Batch Processing
from pathlib import Path
from dqf import DQFValidator, DQFConfig, DQFMode
CALENDAR = {"BTC-USD": "CRYPTO_247", "ETH-USD": "CRYPTO_247",
"SPY": "NYSE", "GLD": "NYSE"}
config = DQFConfig(mode=DQFMode.CERTIFICATION)
validator = DQFValidator(config)
results = {}
for symbol, calendar in CALENDAR.items():
data = pd.read_csv(f"{symbol}.csv", index_col=0, parse_dates=True)
data.index = data.index.tz_localize("UTC")
results[symbol] = validator.validate(data, calendar=calendar)
print(f"{symbol}: {results[symbol].overall_status} MPI={results[symbol].purity_index:.1f}")
# Keep only certified datasets
certified = {s: r.cleaned_data for s, r in results.items() if r.is_certified}
print(f"\n{len(certified)}/{len(CALENDAR)} datasets CERTIFIED")
# Persist manifests
for symbol, report in results.items():
Path(f"manifests/{symbol}.mif.json").write_text(report.to_json())
Example 4: Custom Check
See examples/04_custom_check.py for a complete example. Custom checks extend
BaseCheck and are added to DQFValidator._checks before calling validate().
from dqf import DQFValidator, DQFConfig, DQFMode
from dqf.checks.base import BaseCheck
from dqf.core.enums import STATUS_FAIL, STATUS_PASS
class LiquidityCheck(BaseCheck):
"""Advisory check: minimum daily volume."""
def __init__(self, min_vol: float = 1_000_000):
super().__init__(check_id="check_custom_liquidity",
check_name="Minimum Liquidity")
self.min_vol = min_vol
def run(self, data, **kwargs):
low = (data["volume"] < self.min_vol).sum()
if low:
return self._create_result(
status=STATUS_FAIL,
message=f"{low} days below minimum volume ({self.min_vol:,.0f})",
details={"low_volume_days": low},
)
return self._create_result(status=STATUS_PASS, message="Liquidity OK")
config = DQFConfig(mode=DQFMode.DIAGNOSTIC)
validator = DQFValidator(config)
validator._checks["C_LIQ"] = LiquidityCheck(min_vol=500_000)
report = validator.validate(data)
๐๏ธ Architecture
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Input: Raw DataFrame (OHLCV) โ
โ - Potentially corrupted โ
โ - Unknown quality โ
โโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ DQFValidator (mode: CERT | DIAG) โ
โ CORE checks (failure โ VOID) โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ C2. OHLCV Integrity โ โ
โ โ C3. Calendar Alignment โ โ
โ โ C5. Index Traceability โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ ADVISORY checks (warn โ WARNING) โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ C1. Source Uniqueness (SKIP/P1) โ โ
โ โ C4. Forward-Fill Detection โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ PROD Envelope (MIF-Lite manifest) โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ MIF-UID = SHA-256(hash+ver+cal)โ โ
โ โ MPI = 100ร(1โฮฃwแตข/N) โ โ
โ โ gate = 1.0/0.8/0.0 โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Output: DQFReport (.mif.json) โ
โ - overall_status: CERTIFIED/WARNING/ โ
โ VOID โ
โ - purity_index: 0โ100 (MPI) โ
โ - precondition_gate: 0.0/0.8/1.0 โ
โ - mif_uid: sha256:... โ
โ - cleaned_data: validated DataFrame โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Design Principles:
- Deterministic: Same data + same args โ Same MIF-UID (always)
- Dual mode: CERTIFICATION (strict) vs DIAGNOSTIC (advisory)
- MPI: continuous purity score replaces binary PASS/FAIL
- Production-Ready: 189/189 tests passing
๐ Documentation
- DQF Specification: Canonical architectural decisions (v1.1 design)
- API Reference: Complete API documentation
- Architecture: Design patterns and technical decisions
- Troubleshooting: Common issues and solutions
- Examples: 4 complete examples (basic, config, batch, custom)
๐งช Testing & Quality
# Run all tests
pytest tests/ -v # 189/189 passing
# Coverage
pytest tests/ --cov=dqf
# Examples
python examples/01_basic_validation.py # โ
Works
python examples/02_custom_config.py # โ
Works
python examples/03_batch_processing.py # โ
Works
python examples/04_custom_check.py # โ
Works
Quality Metrics:
- 189 tests (164 unit + 25 integration)
- 0 failures
๐ฆ Project Structure
dqf/
โโโ dqf/ # Source code
โ โโโ checks/ # C1โC5 checks (v1.1.0)
โ โโโ core/ # Config, Validator, Report, PRODEnvelope
โ โโโ utils/ # Calendar utilities, MPI
โโโ tests/ # Test suite (189 tests)
โ โโโ unit/ # Per-module unit tests
โ โโโ integration/ # End-to-end pipeline tests
โโโ examples/ # Complete examples (4)
โโโ docs/
โ โโโ DQF_SPECIFICATION.md # Canonical specification (v1.1)
โ โโโ API.md # API reference
โ โโโ ARCHITECTURE.md # Design & patterns
โ โโโ TROUBLESHOOTING.md # Common issues
โโโ scripts/
โ โโโ test_install.py # Installation smoke test
โโโ pyproject.toml # Package metadata
โโโ LICENSE # MIT License
๐ ๏ธ Development
Requirements
- Python 3.10+
- pandas >= 2.0.0
- PyYAML >= 6.0
Setup
# Clone repository
git clone https://github.com/symbioticode/mif-dqf.git
cd mif-dqf
# Install in editable mode
pip install -e .
# Install dev dependencies
pip install -e ".[dev]"
# Run tests
pytest tests/ -v
Contributing
Contributions welcome! Please see CONTRIBUTING.md for guidelines.
๐ Benchmarks
Performance (100 days of data):
Total validation time: ~0.6s
- C2 (Integrity): 0.32s
- C3 (Calendar): 0.10s
- C4 (Ffill): 0.10s
- C5 (Index): 0.08s
- PROD Envelope: <0.01s
Scalability:
- 100 days: ~0.6s
- 1,000 days: ~2.0s
- 10,000 days: ~15s
๐บ๏ธ Roadmap
v1.0.0 (legacy)
- 7 checks (Source, Integrity, Calendar, Forward-Fill, Index, Sanity, Logging)
- Binary PASS/FAIL report
v1.1.0 (current) โ
- โ Two operational modes: CERTIFICATION (strict) vs DIAGNOSTIC (advisory)
- โ CORE/ADVISORY check classification โ CORE failure โ VOID, gate=0
- โ PROD envelope produces MIF-Lite manifest (.mif.json)
- โ MIF Purity Index (MPI) โ 0โ100 continuous purity score
- โ
MIF-UID โ
SHA-256(data_hash + dqf_version + calendar + mode) - โ C6 (Sanity) migrated to MIF Layer 1; C7 (Logging) replaced by PROD envelope
- โ 189/189 tests passing
v1.2.0 โ Active cleaning (planned)
- Optional deterministic data transformation in CERTIFICATION mode
- Before/after diff reports
v2.0.0 โ MIF integration (planned)
- DAL integration (
get_certified_data()) - C1 (Source Uniqueness) activated โ DAL handoff
- Full provenance chain: source โ DQF โ MIF
๐ค Ecosystem
DQF is the foundational layer of the MIF (Metric Integrity Framework) ecosystem.
MIF Layers 1โ5 = Metric certification & strategy validation
โ (score capped if DQF precondition fails)
DAL = Multi-source data abstraction [planned]
โ
DQF = Data quality gate [YOU ARE HERE]
โ
Raw Sources = Yahoo Finance, Binance, Kraken, etc.
DQF acts as a precondition_gate: if data does not pass DQF, downstream MIF
scores are bounded regardless of metric quality. See
DQF_SPECIFICATION.md for the full integration
contract.
๐ License
MIT License - see LICENSE file for details.
๐ Acknowledgments
- Methodology: Systematic purification as scientific hygiene
- Inspiration: Medical sterilization, laboratory protocols
- Cultural parallels: Islamic Wudu, Shinto Temizuya (ritual, not spiritual)
- Tools: pandas, pytest, PyYAML
๐ Contact & Support
- Repository: github.com/symbioticode/mif-dqf
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Email: corail.synergia@proton.me
โญ Star History
If DQF helps your research or trading, please consider giving it a star! โญ
Made with rigor by the DQF Team
"Data hygiene is not optional. It's the foundation of reliable quantitative analysis."
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mif_dqf-1.2.0.tar.gz.
File metadata
- Download URL: mif_dqf-1.2.0.tar.gz
- Upload date:
- Size: 41.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4cfefe614e1b6b0211340919a60830d490078e26d2e0f0dab6043a06e07392e0
|
|
| MD5 |
c8b7e127154b66a2806a486f3049a28c
|
|
| BLAKE2b-256 |
fb5b88fe983a86870cd7aab7bcce0138f1d4ee5ca98d071abb2d92aceb4230b7
|
Provenance
The following attestation bundles were made for mif_dqf-1.2.0.tar.gz:
Publisher:
mif-dqf-publish.yml on symbioticode/mif-dqf
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
mif_dqf-1.2.0.tar.gz -
Subject digest:
4cfefe614e1b6b0211340919a60830d490078e26d2e0f0dab6043a06e07392e0 - Sigstore transparency entry: 1311126354
- Sigstore integration time:
-
Permalink:
symbioticode/mif-dqf@c0c45820a258d457ba1c964347d01b53a55b5283 -
Branch / Tag:
refs/tags/v1.2.0 - Owner: https://github.com/symbioticode
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
mif-dqf-publish.yml@c0c45820a258d457ba1c964347d01b53a55b5283 -
Trigger Event:
push
-
Statement type:
File details
Details for the file mif_dqf-1.2.0-py3-none-any.whl.
File metadata
- Download URL: mif_dqf-1.2.0-py3-none-any.whl
- Upload date:
- Size: 42.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fcb18e658237db7688e16d7a593dc203a16f29a5aab5e474c6ab877ee282bfb9
|
|
| MD5 |
9cfeea8e1535a5f115b08dfd4ce68b1e
|
|
| BLAKE2b-256 |
f0df5b5ae636cca3f5b307be3cd67ec8618008138b56758ba575f8240198fbbb
|
Provenance
The following attestation bundles were made for mif_dqf-1.2.0-py3-none-any.whl:
Publisher:
mif-dqf-publish.yml on symbioticode/mif-dqf
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
mif_dqf-1.2.0-py3-none-any.whl -
Subject digest:
fcb18e658237db7688e16d7a593dc203a16f29a5aab5e474c6ab877ee282bfb9 - Sigstore transparency entry: 1311126468
- Sigstore integration time:
-
Permalink:
symbioticode/mif-dqf@c0c45820a258d457ba1c964347d01b53a55b5283 -
Branch / Tag:
refs/tags/v1.2.0 - Owner: https://github.com/symbioticode
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
mif-dqf-publish.yml@c0c45820a258d457ba1c964347d01b53a55b5283 -
Trigger Event:
push
-
Statement type: