High-Performance Data Pipeline for Financial Market Data with Qlib Integration
Project description
High-Performance Data Pipeline for Financial Market Data
A production-ready data pipeline for processing Polygon.io S3 flat files into optimized formats for quantitative analysis and machine learning.
๐ฏ Key Features
- Adaptive Processing: Automatically scales from 24GB workstations to 100GB+ servers
- 70%+ Compression: Optimized Parquet and binary formats
- Sub-Second Queries: Partitioned data lake with predicate pushdown
- Incremental Updates: Process only new data using watermarks
- Apple Silicon Optimized: 2-3x faster on M1/M2/M3 chips
- Production Ready: Monitoring, alerting, validation, and error recovery
๐ Performance
| Mode | Memory | Throughput | With Optimizations |
|---|---|---|---|
| Streaming | < 32GB | 100K rec/s | 500K rec/s |
| Batch | 32-64GB | 200K rec/s | 1M rec/s |
| Parallel | > 64GB | 500K rec/s | 2M rec/s |
๐ Quick Start
Prerequisites
- macOS (Apple Silicon or Intel) or Linux
- Python 3.10+
- 24GB+ RAM (recommended: 32GB+)
- 1TB+ storage (SSD recommended)
- Polygon.io account with S3 flat files access
Installation
- Install uv package manager:
curl -LsSf https://astral.sh/uv/install.sh | sh
- Clone and setup:
git clone <repository-url>
cd quantmini
# Create project structure
./create_structure.sh
# Create and activate virtual environment
uv venv
source .venv/bin/activate # On macOS/Linux
- Install dependencies:
uv pip install qlib polygon boto3 aioboto3 polars duckdb pyarrow psutil pyyaml
- Configure credentials:
cp config/credentials.yaml.example config/credentials.yaml
# Edit config/credentials.yaml with your Polygon API keys
- Run system profiler:
python -m src.core.system_profiler
# This will create config/system_profile.yaml
First Run
# Run daily pipeline (processes latest data)
python scripts/run_daily_pipeline.py
# Or backfill historical data
python scripts/run_backfill.py --start-date 2024-01-01 --end-date 2024-12-31
๐ Project Structure
quantmini/
โโโ config/ # Configuration files
โโโ src/ # Source code
โ โโโ core/ # System profiling, memory monitoring
โ โโโ download/ # S3 downloaders
โ โโโ ingest/ # Data ingestion (streaming/batch/parallel)
โ โโโ storage/ # Parquet data lake
โ โโโ features/ # Feature engineering
โ โโโ transform/ # Binary format conversion
โ โโโ query/ # Query engine
โ โโโ orchestration/ # Pipeline orchestration
โโโ data/ # Data storage (not in git)
โ โโโ lake/ # Parquet data lake
โ โโโ binary/ # Qlib binary format
โ โโโ metadata/ # Watermarks, indexes
โโโ scripts/ # Command-line scripts
โโโ tests/ # Test suite
โโโ docs/ # Documentation
๐ง Configuration
Edit config/pipeline_config.yaml to customize:
- Processing mode:
adaptive,streaming,batch, orparallel - Data types: Enable/disable stocks, options, daily, minute data
- Compression: Choose
snappy(fast) orzstd(better compression) - Features: Configure which features to compute
- Optimizations: Enable Apple Silicon, async downloads, etc.
See CONFIGURATION.md for details.
๐ Documentation
- Implementation Plan: 28-week roadmap
- Project Memory: Design principles and patterns
- Project Structure: Complete directory layout
- Design Doc: Architecture details
๐งช Testing
# Run all tests
pytest
# Run with coverage
pytest --cov=src --cov-report=html
# Run specific test suite
pytest tests/unit/
pytest tests/integration/
pytest tests/performance/
๐ Monitoring
Access monitoring dashboards:
# View health status
python scripts/check_health.py
# View performance metrics
cat logs/performance/performance_metrics.json
# Generate report
python scripts/generate_report.py
๐ Data Types
The pipeline processes four types of data from Polygon.io:
- Stock Daily Aggregates: Daily OHLCV for all US stocks
- Stock Minute Aggregates: Minute-level data per symbol
- Options Daily Aggregates: Daily options data per underlying
- Options Minute Aggregates: Minute-level options data (all contracts)
๐จ Architecture
S3 CSV.GZ Files
โ
Adaptive Ingestion (Streaming/Batch/Parallel)
โ
Parquet Data Lake (Partitioned)
โ
Feature Engineering (DuckDB/Polars)
โ
Qlib Binary Format (ML-Ready)
๐ฆ Pipeline Stages
- Download: Async S3 downloads with connection pooling
- Ingest: Adaptive processing based on available memory
- Validate: Data quality checks
- Enrich: Feature engineering (alpha, returns, etc.)
- Convert: Transform to qlib binary format
- Query: Fast access via DuckDB/Polars
๐ Security
- Never commit
config/credentials.yaml(in .gitignore) - Store credentials in environment variables for production
- Use AWS Secrets Manager or similar for cloud deployments
- Rotate API keys regularly
๐ Troubleshooting
Memory Errors
# Reduce memory usage
export MAX_MEMORY_GB=16
# Force streaming mode
export PIPELINE_MODE=streaming
S3 Rate Limits
# Reduce concurrent downloads
# Edit config/pipeline_config.yaml:
# optimizations.async_downloads.max_concurrent: 4
Slow Performance
# Enable profiling
# Edit config/pipeline_config.yaml:
# monitoring.profiling.enabled: true
# Run and check logs/performance/
See TROUBLESHOOTING.md for more.
๐ค Contributing
See CONTRIBUTING.md for development guidelines.
๐ Performance Tuning
See PERFORMANCE_TUNING.md for:
- Apple Silicon optimizations
- Memory tuning
- Storage optimization
- Query performance
- Benchmarking
๐บ๏ธ Roadmap
- Phase 0-4: Core pipeline (Weeks 1-10)
- Phase 5-8: Features and queries (Weeks 11-18)
- Phase 9-11: Orchestration and optimization (Weeks 19-24)
- Phase 12-14: Monitoring and production (Weeks 25-28)
See IMPLEMENTATION_PLAN.md for detailed timeline.
๐ License
[Add your license here]
๐ Acknowledgments
- Polygon.io: S3 flat files data source
- Qlib: Quantitative investment framework
- Polars: High-performance DataFrame library
- DuckDB: Embedded analytical database
๐ง Support
- Documentation: docs/
- Issues: GitHub Issues
- Email: your-email@example.com
Built with: Python 3.10+, uv, qlib, polygon, polars, duckdb, pyarrow
Optimized for: macOS (Apple Silicon M1/M2/M3), 24GB+ RAM, SSD storage
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file quantmini-0.1.0.tar.gz.
File metadata
- Download URL: quantmini-0.1.0.tar.gz
- Upload date:
- Size: 477.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a1770a61e2798db4a4a2563b8397ab3fbd58ca7591985f3b88a3a52efc505634
|
|
| MD5 |
d0a3407db0e05ac020fab95c1b5f4d57
|
|
| BLAKE2b-256 |
adf64712c55d4a79fd3a9eb8ecedaa5eabda62a672c1617d0e12a451c49c916d
|
Provenance
The following attestation bundles were made for quantmini-0.1.0.tar.gz:
Publisher:
publish.yml on nittygritty-zzy/quantmini
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
quantmini-0.1.0.tar.gz -
Subject digest:
a1770a61e2798db4a4a2563b8397ab3fbd58ca7591985f3b88a3a52efc505634 - Sigstore transparency entry: 574991831
- Sigstore integration time:
-
Permalink:
nittygritty-zzy/quantmini@6f8fbf599adb16f2bb7567ee8e10f80c0195a512 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/nittygritty-zzy
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@6f8fbf599adb16f2bb7567ee8e10f80c0195a512 -
Trigger Event:
release
-
Statement type:
File details
Details for the file quantmini-0.1.0-py3-none-any.whl.
File metadata
- Download URL: quantmini-0.1.0-py3-none-any.whl
- Upload date:
- Size: 72.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e468f9b8e8a222161ff31c6387b63e4609d0eb2d348ce2638e18228a3270c133
|
|
| MD5 |
19b58b24faadabbde1a85aed3d46eb44
|
|
| BLAKE2b-256 |
07c871154dcdbb5460afcde235c227f733421aede202bed44c1a25091a1c7020
|
Provenance
The following attestation bundles were made for quantmini-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on nittygritty-zzy/quantmini
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
quantmini-0.1.0-py3-none-any.whl -
Subject digest:
e468f9b8e8a222161ff31c6387b63e4609d0eb2d348ce2638e18228a3270c133 - Sigstore transparency entry: 574991833
- Sigstore integration time:
-
Permalink:
nittygritty-zzy/quantmini@6f8fbf599adb16f2bb7567ee8e10f80c0195a512 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/nittygritty-zzy
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@6f8fbf599adb16f2bb7567ee8e10f80c0195a512 -
Trigger Event:
release
-
Statement type: