Skip to main content

High-Performance Data Pipeline for Financial Market Data with Qlib Integration

Project description

High-Performance Data Pipeline for Financial Market Data

A production-ready data pipeline for processing Polygon.io S3 flat files into optimized formats for quantitative analysis and machine learning.

๐ŸŽฏ Key Features

  • Adaptive Processing: Automatically scales from 24GB workstations to 100GB+ servers
  • 70%+ Compression: Optimized Parquet and binary formats
  • Sub-Second Queries: Partitioned data lake with predicate pushdown
  • Incremental Updates: Process only new data using watermarks
  • Apple Silicon Optimized: 2-3x faster on M1/M2/M3 chips
  • Production Ready: Monitoring, alerting, validation, and error recovery

๐Ÿ“Š Performance

Mode Memory Throughput With Optimizations
Streaming < 32GB 100K rec/s 500K rec/s
Batch 32-64GB 200K rec/s 1M rec/s
Parallel > 64GB 500K rec/s 2M rec/s

๐Ÿš€ Quick Start

Prerequisites

  • macOS (Apple Silicon or Intel) or Linux
  • Python 3.10+
  • 24GB+ RAM (recommended: 32GB+)
  • 1TB+ storage (SSD recommended)
  • Polygon.io account with S3 flat files access

Installation

  1. Install uv package manager:
curl -LsSf https://astral.sh/uv/install.sh | sh
  1. Clone and setup:
git clone <repository-url>
cd quantmini

# Create project structure
./create_structure.sh

# Create and activate virtual environment
uv venv
source .venv/bin/activate  # On macOS/Linux
  1. Install dependencies:
uv pip install qlib polygon boto3 aioboto3 polars duckdb pyarrow psutil pyyaml
  1. Configure credentials:
cp config/credentials.yaml.example config/credentials.yaml
# Edit config/credentials.yaml with your Polygon API keys
  1. Run system profiler:
python -m src.core.system_profiler
# This will create config/system_profile.yaml

First Run

# Run daily pipeline (processes latest data)
python scripts/run_daily_pipeline.py

# Or backfill historical data
python scripts/run_backfill.py --start-date 2024-01-01 --end-date 2024-12-31

๐Ÿ“ Project Structure

quantmini/
โ”œโ”€โ”€ config/              # Configuration files
โ”œโ”€โ”€ src/                 # Source code
โ”‚   โ”œโ”€โ”€ core/           # System profiling, memory monitoring
โ”‚   โ”œโ”€โ”€ download/       # S3 downloaders
โ”‚   โ”œโ”€โ”€ ingest/         # Data ingestion (streaming/batch/parallel)
โ”‚   โ”œโ”€โ”€ storage/        # Parquet data lake
โ”‚   โ”œโ”€โ”€ features/       # Feature engineering
โ”‚   โ”œโ”€โ”€ transform/      # Binary format conversion
โ”‚   โ”œโ”€โ”€ query/          # Query engine
โ”‚   โ””โ”€โ”€ orchestration/  # Pipeline orchestration
โ”œโ”€โ”€ data/               # Data storage (not in git)
โ”‚   โ”œโ”€โ”€ lake/          # Parquet data lake
โ”‚   โ”œโ”€โ”€ binary/        # Qlib binary format
โ”‚   โ””โ”€โ”€ metadata/      # Watermarks, indexes
โ”œโ”€โ”€ scripts/           # Command-line scripts
โ”œโ”€โ”€ tests/             # Test suite
โ””โ”€โ”€ docs/              # Documentation

๐Ÿ”ง Configuration

Edit config/pipeline_config.yaml to customize:

  • Processing mode: adaptive, streaming, batch, or parallel
  • Data types: Enable/disable stocks, options, daily, minute data
  • Compression: Choose snappy (fast) or zstd (better compression)
  • Features: Configure which features to compute
  • Optimizations: Enable Apple Silicon, async downloads, etc.

See CONFIGURATION.md for details.

๐Ÿ“š Documentation

๐Ÿงช Testing

# Run all tests
pytest

# Run with coverage
pytest --cov=src --cov-report=html

# Run specific test suite
pytest tests/unit/
pytest tests/integration/
pytest tests/performance/

๐Ÿ” Monitoring

Access monitoring dashboards:

# View health status
python scripts/check_health.py

# View performance metrics
cat logs/performance/performance_metrics.json

# Generate report
python scripts/generate_report.py

๐Ÿ“Š Data Types

The pipeline processes four types of data from Polygon.io:

  1. Stock Daily Aggregates: Daily OHLCV for all US stocks
  2. Stock Minute Aggregates: Minute-level data per symbol
  3. Options Daily Aggregates: Daily options data per underlying
  4. Options Minute Aggregates: Minute-level options data (all contracts)

๐ŸŽจ Architecture

S3 CSV.GZ Files
      โ†“
Adaptive Ingestion (Streaming/Batch/Parallel)
      โ†“
Parquet Data Lake (Partitioned)
      โ†“
Feature Engineering (DuckDB/Polars)
      โ†“
Qlib Binary Format (ML-Ready)

๐Ÿšฆ Pipeline Stages

  1. Download: Async S3 downloads with connection pooling
  2. Ingest: Adaptive processing based on available memory
  3. Validate: Data quality checks
  4. Enrich: Feature engineering (alpha, returns, etc.)
  5. Convert: Transform to qlib binary format
  6. Query: Fast access via DuckDB/Polars

๐Ÿ” Security

  • Never commit config/credentials.yaml (in .gitignore)
  • Store credentials in environment variables for production
  • Use AWS Secrets Manager or similar for cloud deployments
  • Rotate API keys regularly

๐Ÿ› Troubleshooting

Memory Errors

# Reduce memory usage
export MAX_MEMORY_GB=16

# Force streaming mode
export PIPELINE_MODE=streaming

S3 Rate Limits

# Reduce concurrent downloads
# Edit config/pipeline_config.yaml:
# optimizations.async_downloads.max_concurrent: 4

Slow Performance

# Enable profiling
# Edit config/pipeline_config.yaml:
# monitoring.profiling.enabled: true

# Run and check logs/performance/

See TROUBLESHOOTING.md for more.

๐Ÿค Contributing

See CONTRIBUTING.md for development guidelines.

๐Ÿ“ˆ Performance Tuning

See PERFORMANCE_TUNING.md for:

  • Apple Silicon optimizations
  • Memory tuning
  • Storage optimization
  • Query performance
  • Benchmarking

๐Ÿ—บ๏ธ Roadmap

  • Phase 0-4: Core pipeline (Weeks 1-10)
  • Phase 5-8: Features and queries (Weeks 11-18)
  • Phase 9-11: Orchestration and optimization (Weeks 19-24)
  • Phase 12-14: Monitoring and production (Weeks 25-28)

See IMPLEMENTATION_PLAN.md for detailed timeline.

๐Ÿ“„ License

[Add your license here]

๐Ÿ™ Acknowledgments

  • Polygon.io: S3 flat files data source
  • Qlib: Quantitative investment framework
  • Polars: High-performance DataFrame library
  • DuckDB: Embedded analytical database

๐Ÿ“ง Support


Built with: Python 3.10+, uv, qlib, polygon, polars, duckdb, pyarrow

Optimized for: macOS (Apple Silicon M1/M2/M3), 24GB+ RAM, SSD storage

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

quantmini-0.1.0.tar.gz (477.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

quantmini-0.1.0-py3-none-any.whl (72.4 kB view details)

Uploaded Python 3

File details

Details for the file quantmini-0.1.0.tar.gz.

File metadata

  • Download URL: quantmini-0.1.0.tar.gz
  • Upload date:
  • Size: 477.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for quantmini-0.1.0.tar.gz
Algorithm Hash digest
SHA256 a1770a61e2798db4a4a2563b8397ab3fbd58ca7591985f3b88a3a52efc505634
MD5 d0a3407db0e05ac020fab95c1b5f4d57
BLAKE2b-256 adf64712c55d4a79fd3a9eb8ecedaa5eabda62a672c1617d0e12a451c49c916d

See more details on using hashes here.

Provenance

The following attestation bundles were made for quantmini-0.1.0.tar.gz:

Publisher: publish.yml on nittygritty-zzy/quantmini

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file quantmini-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: quantmini-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 72.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for quantmini-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 e468f9b8e8a222161ff31c6387b63e4609d0eb2d348ce2638e18228a3270c133
MD5 19b58b24faadabbde1a85aed3d46eb44
BLAKE2b-256 07c871154dcdbb5460afcde235c227f733421aede202bed44c1a25091a1c7020

See more details on using hashes here.

Provenance

The following attestation bundles were made for quantmini-0.1.0-py3-none-any.whl:

Publisher: publish.yml on nittygritty-zzy/quantmini

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page