Skip to main content

High-Performance Medallion Architecture Data Pipeline for Financial Market Data with Qlib Integration

Project description

High-Performance Data Pipeline for Financial Market Data

A production-ready data pipeline for processing Polygon.io S3 flat files into optimized formats for quantitative analysis and machine learning.

๐ŸŽฏ Key Features

  • Command-Line Interface: Complete CLI for all operations (quantmini command)
  • Adaptive Processing: Automatically scales from 24GB workstations to 100GB+ servers
  • 70%+ Compression: Optimized Parquet and binary formats
  • Sub-Second Queries: Partitioned data lake with predicate pushdown
  • Incremental Updates: Process only new data using watermarks
  • Apple Silicon Optimized: 2-3x faster on M1/M2/M3 chips
  • Production Ready: Monitoring, alerting, validation, and error recovery

๐Ÿ“Š Performance

Mode Memory Throughput With Optimizations
Streaming < 32GB 100K rec/s 500K rec/s
Batch 32-64GB 200K rec/s 1M rec/s
Parallel > 64GB 500K rec/s 2M rec/s

๐Ÿš€ Quick Start

Prerequisites

  • macOS (Apple Silicon or Intel) or Linux
  • Python 3.10+
  • 24GB+ RAM (recommended: 32GB+)
  • 1TB+ storage (SSD recommended)
  • Polygon.io account with S3 flat files access

Installation

  1. Install uv package manager:
curl -LsSf https://astral.sh/uv/install.sh | sh
  1. Clone and setup:
git clone <repository-url>
cd quantmini

# Create project structure
./create_structure.sh

# Create and activate virtual environment
uv venv
source .venv/bin/activate  # On macOS/Linux
  1. Install dependencies:
uv pip install qlib polygon boto3 aioboto3 polars duckdb pyarrow psutil pyyaml
  1. Configure credentials:
cp config/credentials.yaml.example config/credentials.yaml
# Edit config/credentials.yaml with your Polygon API keys
  1. Run system profiler:
python -m src.core.system_profiler
# This will create config/system_profile.yaml

First Run

# Initialize configuration
quantmini config init

# Edit credentials (add your Polygon.io API keys)
nano config/credentials.yaml

# Run daily pipeline
quantmini pipeline daily --data-type stocks_daily

# Or backfill historical data
quantmini pipeline run --data-type stocks_daily --start-date 2024-01-01 --end-date 2024-12-31

# Query data
quantmini data query --data-type stocks_daily \
  --symbols AAPL MSFT \
  --fields date close volume \
  --start-date 2024-01-01 --end-date 2024-01-31

See CLI.md for complete CLI documentation.

๐Ÿ“ Project Structure (Medallion Architecture)

quantmini/
โ”œโ”€โ”€ config/              # Configuration files
โ”œโ”€โ”€ src/                 # Source code
โ”‚   โ”œโ”€โ”€ core/           # System profiling, memory monitoring
โ”‚   โ”œโ”€โ”€ download/       # S3 downloaders
โ”‚   โ”œโ”€โ”€ ingest/         # Data ingestion (landing โ†’ bronze)
โ”‚   โ”œโ”€โ”€ storage/        # Parquet storage management
โ”‚   โ”œโ”€โ”€ features/       # Feature engineering (bronze โ†’ silver)
โ”‚   โ”œโ”€โ”€ transform/      # Binary conversion (silver โ†’ gold)
โ”‚   โ”œโ”€โ”€ query/          # Query engine
โ”‚   โ””โ”€โ”€ orchestration/  # Pipeline orchestration
โ”œโ”€โ”€ data/               # Data storage (not in git)
โ”‚   โ”œโ”€โ”€ landing/       # Landing layer: raw source data
โ”‚   โ”‚   โ””โ”€โ”€ polygon-s3/  # CSV.GZ files from S3
โ”‚   โ”œโ”€โ”€ bronze/        # Bronze layer: validated Parquet
โ”‚   โ”œโ”€โ”€ silver/        # Silver layer: feature-enriched Parquet
โ”‚   โ”œโ”€โ”€ gold/          # Gold layer: ML-ready formats
โ”‚   โ”‚   โ””โ”€โ”€ qlib/      # Qlib binary format
โ”‚   โ””โ”€โ”€ metadata/      # Watermarks, indexes
โ”œโ”€โ”€ scripts/           # Command-line scripts
โ”œโ”€โ”€ tests/             # Test suite
โ””โ”€โ”€ docs/              # Documentation

๐Ÿ”ง Configuration

Edit config/pipeline_config.yaml to customize:

  • Processing mode: adaptive, streaming, batch, or parallel
  • Data types: Enable/disable stocks, options, daily, minute data
  • Compression: Choose snappy (fast) or zstd (better compression)
  • Features: Configure which features to compute
  • Optimizations: Enable Apple Silicon, async downloads, etc.

See Installation Guide for configuration details.

๐Ÿ“š Documentation

๐Ÿงช Testing

# Run all tests
pytest

# Run with coverage
pytest --cov=src --cov-report=html

# Run specific test suite
pytest tests/unit/
pytest tests/integration/
pytest tests/performance/

๐Ÿ” Monitoring

Access monitoring dashboards:

# View health status
python scripts/check_health.py

# View performance metrics
cat logs/performance/performance_metrics.json

# Generate report
python scripts/generate_report.py

๐Ÿ“Š Data Types

The pipeline processes four types of data from Polygon.io:

  1. Stock Daily Aggregates: Daily OHLCV for all US stocks
  2. Stock Minute Aggregates: Minute-level data per symbol
  3. Options Daily Aggregates: Daily options data per underlying
  4. Options Minute Aggregates: Minute-level options data (all contracts)

๐ŸŽจ Architecture (Medallion Pattern)

Landing Layer          Bronze Layer        Silver Layer         Gold Layer
(Raw Sources)         (Validated)          (Enriched)          (ML-Ready)
     โ†“                     โ†“                    โ†“                   โ†“
S3 CSV.GZ Files  โ†’  Validated Parquet  โ†’  Feature-Enriched  โ†’  Qlib Binary
  (Polygon)            (Schema Check)       (Indicators)        (Backtesting)

Adaptive Ingestion: Streaming/Batch/Parallel based on available memory
Feature Engineering: DuckDB/Polars for calculated indicators
Binary Conversion: Optimized for ML training and backtesting

๐Ÿšฆ Pipeline Stages (Medallion Architecture)

  1. Landing: Async S3 downloads to landing/polygon-s3/
  2. Bronze: Ingest and validate to bronze/ - schema enforcement, type checking
  3. Silver: Enrich with features to silver/ - calculated indicators, returns, alpha
  4. Gold: Convert to ML formats in gold/qlib/ - optimized for backtesting
  5. Query: Fast access via DuckDB/Polars from any layer

Data Quality Progression: Landing (raw) โ†’ Bronze (validated) โ†’ Silver (enriched) โ†’ Gold (ML-ready)

๐Ÿ” Security

  • Never commit config/credentials.yaml (in .gitignore)
  • Store credentials in environment variables for production
  • Use AWS Secrets Manager or similar for cloud deployments
  • Rotate API keys regularly

๐Ÿ› Troubleshooting

Memory Errors

# Reduce memory usage
export MAX_MEMORY_GB=16

# Force streaming mode
export PIPELINE_MODE=streaming

S3 Rate Limits

# Reduce concurrent downloads
# Edit config/pipeline_config.yaml:
# optimizations.async_downloads.max_concurrent: 4

Slow Performance

# Enable profiling
# Edit config/pipeline_config.yaml:
# monitoring.profiling.enabled: true

# Run and check logs/performance/

See the full documentation for more troubleshooting tips.

๐Ÿค Contributing

See Contributing Guide for development guidelines.

๐Ÿ“„ License

MIT License - see LICENSE file for details

๐Ÿ™ Acknowledgments

  • Polygon.io: S3 flat files data source
  • Qlib: Quantitative investment framework
  • Polars: High-performance DataFrame library
  • DuckDB: Embedded analytical database

๐Ÿ“ง Support


Built with: Python 3.10+, uv, qlib, polygon, polars, duckdb, pyarrow

Optimized for: macOS (Apple Silicon M1/M2/M3), 24GB+ RAM, SSD storage

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

quantmini-0.2.0.tar.gz (749.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

quantmini-0.2.0-py3-none-any.whl (177.3 kB view details)

Uploaded Python 3

File details

Details for the file quantmini-0.2.0.tar.gz.

File metadata

  • Download URL: quantmini-0.2.0.tar.gz
  • Upload date:
  • Size: 749.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.22

File hashes

Hashes for quantmini-0.2.0.tar.gz
Algorithm Hash digest
SHA256 c4e9cabddf51d4e47ce09f4301be7698ebede41953426807892dfdde7ccae2e9
MD5 40896de9078a5b0361f6fb7455b1a3a3
BLAKE2b-256 cc013c0c40270455a9a45955a5486434d8bb0461ee59960585221ae5fe674075

See more details on using hashes here.

File details

Details for the file quantmini-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: quantmini-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 177.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.8.22

File hashes

Hashes for quantmini-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b48c2e59e8f884e8f4b7b029a1edc8e91b609984e4aa134aa99d4bf3332052b5
MD5 e79744afb3bb5b2df86a7657d24b2d1b
BLAKE2b-256 fc82fb867b795909e87121e8f77c00b8cc818e06ec722fdb469ded5ae5d81471

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page