Skip to main content

Add your description here

Project description

MYNK ETL

A comprehensive Python-based Extract-Transform-Load (ETL) pipeline framework designed for financial data processing and management. This project leverages PySpark, Apache Iceberg, Apache Kafka, and Nessie for building scalable, production-ready data pipelines.

Overview

MYNK ETL provides a modular architecture for:

  • Extracting data from multiple sources (Kafka, databases, APIs)
  • Transforming financial data with built-in support for market data operations
  • Loading processed data to data lakes and RDBMS systems
  • Managing distributed computing workflows using Apache Spark

Features

  • Modular Pipeline Architecture: Abstract base classes for extensible Extract, Transform, and Load operations
  • Multi-Source Support:
    • Kafka streaming integration with Avro schema support
    • RDBMS connectors (PostgreSQL, etc.)
    • File-based I/O
  • Data Lake Integration: Apache Iceberg support for ACID transactions and time-travel queries
  • Financial Data Processing: Specialized utilities for yfinance ticker data transformation
  • Scalable Computing: PySpark for distributed batch and streaming operations
  • Catalog Management: Nessie catalog integration for version control on data
  • Comprehensive Logging: JSON-based logging with decorators for operation tracking
  • Configuration Management: YAML-based configuration system for infrastructure and tables

Project Structure

mynk_etl/
├── extract/              # Data extraction implementations
│   ├── extract.py       # Abstract Extract base class
│   ├── kafkaExtract.py  # Kafka streaming extraction
│   └── __init__.py
├── load/                # Data loading implementations
│   ├── load.py          # Abstract Load base class
│   ├── icebergWriter.py # Apache Iceberg writer
│   ├── fileWriter.py    # File-based writer
│   ├── rdbmsWriter.py   # RDBMS writer
│   └── __init__.py
├── transform/           # Data transformation implementations
│   ├── transform.py     # Abstract Transform base class
│   ├── yfinance/        # Financial data transformations
│   │   ├── tickTransform.py
│   │   └── __init__.py
│   └── __init__.py
├── sparkUtils/          # Spark initialization and utilities
│   ├── sparkInit.py     # Spark session creation
│   ├── sparkComfunc.py  # Common Spark functions
│   └── __init__.py
├── utils/               # Utility modules
│   ├── common/
│   │   ├── constants.py     # Application-wide constants
│   │   ├── confs.py         # Configuration loading
│   │   ├── genUtils.py      # General utilities
│   │   ├── kUtils.py        # Kafka utilities
│   │   ├── marketUtils.py   # Market/financial utilities
│   │   └── __init__.py
│   └── logger/
│       ├── jsonLogging.py        # JSON log formatting
│       ├── logDecor.py           # Logging decorators
│       ├── psgLogging.py         # PostgreSQL logging
│       ├── QueueListenerHandler.py
│       └── __init__.py
├── mainCalls/
│   ├── yfinanceUtils.py     # yfinance ETL orchestration
│   └── __init__.py
├── main.py              # Main ETL orchestration
└── __init__.py

Prerequisites

  • Python: 3.11 - 3.12.10
  • PySpark: 3.5.6
  • Java Runtime (for Spark)
  • Configuration Files: tables.yml and infrastructure config (loaded via environment)

Installation

Using UV Package Manager

uv sync

Using pip

pip install -e .

Dependencies

Key dependencies include:

  • pyspark==3.5.6 - Distributed computing framework
  • confluent-kafka==2.12.2 - Kafka client
  • pynessie>=0.67.0 - Nessie catalog client
  • fastavro>=1.12.1 - Avro serialization
  • psycopg2-binary>=2.9.11 - PostgreSQL connector
  • pandas-market-calendars>=5.3.0 - Market calendar support

Configuration

Environment Variables

Set the infrastructure environment:

export INFRA_ENV=DEV  # or PROD

Configuration Files

  1. tables.yml - Define table properties and configurations per environment
  2. Infrastructure Config - Spark parameters, Kafka brokers, Nessie settings, MinIO credentials

Configuration is loaded via mynk_etl.utils.common.confs module.

Usage

Command Line Interface

mynk_etl <method> <config_key> [--path /path/to/config]

Parameters:

  • method: Operation to execute (e.g., yfinanceDataLoad)
  • config_key: Configuration key in format table_group.table_name
  • path: Optional path to configuration directory (defaults to current working directory)

Programmatic Usage

from mynk_etl import main_function

# Execute yfinance data load
main_function('yfinanceDataLoad', 'financial.stock_ticks')

Module Details

Extract Module

Abstract interface for data extraction from various sources. Supports both streaming and batch modes.

Transform Module

Abstract interface for data transformations. Specialized implementations for financial data (ticker transformations, temporal partitioning).

Load Module

Abstract interface for writing data to destination systems:

  • Iceberg: ACID transactions, time-travel queries
  • RDBMS: PostgreSQL and other relational databases
  • File: Parquet, CSV exports

Spark Utilities

  • Session initialization with Iceberg and Kafka support
  • Common DataFrame operations and optimizations
  • Distributed computing utilities

Utils

  • Constants: Application-wide configuration and enums
  • Configuration: YAML-based config loading
  • Logging: JSON-formatted logs with operation decorators
  • Market Utils: Financial-specific utilities
  • Kafka Utils: Kafka connection and messaging utilities

Testing

Run tests using pytest:

pytest tests/

Run with multiple workers:

pytest -n auto tests/

Test Structure

  • tests/unit/ - Unit tests
  • tests/fixtures/ - Reusable test fixtures
  • tests/helper/ - Test helper utilities
  • tests/data/ - Test data files

Dev Dependencies

  • pytest>=9.0.2 - Test framework
  • pytest-mock>=3.15.1 - Mocking utilities
  • pytest-xdist>=3.8.0 - Parallel test execution

Logging

The framework provides structured JSON logging with:

  • Log decoration for tracking function execution
  • Operation-level logging
  • Multiple handlers (console, file, PostgreSQL)
  • Correlation tracking via RUN_ID

Architecture Patterns

Abstract Base Classes

The framework uses abstract base classes for extensibility:

# Extract
class Extract(ABC):
    abstractmethod extractSparkData()
    abstractmethod extractSparkStreamData()

# Transform
class Transform(ABC):
    # Custom transformation implementations

# Load
class Load(ABC):
    abstractmethod streamWriter()
    abstractmethod nonStreamWriter()

This design allows easy implementation of new data sources, transformations, and destinations.

Performance Features

  • PySpark Optimization: Leverages Spark's distributed computing
  • Iceberg Features: ACID transactions, schema evolution, time-travel queries
  • Kafka Integration: Reliable streaming data ingestion
  • Horizontal Scalability: Designed for multi-node Spark clusters

Version

Current version: 0.1.14

Contributing

When adding new modules:

  1. Follow the modular structure (Extract → Transform → Load)
  2. Implement abstract base classes
  3. Add comprehensive docstrings
  4. Include unit tests in tests/unit/
  5. Add fixtures to tests/fixtures/ if needed

License

[Add your license here]

Support

For issues and questions, please refer to the project documentation or contact the development team.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mynk_etl-0.1.15.tar.gz (21.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

mynk_etl-0.1.15-py3-none-any.whl (32.9 kB view details)

Uploaded Python 3

File details

Details for the file mynk_etl-0.1.15.tar.gz.

File metadata

  • Download URL: mynk_etl-0.1.15.tar.gz
  • Upload date:
  • Size: 21.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.8 {"installer":{"name":"uv","version":"0.11.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Alpine Linux","version":"3.23.4","id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for mynk_etl-0.1.15.tar.gz
Algorithm Hash digest
SHA256 2ff8a715789b6df4d650df4a27e4c6dab180bc9a1c2053e8c9deda59ad16b20f
MD5 c00ad4cade6ec52cc8d2d803153df01d
BLAKE2b-256 86fbc140bfc0145769a8b3cfaa5b0dd4c4b65bd831dd994bcf2e2fca92fd95f3

See more details on using hashes here.

File details

Details for the file mynk_etl-0.1.15-py3-none-any.whl.

File metadata

  • Download URL: mynk_etl-0.1.15-py3-none-any.whl
  • Upload date:
  • Size: 32.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.8 {"installer":{"name":"uv","version":"0.11.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Alpine Linux","version":"3.23.4","id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for mynk_etl-0.1.15-py3-none-any.whl
Algorithm Hash digest
SHA256 5eb393fff77ed56c7c8730821369f8e5853b07198078dbe373f4fa0b962bf34e
MD5 ad65bbf0d673c505e84313bef46e7aa8
BLAKE2b-256 7c56a4918e36b4e6ad13452c12c0b137c60d1b101286bc027cc1a8fe99644294

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page