Add your description here
Project description
MYNK ETL
A comprehensive Python-based Extract-Transform-Load (ETL) pipeline framework designed for financial data processing and management. This project leverages PySpark, Apache Iceberg, Apache Kafka, and Nessie for building scalable, production-ready data pipelines.
Overview
MYNK ETL provides a modular architecture for:
- Extracting data from multiple sources (Kafka, databases, APIs)
- Transforming financial data with built-in support for market data operations
- Loading processed data to data lakes and RDBMS systems
- Managing distributed computing workflows using Apache Spark
Features
- Modular Pipeline Architecture: Abstract base classes for extensible Extract, Transform, and Load operations
- Multi-Source Support:
- Kafka streaming integration with Avro schema support
- RDBMS connectors (PostgreSQL, etc.)
- File-based I/O
- Data Lake Integration: Apache Iceberg support for ACID transactions and time-travel queries
- Financial Data Processing: Specialized utilities for yfinance ticker data transformation
- Scalable Computing: PySpark for distributed batch and streaming operations
- Catalog Management: Nessie catalog integration for version control on data
- Comprehensive Logging: JSON-based logging with decorators for operation tracking
- Configuration Management: YAML-based configuration system for infrastructure and tables
Project Structure
mynk_etl/
├── extract/ # Data extraction implementations
│ ├── extract.py # Abstract Extract base class
│ ├── kafkaExtract.py # Kafka streaming extraction
│ └── __init__.py
├── load/ # Data loading implementations
│ ├── load.py # Abstract Load base class
│ ├── icebergWriter.py # Apache Iceberg writer
│ ├── fileWriter.py # File-based writer
│ ├── rdbmsWriter.py # RDBMS writer
│ └── __init__.py
├── transform/ # Data transformation implementations
│ ├── transform.py # Abstract Transform base class
│ ├── yfinance/ # Financial data transformations
│ │ ├── tickTransform.py
│ │ └── __init__.py
│ └── __init__.py
├── sparkUtils/ # Spark initialization and utilities
│ ├── sparkInit.py # Spark session creation
│ ├── sparkComfunc.py # Common Spark functions
│ └── __init__.py
├── utils/ # Utility modules
│ ├── common/
│ │ ├── constants.py # Application-wide constants
│ │ ├── confs.py # Configuration loading
│ │ ├── genUtils.py # General utilities
│ │ ├── kUtils.py # Kafka utilities
│ │ ├── marketUtils.py # Market/financial utilities
│ │ └── __init__.py
│ └── logger/
│ ├── jsonLogging.py # JSON log formatting
│ ├── logDecor.py # Logging decorators
│ ├── psgLogging.py # PostgreSQL logging
│ ├── QueueListenerHandler.py
│ └── __init__.py
├── mainCalls/
│ ├── yfinanceUtils.py # yfinance ETL orchestration
│ └── __init__.py
├── main.py # Main ETL orchestration
└── __init__.py
Prerequisites
- Python: 3.11 - 3.12.10
- PySpark: 3.5.6
- Java Runtime (for Spark)
- Configuration Files:
tables.ymland infrastructure config (loaded via environment)
Installation
Using UV Package Manager
uv sync
Using pip
pip install -e .
Dependencies
Key dependencies include:
pyspark==3.5.6- Distributed computing frameworkconfluent-kafka==2.12.2- Kafka clientpynessie>=0.67.0- Nessie catalog clientfastavro>=1.12.1- Avro serializationpsycopg2-binary>=2.9.11- PostgreSQL connectorpandas-market-calendars>=5.3.0- Market calendar support
Configuration
Environment Variables
Set the infrastructure environment:
export INFRA_ENV=DEV # or PROD
Configuration Files
- tables.yml - Define table properties and configurations per environment
- Infrastructure Config - Spark parameters, Kafka brokers, Nessie settings, MinIO credentials
Configuration is loaded via mynk_etl.utils.common.confs module.
Usage
Command Line Interface
mynk_etl <method> <config_key> [--path /path/to/config]
Parameters:
method: Operation to execute (e.g.,yfinanceDataLoad)config_key: Configuration key in formattable_group.table_namepath: Optional path to configuration directory (defaults to current working directory)
Programmatic Usage
from mynk_etl import main_function
# Execute yfinance data load
main_function('yfinanceDataLoad', 'financial.stock_ticks')
Module Details
Extract Module
Abstract interface for data extraction from various sources. Supports both streaming and batch modes.
Transform Module
Abstract interface for data transformations. Specialized implementations for financial data (ticker transformations, temporal partitioning).
Load Module
Abstract interface for writing data to destination systems:
- Iceberg: ACID transactions, time-travel queries
- RDBMS: PostgreSQL and other relational databases
- File: Parquet, CSV exports
Spark Utilities
- Session initialization with Iceberg and Kafka support
- Common DataFrame operations and optimizations
- Distributed computing utilities
Utils
- Constants: Application-wide configuration and enums
- Configuration: YAML-based config loading
- Logging: JSON-formatted logs with operation decorators
- Market Utils: Financial-specific utilities
- Kafka Utils: Kafka connection and messaging utilities
Testing
Run tests using pytest:
pytest tests/
Run with multiple workers:
pytest -n auto tests/
Test Structure
tests/unit/- Unit teststests/fixtures/- Reusable test fixturestests/helper/- Test helper utilitiestests/data/- Test data files
Dev Dependencies
pytest>=9.0.2- Test frameworkpytest-mock>=3.15.1- Mocking utilitiespytest-xdist>=3.8.0- Parallel test execution
Logging
The framework provides structured JSON logging with:
- Log decoration for tracking function execution
- Operation-level logging
- Multiple handlers (console, file, PostgreSQL)
- Correlation tracking via RUN_ID
Architecture Patterns
Abstract Base Classes
The framework uses abstract base classes for extensibility:
# Extract
class Extract(ABC):
abstractmethod extractSparkData()
abstractmethod extractSparkStreamData()
# Transform
class Transform(ABC):
# Custom transformation implementations
# Load
class Load(ABC):
abstractmethod streamWriter()
abstractmethod nonStreamWriter()
This design allows easy implementation of new data sources, transformations, and destinations.
Performance Features
- PySpark Optimization: Leverages Spark's distributed computing
- Iceberg Features: ACID transactions, schema evolution, time-travel queries
- Kafka Integration: Reliable streaming data ingestion
- Horizontal Scalability: Designed for multi-node Spark clusters
Version
Current version: 0.1.14
Contributing
When adding new modules:
- Follow the modular structure (Extract → Transform → Load)
- Implement abstract base classes
- Add comprehensive docstrings
- Include unit tests in
tests/unit/ - Add fixtures to
tests/fixtures/if needed
License
[Add your license here]
Support
For issues and questions, please refer to the project documentation or contact the development team.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file mynk_etl-0.1.15.tar.gz.
File metadata
- Download URL: mynk_etl-0.1.15.tar.gz
- Upload date:
- Size: 21.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.11.8 {"installer":{"name":"uv","version":"0.11.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Alpine Linux","version":"3.23.4","id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2ff8a715789b6df4d650df4a27e4c6dab180bc9a1c2053e8c9deda59ad16b20f
|
|
| MD5 |
c00ad4cade6ec52cc8d2d803153df01d
|
|
| BLAKE2b-256 |
86fbc140bfc0145769a8b3cfaa5b0dd4c4b65bd831dd994bcf2e2fca92fd95f3
|
File details
Details for the file mynk_etl-0.1.15-py3-none-any.whl.
File metadata
- Download URL: mynk_etl-0.1.15-py3-none-any.whl
- Upload date:
- Size: 32.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.11.8 {"installer":{"name":"uv","version":"0.11.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Alpine Linux","version":"3.23.4","id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5eb393fff77ed56c7c8730821369f8e5853b07198078dbe373f4fa0b962bf34e
|
|
| MD5 |
ad65bbf0d673c505e84313bef46e7aa8
|
|
| BLAKE2b-256 |
7c56a4918e36b4e6ad13452c12c0b137c60d1b101286bc027cc1a8fe99644294
|