Config Driven ETL Framework
Project description
Samara
An extensible framework for configuration-driven data pipelines
⭐ Star this repo • 📚 Documentation • 🐛 Report Issues • 💬 Join Discussions
📥 Releases • 📝 Changelog (TBD) • 🤝 Contributing
Built by Krijn van der Burg for the Data Engineering community
Samara transforms data engineering by shifting from custom code to declarative configuration for complete ETL pipeline workflows. The framework handles all execution details while you focus on what your data should do, not how to implement it. This configuration-driven approach standardizes pipeline patterns across teams, reduces complexity for ETL jobs, improves maintainability, and makes data workflows accessible to users with limited programming experience.
The processing engine is abstracted away through configuration, making it easy to switch engines or run the same pipeline in different environments. The current version supports Apache Spark, with Polars support in development.
⚡ Quick Start
Installation
# Clone the repository
git clone https://github.com/krijnvanderburg/Samara.git
cd Samara
# Install dependencies
poetry install
Run an example pipeline
python -m samara run \
--alert-filepath="examples/yaml_products_cleanup/alert.yaml" \
--workflow-filepath="examples/yaml_products_cleanup/job.yaml"
📚 Documentation
Samara's documentation guides you through installation, configuration, and development:
- Getting Started - Installation and basic concepts
- Example Pipelines - Ready-to-run examples demonstrating key features
- CLI Reference - Command-line interface options and examples
- Configuration Reference - Complete syntax guide for all configuration options
- Workflow System - ETL pipeline configuration (extracts, transforms, loads)
- Alert System - Error handling and notification configuration
- Architecture - Design principles and framework structure
- Custom Extensions - Building your own transforms
For complete documentation covering all aspects of Samara, visit the documentation home page.
🔍 Example: Product Cleanup Pipeline (YAML)
Running this pipeline demonstrates data cleaning operations using YAML configuration format:
-
Drop duplicates: Removes duplicate product entries from the catalog
- Empty
columns: []array means check all columns for duplicates - Reduces 12 rows to 10 unique products
- Empty
-
Type casting: Converts string columns to appropriate data types
priceconverted from string to double for numeric operationsstock_quantityconverted to integer for inventory trackingis_availableconverted to boolean for logical operations
-
Column selection: Projects only relevant columns for the final output
- Excludes
last_updatedfield from final dataset - Each column in the select list can be easily modified through configuration
- Excludes
Configuration: examples/yaml_products_cleanup/job.yaml
Flexible Configuration: Define pipelines in YAML or JSON—both formats are fully supported and functionally equivalent. Choose the format that best fits your team preferences.
workflow:
id: product-cleanup-pipeline
description: ETL pipeline for cleaning and standardizing product catalog data
enabled: true
jobs:
- id: clean-products
description: Remove duplicates, cast types, and select relevant columns from product data
enabled: true
engine_type: spark
# Extract product data from CSV file
extracts:
- id: extract-products
extract_type: file
data_format: csv
location: examples/products_cleanup/products/
method: batch
options:
delimiter: ","
header: true
inferSchema: false
schema: examples/products_cleanup/products_schema.json
# Transform the data: remove duplicates, cast types, and select columns
transforms:
- id: transform-clean-products
upstream_id: extract-products
options: {}
functions:
# Step 1: Remove duplicate rows based on all columns
- function_type: dropDuplicates
arguments:
columns: [] # Empty array means check all columns for duplicates
# Step 2: Cast columns to appropriate data types
- function_type: cast
arguments:
columns:
- column_name: price
cast_type: double
- column_name: stock_quantity
cast_type: integer
- column_name: is_available
cast_type: boolean
- column_name: last_updated
cast_type: date
# Step 3: Select only the columns we need for the output
- function_type: select
arguments:
columns: [product_id, product_name, category, price, stock_quantity, is_available]
# Load the cleaned data to output
loads:
- id: load-clean-products
upstream_id: transform-clean-products
load_type: file
data_format: csv
location: examples/products_cleanup/output
method: batch
mode: overwrite
options:
header: true
schema_export: ""
# Event hooks for pipeline lifecycle
hooks:
onStart: []
onFailure: []
onSuccess: []
onFinally: []
🚀 Getting Help
- Documentation: Refer to the Configuration Reference section for detailed syntax
- Examples: Explore working samples in the examples directory
- Community: Ask questions and report issues on GitHub Issues
- Source Code: Browse the implementation in the src/samara directory
🤝 Contributing
Contributions are welcome! Feel free to submit a pull request and message Krijn van der Burg on linkedin.
📄 License
This project is licensed under the Creative Commons Attribution 4.0 International License (CC-BY-4.0) - see the LICENSE file for details.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file samara-0.2.tar.gz.
File metadata
- Download URL: samara-0.2.tar.gz
- Upload date:
- Size: 92.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
780e556dd9236223b67e65366646198265aae0a3cf34f8007050c4b070f11148
|
|
| MD5 |
36ce202ef3c6823b6b5ed49f4f6629a4
|
|
| BLAKE2b-256 |
6b51a9832e9b9024ae36573db57502c3dcb7166a421a1a707ad9fc71b3fa8be5
|
File details
Details for the file samara-0.2-py3-none-any.whl.
File metadata
- Download URL: samara-0.2-py3-none-any.whl
- Upload date:
- Size: 135.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ae7222815a3e3f930c30a4aba484536f1f30f224b85df12daf961502a9d0a87d
|
|
| MD5 |
648e95e71eea92f155e3eba6bd84d36c
|
|
| BLAKE2b-256 |
e750e15e5cc3feaeb26965a095390959d6e9df071f057688fbd173af002a4e6c
|