Skip to main content

Config Driven ETL Framework

Project description

Samara Logo

Samara

A lightweight, extensible framework for configuration-driven data pipelines

Python Versions License Apache Spark

Built by Krijn van der Burg for the data engineering community

⭐ Star this repo📚 Documentation🐛 Report Issues💬 Join Discussions

📥 Releases (TBD)📝 Changelog (TBD)🤝 Contributing


Samara transforms data engineering by shifting from custom code to declarative configuration for complete ETL pipeline workflows. The framework handles all execution details while you focus on what your data should do, not how to implement it. This configuration-driven approach standardizes pipeline patterns across teams, reduces complexity for ETL jobs, improves maintainability, and makes data workflows accessible to users with limited programming experience.

The processing engine is abstracted away through configuration, making it easy to switch engines or run the same pipeline in different environments. The current version supports Apache Spark, with Polars support in development.

⚡ Quick Start

Installation

# Clone the repository
git clone https://github.com/krijnvanderburg/config-driven-ETL-framework.git
cd config-driven-ETL-framework

# Install dependencies
poetry install

Run an example pipeline

python -m samara run \
  --alert-filepath="examples/join_select/alert.jsonc" \
  --runtime-filepath="examples/join_select/job.jsonc"

📚 Documentation

Samara's documentation guides you through installation, configuration, and development:

For complete documentation covering all aspects of Samara, visit the documentation home page.

🔍 Example: Customer Order Analysis

Running this command executes a complete pipeline that showcases Samara's key capabilities:

  • Multi-format extraction: Seamlessly reads from both CSV and JSON sources

    • Source options like delimiters and headers are configurable through the configuration file
    • Schema validation ensures data type safety and consistency across all sources
  • Flexible transformation chain: Performed in order as given

    • First a join to combine both datasets on customer_id
    • Then applies a select transform to project only needed columns
    • Each transform function can be easily customized through its arguments
  • Configurable loading: Writes results as CSV with customizable settings

    • Easily change to Parquet, Delta, or other formats by modifying data_format
    • Output mode (overwrite/append) controlled by a simple parameter
    • Output to multiple formats or locations by creating another load entry

Configuration: examples/join_select/job.jsonc

{
    "runtime": {
        "id": "customer-orders-pipeline",
        "description": "ETL pipeline for processing customer orders data",
        "enabled": true,
        "jobs": [
            {
                "id": "silver",
                "description": "Combine customer and order source data into a single dataset",
                "enabled": true,
                "engine_type": "spark", // Specifies the processing engine to use
                "extracts": [
                    {
                        "id": "extract-customers",
                        "extract_type": "file", // Read from file system
                        "data_format": "csv", // CSV input format
                        "location": "examples/join_select/customers/", // Source directory
                        "method": "batch", // Process all files at once
                        "options": {
                            "delimiter": ",", // CSV delimiter character
                            "header": true, // First row contains column names
                            "inferSchema": false // Use provided schema instead of inferring
                        },
                        "schema": "examples/join_select/customers_schema.json" // Path to schema definition
                    },
                    {
                        "id": "extract-orders",
                        "extract_type": "file",
                        "data_format": "json", // JSON input format
                        "location": "examples/join_select/orders/",
                        "method": "batch",
                        "options": {
                            "multiLine": true, // Each JSON object may span multiple lines
                            "inferSchema": false // Use provided schema instead of inferring
                        },
                        "schema": "examples/join_select/orders_schema.json"
                    }
                ],
                "transforms": [
                    {
                        "id": "transform-join-orders",
                        "upstream_id": "extract-customers", // First input dataset from extract stage
                        "options": {},
                        "functions": [
                            {
                                "function_type": "join", // Join customers with orders
                                "arguments": { 
                                    "other_upstream_id": "extract-orders", // Second dataset to join
                                    "on": ["customer_id"], // Join key
                                    "how": "inner" // Join type (inner, left, right, full)
                                }
                            },
                            {
                                "function_type": "select", // Select only specific columns
                                "arguments": {
                                    "columns": ["name", "email", "signup_date", "order_id", "order_date", "amount"]
                                }
                            }
                        ]
                    }
                ],
                "loads": [
                    {
                        "id": "load-customer-orders",
                        "upstream_id": "transform-join-orders", // Input dataset for this load
                        "load_type": "file", // Write to file system
                        "data_format": "csv", // Output as CSV
                        "location": "examples/join_select/output", // Output directory
                        "method": "batch", // Write all data at once
                        "mode": "overwrite", // Replace existing files if any
                        "options": {
                            "header": true // Include header row with column names
                        },
                        "schema_export": "" // No schema export
                    }
                ],
                "hooks": {
                    "onStart": [], // Actions to execute before pipeline starts
                    "onFailure": [], // Actions to execute if pipeline fails
                    "onSuccess": [], // Actions to execute if pipeline succeeds
                    "onFinally": [] // Actions to execute after pipeline completes (success or failure)
                }
            }
        ]
    }
}

🚀 Getting Help

  • Documentation: Refer to the Configuration Reference section for detailed syntax
  • Examples: Explore working samples in the examples directory
  • Community: Ask questions and report issues on GitHub Issues
  • Source Code: Browse the implementation in the src/samara directory

🤝 Contributing

Contributions are welcome! Feel free to submit a pull request and message Krijn van der Burg on linkedin.

📄 License

This project is licensed under the Creative Commons Attribution 4.0 International License (CC-BY-4.0) - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

samara-0.1.0.tar.gz (51.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

samara-0.1.0-py3-none-any.whl (79.5 kB view details)

Uploaded Python 3

File details

Details for the file samara-0.1.0.tar.gz.

File metadata

  • Download URL: samara-0.1.0.tar.gz
  • Upload date:
  • Size: 51.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for samara-0.1.0.tar.gz
Algorithm Hash digest
SHA256 3662963699ca8dedca953239a3c5d053d15360a1a3445c646e62b00bd5049dc6
MD5 84ec194d2638172bca66cf95a15cbeda
BLAKE2b-256 f1c98ff7b7148e6f6ebf7b31666723d36b1c217a5b707526eb58a4ecd7df1c8e

See more details on using hashes here.

File details

Details for the file samara-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: samara-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 79.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for samara-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 2f96853595e3e215206c607d8d1d4c2722e147a615a0ae52c8e8c5b58531f1ac
MD5 d746d77d001b19d0c44be21a692e62a8
BLAKE2b-256 d9e3c77d5b11c5ff587516498d4bd8a83754dc89fc6c1010c1c256ab4eaf4788

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page