Skip to main content

Config Driven ETL Framework

Project description

Samara

An extensible framework for configuration-driven data pipelines

⭐ Star this repo📚 Documentation🐛 Report Issues💬 Join Discussions

📥 Releases📝 Changelog (TBD)🤝 Contributing

Built by Krijn van der Burg for the Data Engineering community


Samara transforms data engineering by shifting from custom code to declarative configuration for complete ETL pipeline workflows. The framework handles all execution details while you focus on what your data should do, not how to implement it. This configuration-driven approach standardizes pipeline patterns across teams, reduces complexity for ETL jobs, improves maintainability, and makes data workflows accessible to users with limited programming experience.

The processing engine is abstracted away through configuration, making it easy to switch engines or run the same pipeline in different environments. The current version supports Apache Spark, with Polars support in development.

⚡ Quick Start

Installation

# Clone the repository
git clone https://github.com/krijnvanderburg/Samara.git
cd Samara

# Install dependencies
poetry install

Run an example pipeline

python -m samara run \
  --alert-filepath="examples/yaml_products_cleanup/alert.yaml" \
  --workflow-filepath="examples/yaml_products_cleanup/job.yaml"

📚 Documentation

Samara's documentation guides you through installation, configuration, and development:

For complete documentation covering all aspects of Samara, visit the documentation home page.

🔍 Example: Product Cleanup Pipeline (YAML)

Running this pipeline demonstrates data cleaning operations using YAML configuration format:

  • Drop duplicates: Removes duplicate product entries from the catalog

    • Empty columns: [] array means check all columns for duplicates
    • Reduces 12 rows to 10 unique products
  • Type casting: Converts string columns to appropriate data types

    • price converted from string to double for numeric operations
    • stock_quantity converted to integer for inventory tracking
    • is_available converted to boolean for logical operations
  • Column selection: Projects only relevant columns for the final output

    • Excludes last_updated field from final dataset
    • Each column in the select list can be easily modified through configuration

Configuration: examples/yaml_products_cleanup/job.yaml

Flexible Configuration: Define pipelines in YAML or JSON—both formats are fully supported and functionally equivalent. Choose the format that best fits your team preferences.

workflow:
  id: product-cleanup-pipeline
  description: ETL pipeline for cleaning and standardizing product catalog data
  enabled: true

  jobs:
    - id: clean-products
      description: Remove duplicates, cast types, and select relevant columns from product data
      enabled: true
      engine_type: spark

      # Extract product data from CSV file
      extracts:
        - id: extract-products
          extract_type: file
          data_format: csv
          location: examples/products_cleanup/products/
          method: batch
          options:
            delimiter: ","
            header: true
            inferSchema: false
          schema: examples/products_cleanup/products_schema.json

      # Transform the data: remove duplicates, cast types, and select columns
      transforms:
        - id: transform-clean-products
          upstream_id: extract-products
          options: {}
          functions:
            # Step 1: Remove duplicate rows based on all columns
            - function_type: dropDuplicates
              arguments:
                columns: []  # Empty array means check all columns for duplicates

            # Step 2: Cast columns to appropriate data types
            - function_type: cast
              arguments:
                columns:
                  - column_name: price
                    cast_type: double
                  - column_name: stock_quantity
                    cast_type: integer
                  - column_name: is_available
                    cast_type: boolean
                  - column_name: last_updated
                    cast_type: date

            # Step 3: Select only the columns we need for the output
            - function_type: select
              arguments:
                columns: [product_id, product_name, category, price, stock_quantity, is_available]

      # Load the cleaned data to output
      loads:
        - id: load-clean-products
          upstream_id: transform-clean-products
          load_type: file
          data_format: csv
          location: examples/products_cleanup/output
          method: batch
          mode: overwrite
          options:
            header: true
          schema_export: ""

      # Event hooks for pipeline lifecycle
      hooks:
        onStart: []
        onFailure: []
        onSuccess: []
        onFinally: []

🚀 Getting Help

  • Documentation: Refer to the Configuration Reference section for detailed syntax
  • Examples: Explore working samples in the examples directory
  • Community: Ask questions and report issues on GitHub Issues
  • Source Code: Browse the implementation in the src/samara directory

🤝 Contributing

Contributions are welcome! Feel free to submit a pull request and message Krijn van der Burg on linkedin.

📄 License

This project is licensed under the Creative Commons Attribution 4.0 International License (CC-BY-4.0) - see the LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

samara-0.3.tar.gz (91.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

samara-0.3-py3-none-any.whl (134.8 kB view details)

Uploaded Python 3

File details

Details for the file samara-0.3.tar.gz.

File metadata

  • Download URL: samara-0.3.tar.gz
  • Upload date:
  • Size: 91.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for samara-0.3.tar.gz
Algorithm Hash digest
SHA256 4a58a29c5589483539b4c0b2ac10c4f3579fdd432baed70d0f3ddffc9c0b878e
MD5 79a404efd773941a2b53ddb3e358e5dc
BLAKE2b-256 2c2037573f1c91af7b57e97e54bbdd8b776149731ae193498906e44b6cfe19ef

See more details on using hashes here.

File details

Details for the file samara-0.3-py3-none-any.whl.

File metadata

  • Download URL: samara-0.3-py3-none-any.whl
  • Upload date:
  • Size: 134.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for samara-0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 8ddba5d6ca1455514dd51b90b5a739fe17f30515230ae50ad57b9d638f806b8d
MD5 dd079d54d8ee2626f7d59445469bc226
BLAKE2b-256 6a6f69956d9cd40bb21f67c6b9e542048d9f3f4598f6e05db29b388221eb5d9f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page