Skip to main content

BeETL is a Python package for extracting data from one datasource,

Project description

BeETL: Extensible Python/Polars-based ETL Framework


BeETL was born from a job as Integration Developer where a majority of the integrations we develop follow the same pattern - get here, transform a little, put there (with the middle step frequently missing altogether).

After building our 16th integration between the same two systems with another manual template, we decided to build BeETL. BeETL is currently limited to one datasource per source and destination per sync, but this will be expanded in the future. One configuration can contain multiple syncs.

Note: Even though most of the configuration below is in YAML format, you can also use JSON or a python dictionary.

Todo:

  • Soft Delete/Hard Delete
  • Output table at end
  • Automatic column specification from data

TOC

Installation

From PyPi

pip3 install beetl

From Source

git clone https://
python3 setup.py install

Quick Start

The following is the minimum amount of configuration needed to get started with a simple sync

from src.beetl.beetl import Beetl, BeetlConfig

sync_config = {
    # The version of the config file, currently V1
    "version": "V1",
    
    # The datasources to move data between
    "sources": [
        {
            # The identifier for the datasource
            "name": "mysql_db",

            # The type (ex. Sqlserver, Rest, Itop)
            "type": "Mysql",

            # The connection settings for the datasource (connection string or host/user/password)
            "connection": {
              "settings": {
                "connection_string": "mysql://user:password@host:3306/database"
              }
            }
        },
        {
            "name": "postgres_db",
            "type": "Postgres",
            "connection": {
              "settings": {
                "connection_string": "postgresql://user:password@host:5432/database"
              }
            }
        }
    ],
    # The configuration for the sync(s) to run
    "sync": [
        {
            # The source and destination identifiers
            "source": "mysql_db",
            "destination": "postgres_db",

            # The configuration for source/destination
            "sourceConfig": {
                # The query with data to fetch
                "query": "SELECT field1, field2, field3 FROM table1",
                
                # The column descriptions for the query
                "columns": [
                    {
                        # The name of the column/field
                        "name": "field1",

                        # The data type
                        "type": "Int32",

                        # Whether the column is considered unique
                        # (unique cols will be used for comparison)
                        "unique": True
                    },
                    {
                        "name": "field2",
                        "type": "Utf8",
                        "unique": False
                    },
                    {
                        "name": "field3",
                        "type": "Utf8",
                        "unique": False
                    }
                ]
            },
            "destinationConfig": {
                # The table to insert data into
                "table": "table1",

                # The columns to insert data into
                "columns": [
                    {
                        # The name of the column/field
                        "name": "field1",

                        # The data type
                        "type": "Int32",

                        # Whether the column is considered unique
                        # (unique cols will be used for comparison)
                        "unique": True
                    },
                    {
                        "name": "field2",
                        "type": "Utf8",
                        "unique": False
                    },
                    {
                        "name": "field3",
                        "type": "Utf8",
                        "unique": False,
                        
                        # Will be created on insert, but not updated
                        "skip_update": True
                    }
                ]
            },
            "sourceTransformers": {},
            "insertionTransformers": {}
        }
    ]
}

Secrets from Environment Variables

In case you want to save your secrets in environment variables instead of in the yaml configuration file, you can save them as a json object to an environment variable and replace the "sources"-section with sourcesFromEnv setting.

Note that the "sources" and "sourcesFromEnv" options are mutually exclusive.

sync_config = {
    # The version of the config file, currently V1
    "version": "V1",

    # Fetch source configuration from environment variable BEETL_SOURCES
    "sourcesFromEnv": "BEETL_SOURCES",

    # The datasources to move data between
    "sync": [
        .....
version: "V1"
sourcesFromEnv: "BEETL_SOURCES"
sync:
  - ......
{
    "version": "V1",
    "sourcesFromEnv": "BEETL_SOURCES",
    "sync": [
        ......

The format of the sources configuration is the same as the one normally under the "sources"-section:

[
    {
        # The identifier for the datasource
        "name": "mysql_db",

        # The type (ex. Sqlserver, Rest, Itop)
        "type": "Mysql",

        # The connection settings for the datasource (connection string or host/user/password)
        "connection": {
            "settings": {
            "connection_string": "mysql://user:password@host:3306/database"
            }
        }
    },
    {
        "name": "postgres_db",
        "type": "Postgres",
        "connection": {
            "settings": {
            "connection_string": "postgresql://user:password@host:5432/database"
            }
        }
    }
]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

beetl-0.4.10.tar.gz (25.9 kB view details)

Uploaded Source

Built Distribution

beetl-0.4.10-py3-none-any.whl (33.3 kB view details)

Uploaded Python 3

File details

Details for the file beetl-0.4.10.tar.gz.

File metadata

  • Download URL: beetl-0.4.10.tar.gz
  • Upload date:
  • Size: 25.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.15

File hashes

Hashes for beetl-0.4.10.tar.gz
Algorithm Hash digest
SHA256 9f3e4d7464916ba8add9f000e323814d9304b4af3389ed5c99785067ada8b990
MD5 048647c6b355b599749825509d8fd5c0
BLAKE2b-256 5f0eafb05256e2e7bd9a2f786ce0cc2c86cfd9adaea398faf62dc242b34837d8

See more details on using hashes here.

File details

Details for the file beetl-0.4.10-py3-none-any.whl.

File metadata

  • Download URL: beetl-0.4.10-py3-none-any.whl
  • Upload date:
  • Size: 33.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.15

File hashes

Hashes for beetl-0.4.10-py3-none-any.whl
Algorithm Hash digest
SHA256 d6802f102ccb99775068f81b297d5e662b835d5ab8549dd417bce505a745efe3
MD5 531fc1c6671ca18df9e72355ca86d8f3
BLAKE2b-256 cab182b45849d9b9bc1de8fac76f165f8e143cad22409584e9e73d6d536ce3e2

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page