Skip to main content

A utility for running YAML config as Pandas based ETL code

Project description

pandas-etl

Package Tests codecov PyPI Azure DevOps tests PyPI - Downloads PyPI - Format GitHub GitHub language count GitHub top language Snyk Vulnerabilities for GitHub Repo GitHub code size in bytes PyPI - Python Version

A utility for running YAML config as Pandas based ETL code

Installing ⏬

pip install pandas-etl

Usage 📝

YAML Config:

my-run.yaml

variables:
  server: MY_SERVER_NAME.MYDOMAIN.COM
  database: MY_DATABASE

preFlight:
  script: |
    import pandas as pd

connections:
  my_source: postgresql+psycopg2://${var.server}/${var.database}

steps:

- name:           source-df # Unique step name in pipeline
  description:    Read from my PostgreSQL server
  function:       pd.read_sql # Call static function
  args:           # Key-word based parameter input
    sql:          |
                  SELECT int_column, date_column
                  FROM test_data
    con:          ${ conn.my_source }
    index_col:    int_column
    parse_dates:  { "date_column": { "format": "%d/%m/%y" } }

- name:           Grouper for date column
  function:       pd.Grouper
  args:
    key:          date_column
    freq:         W-MON

- name:           group-data
  description:    Group data by int and date columns every week
  function:       ${ steps['source-df'].output.groupby }
  args:
    by:           ${steps['Grouper for date column'].output}
    axis:         columns
    dropna:       false

- name:           aggregate-data
  function:       ${ steps['group-data'].output.max }

- name:           save-data
  function:       ${ steps['aggregate-data'].output.to_csv }
  args:
                  - ./my-aggregated-data.csv

Running this YAML config

python -m pandas_etl --file "./my-run.yaml"

YAML Config (short-hand format)

Functions can be written in short-hand to optimize readability and minimize overall size of config file.

Below is working example with Imports --imports and override Variables --var:

my-run.yaml

preFlight:
  script: |
    import pandas as

imports:
- ./etl_definition_folder/variables/postgresql_database_variables.yaml

connections:
  my_database: postgresql+psycopg2://${var.username}:${var.password}@${var.server}:${var.postgresql_port}/${var.database}

steps:

- pd.read_sql:
    sql:          |
                  SELECT int_column, date_column
                  FROM test_data
    con:          ${ conn.my_database }
    index_col:    int_column
    parse_dates:  { "date_column": { "format": "%d/%m/%y" } }

- pd.Grouper:
    key:          date_column
    freq:         W-MON

- ${ steps['pd.read_csv'].output.groupby }:
    by:           ${steps['pd.Grouper'].output}
    axis:         columns
    dropna:       false

- ${ steps['pd.read_csv.groupby'].output.max }:

- ${ steps['pd.read_csv.groupby.max'].output.to_csv }:
    path_or_buf:  ./my-aggregated-data.csv

Variables:

postgresql_database_variables.yaml

variables:
  server:   MY_SERVER_NAME.MYDOMAIN.COM
  database: MY_DATABASE

postgresql_database-secret_variables.yaml

variables:
  username: postgres
  password: password

Running this YAML config from command line:

python -m pandas_etl --file "./my-run.yaml" --imports "./etl_definition_folder/variables/secrets/postgresql_database-secret_variables.yaml" --var "postgresql_port=9999"

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pandas_etl-2022.8.17.1.tar.gz (25.5 kB view hashes)

Uploaded Source

Built Distribution

pandas_etl-2022.8.17.1-py3-none-any.whl (11.2 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page