Skip to main content

Useful ETL functions for Python

Project description

Useful ETL functions for Python

etl-toolbox is a Python library of simple but powerful functions for ETL and data cleaning. It contains tools that are useful for nearly any ETL pipeline, with a specific focus on the data variety challenges that arise when compiling data from many sources.

GitHub Build Status Coverage Read the Docs PyPI Version Supported Python versions License

Features

  • Standardize various null-indicating values ('blank', 'none', 'null', etc) to Python’s None

  • Trim, condense, and standardize whitespace with a single function

  • Locate and rename column labels in messy files

Quick Start

Installation

Install from PyPI using pip:

$ pip install etl_toolbox

Usage

>>> import pandas as pd
>>>
>>> df = pd.read_csv('./test_data/bad-data.csv')
>>> df  # doctest:+SKIP
         Unnamed: 0           Unnamed: 1            Unnamed: 2    Unnamed: 3 Unnamed: 4
0      created by:   Brookcub Industries  for testing purposes           NaN        NaN
1             date:           2020-06-07                3 rows  some columns        NaN
2               NaN                  NaN                   NaN           NaN        NaN
3             Cust.             EML-addr                On          phn-nmbr       col5
4     Golden jackal    c.aureus@mail.com              03/04/14      333-4444      blank
5  Pie, rufous tree                 none                   NaN      222-3333      empty
6   Vulture, bengal              blocked              06/01/15      777-7777       none
7       Arctic tern  s_paradise@mail.com              01/28/16           NaN        NaN
8   Eurasian badger  meles@othermail.net         notavailable            NaN        NaN
9   Grant's gazelle   grant@randmail.com                     -           NaN        NaN

Find and standardize column labels using a dictionary of the expected values:

>>> from etl_toolbox.dataframe_functions import find_column_labels
>>> from etl_toolbox.mapping_functions import map_labels
>>>
>>> fingerprint_map = {
...     'cust': 'Name',
...     'emladdr': 'Email',
...     'on': 'Date',
...     'phnnmbr': 'Phone'
... }
>>>
>>> find_column_labels(df, fingerprint_map)
>>> df
              Cust.             EML-addr         On      phn-nmbr   col5
0     Golden jackal    c.aureus@mail.com       03/04/14  333-4444  blank
1  Pie, rufous tree                 none            NaN  222-3333  empty
2   Vulture, bengal              blocked       06/01/15  777-7777   none
3       Arctic tern  s_paradise@mail.com       01/28/16       NaN    NaN
4   Eurasian badger  meles@othermail.net  notavailable        NaN    NaN
5   Grant's gazelle   grant@randmail.com              -       NaN    NaN
>>>
>>> df.columns = map_labels(df.columns, fingerprint_map)
>>> df
               Name                Email           Date     Phone      -
0     Golden jackal    c.aureus@mail.com       03/04/14  333-4444  blank
1  Pie, rufous tree                 none            NaN  222-3333  empty
2   Vulture, bengal              blocked       06/01/15  777-7777   none
3       Arctic tern  s_paradise@mail.com       01/28/16       NaN    NaN
4   Eurasian badger  meles@othermail.net  notavailable        NaN    NaN
5   Grant's gazelle   grant@randmail.com              -       NaN    NaN

Standardize null values and remove empty rows/columns:

>>> from etl_toolbox.dataframe_functions import dataframe_clean_null
>>>
>>> dataframe_clean_null(df)
>>> df
               Name                Email      Date     Phone
0     Golden jackal    c.aureus@mail.com  03/04/14  333-4444
1  Pie, rufous tree                  NaN       NaN  222-3333
2   Vulture, bengal                  NaN  06/01/15  777-7777
3       Arctic tern  s_paradise@mail.com  01/28/16       NaN
4   Eurasian badger  meles@othermail.net       NaN       NaN
5   Grant's gazelle   grant@randmail.com       NaN       NaN

Or clean individual data values:

>>> from etl_toolbox.cleaning_functions import clean_whitespace
>>>
>>> clean_whitespace(''' 123   abc 456
...                               def\t\t 789\t''')
'123 abc 456 def 789'

Documentation

Full documentation is hosted at etl-toolbox.readthedocs.io.

Contributing

Contributions are appreciated! There are multiple ways to contribute:

Bug Reports

Bug reports help make this library more robust. A good bug report should include:

  1. A clear description of the problem (the expected behavior vs the actual behavior)

  2. A minimal, reproducible example (see the Stack Overflow guide)

  3. The platform and versions involved (operating system, Python version, etl-toolbox version, pandas/numpy version if applicable, etc)

Submit bug reports with the issue tracker on GitHub.

Feature Requests

Open an issue to discuss features you’d like to see added to etl-toolbox.

Pull Requests

Follow these steps for submitting pull requests:

  1. Find an issue or feature on the issue tracker.

  2. Fork this repository on GitHub and make changes in a branch.

  3. Add tests to confirm that the bugfix/feature works as expected.

  4. Run the entire test suite and coverage report with pytest --doctest-modules --doctest-glob=*.rst --cov=etl_toolbox --ignore=docs/conf.py. Make sure text coverage is 100% and all tests are passing.

  5. Submit a pull request.

The code style for etl-toolbox mostly follows PEP8. A linter like Flake8 is recommended for double checking new contributions.

Release History

  • 0.0.3

    • Fix multiple bugs in merge_columns_by_label() that occurred with certain inputs

    • Change merge_columns_by_label() to remove None and np.nan values from merged columns

    • Change find_column_labels() to check whether the existing column labels fit the match criteria before searching rows

    • Change map_labels() to return '-' instead of None for unmapped labels

    • Change clean_whitespace() to return non-string inputs unaltered instead of raising an exception

  • 0.0.2

    • Add GitHub continuous integration

    • Add project links and badges to readme and PyPI metadata

    • Fix bug in merge_columns_by_label() that raises a ValueError if df has multiple columns labeled None

  • 0.0.1

    • Initial release

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

etl-toolbox-0.0.3.tar.gz (16.0 kB view hashes)

Uploaded Source

Built Distribution

etl_toolbox-0.0.3-py3-none-any.whl (18.3 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page