Useful ETL functions for Python
Project description
Useful ETL functions for Python
etl-toolbox is a Python library of simple but powerful functions for ETL and data cleaning. It contains tools that are useful for nearly any ETL pipeline, with a specific focus on the data variety challenges that arise when compiling data from many sources.
Features
Standardize various null-indicating values ('blank', 'none', 'null', etc) to Python’s None
Trim, condense, and standardize whitespace with a single function
Locate and rename column labels in messy files
Quick Start
Installation
$ pip install etl_toolbox
Usage
>>> import pandas as pd >>> >>> df = pd.read_csv('./test_data/bad-data.csv') >>> df # doctest:+SKIP Unnamed: 0 Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 0 created by: Brookcub Industries for testing purposes NaN NaN 1 date: 2020-06-07 3 rows some columns NaN 2 NaN NaN NaN NaN NaN 3 Cust. EML-addr On phn-nmbr col5 4 Golden jackal c.aureus@mail.com 03/04/14 333-4444 blank 5 Pie, rufous tree none NaN 222-3333 empty 6 Vulture, bengal blocked 06/01/15 777-7777 none 7 Arctic tern s_paradise@mail.com 01/28/16 NaN NaN 8 Eurasian badger meles@othermail.net notavailable NaN NaN 9 Grant's gazelle grant@randmail.com - NaN NaN
Find and standardize column labels using a dictionary of the expected values:
>>> from etl_toolbox.dataframe_functions import find_column_labels >>> from etl_toolbox.mapping_functions import map_labels >>> >>> fingerprint_map = { ... 'cust': 'Name', ... 'emladdr': 'Email', ... 'on': 'Date', ... 'phnnmbr': 'Phone' ... } >>> >>> find_column_labels(df, fingerprint_map) >>> df Cust. EML-addr On phn-nmbr col5 0 Golden jackal c.aureus@mail.com 03/04/14 333-4444 blank 1 Pie, rufous tree none NaN 222-3333 empty 2 Vulture, bengal blocked 06/01/15 777-7777 none 3 Arctic tern s_paradise@mail.com 01/28/16 NaN NaN 4 Eurasian badger meles@othermail.net notavailable NaN NaN 5 Grant's gazelle grant@randmail.com - NaN NaN >>> >>> df.columns = map_labels(df.columns, fingerprint_map) >>> df Name Email Date Phone - 0 Golden jackal c.aureus@mail.com 03/04/14 333-4444 blank 1 Pie, rufous tree none NaN 222-3333 empty 2 Vulture, bengal blocked 06/01/15 777-7777 none 3 Arctic tern s_paradise@mail.com 01/28/16 NaN NaN 4 Eurasian badger meles@othermail.net notavailable NaN NaN 5 Grant's gazelle grant@randmail.com - NaN NaN
Standardize null values and remove empty rows/columns:
>>> from etl_toolbox.dataframe_functions import dataframe_clean_null >>> >>> dataframe_clean_null(df) >>> df Name Email Date Phone 0 Golden jackal c.aureus@mail.com 03/04/14 333-4444 1 Pie, rufous tree NaN NaN 222-3333 2 Vulture, bengal NaN 06/01/15 777-7777 3 Arctic tern s_paradise@mail.com 01/28/16 NaN 4 Eurasian badger meles@othermail.net NaN NaN 5 Grant's gazelle grant@randmail.com NaN NaN
Or clean individual data values:
>>> from etl_toolbox.cleaning_functions import clean_whitespace >>> >>> clean_whitespace(''' 123 abc 456 ... def\t\t 789\t''') '123 abc 456 def 789'
Documentation
Full documentation is hosted at etl-toolbox.readthedocs.io.
Contributing
Contributions are appreciated! There are multiple ways to contribute:
Bug Reports
Bug reports help make this library more robust. A good bug report should include:
A clear description of the problem (the expected behavior vs the actual behavior)
A minimal, reproducible example (see the Stack Overflow guide)
The platform and versions involved (operating system, Python version, etl-toolbox version, pandas/numpy version if applicable, etc)
Submit bug reports with the issue tracker on GitHub.
Feature Requests
Open an issue to discuss features you’d like to see added to etl-toolbox.
Pull Requests
Follow these steps for submitting pull requests:
Find an issue or feature on the issue tracker.
Fork this repository on GitHub and make changes in a branch.
Add tests to confirm that the bugfix/feature works as expected.
Run the entire test suite and coverage report with pytest --doctest-modules --doctest-glob=*.rst --cov=etl_toolbox --ignore=docs/conf.py. Make sure text coverage is 100% and all tests are passing.
Submit a pull request.
The code style for etl-toolbox mostly follows PEP8. A linter like Flake8 is recommended for double checking new contributions.
Release History
0.0.3
Fix multiple bugs in merge_columns_by_label() that occurred with certain inputs
Change merge_columns_by_label() to remove None and np.nan values from merged columns
Change find_column_labels() to check whether the existing column labels fit the match criteria before searching rows
Change map_labels() to return '-' instead of None for unmapped labels
Change clean_whitespace() to return non-string inputs unaltered instead of raising an exception
0.0.2
Add GitHub continuous integration
Add project links and badges to readme and PyPI metadata
Fix bug in merge_columns_by_label() that raises a ValueError if df has multiple columns labeled None
0.0.1
Initial release
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for etl_toolbox-0.0.3-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | be3df3cb8637cd4fa5348dd2be1a2213adf0a9a2bea54920c6b20b18dbfacc32 |
|
MD5 | c3c1500bd0e9c6ba221b5ab3da122ba9 |
|
BLAKE2b-256 | 42eb17fc17d316605c8ec847ec22550a8ad503e0b8fc90faada482052db13185 |