Useful ETL functions for Python
Project description
Useful ETL functions for Python
etl-toolbox is a Python library of simple but powerful functions for ETL and data cleaning. It contains tools that are useful for nearly any ETL pipeline, with a specific focus on the data variety challenges that arise when compiling data from many sources.
Features
Standardize various null-indicating values ('blank', 'none', 'null', etc) to Python’s None
Trim, condense, and standardize whitespace with a single function
Locate and rename column labels in messy files
Quick Start
Installation
$ pip install etl_toolbox
Usage
>>> import pandas as pd >>> >>> df = pd.read_csv('./test_data/bad-data.csv') >>> df # doctest:+SKIP Unnamed: 0 Unnamed: 1 Unnamed: 2 Unnamed: 3 Unnamed: 4 0 created by: Brookcub Industries for testing purposes NaN NaN 1 date: 2020-06-07 3 rows some columns NaN 2 NaN NaN NaN NaN NaN 3 Cust. EML-addr On phn-nmbr col5 4 Golden jackal c.aureus@mail.com 03/04/14 333-4444 blank 5 Pie, rufous tree none NaN 222-3333 empty 6 Vulture, bengal blocked 06/01/15 777-7777 none 7 Arctic tern s_paradise@mail.com 01/28/16 NaN NaN 8 Eurasian badger meles@othermail.net notavailable NaN NaN 9 Grant's gazelle grant@randmail.com - NaN NaN
Find and standardize column labels using a dictionary of the expected values:
>>> from etl_toolbox.dataframe_functions import find_column_labels >>> from etl_toolbox.mapping_functions import map_labels >>> >>> fingerprint_map = { ... 'cust': 'Name', ... 'emladdr': 'Email', ... 'on': 'Date', ... 'phnnmbr': 'Phone' ... } >>> >>> find_column_labels(df, fingerprint_map) >>> df Cust. EML-addr On phn-nmbr col5 0 Golden jackal c.aureus@mail.com 03/04/14 333-4444 blank 1 Pie, rufous tree none NaN 222-3333 empty 2 Vulture, bengal blocked 06/01/15 777-7777 none 3 Arctic tern s_paradise@mail.com 01/28/16 NaN NaN 4 Eurasian badger meles@othermail.net notavailable NaN NaN 5 Grant's gazelle grant@randmail.com - NaN NaN >>> >>> df.columns = map_labels(df.columns, fingerprint_map) >>> df Name Email Date Phone - 0 Golden jackal c.aureus@mail.com 03/04/14 333-4444 blank 1 Pie, rufous tree none NaN 222-3333 empty 2 Vulture, bengal blocked 06/01/15 777-7777 none 3 Arctic tern s_paradise@mail.com 01/28/16 NaN NaN 4 Eurasian badger meles@othermail.net notavailable NaN NaN 5 Grant's gazelle grant@randmail.com - NaN NaN
Standardize null values and remove empty rows/columns:
>>> from etl_toolbox.dataframe_functions import dataframe_clean_null >>> >>> dataframe_clean_null(df) >>> df Name Email Date Phone 0 Golden jackal c.aureus@mail.com 03/04/14 333-4444 1 Pie, rufous tree NaN NaN 222-3333 2 Vulture, bengal NaN 06/01/15 777-7777 3 Arctic tern s_paradise@mail.com 01/28/16 NaN 4 Eurasian badger meles@othermail.net NaN NaN 5 Grant's gazelle grant@randmail.com NaN NaN
Or clean individual data values:
>>> from etl_toolbox.cleaning_functions import clean_whitespace >>> >>> clean_whitespace(''' 123 abc 456 ... def\t\t 789\t''') '123 abc 456 def 789'
Documentation
Full documentation is hosted at etl-toolbox.readthedocs.io.
Contributing
Contributions are appreciated! There are multiple ways to contribute:
Bug Reports
Bug reports help make this library more robust. A good bug report should include:
A clear description of the problem (the expected behavior vs the actual behavior)
A minimal, reproducible example (see the Stack Overflow guide)
The platform and versions involved (operating system, Python version, etl-toolbox version, pandas/numpy version if applicable, etc)
Submit bug reports with the issue tracker on GitHub.
Feature Requests
Open an issue to discuss features you’d like to see added to etl-toolbox.
Pull Requests
Follow these steps for submitting pull requests:
Find an issue or feature on the issue tracker.
Fork this repository on GitHub and make changes in a branch.
Add tests to confirm that the bugfix/feature works as expected.
Run the entire test suite and coverage report with pytest --doctest-modules --doctest-glob=*.rst --cov=etl_toolbox --ignore=docs/conf.py. Make sure text coverage is 100% and all tests are passing.
Submit a pull request.
The code style for etl-toolbox mostly follows PEP8. A linter like Flake8 is recommended for double checking new contributions.
Release History
0.0.3
Fix multiple bugs in merge_columns_by_label() that occurred with certain inputs
Change merge_columns_by_label() to remove None and np.nan values from merged columns
Change find_column_labels() to check whether the existing column labels fit the match criteria before searching rows
Change map_labels() to return '-' instead of None for unmapped labels
Change clean_whitespace() to return non-string inputs unaltered instead of raising an exception
0.0.2
Add GitHub continuous integration
Add project links and badges to readme and PyPI metadata
Fix bug in merge_columns_by_label() that raises a ValueError if df has multiple columns labeled None
0.0.1
Initial release
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file etl-toolbox-0.0.3.tar.gz
.
File metadata
- Download URL: etl-toolbox-0.0.3.tar.gz
- Upload date:
- Size: 16.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/50.3.0 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.8.3rc1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 1fb46cc845ae64aebe90fc9acf7cf13dd63b5dafe08886f99fd500caa412baa2 |
|
MD5 | df93730b872393f18630f41b7e93f6df |
|
BLAKE2b-256 | a7ce272243daa04ac25906b09d05542aa499afa1234a10ab62b0e02e9e70580b |
File details
Details for the file etl_toolbox-0.0.3-py3-none-any.whl
.
File metadata
- Download URL: etl_toolbox-0.0.3-py3-none-any.whl
- Upload date:
- Size: 18.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/50.3.0 requests-toolbelt/0.9.1 tqdm/4.46.0 CPython/3.8.3rc1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | be3df3cb8637cd4fa5348dd2be1a2213adf0a9a2bea54920c6b20b18dbfacc32 |
|
MD5 | c3c1500bd0e9c6ba221b5ab3da122ba9 |
|
BLAKE2b-256 | 42eb17fc17d316605c8ec847ec22550a8ad503e0b8fc90faada482052db13185 |