Data Preprocessing Tools
Project description
dptools: data preprocessing functions for Python
Overview
The dptools
python package provides helper functions to simplify common data processing tasks in a data science pipeline, including feature engineering, data aggregation, working with missing values and more.
The package currently encompasses the following functions:
- Feature engineering:
add_date_features()
: add date-based featuresadd_text_features()
: add text-based featuresaggregate_data()
: aggreagate data and adds aggregation-based featuresencode_factors()
: perform label or dummy encoding of categorical features
- Data processing:
split_nested_features()
: split features nested in a single columnfill_missings()
: replace missings with specific valuesprint_missings()
: print information on features with missing valuesprint_factor_levels()
: print levels of categorical features
- Data cleaning:
find_correlated_features()
: find features with a high pairwise correlationfind_constant_features()
: find features with a single unique value
- Import and versioning:
read_csv_with_json()
: read CSV with columns in JSON formatsave_csv_version()
: save CSV with an automatically assigned version number to prevent overwriting
Installation
The latest stable release can be installed from PyPI:
pip install dptools
You may also install the development version from Github:
pip install git+https://github.com/kozodoi/dptools.git
After the installation, you can import the included functions:
from dptools import *
Examples
This section contains a few examples of using functions from dptools
for different data preprocessing tasks. Please refer to function docstring documentation for further examples.
Creating a toy data set
First, let us create a toy data frame to demonstarte the package functionality.
# import dependecies
import pandas as pd
import numpy as np
# create data frame
data = {'age': [27, np.nan, 30, 25, np.nan],
'height': [170, 168, 173, 177, 165],
'gender': ['female', 'male', np.nan, 'male', 'female'],
'income': ['high', 'medium', 'low', 'low', 'no income']}
df = pd.DataFrame(data)
age | height | gender | income |
---|---|---|---|
27.0 | 170 | female | high |
NaN | 168 | male | medium |
30.0 | 173 | NaN | low |
25.0 | 177 | male | low |
NaN | 165 | female | no income |
Aggregating features
# aggregating the data
from dptools import aggregate_data
df_new = aggregate_data(df, group_var = 'gender', num_stats = ['mean', 'max'], fac_stats = 'mode')
gender | age_mean | age_max | height_mean | height_max | income_mode |
---|---|---|---|---|---|
female | 27.0 | 27.0 | 167.5 | 170 | 'high' |
male | 25.0 | 25.0 | 172.5 | 177 | 'low' |
Creating text-based features
# creating text-based features
from dptools import add_text_features
df_new = add_text_features(df, text_vars = 'income')
age | height | gender | income_word_count | income_char_count | income_tfidf_0 | ... | income_tfidf_3 |
---|---|---|---|---|---|---|---|
27.0 | 170 | female | 1 | 4 | 1.0 | ... | 0.0 |
NaN | 168 | male | 1 | 6 | 0.0 | ... | 1.0 |
30.0 | 173 | NaN | 1 | 3 | 0.0 | ... | 0.0 |
25.0 | 177 | male | 1 | 3 | 0.0 | ... | 0.0 |
NaN | 165 | female | 2 | 9 | 0.0 | ... | 0.0 |
Working with missings
# print statistics on missing values
from dptools import print_missings
print_missings(df)
Total | Percent | |
---|---|---|
age | 2 | 0.4 |
gender | 1 | 0.2 |
Data versioning
# first call saves df as 'data_v1.csv'
from dptools import save_csv_version
save_csv_version('data.csv', df, index = False)
# second call saves df as 'data_v2.csv' as data_v1.csv already exists
save_csv_version('data.csv', df, index = False)
Dependencies
Installation requires Python 3.7+ and the following packages:
Feedback
In case you need help on the included data preprocseeing functions or you want to report an issue, please do so at the corresponding GitHub page.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.