Data Preprocessing Tools
Project description
dptools: data preprocessing functions for Python
Overview
The dptools
Python package provides helper functions to simplify common data processing tasks in a data science pipeline, including feature engineering, data aggregation, working with missing values and more.
The package currently encompasses the following functions:
- Feature engineering:
add_date_features()
: create date and time-based featuresadd_text_features()
: create text-based features (including counts and TF-IDF)aggregate_data()
: aggregate data and create features based on aggregated statisticsencode_factors()
: perform label or dummy encoding of categorical features
- Data processing:
split_nested_features()
: split features nested in a single columnfill_missings()
: replace missings with specific valuescorrect_colnames()
: correct column names to be unique and remove foreign symbolsprint_missings()
: print information on features with missing valuesprint_factor_levels()
: print levels of categorical features
- Data cleaning:
find_correlated_features()
: identify features with a high pairwise correlationfind_constant_features()
: identify features with a single unique value
- Import and versioning:
read_csv_with_json()
: read CSV where some columns are in JSON formatsave_csv_version()
: save CSV with an automatically assigned version to prevent overwriting
Installation
The latest stable release of dptools
can be installed from PyPI:
pip install dptools
You may also install the development version from Github:
pip install git+https://github.com/kozodoi/dptools.git
After the installation, you can import the included functions:
from dptools import *
Examples
This section contains a few examples of using functions from dptools
for different data preprocessing tasks. Please refer to the docstring documentation in the implemented functions for further examples.
Creating a toy data set
First, let us create a toy data frame to demonstrate the package functionality.
# import dependencies
import pandas as pd
import numpy as np
# create data frame
data = {'age': [27, np.nan, 30, 25, np.nan],
'height': [170, 168, 173, 177, 165],
'gender': ['female', 'male', np.nan, 'male', 'female'],
'income': ['high', 'medium', 'low', 'low', 'no income']}
df = pd.DataFrame(data)
age | height | gender | income |
---|---|---|---|
27.0 | 170 | female | high |
NaN | 168 | male | medium |
30.0 | 173 | NaN | low |
25.0 | 177 | male | low |
NaN | 165 | female | no income |
Aggregating features
# aggregating the data
from dptools import aggregate_data
df_new = aggregate_data(df, group_var = 'gender', num_stats = ['mean', 'max'], fac_stats = 'mode')
gender | age_mean | age_max | height_mean | height_max | income_mode |
---|---|---|---|---|---|
female | 27.0 | 27.0 | 167.5 | 170 | 'high' |
male | 25.0 | 25.0 | 172.5 | 177 | 'low' |
Creating text-based features
# creating text-based features
from dptools import add_text_features
df_new = add_text_features(df, text_vars = 'income')
age | height | gender | income_word_count | income_char_count | income_tfidf_0 | ... | income_tfidf_3 |
---|---|---|---|---|---|---|---|
27.0 | 170 | female | 1 | 4 | 1.0 | ... | 0.0 |
NaN | 168 | male | 1 | 6 | 0.0 | ... | 1.0 |
30.0 | 173 | NaN | 1 | 3 | 0.0 | ... | 0.0 |
25.0 | 177 | male | 1 | 3 | 0.0 | ... | 0.0 |
NaN | 165 | female | 2 | 9 | 0.0 | ... | 0.0 |
Working with missings
# print statistics on missing values
from dptools import print_missings
print_missings(df)
Total | Percent | |
---|---|---|
age | 2 | 0.4 |
gender | 1 | 0.2 |
Finding correlated features
# displays one correlated feature from each pair
from dptools import find_correlated_features
feats = find_correlated_features(df, cutoff = 0.4, method = 'spearman')
feats
Found 1 correlated features.
['age']
Data versioning
# first call saves df as 'data_v1.csv'
from dptools import save_csv_version
save_csv_version('data.csv', df, index = False)
# second call saves df as 'data_v2.csv' as data_v1.csv already exists
save_csv_version('data.csv', df, index = False)
Dependencies
Installation requires Python 3.7+ and the following packages:
Feedback
In case you need help on the included data preprocessing functions or you want to report an issue, please do so at the corresponding GitHub page.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file dptools-0.4.2.tar.gz
.
File metadata
- Download URL: dptools-0.4.2.tar.gz
- Upload date:
- Size: 12.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/34.0 requests/2.24.0 requests-toolbelt/0.9.1 urllib3/1.26.9 tqdm/4.50.2 importlib-metadata/4.11.3 keyring/21.4.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.8.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 48ad0df1dd44f6d953bbe541565ae523f6cc1c797f530172cd7cf75d399d2347 |
|
MD5 | 7e4095c8bd041bf20e29ca9dad08659e |
|
BLAKE2b-256 | 3732e1f679031451df4388e961c802489977e5db1a86fd46217b180375d9eacd |