Data Preprocessing Tools

Project description

dptools: data preprocessing functions for Python

Overview

The dptools Python package provides helper functions to simplify common data processing tasks in a data science pipeline, including feature engineering, data aggregation, working with missing values and more.

The package currently encompasses the following functions:

Feature engineering:
- add_date_features(): create date and time-based features
- add_text_features(): create text-based features (including counts and TF-IDF)
- aggregate_data(): aggregate data and create features based on aggregated statistics
- encode_factors(): perform label or dummy encoding of categorical features
Data processing:
- split_nested_features(): split features nested in a single column
- fill_missings(): replace missings with specific values
- correct_colnames(): correct column names to be unique and remove foreign symbols
- print_missings(): print information on features with missing values
- print_factor_levels(): print levels of categorical features
Data cleaning:
- find_correlated_features(): identify features with a high pairwise correlation
- find_constant_features(): identify features with a single unique value
Import and versioning:
- read_csv_with_json(): read CSV where some columns are in JSON format
- save_csv_version(): save CSV with an automatically assigned version to prevent overwriting

Installation

The latest stable release of dptools can be installed from PyPI:

pip install dptools

You may also install the development version from Github:

pip install git+https://github.com/kozodoi/dptools.git

After the installation, you can import the included functions:

from dptools import *

Examples

This section contains a few examples of using functions from dptools for different data preprocessing tasks. Please refer to the docstring documentation in the implemented functions for further examples.

Creating a toy data set

First, let us create a toy data frame to demonstrate the package functionality.

# import dependencies
import pandas as pd
import numpy as np

# create data frame
data = {'age': [27, np.nan, 30, 25, np.nan],
        'height': [170, 168, 173, 177, 165],
        'gender': ['female', 'male', np.nan, 'male', 'female'],
        'income': ['high', 'medium', 'low', 'low', 'no income']}
df = pd.DataFrame(data)

age	height	gender	income
27.0	170	female	high
NaN	168	male	medium
30.0	173	NaN	low
25.0	177	male	low
NaN	165	female	no income

Aggregating features

# aggregating the data
from dptools import aggregate_data
df_new = aggregate_data(df, group_var = 'gender', num_stats = ['mean', 'max'], fac_stats = 'mode')

gender	age_mean	age_max	height_mean	height_max	income_mode
female	27.0	27.0	167.5	170	'high'
male	25.0	25.0	172.5	177	'low'

Creating text-based features

# creating text-based features
from dptools import add_text_features
df_new = add_text_features(df, text_vars = 'income')

age	height	gender	income_word_count	income_char_count	income_tfidf_0	...	income_tfidf_3
27.0	170	female	1	4	1.0	...	0.0
NaN	168	male	1	6	0.0	...	1.0
30.0	173	NaN	1	3	0.0	...	0.0
25.0	177	male	1	3	0.0	...	0.0
NaN	165	female	2	9	0.0	...	0.0

Working with missings

# print statistics on missing values
from dptools import print_missings
print_missings(df)

	Total	Percent
age	2	0.4
gender	1	0.2

Finding correlated features

# displays one correlated feature from each pair
from dptools import find_correlated_features
feats = find_correlated_features(df, cutoff = 0.4, method = 'spearman')
feats

Found 1 correlated features.

['age']

Data versioning

# first call saves df as 'data_v1.csv'
from dptools import save_csv_version
save_csv_version('data.csv', df, index = False)

# second call saves df as 'data_v2.csv' as data_v1.csv already exists
save_csv_version('data.csv', df, index = False)

Dependencies

Installation requires Python 3.7+ and the following packages:

Feedback

In case you need help on the included data preprocessing functions or you want to report an issue, please do so at the corresponding GitHub page.

Project details

Release history Release notifications | RSS feed

This version

0.4.2

Apr 19, 2022

0.4.1

Mar 27, 2022

0.4.0

Jul 29, 2020

0.3.11

Jul 29, 2020

0.3.10

May 21, 2020

0.3.9

May 6, 2020

0.3.8

May 3, 2020

0.3.7

May 3, 2020

0.3.6

May 2, 2020

0.3.5

Apr 20, 2020

0.3.4

Apr 20, 2020

0.3.3

Apr 20, 2020

0.3.2

Apr 20, 2020

0.3.1

Apr 16, 2020

0.3.0

Apr 16, 2020

0.2.3

Apr 16, 2020

0.2.2

Apr 15, 2020

0.2.1

Apr 15, 2020

0.2.0

Apr 15, 2020

0.1.0

Apr 15, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dptools-0.4.2.tar.gz (12.3 kB view details)

Uploaded Apr 19, 2022 Source

File details

Details for the file dptools-0.4.2.tar.gz.

File metadata

Download URL: dptools-0.4.2.tar.gz
Upload date: Apr 19, 2022
Size: 12.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.8.0 pkginfo/1.8.2 readme-renderer/34.0 requests/2.24.0 requests-toolbelt/0.9.1 urllib3/1.26.9 tqdm/4.50.2 importlib-metadata/4.11.3 keyring/21.4.0 rfc3986/2.0.0 colorama/0.4.4 CPython/3.8.5

File hashes

Hashes for dptools-0.4.2.tar.gz
Algorithm	Hash digest
SHA256	`48ad0df1dd44f6d953bbe541565ae523f6cc1c797f530172cd7cf75d399d2347`
MD5	`7e4095c8bd041bf20e29ca9dad08659e`
BLAKE2b-256	`3732e1f679031451df4388e961c802489977e5db1a86fd46217b180375d9eacd`

See more details on using hashes here.

dptools 0.4.2

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Project description

dptools: data preprocessing functions for Python

Overview

Installation

Examples

Creating a toy data set

Aggregating features

Creating text-based features

Working with missings

Finding correlated features

Data versioning

Dependencies

Feedback

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Release history Release notifications | RSS feed

Download files

Source Distribution

File details

File metadata

File hashes