Skip to main content

Data Preprocessing Tools

Project description

dptools: data preprocessing functions for Python


PyPI Latest Release Python 3.7 Project Status: Active – The project has reached a stable, usable state and is being actively developed. Licence Build Status Downloads


Overview

The dptools python package provides helper functions to simplify common data processing tasks in a data science pipeline, including feature engineering, data aggregation, working with missing values and more.

The package currently encompasses the following functions:

  • Feature engineering:
    • add_date_features(): add date-based features
    • add_text_features(): add text-based features
    • aggregate_data(): aggreagate data and adds aggregation-based features
    • encode_factors(): perform label or dummy encoding of categorical features
  • Data processing:
    • split_nested_features(): split features nested in a single column
    • fill_missings(): replace missings with specific values
    • print_missings(): print information on features with missing values
    • print_factor_levels(): print levels of categorical features
  • Data cleaning:
    • find_correlated_features(): find features with a high pairwise correlation
    • find_constant_features(): find features with a single unique value
  • Import and versioning:
    • read_csv_with_json(): read CSV with columns in JSON format
    • save_csv_version(): save CSV with an automatically assigned version number to prevent overwriting

Installation

The latest stable release can be installed from PyPI:

pip install dptools

You may also install the development version from Github:

pip install git+https://github.com/kozodoi/dptools.git

After the installation, you can import the included functions:

from dptools import *

Examples

This section contains a few examples of using functions from dptools for different data preprocessing tasks. Please refer to function docstring documentation for further examples.

Creating a toy data set

First, let us create a toy data frame to demonstarte the package functionality.

# import dependecies
import pandas as pd
import numpy as np

# create data frame
data = {'age': [27, np.nan, 30, 25, np.nan], 
        'height': [170, 168, 173, 177, 165], 
        'gender': ['female', 'male', np.nan, 'male', 'female'],
        'income': ['high', 'medium', 'low', 'low', 'no income']}
df = pd.DataFrame(data)
age height gender income
27.0 170 female high
NaN 168 male medium
30.0 173 NaN low
25.0 177 male low
NaN 165 female no income

Aggregating features

# aggregating the data
from dptools import aggregate_data
df_new = aggregate_data(df, group_var = 'gender', num_stats = ['mean', 'max'], fac_stats = 'mode')   
gender age_mean age_max height_mean height_max income_mode
female 27.0 27.0 167.5 170 'high'
male 25.0 25.0 172.5 177 'low'

Creating text-based features

# creating text-based features
from dptools import add_text_features
df_new = add_text_features(df, text_vars = 'income')
age height gender income_word_count income_char_count income_tfidf_0 ... income_tfidf_3
27.0 170 female 1 4 1.0 ... 0.0
NaN 168 male 1 6 0.0 ... 1.0
30.0 173 NaN 1 3 0.0 ... 0.0
25.0 177 male 1 3 0.0 ... 0.0
NaN 165 female 2 9 0.0 ... 0.0

Working with missings

# print statistics on missing values
from dptools import print_missings
print_missings(df)
Total Percent
age 2 0.4
gender 1 0.2

Data versioning

# first call saves df as 'data_v1.csv'
from dptools import save_csv_version
save_csv_version('data.csv', df, index = False)

# second call saves df as 'data_v2.csv' as data_v1.csv already exists
save_csv_version('data.csv', df, index = False)

Dependencies

Installation requires Python 3.7+ and the following packages:

Feedback

In case you need help on the included data preprocseeing functions or you want to report an issue, please do so at the corresponding GitHub page.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dptools-0.3.4.tar.gz (10.3 kB view details)

Uploaded Source

File details

Details for the file dptools-0.3.4.tar.gz.

File metadata

  • Download URL: dptools-0.3.4.tar.gz
  • Upload date:
  • Size: 10.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/41.0.1 requests-toolbelt/0.9.1 tqdm/4.32.1 CPython/3.7.3

File hashes

Hashes for dptools-0.3.4.tar.gz
Algorithm Hash digest
SHA256 7d54b8e9ae445e0e0386427ad8e083629c2d26da56946a3aede9c19e97c5efd5
MD5 747c7f1a9636d2d7903b5e4c800f0211
BLAKE2b-256 d03723a53cec7da23c31086349b6960fb695cf70e9a96c9fe9071c7dcf4c9e9d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page