PyPreProc is a Python package for correcting, converting, clustering and creating data in Pandas as part of the feature engineering step of the machine learning process.
Project description
PyPreProc
PyPreProc is a Python package that you can use for preprocessing and feature engineering during machine learning development projects. It uses Pandas and makes it quicker and easier to correct, convert, cluster and create data.
Setup
PyPreProc can be installed via PyPi using pip3 install pypreproc
.
Examples
To use PyPreProc simply load your data into a Pandas DataFrame.
import pandas as pd
from pypreproc import correct
df = pd.read_csv('data.csv')
Correcting data
The cols_to_strip_characters()
function takes a list of columns and a character or string and strips this from the column data and returns the modified DataFrame.
strip_cols = ['name', 'address']
df = correct.cols_to_strip_characters(df, strip_cols, '$')
The cols_to_drop()
function drops a list of columns and returns the modified DataFrame.
drop_cols = ['email', 'shoe_size']
df = correct.cols_to_drop(df, drop_cols)
The col_names_to_lower()
function converts all Pandas column names to lowercase.
df = correct.col_names_to_lower(df)
The cols_to_rename()
function lets you use a Python dictionary to rename specific column names in your Pandas DataFrame.
rename_cols = {'old_name1': 'new_name1', 'old_name2': 'new_name2'}
df = correct.cols_to_rename(df, rename_cols)
Converting data
The cols_to_slugify()
function "slugifies" data into a continuous string, stripping special characters and replacing spaces with underscores. It's very useful for use with one-hot encoding as the column values become column names that are easier to reference.
slugify_cols = ['country', 'manufacturer']
df = convert.cols_to_slugify(df, slugify_cols)
The cols_to_float()
function converts column values to float
.
float_cols = ['price', 'weight']
df = convert.cols_to_float(df, float_cols)
The cols_to_int()
function converts column values to int
.
int_cols = ['age', 'children']
df = convert.cols_to_float(df, int_cols)
The cols_to_datetime()
function converts column values to datetime
.
date_cols = ['date']
df = convert.cols_to_float(df, date_cols)
The cols_to_negative()
function converts column values to a negative value.
neg_cols = ['score']
df = convert.cols_to_negative(df, neg_cols)
Clustering data
The kmeans_cluster()
function makes it easy to use unsupervised learning algorithms in your supervised machine learning model, which can often yield great improvements.
To use this function you pass the DataFrame, the column you wish to cluster, provide a name for the new column of cluster data, define the number of cluster to create and provide a value to add if there is a NaN
returned.
# Definitions added for readability
column = 'offspring'
cluster_name = 'fecundity'
n_clusters = 5
fillna_value = 0
df = cluster.kmeans_cluster(df,
column,
cluster_name,
n_clusters,
fillna_value)
Customer data
The rfm_model()
function uses the excellent Lifetimes package to perform simple RFM modelling. This examines the Recency, Frequency and Monetary values of customers and returns data that identify the value of the customer and their propensity to purchase again.
# Definitions added for readability
customer_column = 'customer_id'
date_column = 'order_date'
monetary_column = 'order_value'
df = customer.rfm_model(df, customer_column, date_column, monetary_column)
Creating data
The create
module includes a wide range of functions for creating new features from existing data in your Pandas DataFrame. These are of use when investigating which features might have a correlation with your model's target and could improve model performance.
The cols_to_log()
function takes a list of columns and provides their log values in new columns prefixed with log_
so you can compare them against the non-transformed data. This method is best for data where the column values do not include zeros.
log_cols = ['engine_size', 'output']
df = create.cols_to_log(df, log_cols)
The cols_to_log1p()
function works like cols_to_log()
but adds 1 allowing it to work on columns that contain zero values.
log_cols = ['fins', 'gill_rakers']
df = create.cols_to_log1p(df, log_cols)
The cols_to_log_max_root()
function convert column data to log values using the maximum value as the log max and returns new columns of prefixed data. For use with data where the column values include zeros.
log_cols = ['fins', 'gill_rakers']
df = create.cols_to_log_max_root(df, log_cols)
The cols_to_tanh()
function takes a list of columns and returns their hyperbolic tangent in a new column.
cols = ['fins', 'gill_rakers']
df = create.cols_to_tanh(df, cols)
The cols_to_sigmoid()
function takes a list of columns and creates data points to values between 0 and 1 using a sigmoid function and return new columns of prefixed data.
cols = ['fins', 'gill_rakers']
df = create.cols_to_sigmoid(df, cols)
The cols_to_cube_root()
function takes a list of columns and returns their cube root so all values are between 0 and 1.
cols = ['fins', 'gill_rakers']
df = create.cols_to_cube_root(df, cols)
The cols_to_cube_root_normalize()
function takes a list of columns and returns their normalised cube root so all values are between 0 and 1.
cols = ['fins', 'gill_rakers']
df = create.cols_to_cube_root_normalize(df, cols)
The cols_to_percentile()
function converts data points to their percentile linearized value and return new columns of prefixed data.
cols = ['fins', 'gill_rakers']
df = create.cols_to_percentile(df, cols)
The cols_to_normalize()
function normalizes data points to values between 0 and 1 and return new columns of prefixed data.
cols = ['fins', 'gill_rakers']
df = create.cols_to_normalize(df, cols)
The cols_to_log1p_normalize()
function log+1 normalizes data points to values between 0 and 1 and return new columns of prefixed data. It's best for use with columns where data contain zeros.
cols = ['fins', 'gill_rakers']
df = create.cols_to_log1p_normalize(df, cols)
The cols_to_one_hot()
function one-hot encodes column values and creates new columns containing the one-hot encoded data. For example, if you have a column containing two values (fish or bird) it will return a 1 or 0 for bird and a 1 or 0 for fish.
It's designed for use with low cardinality data (in which there are only a small number of values within the column).
cols = ['class', 'genus']
df = create.cols_to_one_hot(df, cols)
The cols_to_reduce_uniques()
function takes a list of columns and reduces the number of unique values by assigning those below a given threshold as "others".
cols = {'col1' : 1000, 'col2' : 3000}
df = create.cols_to_reduce_uniques(df, cols)
Grouped data
The cols_to_count_uniques()
counts the number of unique column values when grouping by another column and return new columns in original DataFrame. For example, if grouping by the column region
and examining data in the cars
and children
columns the function would return new columns called unique_cars_by_region
.
df = create.cols_to_count_uniques(df, 'region', ['cars', 'children'])
The cols_to_count()
function counts the number of column values when grouping by another column and return new columns in original DataFrame.
df = create.cols_to_count(df, 'region', 'cars')
The cols_to_sum()
function sums the value of column values when grouping by another column and return new columns in original DataFrame.
df = create.cols_to_sum(df, 'region', 'cars')
The get_rolling_average()
function returns the rolling average for a column based on a grouping over X previous periods. For example, the rolling average order value for a customer over their past three visits.
df = create.get_rolling_average(df, 'group_col', 'avg_col', 5, 'sort_col')
Dates
The get_days_since_date()
function returns a new column containing the date difference in days between two dates. For example, the number of days since a last dose.
df = create.get_days_since_date(df, 'date_before', 'date_after', 'days_since_last_dose')
The get_dates()
function takes a single date column and returns a load of new columns including: day
, month
, year
, year_month
, week_number
, day_number
, day_name
, month_name
, mysql_date
, quarter
, which are often more useful in modeling than a very granular date.
df = create.get_dates(df, 'visit_date')
Other features
The get_grouped_stats()
function groups data by a column and returns summary statistics for a list of columns and add prefixed data to new columns. These include: mean, median, std, max and min.
df = create.get_grouped_stats(df, 'species', ['dorsal_fin_rays', 'lateral_line_scales'])
The get_feature_interactions()
function combines multiple features to create new features based on 2, 3 or 4 unique combinations.
df = create.get_feature_interactions(df, ['fins','scales','gill_rakers'], 3)
The get_binned_data()
function performs a simple binning operation on a column and return a column of binned data in DataFrame.
# Definitions added for readability
column = 'orders'
name = 'orders_bin'
bins = 5
df = create.get_binned_data(df, column, name, bins)
The sum_columns()
function returns a single value based on the sum of a column.
value = create.sum_columns(df, 'revenue')
The get_diff()
function returns the difference between two column values.
value = create.get_diff(df, 'order_value', 'aov')
The get_previous_cumulative_sum()
function gets the previous cumulative sum of a column based on a group. For example, the running total of orders placed by a given customer at time X. It does not include the current value.
previous_orders = create.get_previous_cumulative_sum(df, 'group_column', 'sum_column', 'sort_column')
The get_cumulative_sum()
function returns the cumulative sum of a grouped column and includes the current value.
total_orders = create.get_cumulative_sum(df, 'group_column', 'sum_column', 'sort_column')
The get_previous_cumulative_count()
function counts cumulative column values based on a grouping, not including the current value.
previous_orders = create.get_previous_cumulative_count(df, 'group_column', 'count_column', 'sort_column')
The get_previous_value()
function groups by a column and return the previous value of another column and assign value to a new column. For example, the previous value of a customer's order.
df = create.get_previous_value(df, 'customer_id', 'order_value')
The get_probability_ratio()
groups a Pandas DataFrame via a given column and returns the probability ratio of the target variable for that grouping. It's a useful way of using target data to improve model performance, with less likelihood of introducing data leakage.
df = create.get_probability_ratio(df, 'group_column', 'target_column')
The get_mean_encoding()
function groups a Pandas DataFrame via a given column and returns the mean of the target variable for that grouping. For example, if your model's target variable is "revenue" what is the mean revenue for people by "country"?
df = create.get_mean_encoding(df, 'group_column', 'target_column')
The get_frequency_rank()
function return the frequency rank of a categorical variable to assign to a new Pandas DataFrame column. This takes the value count of each categorical variable and then ranks them across the DataFrame. Items with equal value counts are assigned equal ranking. This is monotonic transformation.
df['freq_rank_species'] = create.get_frequency_rank(df, 'lateral_line_pores')
The get_conversion_rate()
function returns the conversion rate for a column.
df['cr'] = create.get_conversion_rate(df, 'sessions', 'orders')
Helpers
The get_unique_rows()
function de-dupes rows in Pandas DataFrame and returns a new DataFrame containing only the unique rows. This simple function keeps the last value.
df = helper.get_unique_rows(df, ['col1', 'col2'], 'sort_column')
The select()
helper function provides a very quick and easy way to filter a Pandas DataFrame. It takes five values: df
(the DataFrame), column_name
(the name of the column you want to search), operator
(the search operator you want to use [endswith, startswith, contains, isin, is]), and an optional exclude
parameter (True
or False
) which defines whether the search includes or excludes the data.
ends = helper.select(df, 'genus', 'endswith', where='chromis', exclude=False)
sw = helper.select(df, 'genus', 'startswith', where='Haplo', exclude=False)
contains = helper.select(df, 'genus', 'contains', where='cich', exclude=False)
isin = helper.select(df, 'genus', 'isin', where=['cich','theraps'], exclude=False)
_is = helper.select(df, 'genus', 'is', where='Astronotus', exclude=False)
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file pypreproc-0.16.tar.gz
.
File metadata
- Download URL: pypreproc-0.16.tar.gz
- Upload date:
- Size: 14.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/47.1.1 requests-toolbelt/0.9.1 tqdm/4.46.1 CPython/3.7.5
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 38de8e8536ef41119720f09d05c9d636f672e60b95943df44fef427617e479d5 |
|
MD5 | cceeb8c0b5dbe535ca82ed3653125eb4 |
|
BLAKE2b-256 | 508ee7800f974e4d604ac25d704f85bdfd8ab7abb410efa5a273e2ff54b36cde |