Skip to main content

Rare-Label-One-Hot Encoder with Threshold Implementation and Python Package.

Project description

Rare-Label-One-Hot Encoder

Setup Automated Test passing Python Version PyPI version Last Commit Open Source Love png2

About

Wanna One-Hot Encode your Train-Test sets which contains Rare-Labels and also give importance to the top entries? No Worries!

Rare-Label-One-Hot Encoder Python Package is there to rescue you out!

It's a Categorical Encoder which can be mostly used with Classical Machine Learning Algorithms in-order to One-Hot-Encode a Feature having huge cardinality and also having rare labels in the Train-Test sets.
Basically, it'll set a threshold (that can be user-defined) of taking up the top categories/entries and treat the rest (least significant) as others. It also handles rare label cases in case of mapping the features from Train to Test respectively and vice versa.

You can set the top entries criterion either by level which will consider the Top entries according to the threshold set or the other by amount which will consider all the entries above the threshold as top entries and rest as others.

Rare-Label-One-Hot Encoder is available as RLOHE in PyPI.

Installation

Run the following command on your terminal to install RLOHE:

1 . Installing the package using pip:

pip install RLOHE

OR

pip3 install RLOHE

2 . Cloning the repository:

git clone https://github.com/rahulbordoloi/Rare-Label-One-Hot-Enocder/
cd Rare-Label-One-Hot-Enocder
pip install -e .

Usage

RLOHE package contains two functions, namely :

  • TopLabeledEntries : Gives out Top Labeled Entries' Analysis of two given DataFrames.
  • RareLabelOneHotEncoder : Gives out Rare Label One-Hot Encoded DataFrames according to threshold being set and it's criterion of segregation.

It is advised to run TopLabeledEntries first in-order to check for the Top Entries and their representation in their respective dataset before going for the encoding as a sanity check.

Arguments

1 . For TopLabeledEntries Function :

Parameters Description
train Refers to the Train Dataset.
test Refers to the Test Dataset.
feature_name Refers to the Feature on which encoding is to be done
threshold Refers to the Top Features Seggregator Limit.
criterion Refers to level/volume according to which top entries will be picked up. Check reference for more information.
secondary_feature Refers to check amount statistics of another feature with respect to the primary feature.
verbose Refers to variable which controls Output to the console.
return_dataframe Refers to condition for if a dataframe has to be returned or not.

2 . For RareLabelOneHotEncoder Function :

Parameters Description
train Refers to the Train Dataset.
test Refers to the Test Dataset.
feature_name Refers to the Feature on which encoding is to be done
threshold Refers to the Top Features Seggregator Limit.
criterion Refers to level/volume according to which top entries will be picked up. Check reference for more information.
verbose Refers to variable which controls Output to the console.
prefix_name Refers to the Prefix Name to be added in front of each new encoded feature.

Reference
* level : Will be considering up top level threshold entries for the particular feature, and rest as BELOW.
* amount : Will be considering up the entries above the threshold for the particular feature, and rest as BELOW.

Run this script in order to get the Top Entries according to a given threshold!

# Importing Libraries
import RLOHE as encoder
import pandas as pd

# Main Method
if __name__ == '__main__':

    # Reading in Dataset
    train = pd.read_csv('https://raw.githubusercontent.com/rahulbordoloi/Rare-Label-One-Hot-Enocder/main/Data/Train_Data.csv')
    test = pd.read_csv('https://raw.githubusercontent.com/rahulbordoloi/Rare-Label-One-Hot-Enocder/main/Data/Test_Data.csv')

    # Displaying out the Top Entries According to the Threshold set.
    encoder.TopLabeledEntries(train, test, feature_name = 'department_info', threshold = 10, secondary_feature = 'cost_to_pay')

Run this script in order to get the Rare Label One-Hot Encoded DataFrames according to a given threshold!

# Importing Libraries
import RLOHE as encoder
import pandas as pd

# Main Method
if __name__ == '__main__':

    # Reading in Dataset
    train = pd.read_csv('https://raw.githubusercontent.com/rahulbordoloi/Rare-Label-One-Hot-Enocder/main/Data/Train_Data.csv')
    test = pd.read_csv('https://raw.githubusercontent.com/rahulbordoloi/Rare-Label-One-Hot-Enocder/main/Data/Test_Data.csv')

    # Rare Label One-Hot Encoder [Level Wise]
    encodedTrain, encodedTest = encoder.RareLabelOneHotEncoder(train, test, feature_name = 'department_info', threshold = 10,
                                                       criterion = 'level', prefix_name = 'dept')
  • Checkout Rare Label One-Hot Encoder Implementation in Google Colab : Open In Colab

Developing Rare Label One Hot Encoder

To install RLOHE, along with the tools you need to develop and run tests, and execute the following in your virtualenv:

$ pip install -e .[dev]

Contact Author

Name : Rahul Bordoloi
Website : https://rahulbordoloi.me
Email : rahulbordoloi24@gmail.com

forthebadge made-with-python ForTheBadge built-with-love

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

RLOHE-0.0.1.tar.gz (8.0 kB view details)

Uploaded Source

Built Distribution

RLOHE-0.0.1-py3-none-any.whl (11.9 kB view details)

Uploaded Python 3

File details

Details for the file RLOHE-0.0.1.tar.gz.

File metadata

  • Download URL: RLOHE-0.0.1.tar.gz
  • Upload date:
  • Size: 8.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.48.0 CPython/3.7.6

File hashes

Hashes for RLOHE-0.0.1.tar.gz
Algorithm Hash digest
SHA256 4c09fca55adf63dbe7accedcbbe73cd1ffdea89503dc0ff0f4f454014e58f18a
MD5 8cc6208b2ca0a74b1cb536248ee58920
BLAKE2b-256 6fe32633b08910a79c3137076d2940b69e183f1871be75734a56a4e42848863a

See more details on using hashes here.

File details

Details for the file RLOHE-0.0.1-py3-none-any.whl.

File metadata

  • Download URL: RLOHE-0.0.1-py3-none-any.whl
  • Upload date:
  • Size: 11.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.24.0 setuptools/41.2.0 requests-toolbelt/0.9.1 tqdm/4.48.0 CPython/3.7.6

File hashes

Hashes for RLOHE-0.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 dad0eca7fb8061f5a70013c2d0e8f4245040d262cbaa4e3e2c2b5c534b39dd83
MD5 7930ed7f8fcce9ce6a12b83289463b6a
BLAKE2b-256 6e39440473a6382e26d1cd4039172a6265ebab1eba67a5f33b97d838bfdb22fa

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page