Custom implementation of tfidf for imbalanced datasets

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3.6

Project description

Weighted-Class-Tfidf

Inspiration behind WCBTFIDF

Standard tfidf models select the features(number defined by the max_features param) using term frequency alone. This can create problems when the dataset is imbalanced resulting in words from the majority class being selected. As a result of this minority class gets under-represented in the matrix that is being returned by tfidf.

Solution

To tackle this problem we break down the tfidf process class wise. Let us consider an example to understand what WCBTFIDF does under the hood

Assume a dataset having two labels 0 and 1. 0 is present in 80% of the records and 1 is present in 20% of the records.

If we run standard tfidf on this(with for example 300 features) it will pick the top 300 words by frequency from both the classes. There is a very high chance that words selected will be majorly from class 0 and we might run the risk of under-representing class 1 severely.

What wcbtfidf does is that first it calculates weight for each label. Weight here refers to how many features it should select from each class.

Since class 0 is present in 80% of the records, wcbtfidf will pick 240 features from class 0 and 60 features from class 1.

So essentially we run tfidf class wise on 0 and 1 labels with max features set as 240 and 60.

After doing that, we combine the vocabulary from both these classes into a single list.It can be easily done since tfidf provides a vocabulary_ param that stores the vocab.

Finally this combined vocab is used as a fixed vocabulary in another tfidf model that is ran on the entire data. By fixing the vocab for the final tfidf we ensure that we are going to score on these set of words only.

To put it simply the 300 features choosen by wcbtfidf are a better representation of the overall data as compared to the features chosen by standard tfidf model.

RESULTS

In the experiments conducted, wcbtfidf performed better than standard tfidf models. The results have been put into a notebook under the demos folder.

Data Sources

IMDB Dataset

Toxic Classification Dataset

Sentiment140 Dataset

Article Link

Click here

Tutorial

# Import the class
from wcbtfidf import Wcbtfidf
# Initialize the object
wcbtfidf = Wcbtfidf(max_features=100)
# Fit on the training set
wcbtfidf.fit(xtrain,ytrain)
# Transform on the test set
test_df = wcbtfidf.transform(xtest)
# Get the vocab
wcbtfidf.combine_vocab
# Get the class wise vocab
wcbtfidf.class_wise_vocab

# Added support for providing custom features set
wcbtfidf = Wcbtfidf(max_features=100,custom_weights={0:20,1:80}) # This lets you manage how many features you want to assign

# Here xtrain,xtest refers to a single pandas column containing the text data and ytrain ytest the
# categorical output label

Project details

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3.6

Release history Release notifications | RSS feed

This version

1.0.3

Nov 12, 2022

1.0.2

Sep 3, 2022

1.0.1

Sep 3, 2022

1.0.0

Sep 3, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

Weighted Class Tfidf-1.0.3.tar.gz (5.0 kB view details)

Uploaded Nov 12, 2022 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

Weighted_Class_Tfidf-1.0.3-py3-none-any.whl (4.9 kB view details)

Uploaded Nov 12, 2022 Python 3

File details

Details for the file Weighted Class Tfidf-1.0.3.tar.gz.

File metadata

Download URL: Weighted Class Tfidf-1.0.3.tar.gz
Upload date: Nov 12, 2022
Size: 5.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.1 CPython/3.8.5

File hashes

Hashes for Weighted Class Tfidf-1.0.3.tar.gz
Algorithm	Hash digest
SHA256	`1ae8fcb775bfdf6a85936d6eccf9b6657b1fc5e4c57af6b39d2fa5874a9059cf`
MD5	`d38c44b92befe2767f918b7522bc71fe`
BLAKE2b-256	`cbf8701026c004d4b30b07ba51c9e562819842f983dac11eba54cf6b4c60b9ae`

See more details on using hashes here.

File details

Details for the file Weighted_Class_Tfidf-1.0.3-py3-none-any.whl.

File metadata

Download URL: Weighted_Class_Tfidf-1.0.3-py3-none-any.whl
Upload date: Nov 12, 2022
Size: 4.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/4.0.1 CPython/3.8.5

File hashes

Hashes for Weighted_Class_Tfidf-1.0.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`6525e99e3915c8611ecfb6ffd7d3d6fe9510d63ffb0972df51e1f855d3091b24`
MD5	`f0265a6390cfc8b2c18f9e5caeae3077`
BLAKE2b-256	`34d36192ab44773d3aea1ede8431304531ab6626e4f6a551bfc6e88eb094796b`

See more details on using hashes here.

Weighted-Class-Tfidf 1.0.3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Weighted-Class-Tfidf

Inspiration behind WCBTFIDF

Solution

RESULTS

Data Sources

Article Link

Tutorial

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes