Implementation of various algorithms for feature selection for text features, based on filter method
Project description
What is it?
TextFeatureSelection is a Python package providing feature selection for text tokens through filter method of feature selection and we can set a threshold to decide which words to be included. There are 4 methods for helping feature selection.
-
Chi-square It measures the lack of independence between term(t) and class(c). It has a natural value of zero if t and c are independent. If it is higher, then term is dependent. It is not reliable for low-frequency terms
-
Mutual information Rare terms will have a higher score than common terms. For multi-class categories, we will calculate MI value for all categories and will take the Max(MI) value across all categories at the word level.
-
Proportional difference How close two numbers are from becoming equal. It helps find unigrams that occur mostly in one class of documents or the other.
-
Information gain It gives discriminatory power of the word.
Input parameters
- target list object which has categories of labels. for more than one category, no need to dummy code and instead provide label encoded values as list object.
- input_doc_list List object which has text. each element of list is text corpus. No need to tokenize, as text will be tokenized in the module while processing. target and input_doc_list should have same length.
- stop_words Words for which you will not want to have metric values calculated. Default is blank
- metric_list List object which has the metric to be calculated. There are 4 metric which are being computed as 'MI','CHI','PD','IG'. you can specify one or more than one as a list object. Default is ['MI','CHI','PD','IG']. Chi-square(CHI), Mutual information(MI), Proportional difference(PD) and Information gain(IG) are 4 metric which are calculated for each tokenized word from the corpus to aid the user for feature selection.
How to use is it?
from TextFeatureSelection import TextFeatureSelection
#Multiclass classification problem
input_doc_list=['i am very happy','i just had an awesome weekend','this is a very difficult terrain to trek. i wish i stayed back at home.','i just had lunch','Do you want chips?']
target=['Positive','Positive','Negative','Neutral','Neutral']
fsOBJ=TextFeatureSelection(target=target,input_doc_list=input_doc_list)
result_df=fsOBJ.getScore()
print(result_df)
#Binary classification
input_doc_list=['i am content with this location','i am having the time of my life','you cannot learn machine learning without linear algebra','i want to go to mars']
target=[1,1,0,1]
fsOBJ=TextFeatureSelection(target=target,input_doc_list=input_doc_list)
result_df=fsOBJ.getScore()
print(result_df)
Where to get it?
pip install TextFeatureSelection
Dependencies
References
- A Comparative Study on Feature Selection in Text Categorization by Yiming Yang and Jan O. Pedersen
- Entropy based feature selection for text categorization by Christine Largeron, Christophe Moulin, Mathias Géry
- Categorical Proportional Difference: A Feature Selection Method for Text Categorization by Mondelle Simeon, Robert J. Hilderman
- Feature Selection and Weighting Methods in Sentiment Analysis by Tim O`Keefe and Irena Koprinska
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for TextFeatureSelection-0.0.4.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | f31a76c566274da8f8c36dc3b11269df7763c72f19e25ce97e77761c3266f3fa |
|
MD5 | 3ec347c97c1060a9dd276b0c9d291ee2 |
|
BLAKE2b-256 | 9ba1a3c6cc5258824ea22ac67b108357d6baaa21ddf9ae86f58257cd53235e44 |
Hashes for TextFeatureSelection-0.0.4-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | d9f9cb199bfd71e4dbaadeb1792f3fadb4d6c422d0fa78d273c2872dab9caea5 |
|
MD5 | 46ca973a10a7108003dea6e4907e86f0 |
|
BLAKE2b-256 | 275b5a1e76c3a0a93a2895b3f2031a671efefa60af2c6dab592d6fe307cdca50 |