Skip to main content

Implementation of various algorithms for feature selection for text features, based on filter method

Project description

What is it?

TextFeatureSelection is a Python package providing feature selection for text tokens through filter method of feature selection and we can set a threshold to decide which words to be included. There are 4 methods for helping feature selection.

  • Chi-square It measures the lack of independence between term(t) and class(c). It has a natural value of zero if t and c are independent. If it is higher, then term is dependent. It is not reliable for low-frequency terms

  • Mutual information Rare terms will have a higher score than common terms. For multi-class categories, we will calculate MI value for all categories and will take the Max(MI) value across all categories at the word level.

  • Proportional difference How close two numbers are from becoming equal. It helps find unigrams that occur mostly in one class of documents or the other.

  • Information gain It gives discriminatory power of the word.

Input parameters

  • target list object which has categories of labels. for more than one category, no need to dummy code and instead provide label encoded values as list object.
  • input_doc_list List object which has text. each element of list is text corpus. No need to tokenize, as text will be tokenized in the module while processing. target and input_doc_list should have same length.
  • stop_words Words for which you will not want to have metric values calculated. Default is blank
  • metric_list List object which has the metric to be calculated. There are 4 metric which are being computed as 'MI','CHI','PD','IG'. you can specify one or more than one as a list object. Default is ['MI','CHI','PD','IG']. Chi-square(CHI), Mutual information(MI), Proportional difference(PD) and Information gain(IG) are 4 metric which are calculated for each tokenized word from the corpus to aid the user for feature selection.

How to use is it?

from TextFeatureSelection import TextFeatureSelection

#Multiclass classification problem
input_doc_list=['i am very happy','i just had an awesome weekend','this is a very difficult terrain to trek. i wish i stayed back at home.','i just had lunch','Do you want chips?']
target=['Positive','Positive','Negative','Neutral','Neutral']
fsOBJ=TextFeatureSelection(target=target,input_doc_list=input_doc_list)
result_df=fsOBJ.getScore()
print(result_df)


#Binary classification
input_doc_list=['i am content with this location','i am having the time of my life','you cannot learn machine learning without linear algebra','i want to go to mars']
target=[1,1,0,1]
fsOBJ=TextFeatureSelection(target=target,input_doc_list=input_doc_list)
result_df=fsOBJ.getScore()
print(result_df)

Where to get it?

pip install TextFeatureSelection

Dependencies

References

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

TextFeatureSelection-0.0.4.tar.gz (5.2 kB view details)

Uploaded Source

Built Distribution

TextFeatureSelection-0.0.4-py3-none-any.whl (7.5 kB view details)

Uploaded Python 3

File details

Details for the file TextFeatureSelection-0.0.4.tar.gz.

File metadata

  • Download URL: TextFeatureSelection-0.0.4.tar.gz
  • Upload date:
  • Size: 5.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.4.0 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.6.9

File hashes

Hashes for TextFeatureSelection-0.0.4.tar.gz
Algorithm Hash digest
SHA256 f31a76c566274da8f8c36dc3b11269df7763c72f19e25ce97e77761c3266f3fa
MD5 3ec347c97c1060a9dd276b0c9d291ee2
BLAKE2b-256 9ba1a3c6cc5258824ea22ac67b108357d6baaa21ddf9ae86f58257cd53235e44

See more details on using hashes here.

File details

Details for the file TextFeatureSelection-0.0.4-py3-none-any.whl.

File metadata

  • Download URL: TextFeatureSelection-0.0.4-py3-none-any.whl
  • Upload date:
  • Size: 7.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.23.0 setuptools/46.4.0 requests-toolbelt/0.9.1 tqdm/4.45.0 CPython/3.6.9

File hashes

Hashes for TextFeatureSelection-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 d9f9cb199bfd71e4dbaadeb1792f3fadb4d6c422d0fa78d273c2872dab9caea5
MD5 46ca973a10a7108003dea6e4907e86f0
BLAKE2b-256 275b5a1e76c3a0a93a2895b3f2031a671efefa60af2c6dab592d6fe307cdca50

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page