Package to extract interesting details about text.

These details have not been verified by PyPI

Project description

textfeatureinfo

Description

In Natural Language Processing, it is common for users to try and engineer their own features from a given text. It can be difficult to extract certain features from text without using additional Python tools. This python package includes functions that allows data scientists to extract information from text features which can be useful for feature engineering, or in other data science projects. Our package, textfeatureinfo, will help gather summary information from plain text such as the number of punctuations in the text, the average word lengths and the percentage of fully capitalised words which can be useful information for feature engineering. Additionally, our package can also manipulate text data by removing the stopwords for the ease of future processing steps.

Our package and functions are inspired from a lab in the course, DSCI 573 (Feature and model selection), of UBC MDS program, and are tailored based on our own experience and interest.

Function Details

count_punc: This function will count and return the number of punctuations within a given text.
avg_word_len: This function will calculate and return the average length of words within a given text.
perc_cap_words: This function will calculate the percentage of fully capitalised words in the text.
remove_stop_words: This function will find and remove the stop words in a text and will return the list of clean words.

Python Ecosystem

In the field of text feature engineering, we are cognisant that there are well established packages in the Python ecosystem - specifically nltk, SpaCy and genism. For punctuations, we are aware that the nltk.tokenize and nltk.probability: FreqDist package can be used to find the number of words and punctuations in a string. To calculate average word length, nltk.word_tokenize() is able to divide strings into lists of substrings. To count the number of fully capitalised words in a text, the above functions do provide a means to isolate these characters, but not to count them explicitly. In the case of stop words, there are several modules that identify stop words. For instance, genius.parsing.preprocessing module has the function remove_stopwords() which allows users to remove specific stop words, as listed in their docstring from a string. nltk.corpus has a module stopwords to remove stop words from the text_token list. The package SpaCy similarly has a list of stopwords stored in sp.Default.stop_words in English.

Based on our experience in our previous module, all the functions that we seek to use require several lines of code. For example, to calculate the average word length, we need to extract the punctuation, count total number of characters, then averaging out over the number of words present. As such, we seek to simplify these tasks into functions that users, including ourselves, can employ in one line of code.

Installation

$ pip install textfeatureinfo

Usage

In order to use the package please go through the following steps:

Create a new conda environment:

conda create --name textfeatureinfo python=3.9 -y

Activate the conda environment:

conda activate textfeatureinfo

Install the package:

pip install textfeatureinfo

Open Python:

python

In the Python prompt type the followings to import all the functions:

from textfeatureinfo import textfeatureinfo
from textfeatureinfo.textfeatureinfo import count_punc
from textfeatureinfo.textfeatureinfo import avg_word_len
from textfeatureinfo.textfeatureinfo import perc_cap_words
from textfeatureinfo.textfeatureinfo import remove_stop_words

You can use the functions as below:

count_punc("Hello, World!")
avg_word_len("Hello, World!")
perc_cap_words("THIS is a SPAm MESSage.")
remove_stop_words("Tomorrow is a big day!")

Contributing

Interested in contributing? Check out the contributing guidelines. Please note that this project is released with a Code of Conduct. By contributing to this project, you agree to abide by its terms.

License

textfeatureinfo was created by Kiran, Jacqueline, Paniz, Lynn. It is licensed under the terms of the MIT license.

Credits

textfeatureinfo was created with cookiecutter and the py-pkgs-cookiecutter template.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.2.8

Jan 29, 2022

0.2.7

Jan 28, 2022

0.2.6

Jan 28, 2022

0.2.5

Jan 27, 2022

0.2.1

Jan 27, 2022

0.2.0

Jan 27, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

textfeatureinfo-0.2.8.tar.gz (5.5 kB view details)

Uploaded Jan 29, 2022 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

textfeatureinfo-0.2.8-py3-none-any.whl (5.4 kB view details)

Uploaded Jan 29, 2022 Python 3

File details

Details for the file textfeatureinfo-0.2.8.tar.gz.

File metadata

Download URL: textfeatureinfo-0.2.8.tar.gz
Upload date: Jan 29, 2022
Size: 5.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.7.1 importlib_metadata/4.10.1 pkginfo/1.8.2 requests/2.27.1 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.10

File hashes

Hashes for textfeatureinfo-0.2.8.tar.gz
Algorithm	Hash digest
SHA256	`5a05ab5b0d230950b0f1b6e20bc71dc0f2d0a90211ecffc7da89141d659d55b8`
MD5	`082d6ea48bf44a7a855b7d3c03e4f909`
BLAKE2b-256	`a8710f2fc43e2187fa9b909e75d8ea4896f8469c0cc0573513e2c99b485ac425`

See more details on using hashes here.

File details

Details for the file textfeatureinfo-0.2.8-py3-none-any.whl.

File metadata

Download URL: textfeatureinfo-0.2.8-py3-none-any.whl
Upload date: Jan 29, 2022
Size: 5.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/3.7.1 importlib_metadata/4.10.1 pkginfo/1.8.2 requests/2.27.1 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.10

File hashes

Hashes for textfeatureinfo-0.2.8-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b5a61fd10a34f8a41051932ddc82e083e6e6f68758372740dd4cdcc8ec0e8b6b`
MD5	`9bb6a1f74502de54197507af4fecdd85`
BLAKE2b-256	`ac1d813f060e8e4f7aedd833773f6df104a2c6fc52c0dfaf88439cde0c9dfe46`

See more details on using hashes here.

textfeatureinfo 0.2.8

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

textfeatureinfo

Description

Function Details

Python Ecosystem

Installation

Usage

Contributing

License

Credits

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes