A package to manage textual data in a simple fashion.
Project description
SimpleText
A package to manage textual data in a simple fashion.
Install with:
pip install SimpleText
1) The preprocess function
This function takes a string as an input and outputs a list of tokens. There are several parameters in the function to help quickly pre-process a string.
Parameters:
text
(string): a string of text
n_grams
(tuple, default = (1,1)): specifies the number of ngrams e.g. (1,2) would be unigrams and bigram, (2,2) would be just bigrams
remove_accents
(boolean, default = False): removes accents
lower
(boolean, default = False): lowercases text
remove_less_than
(int, default = 0): removes words less than X letters
remove_more_than
(int, default = 20): removes words more than X letters
remove_punct
(boolean, default = False): removes punctuation
remove_alpha
(boolean, default = False): removes non-alphabetic tokens
remove_stopwords
(boolean, default = False): removes stopwords
remove_custom_stopwords
(list, default = [ ]): removes custom stopwords
lemma
(boolean, default = False): lemmantises tokens (via the Word Net Lemmantizer algorithm)
stem
(boolean, default = False): stems tokens (via the Porter Stemming algorithm)
In the example below we preprocess the string by:
- lowercasing letters
- removing punctuation
- removing stop words
- removing words with more than 15 letters and less than 1 letter
from SimpleText.preprocessor import preprocess
text = 'Last week, I went to the shops.'
preprocess(text, n_grams=(1, 1), remove_accents=False, lower=True, remove_less_than=1,
remove_more_than=15, remove_punct=True, remove_alpha=False, remove_stopwords=True,
remove_custom_stopwords=[], lemma=False, stem=False, remove_url=False)
The output would be:
['last', 'went', 'shops', 'week']
In this second example we process the string by:
- generating unigrams and bigrams
- stemming
- removing the url
- removing accents
- lowercasing letters
from SimpleText.preprocessor import preprocess
text = "I'm loving the weather this year in españa! https://en.tutiempo.net/spain.html"
preprocess(text, n_grams=(1, 2), remove_accents=True, lower=True, remove_less_than=0,
remove_more_than=20, remove_punct=False, remove_alpha=False, remove_stopwords=False,remove_custom_stopwords=[], lemma=False, stem=True, remove_url=True)
This outputs:
["i'm",'love','the','weather','thi','year','in','espana!',("i'm", 'loving'),('loving', 'the'),('the', weather',
('weather', 'this'),('this', 'year'),('year', 'in'),('in', 'espana!')]
2) Individually preprocessing text
Alternatively, one can also individually apply a preprocessing step without having to use the whole preprocess
function. The functions available are:
from SimpleText.preprocessor import lowercase, strip_accents, strip_punctuation, strip_url,
tokenise, strip_alpha_numeric_characters, strip_stopwords, strip_min_max_tokens, lemantization, stemming, get_ngrams
lowercase("Hi again") # outputs "hi again"
strip_accents("Hi ágain") # outputs "Hi again"
strip_punctuation("Hi again!") # outputs "Hi again"
strip_url("Hi again https//example.example.com/example/example") # outputs "Hi again"
tokenise("Hi again") # outputs ["Hi", "again"]
strip_alpha_numeric_characters(["Hi", "again", "@", "#", "*"]) # outputs ["Hi", "again"]
strip_stopwords(["Hi", "again"], ["Hi"]) # outputs ["again"]
strip_min_max_tokens(["consult", "consulting", "a"], 2, 8) # outputs ['consult']
lemantization(["bats", "feet"]) # outputs ["bat", "foot"]
stemming(["consult", "consultant", "consulting"]) # outputs ["consult", "consult", "consult"]
get_ngrams("hi all I'm", (1,3)) # outputs [('hi', 'all'), ('all', "I'm"), ('hi', 'all', "I'm")]
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for SimpleText-1.0.3-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 481b957dc8b2757230da3967713f1ba086c8716d45096808bf14cbfc51890766 |
|
MD5 | 15306495fd821487911fc253c526e223 |
|
BLAKE2b-256 | 4d174f8a00ea0d005abb80985cc7dcae63cce165bfb2d43eb006ae37e018057c |