A package to manage textual data in a simple fashion.
Project description
SimpleText
A package to manage textual data in a simple fashion.
Install with:
pip install SimpleText
1) The preprocess function
This function takes a string as an input and outputs a list of tokens. There are several parameters in the function to help quickly pre-process a string.
Parameters:
text
(string): a string of text
n_grams
(tuple, default = (1,1)): specifies the number of ngrams e.g. (1,2) would be unigrams and bigram, (2,2) would be just bigrams
remove_accents
(boolean, default = False): removes accents
lower
(boolean, default = False): lowercases text
remove_less_than
(int, default = 0): removes words less than X letters
remove_more_than
(int, default = 20): removes words more than X letters
remove_punct
(boolean, default = False): removes punctuation
remove_alpha
(boolean, default = False): removes non-alphabetic tokens
remove_stopwords
(boolean, default = False): removes stopwords
remove_custom_stopwords
(list, default = [ ]): removes custom stopwords
lemma
(boolean, default = False): lemmantises tokens (via the Word Net Lemmantizer algorithm)
stem
(boolean, default = False): stems tokens (via the Porter Stemming algorithm)
In the example below we preprocess the string by:
- lowercasing letters
- removing punctuation
- removing stop words
- removing words with more than 15 letters and less than 1 letter
from SimpleText.preprocessor import preprocess
text = 'Last week, I went to the shops.'
preprocess(text, n_grams=(1, 1), remove_accents=False, lower=True, remove_less_than=1,
remove_more_than=15, remove_punct=True, remove_alpha=False, remove_stopwords=True,
remove_custom_stopwords=[], lemma=False, stem=False, remove_url=False)
The output would be:
['last', 'went', 'shops', 'week']
In this second example we process the string by:
- generating unigrams and bigrams
- stemming
- removing the url
- removing accents
- lowercasing letters
from SimpleText.preprocessor import preprocess
text = "I'm loving the weather this year in españa! https://en.tutiempo.net/spain.html"
preprocess(text, n_grams=(1, 2), remove_accents=True, lower=True, remove_less_than=0,
remove_more_than=20, remove_punct=False, remove_alpha=False, remove_stopwords=False,remove_custom_stopwords=[], lemma=False, stem=True, remove_url=True)
This outputs:
["i'm",'love','the','weather','thi','year','in','espana!',("i'm", 'loving'),('loving', 'the'),('the', weather',
('weather', 'this'),('this', 'year'),('year', 'in'),('in', 'espana!')]
2) Individually preprocessing text
Alternatively, one can also individually apply a preprocessing step without having to use the whole preprocess
function. The functions available are:
from SimpleText.preprocessor import lowercase, strip_accents, strip_punctuation, strip_url,
tokenise, strip_alpha_numeric_characters, strip_stopwords, strip_min_max_tokens, lemantization, stemming, get_ngrams
lowercase("Hi again") # outputs "hi again"
strip_accents("Hi ágain") # outputs "Hi again"
strip_punctuation("Hi again!") # outputs "Hi again"
strip_url("Hi again https//example.example.com/example/example") # outputs "Hi again"
tokenise("Hi again") # outputs ["Hi", "again"]
strip_alpha_numeric_characters(["Hi", "again", "@", "#", "*"]) # outputs ["Hi", "again"]
strip_stopwords(["Hi", "again"], ["Hi"]) # outputs ["again"]
strip_min_max_tokens(["consult", "consulting", "a"], 2, 8) # outputs ['consult']
lemantization(["bats", "feet"]) # outputs ["bat", "foot"]
stemming(["consult", "consultant", "consulting"]) # outputs ["consult", "consult", "consult"]
get_ngrams("hi all I'm", (1,3)) # outputs [('hi', 'all'), ('all', "I'm"), ('hi', 'all', "I'm")]
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file SimpleText-1.0.3.tar.gz
.
File metadata
- Download URL: SimpleText-1.0.3.tar.gz
- Upload date:
- Size: 6.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/50.3.2 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.7.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 9a1890ae2a7b8cfa5ff6f39fccc6e179da8bab594ce84bd2e61402a86031246a |
|
MD5 | 864f3232bf0565c356c3fbe0b0c3e8e8 |
|
BLAKE2b-256 | 5eef73df2b7a3fa5b92f89abe8f97d43ee61119182a756400e2f399893be4e66 |
File details
Details for the file SimpleText-1.0.3-py3-none-any.whl
.
File metadata
- Download URL: SimpleText-1.0.3-py3-none-any.whl
- Upload date:
- Size: 5.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/50.3.2 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.7.4
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 481b957dc8b2757230da3967713f1ba086c8716d45096808bf14cbfc51890766 |
|
MD5 | 15306495fd821487911fc253c526e223 |
|
BLAKE2b-256 | 4d174f8a00ea0d005abb80985cc7dcae63cce165bfb2d43eb006ae37e018057c |