Skip to main content

A package to manage textual data in a simple fashion.

Project description

SimpleText


A package to manage textual data in a simple fashion.

Install with:

pip install SimpleText

1) The preprocess function

This function takes a string as an input and outputs a list of tokens. There are several parameters in the function to help quickly pre-process a string.

Parameters:

text (string): a string of text

n_grams (tuple, default = (1,1)): specifies the number of ngrams e.g. (1,2) would be unigrams and bigram, (2,2) would be just bigrams

remove_accents (boolean, default = False): removes accents

lower (boolean, default = False): lowercases text

remove_less_than (int, default = 0): removes words less than X letters

remove_more_than (int, default = 20): removes words more than X letters

remove_punct (boolean, default = False): removes punctuation

remove_alpha (boolean, default = False): removes non-alphabetic tokens

remove_stopwords (boolean, default = False): removes stopwords

remove_custom_stopwords (list, default = [ ]): removes custom stopwords

lemma (boolean, default = False): lemmantises tokens (via the Word Net Lemmantizer algorithm)

stem (boolean, default = False): stems tokens (via the Porter Stemming algorithm)

In the example below we preprocess the string by:

  • lowercasing letters
  • removing punctuation
  • removing stop words
  • removing words with more than 15 letters and less than 1 letter
from SimpleText.preprocessor import preprocess

text = 'Last week, I went to the shops.'

preprocess(text, n_grams=(1, 1), remove_accents=False, lower=True, remove_less_than=1,
           remove_more_than=15, remove_punct=True, remove_alpha=False, remove_stopwords=True,
           remove_custom_stopwords=[], lemma=False, stem=False, remove_url=False)

The output would be:

['last', 'went', 'shops', 'week']

In this second example we process the string by:

  • generating unigrams and bigrams
  • stemming
  • removing the url
  • removing accents
  • lowercasing letters
from SimpleText.preprocessor import preprocess

text = "I'm loving the weather this year in españa! https://en.tutiempo.net/spain.html"

preprocess(text, n_grams=(1, 2), remove_accents=True, lower=True, remove_less_than=0, 
           remove_more_than=20, remove_punct=False, remove_alpha=False, remove_stopwords=False,remove_custom_stopwords=[], lemma=False, stem=True, remove_url=True)

This outputs:

["i'm",'love','the','weather','thi','year','in','espana!',("i'm", 'loving'),('loving', 'the'),('the', weather',
 ('weather', 'this'),('this', 'year'),('year', 'in'),('in', 'espana!')]

2) Individually preprocessing text

Alternatively, one can also individually apply a preprocessing step without having to use the whole preprocess function. The functions available are:

from SimpleText.preprocessor import lowercase, strip_accents, strip_punctuation, strip_url, 
tokenise, strip_alpha_numeric_characters, strip_stopwords, strip_min_max_tokens, lemantization, stemming, get_ngrams

lowercase("Hi again") # outputs "hi again"

strip_accents("Hi ágain") # outputs "Hi again"

strip_punctuation("Hi again!") # outputs "Hi again"

strip_url("Hi again https//example.example.com/example/example") # outputs "Hi again"

tokenise("Hi again") # outputs ["Hi", "again"]

strip_alpha_numeric_characters(["Hi", "again", "@", "#", "*"]) # outputs ["Hi", "again"]

strip_stopwords(["Hi", "again"], ["Hi"]) # outputs ["again"]

strip_min_max_tokens(["consult", "consulting", "a"], 2, 8) # outputs ['consult']

lemantization(["bats", "feet"]) # outputs ["bat", "foot"]

stemming(["consult", "consultant", "consulting"]) # outputs ["consult", "consult", "consult"]

get_ngrams("hi all I'm", (1,3)) # outputs [('hi', 'all'), ('all', "I'm"), ('hi', 'all', "I'm")]

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

SimpleText-1.0.3.tar.gz (6.3 kB view details)

Uploaded Source

Built Distribution

SimpleText-1.0.3-py3-none-any.whl (5.2 kB view details)

Uploaded Python 3

File details

Details for the file SimpleText-1.0.3.tar.gz.

File metadata

  • Download URL: SimpleText-1.0.3.tar.gz
  • Upload date:
  • Size: 6.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/50.3.2 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.7.4

File hashes

Hashes for SimpleText-1.0.3.tar.gz
Algorithm Hash digest
SHA256 9a1890ae2a7b8cfa5ff6f39fccc6e179da8bab594ce84bd2e61402a86031246a
MD5 864f3232bf0565c356c3fbe0b0c3e8e8
BLAKE2b-256 5eef73df2b7a3fa5b92f89abe8f97d43ee61119182a756400e2f399893be4e66

See more details on using hashes here.

File details

Details for the file SimpleText-1.0.3-py3-none-any.whl.

File metadata

  • Download URL: SimpleText-1.0.3-py3-none-any.whl
  • Upload date:
  • Size: 5.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.2.0 pkginfo/1.5.0.1 requests/2.22.0 setuptools/50.3.2 requests-toolbelt/0.9.1 tqdm/4.36.1 CPython/3.7.4

File hashes

Hashes for SimpleText-1.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 481b957dc8b2757230da3967713f1ba086c8716d45096808bf14cbfc51890766
MD5 15306495fd821487911fc253c526e223
BLAKE2b-256 4d174f8a00ea0d005abb80985cc7dcae63cce165bfb2d43eb006ae37e018057c

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page