Skip to main content
Donate to the Python Software Foundation or Purchase a PyCharm License to Benefit the PSF! Donate Now

Sanitise text while keeping your sanity

Project description

# Saniti

**Sanitise lists of text documents quickly, easily and whilst maintaining your sanity**

The aim was to streamline processing lists of documents into the same outputs into simply specifying the list of texts and defining the sanitization pipeline.

### Usage:

**As a function-ish**

```
import saniti
original_text = ["I like to moves it, move its", "I likeing to move it!", "the of"]
text = saniti.saniti(original_text, ["token", "destop", "depunct", "unempty", "stem", "out_corp_dict"]) #sanitise the text while initalising the class
print(text.text)

{'dictionary': <gensim.corpora.dictionary.Dictionary object at 0x000002BA9F5FFEF0>, 'corpus': [[(0, 1), (1, 1), (2, 2)], [(0, 1), (1, 1), (2, 1)], []]}
```

**As a class**

```
import saniti
sani1 = saniti.saniti() # initialise the santising class
text = sani1.process(original_text, ["token", "destop", "depunct", "unempty", "lemma", "out_tag_doc"]) # sanitise the text
print(text)

[TaggedDocument(words=['I', 'like', 'move', 'move'], tags=['I like move move']), TaggedDocument(words=['I', 'likeing', 'move'], tags=['I likeing move']), TaggedDocument(words=[], tags=[''])]
```

## Pipeline Components

* "token" - tokenise texts
* "depunct" - remove punctuation
* "unempty" - remove empty words within documents
* "lemma" - lemmatize text
* "destop" - remove stopwords
* "stem" - stem texts
* "out_tag_doc" - turns the texts into gensim tagged documents for Doc2Vec
* "out_corp_dict" - turns the texts into gensim corpus and dictionary

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Filename, size & hash SHA256 hash help File type Python version Upload date
saniti-0.1.51.tar.gz (3.1 kB) Copy SHA256 hash SHA256 Source None

Supported by

Elastic Elastic Search Pingdom Pingdom Monitoring Google Google BigQuery Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN SignalFx SignalFx Supporter DigiCert DigiCert EV certificate StatusPage StatusPage Status page