Sanitise text while keeping your sanity
Project description
# Saniti
**Sanitise lists of text documents quickly, easily and whilst maintaining your sanity**
The aim was to streamline processing lists of documents into the same outputs into simply specifying the list of texts and defining the sanitization pipeline.
### Usage:
**As a function-ish**
```
import saniti
original_text = ["I like to moves it, move its", "I likeing to move it!", "the of"]
text = saniti.saniti(original_text, ["token", "destop", "depunct", "unempty", "stem", "out_corp_dict"]) #sanitise the text while initalising the class
print(text.text)
{'dictionary': <gensim.corpora.dictionary.Dictionary object at 0x000002BA9F5FFEF0>, 'corpus': [[(0, 1), (1, 1), (2, 2)], [(0, 1), (1, 1), (2, 1)], []]}
```
**As a class**
```
import saniti
sani1 = saniti.saniti() # initialise the santising class
text = sani1.process(original_text, ["token", "destop", "depunct", "unempty", "lemma", "out_tag_doc"]) # sanitise the text
print(text)
[TaggedDocument(words=['I', 'like', 'move', 'move'], tags=['I like move move']), TaggedDocument(words=['I', 'likeing', 'move'], tags=['I likeing move']), TaggedDocument(words=[], tags=[''])]
```
## Pipeline Components
* "token" - tokenise texts
* "depunct" - remove punctuation
* "unempty" - remove empty words within documents
* "lemma" - lemmatize text
* "destop" - remove stopwords
* "stem" - stem texts
* "out_tag_doc" - turns the texts into gensim tagged documents for Doc2Vec
* "out_corp_dict" - turns the texts into gensim corpus and dictionary
**Sanitise lists of text documents quickly, easily and whilst maintaining your sanity**
The aim was to streamline processing lists of documents into the same outputs into simply specifying the list of texts and defining the sanitization pipeline.
### Usage:
**As a function-ish**
```
import saniti
original_text = ["I like to moves it, move its", "I likeing to move it!", "the of"]
text = saniti.saniti(original_text, ["token", "destop", "depunct", "unempty", "stem", "out_corp_dict"]) #sanitise the text while initalising the class
print(text.text)
{'dictionary': <gensim.corpora.dictionary.Dictionary object at 0x000002BA9F5FFEF0>, 'corpus': [[(0, 1), (1, 1), (2, 2)], [(0, 1), (1, 1), (2, 1)], []]}
```
**As a class**
```
import saniti
sani1 = saniti.saniti() # initialise the santising class
text = sani1.process(original_text, ["token", "destop", "depunct", "unempty", "lemma", "out_tag_doc"]) # sanitise the text
print(text)
[TaggedDocument(words=['I', 'like', 'move', 'move'], tags=['I like move move']), TaggedDocument(words=['I', 'likeing', 'move'], tags=['I likeing move']), TaggedDocument(words=[], tags=[''])]
```
## Pipeline Components
* "token" - tokenise texts
* "depunct" - remove punctuation
* "unempty" - remove empty words within documents
* "lemma" - lemmatize text
* "destop" - remove stopwords
* "stem" - stem texts
* "out_tag_doc" - turns the texts into gensim tagged documents for Doc2Vec
* "out_corp_dict" - turns the texts into gensim corpus and dictionary
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
saniti-0.1.51.tar.gz
(3.1 kB
view details)
File details
Details for the file saniti-0.1.51.tar.gz
.
File metadata
- Download URL: saniti-0.1.51.tar.gz
- Upload date:
- Size: 3.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | ae890077f4ede610cda3b63013199467e17d1d840047c1731d5692cd83e2f863 |
|
MD5 | 5065011ce3d617152812be09513f661a |
|
BLAKE2b-256 | 5f282b23e2d8a83dbe4a9260859988ddacbce2c6a553c89358e54b64b9722a25 |