Skip to main content

Sanitise text while keeping your sanity

Project description

# Saniti

**Sanitise lists of text documents quickly, easily and whilst maintaining your sanity**

The aim was to streamline processing lists of documents into the same outputs into simply specifying the list of texts and defining the sanitization pipeline.

### Usage:

**As a function-ish**

```
import saniti
original_text = ["I like to moves it, move its", "I likeing to move it!", "the of"]
text = saniti.saniti(original_text, ["token", "destop", "depunct", "unempty", "stem", "out_corp_dict"]) #sanitise the text while initalising the class
print(text.text)

{'dictionary': <gensim.corpora.dictionary.Dictionary object at 0x000002BA9F5FFEF0>, 'corpus': [[(0, 1), (1, 1), (2, 2)], [(0, 1), (1, 1), (2, 1)], []]}
```

**As a class**

```
import saniti
sani1 = saniti.saniti() # initialise the santising class
text = sani1.process(original_text, ["token", "destop", "depunct", "unempty", "lemma", "out_tag_doc"]) # sanitise the text
print(text)

[TaggedDocument(words=['I', 'like', 'move', 'move'], tags=['I like move move']), TaggedDocument(words=['I', 'likeing', 'move'], tags=['I likeing move']), TaggedDocument(words=[], tags=[''])]
```

## Pipeline Components

* "token" - tokenise texts
* "depunct" - remove punctuation
* "unempty" - remove empty words within documents
* "lemma" - lemmatize text
* "destop" - remove stopwords
* "stem" - stem texts
* "out_tag_doc" - turns the texts into gensim tagged documents for Doc2Vec
* "out_corp_dict" - turns the texts into gensim corpus and dictionary

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

saniti-0.1.51.tar.gz (3.1 kB view details)

Uploaded Source

File details

Details for the file saniti-0.1.51.tar.gz.

File metadata

  • Download URL: saniti-0.1.51.tar.gz
  • Upload date:
  • Size: 3.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No

File hashes

Hashes for saniti-0.1.51.tar.gz
Algorithm Hash digest
SHA256 ae890077f4ede610cda3b63013199467e17d1d840047c1731d5692cd83e2f863
MD5 5065011ce3d617152812be09513f661a
BLAKE2b-256 5f282b23e2d8a83dbe4a9260859988ddacbce2c6a553c89358e54b64b9722a25

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page