Skip to main content

A package to clean the raw text

Project description

cleantext

cleantext is a an open-source python library to clean raw text data. Source code for the library can be found here.

Features

cleantext has two main methods,

  • clean: to clean raw text and return the cleaned text
  • clean_words: to clean raw text and return a list of clean words

cleantext can apply all, or a selected combination of the following cleaning operations:

  • Remove extra white spaces
  • Convert the entire text into a uniform lowercase
  • Remove digits from the text
  • Remove punctuations from the text
  • Remove stop words, and choose a language for stop words ( Stop words are generally the most common words in a language with no significant meaning such as is, am, the, this, are etc.)
  • Stem the words (Stemming is a process of converting words with similar meaning into a single word. For example, stemming of words run, runs, running will result run, run, run)

Installation

cleantext requires Python 3 and NLTK to execute.

To install using pip, use

pip install cleantext

Usage

  • Import the library:
import cleantext
  • Choose a method:

To return the text in a string format,

cleantext.clean("your_raw_text_here", all= True) 

To return a list of words from the text,

cleantext.clean_words("your_raw_text_here", all= True) 

To choose a specific set of cleaning operations,

cleantext.clean_words("your_raw_text_here", extra_spaces=True   # Remove extra white space, stemming=True  # Stem the words, stopwords=True # Remove stop words, lowercase=True # Convert to lowercase, numbers=True # Remove all digits, punct=True # Remove all punctuations, stp_lang='english'  # Language for stop words)

Examples

import cleantext
cleantext.clean('This is A s$ample !!!! tExt3% to   cleaN566556+2+59*/133', extra_spaces=True, lowercase=True, numbers=True, punct=True)

returns,

'this is a sample text to clean'

import cleantext
cleantext.clean_words('This is A s$ample !!!! tExt3% to   cleaN566556+2+59*/133', all=True)

returns,

['sampl', 'text', 'clean']

License

MIT

For any questions, issues, bugs, and suggestions please visit here

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cleantext-1.1.0.tar.gz (2.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

cleantext-1.1.0-py3-none-any.whl (3.7 kB view details)

Uploaded Python 3

File details

Details for the file cleantext-1.1.0.tar.gz.

File metadata

  • Download URL: cleantext-1.1.0.tar.gz
  • Upload date:
  • Size: 2.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.6.2 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.7.2

File hashes

Hashes for cleantext-1.1.0.tar.gz
Algorithm Hash digest
SHA256 d22b0b661d310b1114c3da2acb2cc2f848b1dbf399501694a135bb914fe4168c
MD5 982a754f6f8dd29796ff88f9e8bd2030
BLAKE2b-256 ed3d6d3c03f2648c5873286945e73502b7b25c37c1961050e37f811656427463

See more details on using hashes here.

File details

Details for the file cleantext-1.1.0-py3-none-any.whl.

File metadata

  • Download URL: cleantext-1.1.0-py3-none-any.whl
  • Upload date:
  • Size: 3.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/40.6.2 requests-toolbelt/0.9.1 tqdm/4.41.1 CPython/3.7.2

File hashes

Hashes for cleantext-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 cca9382b1ed3ebd962f77ccdbcab1779c17e1568e56acdac1a583a84703942d4
MD5 6a969193f857f3d4a71c91dbc29db5f0
BLAKE2b-256 16a2c8da3f78fa6a976add2fae318e49fcd67f9c9d89587a27eadb7777d0b416

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page