Skip to main content

An open-source python package to clean raw text data

Project description

cleantext

Downloads

cleantext is a an open-source python package to clean raw text data. Source code for the library can be found here.

Features

cleantext has two main methods,

  • clean: to clean raw text and return the cleaned text
  • clean_words: to clean raw text and return a list of clean words

cleantext can apply all, or a selected combination of the following cleaning operations:

  • Remove extra white spaces
  • Convert the entire text into a uniform lowercase
  • Remove digits from the text
  • Remove punctuations from the text
  • Remove or replace the part of text with custom regex
  • Remove stop words, and choose a language for stop words ( Stop words are generally the most common words in a language with no significant meaning such as is, am, the, this, are etc.)
  • Stem the words (Stemming is a process of converting words with similar meaning into a single word. For example, stemming of words run, runs, running will result run, run, run)

Installation

cleantext requires Python 3 and NLTK to execute.

To install using pip, use

pip install cleantext

Usage

  • Import the library:
import cleantext
  • Choose a method:

To return the text in a string format,

cleantext.clean("your_raw_text_here") 

To return a list of words from the text,

cleantext.clean_words("your_raw_text_here") 

To choose a specific set of cleaning operations,

cleantext.clean_words("your_raw_text_here",
clean_all= False # Execute all cleaning operations
extra_spaces=True ,  # Remove extra white spaces 
stemming=True , # Stem the words
stopwords=True ,# Remove stop words
lowercase=True ,# Convert to lowercase
numbers=True ,# Remove all digits 
punct=True ,# Remove all punctuations
reg: str = '<regex>', # Remove parts of text based on regex
reg_replace: str = '<replace_value>', # String to replace the regex used in reg
stp_lang='english'  # Language for stop words
)

Examples

import cleantext
cleantext.clean('This is A s$ample !!!! tExt3% to   cleaN566556+2+59*/133', extra_spaces=True, lowercase=True, numbers=True, punct=True)

returns,

'this is a sample text to clean'

import cleantext
cleantext.clean_words('This is A s$ample !!!! tExt3% to   cleaN566556+2+59*/133')

returns,

['sampl', 'text', 'clean']

from cleantext import clean
text = "my id, name1@dom1.com and your, name2@dom2.in"
clean(text, reg=r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+", reg_replace='email', clean_all=False)

returns,

"my id, email and your, email"

License

MIT

For any questions, issues, bugs, and suggestions please visit here

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

cleantext-1.1.4.tar.gz (4.2 kB view details)

Uploaded Source

Built Distribution

cleantext-1.1.4-py3-none-any.whl (4.9 kB view details)

Uploaded Python 3

File details

Details for the file cleantext-1.1.4.tar.gz.

File metadata

  • Download URL: cleantext-1.1.4.tar.gz
  • Upload date:
  • Size: 4.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.10.0 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.9

File hashes

Hashes for cleantext-1.1.4.tar.gz
Algorithm Hash digest
SHA256 854003de912406d8d821623774b307dc6f0626fd9fac0bdc5d24864ee3f37578
MD5 f41366f4393aba6490e635c51936453c
BLAKE2b-256 9e39883774dadb46a8ea348ddbdc9dfdb9aaa1a104825e65ee9ebe9a375f46e0

See more details on using hashes here.

File details

Details for the file cleantext-1.1.4-py3-none-any.whl.

File metadata

  • Download URL: cleantext-1.1.4-py3-none-any.whl
  • Upload date:
  • Size: 4.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.7.1 importlib_metadata/4.10.0 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.9

File hashes

Hashes for cleantext-1.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 138a658a8084796793910c876140002435ffc7ce51a9abf28d2a6b059a7a4d13
MD5 90047d93770255bb806a85916528a017
BLAKE2b-256 dfd0bd954cf316c1d3a605a9bc29d2cf2bbd388b82d2626b60ab92e8d18457a3

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page