An open-source python package to clean raw text data
Project description
cleantext
cleantext is a an open-source python package to clean raw text data. Source code for the library can be found here.
Features
cleantext has two main methods,
- clean: to clean raw text and return the cleaned text
- clean_words: to clean raw text and return a list of clean words
cleantext can apply all, or a selected combination of the following cleaning operations:
- Remove extra white spaces
- Convert the entire text into a uniform lowercase
- Remove digits from the text
- Remove punctuations from the text
- Remove or replace the part of text with custom regex
- Remove stop words, and choose a language for stop words ( Stop words are generally the most common words in a language with no significant meaning such as is, am, the, this, are etc.)
- Stem the words (Stemming is a process of converting words with similar meaning into a single word. For example, stemming of words run, runs, running will result run, run, run)
Installation
cleantext requires Python 3 and NLTK to execute.
To install using pip, use
pip install cleantext
Usage
- Import the library:
import cleantext
- Choose a method:
To return the text in a string format,
cleantext.clean("your_raw_text_here")
To return a list of words from the text,
cleantext.clean_words("your_raw_text_here")
To choose a specific set of cleaning operations,
cleantext.clean_words("your_raw_text_here",
clean_all= False # Execute all cleaning operations
extra_spaces=True , # Remove extra white spaces
stemming=True , # Stem the words
stopwords=True ,# Remove stop words
lowercase=True ,# Convert to lowercase
numbers=True ,# Remove all digits
punct=True ,# Remove all punctuations
reg: str = '<regex>', # Remove parts of text based on regex
reg_replace: str = '<replace_value>', # String to replace the regex used in reg
stp_lang='english' # Language for stop words
)
Examples
import cleantext
cleantext.clean('This is A s$ample !!!! tExt3% to cleaN566556+2+59*/133', extra_spaces=True, lowercase=True, numbers=True, punct=True)
returns,
'this is a sample text to clean'
import cleantext
cleantext.clean_words('This is A s$ample !!!! tExt3% to cleaN566556+2+59*/133')
returns,
['sampl', 'text', 'clean']
from cleantext import clean
text = "my id, name1@dom1.com and your, name2@dom2.in"
clean(text, reg=r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+", reg_replace='email', clean_all=False)
returns,
"my id, email and your, email"
License
MIT
For any questions, issues, bugs, and suggestions please visit here
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file cleantext-1.1.4.tar.gz
.
File metadata
- Download URL: cleantext-1.1.4.tar.gz
- Upload date:
- Size: 4.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.7.1 importlib_metadata/4.10.0 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 854003de912406d8d821623774b307dc6f0626fd9fac0bdc5d24864ee3f37578 |
|
MD5 | f41366f4393aba6490e635c51936453c |
|
BLAKE2b-256 | 9e39883774dadb46a8ea348ddbdc9dfdb9aaa1a104825e65ee9ebe9a375f46e0 |
File details
Details for the file cleantext-1.1.4-py3-none-any.whl
.
File metadata
- Download URL: cleantext-1.1.4-py3-none-any.whl
- Upload date:
- Size: 4.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/3.7.1 importlib_metadata/4.10.0 pkginfo/1.8.2 requests/2.26.0 requests-toolbelt/0.9.1 tqdm/4.62.3 CPython/3.9.9
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 138a658a8084796793910c876140002435ffc7ce51a9abf28d2a6b059a7a4d13 |
|
MD5 | 90047d93770255bb806a85916528a017 |
|
BLAKE2b-256 | dfd0bd954cf316c1d3a605a9bc29d2cf2bbd388b82d2626b60ab92e8d18457a3 |