A lightweight Python package to clean English text by removing HTML tags, URLs, emojis, digits, and punctuation.
Project description
txtcleanen
txtcleanen is a simple Python package for cleaning English text by removing HTML tags, URLs, emojis, numbers, punctuation, and extra whitespace for Natural Language Processing task.
Features
- Remove HTML tags
- Remove URLs
- Remove emojis
- Remove digits and punctuation
- Normalize Unicode text
- Compact multiple spaces into one
Installation
pip install txtcleanen
Example
import txtcleanen
text = "Hello <b>World!</b> Visit https://example.com now!"
clean_text = txtcleanen(text)
print(clean_text)
# Output: "Hello World Visit now"
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
txtcleanen-1.0.0.tar.gz
(2.7 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file txtcleanen-1.0.0.tar.gz.
File metadata
- Download URL: txtcleanen-1.0.0.tar.gz
- Upload date:
- Size: 2.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8a08a89db8320859549a703429a78fa34c5eb82f0f833bed022a4c980a3451d8
|
|
| MD5 |
9431ebe7c5a8041bddee910991576cec
|
|
| BLAKE2b-256 |
46591248952ffaa9001bd2764217b9c1e0e9d1b931e1202ee69e5ccf50fd1aed
|
File details
Details for the file txtcleanen-1.0.0-py3-none-any.whl.
File metadata
- Download URL: txtcleanen-1.0.0-py3-none-any.whl
- Upload date:
- Size: 3.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.5
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0cbcfc51b582d57082a45732d821d0bb9485557694cec51cd6752f78592c23c2
|
|
| MD5 |
0d25819fcd265c00081201ff99805ae0
|
|
| BLAKE2b-256 |
98b1197cfb3607e51967f8152a0e041950c2d26516f293831f58531ffe63f23b
|