Clean text from extra spaces and special symbols as in the CLIP model.
Project description
Cleantextclip
Library to prepare text for machine learning and NLP tasks. Originated from CLIP model preparation, but a few more rules were added.
Installation
pip install -U ternaus_cleantext
Cleans text similar, but stricter than in the CLIP model:
- Escapes HTML characters
- Removes html tags
- Removes URLs
- Removes extra white spaces
- Text to lower case
from ternaus_cleantext.ternaus_cleantext import clean_text
print(clean_text("This is a test https://ternaus.com <b>bold</b>"))
returns
this is a test bold
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
File details
Details for the file ternaus_cleantext-0.0.1.tar.gz
.
File metadata
- Download URL: ternaus_cleantext-0.0.1.tar.gz
- Upload date:
- Size: 5.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/4.0.2 CPython/3.10.8
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 29dbf62943c1717b65c108ce62a824eb00757544f19e8148ceb6a0590b3a5a1b |
|
MD5 | 6d5bf098a8a6aa66cf3da0106b80e9ad |
|
BLAKE2b-256 | 0e71fdcf492b444a973555001a9b73173215902e4cd7ea49ee4cf8666d8b70d1 |