Skip to main content

Clean text from extra spaces and special symbols as in the CLIP model.

Project description

Sourcery

Cleantextclip

Library to prepare text for machine learning and NLP tasks. Originated from CLIP model preparation, but a few more rules were added.

Installation

pip install -U ternaus_cleantext

Cleans text similar, but stricter than in the CLIP model:

  1. Escapes HTML characters
  2. Removes html tags
  3. Removes URLs
  4. Removes extra white spaces
  5. Text to lower case
from ternaus_cleantext.ternaus_cleantext import clean_text
print(clean_text("This is a test https://ternaus.com <b>bold</b>"))

returns this is a test bold

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ternaus_cleantext-0.0.1.tar.gz (5.3 kB view details)

Uploaded Source

File details

Details for the file ternaus_cleantext-0.0.1.tar.gz.

File metadata

  • Download URL: ternaus_cleantext-0.0.1.tar.gz
  • Upload date:
  • Size: 5.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.10.8

File hashes

Hashes for ternaus_cleantext-0.0.1.tar.gz
Algorithm Hash digest
SHA256 29dbf62943c1717b65c108ce62a824eb00757544f19e8148ceb6a0590b3a5a1b
MD5 6d5bf098a8a6aa66cf3da0106b80e9ad
BLAKE2b-256 0e71fdcf492b444a973555001a9b73173215902e4cd7ea49ee4cf8666d8b70d1

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page