Skip to main content

Clean text from extra spaces and special symbols as in the CLIP model.

Project description

Sourcery

Cleantextclip

Library to prepare text for machine learning and NLP tasks. Originated from CLIP model preparation, but a few more rules were added.

Installation

pip install -U ternaus_cleantext

Cleans text similar, but stricter than in the CLIP model:

  1. Escapes HTML characters
  2. Removes html tags
  3. Removes URLs
  4. Removes extra white spaces
  5. Text to lower case
from ternaus_cleantext.ternaus_cleantext import clean_text
print(clean_text("This is a test https://ternaus.com <b>bold</b>"))

returns this is a test bold

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ternaus_cleantext-0.0.1.tar.gz (5.3 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page