Skip to main content

A comprehensive text cleaning and preprocessing pipeline.

Project description

SqueakyCleanText

SqueakyCleanText is a handy text cleaning package designed to sanitize text for classical machine learning models and language models (such as BERT, RoBERTa) without altering the meaning of the text.

Features

  • Text sanitization for classical ML models and language models.
  • Removes unnecessary characters and normalizes text.
  • Supports Named Entity Recognition (NER).
  • Identifies the language of the text.
  • Provides cleaned text with stopwords removed.

Installation

To install SqueakyCleanText, use the following pip command:

pip install SqueakyCleanText

Usage

Here's a simple example to demonstrate how to use the SqueakyCleanText package:

from sct import sct

# Initialize the TextCleaner
sx = sct.TextCleaner()

# Process the text
#lmtext : Text for Language Models, cmtext : Text for Classical ML, language : Language provided
lmtext, cmtext, language = sx.process("Hello, My name is John!")
# Output the result
print(lmtext, cmtext, language)
# Hello, My name is hello name ENGLISH

API

sct.TextCleaner

process(text: str) -> Tuple[str, str, str]

Processes the input text and returns a tuple containing:

  • Cleaned text with punctuation and unnecessary characters removed.
  • Cleaned text with stopwords removed.
  • Detected language of the text.

TODO

  • Add the ability to change the NER models from the config file, supporting AutoModel and AutoTokenizer.
  • Expand language support for stopwords.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request or open an issue.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgements

  • Thanks to the contributors and the community for their support.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

SqueakyCleanText-0.1.1.tar.gz (12.6 kB view details)

Uploaded Source

Built Distribution

SqueakyCleanText-0.1.1-py3-none-any.whl (16.2 kB view details)

Uploaded Python 3

File details

Details for the file SqueakyCleanText-0.1.1.tar.gz.

File metadata

  • Download URL: SqueakyCleanText-0.1.1.tar.gz
  • Upload date:
  • Size: 12.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.0 CPython/3.10.14

File hashes

Hashes for SqueakyCleanText-0.1.1.tar.gz
Algorithm Hash digest
SHA256 6b4956b295e345ed7a54e5a607dd44d908f901b898752748a26e9e6b955c3da9
MD5 68d55fea7cfc272a4937b4da40b9aba8
BLAKE2b-256 305e6301e3aa06b1c787d6acf57923f35129f25eacaf0ca2f652e84ac30dae16

See more details on using hashes here.

File details

Details for the file SqueakyCleanText-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for SqueakyCleanText-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 ccbfda05fdeb1513a77e28e222582c74bde99d5677f8c45803a5c691d9843589
MD5 fe129456c14840a4d25cc05d7ac07e68
BLAKE2b-256 71fbf95b0203f30f2735d0acde8225e20acd8ef5baee549f21afe9016b94a1b6

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page