Skip to main content

A comprehensive text cleaning and preprocessing pipeline.

Project description

SqueakyCleanText

PyPI PyPI - Downloads

In the world of machine learning and natural language processing, clean and well-structured text data is crucial for building effective downstream models and managing token limits in language models.

SqueakyCleanText helps achieve this by addressing common text issues by doing most of the work for you.

Key Features

  • Encoding Issues: Corrects text encoding problems.
  • HTML and URLs: Removes unnecessary long HTML tags and URLs, or replace them with special tokens.
  • Contact Information: Strips emails, phone numbers, and other contact details, or replace them with special tokens.
  • Isolated Characters: Eliminates isolated letters or symbols that adds no value.
  • NER Support: Uses a soft voting ensemble technique to handle named entities like location, person and organisation names, which can be replaced with special tokens if not needed in the text.
  • Stopwords and Punctuation: For statistical models, it optimizes text by removing stopwords, special symbols, and punctuation.
  • Currency Symbols: Replaces all currency symbols with their alphabetical equivalents.
  • Whitespace Normalization: Removes unnecessary whitespace.
  • Detects the language of processed text if needed in downstream task.
  • Supports English, Dutch, German and Spanish language.
  • Provides text for both Lnaguage model processing and Statistical model processing.
Benefits for Statistical Models

When working with statistical models, further optimization is often required, such as removing stopwords, special symbols, and punctuation. SqueakyCleanText offers functionality to streamline this process, ensuring that your text data is in optimal shape for classification and other downstream tasks.

Advantage for ensemble NER process

Depending on sigle model for Name Entity recognition is not be ideal, as there is a high chance it might skip the entity all together. Also combining the language specific NER model makes it more specific for text and reduces the chance of missing out the entity. The package NER model has the chunking mechanism which helps to do the NER process even if the text is longer than the model token size.

By automating these text cleaning steps, SqueakyCleanText ensures your data is prepared efficiently and effectively, saving time and improving model performance.

Installation

To install SqueakyCleanText, use the following pip command:

pip install SqueakyCleanText

Usage

Few examples, how to use the SqueakyCleanText package:

Examples:

english_text = "Hey John Doe, wanna grab some coffee at Starbucks on 5th Avenue? I'm feeling a bit tired after last night's party at Jane's place. BTW, I can't make it to the meeting at 10:00 AM. LOL! Call me at +1-555-123-4567 or email me at john.doe@example.com. Check out this cool website: https://www.example.com."

dutch_text = "Hé Jan Jansen, wil je wat koffie halen bij Starbucks op de 5e Avenue? Ik voel me een beetje moe na het feest van gisteravond bij Annes huis. Btw, ik kan niet naar de vergadering om 10:00 uur. LOL! Bel me op +31-6-1234-5678 of mail me op jan.jansen@voorbeeld.com. Kijk eens naar deze coole website: https://www.voorbeeld.com."
  • Uisng in it's default config settings:
# first time import will take bit of time, so please have patience
from sct import sct

# Initialize the TextCleaner
sx = sct.TextCleaner()

# Process the text
#lmtext : Text for Language Models;
# cmtext : Text for Classical/Statistical ML;
# language : Processed text language

#### --- English Text
lmtext, cmtext, language = sx.process(english_text)
print(f"Language Model Text : {lmtext}")
print(f"Statistical Model Text : {cmtext}")
print(f"Language of the Text : {language}")

# Output the result
# Language Model Text : Hey <PERSON> wanna grab some coffee at Starbucks on <LOCATION> I'm feeling a bit tired after last night's party at <PERSON>'s place. BTW, can't make it to the meeting at <NUMBER><NUMBER> AM. LOL! Call me at <PHONE> or email me at <EMAIL> Check out this cool website: <URL>
# Statistical Model Text : hey person wanna grab coffee starbucks location im feeling bit tired last nights party persons place btw cant make meeting numbernumber am lol call phone email email check cool website url
# Language of the Text : ENGLISH

#### --- Dutch Text
lmtext, cmtext, language = sx.process(dutch_text)
print(f"Language Model Text : {lmtext}")
print(f"Statistical Model Text : {cmtext}")
print(f"Language of the Text : {language}")

# Output the result
# Language Model Text : He <PERSON> wil je wat koffie halen bij <ORGANISATION> op de <LOCATION> Ik voel me een beetje moe na het feest van gisteravond bij Annes huis. Btw, ik kan niet naar de vergadering om <NUMBER><NUMBER> uur. LOL! Bel me op <NUMBER><NUMBER><PHONE> of mail me op <EMAIL> Kijk eens naar deze coole website: <URL>
# Statistical Model Text : he person koffie halen organisation location voel beetje moe feest gisteravond annes huis btw vergadering numbernumber uur lol bel numbernumberphone mail email kijk coole website url
# Language of the Text : DUTCH
  • Uisng the package any of the functionality, lets take NER as an example
from sct import sct, config

config.CHECK_NER_PROCESS = False
sx = sct.TextCleaner()

lmtext, cmtext, language = sx.process(english_text)
print(f"Language Model Text : {lmtext}")
print(f"Statistical Model Text : {cmtext}")
print(f"Language of the Text : {language}")

# Output the result
Language Model Text : Hey John Doe, wanna grab some coffee at Starbucks on 5th Avenue? I'm feeling a bit tired after last night's party at Jane's place. BTW, can't make it to the meeting at <NUMBER><NUMBER> AM. LOL! Call me at <PHONE> or email me at <EMAIL> Check out this cool website: <URL>
Statistical Model Text : hey john doe wanna grab coffee starbucks 5th avenue im feeling bit tired last nights party janes place btw cant make meeting numbernumber am lol call phone email email check cool website url
Language of the Text : ENGLISH

API

sct.TextCleaner

process(text: str) -> Tuple[str, str, str]

Processes the input text and returns a tuple containing:

  • Cleaned text with punctuation and unnecessary characters removed.
  • Cleaned text with stopwords removed.
  • Detected language of the text.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request or open an issue.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgements

The package took inspirations from the following repo:

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

SqueakyCleanText-0.2.1.tar.gz (15.7 kB view details)

Uploaded Source

Built Distribution

SqueakyCleanText-0.2.1-py3-none-any.whl (19.0 kB view details)

Uploaded Python 3

File details

Details for the file SqueakyCleanText-0.2.1.tar.gz.

File metadata

  • Download URL: SqueakyCleanText-0.2.1.tar.gz
  • Upload date:
  • Size: 15.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.10.14

File hashes

Hashes for SqueakyCleanText-0.2.1.tar.gz
Algorithm Hash digest
SHA256 e97fd8dcaa76a3246e321a1de5882750d5a43aa69f2fb82f67ab944b6ad0db1a
MD5 63dc5a24ab3c29c9ca7148c537d66b31
BLAKE2b-256 dbb0691eea1d13189d34b4e56d8867a38d0faff575b2a18df995d6e09a77be7f

See more details on using hashes here.

File details

Details for the file SqueakyCleanText-0.2.1-py3-none-any.whl.

File metadata

File hashes

Hashes for SqueakyCleanText-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 d7413e603a2d6ad81c4890747a90ae8f28c03b5e06558c8bc8e3e1bde962dd22
MD5 a022066375e175854acfda1961caceb8
BLAKE2b-256 5f3bb9e5262f440d446b8fb53f65c3e9177645c0be9400839b0cd6dd555437f8

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page