Skip to main content

Python package to clean raw tweets for ML applications

Project description

tidyX

tidyX is a Python package designed for cleaning and preprocessing text for machine learning applications, especially for text written in Spanish and originating from social networks. This library provides a complete pipeline to remove unwanted characters, normalize text, group similar terms, etc. to facilitate NLP applications.

To deep dive in the package visit our website

Installation

Install the package using pip:

pip install tidyX

Make sure you have the necessary dependencies installed. If you plan on lemmatizing, you'll need spaCy along with the appropriate language models. For Spanish lemmatization, we recommend downloading the es_core_web_sm model:

python -m spacy download es_core_web_sm 

For English lemmatization, we suggest the en_core_web_sm model:

python -m spacy download en_core_web_sm 

To see a full list of available models for different languages, visit Spacy's documentation.

Features

  • Standardize Text Pipeline: The preprocess() method provides an all-encompassing solution for quickly and effectively standardizing input strings, with a particular focus on tweets. It transforms the input to lowercase, strips accents (and emojis, if specified), and removes URLs, hashtags, and certain special characters. Additionally, it offers the option to delete stopwords in a specified language, trims extra spaces, extracts mentions, and removes 'RT' prefixes from retweets.
from tidyX import TextPreprocessor as tp



# Raw tweet example

raw_tweet = "RT @user: Check out this link: https://example.com 🌍 #example 😃"



# Applying the preprocess method

cleaned_text = tp.preprocess(raw_tweet)



# Printing the cleaned text

print("Cleaned Text:", cleaned_text)

Output:


Cleaned Text: check out this link

To remove English stopwords, simply add the parameters remove_stopwords=True and language_stopwords="english":

from tidyX import TextPreprocessor as tp



# Raw tweet example

raw_tweet = "RT @user: Check out this link: https://example.com 🌍 #example 😃"



# Applying the preprocess method with additional parameters

cleaned_text = tp.preprocess(raw_tweet, remove_stopwords=True, language_stopwords="english")



# Printing the cleaned text

print("Cleaned Text:", cleaned_text)

Output:


Cleaned Text: check link

For a more detailed explanation of the customizable steps of the function, visit the official preprocess() documentation.

  • Stemming and Lemmatizing: One of the foundational steps in preparing text for NLP applications is bringing words to a common base or root. This library provides both stemmer() and lemmatizer() functions to perform this task across various languages.

  • Group similar terms: When working with a corpus sourced from social networks, it's common to encounter texts with grammatical errors or words that aren't formally included in dictionaries. These irregularities can pose challenges when creating Term Frequency matrices for NLP algorithms. To address this, we developed the create_bol() function, which allows you to create specific bags of terms to cluster related terms.

  • Remove unwanted elements: such as special characters, extra spaces, accents, emojis, urls, tweeter mentions, among others.

  • Dependency Parsing Visualization: Incorporates visualization tools that enable the display of dependency parses, facilitating linguistic analysis and feature engineering.

  • Much more!

Contributing

Contributions to enhance tidyX are welcome! Feel free to open issues for bug reports, feature requests, or submit pull requests.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tidyX-1.6.5.tar.gz (5.4 MB view details)

Uploaded Source

Built Distribution

tidyX-1.6.5-py3-none-any.whl (25.1 kB view details)

Uploaded Python 3

File details

Details for the file tidyX-1.6.5.tar.gz.

File metadata

  • Download URL: tidyX-1.6.5.tar.gz
  • Upload date:
  • Size: 5.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.18

File hashes

Hashes for tidyX-1.6.5.tar.gz
Algorithm Hash digest
SHA256 769a12a787bf73b180add2b4eb72dd5cd6afe225e2aee2cce2e9c6f6e10a7b1c
MD5 def5d3e9139bf0f74263fb3f1b143bdd
BLAKE2b-256 51c8386647c8468e40ce34aeaf5c377e499207da7834fc7182758af2ef4d8194

See more details on using hashes here.

File details

Details for the file tidyX-1.6.5-py3-none-any.whl.

File metadata

  • Download URL: tidyX-1.6.5-py3-none-any.whl
  • Upload date:
  • Size: 25.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/4.0.2 CPython/3.9.18

File hashes

Hashes for tidyX-1.6.5-py3-none-any.whl
Algorithm Hash digest
SHA256 fb4c8c27a60bb71623c066644701499baf784f168eeaeee0abfca94c65231371
MD5 521f7f9d990a6d9c00b2c85e958e017a
BLAKE2b-256 341c8e392f9dd40eb0d4b7ceb82215537cd9d7677041fa45bbb8cdd6f7012fb7

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page