Skip to main content

Your Python Text Preprocessing Toolkit

Project description

WrdSmth: Your Python Text Preprocessing Toolkit

WrdSmth is a versatile Python library designed to streamline your text preprocessing workflow. Whether you're working on Natural Language Processing (NLP) tasks, data analysis, or machine learning projects, WrdSmth provides a comprehensive suite of tools to clean, transform, and prepare your text data for optimal results.

Full Documentation you can read on GitHub

Key Features:

  • Cleaning: Remove unwanted characters, HTML tags, punctuation, and extra whitespace.
  • Tokenization: Split text into individual words or sentences.
  • Stemming: Reduce words to their base form (stem).
  • Lemmatization: Convert words to their canonical form (lemma).
  • Vectorization: Transform text into numerical vectors using TF-IDF.

Easy to Use:

WrdSmth offers a simple and intuitive API, making it easy to integrate into your existing projects. Just install it with pip:

pip install WrdSmth

Usage

1. Cleaning Text

The clean_text function provides various options for cleaning text data.

from WrdSmth.cleaning import clean_text

text = "This is an example text with <br> HTML tags, punctuation!@#$%^&*(), numbers 123, a URL https://www.example.com and an email example@example.com."

# Clean text with all default options
cleaned_text = clean_text(text)
print(cleaned_text)
# Output: this is an example text with html tags numbers 123 a url httpwwwexamplecom and an email exampleexamplecom

Parameters:

  • text (str): Text to be cleaned.
  • remove_html (bool, optional): Remove HTML tags. Defaults to True.
  • remove_punctuation (bool, optional): Remove punctuation. Defaults to True.
  • lowercase (bool, optional): Convert text to lowercase. Defaults to True.
  • remove_extra_spaces (bool, optional): Remove extra spaces. Defaults to True.
  • remove_numbers (bool, optional): Remove numbers. Defaults to False.
  • replace_urls (bool, optional): Replace URLs with a placeholder. Defaults to False.
  • replace_emails (bool, optional): Replace email addresses with a placeholder. Defaults to False.
  • custom_regex (str, optional): Custom regular expression pattern to remove. Defaults to None.
  • normalize_unicode (bool, optional): Normalize Unicode characters. Defaults to False.

2. Tokenization

The tokenize_text function offers various tokenization methods:

from WrdSmth.tokenization import tokenize_text

text = "This is a sentence. This is another sentence."

# Word tokenization
word_tokens = tokenize_text(text, method='word')
print(word_tokens)
# Output: ['This', 'is', 'a', 'sentence', '.', 'This', 'is', 'another', 'sentence', '.']

# Sentence tokenization
sentence_tokens = tokenize_text(text, method='sentence')
print(sentence_tokens)
# Output: ['This is a sentence.', 'This is another sentence.']

Parameters:

  • text (str): Text to be tokenized.
  • method (str, optional): Tokenization method ('word', 'sentence', 'regex', 'custom'). Defaults to 'word'.
  • language (str, optional): Language of the text. Defaults to 'english'. If None, the language will be detected automatically.
  • n_gram_range (tuple, optional): Minimum and maximum n-gram lengths (for 'word' method). Defaults to (1, 1).
  • regex_pattern (str, optional): Regular expression pattern for tokenization (for 'regex' method). Defaults to None.
  • remove_stopwords (bool, optional): Whether to remove stop words. Defaults to False.
  • stopwords (list, optional): List of stop words to remove. Defaults to None (uses NLTK's English stop words).
  • lowercase (bool, optional): Whether to lowercase the tokens. Defaults to False.
  • custom_tokenizer (callable, optional): Custom tokenizer function. Defaults to None.

and more...

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wrdsmth-0.1.6.tar.gz (7.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

WrdSmth-0.1.6-py3-none-any.whl (8.1 kB view details)

Uploaded Python 3

File details

Details for the file wrdsmth-0.1.6.tar.gz.

File metadata

  • Download URL: wrdsmth-0.1.6.tar.gz
  • Upload date:
  • Size: 7.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.3

File hashes

Hashes for wrdsmth-0.1.6.tar.gz
Algorithm Hash digest
SHA256 e7cf18bd882d1579ee3551654b466b5295fd7c9801e6347c0f2d9d0c77e4e03f
MD5 c280ab49fe73417e68d1d966143431ba
BLAKE2b-256 7d27cf48685cded9d5b1aa7e9c8f0e1635fd3b97bfce8629af70dd4e78c5b03b

See more details on using hashes here.

File details

Details for the file WrdSmth-0.1.6-py3-none-any.whl.

File metadata

  • Download URL: WrdSmth-0.1.6-py3-none-any.whl
  • Upload date:
  • Size: 8.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/5.1.1 CPython/3.12.3

File hashes

Hashes for WrdSmth-0.1.6-py3-none-any.whl
Algorithm Hash digest
SHA256 518635319a776f4abf1e4603c1465e3a783638f7f87765ae541cd8a343a61d4b
MD5 92c20754333b11c7c62ad1c3fd5cba4a
BLAKE2b-256 b3ac304bbfcbcb87622f64d1441fb93978267728ec71aea92a209ac4502c61a9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page