Your Python Text Preprocessing Toolkit

These details have not been verified by PyPI

Project description

WrdSmth: Your Python Text Preprocessing Toolkit

WrdSmth is a versatile Python library designed to streamline your text preprocessing workflow. Whether you're working on Natural Language Processing (NLP) tasks, data analysis, or machine learning projects, WrdSmth provides a comprehensive suite of tools to clean, transform, and prepare your text data for optimal results.

Full Documentation you can read on GitHub

Key Features:

Cleaning: Remove unwanted characters, HTML tags, punctuation, and extra whitespace.
Tokenization: Split text into individual words or sentences.
Stemming: Reduce words to their base form (stem).
Lemmatization: Convert words to their canonical form (lemma).
Vectorization: Transform text into numerical vectors using TF-IDF.

Easy to Use:

WrdSmth offers a simple and intuitive API, making it easy to integrate into your existing projects. Just install it with pip:

pip install WrdSmth

Usage

1. Cleaning Text

The clean_text function provides various options for cleaning text data.

from WrdSmth.cleaning import clean_text

text = "This is an example text with <br> HTML tags, punctuation!@#$%^&*(), numbers 123, a URL https://www.example.com and an email example@example.com."

# Clean text with all default options
cleaned_text = clean_text(text)
print(cleaned_text)
# Output: this is an example text with html tags numbers 123 a url httpwwwexamplecom and an email exampleexamplecom

Parameters:

text (str): Text to be cleaned.
remove_html (bool, optional): Remove HTML tags. Defaults to True.
remove_punctuation (bool, optional): Remove punctuation. Defaults to True.
lowercase (bool, optional): Convert text to lowercase. Defaults to True.
remove_extra_spaces (bool, optional): Remove extra spaces. Defaults to True.
remove_numbers (bool, optional): Remove numbers. Defaults to False.
replace_urls (bool, optional): Replace URLs with a placeholder. Defaults to False.
replace_emails (bool, optional): Replace email addresses with a placeholder. Defaults to False.
custom_regex (str, optional): Custom regular expression pattern to remove. Defaults to None.
normalize_unicode (bool, optional): Normalize Unicode characters. Defaults to False.

2. Tokenization

The tokenize_text function offers various tokenization methods:

from WrdSmth.tokenization import tokenize_text

text = "This is a sentence. This is another sentence."

# Word tokenization
word_tokens = tokenize_text(text, method='word')
print(word_tokens)
# Output: ['This', 'is', 'a', 'sentence', '.', 'This', 'is', 'another', 'sentence', '.']

# Sentence tokenization
sentence_tokens = tokenize_text(text, method='sentence')
print(sentence_tokens)
# Output: ['This is a sentence.', 'This is another sentence.']

Parameters:

text (str): Text to be tokenized.
method (str, optional): Tokenization method ('word', 'sentence', 'regex', 'custom'). Defaults to 'word'.
language (str, optional): Language of the text. Defaults to 'english'. If None, the language will be detected automatically.
n_gram_range (tuple, optional): Minimum and maximum n-gram lengths (for 'word' method). Defaults to (1, 1).
regex_pattern (str, optional): Regular expression pattern for tokenization (for 'regex' method). Defaults to None.
remove_stopwords (bool, optional): Whether to remove stop words. Defaults to False.
stopwords (list, optional): List of stop words to remove. Defaults to None (uses NLTK's English stop words).
lowercase (bool, optional): Whether to lowercase the tokens. Defaults to False.
custom_tokenizer (callable, optional): Custom tokenizer function. Defaults to None.

and more...

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.6

Aug 26, 2024

0.1.5

Aug 26, 2024

0.1.4.1

Aug 26, 2024

0.1.4

Aug 26, 2024

0.1.3.1

Aug 26, 2024

0.1.3

Aug 26, 2024

0.1.2

Aug 26, 2024

0.1.1

Aug 26, 2024

0.1.0

Aug 26, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

wrdsmth-0.1.6.tar.gz (7.2 kB view details)

Uploaded Aug 26, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

WrdSmth-0.1.6-py3-none-any.whl (8.1 kB view details)

Uploaded Aug 26, 2024 Python 3

File details

Details for the file wrdsmth-0.1.6.tar.gz.

File metadata

Download URL: wrdsmth-0.1.6.tar.gz
Upload date: Aug 26, 2024
Size: 7.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.12.3

File hashes

Hashes for wrdsmth-0.1.6.tar.gz
Algorithm	Hash digest
SHA256	`e7cf18bd882d1579ee3551654b466b5295fd7c9801e6347c0f2d9d0c77e4e03f`
MD5	`c280ab49fe73417e68d1d966143431ba`
BLAKE2b-256	`7d27cf48685cded9d5b1aa7e9c8f0e1635fd3b97bfce8629af70dd4e78c5b03b`

See more details on using hashes here.

File details

Details for the file WrdSmth-0.1.6-py3-none-any.whl.

File metadata

Download URL: WrdSmth-0.1.6-py3-none-any.whl
Upload date: Aug 26, 2024
Size: 8.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/5.1.1 CPython/3.12.3

File hashes

Hashes for WrdSmth-0.1.6-py3-none-any.whl
Algorithm	Hash digest
SHA256	`518635319a776f4abf1e4603c1465e3a783638f7f87765ae541cd8a343a61d4b`
MD5	`92c20754333b11c7c62ad1c3fd5cba4a`
BLAKE2b-256	`b3ac304bbfcbcb87622f64d1441fb93978267728ec71aea92a209ac4502c61a9`

See more details on using hashes here.

WrdSmth 0.1.6

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

WrdSmth: Your Python Text Preprocessing Toolkit

Usage

1. Cleaning Text

2. Tokenization

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes