Your Python Text Preprocessing Toolkit
Project description
WrdSmth: Your Python Text Preprocessing Toolkit
WrdSmth is a versatile Python library designed to streamline your text preprocessing workflow. Whether you're working on Natural Language Processing (NLP) tasks, data analysis, or machine learning projects, WrdSmth provides a comprehensive suite of tools to clean, transform, and prepare your text data for optimal results.
Full Documentation you can read on GitHub
Key Features:
- Cleaning: Remove unwanted characters, HTML tags, punctuation, and extra whitespace.
- Tokenization: Split text into individual words or sentences.
- Stemming: Reduce words to their base form (stem).
- Lemmatization: Convert words to their canonical form (lemma).
- Vectorization: Transform text into numerical vectors using TF-IDF.
Easy to Use:
WrdSmth offers a simple and intuitive API, making it easy to integrate into your existing projects. Just install it with pip:
pip install WrdSmth
Usage
1. Cleaning Text
The clean_text function provides various options for cleaning text data.
from WrdSmth.cleaning import clean_text
text = "This is an example text with <br> HTML tags, punctuation!@#$%^&*(), numbers 123, a URL https://www.example.com and an email example@example.com."
# Clean text with all default options
cleaned_text = clean_text(text)
print(cleaned_text)
# Output: this is an example text with html tags numbers 123 a url httpwwwexamplecom and an email exampleexamplecom
Parameters:
text(str): Text to be cleaned.remove_html(bool, optional): Remove HTML tags. Defaults toTrue.remove_punctuation(bool, optional): Remove punctuation. Defaults toTrue.lowercase(bool, optional): Convert text to lowercase. Defaults toTrue.remove_extra_spaces(bool, optional): Remove extra spaces. Defaults toTrue.remove_numbers(bool, optional): Remove numbers. Defaults toFalse.replace_urls(bool, optional): Replace URLs with a placeholder. Defaults toFalse.replace_emails(bool, optional): Replace email addresses with a placeholder. Defaults toFalse.custom_regex(str, optional): Custom regular expression pattern to remove. Defaults toNone.normalize_unicode(bool, optional): Normalize Unicode characters. Defaults toFalse.
2. Tokenization
The tokenize_text function offers various tokenization methods:
from WrdSmth.tokenization import tokenize_text
text = "This is a sentence. This is another sentence."
# Word tokenization
word_tokens = tokenize_text(text, method='word')
print(word_tokens)
# Output: ['This', 'is', 'a', 'sentence', '.', 'This', 'is', 'another', 'sentence', '.']
# Sentence tokenization
sentence_tokens = tokenize_text(text, method='sentence')
print(sentence_tokens)
# Output: ['This is a sentence.', 'This is another sentence.']
Parameters:
text(str): Text to be tokenized.method(str, optional): Tokenization method ('word', 'sentence', 'regex', 'custom'). Defaults to 'word'.language(str, optional): Language of the text. Defaults to 'english'. If None, the language will be detected automatically.n_gram_range(tuple, optional): Minimum and maximum n-gram lengths (for 'word' method). Defaults to (1, 1).regex_pattern(str, optional): Regular expression pattern for tokenization (for 'regex' method). Defaults to None.remove_stopwords(bool, optional): Whether to remove stop words. Defaults to False.stopwords(list, optional): List of stop words to remove. Defaults to None (uses NLTK's English stop words).lowercase(bool, optional): Whether to lowercase the tokens. Defaults to False.custom_tokenizer(callable, optional): Custom tokenizer function. Defaults to None.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file wrdsmth-0.1.6.tar.gz.
File metadata
- Download URL: wrdsmth-0.1.6.tar.gz
- Upload date:
- Size: 7.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e7cf18bd882d1579ee3551654b466b5295fd7c9801e6347c0f2d9d0c77e4e03f
|
|
| MD5 |
c280ab49fe73417e68d1d966143431ba
|
|
| BLAKE2b-256 |
7d27cf48685cded9d5b1aa7e9c8f0e1635fd3b97bfce8629af70dd4e78c5b03b
|
File details
Details for the file WrdSmth-0.1.6-py3-none-any.whl.
File metadata
- Download URL: WrdSmth-0.1.6-py3-none-any.whl
- Upload date:
- Size: 8.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/5.1.1 CPython/3.12.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
518635319a776f4abf1e4603c1465e3a783638f7f87765ae541cd8a343a61d4b
|
|
| MD5 |
92c20754333b11c7c62ad1c3fd5cba4a
|
|
| BLAKE2b-256 |
b3ac304bbfcbcb87622f64d1441fb93978267728ec71aea92a209ac4502c61a9
|