A Python package for automating text preprocessing tasks.
Project description
TextPreprocessor
TextPreprocessor is a comprehensive Python library for text preprocessing in NLP tasks. It includes a suite of features such as tokenization, punctuation removal, stopword removal, lemmatization, spell correction, and more. This package is designed to streamline and simplify text preprocessing for data analysis, machine learning, and natural language processing projects.
Features
- Tokenization
- Stopword removal (with customizable stopwords)
- Punctuation removal
- Special character removal
- URL and HTML tag removal
- Lowercasing
- Lemmatization (WordNet-based)
- Spell correction
- Modular preprocessing pipeline
Installation
Ensure the following dependencies are installed:
- Python 3.7 or higher
- Required Python packages:
pip install nltk pyspellchecker pandas
Additionally, download required NLTK resources:
import nltk
nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('punkt')
Usage
Initialization
Import the TextPreprocessing class and initialize it. Optionally, pass a list of custom stopwords:
from textPreprocessingToolkit import TextPreprocessor
# Initialize with optional custom stopwords
tpt = TextPreprocessor(custom_stopwords=["example", "test"])
Preprocessing Text
Preprocess a single piece of text by specifying a sequence of preprocessing steps:
text = "Hello! This is an <b>example</b> sentence. Visit https://example.com for more info!"
processed_text = tpt.preprocess(
text,
steps=[
"lowercase",
"remove_punctuation",
"remove_special_characters",
"remove_url",
"remove_html_tags",
"correct_spellings",
"lemmatize_text"
]
)
print("Processed Text:", processed_text)
Output:
Processed Text: hello this is an example sentence visit for more info
Batch Processing
You can preprocess a batch of texts and view a summary:
texts = [
"NLP preprocessing includes tokenization, lemmatization, and stemming.",
"Special characters like @, $, %, &, should be removed!",
"Spelling erorrs in this sentense should be fixed.",
]
tpt.head(texts, n=3)
This will display a table (in Jupyter or IPython environments) with the following columns:
- Original Text
- Processed Text
- Word Count
- Character Count
Modular Methods
You can also use individual methods for specific preprocessing tasks:
text = "Check for spelling erorrs in this sentense."
print("Tokenized:", tpt.tokenize(text))
print("Spell-corrected:", tpt.correct_spellings(text))
print("Lemmatized:", tpt.lemmatize_text(text))
Class Documentation
TextPreprocessor
Initialization:
TextPreprocessor(custom_stopwords: Optional[List[str]] = None)
custom_stopwords: (Optional) A list of additional stopwords to remove.
Methods:
preprocess(text: str, steps: Optional[List[str]] = None) -> str- Preprocesses the input text according to the specified pipeline steps.
tokenize(text: str) -> List[str]- Tokenizes text into words.
remove_punctuation(text: str) -> str- Removes punctuation from the text.
remove_stopwords(tokens: List[str]) -> List[str]- Removes stopwords from a tokenized list.
remove_special_characters(text: str) -> str- Removes non-alphanumeric characters from the text.
correct_spellings(text: str) -> str- Corrects misspellings in the text.
lemmatize_text(text: str) -> str- Lemmatizes text using WordNet.
Logging
The package includes built-in logging for debugging and tracking progress. Logs are displayed for each preprocessing step completed.
Contributing
Contributions are welcome! To contribute:
- Fork the repository.
- Create a feature branch (
git checkout -b feature-branch). - Commit your changes (
git commit -m 'Add feature'). - Push to the branch (
git push origin feature-branch). - Open a pull request.
License
This project is licensed under the MIT License. See the LICENSE file for details.
Author
Developed by [Your Name]. Feel free to reach out for suggestions or collaboration!
Feedback
If you encounter any issues or have suggestions for improvement, please open an issue on GitHub or contact jaiswalgaurav863@gmail.com.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tptk-0.0.8.tar.gz.
File metadata
- Download URL: tptk-0.0.8.tar.gz
- Upload date:
- Size: 5.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.8.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1da05bb681234fbfa263c5a2a2bd025c7924b64a0a292eeb72608315ef0990dd
|
|
| MD5 |
c3136140e35c4a1be35d6a67e33a17a0
|
|
| BLAKE2b-256 |
93a6e7235e2adb2db391dd4e38b46cd461214d3508f6b557bfced17ad2f58ba7
|
File details
Details for the file TPTK-0.0.8-py3-none-any.whl.
File metadata
- Download URL: TPTK-0.0.8-py3-none-any.whl
- Upload date:
- Size: 5.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.0.1 CPython/3.8.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4aff156a39279e2dab3cc33f66c358cef98db3165a9bc969870599541b31c344
|
|
| MD5 |
05e42f80cd949934af8988d96fb1104d
|
|
| BLAKE2b-256 |
51c0ed15bb16ee893308eca872b7d0a3300e97995a5a110ff25dee1d2f01a312
|