A simple, configurable NLP preprocessing toolkit.
Project description
NLPToolkitX
A lightweight yet powerful Natural Language Processing (NLP) preprocessing toolkit with configurable options for tokenization, lemmatization, negation scope handling, slang expansion, emoji demojization, and more. Designed for quick integration into ML/NLP pipelines. Includes optional GPU acceleration with PyTorch for certain operations.
Installation
pip install NLPToolkitX
Notes:
-
First-time use will download required NLTK data packages automatically.
-
If you want GPU acceleration for vectorization and encoding, install with:
pip install NLPToolkitX[torch]
Quick Start
from NLPToolkitX import (
PreprocessConfig,
process_text,
process_dataframe,
validate_config,
contractions_source,
has_torch,
build_vocab,
texts_to_sequences,
pad_sequences,
label_encode,
one_hot_encode,
) # import as needed, all shown for demo purposes
cfg = PreprocessConfig(
lowercase=True,
strip_html=True,
urls="remove", # keep | remove | mask
mentions="mask", # keep | remove | mask
hashtags="split", # keep | remove | split
numbers="mask", # keep | remove | mask → replaces digits with "NUM"
emojis="demojize", # keep | remove | demojize
contractions=True, # expand "don't" → "do not"
accents=True,
repeats_to=2, # cool → coo, soooo → soo
punctuation="remove", # keep | remove | space
tokenize="smart", # simple | smart
stopwords=None, # use default list if None
negation_scope=True, # adds _NEG after not/never/no up to short scope
lemmatize=True,
stem=False,
slang_dict={ # optional: inline slang mapping
"idk": "i do not know",
"brb": "be right back",
"imo": "in my opinion",
},
)
text = "BRB, idk what's going on 😂! Check this out: https://example.com @Kazuma-sama #ExplosionMagic I won't say I'm not impressed!!! 100%"
processed = preprocess_text(text, config)
print(processed)
Output Example:
['brb', 'not', 'know_NEG', 'going_NEG', 'facewithtearsofjoy_NEG', 'check', 'usersama', 'explosion', 'magic', 'not', 'say_NEG', 'not', 'impressed_NEG', 'num_NEG']
Configuration Options
| Parameter | Type | Description |
|---|---|---|
contractions_source |
str | 'pypi' or 'local' for contractions expansion. |
tokenize |
str | 'basic' or 'smart' tokenization. |
negation_scope |
bool | Add _NEG suffix to words in negation scope. |
lemmatize |
bool | Enable lemmatization (requires NLTK). |
stem |
bool | Enable stemming (requires NLTK). |
urls |
str | 'remove' or 'mask' URLs. |
mentions |
str | 'remove' or 'mask' mentions (@user). |
hashtags |
str | 'split' to break hashtag into words. |
numbers |
str | 'mask' or 'remove' numbers. |
emojis |
str | 'demojize' or 'remove'. |
punctuation |
str | 'remove' or 'keep'. |
DataFrame Example
You can process multiple rows at once:
import pandas as pd
texts = [
"Not ever say never",
"Numbers like 123 are masked",
"Laughing Loud soo good"
]
df = pd.DataFrame({'text': texts})
df['tokens'] = df['text'].apply(lambda x: preprocess_text(x, config))
print(df)
Using Vectorization
from NLPToolkitX import vectorize_texts
corpus = [
"Explosion magic is the best magic",
"Kazuma-sama is amazing"
]
vectors, vocab = vectorize_texts(corpus)
print(vectors.shape)
print(vocab)
Note:
If PyTorch is installed, vectorize_texts can run on GPU for faster processing. Otherwise, it will run on CPU.
Custom Slang Dictionary
You can load your own slang mappings:
from NLPToolkitX import load_slang_dictionary
load_slang_dictionary("slang.txt") # one slang mapping per line: word=replacement
Optional Dependencies & Warnings
If PyTorch (torch) is not installed, certain functions like label_encode and one_hot_encode will fall back to slower CPU-based processing.
When falling back, the system will display a tip:
[Tip] Install torch for faster GPU-accelerated encoding: pip install NLPToolkitX[torch]
Troubleshooting
-
Negation scope markers (
_NEG) are intentional for better sentiment/context detection. -
Masked numbers/URLs appear as
numorurlin tokens. -
Windows users: If you see
CRLFwarnings in Git, run:git config core.autocrlf true
Performance Tips
- Reuse the same
PreprocessConfiginstance for speed. - Use batch processing for large datasets.
- Masking instead of removing can help preserve sentence structure.
- Install PyTorch for GPU acceleration.
License
MIT License. See LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file nlptoolkitx-0.1.1.tar.gz.
File metadata
- Download URL: nlptoolkitx-0.1.1.tar.gz
- Upload date:
- Size: 12.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8b1fa7591767d6ef707f29cab11a6271f432f25823c2a5dc68ba86b99a255949
|
|
| MD5 |
dec887fbcdf596e737fde610639ad217
|
|
| BLAKE2b-256 |
2f54bf655813aa227ddf4e9ddf78611f02ad04ab1c97c5d6de048a0851e38677
|
File details
Details for the file nlptoolkitx-0.1.1-py3-none-any.whl.
File metadata
- Download URL: nlptoolkitx-0.1.1-py3-none-any.whl
- Upload date:
- Size: 13.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.10.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
299b6c4ddcb45e1bd4fe108dd99ffc1458857f4a8433ffcbf4e64adf30c538b9
|
|
| MD5 |
9ea714a9c59bd475a176b08c26501aa9
|
|
| BLAKE2b-256 |
cf2864551239fa0b2dd3b5fcfc801b29b635c0fe21b393f05b49b0ed30153463
|