A simple, configurable NLP preprocessing toolkit.

These details have not been verified by PyPI

Project links

Project description

NLPToolkitX

A lightweight yet powerful Natural Language Processing (NLP) preprocessing toolkit with configurable options for tokenization, lemmatization, negation scope handling, slang expansion, emoji demojization, and more. Designed for quick integration into ML/NLP pipelines. Includes optional GPU acceleration with PyTorch for certain operations.

Installation

pip install NLPToolkitX

Notes:

First-time use will download required NLTK data packages automatically.
If you want GPU acceleration for vectorization and encoding, install with:
```
pip install NLPToolkitX[torch]
```

Quick Start

from NLPToolkitX import (
    PreprocessConfig,
    process_text,
    process_dataframe,
    validate_config,
    contractions_source,
    has_torch,
    build_vocab,
    texts_to_sequences,
    pad_sequences,
    label_encode,
    one_hot_encode,
)                           # import as needed, all shown for demo purposes

cfg = PreprocessConfig(
    lowercase=True,
    strip_html=True,
    urls="remove",           # keep | remove | mask
    mentions="mask",         # keep | remove | mask
    hashtags="split",        # keep | remove | split
    numbers="mask",          # keep | remove | mask  → replaces digits with "NUM"
    emojis="demojize",       # keep | remove | demojize
    contractions=True,        # expand "don't" → "do not"
    accents=True,
    repeats_to=2,             # cool → coo, soooo → soo
    punctuation="remove",    # keep | remove | space
    tokenize="smart",        # simple | smart
    stopwords=None,           # use default list if None
    negation_scope=True,      # adds _NEG after not/never/no up to short scope
    lemmatize=True,
    stem=False,
    slang_dict={              # optional: inline slang mapping
        "idk": "i do not know",
        "brb": "be right back",
        "imo": "in my opinion",
    },
)

text = "BRB, idk what's going on 😂! Check this out: https://example.com @Kazuma-sama #ExplosionMagic I won't say I'm not impressed!!! 100%"

processed = preprocess_text(text, config)
print(processed)

Output Example:

['brb', 'not', 'know_NEG', 'going_NEG', 'facewithtearsofjoy_NEG', 'check', 'usersama', 'explosion', 'magic', 'not', 'say_NEG', 'not', 'impressed_NEG', 'num_NEG']

Configuration Options

Parameter	Type	Description
`contractions_source`	str	`'pypi'` or `'local'` for contractions expansion.
`tokenize`	str	`'basic'` or `'smart'` tokenization.
`negation_scope`	bool	Add `_NEG` suffix to words in negation scope.
`lemmatize`	bool	Enable lemmatization (requires NLTK).
`stem`	bool	Enable stemming (requires NLTK).
`urls`	str	`'remove'` or `'mask'` URLs.
`mentions`	str	`'remove'` or `'mask'` mentions (@user).
`hashtags`	str	`'split'` to break hashtag into words.
`numbers`	str	`'mask'` or `'remove'` numbers.
`emojis`	str	`'demojize'` or `'remove'`.
`punctuation`	str	`'remove'` or `'keep'`.

DataFrame Example

You can process multiple rows at once:

import pandas as pd

texts = [
    "Not ever say never",
    "Numbers like 123 are masked",
    "Laughing Loud soo good"
]

df = pd.DataFrame({'text': texts})
df['tokens'] = df['text'].apply(lambda x: preprocess_text(x, config))
print(df)

Using Vectorization

from NLPToolkitX import vectorize_texts

corpus = [
    "Explosion magic is the best magic",
    "Kazuma-sama is amazing"
]

vectors, vocab = vectorize_texts(corpus)
print(vectors.shape)
print(vocab)

Note: If PyTorch is installed, vectorize_texts can run on GPU for faster processing. Otherwise, it will run on CPU.

Custom Slang Dictionary

You can load your own slang mappings:

from NLPToolkitX import load_slang_dictionary

load_slang_dictionary("slang.txt")  # one slang mapping per line: word=replacement

Optional Dependencies & Warnings

If PyTorch (torch) is not installed, certain functions like label_encode and one_hot_encode will fall back to slower CPU-based processing. When falling back, the system will display a tip:

[Tip] Install torch for faster GPU-accelerated encoding: pip install NLPToolkitX[torch]

Troubleshooting

Negation scope markers (_NEG) are intentional for better sentiment/context detection.
Masked numbers/URLs appear as num or url in tokens.
Windows users: If you see CRLF warnings in Git, run:
```
git config core.autocrlf true
```

Performance Tips

Reuse the same PreprocessConfig instance for speed.
Use batch processing for large datasets.
Masking instead of removing can help preserve sentence structure.
Install PyTorch for GPU acceleration.

License

MIT License. See LICENSE file for details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.1

Aug 12, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nlptoolkitx-0.1.1.tar.gz (12.6 kB view details)

Uploaded Aug 12, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

nlptoolkitx-0.1.1-py3-none-any.whl (13.0 kB view details)

Uploaded Aug 12, 2025 Python 3

File details

Details for the file nlptoolkitx-0.1.1.tar.gz.

File metadata

Download URL: nlptoolkitx-0.1.1.tar.gz
Upload date: Aug 12, 2025
Size: 12.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.7

File hashes

Hashes for nlptoolkitx-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`8b1fa7591767d6ef707f29cab11a6271f432f25823c2a5dc68ba86b99a255949`
MD5	`dec887fbcdf596e737fde610639ad217`
BLAKE2b-256	`2f54bf655813aa227ddf4e9ddf78611f02ad04ab1c97c5d6de048a0851e38677`

See more details on using hashes here.

File details

Details for the file nlptoolkitx-0.1.1-py3-none-any.whl.

File metadata

Download URL: nlptoolkitx-0.1.1-py3-none-any.whl
Upload date: Aug 12, 2025
Size: 13.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.10.7

File hashes

Hashes for nlptoolkitx-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`299b6c4ddcb45e1bd4fe108dd99ffc1458857f4a8433ffcbf4e64adf30c538b9`
MD5	`9ea714a9c59bd475a176b08c26501aa9`
BLAKE2b-256	`cf2864551239fa0b2dd3b5fcfc801b29b635c0fe21b393f05b49b0ed30153463`

See more details on using hashes here.

NLPToolkitX 0.1.1

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

NLPToolkitX

Installation

Quick Start

Configuration Options

DataFrame Example

Using Vectorization

Custom Slang Dictionary

Optional Dependencies & Warnings

Troubleshooting

Performance Tips

License

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes