Skip to main content

A simple, configurable NLP preprocessing toolkit.

Project description

NLPToolkitX

A lightweight yet powerful Natural Language Processing (NLP) preprocessing toolkit with configurable options for tokenization, lemmatization, negation scope handling, slang expansion, emoji demojization, and more. Designed for quick integration into ML/NLP pipelines. Includes optional GPU acceleration with PyTorch for certain operations.


Installation

pip install NLPToolkitX

Notes:

  • First-time use will download required NLTK data packages automatically.

  • If you want GPU acceleration for vectorization and encoding, install with:

    pip install NLPToolkitX[torch]
    

Quick Start

from NLPToolkitX import (
    PreprocessConfig,
    process_text,
    process_dataframe,
    validate_config,
    contractions_source,
    has_torch,
    build_vocab,
    texts_to_sequences,
    pad_sequences,
    label_encode,
    one_hot_encode,
)                           # import as needed, all shown for demo purposes

cfg = PreprocessConfig(
    lowercase=True,
    strip_html=True,
    urls="remove",           # keep | remove | mask
    mentions="mask",         # keep | remove | mask
    hashtags="split",        # keep | remove | split
    numbers="mask",          # keep | remove | mask  → replaces digits with "NUM"
    emojis="demojize",       # keep | remove | demojize
    contractions=True,        # expand "don't" → "do not"
    accents=True,
    repeats_to=2,             # cool → coo, soooo → soo
    punctuation="remove",    # keep | remove | space
    tokenize="smart",        # simple | smart
    stopwords=None,           # use default list if None
    negation_scope=True,      # adds _NEG after not/never/no up to short scope
    lemmatize=True,
    stem=False,
    slang_dict={              # optional: inline slang mapping
        "idk": "i do not know",
        "brb": "be right back",
        "imo": "in my opinion",
    },
)

text = "BRB, idk what's going on 😂! Check this out: https://example.com @Kazuma-sama #ExplosionMagic I won't say I'm not impressed!!! 100%"

processed = preprocess_text(text, config)
print(processed)

Output Example:

['brb', 'not', 'know_NEG', 'going_NEG', 'facewithtearsofjoy_NEG', 'check', 'usersama', 'explosion', 'magic', 'not', 'say_NEG', 'not', 'impressed_NEG', 'num_NEG']

Configuration Options

Parameter Type Description
contractions_source str 'pypi' or 'local' for contractions expansion.
tokenize str 'basic' or 'smart' tokenization.
negation_scope bool Add _NEG suffix to words in negation scope.
lemmatize bool Enable lemmatization (requires NLTK).
stem bool Enable stemming (requires NLTK).
urls str 'remove' or 'mask' URLs.
mentions str 'remove' or 'mask' mentions (@user).
hashtags str 'split' to break hashtag into words.
numbers str 'mask' or 'remove' numbers.
emojis str 'demojize' or 'remove'.
punctuation str 'remove' or 'keep'.

DataFrame Example

You can process multiple rows at once:

import pandas as pd

texts = [
    "Not ever say never",
    "Numbers like 123 are masked",
    "Laughing Loud soo good"
]

df = pd.DataFrame({'text': texts})
df['tokens'] = df['text'].apply(lambda x: preprocess_text(x, config))
print(df)

Using Vectorization

from NLPToolkitX import vectorize_texts

corpus = [
    "Explosion magic is the best magic",
    "Kazuma-sama is amazing"
]

vectors, vocab = vectorize_texts(corpus)
print(vectors.shape)
print(vocab)

Note: If PyTorch is installed, vectorize_texts can run on GPU for faster processing. Otherwise, it will run on CPU.


Custom Slang Dictionary

You can load your own slang mappings:

from NLPToolkitX import load_slang_dictionary

load_slang_dictionary("slang.txt")  # one slang mapping per line: word=replacement

Optional Dependencies & Warnings

If PyTorch (torch) is not installed, certain functions like label_encode and one_hot_encode will fall back to slower CPU-based processing. When falling back, the system will display a tip:

[Tip] Install torch for faster GPU-accelerated encoding: pip install NLPToolkitX[torch]

Troubleshooting

  • Negation scope markers (_NEG) are intentional for better sentiment/context detection.

  • Masked numbers/URLs appear as num or url in tokens.

  • Windows users: If you see CRLF warnings in Git, run:

    git config core.autocrlf true
    

Performance Tips

  • Reuse the same PreprocessConfig instance for speed.
  • Use batch processing for large datasets.
  • Masking instead of removing can help preserve sentence structure.
  • Install PyTorch for GPU acceleration.

License

MIT License. See LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nlptoolkitx-0.1.1.tar.gz (12.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

nlptoolkitx-0.1.1-py3-none-any.whl (13.0 kB view details)

Uploaded Python 3

File details

Details for the file nlptoolkitx-0.1.1.tar.gz.

File metadata

  • Download URL: nlptoolkitx-0.1.1.tar.gz
  • Upload date:
  • Size: 12.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.7

File hashes

Hashes for nlptoolkitx-0.1.1.tar.gz
Algorithm Hash digest
SHA256 8b1fa7591767d6ef707f29cab11a6271f432f25823c2a5dc68ba86b99a255949
MD5 dec887fbcdf596e737fde610639ad217
BLAKE2b-256 2f54bf655813aa227ddf4e9ddf78611f02ad04ab1c97c5d6de048a0851e38677

See more details on using hashes here.

File details

Details for the file nlptoolkitx-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: nlptoolkitx-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 13.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.10.7

File hashes

Hashes for nlptoolkitx-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 299b6c4ddcb45e1bd4fe108dd99ffc1458857f4a8433ffcbf4e64adf30c538b9
MD5 9ea714a9c59bd475a176b08c26501aa9
BLAKE2b-256 cf2864551239fa0b2dd3b5fcfc801b29b635c0fe21b393f05b49b0ed30153463

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page