Skip to main content

Advanced Indonesian Natural Language Processing Library

Project description

nahiarhdNLP

PyPI version Python Version GitHub

nahiarhdNLP is an advanced Python library for Indonesian Natural Language Processing (NLP), providing easy-to-use tools for text preprocessing, normalization, tokenization, stemming, spell correction, and customizable pipelines.


Installation

pip install nahiarhdNLP

Features

  • Preprocessing: Clean text from HTML, URLs, stopwords, slang, emoji, mentions, hashtags, numbers, punctuation, extra spaces, and special characters.
  • Tokenization: Split sentences into tokens/words.
  • Stemming: Convert words to their root form (using Sastrawi).
  • Spell Correction: Automatic spelling correction.
  • Pipeline: Chain multiple preprocessing functions easily.
  • Normalization: Replace slang, emoji, and informal words with formal equivalents.

Quick Usage Example

Basic Preprocessing

from nahiarhdNLP import preprocessing

text = "Halooo emg siapa yg nanya? 😀 <a href='#'>link</a> @user #trending 123"
cleaned = preprocessing.cleaning.text_cleaner.clean_text(text)
print(cleaned)

Custom Preprocessing Pipeline

from nahiarhdNLP.preprocessing import (
    pipeline, remove_html, remove_url, remove_mentions, remove_hashtags,
    remove_numbers, replace_word_elongation, emoji_to_words, replace_slang,
    remove_stopwords, remove_punctuation, remove_extra_spaces, to_lowercase
)

custom_pipe = pipeline([
    remove_html, remove_url, remove_mentions, remove_hashtags, remove_numbers,
    replace_word_elongation, emoji_to_words, replace_slang, remove_stopwords,
    remove_punctuation, remove_extra_spaces, to_lowercase
])

result = custom_pipe("Halooo emg siapa yg nanya? 😀 <a href='#'>link</a> @user #trending 123")
print(result)

Spell Correction

from nahiarhdNLP.preprocessing import correct_spelling
print(correct_spelling("sya suka mkn nasi"))  # "saya suka makan nasi"

Stemming

from nahiarhdNLP.preprocessing import stem_text
print(stem_text("bermain-main dengan senang"))  # "main dengan senang"

Requirements

  • Python 3.7+
  • pandas, fsspec, huggingface_hub, sastrawi, datasets, rich

Testing

pytest tests/

Directory Structure

nahiarhdNLP/
├── main.py
├── requirements.txt
├── README.md
├── src/
│   ├── preprocessing/
│   ├── mydatasets/
└── tests/

Contribution

Contributions are welcome! Please fork the repository, create a new branch, and submit a pull request.


License

MIT License


Acknowledgments

  • Stopwords dataset from HuggingFace
  • Emoji dataset from HuggingFace
  • Slang dataset from HuggingFace
  • Sastrawi for Indonesian stemming

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nahiarhdnlp-1.0.1.tar.gz (555.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

nahiarhdnlp-1.0.1-py3-none-any.whl (563.4 kB view details)

Uploaded Python 3

File details

Details for the file nahiarhdnlp-1.0.1.tar.gz.

File metadata

  • Download URL: nahiarhdnlp-1.0.1.tar.gz
  • Upload date:
  • Size: 555.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.6

File hashes

Hashes for nahiarhdnlp-1.0.1.tar.gz
Algorithm Hash digest
SHA256 1e637ab001ac5c5ce7c865734fe4484d7da3761f768950f441efbf2d79ff8de9
MD5 46954a4ad82670ce516487db426249db
BLAKE2b-256 d986afddc908ca1396880c3086e8860d4ebeaaab5a69599ecfdfdfef7149be4b

See more details on using hashes here.

File details

Details for the file nahiarhdnlp-1.0.1-py3-none-any.whl.

File metadata

  • Download URL: nahiarhdnlp-1.0.1-py3-none-any.whl
  • Upload date:
  • Size: 563.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.6

File hashes

Hashes for nahiarhdnlp-1.0.1-py3-none-any.whl
Algorithm Hash digest
SHA256 e1f4e9e31d5ddb86ffd00049e82e1f29c2d23f1f07fc6fe04a2e65a04bd377e5
MD5 f9bf48684609d378b75cf68dd5c77cbf
BLAKE2b-256 f4538bb90233900b5937286f1ae2b9d3e9604253fb20645976106e2bca288328

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page