Advanced Indonesian Natural Language Processing Library
Project description
nahiarhdNLP
nahiarhdNLP is an advanced Python library for Indonesian Natural Language Processing (NLP), providing easy-to-use tools for text preprocessing, normalization, tokenization, stemming, spell correction, and customizable pipelines.
Installation
pip install nahiarhdNLP
Features
- Preprocessing: Clean text from HTML, URLs, stopwords, slang, emoji, mentions, hashtags, numbers, punctuation, extra spaces, and special characters.
- Tokenization: Split sentences into tokens/words.
- Stemming: Convert words to their root form (using Sastrawi).
- Spell Correction: Automatic spelling correction.
- Pipeline: Chain multiple preprocessing functions easily.
- Normalization: Replace slang, emoji, and informal words with formal equivalents.
Quick Usage Example
Basic Preprocessing
from nahiarhdNLP import preprocessing
text = "Halooo emg siapa yg nanya? 😀 <a href='#'>link</a> @user #trending 123"
cleaned = preprocessing.cleaning.text_cleaner.clean_text(text)
print(cleaned)
Custom Preprocessing Pipeline
from nahiarhdNLP.preprocessing import (
pipeline, remove_html, remove_url, remove_mentions, remove_hashtags,
remove_numbers, replace_word_elongation, emoji_to_words, replace_slang,
remove_stopwords, remove_punctuation, remove_extra_spaces, to_lowercase
)
custom_pipe = pipeline([
remove_html, remove_url, remove_mentions, remove_hashtags, remove_numbers,
replace_word_elongation, emoji_to_words, replace_slang, remove_stopwords,
remove_punctuation, remove_extra_spaces, to_lowercase
])
result = custom_pipe("Halooo emg siapa yg nanya? 😀 <a href='#'>link</a> @user #trending 123")
print(result)
Spell Correction
from nahiarhdNLP.preprocessing import correct_spelling
print(correct_spelling("sya suka mkn nasi")) # "saya suka makan nasi"
Stemming
from nahiarhdNLP.preprocessing import stem_text
print(stem_text("bermain-main dengan senang")) # "main dengan senang"
Requirements
- Python 3.7+
- pandas, fsspec, huggingface_hub, sastrawi, datasets, rich
Testing
pytest tests/
Directory Structure
nahiarhdNLP/
├── main.py
├── requirements.txt
├── README.md
├── src/
│ ├── preprocessing/
│ ├── mydatasets/
└── tests/
Contribution
Contributions are welcome! Please fork the repository, create a new branch, and submit a pull request.
License
MIT License
Acknowledgments
- Stopwords dataset from HuggingFace
- Emoji dataset from HuggingFace
- Slang dataset from HuggingFace
- Sastrawi for Indonesian stemming
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file nahiarhdnlp-1.0.1.tar.gz.
File metadata
- Download URL: nahiarhdnlp-1.0.1.tar.gz
- Upload date:
- Size: 555.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1e637ab001ac5c5ce7c865734fe4484d7da3761f768950f441efbf2d79ff8de9
|
|
| MD5 |
46954a4ad82670ce516487db426249db
|
|
| BLAKE2b-256 |
d986afddc908ca1396880c3086e8860d4ebeaaab5a69599ecfdfdfef7149be4b
|
File details
Details for the file nahiarhdnlp-1.0.1-py3-none-any.whl.
File metadata
- Download URL: nahiarhdnlp-1.0.1-py3-none-any.whl
- Upload date:
- Size: 563.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
e1f4e9e31d5ddb86ffd00049e82e1f29c2d23f1f07fc6fe04a2e65a04bd377e5
|
|
| MD5 |
f9bf48684609d378b75cf68dd5c77cbf
|
|
| BLAKE2b-256 |
f4538bb90233900b5937286f1ae2b9d3e9604253fb20645976106e2bca288328
|