Advanced Indonesian Natural Language Processing Library
Project description
nahiarhdNLP - Advanced Indonesian Natural Language Processing Library
Advanced Indonesian Natural Language Processing Library dengan fitur preprocessing teks, normalisasi slang, konversi emoji, koreksi ejaan, dan banyak lagi.
🚀 Instalasi
pip install nahiarhdNLP
📦 Import Library
# Import package utama
import nahiarhdNLP
# Import module preprocessing
from nahiarhdNLP import preprocessing
# Import module datasets
from nahiarhdNLP import mydatasets
# Atau import fungsi spesifik
from nahiarhdNLP.preprocessing import preprocess, remove_html, replace_slang
Contoh Penggunaan
1. 🎯 Fungsi Preprocess All-in-One
from nahiarhdNLP import preprocessing
# Preprocessing lengkap dengan satu fungsi
teks = "Halooo emg siapa yg nanya? 😀"
hasil = preprocessing.preprocess(teks)
print(hasil)
# Output: "halo wajah_gembira"
2. 🧹 TextCleaner - Membersihkan Teks
from nahiarhdNLP.preprocessing import TextCleaner
cleaner = TextCleaner()
# Membersihkan URL
url_text = "kunjungi https://google.com sekarang!"
clean_result = cleaner.clean_urls(url_text)
print(clean_result)
# Output: "kunjungi sekarang!"
# Membersihkan mentions
mention_text = "Halo @user123 apa kabar?"
clean_result = cleaner.clean_mentions(mention_text)
print(clean_result)
# Output: "Halo apa kabar?"
3. ✏️ SpellCorrector - Koreksi Ejaan
from nahiarhdNLP.preprocessing import SpellCorrector
spell = SpellCorrector()
# Koreksi kata
word = "mencri"
corrected = spell.correct(word)
print(corrected)
# Output: "mencuri"
# Koreksi kalimat
sentence = "saya mencri informsi"
corrected = spell.correct_sentence(sentence)
print(corrected)
# Output: "saya mencuri informasi"
4. 🚫 StopwordRemover - Menghapus Stopwords
from nahiarhdNLP.preprocessing import StopwordRemover
stopword = StopwordRemover()
# Menghapus stopwords
text = "saya suka makan nasi goreng"
result = stopword.remove_stopwords(text)
print(result)
# Output: "suka makan nasi goreng"
5. 🔄 SlangNormalizer - Normalisasi Slang
from nahiarhdNLP.preprocessing import SlangNormalizer
slang = SlangNormalizer()
# Normalisasi kata slang
text = "gw lg di rmh"
result = slang.normalize(text)
print(result)
# Output: "saya lagi di rumah"
6. 😀 EmojiConverter - Konversi Emoji
from nahiarhdNLP.preprocessing import EmojiConverter
emoji = EmojiConverter()
# Emoji ke teks
emoji_text = "😀 😂 😍"
text_result = emoji.emoji_to_text_convert(emoji_text)
print(text_result)
# Output: "wajah_gembira wajah_tertawa wajah_bercinta"
# Teks ke emoji
text = "wajah_gembira"
emoji_result = emoji.text_to_emoji_convert(text)
print(emoji_result)
# Output: "😀"
7. 🔪 Tokenizer - Tokenisasi
from nahiarhdNLP.preprocessing import Tokenizer
tokenizer = Tokenizer()
# Tokenisasi teks
text = "ini contoh tokenisasi"
tokens = tokenizer.tokenize(text)
print(tokens)
# Output: ['ini', 'contoh', 'tokenisasi']
8. 🛠️ Fungsi Individual
from nahiarhdNLP.preprocessing import (
remove_html, remove_url, remove_mentions,
replace_slang, emoji_to_words, correct_spelling
)
# Menghapus HTML
html_text = "website <a href='https://google.com'>google</a>"
clean_text = remove_html(html_text)
print(clean_text)
# Output: "website google"
# Menghapus URL
url_text = "kunjungi https://google.com sekarang!"
clean_text = remove_url(url_text)
print(clean_text)
# Output: "kunjungi sekarang!"
# Menghapus mentions
mention_text = "Halo @user123 apa kabar?"
clean_text = remove_mentions(mention_text)
print(clean_text)
# Output: "Halo apa kabar?"
# Normalisasi slang
slang_text = "emg siapa yg nanya?"
normal_text = replace_slang(slang_text)
print(normal_text)
# Output: "memang siapa yang bertanya?"
# Konversi emoji
emoji_text = "😀 😂 😍"
text_result = emoji_to_words(emoji_text)
print(text_result)
# Output: "wajah_gembira wajah_tertawa wajah_bercinta"
# Koreksi ejaan
spell_text = "saya mencri informsi"
corrected = correct_spelling(spell_text)
print(corrected)
# Output: "saya mencuri informasi"
9. 📊 Dataset Loader
from nahiarhdNLP.mydatasets import DatasetLoader
loader = DatasetLoader()
# Load stopwords
stopwords = loader.load_stopwords_dataset()
print(f"Jumlah stopwords: {len(stopwords)}")
# Load slang dictionary
slang_dict = loader.load_slang_dataset()
print(f"Jumlah slang: {len(slang_dict)}")
# Load emoji dictionary
emoji_dict = loader.load_emoji_dataset()
print(f"Jumlah emoji: {len(emoji_dict)}")
10. 🔄 Pipeline Custom
from nahiarhdNLP.preprocessing import pipeline, replace_word_elongation, replace_slang
# Buat pipeline custom
custom_pipeline = pipeline([
replace_word_elongation,
replace_slang
])
# Jalankan pipeline
text = "Knp emg gk mw makan kenapaaa???"
result = custom_pipeline(text)
print(result)
# Output: "mengapa memang tidak mau makan mengapa???"
⚙️ Parameter Preprocess
Fungsi preprocess() memiliki parameter opsional:
result = nahiarhdNLP.preprocessing.preprocess(
text="Halooo emg siapa yg nanya? 😀",
remove_html_tags=True, # Hapus HTML tags
remove_urls=True, # Hapus URL
remove_stopwords_flag=True, # Hapus stopwords
replace_slang_flag=True, # Normalisasi slang
replace_elongation=True, # Atasi word elongation
convert_emoji=True, # Konversi emoji
correct_spelling_flag=False,# Koreksi ejaan (lambat)
stem_text_flag=False, # Stemming
to_lowercase=True # Lowercase
)
🚨 Error Handling
try:
from nahiarhdNLP import preprocessing
result = preprocessing.preprocess("test")
except ImportError:
print("Package nahiarhdNLP belum terinstall")
print("Install dengan: pip install nahiarhdNLP")
except Exception as e:
print(f"Error: {e}")
💡 Tips Penggunaan
- Untuk preprocessing cepat: Gunakan
preprocess()dengan parameter default - Untuk kontrol penuh: Gunakan kelas individual (
TextCleaner,SpellCorrector, dll) - Untuk kustomisasi: Gunakan
pipeline()dengan fungsi yang diinginkan - Untuk koreksi ejaan: Aktifkan
correct_spelling_flag=True(tapi lebih lambat) - Untuk stemming: Aktifkan
stem_text_flag=True(perlu install Sastrawi) - Untuk performa optimal: Dataset akan di-cache otomatis setelah download pertama
- Untuk development: Gunakan fallback data jika HuggingFace down
⚡ Performance & Caching
nahiarhdNLP menggunakan sistem caching untuk mempercepat loading dataset:
- Pertama kali: Download dataset dari HuggingFace (3-4 detik)
- Kedua kali: Load dari cache lokal (0.003 detik)
- Cache tersimpan di
~/.nahiarhdNLP/cache/ - Fallback data jika HuggingFace down
from nahiarhdNLP.mydatasets.loaders import DatasetLoader
loader = DatasetLoader()
# Clear cache jika perlu
loader.clear_cache()
# Cek info cache
cache_info = loader.get_cache_info()
print(cache_info)
📦 Dependencies
Package ini membutuhkan:
datasets- untuk load dataset dari HuggingFacesastrawi- untuk stemming (opsional)pandas- untuk data processingrich- untuk output formattingfsspec- untuk file system operationshuggingface_hub- untuk akses HuggingFace
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file nahiarhdnlp-1.0.4.tar.gz.
File metadata
- Download URL: nahiarhdnlp-1.0.4.tar.gz
- Upload date:
- Size: 5.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7cb4f0f196c98f721ed44b700ed9ba1688ae12a765c0120796cddec46adc81e9
|
|
| MD5 |
27fd3ee5aead148ea1a90fe1adee8be5
|
|
| BLAKE2b-256 |
73748c7cf459db8030643e549dc1612565a56a84a110e9bf2fe6cf1d16cfe060
|
File details
Details for the file nahiarhdnlp-1.0.4-py3-none-any.whl.
File metadata
- Download URL: nahiarhdnlp-1.0.4-py3-none-any.whl
- Upload date:
- Size: 4.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
93b76d1edefea0da4b98dc47973e136eb59ebb235589bbfd4708df594bb48b8a
|
|
| MD5 |
ae426a1c40c0a8aadf715138d71ad3b2
|
|
| BLAKE2b-256 |
1d3114bd00dc0ece18cbdb3c6dd985a662ab16d6c60924a4b19b543ead7b7206
|