Skip to main content

Advanced Indonesian Natural Language Processing Library

Project description

nahiarhdNLP - Advanced Indonesian Natural Language Processing Library

Advanced Indonesian Natural Language Processing Library dengan fitur preprocessing teks, normalisasi slang, konversi emoji, koreksi ejaan, dan banyak lagi.

🚀 Instalasi

pip install nahiarhdNLP

📦 Import Library

# Import package utama
import nahiarhdNLP

# Import module preprocessing
from nahiarhdNLP import preprocessing

# Import module datasets
from nahiarhdNLP import mydatasets

# Atau import fungsi spesifik
from nahiarhdNLP.preprocessing import preprocess, remove_html, replace_slang

Contoh Penggunaan

1. 🎯 Fungsi Preprocess All-in-One

from nahiarhdNLP import preprocessing

# Preprocessing lengkap dengan satu fungsi
teks = "Halooo emg siapa yg nanya? 😀"
hasil = preprocessing.preprocess(teks)
print(hasil)
# Output: "halo wajah_gembira"

2. 🧹 TextCleaner - Membersihkan Teks

from nahiarhdNLP.preprocessing import TextCleaner

cleaner = TextCleaner()

# Membersihkan URL
url_text = "kunjungi https://google.com sekarang!"
clean_result = cleaner.clean_urls(url_text)
print(clean_result)
# Output: "kunjungi  sekarang!"

# Membersihkan mentions
mention_text = "Halo @user123 apa kabar?"
clean_result = cleaner.clean_mentions(mention_text)
print(clean_result)
# Output: "Halo  apa kabar?"

3. ✏️ SpellCorrector - Koreksi Ejaan

from nahiarhdNLP.preprocessing import SpellCorrector

spell = SpellCorrector()

# Koreksi kata
word = "mencri"
corrected = spell.correct(word)
print(corrected)
# Output: "mencuri"

# Koreksi kalimat
sentence = "saya mencri informsi"
corrected = spell.correct_sentence(sentence)
print(corrected)
# Output: "saya mencuri informasi"

4. 🚫 StopwordRemover - Menghapus Stopwords

from nahiarhdNLP.preprocessing import StopwordRemover

stopword = StopwordRemover()

# Menghapus stopwords
text = "saya suka makan nasi goreng"
result = stopword.remove_stopwords(text)
print(result)
# Output: "suka makan nasi goreng"

5. 🔄 SlangNormalizer - Normalisasi Slang

from nahiarhdNLP.preprocessing import SlangNormalizer

slang = SlangNormalizer()

# Normalisasi kata slang
text = "gw lg di rmh"
result = slang.normalize(text)
print(result)
# Output: "saya lagi di rumah"

6. 😀 EmojiConverter - Konversi Emoji

from nahiarhdNLP.preprocessing import EmojiConverter

emoji = EmojiConverter()

# Emoji ke teks
emoji_text = "😀 😂 😍"
text_result = emoji.emoji_to_text_convert(emoji_text)
print(text_result)
# Output: "wajah_gembira wajah_tertawa wajah_bercinta"

# Teks ke emoji
text = "wajah_gembira"
emoji_result = emoji.text_to_emoji_convert(text)
print(emoji_result)
# Output: "😀"

7. 🔪 Tokenizer - Tokenisasi

from nahiarhdNLP.preprocessing import Tokenizer

tokenizer = Tokenizer()

# Tokenisasi teks
text = "ini contoh tokenisasi"
tokens = tokenizer.tokenize(text)
print(tokens)
# Output: ['ini', 'contoh', 'tokenisasi']

8. 🛠️ Fungsi Individual

from nahiarhdNLP.preprocessing import (
    remove_html, remove_url, remove_mentions,
    replace_slang, emoji_to_words, correct_spelling
)

# Menghapus HTML
html_text = "website <a href='https://google.com'>google</a>"
clean_text = remove_html(html_text)
print(clean_text)
# Output: "website google"

# Menghapus URL
url_text = "kunjungi https://google.com sekarang!"
clean_text = remove_url(url_text)
print(clean_text)
# Output: "kunjungi  sekarang!"

# Menghapus mentions
mention_text = "Halo @user123 apa kabar?"
clean_text = remove_mentions(mention_text)
print(clean_text)
# Output: "Halo  apa kabar?"

# Normalisasi slang
slang_text = "emg siapa yg nanya?"
normal_text = replace_slang(slang_text)
print(normal_text)
# Output: "memang siapa yang bertanya?"

# Konversi emoji
emoji_text = "😀 😂 😍"
text_result = emoji_to_words(emoji_text)
print(text_result)
# Output: "wajah_gembira wajah_tertawa wajah_bercinta"

# Koreksi ejaan
spell_text = "saya mencri informsi"
corrected = correct_spelling(spell_text)
print(corrected)
# Output: "saya mencuri informasi"

9. 📊 Dataset Loader

from nahiarhdNLP.mydatasets import DatasetLoader

loader = DatasetLoader()

# Load stopwords
stopwords = loader.load_stopwords_dataset()
print(f"Jumlah stopwords: {len(stopwords)}")

# Load slang dictionary
slang_dict = loader.load_slang_dataset()
print(f"Jumlah slang: {len(slang_dict)}")

# Load emoji dictionary
emoji_dict = loader.load_emoji_dataset()
print(f"Jumlah emoji: {len(emoji_dict)}")

10. 🔄 Pipeline Custom

from nahiarhdNLP.preprocessing import pipeline, replace_word_elongation, replace_slang

# Buat pipeline custom
custom_pipeline = pipeline([
    replace_word_elongation,
    replace_slang
])

# Jalankan pipeline
text = "Knp emg gk mw makan kenapaaa???"
result = custom_pipeline(text)
print(result)
# Output: "mengapa memang tidak mau makan mengapa???"

⚙️ Parameter Preprocess

Fungsi preprocess() memiliki parameter opsional:

result = nahiarhdNLP.preprocessing.preprocess(
    text="Halooo emg siapa yg nanya? 😀",
    remove_html_tags=True,      # Hapus HTML tags
    remove_urls=True,           # Hapus URL
    remove_stopwords_flag=True, # Hapus stopwords
    replace_slang_flag=True,    # Normalisasi slang
    replace_elongation=True,    # Atasi word elongation
    convert_emoji=True,         # Konversi emoji
    correct_spelling_flag=False,# Koreksi ejaan (lambat)
    stem_text_flag=False,       # Stemming
    to_lowercase=True           # Lowercase
)

🚨 Error Handling

try:
    from nahiarhdNLP import preprocessing
    result = preprocessing.preprocess("test")
except ImportError:
    print("Package nahiarhdNLP belum terinstall")
    print("Install dengan: pip install nahiarhdNLP")
except Exception as e:
    print(f"Error: {e}")

💡 Tips Penggunaan

  1. Untuk preprocessing cepat: Gunakan preprocess() dengan parameter default
  2. Untuk kontrol penuh: Gunakan kelas individual (TextCleaner, SpellCorrector, dll)
  3. Untuk kustomisasi: Gunakan pipeline() dengan fungsi yang diinginkan
  4. Untuk koreksi ejaan: Aktifkan correct_spelling_flag=True (tapi lebih lambat)
  5. Untuk stemming: Aktifkan stem_text_flag=True (perlu install Sastrawi)
  6. Untuk performa optimal: Dataset akan di-cache otomatis setelah download pertama
  7. Untuk development: Gunakan fallback data jika HuggingFace down

⚡ Performance & Caching

nahiarhdNLP menggunakan sistem caching untuk mempercepat loading dataset:

  • Pertama kali: Download dataset dari HuggingFace (3-4 detik)
  • Kedua kali: Load dari cache lokal (0.003 detik)
  • Cache tersimpan di ~/.nahiarhdNLP/cache/
  • Fallback data jika HuggingFace down
from nahiarhdNLP.mydatasets.loaders import DatasetLoader

loader = DatasetLoader()

# Clear cache jika perlu
loader.clear_cache()

# Cek info cache
cache_info = loader.get_cache_info()
print(cache_info)

📦 Dependencies

Package ini membutuhkan:

  • datasets - untuk load dataset dari HuggingFace
  • sastrawi - untuk stemming (opsional)
  • pandas - untuk data processing
  • rich - untuk output formatting
  • fsspec - untuk file system operations
  • huggingface_hub - untuk akses HuggingFace

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nahiarhdnlp-1.0.3.tar.gz (559.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

nahiarhdnlp-1.0.3-py3-none-any.whl (566.1 kB view details)

Uploaded Python 3

File details

Details for the file nahiarhdnlp-1.0.3.tar.gz.

File metadata

  • Download URL: nahiarhdnlp-1.0.3.tar.gz
  • Upload date:
  • Size: 559.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.6

File hashes

Hashes for nahiarhdnlp-1.0.3.tar.gz
Algorithm Hash digest
SHA256 59f1c7cb4ad65c4dbce175c8627cc02712df83788513cd6567c421e490377b50
MD5 595c59052b48c016c3d2f42105b42f6f
BLAKE2b-256 ce056d42eb1901aa33fb8e8531958a964cbf9c837e3110ae4d8355ae024729b7

See more details on using hashes here.

File details

Details for the file nahiarhdnlp-1.0.3-py3-none-any.whl.

File metadata

  • Download URL: nahiarhdnlp-1.0.3-py3-none-any.whl
  • Upload date:
  • Size: 566.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.6

File hashes

Hashes for nahiarhdnlp-1.0.3-py3-none-any.whl
Algorithm Hash digest
SHA256 184e38db8277d0a5d1a8c5ee2685d3f2bdd3ceddf83ce33fa55e3928f0fb78c5
MD5 cf22375a991383f4ec632898a4303e3b
BLAKE2b-256 fe1d6c6150c230b4dbc554b920f6761eb67e5328840e3f69f1dabb685db3d41b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page