Skip to main content

Advanced Indonesian Natural Language Processing Library

Project description

๐Ÿ‡ฎ๐Ÿ‡ฉ nahiarhdNLP

Advanced Indonesian Natural Language Processing Library

PyPI version Python Version License: MIT Downloads

Lightweight, powerful, and easy-to-use Indonesian text preprocessing library

Installation โ€ข Quick Start โ€ข Features โ€ข Examples โ€ข Documentation


๐Ÿ“š Table of Contents


๐ŸŒŸ Overview

nahiarhdNLP adalah library Python yang dirancang khusus untuk preprocessing teks Bahasa Indonesia. Library ini menyediakan berbagai fungsi untuk membersihkan, menormalisasi, dan memproses teks dengan mudah dan efisien.

โœจ Key Features

  • ๐Ÿ”ง Configurable Pipeline - Build custom text processing workflows
  • ๐Ÿงน Comprehensive Text Cleaning - Remove HTML, URLs, mentions, hashtags, emojis, and more
  • ๐Ÿ“ Text Normalization - Emoji conversion, spell correction, slang normalization
  • ๐Ÿ”ค Linguistic Processing - Stemming, stopword removal, tokenization
  • ๐Ÿ”„ Text Replacement - Replace emails, links, and mentions with tokens
  • ๐Ÿ“Š Built-in Datasets - Indonesian stopwords, slang dictionary, emoji mappings
  • โšก High Performance - Lazy loading and optimized processing
  • ๐ŸŽฏ Easy to Use - Simple, intuitive API

๐Ÿ“ฆ Installation

Using pip

pip install nahiarhdNLP

From source

git clone https://github.com/raihanhd12/nahiarhdNLP.git
cd nahiarhdNLP
pip install -e .

Requirements

  • Python >= 3.8
  • pandas >= 1.3.0
  • sastrawi >= 1.0.1
  • rich >= 12.0.0

๐Ÿš€ Quick Start

from nahiarhdNLP.preprocessing import Pipeline

# Create a pipeline with configuration
config = {
    "clean_html": True,
    "clean_mentions": True,
    "remove_urls": True,
    "stopword": True
}

pipeline = Pipeline(config)

# Process text
text = "Haii @user!! Cek website kita di https://example.com ya ๐Ÿ˜Š"
result = pipeline.process(text)

print(result)
# Output: "Haii Cek website kita ya ๐Ÿ˜Š"

๐ŸŽฏ Features

๐Ÿงน Text Cleaning

Feature Description Config Key
HTML Removal Remove HTML tags clean_html
URL Removal Remove complete URLs remove_urls
URL Cleaning Remove URL protocols only clean_urls
Mention Removal Remove @mentions remove_mentions
Mention Cleaning Remove @ but keep username clean_mentions
Hashtag Removal Remove #hashtags remove_hashtags
Hashtag Cleaning Remove # but keep tag text clean_hashtags
Emoji Removal Remove all emojis remove_emoji
Punctuation Removal Remove punctuation marks remove_punctuation
Number Removal Remove all numbers remove_numbers
Email Removal Remove email addresses remove_emails
Phone Removal Remove phone numbers remove_phones
Currency Removal Remove currency symbols remove_currency
Special Char Removal Remove special characters remove_special_chars
Extra Spaces Normalize whitespace remove_extra_spaces
Repeated Chars Normalize repeated chars remove_repeated_chars
Whitespace Cleaning Clean tabs, newlines, etc. remove_whitespace

๐Ÿ“ Text Normalization

Feature Description Config Key
Emoji to Text Convert emojis to text emoji_to_text
Text to Emoji Convert text to emojis text_to_emoji
Spell Correction (Word) Correct spelling & slang (single word) spell_corrector_word
Spell Correction (Sentence) Correct spelling & slang (full sentence) spell_corrector_sentence
Lowercase Convert to lowercase remove_lowercase

๐Ÿ”ค Linguistic Processing

Feature Description Config Key
Stemming Reduce words to root form stem
Stopword Removal Remove Indonesian stopwords stopword
Tokenization Split text into tokens tokenizer

๐Ÿ”„ Text Replacement

Feature Description Config Key
Email Replacement Replace emails with <email> replace_email
Link Replacement Replace URLs with <link> replace_link
User Replacement Replace mentions with <user> replace_user

๐Ÿ’ก Comprehensive Examples

1. Pipeline Configuration

Example 1.1: Basic Pipeline

from nahiarhdNLP.preprocessing import Pipeline

# Configure pipeline
config = {
    "clean_html": True,
    "clean_mentions": True,
    "remove_urls": True
}

pipeline = Pipeline(config)

# Input
text = "Hello <b>World</b>! Mention @user123 and visit https://example.com"

# Process
result = pipeline.process(text)

print(f"Input : {text}")
print(f"Output: {result}")

Output:

Input : Hello <b>World</b>! Mention @user123 and visit https://example.com
Output: Hello World! Mention user123 and visit

Example 1.2: Social Media Text Cleaning

from nahiarhdNLP.preprocessing import Pipeline

config = {
    "clean_html": True,
    "clean_mentions": True,
    "clean_hashtags": True,
    "remove_urls": True,
    "remove_emoji": True,
    "remove_extra_spaces": True
}

pipeline = Pipeline(config)

# Input - Typical social media post
text = """
Haiii gengs!! ๐Ÿ˜๐Ÿ˜
Jangan lupa follow @nahiarhdNLP ya!
Cek website kita di https://github.com/nahiarhd
#NLP #IndonesianNLP #TextProcessing ๐Ÿš€
"""

result = pipeline.process(text)

print("=" * 60)
print("INPUT:")
print(text)
print("=" * 60)
print("OUTPUT:")
print(result)
print("=" * 60)

Output:

============================================================
INPUT:

Haiii gengs!! ๐Ÿ˜๐Ÿ˜
Jangan lupa follow @nahiarhdNLP ya!
Cek website kita di https://github.com/nahiarhd
#NLP #IndonesianNLP #TextProcessing ๐Ÿš€

============================================================
OUTPUT:
Haiii gengs!! Jangan lupa follow nahiarhdNLP ya! Cek website kita di NLP IndonesianNLP TextProcessing
============================================================

Example 1.3: Update Pipeline Configuration

from nahiarhdNLP.preprocessing import Pipeline

# Initial configuration
config = {"clean_html": True, "remove_urls": True}
pipeline = Pipeline(config)

text = "<p>Visit https://example.com for more info</p>"

print(f"Initial Output: {pipeline.process(text)}")
# Output: Visit for more info

# Update configuration
pipeline.update_config({"remove_punctuation": True})

print(f"Updated Output: {pipeline.process(text)}")
# Output: Visit for more info

# Check enabled steps
print(f"Enabled steps: {pipeline.get_enabled_steps()}")
# Output: ['clean_html', 'remove_urls', 'remove_punctuation']

Example 1.4: Feature Discovery

from nahiarhdNLP.preprocessing import Pipeline

# Get all available features
all_features = Pipeline.get_available_steps()

print("All Available Features:")
for feature_name, description in sorted(all_features.items()):
    print(f"  {feature_name:25} - {description}")

print(f"\nTotal Features: {len(all_features)}")

# Get features organized by category
features_by_category = Pipeline.get_available_steps_by_category()

print("\nFeatures by Category:")
for category, feature_names in features_by_category.items():
    print(f"\n{category}:")
    for feature_name in feature_names:
        description = all_features.get(feature_name, "No description")
        print(f"  {feature_name:25} - {description}")

Output:

All Available Features:
  clean_hashtags           - Remove # symbol but keep tag text
  clean_html               - Remove HTML tags from text
  clean_mentions           - Remove @ symbol but keep username
  clean_urls               - Remove URL protocols (http://, https://) but keep domain
  emoji_to_text            - Convert emojis to Indonesian text description
  remove_currency          - Remove currency symbols
  remove_emails            - Remove email addresses
  remove_emoji             - Remove all emoji characters
  ... (28 features total)

Total Features: 28

Features by Category:

HTML & Tags:
  clean_html               - Remove HTML tags from text

URLs:
  remove_urls              - Remove complete URLs from text
  clean_urls               - Remove URL protocols (http://, https://) but keep domain

... (8 categories total)

2. Text Cleaning

Example 2.1: HTML Tag Removal

from nahiarhdNLP.preprocessing import Pipeline

config = {"clean_html": True}
pipeline = Pipeline(config)

# Test various HTML tags
examples = [
    "<p>This is a paragraph</p>",
    "<div class='container'>Content here</div>",
    "Normal text <b>bold text</b> <i>italic</i>",
    "<script>alert('test')</script>Clean text"
]

for text in examples:
    result = pipeline.process(text)
    print(f"Input : {text}")
    print(f"Output: {result}")
    print("-" * 60)

Output:

Input : <p>This is a paragraph</p>
Output: This is a paragraph
------------------------------------------------------------
Input : <div class='container'>Content here</div>
Output: Content here
------------------------------------------------------------
Input : Normal text <b>bold text</b> <i>italic</i>
Output: Normal text bold text italic
------------------------------------------------------------
Input : <script>alert('test')</script>Clean text
Output: Clean text
------------------------------------------------------------

Example 2.2: URL Processing

from nahiarhdNLP.preprocessing import Pipeline

# Remove URLs completely
config_remove = {"remove_urls": True}
pipeline_remove = Pipeline(config_remove)

# Clean URLs (remove protocol only)
config_clean = {"clean_urls": True}
pipeline_clean = Pipeline(config_clean)

text = "Visit https://github.com and http://example.com for more info"

print(f"Original     : {text}")
print(f"Remove URLs  : {pipeline_remove.process(text)}")
print(f"Clean URLs   : {pipeline_clean.process(text)}")

Output:

Original     : Visit https://github.com and http://example.com for more info
Remove URLs  : Visit and for more info
Clean URLs   : Visit github.com and example.com for more info

Example 2.3: Mention & Hashtag Processing

from nahiarhdNLP.preprocessing import Pipeline

text = "Hey @john_doe and @jane! Check out #Python #MachineLearning #AI"

# Remove mentions and hashtags
config_remove = {"remove_mentions": True, "remove_hashtags": True}
pipeline_remove = Pipeline(config_remove)

# Clean mentions and hashtags (keep text)
config_clean = {"clean_mentions": True, "clean_hashtags": True}
pipeline_clean = Pipeline(config_clean)

print(f"Original        : {text}")
print(f"Remove @#       : {pipeline_remove.process(text)}")
print(f"Clean @# (keep) : {pipeline_clean.process(text)}")

Output:

Original        : Hey @john_doe and @jane! Check out #Python #MachineLearning #AI
Remove @#       : Hey and ! Check out
Clean @# (keep) : Hey john_doe and jane! Check out Python MachineLearning AI

Example 2.4: Emoji Handling

from nahiarhdNLP.preprocessing import Pipeline

config = {"remove_emoji": True}
pipeline = Pipeline(config)

examples = [
    "I love Python ๐Ÿโค๏ธ",
    "Great work! ๐Ÿ‘๐Ÿ˜Š๐ŸŽ‰",
    "Weather today โ˜€๏ธ๐ŸŒง๏ธโ›ˆ๏ธ",
]

for text in examples:
    result = pipeline.process(text)
    print(f"Input : {text}")
    print(f"Output: {result}")
    print()

Output:

Input : I love Python ๐Ÿโค๏ธ
Output: I love Python

Input : Great work! ๐Ÿ‘๐Ÿ˜Š๐ŸŽ‰
Output: Great work!

Input : Weather today โ˜€๏ธ๐ŸŒง๏ธโ›ˆ๏ธ
Output: Weather today

Example 2.5: Repeated Characters Normalization

from nahiarhdNLP.preprocessing import Pipeline

config = {"remove_repeated_chars": True}
pipeline = Pipeline(config)

examples = [
    "Haiiiii guys!!!",
    "Kangennnnn bangetttt",
    "Wowwwww kerennn",
    "Makasiiih yaaaa"
]

for text in examples:
    result = pipeline.process(text)
    print(f"Input : {text}")
    print(f"Output: {result}")

Output:

Input : Haiiiii guys!!!
Output: Haiii guys!!

Input : Kangennnnn bangetttt
Output: Kangenn bangett

Input : Wowwwww kerennn
Output: Wowww kerenn

Input : Makasiiih yaaaa
Output: Makasiih yaa

3. Text Normalization

Example 3.1: Emoji Conversion

from nahiarhdNLP.preprocessing.normalization.emoji import EmojiConverter

emoji = EmojiConverter()
emoji._load_data()

# Emoji to Text
text_with_emoji = "Hari ini cuaca cerah โ˜€๏ธ dan saya senang ๐Ÿ˜Š"
result = emoji.emoji_to_text_convert(text_with_emoji)
print(f"Emoji to Text:")
print(f"Input : {text_with_emoji}")
print(f"Output: {result}")
print()

# Text to Emoji (example - depends on your emoji dataset)
text = "saya senang wajah tersenyum"
result = emoji.text_to_emoji_convert(text)
print(f"Text to Emoji:")
print(f"Input : {text}")
print(f"Output: {result}")

Output:

Emoji to Text:
Input : Hari ini cuaca cerah โ˜€๏ธ dan saya senang ๐Ÿ˜Š
Output: Hari ini cuaca cerah matahari dan saya senang wajah_tersenyum

Text to Emoji:
Input : saya senang wajah tersenyum
Output: saya senang ๐Ÿ˜Š

Example 3.2: Spell Correction & Slang Normalization

from nahiarhdNLP.preprocessing.normalization.spell_corrector import SpellCorrector

spell = SpellCorrector()

# Single word correction
words = ["sya", "tdk", "gk", "org", "yg", "dgn"]
print("Word Correction:")
for word in words:
    corrected = spell.correct_word(word)
    print(f"  {word:10s} โ†’ {corrected}")

print("\n" + "="*60 + "\n")

# Sentence correction
sentences = [
    "gw lg di rmh",
    "gmn kabar lo?",
    "knp gk dtg?",
    "jgn lupa ya"
]

print("Sentence Correction:")
for sent in sentences:
    corrected = spell.correct_sentence(sent)
    print(f"Input : {sent}")
    print(f"Output: {corrected}")
    print()

Output:

Word Correction:
  sya        โ†’ saya
  tdk        โ†’ tidak
  gk         โ†’ tidak
  org        โ†’ orang
  yg         โ†’ yang
  dgn        โ†’ dengan

============================================================

Sentence Correction:
Input : gw lg di rmh
Output: gue lagi di rumah

Input : gmn kabar lo?
Output: gimana kabar kamu?

Input : knp gk dtg?
Output: kenapa tidak datang?

Input : jgn lupa ya
Output: jangan lupa ya

Example 3.3: Complete Text Normalization Pipeline

from nahiarhdNLP.preprocessing import Pipeline

# Comprehensive normalization pipeline
config = {
    "clean_html": True,
    "clean_mentions": True,
    "clean_hashtags": True,
    "remove_urls": True,
    "remove_emoji": True,
    "remove_extra_spaces": True,
    "remove_repeated_chars": True,
    "spell_corrector_sentence": True,
    "remove_lowercase": True
}

pipeline = Pipeline(config)

# Messy Indonesian text
text = """
Haiii @temans!! ๐Ÿ˜ Kmrn gw udh coba apps baruu loh di https://example.com
#KerenBanget #Recommended Gkkkk nyesel dehhhh!!! ๐Ÿš€๐Ÿš€
"""

result = pipeline.process(text)

print("=" * 70)
print("ORIGINAL TEXT:")
print(text)
print("=" * 70)
print("NORMALIZED TEXT:")
print(result)
print("=" * 70)

Output:

======================================================================
ORIGINAL TEXT:

Haiii @temans!! ๐Ÿ˜ Kmrn gw udh coba apps baruu loh di https://example.com
#KerenBanget #Recommended Gkkkk nyesel dehhhh!!! ๐Ÿš€๐Ÿš€

======================================================================
NORMALIZED TEXT:
haiii temans!! kemarin gue sudah coba apps baruu loh di kerenbangett recommendedd gkk nyesell dehh!!!
======================================================================

4. Linguistic Processing

Example 4.1: Stemming

from nahiarhdNLP.preprocessing.linguistic.stemmer import Stemmer

stemmer = Stemmer()

# Test various Indonesian words
words = [
    "bermain",      # playing
    "berlari",      # running
    "kebahagiaan",  # happiness
    "pembelajaran", # learning
    "menyenangkan", # enjoyable
    "berkomunikasi" # communicate
]

print("Indonesian Stemming:")
print(f"{'Word':<20} โ†’ {'Stem'}")
print("-" * 40)
for word in words:
    stem = stemmer.stem(word)
    print(f"{word:<20} โ†’ {stem}")

print("\n" + "="*60 + "\n")

# Sentence stemming
sentences = [
    "Saya sedang belajar pemrograman Python",
    "Mereka bermain bola di lapangan",
    "Kebahagiaan adalah kunci kesuksesan"
]

print("Sentence Stemming:")
for sent in sentences:
    stemmed = stemmer.stem(sent)
    print(f"Input : {sent}")
    print(f"Output: {stemmed}")
    print()

Output:

Indonesian Stemming:
Word                 โ†’ Stem
----------------------------------------
bermain              โ†’ main
berlari              โ†’ lari
kebahagiaan          โ†’ bahagia
pembelajaran         โ†’ ajar
menyenangkan         โ†’ senang
berkomunikasi        โ†’ komunikasi

============================================================

Sentence Stemming:
Input : Saya sedang belajar pemrograman Python
Output: saya sedang ajar program python

Input : Mereka bermain bola di lapangan
Output: mereka main bola di lapang

Input : Kebahagiaan adalah kunci kesuksesan
Output: bahagia adalah kunci sukses

Example 4.2: Stopword Removal

from nahiarhdNLP.preprocessing.linguistic.stopword import StopwordRemover

stopword = StopwordRemover()
stopword._load_data()

# Test sentences
sentences = [
    "Saya sedang belajar bahasa pemrograman Python untuk data science",
    "Mereka akan pergi ke pasar besok pagi",
    "Ini adalah contoh kalimat dengan banyak stopwords yang harus dihapus"
]

print("Stopword Removal:")
print("=" * 70)
for sent in sentences:
    cleaned = stopword.remove_stopwords(sent)
    print(f"Original: {sent}")
    print(f"Cleaned : {cleaned}")
    print("-" * 70)

Output:

Stopword Removal:
======================================================================
Original: Saya sedang belajar bahasa pemrograman Python untuk data science
Cleaned : belajar bahasa pemrograman Python data science
----------------------------------------------------------------------
Original: Mereka akan pergi ke pasar besok pagi
Cleaned : pasar besok pagi
----------------------------------------------------------------------
Original: Ini adalah contoh kalimat dengan banyak stopwords yang harus dihapus
Cleaned : contoh kalimat stopwords dihapus
----------------------------------------------------------------------

Example 4.3: Complete Linguistic Pipeline

from nahiarhdNLP.preprocessing import Pipeline

# Linguistic processing pipeline
config = {
    "remove_lowercase": True,
    "stopword": True,
    "stem": True,
    "remove_extra_spaces": True
}

pipeline = Pipeline(config)

texts = [
    "Saya sedang mengembangkan aplikasi pembelajaran online",
    "Mereka bermain musik dengan sangat menyenangkan",
    "Kebahagiaan adalah perjalanan bukan tujuan"
]

print("Complete Linguistic Processing:")
print("=" * 70)
for text in texts:
    result = pipeline.process(text)
    print(f"Original : {text}")
    print(f"Processed: {result}")
    print("-" * 70)

Output:

Complete Linguistic Processing:
======================================================================
Original : Saya sedang mengembangkan aplikasi pembelajaran online
Processed: kembang aplikasi ajar online
----------------------------------------------------------------------
Original : Mereka bermain musik dengan sangat menyenangkan
Processed: main musik senang
----------------------------------------------------------------------
Original : Kebahagiaan adalah perjalanan bukan tujuan
Processed: bahagia jalan tuju
----------------------------------------------------------------------

Example 4.4: Tokenization

from nahiarhdNLP.preprocessing.tokenization.tokenizer import Tokenizer

tokenizer = Tokenizer()

texts = [
    "Ini adalah contoh kalimat sederhana",
    "Python, Java, dan JavaScript adalah bahasa pemrograman",
    "Email: test@example.com, Website: https://example.com"
]

print("Tokenization Examples:")
print("=" * 70)
for text in texts:
    tokens = tokenizer.tokenize(text)
    print(f"Text  : {text}")
    print(f"Tokens: {tokens}")
    print("-" * 70)

Output:

Tokenization Examples:
======================================================================
Text  : Ini adalah contoh kalimat sederhana
Tokens: ['Ini', 'adalah', 'contoh', 'kalimat', 'sederhana']
----------------------------------------------------------------------
Text  : Python, Java, dan JavaScript adalah bahasa pemrograman
Tokens: ['Python', ',', 'Java', ',', 'dan', 'JavaScript', 'adalah', 'bahasa', 'pemrograman']
----------------------------------------------------------------------
Text  : Email: test@example.com, Website: https://example.com
Tokens: ['Email', ':', 'test@example.com', ',', 'Website', ':', 'https://example.com']
----------------------------------------------------------------------

5. Text Replacement

Example 5.1: Email, Link, and Mention Replacement

from nahiarhdNLP.preprocessing import Pipeline

# Configure replacement pipeline
config = {
    "replace_email": True,
    "replace_link": True,
    "replace_user": True
}

pipeline = Pipeline(config)

examples = [
    "Contact me at john.doe@gmail.com for more info",
    "Visit https://github.com/nahiarhd for the code",
    "Thanks @john and @jane for your help!",
    "Email: info@company.com | Web: https://company.com | Twitter: @company"
]

print("Text Replacement:")
print("=" * 70)
for text in examples:
    result = pipeline.process(text)
    print(f"Input : {text}")
    print(f"Output: {result}")
    print("-" * 70)

Output:

Text Replacement:
======================================================================
Input : Contact me at john.doe@gmail.com for more info
Output: Contact me at <email> for more info
----------------------------------------------------------------------
Input : Visit https://github.com/nahiarhd for the code
Output: Visit <link> for the code
----------------------------------------------------------------------
Input : Thanks @john and @jane for your help!
Output: Thanks <user> and <user> for your help!
----------------------------------------------------------------------
Input : Email: info@company.com | Web: https://company.com | Twitter: @company
Output: Email: <email> | Web: <link> | Twitter: <user>
----------------------------------------------------------------------

Example 5.2: Data Anonymization Pipeline

from nahiarhdNLP.preprocessing import Pipeline

# Complete anonymization pipeline
config = {
    "replace_email": True,
    "replace_link": True,
    "replace_user": True,
    "remove_phones": True,
    "clean_html": True
}

pipeline = Pipeline(config)

# Sensitive data example
text = """
<div class="contact">
Customer: @johndoe
Email: john.doe@email.com
Phone: +62-812-3456-7890
Website: https://customer-site.com
</div>
"""

result = pipeline.process(text)

print("DATA ANONYMIZATION")
print("=" * 70)
print("ORIGINAL:")
print(text)
print("=" * 70)
print("ANONYMIZED:")
print(result)
print("=" * 70)

Output:

DATA ANONYMIZATION
======================================================================
ORIGINAL:

<div class="contact">
Customer: @johndoe
Email: john.doe@email.com
Phone: +62-812-3456-7890
Website: https://customer-site.com
</div>

======================================================================
ANONYMIZED:
Customer: <user> Email: <email> Phone: Website: <link>
======================================================================

6. Dataset Loaders

Example 6.1: Loading Built-in Datasets

from nahiarhdNLP.datasets import DatasetLoader

loader = DatasetLoader()

# Load stopwords
stopwords = loader.load_stopwords_dataset()
print(f"๐Ÿ“š Stopwords Dataset:")
print(f"   Total words: {len(stopwords)}")
print(f"   Sample: {stopwords[:10]}")
print()

# Load slang dictionary
slang_dict = loader.load_slang_dataset()
print(f"๐Ÿ’ฌ Slang Dictionary:")
print(f"   Total entries: {len(slang_dict)}")
print(f"   Sample mappings:")
for slang, formal in list(slang_dict.items())[:5]:
    print(f"      {slang:10s} โ†’ {formal}")
print()

# Load emoji dictionary
emoji_dict = loader.load_emoji_dataset()
print(f"๐Ÿ˜Š Emoji Dictionary:")
print(f"   Total emojis: {len(emoji_dict)}")
print(f"   Sample mappings:")
for emoji, text in list(emoji_dict.items())[:5]:
    print(f"      {emoji:5s} โ†’ {text}")
print()

# Load wordlist
wordlist = loader.load_wordlist_dataset()
print(f"๐Ÿ“– Wordlist Dataset:")
print(f"   Total words: {len(wordlist)}")
print(f"   Sample: {wordlist[:10]}")

Output:

๐Ÿ“š Stopwords Dataset:
   Total words: 758
   Sample: ['ada', 'adalah', 'adanya', 'adapun', 'agak', 'agaknya', 'agar', 'akan', 'akankah', 'akhir']

๐Ÿ’ฌ Slang Dictionary:
   Total entries: 3592
   Sample mappings:
      gw         โ†’ gue
      lo         โ†’ kamu
      gak        โ†’ tidak
      yg         โ†’ yang
      dgn        โ†’ dengan

๐Ÿ˜Š Emoji Dictionary:
   Total emojis: 1800
   Sample mappings:
      ๐Ÿ˜€     โ†’ wajah_tersenyum
      ๐Ÿ˜     โ†’ wajah_gembira
      ๐Ÿ˜‚     โ†’ tertawa_terbahak
      ๐Ÿคฃ     โ†’ tertawa_guling
      ๐Ÿ˜ƒ     โ†’ senyum_lebar

๐Ÿ“– Wordlist Dataset:
   Total words: 28526
   Sample: ['a', 'aa', 'aaa', 'aaai', 'aai', 'aak', 'aal', 'aalim', 'aam', 'aan']

โš™๏ธ Pipeline Configuration Options

Complete Configuration Reference

config = {
    # ===== TEXT CLEANING =====
    # HTML & Tags
    "clean_html": True,              # Remove HTML tags

    # URLs
    "remove_urls": True,             # Remove complete URLs
    "clean_urls": True,              # Remove URL protocols (http://, https://)

    # Social Media
    "remove_mentions": True,         # Remove @mentions completely
    "clean_mentions": True,          # Remove @ but keep username
    "remove_hashtags": True,         # Remove #hashtags completely
    "clean_hashtags": True,          # Remove # but keep tag text

    # Content Removal
    "remove_emoji": True,            # Remove emoji characters
    "remove_punctuation": True,      # Remove punctuation marks
    "remove_numbers": True,          # Remove numbers
    "remove_emails": True,           # Remove email addresses
    "remove_phones": True,           # Remove phone numbers
    "remove_currency": True,         # Remove currency symbols

    # Text Cleaning
    "remove_special_chars": True,    # Remove special characters
    "remove_extra_spaces": True,     # Normalize whitespace
    "remove_repeated_chars": True,   # Normalize repeated characters (e.g., "haiiii" โ†’ "haii")
    "remove_whitespace": True,       # Clean tabs, newlines, etc.
    "remove_lowercase": True,        # Convert to lowercase

    # ===== TEXT NORMALIZATION =====
    "emoji_to_text": True,           # Convert emojis to text description
    "text_to_emoji": True,           # Convert text to emojis
    "spell_corrector_word": True,    # Correct spelling for single words
    "spell_corrector_sentence": True, # Correct spelling for sentences

    # ===== LINGUISTIC PROCESSING =====
    "stem": True,                    # Apply stemming (reduce to root form)
    "stopword": True,                # Remove stopwords
    "tokenizer": True,               # Tokenize text

    # ===== TEXT REPLACEMENT =====
    "replace_email": True,           # Replace emails with <email>
    "replace_link": True,            # Replace URLs with <link>
    "replace_user": True,            # Replace mentions with <user>
}

Configuration Tips

  1. For Social Media: Use clean_* instead of remove_* to keep the text content
  2. For Formal Text: Use spell_corrector_sentence to normalize slang
  3. For ML/NLP: Combine stem, stopword, and remove_lowercase
  4. For Anonymization: Use replace_* options

๐Ÿ“– API Documentation

Pipeline Class

class Pipeline:
    """
    Configurable text preprocessing pipeline for Indonesian text.

    Args:
        config (dict): Dictionary of preprocessing steps {step_name: True/False}

    Methods:
        process(text: str) -> str: Process text through the pipeline
        update_config(new_config: dict) -> None: Update pipeline configuration
        get_enabled_steps() -> list: Get list of enabled processing steps
        __call__(text: str) -> str: Allow pipeline to be called as a function

    Example:
        >>> config = {"clean_html": True, "stopword": True}
        >>> pipeline = Pipeline(config)
        >>> result = pipeline.process("<p>Saya sedang belajar NLP</p>")
        >>> # or use as callable
        >>> result = pipeline("<p>Saya sedang belajar NLP</p>")
    """

Available Processing Steps

See Pipeline Configuration Options for complete list.


๐Ÿ› ๏ธ Development

Running Tests

# Install development dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Run tests with coverage
pytest --cov=nahiarhdNLP --cov-report=html

# Run specific test file
pytest nahiarhdNLP/tests/test_pipeline.py

Code Formatting

# Format code with black
black nahiarhdNLP/

# Sort imports with isort
isort nahiarhdNLP/

# Lint with flake8
flake8 nahiarhdNLP/

Building Package

# Install build tools
pip install build twine

# Build distributions
python -m build

# Upload to TestPyPI
twine upload --repository testpypi dist/*

# Upload to PyPI
twine upload dist/*

๐Ÿค Contributing

Contributions are welcome! Please follow these steps:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

Development Guidelines

  • Follow PEP 8 style guide
  • Add tests for new features
  • Update documentation
  • Add examples for new functionality

๐Ÿ“ License

This project is licensed under the MIT License - see the LICENSE file for details.


๐Ÿ‘ค Author

Raihan Hidayatullah Djunaedi


๐Ÿ™ Acknowledgments

  • Sastrawi - Indonesian stemming library
  • Indonesian NLP Community - For datasets and inspiration
  • All contributors who helped improve this library

๐Ÿ“Š Project Statistics

GitHub stars GitHub forks GitHub issues GitHub pull requests


Made with โค๏ธ for Indonesian NLP Community

โฌ† Back to Top

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nahiarhdnlp-1.5.3.tar.gz (800.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

nahiarhdnlp-1.5.3-py3-none-any.whl (794.8 kB view details)

Uploaded Python 3

File details

Details for the file nahiarhdnlp-1.5.3.tar.gz.

File metadata

  • Download URL: nahiarhdnlp-1.5.3.tar.gz
  • Upload date:
  • Size: 800.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for nahiarhdnlp-1.5.3.tar.gz
Algorithm Hash digest
SHA256 72ecbefd48036ac1f91ff26e12348e3b23120f50f0011a142ec4034b2ff29708
MD5 7ed38689f0094f0cf9ac9aea198c1aae
BLAKE2b-256 691e396e00fd0dc455284004e2ef4d80bb2352275b0be07a96481fb7085c0f65

See more details on using hashes here.

File details

Details for the file nahiarhdnlp-1.5.3-py3-none-any.whl.

File metadata

  • Download URL: nahiarhdnlp-1.5.3-py3-none-any.whl
  • Upload date:
  • Size: 794.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.6

File hashes

Hashes for nahiarhdnlp-1.5.3-py3-none-any.whl
Algorithm Hash digest
SHA256 22364f8e5d2cefdb39eb4434d06c8624639dcfcd423703e166b8fc3119761c4f
MD5 82dba242a58f4281516fa48c30c52912
BLAKE2b-256 3b128e0017d4494ca3cbc476cf439e82740836702e686e31f773769458a20609

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page