Advanced Indonesian Natural Language Processing Library
Project description
๐ฎ๐ฉ nahiarhdNLP
Advanced Indonesian Natural Language Processing Library
Lightweight, powerful, and easy-to-use Indonesian text preprocessing library
Installation โข Quick Start โข Features โข Examples โข Documentation
๐ Table of Contents
- Overview
- Installation
- Quick Start
- Features
- Comprehensive Examples
- Pipeline Configuration Options
- API Documentation
- Development
- Contributing
- License
๐ Overview
nahiarhdNLP adalah library Python yang dirancang khusus untuk preprocessing teks Bahasa Indonesia. Library ini menyediakan berbagai fungsi untuk membersihkan, menormalisasi, dan memproses teks dengan mudah dan efisien.
โจ Key Features
- ๐ง Configurable Pipeline - Build custom text processing workflows
- ๐งน Comprehensive Text Cleaning - Remove HTML, URLs, mentions, hashtags, emojis, and more
- ๐ Text Normalization - Emoji conversion, spell correction, slang normalization
- ๐ค Linguistic Processing - Stemming, stopword removal, tokenization
- ๐ Text Replacement - Replace emails, links, and mentions with tokens
- ๐ Built-in Datasets - Indonesian stopwords, slang dictionary, emoji mappings
- โก High Performance - Lazy loading and optimized processing
- ๐ฏ Easy to Use - Simple, intuitive API
๐ฆ Installation
Using pip
pip install nahiarhdNLP
From source
git clone https://github.com/raihanhd12/nahiarhdNLP.git
cd nahiarhdNLP
pip install -e .
Requirements
- Python >= 3.8
- pandas >= 1.3.0
- sastrawi >= 1.0.1
- rich >= 12.0.0
๐ Quick Start
from nahiarhdNLP.preprocessing import Pipeline
# Create a pipeline with configuration
config = {
"clean_html": True,
"clean_mentions": True,
"remove_urls": True,
"stopword": True
}
pipeline = Pipeline(config)
# Process text
text = "Haii @user!! Cek website kita di https://example.com ya ๐"
result = pipeline.process(text)
print(result)
# Output: "Haii Cek website kita ya ๐"
๐ฏ Features
๐งน Text Cleaning
| Feature | Description | Config Key |
|---|---|---|
| HTML Removal | Remove HTML tags | clean_html |
| URL Removal | Remove complete URLs | remove_urls |
| URL Cleaning | Remove URL protocols only | clean_urls |
| Mention Removal | Remove @mentions | remove_mentions |
| Mention Cleaning | Remove @ but keep username | clean_mentions |
| Hashtag Removal | Remove #hashtags | remove_hashtags |
| Hashtag Cleaning | Remove # but keep tag text | clean_hashtags |
| Emoji Removal | Remove all emojis | remove_emoji |
| Punctuation Removal | Remove punctuation marks | remove_punctuation |
| Number Removal | Remove all numbers | remove_numbers |
| Email Removal | Remove email addresses | remove_emails |
| Phone Removal | Remove phone numbers | remove_phones |
| Currency Removal | Remove currency symbols | remove_currency |
| Special Char Removal | Remove special characters | remove_special_chars |
| Extra Spaces | Normalize whitespace | remove_extra_spaces |
| Repeated Chars | Normalize repeated chars | remove_repeated_chars |
| Whitespace Cleaning | Clean tabs, newlines, etc. | remove_whitespace |
๐ Text Normalization
| Feature | Description | Config Key |
|---|---|---|
| Emoji to Text | Convert emojis to text | emoji_to_text |
| Text to Emoji | Convert text to emojis | text_to_emoji |
| Spell Correction (Word) | Correct spelling & slang (single word) | spell_corrector_word |
| Spell Correction (Sentence) | Correct spelling & slang (full sentence) | spell_corrector_sentence |
| Lowercase | Convert to lowercase | remove_lowercase |
๐ค Linguistic Processing
| Feature | Description | Config Key |
|---|---|---|
| Stemming | Reduce words to root form | stem |
| Stopword Removal | Remove Indonesian stopwords | stopword |
| Tokenization | Split text into tokens | tokenizer |
๐ Text Replacement
| Feature | Description | Config Key |
|---|---|---|
| Email Replacement | Replace emails with <email> |
replace_email |
| Link Replacement | Replace URLs with <link> |
replace_link |
| User Replacement | Replace mentions with <user> |
replace_user |
๐ก Comprehensive Examples
1. Pipeline Configuration
Example 1.1: Basic Pipeline
from nahiarhdNLP.preprocessing import Pipeline
# Configure pipeline
config = {
"clean_html": True,
"clean_mentions": True,
"remove_urls": True
}
pipeline = Pipeline(config)
# Input
text = "Hello <b>World</b>! Mention @user123 and visit https://example.com"
# Process
result = pipeline.process(text)
print(f"Input : {text}")
print(f"Output: {result}")
Output:
Input : Hello <b>World</b>! Mention @user123 and visit https://example.com
Output: Hello World! Mention user123 and visit
Example 1.2: Social Media Text Cleaning
from nahiarhdNLP.preprocessing import Pipeline
config = {
"clean_html": True,
"clean_mentions": True,
"clean_hashtags": True,
"remove_urls": True,
"remove_emoji": True,
"remove_extra_spaces": True
}
pipeline = Pipeline(config)
# Input - Typical social media post
text = """
Haiii gengs!! ๐๐
Jangan lupa follow @nahiarhdNLP ya!
Cek website kita di https://github.com/nahiarhd
#NLP #IndonesianNLP #TextProcessing ๐
"""
result = pipeline.process(text)
print("=" * 60)
print("INPUT:")
print(text)
print("=" * 60)
print("OUTPUT:")
print(result)
print("=" * 60)
Output:
============================================================
INPUT:
Haiii gengs!! ๐๐
Jangan lupa follow @nahiarhdNLP ya!
Cek website kita di https://github.com/nahiarhd
#NLP #IndonesianNLP #TextProcessing ๐
============================================================
OUTPUT:
Haiii gengs!! Jangan lupa follow nahiarhdNLP ya! Cek website kita di NLP IndonesianNLP TextProcessing
============================================================
Example 1.3: Update Pipeline Configuration
from nahiarhdNLP.preprocessing import Pipeline
# Initial configuration
config = {"clean_html": True, "remove_urls": True}
pipeline = Pipeline(config)
text = "<p>Visit https://example.com for more info</p>"
print(f"Initial Output: {pipeline.process(text)}")
# Output: Visit for more info
# Update configuration
pipeline.update_config({"remove_punctuation": True})
print(f"Updated Output: {pipeline.process(text)}")
# Output: Visit for more info
# Check enabled steps
print(f"Enabled steps: {pipeline.get_enabled_steps()}")
# Output: ['clean_html', 'remove_urls', 'remove_punctuation']
Example 1.4: Feature Discovery
from nahiarhdNLP.preprocessing import Pipeline
# Get all available features
all_features = Pipeline.get_available_steps()
print("All Available Features:")
for feature_name, description in sorted(all_features.items()):
print(f" {feature_name:25} - {description}")
print(f"\nTotal Features: {len(all_features)}")
# Get features organized by category
features_by_category = Pipeline.get_available_steps_by_category()
print("\nFeatures by Category:")
for category, feature_names in features_by_category.items():
print(f"\n{category}:")
for feature_name in feature_names:
description = all_features.get(feature_name, "No description")
print(f" {feature_name:25} - {description}")
Output:
All Available Features:
clean_hashtags - Remove # symbol but keep tag text
clean_html - Remove HTML tags from text
clean_mentions - Remove @ symbol but keep username
clean_urls - Remove URL protocols (http://, https://) but keep domain
emoji_to_text - Convert emojis to Indonesian text description
remove_currency - Remove currency symbols
remove_emails - Remove email addresses
remove_emoji - Remove all emoji characters
... (28 features total)
Total Features: 28
Features by Category:
HTML & Tags:
clean_html - Remove HTML tags from text
URLs:
remove_urls - Remove complete URLs from text
clean_urls - Remove URL protocols (http://, https://) but keep domain
... (8 categories total)
2. Text Cleaning
Example 2.1: HTML Tag Removal
from nahiarhdNLP.preprocessing import Pipeline
config = {"clean_html": True}
pipeline = Pipeline(config)
# Test various HTML tags
examples = [
"<p>This is a paragraph</p>",
"<div class='container'>Content here</div>",
"Normal text <b>bold text</b> <i>italic</i>",
"<script>alert('test')</script>Clean text"
]
for text in examples:
result = pipeline.process(text)
print(f"Input : {text}")
print(f"Output: {result}")
print("-" * 60)
Output:
Input : <p>This is a paragraph</p>
Output: This is a paragraph
------------------------------------------------------------
Input : <div class='container'>Content here</div>
Output: Content here
------------------------------------------------------------
Input : Normal text <b>bold text</b> <i>italic</i>
Output: Normal text bold text italic
------------------------------------------------------------
Input : <script>alert('test')</script>Clean text
Output: Clean text
------------------------------------------------------------
Example 2.2: URL Processing
from nahiarhdNLP.preprocessing import Pipeline
# Remove URLs completely
config_remove = {"remove_urls": True}
pipeline_remove = Pipeline(config_remove)
# Clean URLs (remove protocol only)
config_clean = {"clean_urls": True}
pipeline_clean = Pipeline(config_clean)
text = "Visit https://github.com and http://example.com for more info"
print(f"Original : {text}")
print(f"Remove URLs : {pipeline_remove.process(text)}")
print(f"Clean URLs : {pipeline_clean.process(text)}")
Output:
Original : Visit https://github.com and http://example.com for more info
Remove URLs : Visit and for more info
Clean URLs : Visit github.com and example.com for more info
Example 2.3: Mention & Hashtag Processing
from nahiarhdNLP.preprocessing import Pipeline
text = "Hey @john_doe and @jane! Check out #Python #MachineLearning #AI"
# Remove mentions and hashtags
config_remove = {"remove_mentions": True, "remove_hashtags": True}
pipeline_remove = Pipeline(config_remove)
# Clean mentions and hashtags (keep text)
config_clean = {"clean_mentions": True, "clean_hashtags": True}
pipeline_clean = Pipeline(config_clean)
print(f"Original : {text}")
print(f"Remove @# : {pipeline_remove.process(text)}")
print(f"Clean @# (keep) : {pipeline_clean.process(text)}")
Output:
Original : Hey @john_doe and @jane! Check out #Python #MachineLearning #AI
Remove @# : Hey and ! Check out
Clean @# (keep) : Hey john_doe and jane! Check out Python MachineLearning AI
Example 2.4: Emoji Handling
from nahiarhdNLP.preprocessing import Pipeline
config = {"remove_emoji": True}
pipeline = Pipeline(config)
examples = [
"I love Python ๐โค๏ธ",
"Great work! ๐๐๐",
"Weather today โ๏ธ๐ง๏ธโ๏ธ",
]
for text in examples:
result = pipeline.process(text)
print(f"Input : {text}")
print(f"Output: {result}")
print()
Output:
Input : I love Python ๐โค๏ธ
Output: I love Python
Input : Great work! ๐๐๐
Output: Great work!
Input : Weather today โ๏ธ๐ง๏ธโ๏ธ
Output: Weather today
Example 2.5: Repeated Characters Normalization
from nahiarhdNLP.preprocessing import Pipeline
config = {"remove_repeated_chars": True}
pipeline = Pipeline(config)
examples = [
"Haiiiii guys!!!",
"Kangennnnn bangetttt",
"Wowwwww kerennn",
"Makasiiih yaaaa"
]
for text in examples:
result = pipeline.process(text)
print(f"Input : {text}")
print(f"Output: {result}")
Output:
Input : Haiiiii guys!!!
Output: Haiii guys!!
Input : Kangennnnn bangetttt
Output: Kangenn bangett
Input : Wowwwww kerennn
Output: Wowww kerenn
Input : Makasiiih yaaaa
Output: Makasiih yaa
3. Text Normalization
Example 3.1: Emoji Conversion
from nahiarhdNLP.preprocessing.normalization.emoji import EmojiConverter
emoji = EmojiConverter()
emoji._load_data()
# Emoji to Text
text_with_emoji = "Hari ini cuaca cerah โ๏ธ dan saya senang ๐"
result = emoji.emoji_to_text_convert(text_with_emoji)
print(f"Emoji to Text:")
print(f"Input : {text_with_emoji}")
print(f"Output: {result}")
print()
# Text to Emoji (example - depends on your emoji dataset)
text = "saya senang wajah tersenyum"
result = emoji.text_to_emoji_convert(text)
print(f"Text to Emoji:")
print(f"Input : {text}")
print(f"Output: {result}")
Output:
Emoji to Text:
Input : Hari ini cuaca cerah โ๏ธ dan saya senang ๐
Output: Hari ini cuaca cerah matahari dan saya senang wajah_tersenyum
Text to Emoji:
Input : saya senang wajah tersenyum
Output: saya senang ๐
Example 3.2: Spell Correction & Slang Normalization
from nahiarhdNLP.preprocessing.normalization.spell_corrector import SpellCorrector
spell = SpellCorrector()
# Single word correction
words = ["sya", "tdk", "gk", "org", "yg", "dgn"]
print("Word Correction:")
for word in words:
corrected = spell.correct_word(word)
print(f" {word:10s} โ {corrected}")
print("\n" + "="*60 + "\n")
# Sentence correction
sentences = [
"gw lg di rmh",
"gmn kabar lo?",
"knp gk dtg?",
"jgn lupa ya"
]
print("Sentence Correction:")
for sent in sentences:
corrected = spell.correct_sentence(sent)
print(f"Input : {sent}")
print(f"Output: {corrected}")
print()
Output:
Word Correction:
sya โ saya
tdk โ tidak
gk โ tidak
org โ orang
yg โ yang
dgn โ dengan
============================================================
Sentence Correction:
Input : gw lg di rmh
Output: gue lagi di rumah
Input : gmn kabar lo?
Output: gimana kabar kamu?
Input : knp gk dtg?
Output: kenapa tidak datang?
Input : jgn lupa ya
Output: jangan lupa ya
Example 3.3: Complete Text Normalization Pipeline
from nahiarhdNLP.preprocessing import Pipeline
# Comprehensive normalization pipeline
config = {
"clean_html": True,
"clean_mentions": True,
"clean_hashtags": True,
"remove_urls": True,
"remove_emoji": True,
"remove_extra_spaces": True,
"remove_repeated_chars": True,
"spell_corrector_sentence": True,
"remove_lowercase": True
}
pipeline = Pipeline(config)
# Messy Indonesian text
text = """
Haiii @temans!! ๐ Kmrn gw udh coba apps baruu loh di https://example.com
#KerenBanget #Recommended Gkkkk nyesel dehhhh!!! ๐๐
"""
result = pipeline.process(text)
print("=" * 70)
print("ORIGINAL TEXT:")
print(text)
print("=" * 70)
print("NORMALIZED TEXT:")
print(result)
print("=" * 70)
Output:
======================================================================
ORIGINAL TEXT:
Haiii @temans!! ๐ Kmrn gw udh coba apps baruu loh di https://example.com
#KerenBanget #Recommended Gkkkk nyesel dehhhh!!! ๐๐
======================================================================
NORMALIZED TEXT:
haiii temans!! kemarin gue sudah coba apps baruu loh di kerenbangett recommendedd gkk nyesell dehh!!!
======================================================================
4. Linguistic Processing
Example 4.1: Stemming
from nahiarhdNLP.preprocessing.linguistic.stemmer import Stemmer
stemmer = Stemmer()
# Test various Indonesian words
words = [
"bermain", # playing
"berlari", # running
"kebahagiaan", # happiness
"pembelajaran", # learning
"menyenangkan", # enjoyable
"berkomunikasi" # communicate
]
print("Indonesian Stemming:")
print(f"{'Word':<20} โ {'Stem'}")
print("-" * 40)
for word in words:
stem = stemmer.stem(word)
print(f"{word:<20} โ {stem}")
print("\n" + "="*60 + "\n")
# Sentence stemming
sentences = [
"Saya sedang belajar pemrograman Python",
"Mereka bermain bola di lapangan",
"Kebahagiaan adalah kunci kesuksesan"
]
print("Sentence Stemming:")
for sent in sentences:
stemmed = stemmer.stem(sent)
print(f"Input : {sent}")
print(f"Output: {stemmed}")
print()
Output:
Indonesian Stemming:
Word โ Stem
----------------------------------------
bermain โ main
berlari โ lari
kebahagiaan โ bahagia
pembelajaran โ ajar
menyenangkan โ senang
berkomunikasi โ komunikasi
============================================================
Sentence Stemming:
Input : Saya sedang belajar pemrograman Python
Output: saya sedang ajar program python
Input : Mereka bermain bola di lapangan
Output: mereka main bola di lapang
Input : Kebahagiaan adalah kunci kesuksesan
Output: bahagia adalah kunci sukses
Example 4.2: Stopword Removal
from nahiarhdNLP.preprocessing.linguistic.stopword import StopwordRemover
stopword = StopwordRemover()
stopword._load_data()
# Test sentences
sentences = [
"Saya sedang belajar bahasa pemrograman Python untuk data science",
"Mereka akan pergi ke pasar besok pagi",
"Ini adalah contoh kalimat dengan banyak stopwords yang harus dihapus"
]
print("Stopword Removal:")
print("=" * 70)
for sent in sentences:
cleaned = stopword.remove_stopwords(sent)
print(f"Original: {sent}")
print(f"Cleaned : {cleaned}")
print("-" * 70)
Output:
Stopword Removal:
======================================================================
Original: Saya sedang belajar bahasa pemrograman Python untuk data science
Cleaned : belajar bahasa pemrograman Python data science
----------------------------------------------------------------------
Original: Mereka akan pergi ke pasar besok pagi
Cleaned : pasar besok pagi
----------------------------------------------------------------------
Original: Ini adalah contoh kalimat dengan banyak stopwords yang harus dihapus
Cleaned : contoh kalimat stopwords dihapus
----------------------------------------------------------------------
Example 4.3: Complete Linguistic Pipeline
from nahiarhdNLP.preprocessing import Pipeline
# Linguistic processing pipeline
config = {
"remove_lowercase": True,
"stopword": True,
"stem": True,
"remove_extra_spaces": True
}
pipeline = Pipeline(config)
texts = [
"Saya sedang mengembangkan aplikasi pembelajaran online",
"Mereka bermain musik dengan sangat menyenangkan",
"Kebahagiaan adalah perjalanan bukan tujuan"
]
print("Complete Linguistic Processing:")
print("=" * 70)
for text in texts:
result = pipeline.process(text)
print(f"Original : {text}")
print(f"Processed: {result}")
print("-" * 70)
Output:
Complete Linguistic Processing:
======================================================================
Original : Saya sedang mengembangkan aplikasi pembelajaran online
Processed: kembang aplikasi ajar online
----------------------------------------------------------------------
Original : Mereka bermain musik dengan sangat menyenangkan
Processed: main musik senang
----------------------------------------------------------------------
Original : Kebahagiaan adalah perjalanan bukan tujuan
Processed: bahagia jalan tuju
----------------------------------------------------------------------
Example 4.4: Tokenization
from nahiarhdNLP.preprocessing.tokenization.tokenizer import Tokenizer
tokenizer = Tokenizer()
texts = [
"Ini adalah contoh kalimat sederhana",
"Python, Java, dan JavaScript adalah bahasa pemrograman",
"Email: test@example.com, Website: https://example.com"
]
print("Tokenization Examples:")
print("=" * 70)
for text in texts:
tokens = tokenizer.tokenize(text)
print(f"Text : {text}")
print(f"Tokens: {tokens}")
print("-" * 70)
Output:
Tokenization Examples:
======================================================================
Text : Ini adalah contoh kalimat sederhana
Tokens: ['Ini', 'adalah', 'contoh', 'kalimat', 'sederhana']
----------------------------------------------------------------------
Text : Python, Java, dan JavaScript adalah bahasa pemrograman
Tokens: ['Python', ',', 'Java', ',', 'dan', 'JavaScript', 'adalah', 'bahasa', 'pemrograman']
----------------------------------------------------------------------
Text : Email: test@example.com, Website: https://example.com
Tokens: ['Email', ':', 'test@example.com', ',', 'Website', ':', 'https://example.com']
----------------------------------------------------------------------
5. Text Replacement
Example 5.1: Email, Link, and Mention Replacement
from nahiarhdNLP.preprocessing import Pipeline
# Configure replacement pipeline
config = {
"replace_email": True,
"replace_link": True,
"replace_user": True
}
pipeline = Pipeline(config)
examples = [
"Contact me at john.doe@gmail.com for more info",
"Visit https://github.com/nahiarhd for the code",
"Thanks @john and @jane for your help!",
"Email: info@company.com | Web: https://company.com | Twitter: @company"
]
print("Text Replacement:")
print("=" * 70)
for text in examples:
result = pipeline.process(text)
print(f"Input : {text}")
print(f"Output: {result}")
print("-" * 70)
Output:
Text Replacement:
======================================================================
Input : Contact me at john.doe@gmail.com for more info
Output: Contact me at <email> for more info
----------------------------------------------------------------------
Input : Visit https://github.com/nahiarhd for the code
Output: Visit <link> for the code
----------------------------------------------------------------------
Input : Thanks @john and @jane for your help!
Output: Thanks <user> and <user> for your help!
----------------------------------------------------------------------
Input : Email: info@company.com | Web: https://company.com | Twitter: @company
Output: Email: <email> | Web: <link> | Twitter: <user>
----------------------------------------------------------------------
Example 5.2: Data Anonymization Pipeline
from nahiarhdNLP.preprocessing import Pipeline
# Complete anonymization pipeline
config = {
"replace_email": True,
"replace_link": True,
"replace_user": True,
"remove_phones": True,
"clean_html": True
}
pipeline = Pipeline(config)
# Sensitive data example
text = """
<div class="contact">
Customer: @johndoe
Email: john.doe@email.com
Phone: +62-812-3456-7890
Website: https://customer-site.com
</div>
"""
result = pipeline.process(text)
print("DATA ANONYMIZATION")
print("=" * 70)
print("ORIGINAL:")
print(text)
print("=" * 70)
print("ANONYMIZED:")
print(result)
print("=" * 70)
Output:
DATA ANONYMIZATION
======================================================================
ORIGINAL:
<div class="contact">
Customer: @johndoe
Email: john.doe@email.com
Phone: +62-812-3456-7890
Website: https://customer-site.com
</div>
======================================================================
ANONYMIZED:
Customer: <user> Email: <email> Phone: Website: <link>
======================================================================
6. Dataset Loaders
Example 6.1: Loading Built-in Datasets
from nahiarhdNLP.datasets import DatasetLoader
loader = DatasetLoader()
# Load stopwords
stopwords = loader.load_stopwords_dataset()
print(f"๐ Stopwords Dataset:")
print(f" Total words: {len(stopwords)}")
print(f" Sample: {stopwords[:10]}")
print()
# Load slang dictionary
slang_dict = loader.load_slang_dataset()
print(f"๐ฌ Slang Dictionary:")
print(f" Total entries: {len(slang_dict)}")
print(f" Sample mappings:")
for slang, formal in list(slang_dict.items())[:5]:
print(f" {slang:10s} โ {formal}")
print()
# Load emoji dictionary
emoji_dict = loader.load_emoji_dataset()
print(f"๐ Emoji Dictionary:")
print(f" Total emojis: {len(emoji_dict)}")
print(f" Sample mappings:")
for emoji, text in list(emoji_dict.items())[:5]:
print(f" {emoji:5s} โ {text}")
print()
# Load wordlist
wordlist = loader.load_wordlist_dataset()
print(f"๐ Wordlist Dataset:")
print(f" Total words: {len(wordlist)}")
print(f" Sample: {wordlist[:10]}")
Output:
๐ Stopwords Dataset:
Total words: 758
Sample: ['ada', 'adalah', 'adanya', 'adapun', 'agak', 'agaknya', 'agar', 'akan', 'akankah', 'akhir']
๐ฌ Slang Dictionary:
Total entries: 3592
Sample mappings:
gw โ gue
lo โ kamu
gak โ tidak
yg โ yang
dgn โ dengan
๐ Emoji Dictionary:
Total emojis: 1800
Sample mappings:
๐ โ wajah_tersenyum
๐ โ wajah_gembira
๐ โ tertawa_terbahak
๐คฃ โ tertawa_guling
๐ โ senyum_lebar
๐ Wordlist Dataset:
Total words: 28526
Sample: ['a', 'aa', 'aaa', 'aaai', 'aai', 'aak', 'aal', 'aalim', 'aam', 'aan']
โ๏ธ Pipeline Configuration Options
Complete Configuration Reference
config = {
# ===== TEXT CLEANING =====
# HTML & Tags
"clean_html": True, # Remove HTML tags
# URLs
"remove_urls": True, # Remove complete URLs
"clean_urls": True, # Remove URL protocols (http://, https://)
# Social Media
"remove_mentions": True, # Remove @mentions completely
"clean_mentions": True, # Remove @ but keep username
"remove_hashtags": True, # Remove #hashtags completely
"clean_hashtags": True, # Remove # but keep tag text
# Content Removal
"remove_emoji": True, # Remove emoji characters
"remove_punctuation": True, # Remove punctuation marks
"remove_numbers": True, # Remove numbers
"remove_emails": True, # Remove email addresses
"remove_phones": True, # Remove phone numbers
"remove_currency": True, # Remove currency symbols
# Text Cleaning
"remove_special_chars": True, # Remove special characters
"remove_extra_spaces": True, # Normalize whitespace
"remove_repeated_chars": True, # Normalize repeated characters (e.g., "haiiii" โ "haii")
"remove_whitespace": True, # Clean tabs, newlines, etc.
"remove_lowercase": True, # Convert to lowercase
# ===== TEXT NORMALIZATION =====
"emoji_to_text": True, # Convert emojis to text description
"text_to_emoji": True, # Convert text to emojis
"spell_corrector_word": True, # Correct spelling for single words
"spell_corrector_sentence": True, # Correct spelling for sentences
# ===== LINGUISTIC PROCESSING =====
"stem": True, # Apply stemming (reduce to root form)
"stopword": True, # Remove stopwords
"tokenizer": True, # Tokenize text
# ===== TEXT REPLACEMENT =====
"replace_email": True, # Replace emails with <email>
"replace_link": True, # Replace URLs with <link>
"replace_user": True, # Replace mentions with <user>
}
Configuration Tips
- For Social Media: Use
clean_*instead ofremove_*to keep the text content - For Formal Text: Use
spell_corrector_sentenceto normalize slang - For ML/NLP: Combine
stem,stopword, andremove_lowercase - For Anonymization: Use
replace_*options
๐ API Documentation
Pipeline Class
class Pipeline:
"""
Configurable text preprocessing pipeline for Indonesian text.
Args:
config (dict): Dictionary of preprocessing steps {step_name: True/False}
Methods:
process(text: str) -> str: Process text through the pipeline
update_config(new_config: dict) -> None: Update pipeline configuration
get_enabled_steps() -> list: Get list of enabled processing steps
__call__(text: str) -> str: Allow pipeline to be called as a function
Example:
>>> config = {"clean_html": True, "stopword": True}
>>> pipeline = Pipeline(config)
>>> result = pipeline.process("<p>Saya sedang belajar NLP</p>")
>>> # or use as callable
>>> result = pipeline("<p>Saya sedang belajar NLP</p>")
"""
Available Processing Steps
See Pipeline Configuration Options for complete list.
๐ ๏ธ Development
Running Tests
# Install development dependencies
pip install -e ".[dev]"
# Run tests
pytest
# Run tests with coverage
pytest --cov=nahiarhdNLP --cov-report=html
# Run specific test file
pytest nahiarhdNLP/tests/test_pipeline.py
Code Formatting
# Format code with black
black nahiarhdNLP/
# Sort imports with isort
isort nahiarhdNLP/
# Lint with flake8
flake8 nahiarhdNLP/
Building Package
# Install build tools
pip install build twine
# Build distributions
python -m build
# Upload to TestPyPI
twine upload --repository testpypi dist/*
# Upload to PyPI
twine upload dist/*
๐ค Contributing
Contributions are welcome! Please follow these steps:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
Development Guidelines
- Follow PEP 8 style guide
- Add tests for new features
- Update documentation
- Add examples for new functionality
๐ License
This project is licensed under the MIT License - see the LICENSE file for details.
๐ค Author
Raihan Hidayatullah Djunaedi
- Email: raihanhd.dev@gmail.com
- GitHub: @raihanhd12
๐ Acknowledgments
- Sastrawi - Indonesian stemming library
- Indonesian NLP Community - For datasets and inspiration
- All contributors who helped improve this library
๐ Project Statistics
Made with โค๏ธ for Indonesian NLP Community
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file nahiarhdnlp-1.5.3.tar.gz.
File metadata
- Download URL: nahiarhdnlp-1.5.3.tar.gz
- Upload date:
- Size: 800.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
72ecbefd48036ac1f91ff26e12348e3b23120f50f0011a142ec4034b2ff29708
|
|
| MD5 |
7ed38689f0094f0cf9ac9aea198c1aae
|
|
| BLAKE2b-256 |
691e396e00fd0dc455284004e2ef4d80bb2352275b0be07a96481fb7085c0f65
|
File details
Details for the file nahiarhdnlp-1.5.3-py3-none-any.whl.
File metadata
- Download URL: nahiarhdnlp-1.5.3-py3-none-any.whl
- Upload date:
- Size: 794.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
22364f8e5d2cefdb39eb4434d06c8624639dcfcd423703e166b8fc3119761c4f
|
|
| MD5 |
82dba242a58f4281516fa48c30c52912
|
|
| BLAKE2b-256 |
3b128e0017d4494ca3cbc476cf439e82740836702e686e31f773769458a20609
|