Advanced Indonesian Natural Language Processing Library

These details have not been verified by PyPI

Project links

Project description

nahiarhdNLP

Bahasa | English

nahiarhdNLP adalah library python advanced yang bertujuan untuk memudahkan proyek NLP anda dengan fitur-fitur yang selalu up-to-date.

Installation

nahiarhdNLP dapat diinstall dengan mudah dengan menggunakan pip:

$ pip install nahiarhdNLP

Atau clone repository ini:

git clone https://github.com/nahiarhdNLP/nahiarhdNLP.git
cd nahiarhdNLP
pip install -r requirements.txt

Preprocessing

Modul nahiarhdNLP.preprocessing menyediakan beberapa fungsi umum untuk menyiapkan dan melakukan transformasi terhadap data teks mentah untuk digunakan pada konteks tertentu.

Generics

remove_html

Menghapus html tag yang terdapat di dalam teks

>>> from src.preprocessing import remove_html
>>> remove_html("website <a href='https://google.com'>google</a>")
"website google"

remove_url

Menghapus url yang terdapat di dalam teks

>>> from src.preprocessing import remove_url
>>> remove_url("retrieved from https://gist.github.com/gruber/8891611")
"retrieved from "

remove_stopwords

Stopwords merupakan kata yang diabaikan dalam pemrosesan dan biasanya disimpan di dalam stop lists. Stop list ini berisi daftar kata umum yang mempunyai fungsi tapi tidak mempunyai arti.

Menghapus stopwords yang terdapat di dalam teks. List stopwords bahasa Indonesia didapatkan dari dataset HuggingFace.

>>> from src.preprocessing import remove_stopwords
>>> remove_stopwords("siapa yang suruh makan?!!")
"suruh makan?!!"

replace_slang

Mengganti kata gaul (slang) menjadi kata formal tanpa mengubah makna dari kata tersebut. List kata gaul (slang words) bahasa Indonesian didapatkan dari dataset HuggingFace.

>>> from src.preprocessing import replace_slang
>>> replace_slang("emg siapa yg nanya?")
"memang siapa yang bertanya?"

replace_word_elongation

Word elongation adalah tindakan untuk menambahkan huruf ke kata, biasanya di akhir kata.

Menghandle word elongation:

>>> from src.preprocessing import replace_word_elongation
>>> replace_word_elongation("kenapaaa?")
"kenapaa?"

Fungsi Pembersihan Individual

remove_mentions

Menghapus mentions (@username) dari teks.

>>> from src.preprocessing import remove_mentions
>>> remove_mentions("Halo @user123 dan @admin, apa kabar?")
"Halo dan , apa kabar?"

remove_hashtags

Menghapus hashtags (#tag) dari teks.

>>> from src.preprocessing import remove_hashtags
>>> remove_hashtags("Hari ini #senin #libur #weekend")
"Hari ini"

remove_numbers

Menghapus angka dari teks.

>>> from src.preprocessing import remove_numbers
>>> remove_numbers("Saya berumur 25 tahun dan punya 3 anak")
"Saya berumur tahun dan punya anak"

remove_punctuation

Menghapus tanda baca dari teks.

>>> from src.preprocessing import remove_punctuation
>>> remove_punctuation("Halo, apa kabar?! Semoga sehat selalu...")
"Halo apa kabar Semoga sehat selalu"

remove_extra_spaces

Menghapus spasi berlebih dari teks.

>>> from src.preprocessing import remove_extra_spaces
>>> remove_extra_spaces("Halo    dunia   yang    indah")
"Halo dunia yang indah"

remove_special_chars

Menghapus karakter khusus yang bukan alfanumerik atau spasi.

>>> from src.preprocessing import remove_special_chars
>>> remove_special_chars("Halo @#$%^&*() dunia!!!")
"Halo () dunia!!!"

remove_whitespace

Membersihkan karakter whitespace (tab, newline, dll).

>>> from src.preprocessing import remove_whitespace
>>> remove_whitespace("Halo\n\tdunia\r\nyang indah")
"Halo dunia yang indah"

to_lowercase

Mengubah teks menjadi huruf kecil.

>>> from src.preprocessing import to_lowercase
>>> to_lowercase("HALO Dunia Yang INDAH")
"halo dunia yang indah"

Emoji

Preproses teks yang mengandung emoji.

emoji_to_words

Mengubah emoji yang berada dalam sebuah teks menjadi kata-kata yang sesuai dengan emoji tersebut.

>>> from src.preprocessing import emoji_to_words
>>> emoji_to_words("emoji 😀😁")
"emoji wajah_gembira wajah_gembira_dengan_mata_bahagia"

words_to_emoji

Mengubah kata-kata dengan kode emoji menjadi emoji.

>>> from src.preprocessing import words_to_emoji
>>> words_to_emoji("emoji wajah_gembira")
"emoji 😀"

Pipelining

Membuat pipeline dari sequence fungsi preprocessing:

>>> from src.preprocessing import pipeline, replace_word_elongation, replace_slang
>>> pipe = pipeline([replace_word_elongation, replace_slang])
>>> pipe("Knp emg gk mw makan kenapaaa???")
"Kenapa memang tidak mau makan kenapa??"

Preprocessing All-in-One

Fungsi preprocess menyediakan preprocessing lengkap dengan berbagai opsi:

>>> from src.preprocessing import preprocess
>>> preprocess("Halooo emg siapa yg nanya? 😀")
"halo wajah_gembira"

Dengan opsi kustomisasi:

>>> from src.preprocessing import preprocess
>>> preprocess(
...     "Halooo emg siapa yg nanya? 😀",
...     remove_html_tags=True,
...     remove_urls=True,
...     remove_stopwords_flag=True,
...     replace_slang_flag=True,
...     replace_elongation=True,
...     convert_emoji=True,
...     to_lowercase=True
... )
"halo wajah_gembira"

Fungsi Tambahan

Tokenization

>>> from src.preprocessing import tokenize
>>> tokenize("Saya suka makan nasi")
['Saya', 'suka', 'makan', 'nasi']

Text Cleaning

>>> from src.preprocessing import clean_text
>>> clean_text("Halooo!!! @user #trending https://example.com 😀")
"haloo!!"

Spell Correction

>>> from src.preprocessing import correct_spelling
>>> correct_spelling("sya suka mkn nasi")
"saya suka makan nasi"

Stemming

>>> from src.preprocessing import stem_text
>>> stem_text("bermain-main dengan senang")
"main dengan senang"

Advanced Usage

Menggunakan Kelas Langsung

Jika Anda memerlukan kontrol lebih lanjut, Anda dapat menggunakan kelas-kelas secara langsung:

from src.preprocessing import TextCleaner, StopwordRemover, SlangNormalizer

# Inisialisasi dengan opsi kustom
cleaner = TextCleaner(
    remove_urls=True,
    remove_mentions=True,
    remove_hashtags=True,
    lowercase=True
)

stopword_remover = StopwordRemover(language="indonesian")
slang_normalizer = SlangNormalizer(language="indonesian")

# Gunakan
text = "Halooo @user ini contoh teks!!! https://example.com"
cleaned = cleaner.clean(text)
no_stopwords = stopword_remover.remove_stopwords(cleaned)
formal = slang_normalizer.normalize(no_stopwords)

Pipeline Kustom

from src.preprocessing import pipeline
from src.preprocessing import (
    remove_html,
    remove_url,
    replace_word_elongation,
    emoji_to_words,
    replace_slang,
    remove_stopwords,
    # Fungsi-fungsi individual
    remove_mentions,
    remove_hashtags,
    remove_numbers,
    remove_punctuation,
    remove_extra_spaces,
    to_lowercase
)

# Buat pipeline kustom
custom_pipe = pipeline([
    remove_html,
    remove_url,
    remove_mentions,
    remove_hashtags,
    remove_numbers,
    replace_word_elongation,
    emoji_to_words,
    replace_slang,
    remove_stopwords,
    remove_punctuation,
    remove_extra_spaces,
    to_lowercase
])

# Gunakan
result = custom_pipe("Halooo emg siapa yg nanya? 😀 <a href='#'>link</a> @user #trending 123")
print(result)

Penggunaan Fungsi Individual

from src.preprocessing import (
    remove_mentions,
    remove_hashtags,
    remove_numbers,
    remove_punctuation,
    remove_extra_spaces,
    to_lowercase
)

# Gunakan fungsi individual
text = "Halo @user123 #trending! Saya berumur 25 tahun..."
text = remove_mentions(text)  # "Halo #trending! Saya berumur 25 tahun..."
text = remove_hashtags(text)  # "Halo ! Saya berumur 25 tahun..."
text = remove_numbers(text)   # "Halo ! Saya berumur tahun..."
text = remove_punctuation(text)  # "Halo  Saya berumur tahun"
text = remove_extra_spaces(text)  # "Halo Saya berumur tahun"
text = to_lowercase(text)     # "halo saya berumur tahun"

Requirements

Python 3.7+
datasets
requests
rich (untuk output yang menarik)

Untuk fitur tambahan:

Sastrawi (untuk stemming): pip install Sastrawi
pyspellchecker (untuk spell correction): pip install pyspellchecker

Testing

Untuk menjalankan semua test:

pytest tests/

Demo

Untuk melihat demo lengkap library:

python main.py

Directory Structure

nahiarhdNLP/
├── main.py                    # Demo aplikasi
├── requirements.txt           # Dependencies
├── README.md                  # Dokumentasi
├── src/
│   ├── __init__.py           # Main module
│   ├── preprocessing/         # Modul preprocessing
│   │   ├── __init__.py       # Export semua fungsi
│   │   ├── utils.py          # Fungsi wrapper utama
│   │   ├── cleaning/         # Pembersihan teks
│   │   ├── normalization/    # Normalisasi teks
│   │   ├── linguistic/       # Pemrosesan linguistik
│   │   └── tokenization/     # Tokenisasi
│   └── mydatasets/           # Dataset loader
└── tests/                    # Unit tests

Kontribusi

Kontribusi sangat diterima! Silakan fork repository ini, buat branch baru, dan submit pull request.

License

MIT License

Acknowledgments

Dataset stopwords dari HuggingFace
Dataset emoji dari HuggingFace
Dataset slang dari HuggingFace
Sastrawi untuk stemming bahasa Indonesia

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.5.3

Jan 8, 2026

1.5.2

Jan 8, 2026

1.5.1

Dec 17, 2025

1.5

Dec 17, 2025

1.4.11

Dec 17, 2025

1.4.10

Dec 17, 2025

1.4.9

Dec 16, 2025

1.4.8

Dec 16, 2025

1.4.6

Sep 3, 2025

1.4.5

Sep 2, 2025

1.4.4

Sep 2, 2025

1.4.3

Sep 2, 2025

1.4.2

Sep 2, 2025

1.4.1

Sep 2, 2025

1.4.0

Sep 2, 2025

1.3.2

Jul 28, 2025

1.3.1

Jul 28, 2025

1.2.6

Jul 28, 2025

1.2.5

Jul 25, 2025

1.2.4

Jul 24, 2025

1.2.3

Jul 24, 2025

1.2.2

Jul 24, 2025

1.2.1

Jul 24, 2025

1.2.0

Jul 24, 2025

1.1.1

Jul 24, 2025

1.1.0

Jul 24, 2025

1.0.7

Jul 18, 2025

1.0.6

Jul 18, 2025

1.0.5

Jul 18, 2025

1.0.4

Jul 18, 2025

1.0.3

Jul 18, 2025

1.0.2

Jul 18, 2025

1.0.1

Jul 17, 2025

This version

1.0.0

Jul 17, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nahiarhdnlp-1.0.0.tar.gz (560.5 kB view details)

Uploaded Jul 17, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

nahiarhdnlp-1.0.0-py3-none-any.whl (565.3 kB view details)

Uploaded Jul 17, 2025 Python 3

File details

Details for the file nahiarhdnlp-1.0.0.tar.gz.

File metadata

Download URL: nahiarhdnlp-1.0.0.tar.gz
Upload date: Jul 17, 2025
Size: 560.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.6

File hashes

Hashes for nahiarhdnlp-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`b50329a5b22d8d9a4c3fcfbc0e94b64dd808fc06453c6a7a5ec06b441e3e992e`
MD5	`1810130c41496f9f46d43b2d873f58aa`
BLAKE2b-256	`faa3f78967ea7f1e7587039b4bd8183fcebff2211cb0aa1449c6ae8cf2f61d69`

See more details on using hashes here.

File details

Details for the file nahiarhdnlp-1.0.0-py3-none-any.whl.

File metadata

Download URL: nahiarhdnlp-1.0.0-py3-none-any.whl
Upload date: Jul 17, 2025
Size: 565.3 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.6

File hashes

Hashes for nahiarhdnlp-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`509676ab3eea78ad00d70be09776c5c860ff5dd4fceddf3dabf3f722dd9dd723`
MD5	`6956ecb33e5ced9011767f3ce6ad48d4`
BLAKE2b-256	`77f06e27fe9b5f9df02c1bf9cee2eddb747ac4e6f3c528400f622575126e1fe8`

See more details on using hashes here.

nahiarhdNLP 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

nahiarhdNLP

Installation

Preprocessing

Generics

remove_html

remove_url

remove_stopwords

replace_slang

replace_word_elongation

Fungsi Pembersihan Individual

remove_mentions

remove_hashtags

remove_numbers

remove_punctuation

remove_extra_spaces

remove_special_chars

remove_whitespace

to_lowercase

Emoji

emoji_to_words

words_to_emoji

Pipelining

Preprocessing All-in-One

Fungsi Tambahan

Tokenization

Text Cleaning

Spell Correction

Stemming

Advanced Usage

Menggunakan Kelas Langsung

Pipeline Kustom

Penggunaan Fungsi Individual

Requirements

Testing

Demo

Directory Structure

Kontribusi

License

Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes