kumparan's NLP Services

These details have not been verified by PyPI

Project links

Project description

kumparan's NLP Services

nlp-id is a collection of modules which provides various functions for Natural Language Processing for Bahasa Indonesia. This repository contains all source code related to NLP services.

Installation

To install nlp-id, use the following command:

$ pip install nlp-id

Usage

Description on how to use the lemmatizer, tokenizer, POS-tagger, etc. will be explained in more detail in this section.

Lemmatizer

Lemmatizer is used to get the root words from every word in a sentence.

from nlp_id.lemmatizer import Lemmatizer 
lemmatizer = Lemmatizer() 
lemmatizer.lemmatize('Saya sedang mencoba') 
# saya sedang coba

Tokenizer

Tokenizer is used to convert text into tokens of word, punctuation, number, date, email, URL, etc. There are two kinds of tokenizer in this repository, standard tokenizer and phrase tokenizer. The standard tokenizer tokenizes the text into separate tokens where the word tokens are single-word tokens. Tokens that started with ku- or ended with -ku, -mu, -nya, -lah, -kah will be split if it is personal pronoun or particle.

from nlp_id.tokenizer import Tokenizer 
tokenizer = Tokenizer() 
tokenizer.tokenize('Lionel Messi pergi ke pasar di daerah Jakarta Pusat.') 
# ['Lionel', 'Messi', 'pergi', 'ke', 'pasar', 'di', 'daerah', 'Jakarta', 'Pusat', '.']

tokenizer.tokenize('Lionel Messi pergi ke rumahmu di daerah Jakarta Pusat.') 
# ['Lionel', 'Messi', 'pergi', 'ke', 'rumah', 'mu', 'di', 'daerah', 'Jakarta', 'Pusat', '.']

The phrase tokenizer tokenizes the text into separate tokens where the word tokens are phrases (single or multi-word tokens).

from nlp_id.tokenizer import PhraseTokenizer 
tokenizer = PhraseTokenizer() 
tokenizer.tokenize('Lionel Messi pergi ke pasar di daerah Jakarta Pusat.') 
# ['Lionel Messi', 'pergi', 'ke', 'pasar', 'di', 'daerah', 'Jakarta Pusat', '.']

POS Tagger

POS tagger is used to obtain the Part-Of-Speech tag from a text. There are two kinds of POS tagger in this repository, standard POS tagger and phrase POS tagger. The tokens in standard POS Tagger are single-word tokens, while the tokens in phrase POS Tagger are phrases (single or multi-word tokens).

from nlp_id.postag import PosTag
postagger = PosTag() 
postagger.get_pos_tag('Lionel Messi pergi ke pasar di daerah Jakarta Pusat.') 
# [('Lionel', 'NNP'), ('Messi', 'NNP'), ('pergi', 'VB'), ('ke', 'IN'), ('pasar', 'NN'), ('di', 'IN'), ('daerah', 'NN'),  
  ('Jakarta', 'NNP'), ('Pusat', 'NNP'), ('.', 'SYM')]

postagger.get_phrase_tag('Lionel Messi pergi ke pasar di daerah Jakarta Pusat.') 
# [('Lionel Messi', 'NP'), ('pergi', 'VP'), ('ke', 'IN'), ('pasar', 'NN'), ('di', 'IN'), ('daerah', 'NN'), 
  ('Jakarta Pusat', 'NP'), ('.', 'SYM')]

Description of tagset used for POS Tagger:

No.	Tag	Description	Example
1	ADV	Adverbs. Includes adverb, modal, and auxiliary verb	sangat, hanya, justru, boleh, harus, mesti
2	CC	Coordinating conjunction. Coordinating conjunction links two or more syntactically equivalent parts of a sentence. Coordinating conjunction can link independent clauses, phrases, or words.	dan, tetapi, atau
3	DT	Determiner/article. A grammatical unit which limits the potential referent of a noun phrase, whose basic role is to mark noun phrases as either definite or indefinite.	para, sang, si, ini, itu, nya
4	FW	Foreign word. Foreign word is a word which comes from foreign language and is not yet included in Indonesian dictionary	workshop, business, e-commerce
5	IN	Preposition. A preposition links word or phrase and constituent in front of that preposition and results prepositional phrase.	dalam, dengan, di, ke
6	JJ	Adjective. Adjectives are words which describe, modify, or specify some properties of the head noun of the phrase	bersih, panjang, jauh, marah
7	NEG	Negation	tidak, belum, jangan
8	NN	Noun. Nouns are words which refer to human, animal, thing, concept, or understanding	meja, kursi, monyet, perkumpulan
9	NNP	Proper Noun. Proper noun is a specific name of a person, thing, place, event, etc.	Indonesia, Jakarta, Piala Dunia, Idul Fitri, Jokowi
10	NUM	Number. Includes cardinal and ordinal number	9876, 2019, 0,5, empat
11	PR	Pronoun. Includes personal pronoun and demonstrative pronoun	saya, kami, kita, kalian, ini, itu, nya, yang
12	RP	Particle. Particle which confirms interrogative, imperative, or declarative sentences	pun, lah, kah
13	SC	Subordinating Conjunction. Subordinating conjunction links two or more clauses and one of the clauses is a subordinate clause.	sejak, jika, seandainya, dengan, bahwa
14	SYM	Symbols and Punctuations	+,%,@
15	UH	Interjection. Interjection expresses feeling or state of mind and has no relation with other words syntactically.	ayo, nah, ah
16	VB	Verb. Includes transitive verbs, intransitive verbs, active verbs, passive verbs, and copulas.	tertidur, bekerja, membaca
17	ADJP	Adjective Phrase. A group of words headed by an adjective that describes a noun or a pronoun	sangat tinggi
18	DP	Date Phrase. Date written with whitespaces	1 Januari 2020
19	NP	Noun Phrase. A phrase that has a noun (or indefinite pronoun) as its head	Jakarta Pusat, Lionel Messi
20	NUMP	Number Phrase.	10 juta
21	VP	Verb Phrase. A syntactic unit composed of at least one verb and its dependents	tidak makan

Stopword

nlp-id also provide list of Indonesian stopword.

from nlp_id.stopword import StopWord 
stopword = StopWord() 
stopword.get_stopword() 
# [{list_of_nlp_id_stopword}]

Stopword Removal is used to remove every Indonesian stopword from the given text.

from nlp_id.stopword import StopWord 
text = "Lionel Messi pergi Ke pasar di area Jakarta Pusat" # single sentence
stopword = StopWord() 
stopword.remove_stopword(text)
# Lionel Messi pergi pasar area Jakarta Pusat  

paragraph = "Lionel Messi pergi Ke pasar di area Jakarta Pusat itu. Sedangkan Cristiano Ronaldo ke pasar Di area Jakarta Selatan. Dan mereka tidak bertemu begini-begitu."
stopword.remove_stopword(text)
# Lionel Messi pergi pasar area Jakarta Pusat. Cristiano Ronaldo pasar area Jakarta Selatan. bertemu.

Training and Evaluation

Our model is trained using stories from kumparan as the dataset. We managed to get ~93% accuracy on our test set.

Citation

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.22.0

May 13, 2026

0.1.21.0

Mar 31, 2026

0.1.20.0

Aug 26, 2025

0.1.19.0

May 13, 2025

0.1.18.0

Aug 19, 2024

0.1.17.0

Aug 13, 2024

0.1.16.0

Jul 4, 2024

0.1.15.0

Jul 11, 2023

0.1.14.0

Jun 1, 2023

0.1.13.0

Nov 3, 2022

0.1.12.0

Aug 4, 2021

0.1.11.0

Jul 29, 2021

0.1.10.3

Jul 9, 2021

0.1.10.2

Jul 5, 2021

0.1.10.1

Jun 30, 2021

0.1.10.0

Feb 25, 2021

0.1.9.8

Jul 16, 2020

0.1.9.7

Jul 7, 2020

0.1.9.6

Jun 17, 2020

0.1.9.5

Apr 21, 2020

0.1.9.4

Apr 21, 2020

0.1.9.3

Apr 7, 2020

0.1.9.2

Apr 7, 2020

0.1.9.1

Jan 23, 2020

0.1.9

Jan 16, 2020

0.1.8.8

Jan 15, 2020

0.1.8.7

Jan 7, 2020

0.1.8.6

Jan 7, 2020

0.1.8.5

Jan 6, 2020

0.1.8.4

Jan 3, 2020

0.1.8.3

Jan 3, 2020

0.1.8.2

Dec 13, 2019

0.1.8.1

Dec 12, 2019

0.1.8

Dec 11, 2019

0.1.7.2

Dec 11, 2019

0.1.7.1

Dec 10, 2019

0.1.7

Dec 10, 2019

0.1.6

Dec 10, 2019

0.1.5

Nov 26, 2019

0.1.4

Nov 22, 2019

0.1.3

Nov 22, 2019

0.1.2

Nov 22, 2019

0.1.1

Nov 22, 2019

0.1.0

Nov 21, 2019

0.0.1

Nov 21, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nlp_id-0.1.22.0.tar.gz (7.9 MB view details)

Uploaded May 13, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

nlp_id-0.1.22.0-py3-none-any.whl (8.1 MB view details)

Uploaded May 13, 2026 Python 3

File details

Details for the file nlp_id-0.1.22.0.tar.gz.

File metadata

Download URL: nlp_id-0.1.22.0.tar.gz
Upload date: May 13, 2026
Size: 7.9 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.1.3 CPython/3.11.9 Darwin/25.3.0

File hashes

Hashes for nlp_id-0.1.22.0.tar.gz
Algorithm	Hash digest
SHA256	`659d4c0afd29715a88aaf3630a0200b6849ee5e4829ee08402d9725a3524c9c9`
MD5	`60126d2dda2701be0969c153bd08bcaa`
BLAKE2b-256	`4645f7f9d4a764af0001844482f71c16efa906112360dd247cb3263e8f5b36db`

See more details on using hashes here.

File details

Details for the file nlp_id-0.1.22.0-py3-none-any.whl.

File metadata

Download URL: nlp_id-0.1.22.0-py3-none-any.whl
Upload date: May 13, 2026
Size: 8.1 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/2.1.3 CPython/3.11.9 Darwin/25.3.0

File hashes

Hashes for nlp_id-0.1.22.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e6acdbcc0b072dd5c63677543a851ad244f9af97cae201b8a0be788787ee833a`
MD5	`8bcaec1f681c74fc67c2c9115647e829`
BLAKE2b-256	`c81f4ba6fae4518577912902d46a59b01c3bf37006e84180318589a1167c9d35`

See more details on using hashes here.

nlp-id 0.1.22.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

kumparan's NLP Services

Installation

Usage

Lemmatizer

Tokenizer

POS Tagger

Stopword

Training and Evaluation

Citation

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes