Education toolkit for Bahasa Indonesia NLP
Project description
Education Toolkit for Bahasa Indonesia NLP
Elang is an acronym that combines the phrases Education (E) and Language Understanding (Lang). It is an education-centric toolkit to demonstrate the ideas behind many Natural Language Processing strategies commercially used today, including word embeddings and pre-trained Bahasa Indonesia models for transfer learning. Quick Start and Documentation helps you get started in 5 minutes.
Elang
Elang also means "eagle" in Bahasa Indonesia, and the elang Jawa (Javan hawk-eagle) is the national bird of Indonesia, more commonly referred to as Garuda.
The package provides a collection of utility functions and tools that interface with gensim
and scikit-learn
, as well as curated negative lists for Bahasa Indonesia (kata kasar / vulgar words, stopwords etc) and useful preprocesisng functions.
Quick Demo
Install elang
:
pip install elang
Performing word embeddings in 4 lines of code gets you a visualization:
from elang.plot.utils import plot2d
from gensim.models import Word2Vec
model = Word2Vec.load("path.to.model")
plot2d(model)
# output:
It even looks like a soaring eagle with its outstretched wings!
Scikit-Learn Compatability
Because the dimensionality reduction procedure is handled by the underlying sklearn
code, you can use any of the valid parameters in the function call and they will be handed off to the underlying method. Common examples are the perplexity
, n_iter
and random_state
parameters:
model = Word2Vec.load("path.to.model")
bca = model.wv.most_similar("bca", topn=14)
similar_bca = [w[0] for w in bca]
plot2d(
model,
method="TSNE",
targets=similar_bca,
perplexity=20,
early_exaggeration=50,
n_iter=2000,
random_state=0,
)
Output:
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.