Modern NLP for Indonesian and regional languages
Project description
BASA
Modern NLP preprocessing for Indonesian and regional languages.
BASA is a lightweight, zero-dependency preprocessing library designed for real-world Indonesian text — the kind found on Twitter/X, TikTok, WhatsApp, Shopee reviews, and Discord. It normalizes informal slang, collapses expressive character repetition, reduces punctuation noise, and optionally corrects typos, all through a single clean API.
from basa import normalize
normalize("GW GKKKK NGERTIII BNGTTTT!!!!!")
# → 'saya tidak mengerti banget!'
Why BASA?
Indonesian social media text is notoriously difficult to process with standard NLP tools:
| Raw input | After normalize() |
|---|---|
gw gk ngerti bngt sihhhh!!! |
saya tidak mengerti banget sih! |
kmrn gamau makan krn baper bgt |
kemarin tidak mau makan karena bawa perasaan banget |
otw gan, rekber dlu ya!!!!! |
dalam perjalanan saudara, rekening bersama dulu ya! |
GW GKKKK NGERTIII BNGTTTT!!!!! |
saya tidak mengerti banget! |
Standard tokenizers and language models often fail on this kind of input because they see "gkkkk", "bngtttt", and "ngertiii" as unknown tokens. BASA normalizes them first.
Installation
pip install basa
Requires Python 3.10+
Quick Start
One-liner (recommended for most use cases)
from basa import normalize
normalize("gw gk ngerti bngt sihhhh!!!")
# → 'saya tidak mengerti banget sih!'
Zero-config alias
from basa import quick
quick("GW GKKKK NGERTIII BNGTTTT!!!!!")
# → 'saya tidak mengerti banget!'
quick() is a thin alias for normalize() with all defaults applied. Use it when you want the shortest possible call.
Batch processing
from basa import normalize
texts = [
"gw gk ngerti",
"lu udh makan??",
"kmrn gamau pergi krn baper bgt",
]
normalize(texts)
# → ['saya tidak mengerti', 'kamu sudah makan?', 'kemarin tidak mau pergi karena bawa perasaan banget']
API Reference
normalize(text, **options)
Normalize informal Indonesian text. Accepts a single string or a list of strings.
normalize(
text: Union[str, List[str]],
apply_slang: bool = True,
apply_typo: bool = False,
lowercase: bool = True,
normalize_punctuation: bool = True,
normalize_whitespace: bool = True,
) -> Union[str, List[str]]
Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
text |
str or List[str] |
— | Input text or list of texts. |
apply_slang |
bool |
True |
Expand slang and reduce expressive repeated characters (e.g. "bngtttt" → "banget"). |
apply_typo |
bool |
False |
Correct misspelled words using Levenshtein distance. Opt-in — requires a vocabulary to be loaded first. |
lowercase |
bool |
True |
Lowercase the text before processing. Set False for NER and case-sensitive pipelines. |
normalize_punctuation |
bool |
True |
Collapse repeated punctuation marks ("!!!!!" → "!"). |
normalize_whitespace |
bool |
True |
Strip leading/trailing whitespace and collapse internal multiple spaces. |
Processing pipeline (in order)
1. lowercase → "GW GK NGERTI" → "gw gk ngerti"
2. slang normalization → "gkkkk" → "gk" → "tidak"
3. typo correction → "mkan" → "makan" (opt-in)
4. punctuation → "!!!!!" → "!"
5. whitespace cleanup → " a b " → "a b"
Examples
# Preserve case for NER tasks
normalize("Jokowi pergi ke Jakarta", lowercase=False)
# → 'Jokowi pergi ke Jakarta'
# Disable slang (pass through raw tokens)
normalize("gw gk ngerti", apply_slang=False)
# → 'gw gk ngerti'
# Enable typo correction (requires vocab)
from basa import typo
typo.add_to_vocab({"makan", "minum", "pergi"})
normalize("saya mkan dan mnum", apply_typo=True)
# → 'saya makan dan minum'
quick(text)
Zero-config alias for normalize() with all default settings.
from basa import quick
quick("gw gamau pergi krn mager")
# → 'saya tidak mau pergi karena malas bergerak'
typo — Typo Corrector
BASA's typo corrector is vocabulary-driven and opt-in by default. You supply the vocabulary; BASA finds the closest match using Levenshtein distance.
from basa import typo
# Load your domain vocabulary
typo.add_to_vocab({"makan", "minum", "masak", "pergi", "datang"})
typo.correct("mkan") # → 'makan'
typo.correct("mnm") # → 'minum'
typo.correct("ok") # → 'ok' (too short, skipped by default)
# Correct a full sentence
typo.correct_text("saya mkan dan mnm")
# → 'saya makan dan minum'
# Get multiple suggestions
typo.suggest("mkan", top_k=3)
# → ['makan', 'masak', 'minum']
Why is apply_typo=False by default?
Typo correction is destructive when applied blindly. Without the right vocabulary, domain-specific terms like xgboost, lightgbm, or rekber would be mangled. BASA follows the principle of conservative by default, destructive features opt-in.
Vocabulary management
from basa import typo
typo.add_to_vocab({"kata", "lain"}) # add words
typo.remove_from_vocab({"kata"}) # remove words
typo.clear_vocab() # reset entirely
len(typo) # vocab size
"makan" in typo # membership check
# Check cache statistics (useful for profiling)
typo.cache_info()
# → {'hits': 120, 'misses': 35, 'size': 35}
Typo corrector options
from basa.core.typo import TypoCorrector
corrector = TypoCorrector(
vocab={"makan", "minum"},
min_word_length=4, # tokens shorter than this are skipped (default: 4)
min_confidence=0.5, # minimum correction confidence in [0, 1] (default: 0.5)
)
slang — Slang Normalizer
Access the underlying slang engine directly for fine-grained control.
from basa.core.slang import slang, SlangNormalizer
# Use the singleton
slang.normalize("gw gamau pergi krn lg baper bgt")
# → 'saya tidak mau pergi karena sedang bawa perasaan banget'
# Custom dictionary (extend or override defaults)
custom = SlangNormalizer(custom_mapping={
"gaskeun": "ayo lakukan",
"jancok": "ekspresi",
})
custom.normalize("gaskeun bro!")
# → 'ayo lakukan bro!'
# Batch normalize
slang.normalize_batch(["gw makan", "lu minum"])
# → ['saya makan', 'kamu minum']
Slang dictionary categories
The built-in dictionary covers 1,200+ entries across 27 categories:
| Category | Count | Examples |
|---|---|---|
| Pronouns | 34 | gw → saya, lu → kamu, doi → dia, ane → saya, sampeyan → kamu |
| Kinship & address | 19 | kk → kakak, ortu → orang tua, bokap → ayah, nyokap → ibu |
| Negation | 22 | ga, gak, nggak, kagak, jangan, belom → tidak/belum/jangan |
| Compound negation | 30 | gamau → tidak mau, gabisa → tidak bisa, gatau → tidak tahu |
| Conjunctions | 51 | yg → yang, krn → karena, stlh → setelah, sblm → sebelum |
| Verbs | 98 | udah → sudah, ngerti → mengerti, nyari → mencari, ngobrol → mengobrol |
| Adjectives & adverbs | 111 | bgt → banget, lmyn → lumayan, pdhl → padahal, kyknya → sepertinya |
| Question words | 16 | gmn → bagaimana, knp → kenapa, kumaha → bagaimana (Sunda) |
| Greetings & responses | 73 | makasih, tq, sori, yoi, bye, tengkyu → terima kasih/maaf/dll |
| Temporal & location | 26 | skrg → sekarang, kmrn → kemarin, mgg → minggu, wktu → waktu |
| Internet slang | 50 | otw → dalam perjalanan, btw → omong-omong, lowkey → diam-diam |
| E-commerce & finance | 30 | ongkir → ongkos kirim, cod → bayar di tempat, duit → uang |
| Youth / Gen-Z | 24 | mager → malas bergerak, baper → bawa perasaan, gabut → tidak ada kegiatan |
| Discourse markers | 11 | mksdnya → maksudnya, pokoknya, intinya, menurutku → menurut saya |
| Javanese extended | 35 | mangan → makan, turu → tidur, apik → bagus, okeh → banyak |
| Sundanese extended | 27 | abdi → saya, geus → sudah, tiasa → bisa, atuh → dong |
| Nouns | 46 | hp → handphone, temen → teman, matkul → mata kuliah, kantor → kantor |
| Health | 42 | dmm → demam, opname → rawat inap, gws → lekas sembuh, isoman → isolasi mandiri |
| Emotions & expressions | 80 | galau, bete, ghosting, crush, pdkt → pendekatan, mupeng |
| Food & drink | 55 | nasgor → nasi goreng, kopsu → kopi susu, laper → lapar, kenyang |
| Clothing & fashion | 47 | ootd → pakaian hari ini, thrifting → belanja baju bekas, hoodie |
| Transportation | 36 | ojol → ojek online, krl → kereta rel listrik, macet, nebeng → menumpang |
| Religion & culture | 56 | alhamdulillah, bismillah, bukber → buka bersama, ultah → ulang tahun |
| Education extended | 69 | bimbel → bimbingan belajar, ospek → orientasi studi, maba → mahasiswa baru |
| Work & office | 85 | wfh → kerja dari rumah, deadline → batas waktu, lembur, resign |
| Numbers & quantity | 35 | 1rb → seribu, 5jt → lima juta, rata2 → rata-rata, tiba2 → tiba-tiba |
| Compound expressions | 57 | ngapain → sedang apa, kapan2 → kapan-kapan, otw ke → dalam perjalanan ke |
Extending the dictionary
You can add domain-specific entries at any time without subclassing:
from basa.core.slang import slang, SlangNormalizer
# Add a single entry to the shared singleton
slang.add("npwp", "nomor pokok wajib pajak")
slang.lookup("npwp") # → 'nomor pokok wajib pajak'
slang.remove("npwp") # → True
len(slang) # total entries
# Add multiple entries at once (single regex recompile)
slang.bulk_add({
"ktp": "kartu tanda penduduk",
"sim": "surat izin mengemudi",
"stnk": "surat tanda nomor kendaraan",
})
# Or create a fully isolated custom normalizer
custom = SlangNormalizer(custom_mapping={
"jancok": "ekspresi",
"cuk": "ekspresi",
})
custom.normalize("jancok, mantul tenan!")
# → 'ekspresi, mantap betul sungguh!'
# Export the full mapping for inspection or persistence
mapping = slang.export() # → Dict[str, str]
slang.reset() # restore built-in defaults
Real-World Use Cases
Preprocessing for sentiment analysis
from basa import normalize
reviews = [
"produknya bagus bgt tp ongkirnya mahal bgt!!!",
"gw kecewa bngt, barang ga sesuai deskripsi smskali",
"rekber dlu gan, takut kena tipu",
]
clean = normalize(reviews)
# Pass clean into your sentiment model
Preprocessing for a custom NLP pipeline
from basa import normalize, typo
# Load your domain vocabulary (e.g., from a word list file)
with open("vocab.txt") as f:
domain_vocab = set(f.read().splitlines())
typo.add_to_vocab(domain_vocab)
def preprocess(text: str) -> str:
return normalize(text, apply_typo=True)
preprocess("gw mkan siang tdi di wrng padang")
# → 'saya makan siang tadi di warung padang'
NER pipeline (preserve casing)
from basa import normalize
text = "Jokowi blg bhw pemerintah akan bantu UMKM"
normalize(text, lowercase=False)
# → 'Jokowi bilang bahwa pemerintah akan bantu UMKM'
Design Philosophy
BASA is built around three principles:
-
Conservative by default. Only safe, lossless transforms are enabled out of the box. Destructive features (like typo correction) require explicit opt-in.
-
No bundled vocabularies for correction. Every domain has different vocabulary needs — fintech, e-commerce, ML, healthcare. Callers supply their own word list via
typo.add_to_vocab(). -
Zero required dependencies for core preprocessing. The
normalize()andslangmodules use only the Python standard library. The optionaltransformers,torch, andpydanticdependencies are only required for advanced modules (basa.translate,basa.evaluate).
Development
Setup
git clone https://github.com/Muanai/basa.git
cd basa
python -m venv .venv
.venv\Scripts\activate # Windows
# source .venv/bin/activate # macOS / Linux
pip install -e ".[dev]"
Running tests
pytest tests/ -v
Optional extras
pip install -e ".[serving]" # FastAPI serving
pip install -e ".[evaluation]" # ROUGE, BERTScore, seqeval
pip install -e ".[dev]" # pytest, ruff, black, mypy
Roadmap
| Version | Status | Features |
|---|---|---|
| v0.1 | ✅ Current | normalize(), quick(), slang (1,600+ entries, 27 categories), typo corrector |
| v0.2 | 🔜 Planned | BK-Tree / SymSpell for faster typo correction at large vocab sizes |
| v0.3 | 🔜 Planned | Emoji handling, remove_emoji flag |
| v0.4 | 🔜 Planned | Tokenizer module (basa.tokenize) |
| v1.0 | 🔜 Planned | Stable API, full docs site, PyPI release |
Contributing
Contributions are welcome! In particular:
- Slang dictionary additions — if you spot a common slang word that's missing, open a PR adding it to the appropriate category in
src/basa/core/slang.py. - Bug reports — please include the exact input string and the unexpected output.
- Performance improvements — especially for the typo correction module.
Please open an issue before submitting large changes.
License
MIT © 2026 Muanai Khalifah Revindo
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file basa-0.1.0b1.tar.gz.
File metadata
- Download URL: basa-0.1.0b1.tar.gz
- Upload date:
- Size: 40.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4fbf59fd0e34b1d3479e0e5d652f70b23c4c4236559a868959a78cd7d6506aa2
|
|
| MD5 |
3fed5bd9fcce1b703e5f51eee23359ab
|
|
| BLAKE2b-256 |
bbc4c2620726df8d9375bbca514d4c43222a9de7ca93f8457f7e3c677b0568a5
|
File details
Details for the file basa-0.1.0b1-py3-none-any.whl.
File metadata
- Download URL: basa-0.1.0b1-py3-none-any.whl
- Upload date:
- Size: 39.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
af2b732308f9ddec8d918a630c24464d138a4464906215f277c71e2a502ac931
|
|
| MD5 |
9603d10835cf47755a65aa54706715f4
|
|
| BLAKE2b-256 |
220bba4601cf5c9da2479d63f41fbd9b428e79104be3822372771e0782107923
|