Library pemrosesan teks Bahasa Indonesia untuk domain e-commerce (cleaning, PII masking, review mining, pipeline).
Project description
Leksara
Description
Leksara is a Python toolkit designed to streamline the preprocessing and cleaning of Indonesian text data for Data Scientists and Machine Learning Engineers. It focuses on handling messy and noisy Indonesian text from various domains such as e-commerce reviews, social media posts, and chat conversations. The tool helps clean text by handling Indonesian-specific challenges like slang words, regional expressions, informal abbreviations, and mixed language content, while also providing standard cleaning features like punctuation and stopword removal. This makes it an essential tool for Indonesian text analysis and machine learning model preparation.
Key Features
- Basic Cleaning Pipeline: A straightforward pipeline to clean raw text data by handling common tasks like punctuation removal, casing normalization, and stopword filtering.
- Advanced Customization: Users can create custom cleaning pipelines tailored to specific datasets, including support for regex pattern matching, stemming, and custom dictionaries.
- Preset Options: Includes predefined cleaning presets for various domains like e-commerce, allowing for one-click cleaning.
- Slang and Informal Text Handling: Users can define their own custom dictionaries for slang terms and informal language, especially useful for Indonesian text.
Usage Examples
Basic Usage: Basic Cleaning Pipeline
This example demonstrates how to clean e-commerce product reviews using a pre-built preset.
from Leksara import Leksara
df['cleaned_review'] = Leksara(df['review_text'], preset='ecommerce_review')
print(df[['review_id', 'cleaned_review']])
Input Data (df):
| review_id | review_text |
|---|---|
| 1 | <p>brgnya ORI & pengiriman cepat. Mantulll ๐</p> |
| 2 | Kualitasnya krg bgs, ga sesuai ekspektasi... |
Output Data:
| review_id | cleaned_review |
|---|---|
| 1 | barang nya original pengiriman cepat mantap |
| 2 | kualitasnya kurang bagus tidak sesuai ekspektasi |
Advanced Usage: Custom Cleaning Pipeline
Customize the pipeline to mask phone numbers and normalize whitespace in chat logs.
from Leksara import Leksara
from Leksara.functions import to_lowercase, normalize_whitespace
from Leksara.patterns import MASK_PHONE_NUMBER
custom_pipeline = {
'patterns': [MASK_PHONE_NUMBER],
'functions': [to_lowercase, normalize_whitespace]
}
df['safe_message'] = Leksara(df['chat_message'], pipeline=custom_pipeline)
print(df[['chat_id', 'safe_message']])
Input Data (df):
| chat_id | chat_message |
|---|---|
| 101 | Hi kak, pesanan saya INV/123 blm sampai. No HP saya 081234567890 |
| 102 | Tolong dibantu ya sis, thanks |
Output Data:
| chat_id | safe_message |
|---|---|
| 101 | hi kak, pesanan saya inv/123 blm sampai. no hp saya [PHONE_NUMBER] |
| 102 | tolong dibantu ya sis, thanks |
Goals & Objectives
- Provide an intuitive and adaptable cleaning tool for Indonesian text, focusing on domains like e-commerce.
- Enable Data Scientists and ML Engineers to clean and preprocess text with minimal effort.
- Allow for deep customization through configuration options and the use of custom dictionaries.
Success Metrics
- On-time Delivery: Targeted release by October 15, 2025.
- Processing Speed: Clean a 10,000-row Pandas Series in under 5 seconds.
- Cleaning Accuracy: Achieve over 95% accuracy for core cleaning functions.
Folder Structure
Below is the recommended folder structure for organizing the project:
[Leksara]/
โโโ pyproject.toml # packaging & deps (nltk, dll)
โโโ requirements.txt # runtime deps (nltk, pandas, dll)
โโโ README.md # overview & usage
โโโ leksara/ # package utama
โ โโโ __init__.py # public API surface
โ โโโ version.py # versi paket
โ โโโ core/
โ โ โโโ chain.py # pipeline/CLI entry (sesuai pyproject scripts)
โ โ โโโ logging.py # util logging/benchmark
โ โ โโโ presets.py # preset pipeline
โ โโโ frames/
โ โ โโโ cartboard.py # helpers untuk data frame
โ โโโ functions/ # modul granular
โ โ โโโ __init__.py
โ โ โโโ cleaner/
โ โ โ โโโ __init__.py
โ โ โ โโโ basic.py # remove_tags, case_normal, remove_stopwords, dll.
โ โ โโโ patterns/
โ โ โ โโโ __init__.py
โ โ โ โโโ pii.py # masker PII (email/telepon, dll.)
โ โ โโโ review/
โ โ โโโ __init__.py
โ โ โโโ advanced.py # fungsi review lanjutan
โ โโโ resources/ # data pendukung (dibundel)
โ โ โโโ acronyms.csv
โ โ โโโ contractions.json
โ โ โโโ slang_dict.json
โ โ โโโ stopwords/
โ โ โโโ id.txt # stopwords Indonesia (tambahan/abbr)
โ โโโ tests/
โ โ โโโ test_chain.py
โ โ โโโ test_cleaner_basic.py
โ โ โโโ test_patterns_pii.py
โ โ โโโ test_review_advanced.py
โ โโโ utils/
โ โโโ lang.py
โ โโโ regexes.py
โ โโโ text.py # text helpers
โ โโโ whitelist.py
โโโ notebooks/
โโโ leksara_quickstart.ipynb # quickstart & demo
Milestones
| Sprint | Dates | Goal |
|---|---|---|
| 1 | Aug 18 โ Aug 22 | Project Kickoff, Discovery, Set up repository |
| 2 | Aug 22 โ Aug 29 | Build Core Cleaning Engine |
| 3 | Aug 29 โ Sep 5 | Develop Configurable Features |
| 4 | Sep 5 โ Sep 12 | Implement Advanced Customization |
| 5 | Sep 12 โ Sep 19 | Refine API |
| 6 | Sep 19 โ Sep 26 | Optimize System |
| 7 | Sep 26 โ Oct 3 | Finalize Documentation |
| 8 | Oct 3 โ Oct 10 | Prepare for Launch |
Requirements
- Python 3.x
- Pandas
Install
pip install Leksara
Contributors
- Vivian & Zahra โ Document Owners
- Salsa โ UI/UX Designer
- Aufi, Althaf, Rhendy, Adit โ Data Science Team
- Alya, Vivin โ Data Analyst Team
For more details on the features and usage, refer to the official documentation linked above.
Links
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file leksara-0.0.8.tar.gz.
File metadata
- Download URL: leksara-0.0.8.tar.gz
- Upload date:
- Size: 32.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b2ade81369a1c746676a839d2a7862f36cc53518287c516b17e5303fcc42c7b4
|
|
| MD5 |
765bb79526d28af2f9d63f889dba932a
|
|
| BLAKE2b-256 |
14ee862be0d791ac3871b72c6f8cdff2ef82ee5975aa6333aa948e568d752722
|
Provenance
The following attestation bundles were made for leksara-0.0.8.tar.gz:
Publisher:
python-publish.yml on RedEye1605/Leksara
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
leksara-0.0.8.tar.gz -
Subject digest:
b2ade81369a1c746676a839d2a7862f36cc53518287c516b17e5303fcc42c7b4 - Sigstore transparency entry: 541169682
- Sigstore integration time:
-
Permalink:
RedEye1605/Leksara@a57baad926671387b90e14b5962e47dbbf0f9eea -
Branch / Tag:
refs/tags/0.0.8 - Owner: https://github.com/RedEye1605
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@a57baad926671387b90e14b5962e47dbbf0f9eea -
Trigger Event:
push
-
Statement type:
File details
Details for the file leksara-0.0.8-py3-none-any.whl.
File metadata
- Download URL: leksara-0.0.8-py3-none-any.whl
- Upload date:
- Size: 38.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
56c0ec1679a91102c9ea1a286a1edbd4fb99c193440d5cd66c9ab9f571c0dd19
|
|
| MD5 |
b4a45c6e7ef344309e26b19824fd8554
|
|
| BLAKE2b-256 |
77b89b2be7da8337aca3b685e61e8dbc80501ef449ff6d5fb148d6041a39236b
|
Provenance
The following attestation bundles were made for leksara-0.0.8-py3-none-any.whl:
Publisher:
python-publish.yml on RedEye1605/Leksara
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
leksara-0.0.8-py3-none-any.whl -
Subject digest:
56c0ec1679a91102c9ea1a286a1edbd4fb99c193440d5cd66c9ab9f571c0dd19 - Sigstore transparency entry: 541169685
- Sigstore integration time:
-
Permalink:
RedEye1605/Leksara@a57baad926671387b90e14b5962e47dbbf0f9eea -
Branch / Tag:
refs/tags/0.0.8 - Owner: https://github.com/RedEye1605
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@a57baad926671387b90e14b5962e47dbbf0f9eea -
Trigger Event:
push
-
Statement type: