Library pemrosesan teks Bahasa Indonesia untuk domain e-commerce (cleaning, PII masking, review mining, pipeline).
Project description
Leksara
Leksara is an Indonesian-language text preparation toolkit for data teams who need production-ready cleaning, masking, and normalization pipelines. The library bundles linguistic resources, a preset-driven orchestration layer, and modular helpers so you can audit raw text, remediate sensitive content, and standardize noisy reviews without rebuilding the stack for every project.
Feature Highlights
- CartBoard review intake – Inspect raw datasets from chatbots or marketplaces, generate column-level flags (PII, non-alphabetical noise, ratings), and capture metadata for monitoring.
- Composable cleaning utilities –
leksara.functionre-exports the building blocks (HTML stripping, casing, stopwords, punctuation, emoji, numeric cleanup) for ad-hoc preprocessing. - PII masking and redaction – Regex-backed detectors for Indonesian phone numbers, emails, addresses, and national IDs with configurable replacement modes and conflict handling.
- Review-focused normalization – Slang and acronym expansion, contraction repair, elongated text trimming, rating extraction, stemming/normalization tuned for Bahasa Indonesia.
- ReviewChain orchestrator – Run pipelines functionally with
leksara(...)or via theReviewChainclass, mix presets with custom steps, and benchmark per-stage performance. - Resource-driven customization – Ship your own dictionaries and regex rules or extend the bundled JSON/CSV files to adapt the cleaner to new verticals.
Deep dives for each module live in docs/features.md together with API tables, dependencies, and ready-to-run recipes.
Quickstart
1. Install
python -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install leksara
Optional extras and troubleshooting tips are listed in docs/installation.md.
2. Clean a review column in-place
import pandas as pd
from leksara import leksara
df = pd.DataFrame(
{
"review_id": [101, 102],
"review_text": [
"<p>Barangnya mantul!!! Email saya user@mail.id, WA 0812-3456-7890</p>",
"Kualitasnya ⭐⭐⭐⭐, pengiriman 4/5. Hubungi +62 812 8888 7777",
],
}
)
# Apply the ecommerce review preset
df["clean_text"] = leksara(df["review_text"], preset="ecommerce_review")
print(df[["review_id", "clean_text"]])
3. Audit raw text with CartBoard
from leksara.frames.cartboard import get_flags, get_stats
flags = get_flags(df, text_column="review_text")
stats = get_stats(df, text_column="review_text")
print(flags[["review_id", "pii_flag", "rating_flag", "non_alphabetical_flag"]])
print(stats.iloc[0]["stats"]) # nested histogram of noise sources
4. Compose a tailored pipeline
from leksara import ReviewChain
from leksara.function import (
case_normal,
remove_punctuation,
remove_stopwords,
replace_email,
replace_phone,
)
chain = ReviewChain.from_steps(
patterns=[(replace_phone, {"mode": "replace"}), (replace_email, {"mode": "replace"})],
functions=[case_normal, remove_stopwords, remove_punctuation],
)
cleaned, metrics = chain.transform(df["review_text"], benchmark=True)
Documentation Map
| Topic | When to read | Location |
|---|---|---|
| Installation & environment | You are provisioning a workstation or CI agent | docs/installation.md |
| Feature deep dives | You need behavioral details, configuration knobs, or per-feature dependencies | docs/features.md |
| Public API reference | You want signatures, argument descriptions, and return payload formats | docs/api.md |
| Worked examples | You prefer copy/paste recipes for notebooks or pipelines | docs/examples.md |
| Dependency matrix | You must vet optional packages or align with enterprise policies | docs/dependencies.md |
| Contributing | You plan to submit patches, run tests, or build docs | docs/contributing.md |
How Leksara Fits Together
- Pipelines –
leksara(...)is a convenience wrapper aroundReviewChain; both accept raw sequences (list/Series) and return cleaned text plus optional benchmarking details. - Frames layer –
CartBoardand friends operate on review tables, deriving flags, statistics, and noise diagnostics suitable for dashboards. - Functions layer – The
leksara.functionmodule mirrors the implementation modules underleksara/functionsso you can cherry-pick individual cleaners without touching internals. - Resources – Regex rules and dictionaries stored under
leksara/resources/drive PII detection, slang resolution, and whitelist protection. Update these files to specialise the toolkit. - Logging & benchmarking –
leksara.core.loggingships opt-in helpers to emit step-level logs, whilebenchmark=Truecollects timing metadata for throughput tuning.
Architectural notes, data contracts, and extension points for each layer are captured in docs/features.md.
Contributing & Support
- Read
docs/contributing.mdbefore opening a pull request. It covers environment setup, style, testing, and documentation requirements. - File issues on GitHub with reproducible examples; include the preset, optional dependencies, and OS details when reporting pipeline differences.
- Commercial or large-scale users should build automated smoke tests around
ReviewChainto detect upstream dictionary or regex changes.
Leksara is licensed under the terms specified in LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file leksara-0.1.3.tar.gz.
File metadata
- Download URL: leksara-0.1.3.tar.gz
- Upload date:
- Size: 49.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f343bf19c0363f890102d9536ff3974ddb32a49777404c8b3ce205a5a1318558
|
|
| MD5 |
d90a5e2c7c9de56ccc8642f2dae2b1ae
|
|
| BLAKE2b-256 |
93037bb7aff95ecae7b18bd6ad1dff5161ad7e627f2060e9f9017d99e13530e3
|
Provenance
The following attestation bundles were made for leksara-0.1.3.tar.gz:
Publisher:
python-publish.yml on RedEye1605/Leksara
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
leksara-0.1.3.tar.gz -
Subject digest:
f343bf19c0363f890102d9536ff3974ddb32a49777404c8b3ce205a5a1318558 - Sigstore transparency entry: 637669809
- Sigstore integration time:
-
Permalink:
RedEye1605/Leksara@45a3c4648718a5bafb16291a416c44d043b5731b -
Branch / Tag:
refs/tags/0.1.3 - Owner: https://github.com/RedEye1605
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@45a3c4648718a5bafb16291a416c44d043b5731b -
Trigger Event:
push
-
Statement type:
File details
Details for the file leksara-0.1.3-py3-none-any.whl.
File metadata
- Download URL: leksara-0.1.3-py3-none-any.whl
- Upload date:
- Size: 57.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ba7a5c4618b4e9aadccfd121d9f415aca423be6d8336ca19c10e96d9af6b97c9
|
|
| MD5 |
c26093b7a53cad4d3fa6606b76a03464
|
|
| BLAKE2b-256 |
517f851f7ccc27d6be4295a5cc54d7ed28c0a947ed6f9937936e23d1318045d9
|
Provenance
The following attestation bundles were made for leksara-0.1.3-py3-none-any.whl:
Publisher:
python-publish.yml on RedEye1605/Leksara
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
leksara-0.1.3-py3-none-any.whl -
Subject digest:
ba7a5c4618b4e9aadccfd121d9f415aca423be6d8336ca19c10e96d9af6b97c9 - Sigstore transparency entry: 637669810
- Sigstore integration time:
-
Permalink:
RedEye1605/Leksara@45a3c4648718a5bafb16291a416c44d043b5731b -
Branch / Tag:
refs/tags/0.1.3 - Owner: https://github.com/RedEye1605
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@45a3c4648718a5bafb16291a416c44d043b5731b -
Trigger Event:
push
-
Statement type: