Library pemrosesan teks Bahasa Indonesia untuk domain e-commerce (cleaning, PII masking, review mining, pipeline).
Project description
LEKSARA
Transforming Text, Empowering Insights Instantly
Indonesian-language text preparation toolkit for production review pipelines: clean, mask, and normalize in a single pass.
The library ships linguistic resources, preset-driven orchestration, and modular helpers so data teams can audit raw text, remediate sensitive content, and standardize noisy reviews without rebuilding the pipeline from scratch.
Table of Contents
Why Leksara
| Capability | What you get | Key modules |
|---|---|---|
| CartBoard review intake | Dataset audits with PII flags, rating detection, and noise diagnostics ready for dashboards. | leksara.frames.cartboard |
| Composable cleaning utilities | Reusable HTML stripping, casing, stopword, emoji, punctuation, and numeric cleanup helpers. | leksara.function |
| PII masking & redaction | Regex-backed replacement for Indonesian phones, emails, addresses, and national IDs with configurable modes. | leksara.pattern plus leksara/resources/regex_patterns/ |
| Review-focused normalization | Slang/acronym expansion, contraction repair, rating extraction, and elongation trimming for Bahasa Indonesia. | leksara.functions.review |
| ReviewChain orchestrator | leksara(...) wrapper and ReviewChain class for preset pipelines, benchmarking, and hybrid custom flows. |
leksara.core.chain |
| Resource-driven customization | Drop in your own dictionaries and regex rules to adapt cleaners for new verticals. | leksara/resources/ |
Deep dives live in docs/features.md alongside API tables, dependencies, and ready-to-run notebooks.
Quickstart
-
Create and activate a virtual environment
python -m venv .venv .\.venv\Scripts\Activate.ps1 pip install leksara
Optional extras and troubleshooting tips are documented in
docs/installation.md. -
Clean a review column with the ecommerce preset
import pandas as pd from leksara import leksara df = pd.DataFrame( { "review_id": [101, 102], "review_text": [ "<p>Barangnya mantul!!! Email saya user@mail.id, WA 0812-3456-7890</p>", "Kualitasnya ⭐⭐⭐⭐, pengiriman 4/5. Hubungi +62 812 8888 7777", ], } ) df["clean_text"] = leksara(df["review_text"], preset="ecommerce_review") print(df[["review_id", "clean_text"]])
review_id clean_text 101 barang mantap email [EMAIL] wa [PHONE_NUMBER]102 kualitas 4.0 kirim 4.0 hubung [PHONE_NUMBER] -
Audit raw text with CartBoard
from leksara.frames.cartboard import get_flags, get_stats flags = get_flags(df, text_column="review_text") stats = get_stats(df, text_column="review_text") print(flags[["review_id", "pii_flag", "rating_flag", "non_alphabetical_flag"]]) print(stats.iloc[0]["stats"]) # nested histogram of noise sources
-
Compose a tailored pipeline with ReviewChain
from leksara import ReviewChain from leksara.function import case_normal, remove_punctuation, remove_stopwords from leksara.pattern import replace_email, replace_phone chain = ReviewChain.from_steps( patterns=[ (replace_phone, {"mode": "replace"}), (replace_email, {"mode": "replace"}), ], functions=[case_normal, remove_stopwords, remove_punctuation], ) cleaned, metrics = chain.transform(df["review_text"], benchmark=True)
metricsincludes per-step timings so you can spot bottlenecks or confirm PII masks run before downstream cleaners.
How Leksara Fits Together
| Layer | Purpose | Entry points |
|---|---|---|
| Pipelines | High-level orchestration that accepts raw sequences and returns cleaned text plus optional benchmarks. | leksara(...), ReviewChain |
| Frames | Review-table utilities for bulk audits and dashboard-friendly stats. | leksara.frames.cartboard |
| Functions | Composable cleaning helpers mirrored from the implementation modules. | leksara.function |
| Patterns | Opt-in masking utilities with regex rules for PII. | leksara.pattern |
| Resources | Dictionaries, regex rules, and whitelists that drive domain knowledge. | leksara/resources/ |
| Logging & Benchmarking | Optional hooks for throughput tuning and step-level visibility. | leksara.core.logging, benchmark=True |
Architectural notes, data contracts, and extension points for each layer are captured in docs/features.md.
Documentation Map
| Topic | When to read | Location |
|---|---|---|
| Installation & environment | Provisioning a workstation or CI agent | docs/installation.md |
| Feature deep dives | Behavioral details, configuration knobs, and dependencies | docs/features.md |
| Public API reference | Signatures, argument descriptions, return payloads | docs/api.md |
| Worked examples | Copy/paste recipes for notebooks or pipelines | docs/examples.md |
| Dependency matrix | Optional packages and enterprise alignment | docs/dependencies.md |
| Contributing guide | Environment, style, testing, documentation expectations | docs/contributing.md |
Contributing & Support
- Read
docs/contributing.mdbefore opening a pull request; it outlines environment setup, code style, testing, and documentation requirements. - File issues on GitHub with reproducible examples, presets used, optional dependencies, and OS details when reporting pipeline differences.
- Commercial or large-scale users should wrap
ReviewChainwith automated smoke tests to detect upstream dictionary or regex changes early.
Leksara is licensed under the terms specified in LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file leksara-0.2.2.tar.gz.
File metadata
- Download URL: leksara-0.2.2.tar.gz
- Upload date:
- Size: 93.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
92e30ce1cab028c4f8629c05c695787e18b3f30cdb31f2314e3085c802a7d954
|
|
| MD5 |
232087c86bf6194431510c888d8cffba
|
|
| BLAKE2b-256 |
0b6a61c55fe629dd2fe5bca4d447fcf980171945769b68280f7794dda45f85c0
|
Provenance
The following attestation bundles were made for leksara-0.2.2.tar.gz:
Publisher:
python-publish.yml on RedEye1605/Leksara
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
leksara-0.2.2.tar.gz -
Subject digest:
92e30ce1cab028c4f8629c05c695787e18b3f30cdb31f2314e3085c802a7d954 - Sigstore transparency entry: 639225820
- Sigstore integration time:
-
Permalink:
RedEye1605/Leksara@d6647734916b24590e0cc8870ef2b86bb73d7334 -
Branch / Tag:
refs/tags/0.2.2 - Owner: https://github.com/RedEye1605
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@d6647734916b24590e0cc8870ef2b86bb73d7334 -
Trigger Event:
push
-
Statement type:
File details
Details for the file leksara-0.2.2-py3-none-any.whl.
File metadata
- Download URL: leksara-0.2.2-py3-none-any.whl
- Upload date:
- Size: 59.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8195989fe86e3ab817451421f3380fb2a79d500f7da1f8177dbfa9d2a20159a1
|
|
| MD5 |
e084bed5b5339911b2c0dd9792ec6a15
|
|
| BLAKE2b-256 |
15dbed1fa04ec7d181d48f2a4f6ff80e990bfc76cd820dee7403ef0c4ed98d2b
|
Provenance
The following attestation bundles were made for leksara-0.2.2-py3-none-any.whl:
Publisher:
python-publish.yml on RedEye1605/Leksara
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
leksara-0.2.2-py3-none-any.whl -
Subject digest:
8195989fe86e3ab817451421f3380fb2a79d500f7da1f8177dbfa9d2a20159a1 - Sigstore transparency entry: 639225823
- Sigstore integration time:
-
Permalink:
RedEye1605/Leksara@d6647734916b24590e0cc8870ef2b86bb73d7334 -
Branch / Tag:
refs/tags/0.2.2 - Owner: https://github.com/RedEye1605
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
python-publish.yml@d6647734916b24590e0cc8870ef2b86bb73d7334 -
Trigger Event:
push
-
Statement type: