Skip to main content

Library pemrosesan teks Bahasa Indonesia untuk domain e-commerce (cleaning, PII masking, review mining, pipeline).

Project description

Leksara

PyPI version PyPI - Python Version PyPI - License

Leksara is an Indonesian-language text preparation toolkit for data teams who need production-ready cleaning, masking, and normalization pipelines. The library bundles linguistic resources, a preset-driven orchestration layer, and modular helpers so you can audit raw text, remediate sensitive content, and standardize noisy reviews without rebuilding the stack for every project.


Feature Highlights

  • CartBoard review intake – Inspect raw datasets from chatbots or marketplaces, generate column-level flags (PII, non-alphabetical noise, ratings), and capture metadata for monitoring.
  • Composable cleaning utilitiesleksara.function re-exports the building blocks (HTML stripping, casing, stopwords, punctuation, emoji, numeric cleanup) for ad-hoc preprocessing.
  • PII masking and redaction – Regex-backed detectors for Indonesian phone numbers, emails, addresses, and national IDs with configurable replacement modes and conflict handling.
  • Review-focused normalization – Slang and acronym expansion, contraction repair, elongated text trimming, rating extraction, stemming/normalization tuned for Bahasa Indonesia.
  • ReviewChain orchestrator – Run pipelines functionally with leksara(...) or via the ReviewChain class, mix presets with custom steps, and benchmark per-stage performance.
  • Resource-driven customization – Ship your own dictionaries and regex rules or extend the bundled JSON/CSV files to adapt the cleaner to new verticals.

Deep dives for each module live in docs/features.md together with API tables, dependencies, and ready-to-run recipes.


Quickstart

1. Install

python -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install leksara

Optional extras and troubleshooting tips are listed in docs/installation.md.

2. Clean a review column in-place

import pandas as pd
from leksara import leksara

df = pd.DataFrame(
    {
        "review_id": [101, 102],
        "review_text": [
            "<p>Barangnya mantul!!! Email saya user@mail.id, WA 0812-3456-7890</p>",
            "Kualitasnya ⭐⭐⭐⭐, pengiriman 4/5. Hubungi +62 812 8888 7777",
        ],
    }
)

# Apply the ecommerce review preset
df["clean_text"] = leksara(df["review_text"], preset="ecommerce_review")
print(df[["review_id", "clean_text"]])

3. Audit raw text with CartBoard

from leksara.frames.cartboard import get_flags, get_stats

flags = get_flags(df, text_column="review_text")
stats = get_stats(df, text_column="review_text")

print(flags[["review_id", "pii_flag", "rating_flag", "non_alphabetical_flag"]])
print(stats.iloc[0]["stats"])  # nested histogram of noise sources

4. Compose a tailored pipeline

from leksara import ReviewChain
from leksara.function import (
    case_normal,
    remove_punctuation,
    remove_stopwords,
    replace_email,
    replace_phone,
)

chain = ReviewChain.from_steps(
    patterns=[(replace_phone, {"mode": "replace"}), (replace_email, {"mode": "replace"})],
    functions=[case_normal, remove_stopwords, remove_punctuation],
)

cleaned, metrics = chain.transform(df["review_text"], benchmark=True)

Documentation Map

Topic When to read Location
Installation & environment You are provisioning a workstation or CI agent docs/installation.md
Feature deep dives You need behavioral details, configuration knobs, or per-feature dependencies docs/features.md
Public API reference You want signatures, argument descriptions, and return payload formats docs/api.md
Worked examples You prefer copy/paste recipes for notebooks or pipelines docs/examples.md
Dependency matrix You must vet optional packages or align with enterprise policies docs/dependencies.md
Contributing You plan to submit patches, run tests, or build docs docs/contributing.md

How Leksara Fits Together

  • Pipelinesleksara(...) is a convenience wrapper around ReviewChain; both accept raw sequences (list/Series) and return cleaned text plus optional benchmarking details.
  • Frames layerCartBoard and friends operate on review tables, deriving flags, statistics, and noise diagnostics suitable for dashboards.
  • Functions layer – The leksara.function module mirrors the implementation modules under leksara/functions so you can cherry-pick individual cleaners without touching internals.
  • Resources – Regex rules and dictionaries stored under leksara/resources/ drive PII detection, slang resolution, and whitelist protection. Update these files to specialise the toolkit.
  • Logging & benchmarkingleksara.core.logging ships opt-in helpers to emit step-level logs, while benchmark=True collects timing metadata for throughput tuning.

Architectural notes, data contracts, and extension points for each layer are captured in docs/features.md.


Contributing & Support

  • Read docs/contributing.md before opening a pull request. It covers environment setup, style, testing, and documentation requirements.
  • File issues on GitHub with reproducible examples; include the preset, optional dependencies, and OS details when reporting pipeline differences.
  • Commercial or large-scale users should build automated smoke tests around ReviewChain to detect upstream dictionary or regex changes.

Leksara is licensed under the terms specified in LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

leksara-0.1.1.tar.gz (49.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

leksara-0.1.1-py3-none-any.whl (56.9 kB view details)

Uploaded Python 3

File details

Details for the file leksara-0.1.1.tar.gz.

File metadata

  • Download URL: leksara-0.1.1.tar.gz
  • Upload date:
  • Size: 49.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for leksara-0.1.1.tar.gz
Algorithm Hash digest
SHA256 07565d67139f4630593426c1fe7f9b72c8ba691164729b1e8398d5bbc29d4464
MD5 a4ec699ee168c9b9e0a9945f9bd98b77
BLAKE2b-256 fb0c199564d01df977b0134eda482c88e160b9070411b11ec6a4a68d158a2196

See more details on using hashes here.

Provenance

The following attestation bundles were made for leksara-0.1.1.tar.gz:

Publisher: python-publish.yml on RedEye1605/Leksara

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file leksara-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: leksara-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 56.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for leksara-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 ab1b613d11d9413a50ad70647a8c29c30004600cb1e062caac907d9c259c61e1
MD5 de63b47352852d6e2952a9e9ec2def7f
BLAKE2b-256 4111e447595880fa380b1796eb02a4439a63cf4cbeaa1871b9d26101a0f8d212

See more details on using hashes here.

Provenance

The following attestation bundles were made for leksara-0.1.1-py3-none-any.whl:

Publisher: python-publish.yml on RedEye1605/Leksara

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page