Skip to main content

Library pemrosesan teks Bahasa Indonesia untuk domain e-commerce (cleaning, PII masking, review mining, pipeline).

Project description

Leksara

PyPI version PyPI - Python Version PyPI - License

Leksara is an Indonesian-language text preparation toolkit for data teams who need production-ready cleaning, masking, and normalization pipelines. The library bundles linguistic resources, a preset-driven orchestration layer, and modular helpers so you can audit raw text, remediate sensitive content, and standardize noisy reviews without rebuilding the stack for every project.


Feature Highlights

  • CartBoard review intake – Inspect raw datasets from chatbots or marketplaces, generate column-level flags (PII, non-alphabetical noise, ratings), and capture metadata for monitoring.
  • Composable cleaning utilitiesleksara.function re-exports the building blocks (HTML stripping, casing, stopwords, punctuation, emoji, numeric cleanup) for ad-hoc preprocessing.
  • PII masking and redaction – Regex-backed detectors for Indonesian phone numbers, emails, addresses, and national IDs with configurable replacement modes and conflict handling.
  • Review-focused normalization – Slang and acronym expansion, contraction repair, elongated text trimming, rating extraction, stemming/normalization tuned for Bahasa Indonesia.
  • ReviewChain orchestrator – Run pipelines functionally with leksara(...) or via the ReviewChain class, mix presets with custom steps, and benchmark per-stage performance.
  • Resource-driven customization – Ship your own dictionaries and regex rules or extend the bundled JSON/CSV files to adapt the cleaner to new verticals.

Deep dives for each module live in docs/features.md together with API tables, dependencies, and ready-to-run recipes.


Quickstart

1. Install

python -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install leksara

Optional extras and troubleshooting tips are listed in docs/installation.md.

2. Clean a review column in-place

import pandas as pd
from leksara import leksara

df = pd.DataFrame(
    {
        "review_id": [101, 102],
        "review_text": [
            "<p>Barangnya mantul!!! Email saya user@mail.id, WA 0812-3456-7890</p>",
            "Kualitasnya ⭐⭐⭐⭐, pengiriman 4/5. Hubungi +62 812 8888 7777",
        ],
    }
)

# Apply the ecommerce review preset
df["clean_text"] = leksara(df["review_text"], preset="ecommerce_review")
print(df[["review_id", "clean_text"]])

3. Audit raw text with CartBoard

from leksara.frames.cartboard import get_flags, get_stats

flags = get_flags(df, text_column="review_text")
stats = get_stats(df, text_column="review_text")

print(flags[["review_id", "pii_flag", "rating_flag", "non_alphabetical_flag"]])
print(stats.iloc[0]["stats"])  # nested histogram of noise sources

4. Compose a tailored pipeline

from leksara import ReviewChain
from leksara.function import (
    case_normal,
    remove_punctuation,
    remove_stopwords,
    replace_email,
    replace_phone,
)

chain = ReviewChain.from_steps(
    patterns=[(replace_phone, {"mode": "replace"}), (replace_email, {"mode": "replace"})],
    functions=[case_normal, remove_stopwords, remove_punctuation],
)

cleaned, metrics = chain.transform(df["review_text"], benchmark=True)

Documentation Map

Topic When to read Location
Installation & environment You are provisioning a workstation or CI agent docs/installation.md
Feature deep dives You need behavioral details, configuration knobs, or per-feature dependencies docs/features.md
Public API reference You want signatures, argument descriptions, and return payload formats docs/api.md
Worked examples You prefer copy/paste recipes for notebooks or pipelines docs/examples.md
Dependency matrix You must vet optional packages or align with enterprise policies docs/dependencies.md
Contributing You plan to submit patches, run tests, or build docs docs/contributing.md

How Leksara Fits Together

  • Pipelinesleksara(...) is a convenience wrapper around ReviewChain; both accept raw sequences (list/Series) and return cleaned text plus optional benchmarking details.
  • Frames layerCartBoard and friends operate on review tables, deriving flags, statistics, and noise diagnostics suitable for dashboards.
  • Functions layer – The leksara.function module mirrors the implementation modules under leksara/functions so you can cherry-pick individual cleaners without touching internals.
  • Resources – Regex rules and dictionaries stored under leksara/resources/ drive PII detection, slang resolution, and whitelist protection. Update these files to specialise the toolkit.
  • Logging & benchmarkingleksara.core.logging ships opt-in helpers to emit step-level logs, while benchmark=True collects timing metadata for throughput tuning.

Architectural notes, data contracts, and extension points for each layer are captured in docs/features.md.


Contributing & Support

  • Read docs/contributing.md before opening a pull request. It covers environment setup, style, testing, and documentation requirements.
  • File issues on GitHub with reproducible examples; include the preset, optional dependencies, and OS details when reporting pipeline differences.
  • Commercial or large-scale users should build automated smoke tests around ReviewChain to detect upstream dictionary or regex changes.

Leksara is licensed under the terms specified in LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

leksara-0.1.3.tar.gz (49.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

leksara-0.1.3-py3-none-any.whl (57.4 kB view details)

Uploaded Python 3

File details

Details for the file leksara-0.1.3.tar.gz.

File metadata

  • Download URL: leksara-0.1.3.tar.gz
  • Upload date:
  • Size: 49.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for leksara-0.1.3.tar.gz
Algorithm Hash digest
SHA256 f343bf19c0363f890102d9536ff3974ddb32a49777404c8b3ce205a5a1318558
MD5 d90a5e2c7c9de56ccc8642f2dae2b1ae
BLAKE2b-256 93037bb7aff95ecae7b18bd6ad1dff5161ad7e627f2060e9f9017d99e13530e3

See more details on using hashes here.

Provenance

The following attestation bundles were made for leksara-0.1.3.tar.gz:

Publisher: python-publish.yml on RedEye1605/Leksara

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file leksara-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: leksara-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 57.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for leksara-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 ba7a5c4618b4e9aadccfd121d9f415aca423be6d8336ca19c10e96d9af6b97c9
MD5 c26093b7a53cad4d3fa6606b76a03464
BLAKE2b-256 517f851f7ccc27d6be4295a5cc54d7ed28c0a947ed6f9937936e23d1318045d9

See more details on using hashes here.

Provenance

The following attestation bundles were made for leksara-0.1.3-py3-none-any.whl:

Publisher: python-publish.yml on RedEye1605/Leksara

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page