Skip to main content

Library pemrosesan teks Bahasa Indonesia untuk domain e-commerce (cleaning, PII masking, review mining, pipeline).

Project description

LEKSARA

Transforming Text, Empowering Insights Instantly

PyPI version PyPI - Python Version PyPI - License

Indonesian-language text preparation toolkit for production review pipelines: clean, mask, and normalize in a single pass.

The library ships linguistic resources, preset-driven orchestration, and modular helpers so data teams can audit raw text, remediate sensitive content, and standardize noisy reviews without rebuilding the pipeline from scratch.


Table of Contents


Why Leksara

Capability What you get Key modules
CartBoard review intake Dataset audits with PII flags, rating detection, and noise diagnostics ready for dashboards. leksara.frames.cartboard
Composable cleaning utilities Reusable HTML stripping, casing, stopword, emoji, punctuation, and numeric cleanup helpers. leksara.function
PII masking & redaction Regex-backed replacement for Indonesian phones, emails, addresses, and national IDs with configurable modes. leksara.pattern plus leksara/resources/regex_patterns/
Review-focused normalization Slang/acronym expansion, contraction repair, rating extraction, and elongation trimming for Bahasa Indonesia. leksara.functions.review
ReviewChain orchestrator leksara(...) wrapper and ReviewChain class for preset pipelines, benchmarking, and hybrid custom flows. leksara.core.chain
Resource-driven customization Drop in your own dictionaries and regex rules to adapt cleaners for new verticals. leksara/resources/

Deep dives live in docs/features.md alongside API tables, dependencies, and ready-to-run notebooks.


Quickstart

  1. Create and activate a virtual environment

    python -m venv .venv
    .\.venv\Scripts\Activate.ps1
    pip install leksara
    

    Optional extras and troubleshooting tips are documented in docs/installation.md.

  2. Clean a review column with the ecommerce preset

    import pandas as pd
    from leksara import leksara
    
    df = pd.DataFrame(
        {
            "review_id": [101, 102],
            "review_text": [
                "<p>Barangnya mantul!!! Email saya user@mail.id, WA 0812-3456-7890</p>",
                "Kualitasnya ⭐⭐⭐⭐, pengiriman 4/5. Hubungi +62 812 8888 7777",
            ],
        }
    )
    
    df["clean_text"] = leksara(df["review_text"], preset="ecommerce_review")
    print(df[["review_id", "clean_text"]])
    
    review_id clean_text
    101 barang mantap email [EMAIL] wa [PHONE_NUMBER]
    102 kualitas 4.0 kirim 4.0 hubung [PHONE_NUMBER]
  3. Audit raw text with CartBoard

    from leksara.frames.cartboard import get_flags, get_stats
    
    flags = get_flags(df, text_column="review_text")
    stats = get_stats(df, text_column="review_text")
    
    print(flags[["review_id", "pii_flag", "rating_flag", "non_alphabetical_flag"]])
    print(stats.iloc[0]["stats"])  # nested histogram of noise sources
    
  4. Compose a tailored pipeline with ReviewChain

    from leksara import ReviewChain
    from leksara.function import case_normal, remove_punctuation, remove_stopwords
    from leksara.pattern import replace_email, replace_phone
    
    chain = ReviewChain.from_steps(
        patterns=[
            (replace_phone, {"mode": "replace"}),
            (replace_email, {"mode": "replace"}),
        ],
        functions=[case_normal, remove_stopwords, remove_punctuation],
    )
    
    cleaned, metrics = chain.transform(df["review_text"], benchmark=True)
    

    metrics includes per-step timings so you can spot bottlenecks or confirm PII masks run before downstream cleaners.


How Leksara Fits Together

Layer Purpose Entry points
Pipelines High-level orchestration that accepts raw sequences and returns cleaned text plus optional benchmarks. leksara(...), ReviewChain
Frames Review-table utilities for bulk audits and dashboard-friendly stats. leksara.frames.cartboard
Functions Composable cleaning helpers mirrored from the implementation modules. leksara.function
Patterns Opt-in masking utilities with regex rules for PII. leksara.pattern
Resources Dictionaries, regex rules, and whitelists that drive domain knowledge. leksara/resources/
Logging & Benchmarking Optional hooks for throughput tuning and step-level visibility. leksara.core.logging, benchmark=True

Architectural notes, data contracts, and extension points for each layer are captured in docs/features.md.


Documentation Map

Topic When to read Location
Installation & environment Provisioning a workstation or CI agent docs/installation.md
Feature deep dives Behavioral details, configuration knobs, and dependencies docs/features.md
Public API reference Signatures, argument descriptions, return payloads docs/api.md
Worked examples Copy/paste recipes for notebooks or pipelines docs/examples.md
Dependency matrix Optional packages and enterprise alignment docs/dependencies.md
Contributing guide Environment, style, testing, documentation expectations docs/contributing.md

Contributing & Support

  • Read docs/contributing.md before opening a pull request; it outlines environment setup, code style, testing, and documentation requirements.
  • File issues on GitHub with reproducible examples, presets used, optional dependencies, and OS details when reporting pipeline differences.
  • Commercial or large-scale users should wrap ReviewChain with automated smoke tests to detect upstream dictionary or regex changes early.

Leksara is licensed under the terms specified in LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

leksara-0.2.1.tar.gz (89.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

leksara-0.2.1-py3-none-any.whl (58.6 kB view details)

Uploaded Python 3

File details

Details for the file leksara-0.2.1.tar.gz.

File metadata

  • Download URL: leksara-0.2.1.tar.gz
  • Upload date:
  • Size: 89.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for leksara-0.2.1.tar.gz
Algorithm Hash digest
SHA256 63099d88b7506bb20b8028819b14ddbef244b3031e7689c93c6559eedfd7900c
MD5 01d4ff911f26b2719ec1b2b6653c865d
BLAKE2b-256 f3e582d1a4abd4315744257e737818574108cb863020542463dc23ee4cfff4f0

See more details on using hashes here.

Provenance

The following attestation bundles were made for leksara-0.2.1.tar.gz:

Publisher: python-publish.yml on RedEye1605/Leksara

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file leksara-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: leksara-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 58.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for leksara-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 9aefa41efad01f45f5bceed3623b7711f61735c6976dcd0e91a401b3c6fdde6d
MD5 10a8c14bd9df5427c12a5b9a95911ac2
BLAKE2b-256 83df5fc6210f9025d9ac184cf682bfdf70c01f395ad5f8f2a0645a413e09795e

See more details on using hashes here.

Provenance

The following attestation bundles were made for leksara-0.2.1-py3-none-any.whl:

Publisher: python-publish.yml on RedEye1605/Leksara

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page