Skip to main content

Library pemrosesan teks Bahasa Indonesia untuk domain e-commerce (cleaning, PII masking, review mining, pipeline).

Project description

Leksara

Description

Leksara is a Python toolkit designed to streamline the preprocessing and cleaning of Indonesian text data for Data Scientists and Machine Learning Engineers. It focuses on handling messy and noisy Indonesian text from various domains such as e-commerce reviews, social media posts, and chat conversations. The tool helps clean text by handling Indonesian-specific challenges like slang words, regional expressions, informal abbreviations, and mixed language content, while also providing standard cleaning features like punctuation and stopword removal. This makes it an essential tool for Indonesian text analysis and machine learning model preparation.

Key Features

  • Basic Cleaning Pipeline: A straightforward pipeline to clean raw text data by handling common tasks like punctuation removal, casing normalization, and stopword filtering.
  • Advanced Customization: Users can create custom cleaning pipelines tailored to specific datasets, including support for regex pattern matching, stemming, and custom dictionaries.
  • Preset Options: Includes predefined cleaning presets for various domains like e-commerce, allowing for one-click cleaning.
  • Slang and Informal Text Handling: Users can define their own custom dictionaries for slang terms and informal language, especially useful for Indonesian text.

Usage Examples

Basic Usage: Basic Cleaning Pipeline

This example demonstrates how to clean e-commerce product reviews using a pre-built preset.

from Leksara  import Leksara 

df['cleaned_review'] = Leksara(df['review_text'], preset='ecommerce_review')
print(df[['review_id', 'cleaned_review']])

Input Data (df):

review_id review_text
1 <p>brgnya ORI & pengiriman cepat. Mantulll ๐Ÿ‘</p>
2 Kualitasnya krg bgs, ga sesuai ekspektasi...

Output Data:

review_id cleaned_review
1 barang nya original pengiriman cepat mantap
2 kualitasnya kurang bagus tidak sesuai ekspektasi

Advanced Usage: Custom Cleaning Pipeline

Customize the pipeline to mask phone numbers and normalize whitespace in chat logs.

from Leksara import Leksara
from Leksara.functions import to_lowercase, normalize_whitespace
from Leksara.patterns import MASK_PHONE_NUMBER

custom_pipeline = {
    'patterns': [MASK_PHONE_NUMBER],
    'functions': [to_lowercase, normalize_whitespace]
}

df['safe_message'] = Leksara(df['chat_message'], pipeline=custom_pipeline)
print(df[['chat_id', 'safe_message']])

Input Data (df):

chat_id chat_message
101 Hi kak, pesanan saya INV/123 blm sampai. No HP saya 081234567890
102 Tolong dibantu ya sis, thanks

Output Data:

chat_id safe_message
101 hi kak, pesanan saya inv/123 blm sampai. no hp saya [PHONE_NUMBER]
102 tolong dibantu ya sis, thanks

Goals & Objectives

  • Provide an intuitive and adaptable cleaning tool for Indonesian text, focusing on domains like e-commerce.
  • Enable Data Scientists and ML Engineers to clean and preprocess text with minimal effort.
  • Allow for deep customization through configuration options and the use of custom dictionaries.

Success Metrics

  • On-time Delivery: Targeted release by October 15, 2025.
  • Processing Speed: Clean a 10,000-row Pandas Series in under 5 seconds.
  • Cleaning Accuracy: Achieve over 95% accuracy for core cleaning functions.

Folder Structure

Below is the recommended folder structure for organizing the project:

[Leksara]/
โ”œโ”€โ”€ pyproject.toml                  # packaging & deps (nltk, dll)
โ”œโ”€โ”€ requirements.txt                # runtime deps (nltk, pandas, dll)
โ”œโ”€โ”€ README.md                       # overview & usage
โ”œโ”€โ”€ leksara/                        # package utama
โ”‚   โ”œโ”€โ”€ __init__.py                 # public API surface
โ”‚   โ”œโ”€โ”€ version.py                  # versi paket
โ”‚   โ”œโ”€โ”€ core/
โ”‚   โ”‚   โ”œโ”€โ”€ chain.py                # pipeline/CLI entry (sesuai pyproject scripts)
โ”‚   โ”‚   โ”œโ”€โ”€ logging.py              # util logging/benchmark
โ”‚   โ”‚   โ””โ”€โ”€ presets.py              # preset pipeline
โ”‚   โ”œโ”€โ”€ frames/
โ”‚   โ”‚   โ””โ”€โ”€ cartboard.py            # helpers untuk data frame
โ”‚   โ”œโ”€โ”€ functions/                  # modul granular
โ”‚   โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚   โ”‚   โ”œโ”€โ”€ cleaner/
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚   โ”‚   โ”‚   โ””โ”€โ”€ basic.py            # remove_tags, case_normal, remove_stopwords, dll.
โ”‚   โ”‚   โ”œโ”€โ”€ patterns/
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚   โ”‚   โ”‚   โ””โ”€โ”€ pii.py              # masker PII (email/telepon, dll.)
โ”‚   โ”‚   โ””โ”€โ”€ review/
โ”‚   โ”‚       โ”œโ”€โ”€ __init__.py
โ”‚   โ”‚       โ””โ”€โ”€ advanced.py         # fungsi review lanjutan
โ”‚   โ”œโ”€โ”€ resources/                  # data pendukung (dibundel)
โ”‚   โ”‚   โ”œโ”€โ”€ acronyms.csv
โ”‚   โ”‚   โ”œโ”€โ”€ contractions.json
โ”‚   โ”‚   โ”œโ”€โ”€ slang_dict.json
โ”‚   โ”‚   โ””โ”€โ”€ stopwords/
โ”‚   โ”‚       โ””โ”€โ”€ id.txt              # stopwords Indonesia (tambahan/abbr)
โ”‚   โ”œโ”€โ”€ tests/
โ”‚   โ”‚   โ”œโ”€โ”€ test_chain.py
โ”‚   โ”‚   โ”œโ”€โ”€ test_cleaner_basic.py
โ”‚   โ”‚   โ”œโ”€โ”€ test_patterns_pii.py
โ”‚   โ”‚   โ””โ”€โ”€ test_review_advanced.py
โ”‚   โ””โ”€โ”€ utils/
โ”‚       โ”œโ”€โ”€ lang.py
โ”‚       โ”œโ”€โ”€ regexes.py
โ”‚       โ”œโ”€โ”€ text.py                 # text helpers
โ”‚       โ””โ”€โ”€ whitelist.py
โ””โ”€โ”€ notebooks/
    โ””โ”€โ”€ leksara_quickstart.ipynb    # quickstart & demo

Milestones

Sprint Dates Goal
1 Aug 18 โ€“ Aug 22 Project Kickoff, Discovery, Set up repository
2 Aug 22 โ€“ Aug 29 Build Core Cleaning Engine
3 Aug 29 โ€“ Sep 5 Develop Configurable Features
4 Sep 5 โ€“ Sep 12 Implement Advanced Customization
5 Sep 12 โ€“ Sep 19 Refine API
6 Sep 19 โ€“ Sep 26 Optimize System
7 Sep 26 โ€“ Oct 3 Finalize Documentation
8 Oct 3 โ€“ Oct 10 Prepare for Launch

Requirements

  • Python 3.x
  • Pandas

Install

pip install Leksara

Contributors

  • Vivian & Zahra โ€“ Document Owners
  • Salsa โ€“ UI/UX Designer
  • Aufi, Althaf, Rhendy, Adit โ€“ Data Science Team
  • Alya, Vivin โ€“ Data Analyst Team

For more details on the features and usage, refer to the official documentation linked above.

Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

leksara-0.0.8.tar.gz (32.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

leksara-0.0.8-py3-none-any.whl (38.2 kB view details)

Uploaded Python 3

File details

Details for the file leksara-0.0.8.tar.gz.

File metadata

  • Download URL: leksara-0.0.8.tar.gz
  • Upload date:
  • Size: 32.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for leksara-0.0.8.tar.gz
Algorithm Hash digest
SHA256 b2ade81369a1c746676a839d2a7862f36cc53518287c516b17e5303fcc42c7b4
MD5 765bb79526d28af2f9d63f889dba932a
BLAKE2b-256 14ee862be0d791ac3871b72c6f8cdff2ef82ee5975aa6333aa948e568d752722

See more details on using hashes here.

Provenance

The following attestation bundles were made for leksara-0.0.8.tar.gz:

Publisher: python-publish.yml on RedEye1605/Leksara

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file leksara-0.0.8-py3-none-any.whl.

File metadata

  • Download URL: leksara-0.0.8-py3-none-any.whl
  • Upload date:
  • Size: 38.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for leksara-0.0.8-py3-none-any.whl
Algorithm Hash digest
SHA256 56c0ec1679a91102c9ea1a286a1edbd4fb99c193440d5cd66c9ab9f571c0dd19
MD5 b4a45c6e7ef344309e26b19824fd8554
BLAKE2b-256 77b89b2be7da8337aca3b685e61e8dbc80501ef449ff6d5fb148d6041a39236b

See more details on using hashes here.

Provenance

The following attestation bundles were made for leksara-0.0.8-py3-none-any.whl:

Publisher: python-publish.yml on RedEye1605/Leksara

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page