Skip to main content

Library pemrosesan teks Bahasa Indonesia untuk domain e-commerce (cleaning, PII masking, review mining, pipeline).

Project description

Leksara

Description

Leksara is a Python toolkit designed to streamline the preprocessing and cleaning of Indonesian text data for Data Scientists and Machine Learning Engineers. It focuses on handling messy and noisy Indonesian text from various domains such as e-commerce reviews, social media posts, and chat conversations. The tool helps clean text by handling Indonesian-specific challenges like slang words, regional expressions, informal abbreviations, and mixed language content, while also providing standard cleaning features like punctuation and stopword removal. This makes it an essential tool for Indonesian text analysis and machine learning model preparation.

Key Features

  • Basic Cleaning Pipeline: A straightforward pipeline to clean raw text data by handling common tasks like punctuation removal, casing normalization, and stopword filtering.
  • Advanced Customization: Users can create custom cleaning pipelines tailored to specific datasets, including support for regex pattern matching, stemming, and custom dictionaries.
  • Preset Options: Includes predefined cleaning presets for various domains like e-commerce, allowing for one-click cleaning.
  • Slang and Informal Text Handling: Users can define their own custom dictionaries for slang terms and informal language, especially useful for Indonesian text.

Usage Examples

Basic Usage: Basic Cleaning Pipeline

This example demonstrates how to clean e-commerce product reviews using a pre-built preset.

from Leksara  import Leksara 

df['cleaned_review'] = Leksara(df['review_text'], preset='ecommerce_review')
print(df[['review_id', 'cleaned_review']])

Input Data (df):

review_id review_text
1 <p>brgnya ORI & pengiriman cepat. Mantulll ๐Ÿ‘</p>
2 Kualitasnya krg bgs, ga sesuai ekspektasi...

Output Data:

review_id cleaned_review
1 barang nya original pengiriman cepat mantap
2 kualitasnya kurang bagus tidak sesuai ekspektasi

Advanced Usage: Custom Cleaning Pipeline

Customize the pipeline to mask phone numbers and normalize whitespace in chat logs.

from Leksara import Leksara
from Leksara.functions import to_lowercase, normalize_whitespace
from Leksara.patterns import MASK_PHONE_NUMBER

custom_pipeline = {
    'patterns': [MASK_PHONE_NUMBER],
    'functions': [to_lowercase, normalize_whitespace]
}

df['safe_message'] = Leksara(df['chat_message'], pipeline=custom_pipeline)
print(df[['chat_id', 'safe_message']])

Input Data (df):

chat_id chat_message
101 Hi kak, pesanan saya INV/123 blm sampai. No HP saya 081234567890
102 Tolong dibantu ya sis, thanks

Output Data:

chat_id safe_message
101 hi kak, pesanan saya inv/123 blm sampai. no hp saya [PHONE_NUMBER]
102 tolong dibantu ya sis, thanks

Goals & Objectives

  • Provide an intuitive and adaptable cleaning tool for Indonesian text, focusing on domains like e-commerce.
  • Enable Data Scientists and ML Engineers to clean and preprocess text with minimal effort.
  • Allow for deep customization through configuration options and the use of custom dictionaries.

Success Metrics

  • On-time Delivery: Targeted release by October 15, 2025.
  • Processing Speed: Clean a 10,000-row Pandas Series in under 5 seconds.
  • Cleaning Accuracy: Achieve over 95% accuracy for core cleaning functions.

Folder Structure

Below is the recommended folder structure for organizing the project:

[Leksara]/
โ”œโ”€โ”€ pyproject.toml                  # packaging & deps (nltk, dll)
โ”œโ”€โ”€ requirements.txt                # runtime deps (nltk, pandas, dll)
โ”œโ”€โ”€ README.md                       # overview & usage
โ”œโ”€โ”€ leksara/                        # package utama
โ”‚   โ”œโ”€โ”€ __init__.py                 # public API surface
โ”‚   โ”œโ”€โ”€ version.py                  # versi paket
โ”‚   โ”œโ”€โ”€ core/
โ”‚   โ”‚   โ”œโ”€โ”€ chain.py                # pipeline/CLI entry (sesuai pyproject scripts)
โ”‚   โ”‚   โ”œโ”€โ”€ logging.py              # util logging/benchmark
โ”‚   โ”‚   โ””โ”€โ”€ presets.py              # preset pipeline
โ”‚   โ”œโ”€โ”€ frames/
โ”‚   โ”‚   โ””โ”€โ”€ cartboard.py            # helpers untuk data frame
โ”‚   โ”œโ”€โ”€ functions/                  # modul granular
โ”‚   โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚   โ”‚   โ”œโ”€โ”€ cleaner/
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚   โ”‚   โ”‚   โ””โ”€โ”€ basic.py            # remove_tags, case_normal, remove_stopwords, dll.
โ”‚   โ”‚   โ”œโ”€โ”€ patterns/
โ”‚   โ”‚   โ”‚   โ”œโ”€โ”€ __init__.py
โ”‚   โ”‚   โ”‚   โ””โ”€โ”€ pii.py              # masker PII (email/telepon, dll.)
โ”‚   โ”‚   โ””โ”€โ”€ review/
โ”‚   โ”‚       โ”œโ”€โ”€ __init__.py
โ”‚   โ”‚       โ””โ”€โ”€ advanced.py         # fungsi review lanjutan
โ”‚   โ”œโ”€โ”€ resources/                  # data pendukung (dibundel)
โ”‚   โ”‚   โ”œโ”€โ”€ acronyms.csv
โ”‚   โ”‚   โ”œโ”€โ”€ contractions.json
โ”‚   โ”‚   โ”œโ”€โ”€ slang_dict.json
โ”‚   โ”‚   โ””โ”€โ”€ stopwords/
โ”‚   โ”‚       โ””โ”€โ”€ id.txt              # stopwords Indonesia (tambahan/abbr)
โ”‚   โ”œโ”€โ”€ tests/
โ”‚   โ”‚   โ”œโ”€โ”€ test_chain.py
โ”‚   โ”‚   โ”œโ”€โ”€ test_cleaner_basic.py
โ”‚   โ”‚   โ”œโ”€โ”€ test_patterns_pii.py
โ”‚   โ”‚   โ””โ”€โ”€ test_review_advanced.py
โ”‚   โ””โ”€โ”€ utils/
โ”‚       โ”œโ”€โ”€ lang.py
โ”‚       โ”œโ”€โ”€ regexes.py
โ”‚       โ”œโ”€โ”€ text.py                 # text helpers
โ”‚       โ””โ”€โ”€ whitelist.py
โ””โ”€โ”€ notebooks/
    โ””โ”€โ”€ leksara_quickstart.ipynb    # quickstart & demo

Milestones

Sprint Dates Goal
1 Aug 18 โ€“ Aug 22 Project Kickoff, Discovery, Set up repository
2 Aug 22 โ€“ Aug 29 Build Core Cleaning Engine
3 Aug 29 โ€“ Sep 5 Develop Configurable Features
4 Sep 5 โ€“ Sep 12 Implement Advanced Customization
5 Sep 12 โ€“ Sep 19 Refine API
6 Sep 19 โ€“ Sep 26 Optimize System
7 Sep 26 โ€“ Oct 3 Finalize Documentation
8 Oct 3 โ€“ Oct 10 Prepare for Launch

Requirements

  • Python 3.x
  • Pandas

Install

pip install Leksara

Contributors

  • Vivian & Zahra โ€“ Document Owners
  • Salsa โ€“ UI/UX Designer
  • Aufi, Althaf, Rhendy, Adit โ€“ Data Science Team
  • Alya, Vivin โ€“ Data Analyst Team

For more details on the features and usage, refer to the official documentation linked above.

Links

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

leksara-0.0.7.tar.gz (27.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

leksara-0.0.7-py3-none-any.whl (30.6 kB view details)

Uploaded Python 3

File details

Details for the file leksara-0.0.7.tar.gz.

File metadata

  • Download URL: leksara-0.0.7.tar.gz
  • Upload date:
  • Size: 27.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for leksara-0.0.7.tar.gz
Algorithm Hash digest
SHA256 ab47429e375e04a17988c4c7e48035c7b2896a4365f6877d85c6ad487442b193
MD5 267f0438d6e3054f7e1bad131226c62e
BLAKE2b-256 67f327773ea84ee33fcb3d078b3b0544eb10b538d13593461df04417d6c06d72

See more details on using hashes here.

Provenance

The following attestation bundles were made for leksara-0.0.7.tar.gz:

Publisher: python-publish.yml on RedEye1605/Leksara

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file leksara-0.0.7-py3-none-any.whl.

File metadata

  • Download URL: leksara-0.0.7-py3-none-any.whl
  • Upload date:
  • Size: 30.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for leksara-0.0.7-py3-none-any.whl
Algorithm Hash digest
SHA256 7dbdac8414cded99d961b541863953a285590f950579fa2b00aeb6f8e732026b
MD5 4c2826719650b5d0840755b5755da270
BLAKE2b-256 00cba6bef3db5b03f4b5d742677f65c2ffc2e35562866b75251e6dc04cfd6def

See more details on using hashes here.

Provenance

The following attestation bundles were made for leksara-0.0.7-py3-none-any.whl:

Publisher: python-publish.yml on RedEye1605/Leksara

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page