Library pemrosesan teks Bahasa Indonesia untuk domain e-commerce (cleaning, PII masking, review mining, pipeline).

These details have not been verified by PyPI

Project links

Project description

Leksara

Description

Leksara is a Python toolkit designed to streamline the preprocessing and cleaning of Indonesian text data for Data Scientists and Machine Learning Engineers. It focuses on handling messy and noisy Indonesian text from various domains such as e-commerce reviews, social media posts, and chat conversations. The tool helps clean text by handling Indonesian-specific challenges like slang words, regional expressions, informal abbreviations, and mixed language content, while also providing standard cleaning features like punctuation and stopword removal. This makes it an essential tool for Indonesian text analysis and machine learning model preparation.

Key Features

Basic Cleaning Pipeline: A straightforward pipeline to clean raw text data by handling common tasks like punctuation removal, casing normalization, and stopword filtering.
Advanced Customization: Users can create custom cleaning pipelines tailored to specific datasets, including support for regex pattern matching, stemming, and custom dictionaries.
Preset Options: Includes predefined cleaning presets for various domains like e-commerce, allowing for one-click cleaning.
Slang and Informal Text Handling: Users can define their own custom dictionaries for slang terms and informal language, especially useful for Indonesian text.

Usage Examples

Basic Usage: Basic Cleaning Pipeline

This example demonstrates how to clean e-commerce product reviews using a pre-built preset.

from Leksara  import Leksara 

df['cleaned_review'] = Leksara(df['review_text'], preset='ecommerce_review')
print(df[['review_id', 'cleaned_review']])

Input Data (df):

review_id	review_text
1	`<p>brgnya ORI & pengiriman cepat. Mantulll 👍</p>`
2	`Kualitasnya krg bgs, ga sesuai ekspektasi...`

Output Data:

review_id	cleaned_review
1	`barang nya original pengiriman cepat mantap`
2	`kualitasnya kurang bagus tidak sesuai ekspektasi`

Advanced Usage: Custom Cleaning Pipeline

Customize the pipeline to mask phone numbers and normalize whitespace in chat logs.

from Leksara import Leksara
from Leksara.functions import to_lowercase, normalize_whitespace
from Leksara.patterns import MASK_PHONE_NUMBER

custom_pipeline = {
    'patterns': [MASK_PHONE_NUMBER],
    'functions': [to_lowercase, normalize_whitespace]
}

df['safe_message'] = Leksara(df['chat_message'], pipeline=custom_pipeline)
print(df[['chat_id', 'safe_message']])

Input Data (df):

chat_id	chat_message
101	`Hi kak, pesanan saya INV/123 blm sampai. No HP saya 081234567890`
102	`Tolong dibantu ya sis, thanks`

Output Data:

chat_id	safe_message
101	`hi kak, pesanan saya inv/123 blm sampai. no hp saya [PHONE_NUMBER]`
102	`tolong dibantu ya sis, thanks`

Goals & Objectives

Provide an intuitive and adaptable cleaning tool for Indonesian text, focusing on domains like e-commerce.
Enable Data Scientists and ML Engineers to clean and preprocess text with minimal effort.
Allow for deep customization through configuration options and the use of custom dictionaries.

Success Metrics

On-time Delivery: Targeted release by October 15, 2025.
Processing Speed: Clean a 10,000-row Pandas Series in under 5 seconds.
Cleaning Accuracy: Achieve over 95% accuracy for core cleaning functions.

Folder Structure

Below is the recommended folder structure for organizing the project:

[Leksara]/
├── pyproject.toml                  # packaging & deps (nltk, dll)
├── requirements.txt                # runtime deps (nltk, pandas, dll)
├── README.md                       # overview & usage
├── leksara/                        # package utama
│   ├── __init__.py                 # public API surface
│   ├── version.py                  # versi paket
│   ├── core/
│   │   ├── chain.py                # pipeline/CLI entry (sesuai pyproject scripts)
│   │   ├── logging.py              # util logging/benchmark
│   │   └── presets.py              # preset pipeline
│   ├── frames/
│   │   └── cartboard.py            # helpers untuk data frame
│   ├── functions/                  # modul granular
│   │   ├── __init__.py
│   │   ├── cleaner/
│   │   │   ├── __init__.py
│   │   │   └── basic.py            # remove_tags, case_normal, remove_stopwords, dll.
│   │   ├── patterns/
│   │   │   ├── __init__.py
│   │   │   └── pii.py              # masker PII (email/telepon, dll.)
│   │   └── review/
│   │       ├── __init__.py
│   │       └── advanced.py         # fungsi review lanjutan
│   ├── resources/                  # data pendukung (dibundel)
│   │   ├── acronyms.csv
│   │   ├── contractions.json
│   │   ├── slang_dict.json
│   │   └── stopwords/
│   │       └── id.txt              # stopwords Indonesia (tambahan/abbr)
│   ├── tests/
│   │   ├── test_chain.py
│   │   ├── test_cleaner_basic.py
│   │   ├── test_patterns_pii.py
│   │   └── test_review_advanced.py
│   └── utils/
│       ├── lang.py
│       ├── regexes.py
│       ├── text.py                 # text helpers
│       └── whitelist.py
└── notebooks/
    └── leksara_quickstart.ipynb    # quickstart & demo

Milestones

Sprint	Dates	Goal
1	Aug 18 – Aug 22	Project Kickoff, Discovery, Set up repository
2	Aug 22 – Aug 29	Build Core Cleaning Engine
3	Aug 29 – Sep 5	Develop Configurable Features
4	Sep 5 – Sep 12	Implement Advanced Customization
5	Sep 12 – Sep 19	Refine API
6	Sep 19 – Sep 26	Optimize System
7	Sep 26 – Oct 3	Finalize Documentation
8	Oct 3 – Oct 10	Prepare for Launch

Requirements

Python 3.x
Pandas

Install

pip install Leksara

Contributors

Vivian & Zahra – Document Owners
Salsa – UI/UX Designer
Aufi, Althaf, Rhendy, Adit – Data Science Team
Alya, Vivin – Data Analyst Team

For more details on the features and usage, refer to the official documentation linked above.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.2

Oct 25, 2025

0.2.1

Oct 25, 2025

0.1.3

Oct 24, 2025

0.1.2

Oct 24, 2025

0.1.1

Oct 24, 2025

0.1.0

Sep 20, 2025

This version

0.0.8

Sep 20, 2025

0.0.7

Sep 20, 2025

0.0.4

Sep 20, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

leksara-0.0.8.tar.gz (32.6 kB view details)

Uploaded Sep 20, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

leksara-0.0.8-py3-none-any.whl (38.2 kB view details)

Uploaded Sep 20, 2025 Python 3

File details

Details for the file leksara-0.0.8.tar.gz.

File metadata

Download URL: leksara-0.0.8.tar.gz
Upload date: Sep 20, 2025
Size: 32.6 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for leksara-0.0.8.tar.gz
Algorithm	Hash digest
SHA256	`b2ade81369a1c746676a839d2a7862f36cc53518287c516b17e5303fcc42c7b4`
MD5	`765bb79526d28af2f9d63f889dba932a`
BLAKE2b-256	`14ee862be0d791ac3871b72c6f8cdff2ef82ee5975aa6333aa948e568d752722`

See more details on using hashes here.

Provenance

The following attestation bundles were made for leksara-0.0.8.tar.gz:

Publisher: python-publish.yml on RedEye1605/Leksara

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: leksara-0.0.8.tar.gz
- Subject digest: b2ade81369a1c746676a839d2a7862f36cc53518287c516b17e5303fcc42c7b4
- Sigstore transparency entry: 541169682
- Sigstore integration time: Sep 20, 2025
Source repository:
- Permalink: RedEye1605/Leksara@a57baad926671387b90e14b5962e47dbbf0f9eea
- Branch / Tag: refs/tags/0.0.8
- Owner: https://github.com/RedEye1605
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@a57baad926671387b90e14b5962e47dbbf0f9eea
- Trigger Event: push

File details

Details for the file leksara-0.0.8-py3-none-any.whl.

File metadata

Download URL: leksara-0.0.8-py3-none-any.whl
Upload date: Sep 20, 2025
Size: 38.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for leksara-0.0.8-py3-none-any.whl
Algorithm	Hash digest
SHA256	`56c0ec1679a91102c9ea1a286a1edbd4fb99c193440d5cd66c9ab9f571c0dd19`
MD5	`b4a45c6e7ef344309e26b19824fd8554`
BLAKE2b-256	`77b89b2be7da8337aca3b685e61e8dbc80501ef449ff6d5fb148d6041a39236b`

See more details on using hashes here.

Provenance

The following attestation bundles were made for leksara-0.0.8-py3-none-any.whl:

Publisher: python-publish.yml on RedEye1605/Leksara

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: leksara-0.0.8-py3-none-any.whl
- Subject digest: 56c0ec1679a91102c9ea1a286a1edbd4fb99c193440d5cd66c9ab9f571c0dd19
- Sigstore transparency entry: 541169685
- Sigstore integration time: Sep 20, 2025
Source repository:
- Permalink: RedEye1605/Leksara@a57baad926671387b90e14b5962e47dbbf0f9eea
- Branch / Tag: refs/tags/0.0.8
- Owner: https://github.com/RedEye1605
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: python-publish.yml@a57baad926671387b90e14b5962e47dbbf0f9eea
- Trigger Event: push

leksara 0.0.8

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Leksara

Description

Key Features

Usage Examples

Basic Usage: Basic Cleaning Pipeline

Advanced Usage: Custom Cleaning Pipeline

Goals & Objectives

Success Metrics

Folder Structure

Milestones

Requirements

Install

Contributors

Links

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance