An autocorrect Khmer Address and Specifically for Khmer National ID Card
Project description
Autocorrect for Khmer National ID Addresses (autocorrect_kh)
This Python script (autocorrect.py) provides an autocorrection tool for Khmer addresses on Cambodian National ID cards. It processes addresses in two parts—address_1 (house, road, village) and address_2 (commune, district, province)—using dictionary-based correction and custom rules tailored to Khmer script.
Contents
Features
- Khmer Address Correction: Fixes typos and misspellings in Khmer addresses.
- Two-Part Processing: Splits addresses into
address_1(ផ្ទះ, ផ្លូវ, ភូមិ) andaddress_2(ឃុំ, district, province). - Dictionary Support: Loads correction dictionaries from text files or folders.
- Custom Logic: Handles unique Khmer terms like ផ្ទះ (house) and ផ្លូវ (road) with specific rules.
- Unicode Normalization: Ensures consistent Khmer text processing.
Requirements
- Python 3.x
- Required packages:
jellyfish(for Damerau-Levenshtein distance)regex(for advanced pattern matching)unicodedata(included in Python standard library)pkg_resources(included withsetuptools)
Installation
Install the library via pip:
pip install autocorrect_kh
Or Install from source
git clone https://github.com/monykappa/autocorrect-kh.git
Usage
Autocorrect for address_1 and address_2
from autocorrect_kh import autocorrect_address_1, autocorrect_address_2
address_1 = "ផ្ទ៤១បេ ផ្លុវ៤៤៤ ភុមិ២"
address_2 = "សង្កាត់ទលទពូងទី ២ ខណ្ឌចំករមន ភ្នំពញ"
address_1_text = autocorrect_address_1(address_1) # Output: ផ្ទះ៤១បេ ផ្លូវ៤៤៤ ភូមិ២
address_2_text = autocorrect_address_2(address_2) # Output: សង្កាត់ទួលទំពូងទី២ ខណ្ឌចំការមន ភ្នំពេញ
print("Autocorrected Address:", address_1_text + " " + address_2_text)
Autocorrect Address Separately
from autocorrect_kh import autocorrect_province, autocorrect_district, autocorrect_khum, autocorrect_phum
phum_text = "កូមិត្រពាងថ្លង២"
khum_text = "សង្កាក់ច្បាអំពៅ២"
district_text = "ខណ្ឌចំករមន"
province_text = "កំពង់ចម"
autocorrect_phum = autocorrect_phum(phum_text)
autocorrect_khum = autocorrect_khum(khum_text)
autocorrect_district = autocorrect_district(district_text)
autocorrect_province = autocorrect_province(province_text)
print(f"Original phum {phum_text} -> autocorrected phum {autocorrect_phum}") # Output: ភូមិត្រពាំងថ្លឹង២
print(f"Original khum {khum_text} -> autocorrected khum {autocorrect_khum}") # Output: សង្កាត់ច្បារអំពៅ២
print(f"Original district {district_text} -> autocorrected district {autocorrect_district}") # Output: ខណ្ឌចំការមន
print(f"Original province {province_text} -> autocorrected province {autocorrect_province}") # Output: កំពង់ចាម
How It Works
Address Breakdown
Khmer National ID addresses are split into:
address_1:Contains ផ្ទះ (house), ផ្លូវ (road), and ភូមិ (village/phum).address_2: Contains ឃុំ/សង្កាត់ (commune/khum), district, and province.
Correction Flow
Address 1: House, Road, Village
- ផ្ទះ (House) and ផ្លូវ (Road):
- Corrected using hardcoded rules (not from dictionaries) due to their unique patterns, often followed by numbers or identifiers.
- Examples:
ផ្ទ១១៣→ផ្ទះ១១៣ផ្លូរបេតុង→ផ្លូវបេតុង
- ភូមិ (Village/Phum):
- First checks and corrects the prefix ភូមិ (e.g.,
ភុមិ→ភូមិ). - Then corrects the village name after ភូមិ using the phum_dict (loaded from data/phum/).
- Note: The phum dictionary excludes the word ភូមិ because it’s inconsistently present on ID cards.
- Example:
- Input:
ភុមិស្វយព្រៃ - Step 1:
ភុមិ→ភូមិ - Step 2:
ស្វយព្រៃ→ស្វាយព្រៃ(using phum_dict) - Output:
ភូមិស្វាយព្រៃ
- Input:
- First checks and corrects the prefix ភូមិ (e.g.,
Address 2: Commune, District, Province
- Corrected directly using automatically loaded dictionaries from:
data/khum/for khumdata/district.txtfor districtdata/province.txtfor province
- No prefix-specific rules; full names are matched and corrected.
- Example:
- Input:
សង្កាត់បឹងត្រុបែក ខណ្ឌចំករមន ភ្នំពញ - Output:
សង្កាត់បឹងត្របែក ខណ្ឌចំការមន ភ្នំពេញ(corrected using dictionaries)
- Input:
Separate Autocorrect Functions (v0.3.0)
The autocorrect_kh package now includes dedicated functions to autocorrect individual components of Khmer addresses. This allows you to correct specific parts of an address—such as village (phum), commune (khum), district, or province—independently, offering greater flexibility alongside the combined address correction features.
autocorrect_phum(phum_text): Corrects village names, handling the prefix ភូមិ (phum) separately. It ensures the prefix is standardized (e.g., correcting ភុមិ to ភូមិ) and then corrects the village name using the phum_dict dictionary.- Example:
- Input:
កូមិត្រពាងថ្លង២ - Output:
ភូមិត្រពាំងថ្លឹង២
- Input:
autocorrect_khum(khum_text): Corrects commune (khum) names using thekhum_dictdictionary.- Example:
- Input:
សង្កាក់ច្បាអំពៅ២ - Output:
សង្កាត់ច្បារអំពៅ២
- Input:
autocorrect_district(district_text): Corrects district names using the district_dict dictionary.- Example:
- Input:
ខណ្ឌចំករមន - Output:
ខណ្ឌចំការមន
- Input:
autocorrect_province(province_text): Corrects province names using the province_dict dictionary.- Example:
- Input:
កំពង់ចម - Output:
កំពង់ចាមThese functions provide consistent and accurate corrections for individual address components, whether used on their own or as part of a broader address correction workflow.
- Input:
Key Functions
Load Dictionary
normalize_text(text): Normalizes Khmer Unicode to NFC for consistent processing.load_resource_text(resource_path): Loads raw text from a package resource file.load_autocorrect_dict_from_resource(resource_path): Loads a dictionary from a single text file (e.g.,district.txt).load_autocorrect_dicts_from_resource(folder_resource): Loads dictionaries from a folder (e.g.,data/phum/).
Autocorrect Specifically for address_1 and address_2
autocorrect_address_1(part, dictionary=phum_dict): Correctsaddress_1with custom rules.autocorrect_address_2(address_2_text, khum_dictionary=khum_dict, district_dictionary=district_dict, province_dictionary=province_dict): Correctsaddress_2(commune, district, province) using dictionaries.
Autocorrect Phum, Khum, District, and Province Separately
autocorrect_phum(phum_text): Corrects village names (e.g.,កូមិត្រពាងថ្លង២→ភូមិត្រពាំងថ្លឹង២).autocorrect_khum(khum_text): Corrects commune names usingkhum_dict.autocorrect_district(district_text): Corrects district names usingdistrict_dict.autocorrect_province(province_text): Corrects province names usingprovince_dict.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file autocorrect_kh-0.3.0.tar.gz.
File metadata
- Download URL: autocorrect_kh-0.3.0.tar.gz
- Upload date:
- Size: 88.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
906c046ddf148cda4bd67f6bea0d8b2f0171668b3baf2b017c6208f0f311803c
|
|
| MD5 |
9516ac231eb758e8e5287cadad8f4436
|
|
| BLAKE2b-256 |
486d967e1c1e689ee14c2e1bd1d386cc16bad4143ebb5757c4528b879f587d95
|
Provenance
The following attestation bundles were made for autocorrect_kh-0.3.0.tar.gz:
Publisher:
pypi-publish.yml on monykappa/autocorrect-kh
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
autocorrect_kh-0.3.0.tar.gz -
Subject digest:
906c046ddf148cda4bd67f6bea0d8b2f0171668b3baf2b017c6208f0f311803c - Sigstore transparency entry: 173666052
- Sigstore integration time:
-
Permalink:
monykappa/autocorrect-kh@6a6b81992a8d6ed7f72ad55bce001a6a3217b5d6 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/monykappa
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pypi-publish.yml@6a6b81992a8d6ed7f72ad55bce001a6a3217b5d6 -
Trigger Event:
push
-
Statement type:
File details
Details for the file autocorrect_kh-0.3.0-py3-none-any.whl.
File metadata
- Download URL: autocorrect_kh-0.3.0-py3-none-any.whl
- Upload date:
- Size: 108.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5a5a2f62bd5cd14f66b1893c031f09aaee273fc9668955ff0f371c085b992204
|
|
| MD5 |
8d5acc182af9243ef64d2516b5a1f306
|
|
| BLAKE2b-256 |
6edde154034401cc6fe2b2843470828083cc1e613e4629a7beddead67185d579
|
Provenance
The following attestation bundles were made for autocorrect_kh-0.3.0-py3-none-any.whl:
Publisher:
pypi-publish.yml on monykappa/autocorrect-kh
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
autocorrect_kh-0.3.0-py3-none-any.whl -
Subject digest:
5a5a2f62bd5cd14f66b1893c031f09aaee273fc9668955ff0f371c085b992204 - Sigstore transparency entry: 173666053
- Sigstore integration time:
-
Permalink:
monykappa/autocorrect-kh@6a6b81992a8d6ed7f72ad55bce001a6a3217b5d6 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/monykappa
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pypi-publish.yml@6a6b81992a8d6ed7f72ad55bce001a6a3217b5d6 -
Trigger Event:
push
-
Statement type: