An autocorrect address specifically for Khmer National ID Card
Project description
Autocorrect for Khmer National ID Addresses (autocorrect_kh)
This Python script (autocorrect.py) provides an autocorrection tool for Khmer addresses on Cambodian National ID cards. It processes addresses in two parts—address_1 (house, road, village) and address_2 (commune, district, province)—using dictionary-based correction and custom rules tailored to Khmer script.
Contents
Features
- Khmer Address Correction: Fixes typos and misspellings in Khmer addresses.
- Two-Part Processing: Splits addresses into
address_1(ផ្ទះ, ផ្លូវ, ភូមិ) andaddress_2(ឃុំ, district, province). - Dictionary Support: Loads correction dictionaries from text files or folders.
- Custom Logic: Handles unique Khmer terms like ផ្ទះ (house) and ផ្លូវ (road) with specific rules.
- Unicode Normalization: Ensures consistent Khmer text processing.
Requirements
- Python 3.x
- Required packages:
jellyfish(for Damerau-Levenshtein distance)regex(for advanced pattern matching)unicodedata(included in Python standard library)pkg_resources(included withsetuptools)
Installation
Install the library via pip:
pip install autocorrect_kh jellyfish regex
Or Install from source
git clone https://github.com/monykappa/autocorrect-kh.git
Usage
from autocorrect_kh import autocorrect_address_1, autocorrect_address_2
address_1_text = "ផ្ទ៤១បេ ផ្លុវ៤៤៤ ភុមិ២"
address_2_text = "សង្កាត់ទលទពូងទី ២ ខណ្ឌចំករមន ភ្នំពញ"
address_1_text = autocorrect_address_1(address_1_text)
address_2_text = autocorrect_address_2(address_2_text)
print("Autocorrected Address:", address_1_text + " " + address_2_text)
How It Works
Address Breakdown
Khmer National ID addresses are split into:
address_1:Contains ផ្ទះ (house), ផ្លូវ (road), and ភូមិ (village/phum).address_2: Contains ឃុំ/សង្កាត់ (commune/khum), district, and province.
Correction Flow
Address 1: House, Road, Village
- ផ្ទះ (House) and ផ្លូវ (Road):
- Corrected using hardcoded rules (not from dictionaries) due to their unique patterns, often followed by numbers or identifiers.
- Examples:
ផ្ទ១១៣→ផ្ទះ១១៣ផ្លូរបេតុង→ផ្លូវបេតុង
- ភូមិ (Village/Phum):
- First checks and corrects the prefix ភូមិ (e.g.,
ភុមិ→ភូមិ). - Then corrects the village name after ភូមិ using the phum_dict (loaded from data/phum/).
- Note: The phum dictionary excludes the word ភូមិ because it’s inconsistently present on ID cards.
- Example:
- Input:
ភុមិស្វយព្រៃ - Step 1:
ភុមិ→ភូមិ - Step 2:
ស្វយព្រៃ→ស្វាយព្រៃ(using phum_dict) - Output:
ភូមិស្វាយព្រៃ
- Input:
- First checks and corrects the prefix ភូមិ (e.g.,
Address 2: Commune, District, Province
- Corrected directly using automatically loaded dictionaries from:
data/khum/for khumdata/district.txtfor districtdata/province.txtfor province
- No prefix-specific rules; full names are matched and corrected.
- Example:
- Input:
សង្កាត់បឹងត្រុបែក ខណ្ឌចំករមន ភ្នំពញ - Output:
សង្កាត់បឹងត្របែក ខណ្ឌចំការមន ភ្នំពេញ(corrected using dictionaries)
- Input:
Key Functions
normalize_text(text): Normalizes Khmer Unicode to NFC for consistent processing.load_resource_text(resource_path): Loads raw text from a package resource file.load_autocorrect_dict_from_resource(resource_path): Loads a dictionary from a single text file (e.g.,district.txt).load_autocorrect_dicts_from_resource(folder_resource): Loads dictionaries from all.txtfiles in a folder (e.g.,data/phum/).autocorrect_word(word, word_set): Corrects a word using Damerau-Levenshtein distance.autocorrect_address_1(part, dictionary=phum_dict): Corrects address_1 with custom rules.autocorrect_address_2(address_2_text, khum_dictionary=khum_dict, district_dictionary=district_dict, province_dictionary=province_dict): Corrects address_2 components using automatically loaded dictionaries.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file autocorrect_kh-0.2.2.tar.gz.
File metadata
- Download URL: autocorrect_kh-0.2.2.tar.gz
- Upload date:
- Size: 86.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6131700006c862344fa7c6847224e90dcbd869047aaec4adac8fac924dff4c3f
|
|
| MD5 |
0593ce96de244a66543c98b9c01f402d
|
|
| BLAKE2b-256 |
c57ad90633909da0d9dae7b84ad45fa16e67b5123779039f4661a88049f7ad49
|
Provenance
The following attestation bundles were made for autocorrect_kh-0.2.2.tar.gz:
Publisher:
pypi-publish.yml on monykappa/autocorrect-kh
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
autocorrect_kh-0.2.2.tar.gz -
Subject digest:
6131700006c862344fa7c6847224e90dcbd869047aaec4adac8fac924dff4c3f - Sigstore transparency entry: 173239401
- Sigstore integration time:
-
Permalink:
monykappa/autocorrect-kh@512a54bb3095b3f78533559b283a6d637952a950 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/monykappa
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pypi-publish.yml@512a54bb3095b3f78533559b283a6d637952a950 -
Trigger Event:
push
-
Statement type:
File details
Details for the file autocorrect_kh-0.2.2-py3-none-any.whl.
File metadata
- Download URL: autocorrect_kh-0.2.2-py3-none-any.whl
- Upload date:
- Size: 107.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
25bc4aa6971a14f6f31667c0a395fa1ed2920a1d68ab202c6c104286d1bd0396
|
|
| MD5 |
b6caf01bcf1eb535aaac2596e2907e5b
|
|
| BLAKE2b-256 |
fc18c075a01eb9893985af5645063af12f1a4de0fd019ca79e69a6ff5a1ee914
|
Provenance
The following attestation bundles were made for autocorrect_kh-0.2.2-py3-none-any.whl:
Publisher:
pypi-publish.yml on monykappa/autocorrect-kh
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
autocorrect_kh-0.2.2-py3-none-any.whl -
Subject digest:
25bc4aa6971a14f6f31667c0a395fa1ed2920a1d68ab202c6c104286d1bd0396 - Sigstore transparency entry: 173239402
- Sigstore integration time:
-
Permalink:
monykappa/autocorrect-kh@512a54bb3095b3f78533559b283a6d637952a950 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/monykappa
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pypi-publish.yml@512a54bb3095b3f78533559b283a6d637952a950 -
Trigger Event:
push
-
Statement type: