Skip to main content

An autocorrect Khmer Address and Specifically for Khmer National ID Card

Project description

Autocorrect for Khmer National ID Addresses (autocorrect_kh)

This Python script (autocorrect.py) provides an autocorrection tool for Khmer addresses on Cambodian National ID cards. It processes addresses in two parts—address_1 (house, road, village) and address_2 (commune, district, province)—using dictionary-based correction and custom rules tailored to Khmer script.

Khmer Address Correction Example

Contents

Features

  • Khmer Address Correction: Fixes typos and misspellings in Khmer addresses.
  • Two-Part Processing: Splits addresses into address_1 (ផ្ទះ, ផ្លូវ, ភូមិ) and address_2 (ឃុំ, district, province).
  • Dictionary Support: Loads correction dictionaries from text files or folders.
  • Custom Logic: Handles unique Khmer terms like ផ្ទះ (house) and ផ្លូវ (road) with specific rules.
  • Unicode Normalization: Ensures consistent Khmer text processing.

Requirements

  • Python 3.x
  • Required packages:
    • jellyfish (for Damerau-Levenshtein distance)
    • regex (for advanced pattern matching)
    • unicodedata (included in Python standard library)
    • pkg_resources (included with setuptools)

Installation

Install the library via pip:

pip install autocorrect_kh

Or Install from source

git clone https://github.com/monykappa/autocorrect-kh.git

Usage

Autocorrect for address_1 and address_2

from autocorrect_kh import autocorrect_address_1, autocorrect_address_2

address_1 = "ផ្ទ៤១បេ ផ្លុវ៤៤៤ ភុមិ២"
address_2 = "សង្កាត់ទលទពូងទី ២ ខណ្ឌចំករមន ភ្នំពញ"


address_1_text = autocorrect_address_1(address_1) # Output: ផ្ទះ៤១បេ ផ្លូវ៤៤៤ ភូមិ២
address_2_text = autocorrect_address_2(address_2) # Output: សង្កាត់ទួលទំពូងទី២ ខណ្ឌចំការមន ភ្នំពេញ

print("Autocorrected Address:", address_1_text + " " + address_2_text)

Autocorrect Address Separately

from autocorrect_kh import autocorrect_province, autocorrect_district, autocorrect_khum, autocorrect_phum

phum_text = "កូមិត្រពាងថ្លង២"
khum_text = "សង្កាក់ច្បាអំពៅ២"
district_text = "ខណ្ឌចំករមន"
province_text = "កំពង់ចម"

autocorrect_phum = autocorrect_phum(phum_text)
autocorrect_khum = autocorrect_khum(khum_text)
autocorrect_district = autocorrect_district(district_text)
autocorrect_province = autocorrect_province(province_text)

print(f"Original phum {phum_text} -> autocorrected phum {autocorrect_phum}") # Output: ភូមិត្រពាំងថ្លឹង២
print(f"Original khum {khum_text} -> autocorrected khum {autocorrect_khum}") # Output:​​ ​សង្កាត់ច្បារអំពៅ២
print(f"Original district {district_text} -> autocorrected district {autocorrect_district}") # Output:​ ខណ្ឌចំការមន
print(f"Original province {province_text} -> autocorrected province {autocorrect_province}") # Output: កំពង់ចាម

How It Works

Address Breakdown

Khmer National ID addresses are split into:

  1. address_1: Contains ផ្ទះ (house), ផ្លូវ (road), and ភូមិ (village/phum).
  2. address_2: Contains ឃុំ/សង្កាត់ (commune/khum), district, and province.

Correction Flow

Address 1: House, Road, Village

  • ផ្ទះ (House) and ផ្លូវ (Road):
    • Corrected using hardcoded rules (not from dictionaries) due to their unique patterns, often followed by numbers or identifiers.
    • Examples:
      • ផ្ទ១១៣ផ្ទះ១១៣
      • ផ្លូរបេតុងផ្លូវបេតុង
  • ភូមិ (Village/Phum):
    • First checks and corrects the prefix ភូមិ (e.g., ភុមិភូមិ).
    • Then corrects the village name after ភូមិ using the phum_dict (loaded from data/phum/).
    • Note: The phum dictionary excludes the word ភូមិ because it’s inconsistently present on ID cards.
    • Example:
      • Input: ភុមិស្វយព្រៃ
      • Step 1: ភុមិភូមិ
      • Step 2: ស្វយព្រៃស្វាយព្រៃ (using phum_dict)
      • Output: ភូមិស្វាយព្រៃ

Address 2: Commune, District, Province

  • Corrected directly using automatically loaded dictionaries from:
    • data/khum/ for khum
    • data/district.txt for district
    • data/province.txt for province
  • No prefix-specific rules; full names are matched and corrected.
  • Example:
    • Input: សង្កាត់បឹងត្រុបែក ខណ្ឌចំករមន ភ្នំពញ
    • Output: សង្កាត់បឹងត្របែក ខណ្ឌចំការមន ភ្នំពេញ (corrected using dictionaries)

Separate Autocorrect Functions (v0.3.0)

The autocorrect_kh package now includes dedicated functions to autocorrect individual components of Khmer addresses. This allows you to correct specific parts of an address—such as village (phum), commune (khum), district, or province—independently, offering greater flexibility alongside the combined address correction features.

  • autocorrect_phum(phum_text): Corrects village names, handling the prefix ភូមិ (phum) separately. It ensures the prefix is standardized (e.g., correcting ភុមិ to ភូមិ) and then corrects the village name using the phum_dict dictionary.
  • Example:
    • Input: កូមិត្រពាងថ្លង២
    • Output: ភូមិត្រពាំងថ្លឹង២
  • autocorrect_khum(khum_text): Corrects commune (khum) names using the khum_dict dictionary.
  • Example:
    • Input: សង្កាក់ច្បាអំពៅ២
    • Output: សង្កាត់ច្បារអំពៅ២
  • autocorrect_district(district_text): Corrects district names using the district_dict dictionary.
  • Example:
    • Input: ខណ្ឌចំករមន
    • Output: ខណ្ឌចំការមន
  • autocorrect_province(province_text): Corrects province names using the province_dict dictionary.
  • Example:
    • Input: កំពង់ចម
    • Output: កំពង់ចាម These functions provide consistent and accurate corrections for individual address components, whether used on their own or as part of a broader address correction workflow.

Key Functions

Load Dictionary

  • normalize_text(text): Normalizes Khmer Unicode to NFC for consistent processing.
  • load_resource_text(resource_path): Loads raw text from a package resource file.
  • load_autocorrect_dict_from_resource(resource_path): Loads a dictionary from a single text file (e.g., district.txt).
  • load_autocorrect_dicts_from_resource(folder_resource): Loads dictionaries from a folder (e.g., data/phum/).

Autocorrect Specifically for address_1 and address_2

  • autocorrect_address_1(part, dictionary=phum_dict): Corrects address_1 with custom rules.
  • autocorrect_address_2(address_2_text, khum_dictionary=khum_dict, district_dictionary=district_dict, province_dictionary=province_dict): Corrects address_2 (commune, district, province) using dictionaries.

Autocorrect Phum, Khum, District, and Province Separately

  • autocorrect_phum(phum_text): Corrects village names (e.g., កូមិត្រពាងថ្លង២ភូមិត្រពាំងថ្លឹង២).
  • autocorrect_khum(khum_text): Corrects commune names using khum_dict.
  • autocorrect_district(district_text): Corrects district names using district_dict.
  • autocorrect_province(province_text): Corrects province names using province_dict.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

autocorrect_kh-0.3.1.tar.gz (88.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

autocorrect_kh-0.3.1-py3-none-any.whl (107.3 kB view details)

Uploaded Python 3

File details

Details for the file autocorrect_kh-0.3.1.tar.gz.

File metadata

  • Download URL: autocorrect_kh-0.3.1.tar.gz
  • Upload date:
  • Size: 88.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for autocorrect_kh-0.3.1.tar.gz
Algorithm Hash digest
SHA256 458e9c0885b3d09f80a2567ae4dbc3bf1d9d6d58e5402d0ffb26175f32d83cde
MD5 67809ac48399818b8939fc69eb3371d6
BLAKE2b-256 9955a00e558dac39cded5e38295565d441ac521c209d38b396a170b325a633b8

See more details on using hashes here.

Provenance

The following attestation bundles were made for autocorrect_kh-0.3.1.tar.gz:

Publisher: pypi-publish.yml on monykappa/autocorrect-kh

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file autocorrect_kh-0.3.1-py3-none-any.whl.

File metadata

  • Download URL: autocorrect_kh-0.3.1-py3-none-any.whl
  • Upload date:
  • Size: 107.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for autocorrect_kh-0.3.1-py3-none-any.whl
Algorithm Hash digest
SHA256 0eac0039152275e3c8b336d6031b3c8febd391162a37305a5362096a64baf6a6
MD5 39274ba9d0a56fa9648c7e9b9cd465cc
BLAKE2b-256 688b09e2317cdc9fa7206ee32f2a678f157f3e9a981e7481592e17beb87854d1

See more details on using hashes here.

Provenance

The following attestation bundles were made for autocorrect_kh-0.3.1-py3-none-any.whl:

Publisher: pypi-publish.yml on monykappa/autocorrect-kh

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page