Burmese text detection and conversion toolkit for Zawgyi and Unicode
Project description
Para
Para is a small, boring, and transparent toolkit for working with Burmese text. It detects whether text is encoded in Zawgyi or Unicode and converts Zawgyi to Unicode using a rule-based approach. Para never invents a new encoding and keeps its APIs explicit.
Goals
- Be Unicode-first and never invent a new encoding.
- Offer stable, explicit APIs without side effects or magic imports.
- Provide deterministic Zawgyi vs Unicode detection.
- Convert Zawgyi to Unicode with maintainable, rule-based logic (Parabaik-style), not machine learning.
- Stay batch-friendly for spreadsheets, CSVs, and plain text.
- Avoid heavy native dependencies.
- Be honest about limitations and edge cases.
Installation
pip install paraencoder
Usage
from para.detect import is_zawgyi, detect_encoding
from para.convert import zg_to_unicode
from para.normalize import normalize_unicode
text = "\u1031\u1010\u1004\u103a" # Zawgyi-encoded string
if is_zawgyi(text):
cleaned = zg_to_unicode(text)
cleaned = normalize_unicode(cleaned)
CLI
Detect encoding:
echo "\u1031\u1010\u1004\u103a" | para detect
Convert Zawgyi to Unicode:
echo "\u1031\u1010\u1004\u103a" | para convert > output.txt
Process a file in place (write to stdout by default):
para convert --input input.txt --output output.txt
Windows / PowerShell note
PowerShell's default encoding corrupts Myanmar text in pipes. Before piping Burmese text, set UTF-8 encoding:
$OutputEncoding = [System.Text.Encoding]::UTF8
[Console]::InputEncoding = [System.Text.Encoding]::UTF8
[Console]::OutputEncoding = [System.Text.Encoding]::UTF8
echo "ျမန္မာ" | para convert
Or use file-based input/output to avoid pipe issues:
para convert --input input.txt --output output.txt
API surface
-
para.detect.is_zawgyi(text: str) -> bool- Input:
textstring. - Output:
Trueonly when the detector score prefers Zawgyi; otherwiseFalse. - Guarantee: Never raises on empty/ASCII-only input; returns
Falsefor those.
- Input:
-
para.detect.detect_encoding(text: str) -> Literal["zawgyi", "unicode", "unknown"]- Input:
textstring. - Output: One of the three labels. Ties or insufficient evidence →
"unknown"(no auto-conversion). - Guarantee: Deterministic, no network/ML, explicit tie handling.
- Input:
-
para.convert.zg_to_unicode(text: str, *, normalize: bool = True, force: bool = False) -> str- Input:
textstring. - Output: Converted Unicode string when detection prefers Zawgyi (or when
force=True). Otherwise passes through (optionally normalized). - Guarantee: Ordered, test-backed regex rules; no Unicode→Zawgyi path;
force=Falseavoids silent conversion on ambiguous text.
- Input:
-
para.normalize.normalize_unicode(text: str) -> str- Input:
textstring. - Output: NFC-normalized string with simple Myanmar ordering tweaks.
- Guarantee: Idempotent on already-normalized Unicode Burmese.
- Input:
-
para.io.read_text(path: str, *, encoding: str = "utf-8") -> str -
para.io.write_text(path: str, data: str, *, encoding: str = "utf-8") -> None -
para.io.convert_file(...) -> str- Batch helpers for files; never guess encodings beyond the provided
encodingargument.
- Batch helpers for files; never guess encodings beyond the provided
Detection approach
Detection is deterministic and rule-based. Para scores the input with Zawgyi-specific patterns (e.g., U+1031 prefix order, U+105A, stacked medials) and Unicode-only patterns (e.g., valid ordering of medials, U+103A usage). The side with the higher score wins; ties produce "unknown". No machine learning, no network calls.
Conversion approach
Conversion uses an ordered list of regex replacements derived from Parabaik-style mappings. The rules are explicit, unit-tested, and live in para.rules. The converter does not attempt Unicode-to-Zawgyi; it only supports Zawgyi-to-Unicode because Unicode is the target canonical encoding.
Limitations
- Ambiguous short strings (e.g., ASCII-only) return
"unknown"and pass through unchanged. - Extremely malformed Zawgyi text may require manual cleanup.
- The converter focuses on common Zawgyi usage; rare legacy ligatures may need additional rules.
Non-goals
- Creating or endorsing any new Burmese encoding.
- Unicode-to-Zawgyi conversion.
- ML-based detection or probabilistic auto-conversion.
- Silent mutation of text when detection confidence is low; ties stay
"unknown".
Contributing
Issues and pull requests are welcome. Keep changes readable and testable.
Packaging
- Build a wheel/sdist locally:
python -m pip install buildthenpython -m build. - Publish to PyPI (once ready):
python -m pip install twinethentwine upload dist/*. - The package metadata in
pyproject.tomlis PyPI-ready (MIT license, explicit packages, CLI entrypoint).
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file paraencoder-0.1.1.tar.gz.
File metadata
- Download URL: paraencoder-0.1.1.tar.gz
- Upload date:
- Size: 11.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
08661b0b2d14c77d5884d82bb598b8f17ccec110b79dffa1fc07895ab441f1e2
|
|
| MD5 |
185e96f31ebf8d5880c8022331092d53
|
|
| BLAKE2b-256 |
f752e5215d07fcf90b086bb49be83c43a8febba3a45e145b84f7a0b3f146846b
|
File details
Details for the file paraencoder-0.1.1-py3-none-any.whl.
File metadata
- Download URL: paraencoder-0.1.1-py3-none-any.whl
- Upload date:
- Size: 11.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.14.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3a851ddcaa5600f7871680cf35b088a55a02a411615a7d6d75c04afdfd58bf3d
|
|
| MD5 |
cf6ce0de310d4efa3ede29e5e2c707db
|
|
| BLAKE2b-256 |
7f5cb3ad98cadd0081fe34e48fe0558f61e8ba5470f3268a759e683b8bff36a1
|