Burmese text detection and conversion toolkit for Zawgyi and Unicode

These details have not been verified by PyPI

Project links

Project description

Para

Para is a small, boring, and transparent toolkit for working with Burmese text. It detects whether text is encoded in Zawgyi or Unicode and converts Zawgyi to Unicode using a rule-based approach. Para never invents a new encoding and keeps its APIs explicit.

Goals

Be Unicode-first and never invent a new encoding.
Offer stable, explicit APIs without side effects or magic imports.
Provide deterministic Zawgyi vs Unicode detection.
Convert Zawgyi to Unicode with maintainable, rule-based logic (Parabaik-style), not machine learning.
Stay batch-friendly for spreadsheets, CSVs, and plain text.
Avoid heavy native dependencies.
Be honest about limitations and edge cases.

Installation

pip install paraencoder

For Office document support (.docx, .xlsx, .odt):

pip install paraencoder[office]

Supported File Formats

Plain Text (built-in, no extra dependencies)

Text files: .txt, .text, .log, .md, .rst, .asc
Web/markup: .html, .htm, .xhtml, .xml, .json, .yaml, .yml, .csv, .tsv
Documentation: .tex, .latex, .adoc, .org, .wiki, .mediawiki
Config: .ini, .cfg, .conf, .properties, .env, .toml, .lock
Source code: .py, .js, .ts, .java, .c, .cpp, .h, .cs, .php, .rb, .go, .rs, .sh, .bat, .ps1, .sql
Subtitles: .srt, .vtt, .sub
Other: .po, .pot, .texi, .man, .nfo, .readme, .eml, .mbox

Office Documents (requires `paraencoder[office]`)

Microsoft Word: .docx, .docm
Microsoft Excel: .xlsx, .xlsm
OpenDocument: .odt

Usage

from para.detect import is_zawgyi, detect_encoding
from para.convert import zg_to_unicode
from para.normalize import normalize_unicode

text = "\u1031\u1010\u1004\u103a"  # Zawgyi-encoded string
if is_zawgyi(text):
    cleaned = zg_to_unicode(text)
    cleaned = normalize_unicode(cleaned)

CLI

Detect encoding:

echo "\u1031\u1010\u1004\u103a" | para detect

Convert Zawgyi to Unicode:

echo "\u1031\u1010\u1004\u103a" | para convert > output.txt

Process a file in place (write to stdout by default):

para convert --input input.txt --output output.txt

Convert Office documents (requires paraencoder[office]):

para convert --input "Document.docx" --output "Document_Unicode.docx"
para convert --input "Spreadsheet.xlsx" --output "Spreadsheet_Unicode.xlsx"

Windows / PowerShell note

PowerShell's default encoding corrupts Myanmar text in pipes. Before piping Burmese text, set UTF-8 encoding:

$OutputEncoding = [System.Text.Encoding]::UTF8
[Console]::InputEncoding = [System.Text.Encoding]::UTF8
[Console]::OutputEncoding = [System.Text.Encoding]::UTF8
echo "ျမန္မာ" | para convert

Or use file-based input/output to avoid pipe issues:

para convert --input input.txt --output output.txt

API surface

para.detect.is_zawgyi(text: str) -> bool
- Input: text string.
- Output: True only when the detector score prefers Zawgyi; otherwise False.
- Guarantee: Never raises on empty/ASCII-only input; returns False for those.
para.detect.detect_encoding(text: str) -> Literal["zawgyi", "unicode", "unknown"]
- Input: text string.
- Output: One of the three labels. Ties or insufficient evidence → "unknown" (no auto-conversion).
- Guarantee: Deterministic, no network/ML, explicit tie handling.
para.convert.zg_to_unicode(text: str, *, normalize: bool = True, force: bool = False) -> str
- Input: text string.
- Output: Converted Unicode string when detection prefers Zawgyi (or when force=True). Otherwise passes through (optionally normalized).
- Guarantee: Ordered, test-backed regex rules; no Unicode→Zawgyi path; force=False avoids silent conversion on ambiguous text.
para.normalize.normalize_unicode(text: str) -> str
- Input: text string.
- Output: NFC-normalized string with simple Myanmar ordering tweaks.
- Guarantee: Idempotent on already-normalized Unicode Burmese.
para.io.read_text(path: str, *, encoding: str = "utf-8") -> str
para.io.write_text(path: str, data: str, *, encoding: str = "utf-8") -> None
para.io.convert_file(...) -> str
- Batch helpers for files; never guess encodings beyond the provided encoding argument.

Detection approach

Detection is deterministic and rule-based. Para scores the input with Zawgyi-specific patterns (e.g., U+1031 prefix order, U+105A, stacked medials) and Unicode-only patterns (e.g., valid ordering of medials, U+103A usage). The side with the higher score wins; ties produce "unknown". No machine learning, no network calls.

Conversion approach

Conversion uses an ordered list of regex replacements derived from Parabaik-style mappings. The rules are explicit, unit-tested, and live in para.rules. The converter does not attempt Unicode-to-Zawgyi; it only supports Zawgyi-to-Unicode because Unicode is the target canonical encoding.

Limitations

Ambiguous short strings (e.g., ASCII-only) return "unknown" and pass through unchanged.
Extremely malformed Zawgyi text may require manual cleanup.
The converter focuses on common Zawgyi usage; rare legacy ligatures may need additional rules.

Non-goals

Creating or endorsing any new Burmese encoding.
Unicode-to-Zawgyi conversion.
ML-based detection or probabilistic auto-conversion.
Silent mutation of text when detection confidence is low; ties stay "unknown".

Contributing

Issues and pull requests are welcome. Keep changes readable and testable.

Packaging

Build a wheel/sdist locally: python -m pip install build then python -m build.
Publish to PyPI (once ready): python -m pip install twine then twine upload dist/*.
The package metadata in pyproject.toml is PyPI-ready (MIT license, explicit packages, CLI entrypoint).

License

MIT

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.1

Feb 4, 2026

This version

0.2.0

Feb 4, 2026

0.1.1

Feb 4, 2026

0.1.0

Feb 4, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

paraencoder-0.2.0.tar.gz (14.3 kB view details)

Uploaded Feb 4, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

paraencoder-0.2.0-py3-none-any.whl (14.8 kB view details)

Uploaded Feb 4, 2026 Python 3

File details

Details for the file paraencoder-0.2.0.tar.gz.

File metadata

Download URL: paraencoder-0.2.0.tar.gz
Upload date: Feb 4, 2026
Size: 14.3 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for paraencoder-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`62157fcbffbe3cc72a373d2dbf5ca251ebc88497996eaa0c4193de5c2989386f`
MD5	`936ccd68c3ca1d127f4aa05db9e0dd2f`
BLAKE2b-256	`5720022b7d0eb601594c789415cbaf36e639b176fa7f023c073e1653e95162ce`

See more details on using hashes here.

File details

Details for the file paraencoder-0.2.0-py3-none-any.whl.

File metadata

Download URL: paraencoder-0.2.0-py3-none-any.whl
Upload date: Feb 4, 2026
Size: 14.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.14.0

File hashes

Hashes for paraencoder-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b5b70cad85268e52a185c08cd3b3f077591b065a1171a7329e56df7764c23c73`
MD5	`e2924f6b94069bf6a6d25540c95cfa4c`
BLAKE2b-256	`27bb197e239bb6bbf57c9098153bb3aaaec106c01050eb9aba73fbffdc9adeea`

See more details on using hashes here.

paraencoder 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Para

Goals

Installation

Supported File Formats

Plain Text (built-in, no extra dependencies)

Office Documents (requires paraencoder[office])

Usage

CLI

Windows / PowerShell note

API surface

Detection approach

Conversion approach

Limitations

Non-goals

Contributing

Packaging

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

Office Documents (requires `paraencoder[office]`)