Skip to main content

ICU MessageFormat parser with whitespace-preserving AST and string reconstruction

Project description

generaltranslation-icu-messageformat-parser

A pure-Python ICU MessageFormat parser with whitespace-preserving AST and string reconstruction. Python equivalent of @formatjs/icu-messageformat-parser.

Derived from pyicumessageformat by Mike deBeaubien (MIT license).

Installation

pip install generaltranslation-icu-messageformat-parser

No dependencies. Pure Python. Requires Python 3.10+.

Quick Start

from generaltranslation_icu_messageformat_parser import Parser, print_ast

parser = Parser()
ast = parser.parse("{count, plural, one {# item} other {# items}}")
# [{'name': 'count', 'type': 'plural', 'offset': 0, 'options': {'one': [{'type': 'number', 'name': 'count', 'hash': True}, ' item'], 'other': [{'type': 'number', 'name': 'count', 'hash': True}, ' items']}}]

API

Parser(options=None)

Create a parser instance with optional configuration.

Options dict keys:

Option Type Default Description
subnumeric_types list[str] ['plural', 'selectordinal'] Types that support # hash replacement
submessage_types list[str] ['plural', 'selectordinal', 'select'] Types with sub-message branches
maximum_depth int 50 Maximum nesting depth
allow_tags bool False Enable XML-style <tag> parsing
strict_tags bool False Strict tag parsing mode
tag_prefix str | None None Required tag name prefix
tag_type str 'tag' AST node type string for tags
include_indices bool False Include start/end positions in AST nodes
loose_submessages bool False Allow loose submessage parsing
allow_format_spaces bool True Allow spaces in format strings
require_other bool True Require other branch in plural/select
preserve_whitespace bool False Store whitespace in _ws dict on AST nodes for lossless round-trips

Parser.parse(input, tokens=None)

Parse an ICU MessageFormat string into an AST.

Args:

  • input (str): The ICU MessageFormat string to parse.
  • tokens (list | None): Optional list to populate with token objects for low-level analysis.

Returns: list — A list of AST nodes (strings and dicts).

Raises: SyntaxError on malformed input, TypeError if input is not a string.

print_ast(ast)

Reconstruct an ICU MessageFormat string from an AST.

Args:

  • ast (list): The AST as returned by Parser.parse().

Returns: str — The reconstructed ICU MessageFormat string.

When the AST contains _ws whitespace metadata (from preserve_whitespace=True), reconstruction is lossless — the output exactly matches the original input. Without whitespace metadata, normalized spacing is used.

AST Node Types

String literal

Plain strings appear directly in the AST list:

parser.parse("Hello world")
# ["Hello world"]

Simple variable {name}

{"name": "username"}

Typed placeholder {name, type, style}

{"name": "amount", "type": "number", "format": "::currency/USD"}

Plural / selectordinal {n, plural, ...}

{
    "name": "count",
    "type": "plural",          # or "selectordinal"
    "offset": 0,               # offset value (0 if none)
    "options": {
        "one": [{"type": "number", "name": "count", "hash": True}, " item"],
        "other": [{"type": "number", "name": "count", "hash": True}, " items"],
        "=0": ["no items"],    # exact match keys
    }
}

Select {gender, select, ...}

{
    "name": "gender",
    "type": "select",
    "options": {
        "male": ["He"],
        "female": ["She"],
        "other": ["They"],
    }
}

Hash # (inside plural/selectordinal)

{"type": "number", "name": "count", "hash": True}

With include_indices=True

All dict nodes gain start and end integer fields indicating byte positions in the original string.

With preserve_whitespace=True

Dict nodes gain a _ws dict storing whitespace at each structural position, enabling lossless print_ast() round-trips.

Supported ICU Features

  • Simple variable interpolation: {name}
  • Plural with CLDR categories: {n, plural, one {...} other {...}}
  • Exact match: {n, plural, =0 {...} =1 {...} other {...}}
  • Plural offset: {n, plural, offset:1 ...}
  • Selectordinal: {n, selectordinal, one {#st} two {#nd} few {#rd} other {#th}}
  • Select: {gender, select, male {...} female {...} other {...}}
  • Nested expressions: plural inside select, select inside plural, etc.
  • Typed placeholders: {amount, number}, {d, date, short}
  • ICU escape sequences: '' for literal quote, '{...}' for literal braces
  • Hash # replacement inside plural/selectordinal branches
  • XML-style tags (opt-in): <bold>text</bold>

Known Limitations

  • Escape sequences are consumed during parsing. '' becomes ' and '{...}' becomes {...} in the AST. These cannot be reconstructed by print_ast(). This matches the behavior of @formatjs/icu-messageformat-parser.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

File details

Details for the file generaltranslation_icu_messageformat_parser-0.0.0.tar.gz.

File metadata

  • Download URL: generaltranslation_icu_messageformat_parser-0.0.0.tar.gz
  • Upload date:
  • Size: 10.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.8 {"installer":{"name":"uv","version":"0.10.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for generaltranslation_icu_messageformat_parser-0.0.0.tar.gz
Algorithm Hash digest
SHA256 de214545d66dc9501e5e30807a29c3254d966bc98188c86464f1a73446a0dc01
MD5 2f0b06d3aecb6a42896e0c56841261c0
BLAKE2b-256 35afc5a78ff17ba100331c5f5f25b6f91e5b893c716face51969053ab353ed05

See more details on using hashes here.

File details

Details for the file generaltranslation_icu_messageformat_parser-0.0.0-py3-none-any.whl.

File metadata

  • Download URL: generaltranslation_icu_messageformat_parser-0.0.0-py3-none-any.whl
  • Upload date:
  • Size: 13.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.10.8 {"installer":{"name":"uv","version":"0.10.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"macOS","version":null,"id":null,"libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for generaltranslation_icu_messageformat_parser-0.0.0-py3-none-any.whl
Algorithm Hash digest
SHA256 139ee9ca96a71a7e3e48ae3457c87622cc2b0a4dccb2d527323d8de15af4c5e3
MD5 773062e4f5d977e71295a0f0efa57735
BLAKE2b-256 02fb82003b93c6a5a36c1f7233995524b0bba1825aedf52a454e9fb20a3765ed

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page