Skip to main content

High-performance PEG parsing (a port of TatSu to Rust)

Project description

CodSpeed

铁修 TieXiu

A high-performance port of TatSu to Rust.

TieXiu (铁修) is a PEG (Parsing Expression Grammar) engine that implements the flexibility and power of the original TatSu lineage into a memory-safe, high-concurrency architecture optimized for modern CPU caches.

Beta

TieXiu is functionally complete, and correct respect its predecessor TatSu. A beta period will allow for adjusting the API and its signatures to the user experience.

About

TieXiu is a tool that takes grammars in extended EBNF_ as input, and outputs memoizing (Packrat) PEG parsers as a Rust model. The classic variations of EBNF (Tomassetti, EasyExtend, Wirth) and ISO EBNF are supported as input grammar formats.

The TatSu Documentation provides a vision of where the TieXiu project is heading. A copy of the grammar syntax can be accessed locally in the SYNTAX document.

TieXiu is foremost a Rust library that is also published as a Python library with the help of PyO3/Maturin. The Rust API may return objects of types in the internal parser or tree model. The Python API has strings as input and json.dumps() compatible Python objects as output.

TatSu is a mature project with an important user base. It's difficult to make certain changes to it even if they are improvements or fixes for long-standing quirks (as well known within experienced software engineers, a long-lived quirk becomes a feature). TieXiu is an opportunity to start from scratch, with a modern approach, even if the grammar syntax and its semantics are preserved.

On The (Non-Blazing) Speed

TieXiu was started with the aim of learning Rust and applying AI Agents over a meaningful project (versus books or simple exercises). TatSu has rich parser-generator semantics that had to be replicated in Rust for completeness and compatibility.

Not a primary objective, it was expected that parsing with otpimized Rust over its runtime would run circles around the Python implementation.

It was not so.

As if implementing the semantics wasn't difficult enough on a languate so strict about memory management and with so little reasonable and efficient defaults, the first complete runs of parser generation and parsing were 3x times slower than the best Python counterpart.

It took an important amount of Rust-specific optimizations and some algorithm redesign to reach the current 1.08x speed.Rust is not friendly to the deep recursion required to parse a language like, for example, Java, and its default data structures, like Vec, don't behave well when used as short-lived containers. The complete history of optimizations that include an imported heap manager (jemalloc) figure in the Git logs. The custom allocator has since been removed for cross-platform compatibility — the system allocator is used on all platforms.

The PyO3 interface is there, but it's easier and more convenient to use TatSu directly when working with Python.

TieXiu is today a powerful PEG parser generator in Rust, so it may find a home among the rustacean community wanting to convert flat streams into semantic structures.

The Theoretical Ceiling: Why Rust Didn't "Run Circles"

The humble 1.08x speedup over highly optimized Python isn't a failure of Rust—it is a lesson in computer architecture governed by Amdahl's Law and the physical constraints of hardware.

When applied to execution cost, Amdahl's Law dictates that the total time $T$ of a process across an optimization scale $n$ is bound by a fixed, unalterable baseline $b$:

$$T(n) = b + \frac{T(1) - b}{n}$$

In an optimized PEG engine like TatSu, the algorithmic design is already incredibly refined. The serial baseline $b$ represents the unavoidable physical mechanics of processing a formal language:

  1. The Linear-Scan Lower Bound: A parser cannot predict the input structure without scanning the all the input. Every single byte must travel from RAM, through the cache hierarchy, and into a CPU register, and it must be matched against "something" as specified by the grammar.
  2. The Memory Wall: Once code optimization strips away linguistic bloat (interpreter lookups, heavy object wrapping), the execution becomes entirely Stream-Bound. The CPU spends its clock cycles waiting on memory bus bandwidth, or moving through the states of a regexp automata.

Because TatSu's core execution loop pushes those operations down into optimized C primitives and bitmaps, the remaining runtime overhead left to optimize ($T(1) - b$) is incredibly small.

Rewriting the engine in Rust completely eliminates Python's memory management friction, heavy object boxing, and Garbage Collection pauses, but it cannot bypass the physical limits of silicon. Once an engine achieves true Mechanical Sympathy with the underlying hardware, the language it is written in becomes secondary—the physics of the text stream call the shots.

OgoPEGo is a brand-new implementation of the semantics in Go. It's quite beautiful and very efficient, and it also hits the same asymptotic bound on parsing speed.

Non-Features

Most features of TatSu are available in TieXiu. Some features have not yet been implemented, and a few never will:

  • Generation of synthetic classes from grammar parameters will not be implemented in Rust.
  • Generation of source code with an object model for deifinitions in the grammar may be implemented if a way is found to make the parser or postprocessing bind the Tree output of a parse to the model.
  • Code generation of a parser recently moved in TatSu to the loading of a model of the Grammar and using it as parser. Although the generated procedural parser may produce 1.3x increased throughput in Python, supporting generated code is hard, and it complicates the internal interfaces. For Rust, TieXiu alreay knows how to load fast a Grammar model from TatSu JSON. A generated copy of the grammar model constructor could be precompiled by Rust.
  • Parsing of boolean and numeric values happens in TatSu through synthetic actions, which call the constructors for those types passing the parsed strings. For TieXiu the preferred way of transformig a tree (semantics) is through post-processing (folding), but basic numeric types and booleans could be supported.
  • Semantic actions (transformations) during parse are not implemented. Python is friendly to objects of type Any, so semantic actions during parse in TatSu can produce a tree of any type. Rust is different, and trying to have structures of an any type is not rustacean. The result of a parse is a well-defined Tree which is a small-enough enum that writing a walker for it is easy, so type transformations can be done in postprocessing by folding. See the fold modules in TieXiu for examples and useful trait definitions.
  • Interpolation and evaluation of `constant` expressions hasn't had any known use cases with TatSu. They will not be implemented in TieXiu until a use case appears.
  • The @@include directive for textual includes was always a bad idea.

API

The needs of most users are met by parsing input with the rules in a grammar and reciving the structure output as a JSON-compatible value. For other use cases, TieXiu exposes its internal model and APIs (to be docummented).

The Python API

The return values of Any are of the basic Python types, as defined in the json module documentation (see Encoders and Decoders ).

JSON Python
object dict
array list
string str
number (int) int
number (real) float
true True
false False
null None

Keyword arguments can be passed for runtime configuration. The only recognized argument as of writing is trace=.

These functions are available from package tiexiu.

def parse(grammar: str, text: str, **kwargs: Any) -> Any
def parse_grammar(grammar: str, **kwargs: Any) -> Any:
def parse_grammar_to_json(grammar: str, **kwargs: Any) -> Any:
def parse_to_json(grammar: str, text: str, **kwargs: Anyt) -> Any:
def pretty(grammar: str, **kwargs: Any) -> str:
def compile_to_json(grammar: str, **kwargs: Any) -> Any:

The Rust API

pub fn parse_grammar(grammar: &str, cfg: &CfgA) -> Result<Tree>;
pub fn parse_grammar_to_json(grammar: &str, cfg: &CfgA) -> Result<serde_json::Value>;
pub fn parse_grammar_to_json_string(grammar: &str, cfg: &CfgA) -> Result<String>;
pub fn parse_grammar_with<U>(cursor: U, cfg: &CfgA) -> Result<Tree>
pub fn parse_grammar_to_json_with<U>(cursor: U, cfg: &CfgA) -> Result<serde_json::Value>
pub fn compile(grammar: &str, cfg: &CfgA) -> Result<Grammar>;
pub fn compile_to_json(grammar: &str, cfg: &CfgA) -> Result<serde_json::Value>;
pub fn compile_to_json_string(grammar: &str, cfg: &CfgA) -> Result<String>;
pub fn compile_with<U>(cursor: U, cfg: &CfgA) -> Result<Grammar>
pub fn compile_to_json_with<U>(cursor: U, cfg: &CfgA) -> Result<serde_json::Value>
pub fn load(json: &str, _cfg: &CfgA) -> Result<Grammar>;
pub fn load_to_json(json: &str, cfg: &CfgA) -> Result<serde_json::Value>;
pub fn load_tree(json: &str, _cfg: &CfgA) -> Result<Tree>;
pub fn load_tree_to_json(json: &str, cfg: &CfgA) -> Result<serde_json::Value>;
pub fn grammar_pretty(grammar: &str, cfg: &CfgA) -> Result<String>;
pub fn pretty_tree(tree: &Tree, _cfg: &CfgA) -> Result<String>;
pub fn pretty_tree_json(tree: &Tree, _cfg: &CfgA) -> Result<String>;
pub fn parse(grammar: &str, text: &str, cfg: &CfgA) -> Result<Tree>;
pub fn parse_to_json(grammar: &str, text: &str, cfg: &CfgA) -> Result<serde_json::Value>;
pub fn parse_to_json_string(grammar: &str, text: &str, cfg: &CfgA) -> Result<String>;
pub fn parse_input(parser: &Grammar, text: &str, cfg: &CfgA) -> Result<Tree>;
pub fn parse_input_to_json(parser: &Grammar, text: &str, cfg: &CfgA) -> Result<serde_json::Value>;
pub fn parse_input_to_json_string(parser: &Grammar, text: &str, cfg: &CfgA) -> Result<String>;

Roadmap

The project is functionally complete. Comments about the implementation strategies and possible improvements are now in RODADMAP.

License

Licensed under either of:

at your option.

Contribution

Unless explicitly stated otherwise, any contribution intentionally submitted for inclusion in the work, as defined in the Apache-2.0 license, shall be dual-licensed as above, without any additional terms or conditions.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tiexiu-0.1.2.tar.gz (238.8 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

tiexiu-0.1.2-cp312-abi3-win_amd64.whl (1.2 MB view details)

Uploaded CPython 3.12+Windows x86-64

tiexiu-0.1.2-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.3 MB view details)

Uploaded CPython 3.12+manylinux: glibc 2.17+ x86-64

tiexiu-0.1.2-cp312-abi3-macosx_11_0_arm64.whl (1.2 MB view details)

Uploaded CPython 3.12+macOS 11.0+ ARM64

tiexiu-0.1.2-cp312-abi3-macosx_10_12_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.12+macOS 10.12+ x86-64

File details

Details for the file tiexiu-0.1.2.tar.gz.

File metadata

  • Download URL: tiexiu-0.1.2.tar.gz
  • Upload date:
  • Size: 238.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for tiexiu-0.1.2.tar.gz
Algorithm Hash digest
SHA256 579164aa6df5816b2217ee6876b32d14a6c0bc07992d8002b733d12378e50317
MD5 587c22f15263ffecbb2c4cecfbc425a2
BLAKE2b-256 8783e22d4ffedfb1694952e83e683395c70d4a704cb3b2cf0cef354502e8f256

See more details on using hashes here.

Provenance

The following attestation bundles were made for tiexiu-0.1.2.tar.gz:

Publisher: release.yml on neogeny/tiexiu

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file tiexiu-0.1.2-cp312-abi3-win_amd64.whl.

File metadata

  • Download URL: tiexiu-0.1.2-cp312-abi3-win_amd64.whl
  • Upload date:
  • Size: 1.2 MB
  • Tags: CPython 3.12+, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for tiexiu-0.1.2-cp312-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 afdf4d011cc6eb5387ff37c1fafea2fb6ad63fafea38fb73db2f58e16c9bb0b6
MD5 845290578df6808cac7a42dc978ed5f7
BLAKE2b-256 583316fbe49ef83dcd8153e812cf4ab11ef251cf2cf5046acfb823245bd52abe

See more details on using hashes here.

Provenance

The following attestation bundles were made for tiexiu-0.1.2-cp312-abi3-win_amd64.whl:

Publisher: release.yml on neogeny/tiexiu

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file tiexiu-0.1.2-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for tiexiu-0.1.2-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 e542b33269585599f5419e84b6137a4217b3549bfa66173ad5eb6471cf01a118
MD5 b19153c253d2d60eead53a1ce4a28556
BLAKE2b-256 ce85da8f5ba011fd5bb637f73444ad167e1d70f38bcf1785148d05342b0c36fa

See more details on using hashes here.

Provenance

The following attestation bundles were made for tiexiu-0.1.2-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on neogeny/tiexiu

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file tiexiu-0.1.2-cp312-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for tiexiu-0.1.2-cp312-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 0ab64f51e784aea1f5b912c488766685812d40fde275b6fdda6c5a9c3aa98be7
MD5 1b2ef5feaaca3441be9d42bb714eeb82
BLAKE2b-256 10b94a1f1cce080bdd188fdfe20f34b2cc9a97c4f24dc1b1d2dc74d711efd183

See more details on using hashes here.

Provenance

The following attestation bundles were made for tiexiu-0.1.2-cp312-abi3-macosx_11_0_arm64.whl:

Publisher: release.yml on neogeny/tiexiu

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file tiexiu-0.1.2-cp312-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for tiexiu-0.1.2-cp312-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 2598a153d08e73de6e2c851a10ce38c31601ef5fb9bcb1ce0325e79a5b6fab7c
MD5 3e041bcc0f2f37ea89feb8eb0d45742c
BLAKE2b-256 688340b8d3bc1f4645b684a77d4810ee8ce99f90e1dca7bad3b49b11e321c3c0

See more details on using hashes here.

Provenance

The following attestation bundles were made for tiexiu-0.1.2-cp312-abi3-macosx_10_12_x86_64.whl:

Publisher: release.yml on neogeny/tiexiu

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page