Skip to main content

High-performance PEG parsing (a port of TatSu to Rust)

Project description

CodSpeed

铁修 TieXiu

A high-performance port of TatSu to Rust.

TieXiu (铁修) is a PEG (Parsing Expression Grammar) engine that implements the flexibility and power of the original TatSu lineage into a memory-safe, high-concurrency architecture optimized for modern CPU caches.

Beta

TieXiu is functionally complete, and correct respect its predecessor TatSu. A beta period will allow for adjusting the API and its signatures to the user experience.

About

TieXiu is a tool that takes grammars in extended EBNF_ as input, and outputs memoizing (Packrat) PEG parsers as a Rust model. The classic variations of EBNF (Tomassetti, EasyExtend, Wirth) and ISO EBNF are supported as input grammar formats.

The TatSu Documentation provides a vision of where the TieXiu project is heading. A copy of the grammar syntax can be accessed locally in the SYNTAX document.

TieXiu is foremost a Rust library that is also published as a Python library with the help of PyO3/Maturin. The Rust API may return objects of types in the internal parser or tree model. The Python API has strings as input and json.dumps() compatible Python objects as output.

TatSu is a mature project with an important user base. It's difficult to make certain changes to it even if they are improvements or fixes for long-standing quirks (as well known within experienced software engineers, a long-lived quirk becomes a feature). TieXiu is an opportunity to start from scratch, with a modern approach, even if the grammar syntax and its semantics are preserved.

On The (Non-Blazing) Speed

TieXiu was started with the aim of learning Rust and applying AI Agents over a meaningful project (versus books or simple exercises). TatSu has rich parser-generator semantics that had to be replicated in Rust for completeness and compatibility.

Not a primary objective, it was expected that parsing with otpimized Rust over its runtime would run circles around the Python implementation.

It was not so.

As if implementing the semantics wasn't difficult enough on a languate so strict about memory management and with so little reasonable and efficient defaults, the first complete runs of parser generation and parsing were 3x times slower than the best Python counterpart.

It took an important amount of Rust-specific optimizations and some algorithm redesign to reach the current 1.08x speed.Rust is not friendly to the deep recursion required to parse a language like, for example, Java, and its default data structures, like Vec, don't behave well when used as short-lived containers. The complete history of optimizations that include an imported heap manager (jemalloc) figure in the Git logs. The custom allocator has since been removed for cross-platform compatibility — the system allocator is used on all platforms.

The PyO3 interface is there, but it's easier and more convenient to use TatSu directly when working with Python.

TieXiu is today a powerful PEG parser generator in Rust, so it may find a home among the rustacean community wanting to convert flat streams into semantic structures.

The Theoretical Ceiling: Why Rust Didn't "Run Circles"

The humble 1.08x speedup over highly optimized Python isn't a failure of Rust—it is a lesson in computer architecture governed by Amdahl's Law and the physical constraints of hardware.

When applied to execution cost, Amdahl's Law dictates that the total time $T$ of a process across an optimization scale $n$ is bound by a fixed, unalterable baseline $b$:

$$T(n) = b + \frac{T(1) - b}{n}$$

In an optimized PEG engine like TatSu, the algorithmic design is already incredibly refined. The serial baseline $b$ represents the unavoidable physical mechanics of processing a formal language:

  1. The Linear-Scan Lower Bound: A parser cannot predict the input structure without scanning the all the input. Every single byte must travel from RAM, through the cache hierarchy, and into a CPU register, and it must be matched against "something" as specified by the grammar.
  2. The Memory Wall: Once code optimization strips away linguistic bloat (interpreter lookups, heavy object wrapping), the execution becomes entirely Stream-Bound. The CPU spends its clock cycles waiting on memory bus bandwidth, or moving through the states of a regexp automata.

Because TatSu's core execution loop pushes those operations down into optimized C primitives and bitmaps, the remaining runtime overhead left to optimize ($T(1) - b$) is incredibly small.

Rewriting the engine in Rust completely eliminates Python's memory management friction, heavy object boxing, and Garbage Collection pauses, but it cannot bypass the physical limits of silicon. Once an engine achieves true Mechanical Sympathy with the underlying hardware, the language it is written in becomes secondary—the physics of the text stream call the shots.

OgoPEGo is a brand-new implementation of the semantics in Go. It's quite beautiful and very efficient, and it also hits the same asymptotic bound on parsing speed.

Non-Features

Most features of TatSu are available in TieXiu. Some features have not yet been implemented, and a few never will:

  • Generation of synthetic classes from grammar parameters will not be implemented in Rust.
  • Generation of source code with an object model for deifinitions in the grammar may be implemented if a way is found to make the parser or postprocessing bind the Tree output of a parse to the model.
  • Code generation of a parser recently moved in TatSu to the loading of a model of the Grammar and using it as parser. Although the generated procedural parser may produce 1.3x increased throughput in Python, supporting generated code is hard, and it complicates the internal interfaces. For Rust, TieXiu alreay knows how to load fast a Grammar model from TatSu JSON. A generated copy of the grammar model constructor could be precompiled by Rust.
  • Parsing of boolean and numeric values happens in TatSu through synthetic actions, which call the constructors for those types passing the parsed strings. For TieXiu the preferred way of transformig a tree (semantics) is through post-processing (folding), but basic numeric types and booleans could be supported.
  • Semantic actions (transformations) during parse are not implemented. Python is friendly to objects of type Any, so semantic actions during parse in TatSu can produce a tree of any type. Rust is different, and trying to have structures of an any type is not rustacean. The result of a parse is a well-defined Tree which is a small-enough enum that writing a walker for it is easy, so type transformations can be done in postprocessing by folding. See the fold modules in TieXiu for examples and useful trait definitions.
  • Interpolation and evaluation of `constant` expressions hasn't had any known use cases with TatSu. They will not be implemented in TieXiu until a use case appears.
  • The @@include directive for textual includes was always a bad idea.

API

The needs of most users are met by parsing input with the rules in a grammar and reciving the structure output as a JSON-compatible value. For other use cases, TieXiu exposes its internal model and APIs (to be docummented).

The Python API

The return values of Any are of the basic Python types, as defined in the json module documentation (see Encoders and Decoders ).

JSON Python
object dict
array list
string str
number (int) int
number (real) float
true True
false False
null None

Keyword arguments can be passed for runtime configuration. The only recognized argument as of writing is trace=.

These functions are available from package tiexiu.

def parse(grammar: str, text: str, **kwargs: Any) -> Any
def parse_grammar(grammar: str, **kwargs: Any) -> Any:
def parse_grammar_to_json(grammar: str, **kwargs: Any) -> Any:
def parse_to_json(grammar: str, text: str, **kwargs: Anyt) -> Any:
def pretty(grammar: str, **kwargs: Any) -> str:
def compile_to_json(grammar: str, **kwargs: Any) -> Any:

The Rust API

pub fn parse_grammar(grammar: &str, cfg: &CfgA) -> Result<Tree>;
pub fn parse_grammar_to_json(grammar: &str, cfg: &CfgA) -> Result<serde_json::Value>;
pub fn parse_grammar_to_json_string(grammar: &str, cfg: &CfgA) -> Result<String>;
pub fn parse_grammar_with<U>(cursor: U, cfg: &CfgA) -> Result<Tree>
pub fn parse_grammar_to_json_with<U>(cursor: U, cfg: &CfgA) -> Result<serde_json::Value>
pub fn compile(grammar: &str, cfg: &CfgA) -> Result<Grammar>;
pub fn compile_to_json(grammar: &str, cfg: &CfgA) -> Result<serde_json::Value>;
pub fn compile_to_json_string(grammar: &str, cfg: &CfgA) -> Result<String>;
pub fn compile_with<U>(cursor: U, cfg: &CfgA) -> Result<Grammar>
pub fn compile_to_json_with<U>(cursor: U, cfg: &CfgA) -> Result<serde_json::Value>
pub fn load(json: &str, _cfg: &CfgA) -> Result<Grammar>;
pub fn load_to_json(json: &str, cfg: &CfgA) -> Result<serde_json::Value>;
pub fn load_tree(json: &str, _cfg: &CfgA) -> Result<Tree>;
pub fn load_tree_to_json(json: &str, cfg: &CfgA) -> Result<serde_json::Value>;
pub fn grammar_pretty(grammar: &str, cfg: &CfgA) -> Result<String>;
pub fn pretty_tree(tree: &Tree, _cfg: &CfgA) -> Result<String>;
pub fn pretty_tree_json(tree: &Tree, _cfg: &CfgA) -> Result<String>;
pub fn parse(grammar: &str, text: &str, cfg: &CfgA) -> Result<Tree>;
pub fn parse_to_json(grammar: &str, text: &str, cfg: &CfgA) -> Result<serde_json::Value>;
pub fn parse_to_json_string(grammar: &str, text: &str, cfg: &CfgA) -> Result<String>;
pub fn parse_input(parser: &Grammar, text: &str, cfg: &CfgA) -> Result<Tree>;
pub fn parse_input_to_json(parser: &Grammar, text: &str, cfg: &CfgA) -> Result<serde_json::Value>;
pub fn parse_input_to_json_string(parser: &Grammar, text: &str, cfg: &CfgA) -> Result<String>;

Roadmap

The project is functionally complete. Comments about the implementation strategies and possible improvements are now in RODADMAP.

License

Licensed under either of:

at your option.

Contribution

Unless explicitly stated otherwise, any contribution intentionally submitted for inclusion in the work, as defined in the Apache-2.0 license, shall be dual-licensed as above, without any additional terms or conditions.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tiexiu-0.1.4.tar.gz (244.9 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

tiexiu-0.1.4-cp312-abi3-win_amd64.whl (1.0 MB view details)

Uploaded CPython 3.12+Windows x86-64

tiexiu-0.1.4-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.0 MB view details)

Uploaded CPython 3.12+manylinux: glibc 2.17+ x86-64

tiexiu-0.1.4-cp312-abi3-macosx_11_0_arm64.whl (1.2 MB view details)

Uploaded CPython 3.12+macOS 11.0+ ARM64

tiexiu-0.1.4-cp312-abi3-macosx_10_12_x86_64.whl (1.3 MB view details)

Uploaded CPython 3.12+macOS 10.12+ x86-64

File details

Details for the file tiexiu-0.1.4.tar.gz.

File metadata

  • Download URL: tiexiu-0.1.4.tar.gz
  • Upload date:
  • Size: 244.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for tiexiu-0.1.4.tar.gz
Algorithm Hash digest
SHA256 fea0beec335e7a53ab87d0185cb8690f167189b2c0e98feb0ff032dde8b74281
MD5 cead3a539eb6db92fa89d2d5549390c5
BLAKE2b-256 708ba7268a632bbac4d02ef97a80b387e70aa41631d2bde12117447351e860e8

See more details on using hashes here.

Provenance

The following attestation bundles were made for tiexiu-0.1.4.tar.gz:

Publisher: release.yml on neogeny/tiexiu

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file tiexiu-0.1.4-cp312-abi3-win_amd64.whl.

File metadata

  • Download URL: tiexiu-0.1.4-cp312-abi3-win_amd64.whl
  • Upload date:
  • Size: 1.0 MB
  • Tags: CPython 3.12+, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for tiexiu-0.1.4-cp312-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 74647607daf543e4c5ac73c4daa781cc7eaba3983597e6fb8ec2d43c1d259eb5
MD5 cda40cc21c11f8b7aedf100806bf8995
BLAKE2b-256 25d06970134a5267f872820e9f1ad97cdc8aaf53a132f64e8034ec5ccd531c28

See more details on using hashes here.

Provenance

The following attestation bundles were made for tiexiu-0.1.4-cp312-abi3-win_amd64.whl:

Publisher: release.yml on neogeny/tiexiu

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file tiexiu-0.1.4-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for tiexiu-0.1.4-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 20666a7b3d323ce0582bfa18f0eebeb61a000b9d61d28b413418cdd336841484
MD5 c08dfd3635423246c44878112d4c2669
BLAKE2b-256 6529b4435e5ee9183360bc03b22cd8fc22c2bf9adc4a2d407f1d3cf9fbc6e7ab

See more details on using hashes here.

Provenance

The following attestation bundles were made for tiexiu-0.1.4-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on neogeny/tiexiu

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file tiexiu-0.1.4-cp312-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for tiexiu-0.1.4-cp312-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 8c968ac259efbd9f61432991fe668966ffec67e49512292c25aa2fa021b59677
MD5 32e019c1c982c3416ca2361aa0c96dee
BLAKE2b-256 600064436f8f313ccadf3d2dde6f5c6fbbd0878eb785049e56a7d68e11d83e7c

See more details on using hashes here.

Provenance

The following attestation bundles were made for tiexiu-0.1.4-cp312-abi3-macosx_11_0_arm64.whl:

Publisher: release.yml on neogeny/tiexiu

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file tiexiu-0.1.4-cp312-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for tiexiu-0.1.4-cp312-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 5cbfe44346e22b4304b12019d61abe23f242a1c119d9a726b32fb29b07ce452e
MD5 fbb55fd4aa515b9d1d8a4464de15acc7
BLAKE2b-256 42ebca780aef4baa782fce7c2ac3a39bd9da806861a9e12fab7df5f58addb028

See more details on using hashes here.

Provenance

The following attestation bundles were made for tiexiu-0.1.4-cp312-abi3-macosx_10_12_x86_64.whl:

Publisher: release.yml on neogeny/tiexiu

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page