Skip to main content

High-performance PEG parsing (a port of TatSu to Rust)

Project description

CodSpeed

铁修 TieXiu

A high-performance port of TatSu to Rust.

TieXiu (铁修) is a PEG (Parsing Expression Grammar) engine that implements the flexibility and power of the original TatSu lineage into a memory-safe, high-concurrency architecture optimized for modern CPU caches.

Beta

TieXiu is functionally complete, and correct respect its predecessor TatSu. A beta period will allow for adjusting the API and its signatures to the user experience.

About

TieXiu is a tool that takes grammars in extended EBNF_ as input, and outputs memoizing (Packrat) PEG parsers as a Rust model. The classic variations of EBNF (Tomassetti, EasyExtend, Wirth) and ISO EBNF are supported as input grammar formats.

The TatSu Documentation provides a vision of where the TieXiu project is heading. A copy of the grammar syntax can be accessed locally in the SYNTAX document.

TieXiu is foremost a Rust library that is also published as a Python library with the help of PyO3/Maturin. The Rust API may return objects of types in the internal parser or tree model. The Python API has strings as input and json.dumps() compatible Python objects as output.

TatSu is a mature project with an important user base. It's difficult to make certain changes to it even if they are improvements or fixes for long-standing quirks (as well known within experienced software engineers, a long-lived quirk becomes a feature). TieXiu is an opportunity to start from scratch, with a modern approach, even if the grammar syntax and its semantics are preserved.

On The (Non-Blazing) Speed

TieXiu was started with the aim of learning Rust and applying AI Agents over a meaningful project (versus books or simple exercises). TatSu has rich parser-generator semantics that had to be replicated in Rust for completeness and compatibility.

Not a primary objective, it was expected that parsing with otpimized Rust over its runtime would run circles around the Python implementation.

It was not so.

As if implementing the semantics wasn't difficult enough on a languate so strict about memory management and with so little reasonable and efficient defaults, the first complete runs of parser generation and parsing were 3x times slower than the best Python counterpart.

It took an important amount of Rust-specific optimizations and some algorithm redesign to reach the current 1.08x speed.Rust is not friendly to the deep recursion required to parse a language like, for example, Java, and its default data structures, like Vec, don't behave well when used as short-lived containers. The complete history of optimizations that include an imported heap manager (jemalloc) figure in the Git logs. The custom allocator has since been removed for cross-platform compatibility — the system allocator is used on all platforms.

The PyO3 interface is there, but it's easier and more convenient to use TatSu directly when working with Python.

TieXiu is today a powerful PEG parser generator in Rust, so it may find a home among the rustacean community wanting to convert flat streams into semantic structures.

The Theoretical Ceiling: Why Rust Didn't "Run Circles"

The humble 1.08x speedup over highly optimized Python isn't a failure of Rust—it is a lesson in computer architecture governed by Amdahl's Law and the physical constraints of hardware.

When applied to execution cost, Amdahl's Law dictates that the total time $T$ of a process across an optimization scale $n$ is bound by a fixed, unalterable baseline $b$:

$$T(n) = b + \frac{T(1) - b}{n}$$

In an optimized PEG engine like TatSu, the algorithmic design is already incredibly refined. The serial baseline $b$ represents the unavoidable physical mechanics of processing a formal language:

  1. The Linear-Scan Lower Bound: A parser cannot predict the input structure without scanning the all the input. Every single byte must travel from RAM, through the cache hierarchy, and into a CPU register, and it must be matched against "something" as specified by the grammar.
  2. The Memory Wall: Once code optimization strips away linguistic bloat (interpreter lookups, heavy object wrapping), the execution becomes entirely Stream-Bound. The CPU spends its clock cycles waiting on memory bus bandwidth, or moving through the states of a regexp automata.

Because TatSu's core execution loop pushes those operations down into optimized C primitives and bitmaps, the remaining runtime overhead left to optimize ($T(1) - b$) is incredibly small.

Rewriting the engine in Rust completely eliminates Python's memory management friction, heavy object boxing, and Garbage Collection pauses, but it cannot bypass the physical limits of silicon. Once an engine achieves true Mechanical Sympathy with the underlying hardware, the language it is written in becomes secondary—the physics of the text stream call the shots.

OgoPEGo is a brand-new implementation of the semantics in Go. It's quite beautiful and very efficient, and it also hits the same asymptotic bound on parsing speed.

Non-Features

Most features of TatSu are available in TieXiu. Some features have not yet been implemented, and a few never will:

  • Generation of synthetic classes from grammar parameters will not be implemented in Rust.
  • Generation of source code with an object model for deifinitions in the grammar may be implemented if a way is found to make the parser or postprocessing bind the Tree output of a parse to the model.
  • Code generation of a parser recently moved in TatSu to the loading of a model of the Grammar and using it as parser. Although the generated procedural parser may produce 1.3x increased throughput in Python, supporting generated code is hard, and it complicates the internal interfaces. For Rust, TieXiu alreay knows how to load fast a Grammar model from TatSu JSON. A generated copy of the grammar model constructor could be precompiled by Rust.
  • Parsing of boolean and numeric values happens in TatSu through synthetic actions, which call the constructors for those types passing the parsed strings. For TieXiu the preferred way of transformig a tree (semantics) is through post-processing (folding), but basic numeric types and booleans could be supported.
  • Semantic actions (transformations) during parse are not implemented. Python is friendly to objects of type Any, so semantic actions during parse in TatSu can produce a tree of any type. Rust is different, and trying to have structures of an any type is not rustacean. The result of a parse is a well-defined Tree which is a small-enough enum that writing a walker for it is easy, so type transformations can be done in postprocessing by folding. See the fold modules in TieXiu for examples and useful trait definitions.
  • Interpolation and evaluation of `constant` expressions hasn't had any known use cases with TatSu. They will not be implemented in TieXiu until a use case appears.
  • The @@include directive for textual includes was always a bad idea.

API

The needs of most users are met by parsing input with the rules in a grammar and reciving the structure output as a JSON-compatible value. For other use cases, TieXiu exposes its internal model and APIs (to be docummented).

The Python API

The return values of Any are of the basic Python types, as defined in the json module documentation (see Encoders and Decoders ).

JSON Python
object dict
array list
string str
number (int) int
number (real) float
true True
false False
null None

Keyword arguments can be passed for runtime configuration. The only recognized argument as of writing is trace=.

These functions are available from package tiexiu.

def parse(grammar: str, text: str, **kwargs: Any) -> Any
def parse_grammar(grammar: str, **kwargs: Any) -> Any:
def parse_grammar_to_json(grammar: str, **kwargs: Any) -> Any:
def parse_to_json(grammar: str, text: str, **kwargs: Anyt) -> Any:
def pretty(grammar: str, **kwargs: Any) -> str:
def compile_to_json(grammar: str, **kwargs: Any) -> Any:

The Rust API

pub fn parse_grammar(grammar: &str, cfg: &CfgA) -> Result<Tree>;
pub fn parse_grammar_to_json(grammar: &str, cfg: &CfgA) -> Result<serde_json::Value>;
pub fn parse_grammar_to_json_string(grammar: &str, cfg: &CfgA) -> Result<String>;
pub fn parse_grammar_with<U>(cursor: U, cfg: &CfgA) -> Result<Tree>
pub fn parse_grammar_to_json_with<U>(cursor: U, cfg: &CfgA) -> Result<serde_json::Value>
pub fn compile(grammar: &str, cfg: &CfgA) -> Result<Grammar>;
pub fn compile_to_json(grammar: &str, cfg: &CfgA) -> Result<serde_json::Value>;
pub fn compile_to_json_string(grammar: &str, cfg: &CfgA) -> Result<String>;
pub fn compile_with<U>(cursor: U, cfg: &CfgA) -> Result<Grammar>
pub fn compile_to_json_with<U>(cursor: U, cfg: &CfgA) -> Result<serde_json::Value>
pub fn load(json: &str, _cfg: &CfgA) -> Result<Grammar>;
pub fn load_to_json(json: &str, cfg: &CfgA) -> Result<serde_json::Value>;
pub fn load_tree(json: &str, _cfg: &CfgA) -> Result<Tree>;
pub fn load_tree_to_json(json: &str, cfg: &CfgA) -> Result<serde_json::Value>;
pub fn grammar_pretty(grammar: &str, cfg: &CfgA) -> Result<String>;
pub fn pretty_tree(tree: &Tree, _cfg: &CfgA) -> Result<String>;
pub fn pretty_tree_json(tree: &Tree, _cfg: &CfgA) -> Result<String>;
pub fn parse(grammar: &str, text: &str, cfg: &CfgA) -> Result<Tree>;
pub fn parse_to_json(grammar: &str, text: &str, cfg: &CfgA) -> Result<serde_json::Value>;
pub fn parse_to_json_string(grammar: &str, text: &str, cfg: &CfgA) -> Result<String>;
pub fn parse_input(parser: &Grammar, text: &str, cfg: &CfgA) -> Result<Tree>;
pub fn parse_input_to_json(parser: &Grammar, text: &str, cfg: &CfgA) -> Result<serde_json::Value>;
pub fn parse_input_to_json_string(parser: &Grammar, text: &str, cfg: &CfgA) -> Result<String>;

Roadmap

The project is functionally complete. Comments about the implementation strategies and possible improvements are now in RODADMAP.

License

Licensed under either of:

at your option.

Contribution

Unless explicitly stated otherwise, any contribution intentionally submitted for inclusion in the work, as defined in the Apache-2.0 license, shall be dual-licensed as above, without any additional terms or conditions.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tiexiu-0.1.3.tar.gz (235.1 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

tiexiu-0.1.3-cp312-abi3-win_amd64.whl (977.1 kB view details)

Uploaded CPython 3.12+Windows x86-64

tiexiu-0.1.3-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.9 MB view details)

Uploaded CPython 3.12+manylinux: glibc 2.17+ x86-64

tiexiu-0.1.3-cp312-abi3-macosx_11_0_arm64.whl (1.2 MB view details)

Uploaded CPython 3.12+macOS 11.0+ ARM64

tiexiu-0.1.3-cp312-abi3-macosx_10_12_x86_64.whl (1.2 MB view details)

Uploaded CPython 3.12+macOS 10.12+ x86-64

File details

Details for the file tiexiu-0.1.3.tar.gz.

File metadata

  • Download URL: tiexiu-0.1.3.tar.gz
  • Upload date:
  • Size: 235.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for tiexiu-0.1.3.tar.gz
Algorithm Hash digest
SHA256 458e967c55946bca1708d2f2e77723ea5450e0b5c96f97bfdf5b1419e3eafb24
MD5 9d581e9fc972f96dbe473006407f959a
BLAKE2b-256 e8638db79a82762646969775988b718ec0cffbdd4770c4d3b2701b3e78fe0e57

See more details on using hashes here.

Provenance

The following attestation bundles were made for tiexiu-0.1.3.tar.gz:

Publisher: release.yml on neogeny/tiexiu

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file tiexiu-0.1.3-cp312-abi3-win_amd64.whl.

File metadata

  • Download URL: tiexiu-0.1.3-cp312-abi3-win_amd64.whl
  • Upload date:
  • Size: 977.1 kB
  • Tags: CPython 3.12+, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for tiexiu-0.1.3-cp312-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 b0705d01ed0eda57feb42a0a9e99160e9cab02f06fa22059b95b424cc3e140cc
MD5 ef3526701e1655d5da288c857f63bbb0
BLAKE2b-256 f4000ec0ede38ed11a0f6247409153676bb76c098aed00fbc61ea2301dc2ff40

See more details on using hashes here.

Provenance

The following attestation bundles were made for tiexiu-0.1.3-cp312-abi3-win_amd64.whl:

Publisher: release.yml on neogeny/tiexiu

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file tiexiu-0.1.3-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for tiexiu-0.1.3-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 29887413ec296fb0e7db856a79a63bba04b79e86a9812fd004308366f5d043fb
MD5 67fb97e0698ba39cb7875ea26104cf02
BLAKE2b-256 ea7ff9ffa8f8ddcc9b6b3ad6248a771121ce48f7f131fcd4115da7e37987ad0a

See more details on using hashes here.

Provenance

The following attestation bundles were made for tiexiu-0.1.3-cp312-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on neogeny/tiexiu

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file tiexiu-0.1.3-cp312-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for tiexiu-0.1.3-cp312-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 94652001b639a6d3529c0e41d5abb4b1760bb4ff1ef68993aa6d277885f7bbf5
MD5 0f6293cd6f9a086cab1d008ce73141e8
BLAKE2b-256 83bba9e8ba80a7d3013d492bbd007683614d9d2df4d3fd4a46e63cfc75f80cf2

See more details on using hashes here.

Provenance

The following attestation bundles were made for tiexiu-0.1.3-cp312-abi3-macosx_11_0_arm64.whl:

Publisher: release.yml on neogeny/tiexiu

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file tiexiu-0.1.3-cp312-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for tiexiu-0.1.3-cp312-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 ec44016075a695738d960ee46a0d3728182f3a541986587c630297677d45389c
MD5 68467dcfda2c53d35edd8f4ee05eca24
BLAKE2b-256 cd8f92193737cc713d9609c202415c52d5e67a3dcd3666d5359a15db30270bbe

See more details on using hashes here.

Provenance

The following attestation bundles were made for tiexiu-0.1.3-cp312-abi3-macosx_10_12_x86_64.whl:

Publisher: release.yml on neogeny/tiexiu

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page