Skip to main content

Hand-rolled Rust HogQL parser with C++-parity AST output

Project description

hogql_parser_rs

Hand-rolled Rust HogQL parser. Pratt + recursive descent. Same JSON AST shape as the C++ ANTLR parser so the two can be cross-validated query-by-query. Ships as a Python extension via maturin and is selected as the rust-json backend in posthog/hogql/parser.py.

About 15× faster than the C++ parser on parse_expr and 50–55× on parse_select, against the same input, on the same machine. The numbers come from posthog/hogql/scripts/parser_bench.py; re-run locally before and after any non-trivial change.

The C++ parser is the source of truth

When grammar, AST shape, or any visible behaviour disagrees between the two, the C++ ANTLR parser is right and this one is wrong. The C++ parser is generated from posthog/hogql/grammar/HogQLLexer.*.g4

  • HogQLParser.g4 via ANTLR4. The Rust parser does not consume those grammar files; it hand-implements the same recognition behaviour.

This means any grammar change is a two-step change:

  1. Update the ANTLR grammar and rebuild the C++ parser. Get the new shape working end-to-end on cpp-json. Pin the new behaviour with regression tests (see "Tools" below).

  2. Bring the Rust parser to parity. Run the diagnostics, find the new divergences, fix them. This is the part an LLM agent can drive in a long-running loop.

Skipping step 1 produces a Rust parser that "works" but on a shape the C++ parser rejects, which means Cloud's printer / planner will reject too, because they're built on top of cpp-json's output. Get the oracle right first, then the candidate.

What's in this crate

Path What it does
src/lib.rs PyO3 entry points (parse_expr_json, parse_select_json, parse_program_json, parse_order_expr_json, parse_full_template_string_json). Each returns a JSON string; on error the JSON is an {"error": true, ...} envelope posthog/hogql/json_ast.py decodes into HogQLSyntaxError / ExposedHogQLError.
src/lex.rs Lexer. Hand-rolled state machine matching the ANTLR-generated C++ lexer's tokens + mode stack (default / template-string / HogQLX-tag / HogQLX-text). When you add a new keyword to the grammar, add it here too.
src/parse.rs Parser core: Parser struct, public entry points, the Pratt expression parser (parse_expr_bp), positions (pos_obj, wrap_pos, wrap_pos_to), char-offset / line-col tables, checkpoint / restore for speculative branches.
src/parse/{expr,select,program,join,cte,hogqlx,template}.rs Per-rule parsing. Most grammar changes land in one of these.
src/parse/bp.rs Binding-power table + build_infix / merge_and_or / merge_concat. The precedence ladder lives here; new operators usually need an entry in infix_bp and a build_infix arm.
src/emit.rs AST-node builders + position helpers (with_pos is idempotent, replace_pos overrides, no_pos reserves null keys to opt out of the wrap). When you add a new AST node, add a helper here so callers don't hand-build the JSON object.
src/error.rs ParseError + the JSON error envelope.

Building locally

# One-time: install the rust toolchain via flox / rustup (the workspace
# Cargo.toml is at `rust/Cargo.toml`).

# Build + install the wheel into the venv (editable). Re-run after each
# rust source change.
uv pip install -e rust/hogql/parser

# Or via maturin directly (faster incremental):
maturin develop --release --manifest-path rust/hogql/parser/Cargo.toml

# Sanity check
python -c "import hogql_parser_rs; print(hogql_parser_rs.parse_expr_json('1 + 2'))"

maturin builds a single cp312-abi3 wheel that works on Python 3.12+. CI builds wheels for Linux x86_64/aarch64 (manylinux 2_28 + musllinux 1_2) and macOS arm64/x86_64; see .github/workflows/build-hogql-parser-rs.yml.

Publishing

The crate is pinned via the hogql-parser-rs==X.Y.Z line in the repo-root pyproject.toml. Bump the version in both:

(They must match. The PR check at .github/workflows/build-hogql-parser-rs.yml enforces this.)

Version is intentionally locked in step with common/hogql_parser (the C++ parser PyPI package) so a bump signals "both parsers move together." The publish workflow builds wheels, pushes to PyPI via trusted publishing, then opens a follow-up PR that updates the repo-root pin.

Adding a new grammar feature

Steps 1–4 are the one-time grammar-update process — done once, human-driven. Step 5 (running the parity loop below) is the long-running, agent-friendly part.

  1. Update HogQLLexer.*.g4 and HogQLParser.g4. Run pnpm grammar:build to regenerate the Python and C++ ANTLR artefacts:

    pnpm grammar:build
    

    That step requires the antlr 4.13.2 binary on PATH; instructions in posthog/hogql/grammar/README.md. The script rewrites common/hogql_parser/HogQL{Lexer,Parser}.{cpp,h,interp,tokens} and the matching Python files. Both backends now recognise the new shape.

  2. Pick the AST emission. Decide what JSON the cpp visitor should return for the new shape. Either reuse an existing AST node or add a new one in posthog/hogql/ast.py. The Python AST is shared between backends, so any new node has to land there first, otherwise posthog/hogql/json_ast.py::deserialize_ast will crash on it.

  3. Update the cpp visitor. Add the VISIT(YourNewRule) arm in common/hogql_parser/parser_json.cpp. Mirror cpp's conventions: call addPositionInfo(json, ctx) per rule unless you specifically want a position-less node (see "Position parity" below). Rebuild the cpp wheel (pip install ./common/hogql_parser).

  4. Pin the new behaviour with a regression test. Add a test (and a rust-rejects-it negative test if the grammar tightens) into the parser_test_factory suite in posthog/hogql/test/_test_parser.py. The factory runs every test against cpp-json, rust-json, and python; on a fresh grammar change the test passes on cpp and fails on rust. That fail is the starting state for the parity loop.

  5. Run the parity loop. See the next section.

The parity loop

The agent loop that brings rust to behavioural parity with cpp. Long-running; the diagnostics produce concrete diffs the agent attacks one at a time.

Before each iteration, rebuild both parser wheels from local source. uv pip install / pip install will happily resolve to the published PyPI wheel and ignore the working tree, so a fresh maturin develop and a fresh pip install ./common/hogql_parser are non-negotiable — the diagnostics test what's installed, not what's on disk.

maturin develop --release --manifest-path rust/hogql/parser/Cargo.toml
pip install --force-reinstall --no-deps ./common/hogql_parser

maturin develop writes the wheel into the active venv as editable; --force-reinstall --no-deps for the cpp wheel sidesteps pip's "already-satisfied" short-circuit when the in-tree version matches the PyPI pin. Skip these and the loop will silently chase divergences that have already been fixed in the working tree.

  1. Generate a new divergence. In priority order:

    • existing failing regression tests (highest signal);
    • real production queries via log_corpus_diagnostic.py / hog_corpus_diagnostic.py (the hog corpus has been at 100% for a while — usually skip);
    • PBT (pbt_diagnostic.py --rule expr|select|program);
    • thinking hard about edge cases the grammar surface invites.

    For everything other than regression tests, start with a small budget (lower --n, less thinking time) and increase until at least one divergence surfaces.

  2. Reduce + pin. Shrink each divergence to its minimal form and add it as a regression test in _test_parser.py's factory so it runs on all three backends.

  3. Read before fixing. Read the grammar AND the cpp visitor for the rule. 100% identical behaviour means knowing exactly what cpp does — guessing leads to fixes that resurface on a deeper PBT run.

  4. Fix the rust parser. Prefer general fixes that won't break on deeper nesting; a depth-0-only special case is a smell. Print a one-paragraph report for the human operator so progress is visible while the loop runs autonomously.

  5. Re-run the regression suite. Anything below the previous baseline goes back to step 1.

Generating divergences is the slow step. Run discovery in parallel in the background:

  • pbt_diagnostic.py --rule select
  • pbt_diagnostic.py --rule expr
  • pbt_diagnostic.py --rule program
  • log_corpus_diagnostic.py (real query corpus)
  • a research subagent grepping for cpp-vs-rust visitor differences
  • a research subagent brainstorming adversarial edge cases

Most of these can stream divergences as they're found. Once at least one known divergence is in hand, start fixing it while the parallel runs keep mining the long tail.

Tools for parity work

Every script below has the same --oracle / --candidate flag pair and defaults to cpp-json vs rust-json. The diagnostics include per-node start / end positions in the comparison by default; set CLEAR_LOCATIONS=1 to strip positions when you want a structural-only read.

Regression tests in posthog/hogql/test/

hogli test posthog/hogql/test/test_parser_cpp_json.py
hogli test posthog/hogql/test/test_parser_python.py
hogli test posthog/hogql/test/test_parser_rust_json.py

The behaviour suite + regression pins live in _test_parser.py's parser_test_factory. The three files above are thin subclasses that spawn one runnable test entry per (backend, case) combination. When you find a new divergence, add a reduced regression to the factory — it picks up all three backends automatically.

Property-based testing via posthog/hogql/scripts/pbt_diagnostic.py

PYTHONPATH=. python posthog/hogql/scripts/pbt_diagnostic.py \
    --n 5000 --rule program

# Per rule:
--rule expr     # standalone column expressions
--rule select   # SELECT / SELECT-set statements
--rule program  # full Hog programs (declarations + statements + exprs)

Generates ~5 000 random grammar surface examples per rule, parses with oracle and candidate, buckets divergences by AST shape, and prints shrunk reproducers. Use --shrink-failures to auto-reduce each divergence to a minimal example.

Real-query corpora via log_corpus_diagnostic.py / hog_corpus_diagnostic.py

# SELECT queries from the last 7 days of production traffic
# (redacted, AI-data-processing-approved teams only):
PYTHONPATH=. python posthog/hogql/scripts/log_corpus_diagnostic.py

# Hog programs from production (transformations, destinations, …):
PYTHONPATH=. python posthog/hogql/scripts/hog_corpus_diagnostic.py

Both auto-download via hogli metabase:query and cache locally under posthog/hogql/scripts/.local/. Pass --skip-download to reuse the existing dump while iterating. Failures are written one block per divergence to a .sql / .hog file the agent can chew through.

Perf bench via posthog/hogql/scripts/parser_bench.py

CANDIDATE_BACKEND=rust-json PYTHONPATH=. \
    python posthog/hogql/scripts/parser_bench.py

Runs both parsers against a fixed corpus of representative queries (small / medium / nested / pathological) and prints an oracle / candidate ratio per row. Re-run before and after any non-trivial change. If parse_select mean drops noticeably (the parse_select speedup is the headline number), find out why before landing.

Shadow compare in TEST via cpp-with-rust-shadow

In TEST mode the default backend is cpp-with-rust-shadow: both backends parse, ASTs are compared, mismatches raise so the failing test points right at the offending query. In production this same mode runs at a 1% sample and only logs. Useful when a regression slips past the PBT but shows up in the suite.

from posthog.hogql.constants import HogQLParserBackend
parse_expr(src, backend=HogQLParserBackend.CPP_WITH_RUST_SHADOW)

Rules of thumb for the parity loop

These aren't always obvious from the diagnostics alone:

  • Prefer the generalising fix. When two implementations both pass the failing cases, pick the one that doesn't depend on the input shape. A wrap_pos call at a single emit site beats a depth-aware conditional. A change to the binding-power table beats an ad-hoc check in the consumer.

  • Position bugs hide behind structural bugs. Always run the PBT with positions on (the default); CLEAR_LOCATIONS=1 is for diagnosing structural regressions only. A 99% structural match can mask a 50% position-aware match.

  • Look at the cpp visitor before guessing. Every per-node position decision in this parser has a cpp counterpart in common/hogql_parser/parser_json.cpp. If the cpp visitor calls addPositionInfo(json, ctx) you need a wrap on the rust side; if it doesn't, you need emit::no_pos (or the helper for that node already does it).

  • Watch the perf bench. Position emission isn't free. Cache O(N) computations on Parser rather than recomputing per emit; the is_ascii_src field is the canonical example.

  • Don't fix one rule at a time at the expense of others. A one-line wrap in parse/expr.rs can move three PBT rules at once. Run all three PBTs after each change, not just the one you started with.

Position parity (the non-obvious part)

The C++ visitor decides per-node whether to emit positions via addPositionInfo(json, ctx). Some nodes are deliberately position-less (NamedArgument, ColumnsExpr in qualified-asterisk column slots, etc.) so the rust parser has to match that exactly.

Three position helpers in emit.rs cover the three cases:

Helper When to use
with_pos Default. Adds start / end if not already set. Used by Parser::wrap_pos and wrap_pos_to. Idempotent so the outer pratt-loop wrap doesn't trample inner spans.
replace_pos Override existing start / end. Used by the bare-paren grammar alts ((* REPLACE(...))) where the inner wrap captured only the inner content but cpp's grammar ctx includes the outer parens.
no_pos Pre-insert start: null, end: null so the outer wrap leaves the node bare. Used for nodes cpp explicitly doesn't position (NamedArgument, ColumnExprNamedArg).

Two more things to keep in mind:

  • Offsets are character indices, not byte indices. cpp's getStartIndex() is char-based; rust's source slices are byte-based. Parser::pos_obj converts via byte_to_char_index for non-ASCII sources, short-circuits for ASCII. If you bypass pos_obj (e.g. hand-building a position object for a node-builder you control), you have to do the conversion yourself.

  • Column is character-position-in-line, not byte-position. Same reason. The ASCII fast path in pos_obj handles this for free; the slow path counts chars between line-start and offset.

Known long-tail divergences

The PBT for expr and select exposes adversarial grammar surface that the production corpora never see: deep nested BETWEEN low AND high chains with embedded aliases and ternaries, extreme WITHIN GROUP (ORDER BY …) shapes, multi-token-AND-merged operands. These take focused per-shape investigation; the PR description has the current numbers.

The production corpora (log_corpus_diagnostic, hog_corpus_diagnostic) stay above 90%, so anything the PBT surfaces that doesn't appear there is technically grammar-parity work but not user-visible.

Selecting from Python

from posthog.hogql.parser import parse_expr, parse_select, parse_program

ast = parse_expr("1 + event.properties.$browser", backend="rust-json")

Backends live in posthog/hogql/constants.HogQLParserBackend:

Backend Use case
cpp-json Production default. ANTLR-based, oracle for everything below.
rust-json This crate. ~15× / ~50× faster, behaviour identical (modulo the long tail).
python Pure-Python ANTLR fallback. Slower; useful for debugging visitor changes.
cpp-with-rust-shadow Production-default in TEST. Parses with cpp, shadow-parses with rust, raises on mismatch (TEST) / logs at 1% sample (prod).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hogql_parser_rs-1.3.72.tar.gz (257.6 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

hogql_parser_rs-1.3.72-cp312-abi3-musllinux_1_2_x86_64.whl (2.6 MB view details)

Uploaded CPython 3.12+musllinux: musl 1.2+ x86-64

hogql_parser_rs-1.3.72-cp312-abi3-musllinux_1_2_aarch64.whl (2.6 MB view details)

Uploaded CPython 3.12+musllinux: musl 1.2+ ARM64

hogql_parser_rs-1.3.72-cp312-abi3-manylinux_2_28_x86_64.whl (2.7 MB view details)

Uploaded CPython 3.12+manylinux: glibc 2.28+ x86-64

hogql_parser_rs-1.3.72-cp312-abi3-manylinux_2_28_aarch64.whl (2.4 MB view details)

Uploaded CPython 3.12+manylinux: glibc 2.28+ ARM64

hogql_parser_rs-1.3.72-cp312-abi3-macosx_11_0_arm64.whl (662.8 kB view details)

Uploaded CPython 3.12+macOS 11.0+ ARM64

hogql_parser_rs-1.3.72-cp312-abi3-macosx_10_12_x86_64.whl (701.2 kB view details)

Uploaded CPython 3.12+macOS 10.12+ x86-64

File details

Details for the file hogql_parser_rs-1.3.72.tar.gz.

File metadata

  • Download URL: hogql_parser_rs-1.3.72.tar.gz
  • Upload date:
  • Size: 257.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for hogql_parser_rs-1.3.72.tar.gz
Algorithm Hash digest
SHA256 587bf0609f431545700fc880d9219f37c25d0865d8a20633059866f925957103
MD5 7a950c18ca20c3f22a8c710d4862f028
BLAKE2b-256 6be7921674b57dda8dcf18fa98e9f601fa11e3d61a426aa6c447362fe7d92219

See more details on using hashes here.

Provenance

The following attestation bundles were made for hogql_parser_rs-1.3.72.tar.gz:

Publisher: build-hogql-parser-rs.yml on PostHog/posthog

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file hogql_parser_rs-1.3.72-cp312-abi3-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for hogql_parser_rs-1.3.72-cp312-abi3-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 d1d8842c2bad8950a125c0112e755db3ac671a00f239a548d30146f918a67ce5
MD5 913266943dbba0b4eca03887d20812de
BLAKE2b-256 c61f0b2eb1717f9267580fdfee652b4a355e4c29ba108152f1e277be47fa2b67

See more details on using hashes here.

Provenance

The following attestation bundles were made for hogql_parser_rs-1.3.72-cp312-abi3-musllinux_1_2_x86_64.whl:

Publisher: build-hogql-parser-rs.yml on PostHog/posthog

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file hogql_parser_rs-1.3.72-cp312-abi3-musllinux_1_2_aarch64.whl.

File metadata

File hashes

Hashes for hogql_parser_rs-1.3.72-cp312-abi3-musllinux_1_2_aarch64.whl
Algorithm Hash digest
SHA256 2eb182ec64760022403bc8d8d347b775d00bb91a776c663f816fb96f8379aad1
MD5 c80244534850181fb31fc7ddcdc404d1
BLAKE2b-256 600385c2baa011956582ba74cd1fc270627ddba884dd35f9d00e40028177bed0

See more details on using hashes here.

Provenance

The following attestation bundles were made for hogql_parser_rs-1.3.72-cp312-abi3-musllinux_1_2_aarch64.whl:

Publisher: build-hogql-parser-rs.yml on PostHog/posthog

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file hogql_parser_rs-1.3.72-cp312-abi3-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for hogql_parser_rs-1.3.72-cp312-abi3-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 132cddee12c02f040aef141c468c090df16ebf21033feec578beacbc297a2fbb
MD5 56b267aed5d8a53bd946c050974fd23e
BLAKE2b-256 97b5188cdfe2a12afce6546ffd7ed9c85d7962ccf71bed0b3e08c4ddd8f73e9b

See more details on using hashes here.

Provenance

The following attestation bundles were made for hogql_parser_rs-1.3.72-cp312-abi3-manylinux_2_28_x86_64.whl:

Publisher: build-hogql-parser-rs.yml on PostHog/posthog

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file hogql_parser_rs-1.3.72-cp312-abi3-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for hogql_parser_rs-1.3.72-cp312-abi3-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 1cc08aa01a439eb2ebca52666a39ac97e3d54cd5f36c7f31b5725c0b45446c7f
MD5 e1a76247553470a05a48ba2f7802192f
BLAKE2b-256 634fbc8b67e6b7311a1def454f0e406d3bb8d3dbca9316c1861d53e90fcabbc5

See more details on using hashes here.

Provenance

The following attestation bundles were made for hogql_parser_rs-1.3.72-cp312-abi3-manylinux_2_28_aarch64.whl:

Publisher: build-hogql-parser-rs.yml on PostHog/posthog

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file hogql_parser_rs-1.3.72-cp312-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for hogql_parser_rs-1.3.72-cp312-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 c99da60fc6c8a8ea20062b3341d57a8e0f6396b11a07c31a329c33519f8ee2b2
MD5 94a2d56a98576552865fb46d3275201b
BLAKE2b-256 3fa1aa0d76e8e9ad3011ea58f04e9461e265a51cfb19540639bb5bf0f31b2c73

See more details on using hashes here.

Provenance

The following attestation bundles were made for hogql_parser_rs-1.3.72-cp312-abi3-macosx_11_0_arm64.whl:

Publisher: build-hogql-parser-rs.yml on PostHog/posthog

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file hogql_parser_rs-1.3.72-cp312-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for hogql_parser_rs-1.3.72-cp312-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 f12bff693dd1bf6c45b9baf7e53864f9848b02315cf526a4a09bd26e3038873d
MD5 8b0a2c176acc36a5b85e148d4757fecc
BLAKE2b-256 acc90a7515d96c68817e8b7ea9fdea009188b0bb5ac387d7538c12cd87572ef7

See more details on using hashes here.

Provenance

The following attestation bundles were made for hogql_parser_rs-1.3.72-cp312-abi3-macosx_10_12_x86_64.whl:

Publisher: build-hogql-parser-rs.yml on PostHog/posthog

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page