Skip to main content

Hand-rolled Rust HogQL parser with C++-parity AST output

Project description

hogql_parser_rs

Hand-rolled Rust HogQL parser. Pratt + recursive descent. Same JSON AST shape as the C++ ANTLR parser so the two can be cross-validated query-by-query. Ships as a Python extension via maturin and is selected as the rust-json backend in posthog/hogql/parser.py.

About 15× faster than the C++ parser on parse_expr and 50–55× on parse_select, against the same input, on the same machine. The numbers come from posthog/hogql/scripts/parser_bench.py; re-run locally before and after any non-trivial change.

The C++ parser is the source of truth

When grammar, AST shape, or any visible behaviour disagrees between the two, the C++ ANTLR parser is right and this one is wrong. The C++ parser is generated from posthog/hogql/grammar/HogQLLexer.*.g4

  • HogQLParser.g4 via ANTLR4. The Rust parser does not consume those grammar files; it hand-implements the same recognition behaviour.

This means any grammar change is a two-step change:

  1. Update the ANTLR grammar and rebuild the C++ parser. Get the new shape working end-to-end on cpp-json. Pin the new behaviour with regression tests (see "Tools" below).

  2. Bring the Rust parser to parity. Run the diagnostics, find the new divergences, fix them. This is the part an LLM agent can drive in a long-running loop.

Skipping step 1 produces a Rust parser that "works" but on a shape the C++ parser rejects, which means Cloud's printer / planner will reject too, because they're built on top of cpp-json's output. Get the oracle right first, then the candidate.

What's in this crate

Path What it does
src/lib.rs PyO3 entry points (parse_expr_json, parse_select_json, parse_program_json, parse_order_expr_json, parse_full_template_string_json). Each returns a JSON string; on error the JSON is an {"error": true, ...} envelope posthog/hogql/json_ast.py decodes into HogQLSyntaxError / ExposedHogQLError.
src/lex.rs Lexer. Hand-rolled state machine matching the ANTLR-generated C++ lexer's tokens + mode stack (default / template-string / HogQLX-tag / HogQLX-text). When you add a new keyword to the grammar, add it here too.
src/parse.rs Parser core: Parser struct, public entry points, the Pratt expression parser (parse_expr_bp), positions (pos_obj, wrap_pos, wrap_pos_to), char-offset / line-col tables, checkpoint / restore for speculative branches.
src/parse/{expr,select,program,join,cte,hogqlx,template}.rs Per-rule parsing. Most grammar changes land in one of these.
src/parse/bp.rs Binding-power table + build_infix / merge_and_or / merge_concat. The precedence ladder lives here; new operators usually need an entry in infix_bp and a build_infix arm.
src/emit.rs AST-node builders + position helpers (with_pos is idempotent, replace_pos overrides, no_pos reserves null keys to opt out of the wrap). When you add a new AST node, add a helper here so callers don't hand-build the JSON object.
src/error.rs ParseError + the JSON error envelope.

Building locally

# One-time: install the rust toolchain via flox / rustup (the workspace
# Cargo.toml is at `rust/Cargo.toml`).

# Build + install the wheel into the venv (editable). Re-run after each
# rust source change.
uv pip install -e rust/hogql/parser

# Or via maturin directly (faster incremental):
maturin develop --release --manifest-path rust/hogql/parser/Cargo.toml

# Sanity check
python -c "import hogql_parser_rs; print(hogql_parser_rs.parse_expr_json('1 + 2'))"

maturin builds a single cp312-abi3 wheel that works on Python 3.12+. CI builds wheels for Linux x86_64/aarch64 (manylinux 2_28 + musllinux 1_2) and macOS arm64/x86_64; see .github/workflows/build-hogql-parser-rs.yml.

Publishing

The crate is pinned via the hogql-parser-rs==X.Y.Z line in the repo-root pyproject.toml. Bump the version in both:

(They must match. The PR check at .github/workflows/build-hogql-parser-rs.yml enforces this.)

Version is intentionally locked in step with common/hogql_parser (the C++ parser PyPI package) so a bump signals "both parsers move together." The publish workflow builds wheels, pushes to PyPI via trusted publishing, then opens a follow-up PR that updates the repo-root pin.

Adding a new grammar feature

Steps 1–4 are the one-time grammar-update process — done once, human-driven. Step 5 (running the parity loop below) is the long-running, agent-friendly part.

  1. Update HogQLLexer.*.g4 and HogQLParser.g4. Run pnpm grammar:build to regenerate the Python and C++ ANTLR artefacts:

    pnpm grammar:build
    

    That step requires the antlr 4.13.2 binary on PATH; instructions in posthog/hogql/grammar/README.md. The script rewrites common/hogql_parser/HogQL{Lexer,Parser}.{cpp,h,interp,tokens} and the matching Python files. Both backends now recognise the new shape.

  2. Pick the AST emission. Decide what JSON the cpp visitor should return for the new shape. Either reuse an existing AST node or add a new one in posthog/hogql/ast.py. The Python AST is shared between backends, so any new node has to land there first, otherwise posthog/hogql/json_ast.py::deserialize_ast will crash on it.

  3. Update the cpp visitor. Add the VISIT(YourNewRule) arm in common/hogql_parser/parser_json.cpp. Mirror cpp's conventions: call addPositionInfo(json, ctx) per rule unless you specifically want a position-less node (see "Position parity" below). Rebuild the cpp wheel (pip install ./common/hogql_parser).

  4. Pin the new behaviour with a regression test. Add a test (and a rust-rejects-it negative test if the grammar tightens) into the parser_test_factory suite in posthog/hogql/test/_test_parser.py. The factory runs every test against cpp-json, rust-json, and python; on a fresh grammar change the test passes on cpp and fails on rust. That fail is the starting state for the parity loop.

  5. Run the parity loop. See the next section.

The parity loop

The agent loop that brings rust to behavioural parity with cpp. Long-running; the diagnostics produce concrete diffs the agent attacks one at a time.

Before each iteration, rebuild both parser wheels from local source. uv pip install / pip install will happily resolve to the published PyPI wheel and ignore the working tree, so a fresh maturin develop and a fresh pip install ./common/hogql_parser are non-negotiable — the diagnostics test what's installed, not what's on disk.

maturin develop --release --manifest-path rust/hogql/parser/Cargo.toml
pip install --force-reinstall --no-deps ./common/hogql_parser

maturin develop writes the wheel into the active venv as editable; --force-reinstall --no-deps for the cpp wheel sidesteps pip's "already-satisfied" short-circuit when the in-tree version matches the PyPI pin. Skip these and the loop will silently chase divergences that have already been fixed in the working tree.

  1. Generate a new divergence. In priority order:

    • existing failing regression tests (highest signal);
    • real production queries via log_corpus_diagnostic.py / hog_corpus_diagnostic.py (the hog corpus has been at 100% for a while — usually skip);
    • PBT (pbt_diagnostic.py --rule expr|select|program);
    • thinking hard about edge cases the grammar surface invites.

    For everything other than regression tests, start with a small budget (lower --n, less thinking time) and increase until at least one divergence surfaces.

  2. Reduce + pin. Shrink each divergence to its minimal form and add it as a regression test in _test_parser.py's factory so it runs on all three backends.

  3. Read before fixing. Read the grammar AND the cpp visitor for the rule. 100% identical behaviour means knowing exactly what cpp does — guessing leads to fixes that resurface on a deeper PBT run.

  4. Fix the rust parser. Prefer general fixes that won't break on deeper nesting; a depth-0-only special case is a smell. Print a one-paragraph report for the human operator so progress is visible while the loop runs autonomously.

  5. Re-run the regression suite. Anything below the previous baseline goes back to step 1.

Generating divergences is the slow step. Run discovery in parallel in the background:

  • pbt_diagnostic.py --rule select
  • pbt_diagnostic.py --rule expr
  • pbt_diagnostic.py --rule program
  • log_corpus_diagnostic.py (real query corpus)
  • a research subagent grepping for cpp-vs-rust visitor differences
  • a research subagent brainstorming adversarial edge cases

Most of these can stream divergences as they're found. Once at least one known divergence is in hand, start fixing it while the parallel runs keep mining the long tail.

Tools for parity work

Every script below has the same --oracle / --candidate flag pair and defaults to cpp-json vs rust-json. The diagnostics include per-node start / end positions in the comparison by default; set CLEAR_LOCATIONS=1 to strip positions when you want a structural-only read.

Regression tests in posthog/hogql/test/

hogli test posthog/hogql/test/test_parser_cpp_json.py
hogli test posthog/hogql/test/test_parser_python.py
hogli test posthog/hogql/test/test_parser_rust_json.py

The behaviour suite + regression pins live in _test_parser.py's parser_test_factory. The three files above are thin subclasses that spawn one runnable test entry per (backend, case) combination. When you find a new divergence, add a reduced regression to the factory — it picks up all three backends automatically.

Property-based testing via posthog/hogql/scripts/pbt_diagnostic.py

PYTHONPATH=. python posthog/hogql/scripts/pbt_diagnostic.py \
    --n 5000 --rule program

# Per rule:
--rule expr     # standalone column expressions
--rule select   # SELECT / SELECT-set statements
--rule program  # full Hog programs (declarations + statements + exprs)

Generates ~5 000 random grammar surface examples per rule, parses with oracle and candidate, buckets divergences by AST shape, and prints shrunk reproducers. Use --shrink-failures to auto-reduce each divergence to a minimal example.

Real-query corpora via log_corpus_diagnostic.py / hog_corpus_diagnostic.py

# SELECT queries from the last 7 days of production traffic
# (redacted, AI-data-processing-approved teams only):
PYTHONPATH=. python posthog/hogql/scripts/log_corpus_diagnostic.py

# Hog programs from production (transformations, destinations, …):
PYTHONPATH=. python posthog/hogql/scripts/hog_corpus_diagnostic.py

Both auto-download via hogli metabase:query and cache locally under posthog/hogql/scripts/.local/. Pass --skip-download to reuse the existing dump while iterating. Failures are written one block per divergence to a .sql / .hog file the agent can chew through.

Perf bench via posthog/hogql/scripts/parser_bench.py

CANDIDATE_BACKEND=rust-json PYTHONPATH=. \
    python posthog/hogql/scripts/parser_bench.py

Runs both parsers against a fixed corpus of representative queries (small / medium / nested / pathological) and prints an oracle / candidate ratio per row. Re-run before and after any non-trivial change. If parse_select mean drops noticeably (the parse_select speedup is the headline number), find out why before landing.

Shadow compare in TEST via cpp-with-rust-shadow

In TEST mode the default backend is cpp-with-rust-shadow: both backends parse, ASTs are compared, mismatches raise so the failing test points right at the offending query. In production this same mode runs at a 1% sample and only logs. Useful when a regression slips past the PBT but shows up in the suite.

from posthog.hogql.constants import HogQLParserBackend
parse_expr(src, backend=HogQLParserBackend.CPP_WITH_RUST_SHADOW)

Rules of thumb for the parity loop

These aren't always obvious from the diagnostics alone:

  • Prefer the generalising fix. When two implementations both pass the failing cases, pick the one that doesn't depend on the input shape. A wrap_pos call at a single emit site beats a depth-aware conditional. A change to the binding-power table beats an ad-hoc check in the consumer.

  • Position bugs hide behind structural bugs. Always run the PBT with positions on (the default); CLEAR_LOCATIONS=1 is for diagnosing structural regressions only. A 99% structural match can mask a 50% position-aware match.

  • Look at the cpp visitor before guessing. Every per-node position decision in this parser has a cpp counterpart in common/hogql_parser/parser_json.cpp. If the cpp visitor calls addPositionInfo(json, ctx) you need a wrap on the rust side; if it doesn't, you need emit::no_pos (or the helper for that node already does it).

  • Watch the perf bench. Position emission isn't free. Cache O(N) computations on Parser rather than recomputing per emit; the is_ascii_src field is the canonical example.

  • Don't fix one rule at a time at the expense of others. A one-line wrap in parse/expr.rs can move three PBT rules at once. Run all three PBTs after each change, not just the one you started with.

Position parity (the non-obvious part)

The C++ visitor decides per-node whether to emit positions via addPositionInfo(json, ctx). Some nodes are deliberately position-less (NamedArgument, ColumnsExpr in qualified-asterisk column slots, etc.) so the rust parser has to match that exactly.

Three position helpers in emit.rs cover the three cases:

Helper When to use
with_pos Default. Adds start / end if not already set. Used by Parser::wrap_pos and wrap_pos_to. Idempotent so the outer pratt-loop wrap doesn't trample inner spans.
replace_pos Override existing start / end. Used by the bare-paren grammar alts ((* REPLACE(...))) where the inner wrap captured only the inner content but cpp's grammar ctx includes the outer parens.
no_pos Pre-insert start: null, end: null so the outer wrap leaves the node bare. Used for nodes cpp explicitly doesn't position (NamedArgument, ColumnExprNamedArg).

Two more things to keep in mind:

  • Offsets are character indices, not byte indices. cpp's getStartIndex() is char-based; rust's source slices are byte-based. Parser::pos_obj converts via byte_to_char_index for non-ASCII sources, short-circuits for ASCII. If you bypass pos_obj (e.g. hand-building a position object for a node-builder you control), you have to do the conversion yourself.

  • Column is character-position-in-line, not byte-position. Same reason. The ASCII fast path in pos_obj handles this for free; the slow path counts chars between line-start and offset.

Known long-tail divergences

The PBT for expr and select exposes adversarial grammar surface that the production corpora never see: deep nested BETWEEN low AND high chains with embedded aliases and ternaries, extreme WITHIN GROUP (ORDER BY …) shapes, multi-token-AND-merged operands. These take focused per-shape investigation; the PR description has the current numbers.

The production corpora (log_corpus_diagnostic, hog_corpus_diagnostic) stay above 90%, so anything the PBT surfaces that doesn't appear there is technically grammar-parity work but not user-visible.

Selecting from Python

from posthog.hogql.parser import parse_expr, parse_select, parse_program

ast = parse_expr("1 + event.properties.$browser", backend="rust-json")

Backends live in posthog/hogql/constants.HogQLParserBackend:

Backend Use case
cpp-json Production default. ANTLR-based, oracle for everything below.
rust-json This crate. ~15× / ~50× faster, behaviour identical (modulo the long tail).
python Pure-Python ANTLR fallback. Slower; useful for debugging visitor changes.
cpp-with-rust-shadow Production-default in TEST. Parses with cpp, shadow-parses with rust, raises on mismatch (TEST) / logs at 1% sample (prod).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hogql_parser_rs-1.3.71.tar.gz (257.3 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

hogql_parser_rs-1.3.71-cp312-abi3-musllinux_1_2_x86_64.whl (2.6 MB view details)

Uploaded CPython 3.12+musllinux: musl 1.2+ x86-64

hogql_parser_rs-1.3.71-cp312-abi3-musllinux_1_2_aarch64.whl (2.6 MB view details)

Uploaded CPython 3.12+musllinux: musl 1.2+ ARM64

hogql_parser_rs-1.3.71-cp312-abi3-manylinux_2_28_x86_64.whl (2.7 MB view details)

Uploaded CPython 3.12+manylinux: glibc 2.28+ x86-64

hogql_parser_rs-1.3.71-cp312-abi3-manylinux_2_28_aarch64.whl (2.4 MB view details)

Uploaded CPython 3.12+manylinux: glibc 2.28+ ARM64

hogql_parser_rs-1.3.71-cp312-abi3-macosx_11_0_arm64.whl (663.0 kB view details)

Uploaded CPython 3.12+macOS 11.0+ ARM64

hogql_parser_rs-1.3.71-cp312-abi3-macosx_10_12_x86_64.whl (701.3 kB view details)

Uploaded CPython 3.12+macOS 10.12+ x86-64

File details

Details for the file hogql_parser_rs-1.3.71.tar.gz.

File metadata

  • Download URL: hogql_parser_rs-1.3.71.tar.gz
  • Upload date:
  • Size: 257.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for hogql_parser_rs-1.3.71.tar.gz
Algorithm Hash digest
SHA256 9761fb1720fb68dd005a0335fe2193466571a5e398cb8e7059ef14bbaa451a55
MD5 b61ba5e5950dd10b58c7bc7804ecc150
BLAKE2b-256 3403a1c7dbd1cbfec77b49c9d4b9cb6734d577948a68ab49b7a86d1b6a958192

See more details on using hashes here.

Provenance

The following attestation bundles were made for hogql_parser_rs-1.3.71.tar.gz:

Publisher: build-hogql-parser-rs.yml on PostHog/posthog

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file hogql_parser_rs-1.3.71-cp312-abi3-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for hogql_parser_rs-1.3.71-cp312-abi3-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 6c5e4283ab7634f6946da67e04492f4a7f0801e36eab1bc1e352f1b1cc677453
MD5 6038845b500fc7ffdd92f96ad7619a53
BLAKE2b-256 b5a4d3d0ee2d42283a175c3b63af866eac9e3b640af0315262d5442a224b4cdd

See more details on using hashes here.

Provenance

The following attestation bundles were made for hogql_parser_rs-1.3.71-cp312-abi3-musllinux_1_2_x86_64.whl:

Publisher: build-hogql-parser-rs.yml on PostHog/posthog

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file hogql_parser_rs-1.3.71-cp312-abi3-musllinux_1_2_aarch64.whl.

File metadata

File hashes

Hashes for hogql_parser_rs-1.3.71-cp312-abi3-musllinux_1_2_aarch64.whl
Algorithm Hash digest
SHA256 47a9e66f76e1d5d897cff6b6f773f6989d94ca4b7d83b7cd39576a4fd3be4b7b
MD5 413cfa6506008232c73cbbf2bd618471
BLAKE2b-256 9b4ba072d993a8373e17e7f4406475c0847d49c84e1cf13c53d9cf7206c4a015

See more details on using hashes here.

Provenance

The following attestation bundles were made for hogql_parser_rs-1.3.71-cp312-abi3-musllinux_1_2_aarch64.whl:

Publisher: build-hogql-parser-rs.yml on PostHog/posthog

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file hogql_parser_rs-1.3.71-cp312-abi3-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for hogql_parser_rs-1.3.71-cp312-abi3-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 d73c11a37ca40de3fd022d714e62f6c725b3eb78884cc5a90f4d1300dbe9fa97
MD5 2298c36c51b5f529bc8c2525540ae5d7
BLAKE2b-256 aafcd865fc82b167ba77ed797ca13482afd37723d59829de9314ff11746ff079

See more details on using hashes here.

Provenance

The following attestation bundles were made for hogql_parser_rs-1.3.71-cp312-abi3-manylinux_2_28_x86_64.whl:

Publisher: build-hogql-parser-rs.yml on PostHog/posthog

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file hogql_parser_rs-1.3.71-cp312-abi3-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for hogql_parser_rs-1.3.71-cp312-abi3-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 93aed9372e3fa2aa1775622a2df68741b6c928a83c90363c24f2d11a2d357551
MD5 5c9c28e5ab479700a9d926a12aed24f4
BLAKE2b-256 6feb45bc3f8cc05c514d3c9704dcef1d4d0b6ae94c0bd0b66303bf43a0ee3c6d

See more details on using hashes here.

Provenance

The following attestation bundles were made for hogql_parser_rs-1.3.71-cp312-abi3-manylinux_2_28_aarch64.whl:

Publisher: build-hogql-parser-rs.yml on PostHog/posthog

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file hogql_parser_rs-1.3.71-cp312-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for hogql_parser_rs-1.3.71-cp312-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 986593ac3fb366b753e47ed8fb0c015818b180b8fc97b4901014d4885fd83c10
MD5 84fcc35ff0151df791ef06538d93c767
BLAKE2b-256 a6270c7d1f9b739c5ee0c77803faca87bba9cede6efe151ad0eb3bbf77c906e3

See more details on using hashes here.

Provenance

The following attestation bundles were made for hogql_parser_rs-1.3.71-cp312-abi3-macosx_11_0_arm64.whl:

Publisher: build-hogql-parser-rs.yml on PostHog/posthog

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file hogql_parser_rs-1.3.71-cp312-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for hogql_parser_rs-1.3.71-cp312-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 96541e331b9e11175392a0fc4acb6f05aa509b1345d5389dcc519f4c93c4c6e2
MD5 4f905038ae9f0a5fbfd5363270c49299
BLAKE2b-256 0c23be275691f68bd1b6d611ffa398fd6b2d263502c24ae85bf298db9981fecd

See more details on using hashes here.

Provenance

The following attestation bundles were made for hogql_parser_rs-1.3.71-cp312-abi3-macosx_10_12_x86_64.whl:

Publisher: build-hogql-parser-rs.yml on PostHog/posthog

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page