Skip to main content

Hand-rolled Rust HogQL parser with C++-parity AST output

Project description

hogql_parser_rs

Hand-rolled Rust HogQL parser. Pratt + recursive descent. Same JSON AST shape as the C++ ANTLR parser so the two can be cross-validated query-by-query. Ships as a Python extension via maturin and is selected as the rust-json backend in posthog/hogql/parser.py.

About 15× faster than the C++ parser on parse_expr and 50–55× on parse_select, against the same input, on the same machine. The numbers come from posthog/hogql/scripts/parser_bench.py; re-run locally before and after any non-trivial change.

The C++ parser is the source of truth

When grammar, AST shape, or any visible behaviour disagrees between the two, the C++ ANTLR parser is right and this one is wrong. The C++ parser is generated from posthog/hogql/grammar/HogQLLexer.*.g4

  • HogQLParser.g4 via ANTLR4. The Rust parser does not consume those grammar files; it hand-implements the same recognition behaviour.

This means any grammar change is a two-step change:

  1. Update the ANTLR grammar and rebuild the C++ parser. Get the new shape working end-to-end on cpp-json. Pin the new behaviour with regression tests (see "Tools" below).

  2. Bring the Rust parser to parity. Run the diagnostics, find the new divergences, fix them. This is the part an LLM agent can drive in a long-running loop.

Skipping step 1 produces a Rust parser that "works" but on a shape the C++ parser rejects, which means Cloud's printer / planner will reject too, because they're built on top of cpp-json's output. Get the oracle right first, then the candidate.

What's in this crate

Path What it does
src/lib.rs PyO3 entry points (parse_expr_json, parse_select_json, parse_program_json, parse_order_expr_json, parse_full_template_string_json). Each returns a JSON string; on error the JSON is an {"error": true, ...} envelope posthog/hogql/json_ast.py decodes into HogQLSyntaxError / ExposedHogQLError.
src/lex.rs Lexer. Hand-rolled state machine matching the ANTLR-generated C++ lexer's tokens + mode stack (default / template-string / HogQLX-tag / HogQLX-text). When you add a new keyword to the grammar, add it here too.
src/parse.rs Parser core: Parser struct, public entry points, the Pratt expression parser (parse_expr_bp), positions (pos_obj, wrap_pos, wrap_pos_to), char-offset / line-col tables, checkpoint / restore for speculative branches.
src/parse/{expr,select,program,join,cte,hogqlx,template}.rs Per-rule parsing. Most grammar changes land in one of these.
src/parse/bp.rs Binding-power table + build_infix / merge_and_or / merge_concat. The precedence ladder lives here; new operators usually need an entry in infix_bp and a build_infix arm.
src/emit.rs AST-node builders + position helpers (with_pos is idempotent, replace_pos overrides, no_pos reserves null keys to opt out of the wrap). When you add a new AST node, add a helper here so callers don't hand-build the JSON object.
src/error.rs ParseError + the JSON error envelope.

Building locally

# One-time: install the rust toolchain via flox / rustup (the workspace
# Cargo.toml is at `rust/Cargo.toml`).

# Build + install the wheel into the venv (editable). Re-run after each
# rust source change.
uv pip install -e rust/hogql/parser

# Or via maturin directly (faster incremental):
maturin develop --release --manifest-path rust/hogql/parser/Cargo.toml

# Sanity check
python -c "import hogql_parser_rs; print(hogql_parser_rs.parse_expr_json('1 + 2'))"

maturin builds a single cp312-abi3 wheel that works on Python 3.12+. CI builds wheels for Linux x86_64/aarch64 (manylinux 2_28 + musllinux 1_2) and macOS arm64/x86_64; see .github/workflows/build-hogql-parser-rs.yml.

Publishing

The crate is pinned via the hogql-parser-rs==X.Y.Z line in the repo-root pyproject.toml. Bump the version in both:

(They must match. The PR check at .github/workflows/build-hogql-parser-rs.yml enforces this.)

Version is intentionally locked in step with common/hogql_parser (the C++ parser PyPI package) so a bump signals "both parsers move together." The publish workflow builds wheels, pushes to PyPI via trusted publishing, then opens a follow-up PR that updates the repo-root pin.

Adding a new grammar feature

Steps 1–4 are the one-time grammar-update process — done once, human-driven. Step 5 (running the parity loop below) is the long-running, agent-friendly part.

  1. Update HogQLLexer.*.g4 and HogQLParser.g4. Run pnpm grammar:build to regenerate the Python and C++ ANTLR artefacts:

    pnpm grammar:build
    

    That step requires the antlr 4.13.2 binary on PATH; instructions in posthog/hogql/grammar/README.md. The script rewrites common/hogql_parser/HogQL{Lexer,Parser}.{cpp,h,interp,tokens} and the matching Python files. Both backends now recognise the new shape.

  2. Pick the AST emission. Decide what JSON the cpp visitor should return for the new shape. Either reuse an existing AST node or add a new one in posthog/hogql/ast.py. The Python AST is shared between backends, so any new node has to land there first, otherwise posthog/hogql/json_ast.py::deserialize_ast will crash on it.

  3. Update the cpp visitor. Add the VISIT(YourNewRule) arm in common/hogql_parser/parser_json.cpp. Mirror cpp's conventions: call addPositionInfo(json, ctx) per rule unless you specifically want a position-less node (see "Position parity" below). Rebuild the cpp wheel (pip install ./common/hogql_parser).

  4. Pin the new behaviour with a regression test. Add a test (and a rust-rejects-it negative test if the grammar tightens) into the parser_test_factory suite in posthog/hogql/test/_test_parser.py. The factory runs every test against cpp-json, rust-json, and python; on a fresh grammar change the test passes on cpp and fails on rust. That fail is the starting state for the parity loop.

  5. Run the parity loop. See the next section.

The parity loop

The agent loop that brings rust to behavioural parity with cpp. Long-running; the diagnostics produce concrete diffs the agent attacks one at a time.

Before each iteration make sure both the cpp and rust parser binaries are up to date (pip install ./common/hogql_parser, maturin develop --release --manifest-path rust/hogql/parser/Cargo.toml).

  1. Generate a new divergence. In priority order:

    • existing failing regression tests (highest signal);
    • real production queries via log_corpus_diagnostic.py / hog_corpus_diagnostic.py (the hog corpus has been at 100% for a while — usually skip);
    • PBT (pbt_diagnostic.py --rule expr|select|program);
    • thinking hard about edge cases the grammar surface invites.

    For everything other than regression tests, start with a small budget (lower --n, less thinking time) and increase until at least one divergence surfaces.

  2. Reduce + pin. Shrink each divergence to its minimal form and add it as a regression test in _test_parser.py's factory so it runs on all three backends.

  3. Read before fixing. Read the grammar AND the cpp visitor for the rule. 100% identical behaviour means knowing exactly what cpp does — guessing leads to fixes that resurface on a deeper PBT run.

  4. Fix the rust parser. Prefer general fixes that won't break on deeper nesting; a depth-0-only special case is a smell. Print a one-paragraph report for the human operator so progress is visible while the loop runs autonomously.

  5. Re-run the regression suite. Anything below the previous baseline goes back to step 1.

Generating divergences is the slow step. Run discovery in parallel in the background:

  • pbt_diagnostic.py --rule select
  • pbt_diagnostic.py --rule expr
  • pbt_diagnostic.py --rule program
  • log_corpus_diagnostic.py (real query corpus)
  • a research subagent grepping for cpp-vs-rust visitor differences
  • a research subagent brainstorming adversarial edge cases

Most of these can stream divergences as they're found. Once at least one known divergence is in hand, start fixing it while the parallel runs keep mining the long tail.

Tools for parity work

Every script below has the same --oracle / --candidate flag pair and defaults to cpp-json vs rust-json. The diagnostics include per-node start / end positions in the comparison by default; set CLEAR_LOCATIONS=1 to strip positions when you want a structural-only read.

Regression tests in posthog/hogql/test/

hogli test posthog/hogql/test/test_parser_cpp_json.py
hogli test posthog/hogql/test/test_parser_python.py
hogli test posthog/hogql/test/test_parser_rust_json.py

The behaviour suite + regression pins live in _test_parser.py's parser_test_factory. The three files above are thin subclasses that spawn one runnable test entry per (backend, case) combination. When you find a new divergence, add a reduced regression to the factory — it picks up all three backends automatically.

Property-based testing via posthog/hogql/scripts/pbt_diagnostic.py

PYTHONPATH=. python posthog/hogql/scripts/pbt_diagnostic.py \
    --n 5000 --rule program

# Per rule:
--rule expr     # standalone column expressions
--rule select   # SELECT / SELECT-set statements
--rule program  # full Hog programs (declarations + statements + exprs)

Generates ~5 000 random grammar surface examples per rule, parses with oracle and candidate, buckets divergences by AST shape, and prints shrunk reproducers. Use --shrink-failures to auto-reduce each divergence to a minimal example.

Real-query corpora via log_corpus_diagnostic.py / hog_corpus_diagnostic.py

# SELECT queries from the last 7 days of production traffic
# (redacted, AI-data-processing-approved teams only):
PYTHONPATH=. python posthog/hogql/scripts/log_corpus_diagnostic.py

# Hog programs from production (transformations, destinations, …):
PYTHONPATH=. python posthog/hogql/scripts/hog_corpus_diagnostic.py

Both auto-download via hogli metabase:query and cache locally under posthog/hogql/scripts/.local/. Pass --skip-download to reuse the existing dump while iterating. Failures are written one block per divergence to a .sql / .hog file the agent can chew through.

Perf bench via posthog/hogql/scripts/parser_bench.py

CANDIDATE_BACKEND=rust-json PYTHONPATH=. \
    python posthog/hogql/scripts/parser_bench.py

Runs both parsers against a fixed corpus of representative queries (small / medium / nested / pathological) and prints an oracle / candidate ratio per row. Re-run before and after any non-trivial change. If parse_select mean drops noticeably (the parse_select speedup is the headline number), find out why before landing.

Shadow compare in TEST via cpp-with-rust-shadow

In TEST mode the default backend is cpp-with-rust-shadow: both backends parse, ASTs are compared, mismatches raise so the failing test points right at the offending query. In production this same mode runs at a 1% sample and only logs. Useful when a regression slips past the PBT but shows up in the suite.

from posthog.hogql.constants import HogQLParserBackend
parse_expr(src, backend=HogQLParserBackend.CPP_WITH_RUST_SHADOW)

Rules of thumb for the parity loop

These aren't always obvious from the diagnostics alone:

  • Prefer the generalising fix. When two implementations both pass the failing cases, pick the one that doesn't depend on the input shape. A wrap_pos call at a single emit site beats a depth-aware conditional. A change to the binding-power table beats an ad-hoc check in the consumer.

  • Position bugs hide behind structural bugs. Always run the PBT with positions on (the default); CLEAR_LOCATIONS=1 is for diagnosing structural regressions only. A 99% structural match can mask a 50% position-aware match.

  • Look at the cpp visitor before guessing. Every per-node position decision in this parser has a cpp counterpart in common/hogql_parser/parser_json.cpp. If the cpp visitor calls addPositionInfo(json, ctx) you need a wrap on the rust side; if it doesn't, you need emit::no_pos (or the helper for that node already does it).

  • Watch the perf bench. Position emission isn't free. Cache O(N) computations on Parser rather than recomputing per emit; the is_ascii_src field is the canonical example.

  • Don't fix one rule at a time at the expense of others. A one-line wrap in parse/expr.rs can move three PBT rules at once. Run all three PBTs after each change, not just the one you started with.

Position parity (the non-obvious part)

The C++ visitor decides per-node whether to emit positions via addPositionInfo(json, ctx). Some nodes are deliberately position-less (NamedArgument, ColumnsExpr in qualified-asterisk column slots, etc.) so the rust parser has to match that exactly.

Three position helpers in emit.rs cover the three cases:

Helper When to use
with_pos Default. Adds start / end if not already set. Used by Parser::wrap_pos and wrap_pos_to. Idempotent so the outer pratt-loop wrap doesn't trample inner spans.
replace_pos Override existing start / end. Used by the bare-paren grammar alts ((* REPLACE(...))) where the inner wrap captured only the inner content but cpp's grammar ctx includes the outer parens.
no_pos Pre-insert start: null, end: null so the outer wrap leaves the node bare. Used for nodes cpp explicitly doesn't position (NamedArgument, ColumnExprNamedArg).

Two more things to keep in mind:

  • Offsets are character indices, not byte indices. cpp's getStartIndex() is char-based; rust's source slices are byte-based. Parser::pos_obj converts via byte_to_char_index for non-ASCII sources, short-circuits for ASCII. If you bypass pos_obj (e.g. hand-building a position object for a node-builder you control), you have to do the conversion yourself.

  • Column is character-position-in-line, not byte-position. Same reason. The ASCII fast path in pos_obj handles this for free; the slow path counts chars between line-start and offset.

Known long-tail divergences

The PBT for expr and select exposes adversarial grammar surface that the production corpora never see: deep nested BETWEEN low AND high chains with embedded aliases and ternaries, extreme WITHIN GROUP (ORDER BY …) shapes, multi-token-AND-merged operands. These take focused per-shape investigation; the PR description has the current numbers.

The production corpora (log_corpus_diagnostic, hog_corpus_diagnostic) stay above 90%, so anything the PBT surfaces that doesn't appear there is technically grammar-parity work but not user-visible.

Selecting from Python

from posthog.hogql.parser import parse_expr, parse_select, parse_program

ast = parse_expr("1 + event.properties.$browser", backend="rust-json")

Backends live in posthog/hogql/constants.HogQLParserBackend:

Backend Use case
cpp-json Production default. ANTLR-based, oracle for everything below.
rust-json This crate. ~15× / ~50× faster, behaviour identical (modulo the long tail).
python Pure-Python ANTLR fallback. Slower; useful for debugging visitor changes.
cpp-with-rust-shadow Production-default in TEST. Parses with cpp, shadow-parses with rust, raises on mismatch (TEST) / logs at 1% sample (prod).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hogql_parser_rs-1.3.67.tar.gz (236.6 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

hogql_parser_rs-1.3.67-cp312-abi3-musllinux_1_2_x86_64.whl (2.2 MB view details)

Uploaded CPython 3.12+musllinux: musl 1.2+ x86-64

hogql_parser_rs-1.3.67-cp312-abi3-musllinux_1_2_aarch64.whl (2.1 MB view details)

Uploaded CPython 3.12+musllinux: musl 1.2+ ARM64

hogql_parser_rs-1.3.67-cp312-abi3-manylinux_2_28_x86_64.whl (2.2 MB view details)

Uploaded CPython 3.12+manylinux: glibc 2.28+ x86-64

hogql_parser_rs-1.3.67-cp312-abi3-manylinux_2_28_aarch64.whl (1.9 MB view details)

Uploaded CPython 3.12+manylinux: glibc 2.28+ ARM64

hogql_parser_rs-1.3.67-cp312-abi3-macosx_11_0_arm64.whl (477.1 kB view details)

Uploaded CPython 3.12+macOS 11.0+ ARM64

hogql_parser_rs-1.3.67-cp312-abi3-macosx_10_12_x86_64.whl (502.4 kB view details)

Uploaded CPython 3.12+macOS 10.12+ x86-64

File details

Details for the file hogql_parser_rs-1.3.67.tar.gz.

File metadata

  • Download URL: hogql_parser_rs-1.3.67.tar.gz
  • Upload date:
  • Size: 236.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for hogql_parser_rs-1.3.67.tar.gz
Algorithm Hash digest
SHA256 4abca8c4406f6f3a054b10988ffe5533b6dad17a3e25cf9a8d57babdc8e11a4e
MD5 d065e0ca65c1fe8110c401c415aa8efc
BLAKE2b-256 e36a0a8748a578916e2f3491e7594b9193c2d0938c2c8d27d619390e56ab14d9

See more details on using hashes here.

Provenance

The following attestation bundles were made for hogql_parser_rs-1.3.67.tar.gz:

Publisher: build-hogql-parser-rs.yml on PostHog/posthog

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file hogql_parser_rs-1.3.67-cp312-abi3-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for hogql_parser_rs-1.3.67-cp312-abi3-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 1fe8407ebc07400dcd38390b47a12e223f78ea30441308d1631de84de5dc139d
MD5 f331743331272bdc8184fee5b460935a
BLAKE2b-256 ca1c381a697fc933660741173501d038fbf8a877832637c284deaeb8a0b98109

See more details on using hashes here.

Provenance

The following attestation bundles were made for hogql_parser_rs-1.3.67-cp312-abi3-musllinux_1_2_x86_64.whl:

Publisher: build-hogql-parser-rs.yml on PostHog/posthog

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file hogql_parser_rs-1.3.67-cp312-abi3-musllinux_1_2_aarch64.whl.

File metadata

File hashes

Hashes for hogql_parser_rs-1.3.67-cp312-abi3-musllinux_1_2_aarch64.whl
Algorithm Hash digest
SHA256 aeafb74f40dd851839ed80616207748e32b7b984330d0b17e498b920f1ec891b
MD5 b9d75d8b9d8c41880161bc869d733796
BLAKE2b-256 0a696b271b0eb314f28e7d2c0b6926b958bfc018f1a522ad21f34a547d01c69e

See more details on using hashes here.

Provenance

The following attestation bundles were made for hogql_parser_rs-1.3.67-cp312-abi3-musllinux_1_2_aarch64.whl:

Publisher: build-hogql-parser-rs.yml on PostHog/posthog

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file hogql_parser_rs-1.3.67-cp312-abi3-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for hogql_parser_rs-1.3.67-cp312-abi3-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 7a71e7933c1a85e20b310bb17fb5be870749aac19ec4c3487c674607e2964a8c
MD5 9240e53830fe7c68166c214061d6512e
BLAKE2b-256 58aae23594aca406053be5199a38559d036ac145ccd894549e5ed26355d67dc5

See more details on using hashes here.

Provenance

The following attestation bundles were made for hogql_parser_rs-1.3.67-cp312-abi3-manylinux_2_28_x86_64.whl:

Publisher: build-hogql-parser-rs.yml on PostHog/posthog

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file hogql_parser_rs-1.3.67-cp312-abi3-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for hogql_parser_rs-1.3.67-cp312-abi3-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 74d1fa0d51c3523fbed2d363330e8ff40d29e2bdc55eac5e497448fd1a38802a
MD5 59434b93dfd99fb68bb538cc4fff879a
BLAKE2b-256 9493edf4ba1b0eea0c08513dfa6f668d428f6f1975186229c4db1df42479506c

See more details on using hashes here.

Provenance

The following attestation bundles were made for hogql_parser_rs-1.3.67-cp312-abi3-manylinux_2_28_aarch64.whl:

Publisher: build-hogql-parser-rs.yml on PostHog/posthog

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file hogql_parser_rs-1.3.67-cp312-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for hogql_parser_rs-1.3.67-cp312-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 e1796cde3c90565ce5b0c7a1bb2ad006250c813f0ab306ed7584a560a245a95b
MD5 dbc03fbf4e5d135ad2d1abccfcfff8e1
BLAKE2b-256 174c666502049fa43bd2b22dbf24978b1deb5ec9d648f59e67041ad99c8d3241

See more details on using hashes here.

Provenance

The following attestation bundles were made for hogql_parser_rs-1.3.67-cp312-abi3-macosx_11_0_arm64.whl:

Publisher: build-hogql-parser-rs.yml on PostHog/posthog

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file hogql_parser_rs-1.3.67-cp312-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for hogql_parser_rs-1.3.67-cp312-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 b42d9706c58c963060cade813ec2a54e51dd893d96256806d021b6653fa5d117
MD5 e236b06665073dc3fd9dd2b69c6b4dde
BLAKE2b-256 e3be39855d7779661a314b10d73d9e90bf51b2c3dc8583a961fb46d1e2a11541

See more details on using hashes here.

Provenance

The following attestation bundles were made for hogql_parser_rs-1.3.67-cp312-abi3-macosx_10_12_x86_64.whl:

Publisher: build-hogql-parser-rs.yml on PostHog/posthog

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page