Skip to main content

Hand-rolled Rust HogQL parser with C++-parity AST output

Project description

hogql_parser_rs

Hand-rolled Rust HogQL parser. Pratt + recursive descent. Same JSON AST shape as the C++ ANTLR parser so the two can be cross-validated query-by-query. Ships as a Python extension via maturin and is selected as the rust-json backend in posthog/hogql/parser.py.

About 15× faster than the C++ parser on parse_expr and 50–55× on parse_select, against the same input, on the same machine. The numbers come from posthog/hogql/scripts/parser_bench.py; re-run locally before and after any non-trivial change.

The C++ parser is the source of truth

When grammar, AST shape, or any visible behaviour disagrees between the two, the C++ ANTLR parser is right and this one is wrong. The C++ parser is generated from posthog/hogql/grammar/HogQLLexer.*.g4

  • HogQLParser.g4 via ANTLR4. The Rust parser does not consume those grammar files; it hand-implements the same recognition behaviour.

This means any grammar change is a two-step change:

  1. Update the ANTLR grammar and rebuild the C++ parser. Get the new shape working end-to-end on cpp-json. Pin the new behaviour with regression tests (see "Tools" below).

  2. Bring the Rust parser to parity. Run the diagnostics, find the new divergences, fix them. This is the part an LLM agent can drive in a long-running loop.

Skipping step 1 produces a Rust parser that "works" but on a shape the C++ parser rejects, which means Cloud's printer / planner will reject too, because they're built on top of cpp-json's output. Get the oracle right first, then the candidate.

What's in this crate

Path What it does
src/lib.rs PyO3 entry points (parse_expr_json, parse_select_json, parse_program_json, parse_order_expr_json, parse_full_template_string_json). Each returns a JSON string; on error the JSON is an {"error": true, ...} envelope posthog/hogql/json_ast.py decodes into HogQLSyntaxError / ExposedHogQLError.
src/lex.rs Lexer. Hand-rolled state machine matching the ANTLR-generated C++ lexer's tokens + mode stack (default / template-string / HogQLX-tag / HogQLX-text). When you add a new keyword to the grammar, add it here too.
src/parse.rs Parser core: Parser struct, public entry points, the Pratt expression parser (parse_expr_bp), positions (pos_obj, wrap_pos, wrap_pos_to), char-offset / line-col tables, checkpoint / restore for speculative branches.
src/parse/{expr,select,program,join,cte,hogqlx,template}.rs Per-rule parsing. Most grammar changes land in one of these.
src/parse/bp.rs Binding-power table + build_infix / merge_and_or / merge_concat. The precedence ladder lives here; new operators usually need an entry in infix_bp and a build_infix arm.
src/emit.rs AST-node builders + position helpers (with_pos is idempotent, replace_pos overrides, no_pos reserves null keys to opt out of the wrap). When you add a new AST node, add a helper here so callers don't hand-build the JSON object.
src/error.rs ParseError + the JSON error envelope.

Building locally

# One-time: install the rust toolchain via flox / rustup (the workspace
# Cargo.toml is at `rust/Cargo.toml`).

# Build + install the wheel into the venv (editable). Re-run after each
# rust source change.
uv pip install -e rust/hogql/parser

# Or via maturin directly (faster incremental):
maturin develop --release --manifest-path rust/hogql/parser/Cargo.toml

# Sanity check
python -c "import hogql_parser_rs; print(hogql_parser_rs.parse_expr_json('1 + 2'))"

maturin builds a single cp312-abi3 wheel that works on Python 3.12+. CI builds wheels for Linux x86_64/aarch64 (manylinux 2_28 + musllinux 1_2) and macOS arm64/x86_64; see .github/workflows/build-hogql-parser-rs.yml.

Publishing

The crate is pinned via the hogql-parser-rs==X.Y.Z line in the repo-root pyproject.toml. Bump the version in both:

(They must match. The PR check at .github/workflows/build-hogql-parser-rs.yml enforces this.)

Version is intentionally locked in step with common/hogql_parser (the C++ parser PyPI package) so a bump signals "both parsers move together." The publish workflow builds wheels, pushes to PyPI via trusted publishing, then opens a follow-up PR that updates the repo-root pin.

Adding a new grammar feature

The big-picture loop:

  1. Update HogQLLexer.*.g4 and HogQLParser.g4. Run pnpm grammar:build to regenerate the Python and C++ ANTLR artefacts:

    pnpm grammar:build
    

    That step requires the antlr 4.13.2 binary on PATH; instructions in posthog/hogql/grammar/README.md. The script rewrites common/hogql_parser/HogQL{Lexer,Parser}.{cpp,h,interp,tokens} and the matching Python files. Both backends now recognise the new shape.

  2. Pick the AST emission. Decide what JSON the cpp visitor should return for the new shape. Either reuse an existing AST node or add a new one in posthog/hogql/ast.py. The Python AST is shared between backends, so any new node has to land there first, otherwise posthog/hogql/json_ast.py::deserialize_ast will crash on it.

  3. Update the cpp visitor. Add the VISIT(YourNewRule) arm in common/hogql_parser/parser_json.cpp. Mirror cpp's conventions: call addPositionInfo(json, ctx) per rule unless you specifically want a position-less node (see "Position parity" below). Rebuild the cpp wheel (pip install ./common/hogql_parser).

  4. Pin the new behaviour. Add a regression test (and a rust-rejects-it negative test if the grammar tightens) in posthog/hogql/test/test_parser_regressions.py. Run on cpp-json only (you haven't done the Rust work yet); the test should pass on cpp and fail on rust. That fail is the starting state for step 5.

  5. Bring the Rust parser to parity. Add lexer keywords (if any) in src/lex.rs, then the parser shape in the matching src/parse/*.rs file. Match cpp's per-node visit behaviour: every addPositionInfo(json, ctx) on the cpp side needs a self.wrap_pos(value, start) or self.wrap_pos_to(value, start, end) on this side. Add an emit::* helper if you're building a new node shape, so callers stay declarative.

  6. Run the diagnostics. PBT, corpus checks, regression suite, perf bench. Anything below the previous baseline goes back into the loop.

Step 5 is where an LLM agent in a long-running loop (ralph loop, autoresearch, Claude Code with a wakeup schedule) does well. The diagnostics produce concrete diffs the agent can attack one at a time.

Tools for parity work

Every script below has the same --oracle / --candidate flag pair and defaults to cpp-json vs rust-json. The diagnostics include per-node start / end positions in the comparison by default; set CLEAR_LOCATIONS=1 to strip positions when you want a structural-only read.

Regression tests in posthog/hogql/test/

hogli test posthog/hogql/test/test_parser_regressions.py
hogli test posthog/hogql/test/test_parser_rust_json.py

test_parser_regressions.py pins every cpp-vs-rust divergence that has been found and fixed; one parameterised assertion runs on all three backends (cpp-json, rust-json, python). When you add a new grammar shape, add a regression here too.

test_parser_rust_json.py runs the shared _test_parser.py suite against rust-json. Catches behaviour regressions the regression file doesn't pin.

Property-based testing via posthog/hogql/scripts/pbt_diagnostic.py

PYTHONPATH=. python posthog/hogql/scripts/pbt_diagnostic.py \
    --n 5000 --rule program

# Per rule:
--rule expr     # standalone column expressions
--rule select   # SELECT / SELECT-set statements
--rule program  # full Hog programs (declarations + statements + exprs)

Generates ~5 000 random grammar surface examples per rule, parses with oracle and candidate, buckets divergences by AST shape, and prints shrunk reproducers. Use --shrink-failures to auto-reduce each divergence to a minimal example.

Real-query corpora via log_corpus_diagnostic.py / hog_corpus_diagnostic.py

# SELECT queries from the last 7 days of production traffic
# (redacted, AI-data-processing-approved teams only):
PYTHONPATH=. python posthog/hogql/scripts/log_corpus_diagnostic.py

# Hog programs from production (transformations, destinations, …):
PYTHONPATH=. python posthog/hogql/scripts/hog_corpus_diagnostic.py

Both auto-download via hogli metabase:query and cache locally under posthog/hogql/scripts/.local/. Pass --skip-download to reuse the existing dump while iterating. Failures are written one block per divergence to a .sql / .hog file the agent can chew through.

Perf bench via posthog/hogql/scripts/parser_bench.py

CANDIDATE_BACKEND=rust-json PYTHONPATH=. \
    python posthog/hogql/scripts/parser_bench.py

Runs both parsers against a fixed corpus of representative queries (small / medium / nested / pathological) and prints an oracle / candidate ratio per row. Re-run before and after any non-trivial change. If parse_select mean drops noticeably (the parse_select speedup is the headline number), find out why before landing.

Shadow compare in TEST via cpp-with-rust-shadow

In TEST mode the default backend is cpp-with-rust-shadow: both backends parse, ASTs are compared, mismatches raise so the failing test points right at the offending query. In production this same mode runs at a 1% sample and only logs. Useful when a regression slips past the PBT but shows up in the suite.

from posthog.hogql.constants import HogQLParserBackend
parse_expr(src, backend=HogQLParserBackend.CPP_WITH_RUST_SHADOW)

Example loop for an LLM agent

A long-running loop driving a single grammar-parity task looks roughly like this. Tailor for your runtime (ralph loop, autoresearch, Claude Code wakeup, etc.); the steps stay the same.

PROMPT:
  You are bringing the Rust parser to parity with the C++ parser for
  the new grammar feature `<feature description>`. The C++ parser is
  the source of truth. Each iteration:

    1. Run the PBT for the rule the feature touches:
       posthog/hogql/scripts/pbt_diagnostic.py --n 500 --rule <rule> \
         --shrink-failures \
         --write-divergences /tmp/divs.jsonl

    2. Read /tmp/divs.jsonl. Bucket divergences by the failing AST
       node type. Pick the bucket with the most members.

    3. Read 2–3 shrunk reproducers from that bucket. Look at the
       cpp output and the rust output side by side.

    4. Fix the rust parser. Prefer changes that generalise to deeper
       and more nested queries. A fix that only handles `a.b` but
       not `a.b.c` is going to lose ground on the next iteration. If
       a fix needs a special case at depth 0 only, that's a smell.

    5. Re-run the PBT. If the failing-bucket count went down without
       introducing new buckets, keep the change.

    6. Re-run the regression suite and the perf bench. Both must stay
       green / on-baseline before continuing.

    7. Commit. The commit message should call out which divergence
       class the fix targets so future work can trace the history.

  Stop when:
    - The PBT bucket is empty
    - The hog_corpus_diagnostic and log_corpus_diagnostic both stay
      at >= 90% match (run weekly)
    - The perf bench `parse_select` mean is within 5% of its
      pre-change baseline

A few rules of thumb the loop should follow that aren't always obvious from the diagnostics alone:

  • Prefer the generalising fix. When two implementations both pass the failing cases, pick the one that doesn't depend on the input shape. A wrap_pos call at a single emit site beats a depth-aware conditional. A change to the binding-power table beats an ad-hoc check in the consumer.

  • Position bugs hide behind structural bugs. Always run the PBT with positions on (the default); CLEAR_LOCATIONS=1 is for diagnosing structural regressions only. A 99% structural match can mask a 50% position-aware match.

  • Look at the cpp visitor before guessing. Every per-node position decision in this parser has a cpp counterpart in common/hogql_parser/parser_json.cpp. If the cpp visitor calls addPositionInfo(json, ctx) you need a wrap on the rust side; if it doesn't, you need emit::no_pos (or the helper for that node already does it).

  • Watch the perf bench. Position emission isn't free. Cache O(N) computations on Parser rather than recomputing per emit; the is_ascii_src field is the canonical example.

  • Don't fix one rule at a time at the expense of others. A one-line wrap in parse/expr.rs can move three PBT rules at once. Run all three PBTs after each change, not just the one you started with.

Position parity (the non-obvious part)

The C++ visitor decides per-node whether to emit positions via addPositionInfo(json, ctx). Some nodes are deliberately position-less (NamedArgument, ColumnsExpr in qualified-asterisk column slots, etc.) so the rust parser has to match that exactly.

Three position helpers in emit.rs cover the three cases:

Helper When to use
with_pos Default. Adds start / end if not already set. Used by Parser::wrap_pos and wrap_pos_to. Idempotent so the outer pratt-loop wrap doesn't trample inner spans.
replace_pos Override existing start / end. Used by the bare-paren grammar alts ((* REPLACE(...))) where the inner wrap captured only the inner content but cpp's grammar ctx includes the outer parens.
no_pos Pre-insert start: null, end: null so the outer wrap leaves the node bare. Used for nodes cpp explicitly doesn't position (NamedArgument, ColumnExprNamedArg).

Two more things to keep in mind:

  • Offsets are character indices, not byte indices. cpp's getStartIndex() is char-based; rust's source slices are byte-based. Parser::pos_obj converts via byte_to_char_index for non-ASCII sources, short-circuits for ASCII. If you bypass pos_obj (e.g. hand-building a position object for a node-builder you control), you have to do the conversion yourself.

  • Column is character-position-in-line, not byte-position. Same reason. The ASCII fast path in pos_obj handles this for free; the slow path counts chars between line-start and offset.

Known long-tail divergences

The PBT for expr and select exposes adversarial grammar surface that the production corpora never see: deep nested BETWEEN low AND high chains with embedded aliases and ternaries, extreme WITHIN GROUP (ORDER BY …) shapes, multi-token-AND-merged operands. These take focused per-shape investigation; the PR description has the current numbers.

The production corpora (log_corpus_diagnostic, hog_corpus_diagnostic) stay above 90%, so anything the PBT surfaces that doesn't appear there is technically grammar-parity work but not user-visible.

Selecting from Python

from posthog.hogql.parser import parse_expr, parse_select, parse_program

ast = parse_expr("1 + event.properties.$browser", backend="rust-json")

Backends live in posthog/hogql/constants.HogQLParserBackend:

Backend Use case
cpp-json Production default. ANTLR-based, oracle for everything below.
rust-json This crate. ~15× / ~50× faster, behaviour identical (modulo the long tail).
python Pure-Python ANTLR fallback. Slower; useful for debugging visitor changes.
cpp-with-rust-shadow Production-default in TEST. Parses with cpp, shadow-parses with rust, raises on mismatch (TEST) / logs at 1% sample (prod).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hogql_parser_rs-1.3.65.tar.gz (237.7 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

hogql_parser_rs-1.3.65-cp312-abi3-musllinux_1_2_x86_64.whl (2.2 MB view details)

Uploaded CPython 3.12+musllinux: musl 1.2+ x86-64

hogql_parser_rs-1.3.65-cp312-abi3-musllinux_1_2_aarch64.whl (2.1 MB view details)

Uploaded CPython 3.12+musllinux: musl 1.2+ ARM64

hogql_parser_rs-1.3.65-cp312-abi3-manylinux_2_28_x86_64.whl (2.2 MB view details)

Uploaded CPython 3.12+manylinux: glibc 2.28+ x86-64

hogql_parser_rs-1.3.65-cp312-abi3-manylinux_2_28_aarch64.whl (1.9 MB view details)

Uploaded CPython 3.12+manylinux: glibc 2.28+ ARM64

hogql_parser_rs-1.3.65-cp312-abi3-macosx_11_0_arm64.whl (478.1 kB view details)

Uploaded CPython 3.12+macOS 11.0+ ARM64

hogql_parser_rs-1.3.65-cp312-abi3-macosx_10_12_x86_64.whl (503.6 kB view details)

Uploaded CPython 3.12+macOS 10.12+ x86-64

File details

Details for the file hogql_parser_rs-1.3.65.tar.gz.

File metadata

  • Download URL: hogql_parser_rs-1.3.65.tar.gz
  • Upload date:
  • Size: 237.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for hogql_parser_rs-1.3.65.tar.gz
Algorithm Hash digest
SHA256 860d1576b4024d24226cf92bb7c1553074eff2c39be7471a69968bd7f649bc70
MD5 7d61be9dc76a623858d2d8d50c244a91
BLAKE2b-256 2cd16a53292e6d7ca3ee6ac84e6f007829aafcf4388117e84a24df5c8287b057

See more details on using hashes here.

Provenance

The following attestation bundles were made for hogql_parser_rs-1.3.65.tar.gz:

Publisher: build-hogql-parser-rs.yml on PostHog/posthog

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file hogql_parser_rs-1.3.65-cp312-abi3-musllinux_1_2_x86_64.whl.

File metadata

File hashes

Hashes for hogql_parser_rs-1.3.65-cp312-abi3-musllinux_1_2_x86_64.whl
Algorithm Hash digest
SHA256 e1d15977ae22bfb78b621f7e9ad048f5707aae2c74e872ecf8bcf4f7f08b65b2
MD5 eb62b54278c23147950b5b8de4df15ee
BLAKE2b-256 741839493f69578c5b1254f12e60dad654a3053935d70e624f5d671a282d2539

See more details on using hashes here.

Provenance

The following attestation bundles were made for hogql_parser_rs-1.3.65-cp312-abi3-musllinux_1_2_x86_64.whl:

Publisher: build-hogql-parser-rs.yml on PostHog/posthog

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file hogql_parser_rs-1.3.65-cp312-abi3-musllinux_1_2_aarch64.whl.

File metadata

File hashes

Hashes for hogql_parser_rs-1.3.65-cp312-abi3-musllinux_1_2_aarch64.whl
Algorithm Hash digest
SHA256 f34e79bb1149c4e7361556c2ad0f0888ef14554329795196222f0753101b8bed
MD5 5083951bf45a34b6aaf9e7304a9b396c
BLAKE2b-256 c94c9e53f6471b9caf80e0bd95d2e39201accf6c740384194d3d5d1fb23eb764

See more details on using hashes here.

Provenance

The following attestation bundles were made for hogql_parser_rs-1.3.65-cp312-abi3-musllinux_1_2_aarch64.whl:

Publisher: build-hogql-parser-rs.yml on PostHog/posthog

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file hogql_parser_rs-1.3.65-cp312-abi3-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for hogql_parser_rs-1.3.65-cp312-abi3-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 e688486173b3acfb82ce984d4aefc49059cc463833850208b2e4f720d7b97db7
MD5 44d3f350376fba2a8a73f7638814086e
BLAKE2b-256 93e25cc52acbd36ed5f8bd30ab91e5bd2f11ac38c25b95d919c88d9e2263568b

See more details on using hashes here.

Provenance

The following attestation bundles were made for hogql_parser_rs-1.3.65-cp312-abi3-manylinux_2_28_x86_64.whl:

Publisher: build-hogql-parser-rs.yml on PostHog/posthog

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file hogql_parser_rs-1.3.65-cp312-abi3-manylinux_2_28_aarch64.whl.

File metadata

File hashes

Hashes for hogql_parser_rs-1.3.65-cp312-abi3-manylinux_2_28_aarch64.whl
Algorithm Hash digest
SHA256 92f37ed5f29dffadab99ae95fb279ce61b8df1f9a0b74b3825e668761ca617ef
MD5 fd51668016686c3da357c6393a0a2561
BLAKE2b-256 9415050076a00fe488b70feffd52286b0bb85ee95433b69c847367e3d72c0f8b

See more details on using hashes here.

Provenance

The following attestation bundles were made for hogql_parser_rs-1.3.65-cp312-abi3-manylinux_2_28_aarch64.whl:

Publisher: build-hogql-parser-rs.yml on PostHog/posthog

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file hogql_parser_rs-1.3.65-cp312-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for hogql_parser_rs-1.3.65-cp312-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 42fa6631dd971e64d1f7745f55942b4dedd153fdb3b976199d7ebef99cc17bfd
MD5 bbe7cdf2c1a343bd3b5d606f738d7514
BLAKE2b-256 ecc3ce89db348bd6008e3d267dba84bded36abcfd0f6d730498f71d518fd0002

See more details on using hashes here.

Provenance

The following attestation bundles were made for hogql_parser_rs-1.3.65-cp312-abi3-macosx_11_0_arm64.whl:

Publisher: build-hogql-parser-rs.yml on PostHog/posthog

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file hogql_parser_rs-1.3.65-cp312-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for hogql_parser_rs-1.3.65-cp312-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 f7dd677a11fb085a604fb81bf8865471358df53f8aaf61ab17b9efaf0ee4fac1
MD5 ad34776a823b933d69ec413449bed999
BLAKE2b-256 307ed9c3cb5089f4ec1f83032a6966dab714c1601287d7eaddedff9e9cb93f04

See more details on using hashes here.

Provenance

The following attestation bundles were made for hogql_parser_rs-1.3.65-cp312-abi3-macosx_10_12_x86_64.whl:

Publisher: build-hogql-parser-rs.yml on PostHog/posthog

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page