Skip to main content

A Python parser for Blacklab Corpus Query Language

Project description

A Python parser for Blacklab Corpus Query Language

Documentation codecov License

Installation

Not on PyPi yet so clone from Github first:

git clone https://github.com/BramVanroy/bcql_py.git --depth 1
cd bcql_py
uv sync

Notes

ANTLR to generate the needed tools

Blacklab uses ANTLR to generate the parser/lexer in Java based on a g4 file. We could similarly generate Python files. However, after trying it out, I find the files obfuscated and unclear and I'm not fond of requiring an extra external library. That is not a slight to ANTLR; I am simply not familiar with the tool - I am sure it is incredibly powerful and useful if you know how to use it. To keep a clearer view of this library I therefore strive to make a Python-native implementation that is true to spec. It's also just a fun project that I do not wish to "automate away" (though I might regret that later). At a later time (TODO) I might implement functionality to cross-validate our implementation with the generated ANTLR parser and lexer. For now I will be satisfied with high coverage testing. In case of doubt I have followed the Bcql.g4 file.

If you'd like to try the ANTLR route yourself, you can try it as follows:

  1. Install requirements (not included in our pyproject.toml file, you'll need to download these yourself!)
uv pip install requests antlr4-tools antlr4-python3-runtime
  1. Download the Black G4 definition from github. You can optionally specify a --branch or --tag, defaults to --branch dev.
uv run python scripts/get_bcql_g4.py
# Saved to parser/Bcql.g4
cd parser/
  1. Run ANTLR (you can update -v to the latest version if needed)
antlr4 -v 4.13.2 -Dlanguage=Python3 Bcql.g4

Design choices

Building a lexer is somewhat care-free with the exception of deciding which boundaries to use. As an example, I chose to tokenize the regex positive lookbehind (?<= as a single Token but I could have chosen to go deeper and re-use regular parens (, followed by a question mark (also used as quantifier) ?, followed by the single-token <=, also used as a mathematical operator. Such changes would make the "vocabulary" smaller but apart from that I did not see much benefit - though I am sure that there are more arguments to make both for and against a minimalist approach.

The parser, however, is a different beast entirely. be separated so we can re-use it and re-use <= as a single entity operator), building a parser

Pydantic models

Model rebuilding

In many of the models in models/*.py you will see that that we have to call model_rebuild after having set the discirminatory *ConstraintExpr union. This union is needed for typing - some of the constraint nodes have operands that can be any constraint nodes (union). Pydantic needs to know about the union after all the individual classes have been defined, so we call model_rebuild on all of them at the end of the file.

If we don't do this, we'll get a Pydantic error about the forward reference not being resolved when we try to create a NotConstraint or BoolConstraint

Acknowledegments

TODO

  • Output AST as eBNF grammar?
  • Routinely validate our implementation with official Blacklab .g4 grammar?

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bcql_py-0.1.5.tar.gz (180.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

bcql_py-0.1.5-py3-none-any.whl (59.9 kB view details)

Uploaded Python 3

File details

Details for the file bcql_py-0.1.5.tar.gz.

File metadata

  • Download URL: bcql_py-0.1.5.tar.gz
  • Upload date:
  • Size: 180.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.8 {"installer":{"name":"uv","version":"0.11.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for bcql_py-0.1.5.tar.gz
Algorithm Hash digest
SHA256 165311618466e25571d620bd8df5152c08b0776a200d3119891ecf0e3f04ecb6
MD5 33d8594e4358b3f440184138d5efa8f3
BLAKE2b-256 4b97c36ac5e34a01b1d4363409cc5d3ed982a2dfb98fd32a171f58e516b827a3

See more details on using hashes here.

File details

Details for the file bcql_py-0.1.5-py3-none-any.whl.

File metadata

  • Download URL: bcql_py-0.1.5-py3-none-any.whl
  • Upload date:
  • Size: 59.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.8 {"installer":{"name":"uv","version":"0.11.8","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for bcql_py-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 ea40f65908c9c8cfbde38c890f8eedd838c707d24e9586af66f27ca3c8b37191
MD5 ca12b0861e0a078b775e9a6bf430b654
BLAKE2b-256 f2cce2fef2afc015c6335d76a35e9d1fce958da177739706eabee636d88274d1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page