CSS selector syntax for python minidom and compatible DOM implementations
CSS selector syntax for python minidom and DOM implementations.
Provided an HTML file sample.html the following code will query some elements and return them as minidom Elements. In case of multiple elements (select_all) a simple python list is returned (instead of a minidom NodeList).
from xml.dom.minidom import parse from dom_query import select, select_all tree = parse("test/html/sample.html") # Title element title = select(tree, "title") # Every P element paragraphs = select_all(tree, "p") # Element with type P and ID equal to "summary" summary = select(tree, "p#summary") # Every element with class "wide" wide_elements = select_all(tree, ".wide")
Supported CSS syntax
Only a subset of CSS syntax is supported:
Compound selectors (comma separator),
element type and id,
attributes match (presence and all the other operators),
combinators (descendant, sibling, subsequent, child).
Some supported selectors:
p#abstract[lang|=en] p[data-user="john"] div > p + p, article > p + p script[type="text/data"] header > li ul, footer > li ul section h1 ~ p, article h2 ~ p
Internals and implementation
Every query is compiled and cached sor subsequent use.
The first stage is tokenization (lexer.py lexer) which is loosely based on the W3C selector lexer. The differences are mainly to make the tokenizer compatible with regular expressions and to strip every unnecessary feautures.
Then follows the parsing stage (parser.py parse) which produce a simple AST from the tokens. The parser is, just like the tokenizer, a simplified version of the standard one. It is a single function which implements a descent parser. The AST is a tuple of tuples and maps in a relatively close way the given query.
The last stage is the compiler (compiler.py compile). It translates the AST into a sequence of simple actions to be performed in order to select the matching elements. Once compiled it is saved in cache and will be reused whenever the same query is seen again.
The opcodes are executed by (vm.py execute). This function takes a starting element, a sequence of opcodes, and an api. The api is dict-like object. Every key corresponds to a function which implements an opcode. The default api is minidom_api.py api.
Every function in the api is either a filter (actual filtering of nodes) or a generator (combinators expansion). The only two opcodes which don’t follows this rule are YIELD (return elements found so far) and RESET (reload the original element node after a CSS comma).
In case of other dom implementations it should be sufficient to write a new api and pass it to execute (or select*) upon querying.
Code quality and stability
The code is far from complete. It is tested but there are minor issues (attribute match doesn’t follow the specs verbatim).
Feel free to contribute.
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Hashes for dom_query-0.0.1-py3-none-any.whl