Skip to main content

CSS selector syntax for python minidom and compatible DOM implementations

Project description

CSS selector syntax for python minidom and DOM implementations.

Short example

Provided an HTML file sample.html the following code will query some elements and return them as minidom Elements. In case of multiple elements (select_all) a simple python list is returned (instead of a minidom NodeList).

from xml.dom.minidom import parse
from dom_query import select, select_all

tree = parse("test/html/sample.html")

# Title element
title = select(tree, "title")

# Every P element
paragraphs = select_all(tree, "p")

# Element with type P and ID equal to "summary"
summary = select(tree, "p#summary")

# Every element with class "wide"
wide_elements = select_all(tree, ".wide")

Supported CSS syntax

Only a subset of CSS syntax is supported:

  • Compound selectors (comma separator),
  • element type and id,
  • classes presence,
  • attributes match (presence and all the other operators),
  • combinators (descendant, sibling, subsequent, child).

Some supported selectors:

p#abstract[lang|=en]
p[data-user="john"]
div > p + p, article > p + p
script[type="text/data"]
header > li ul, footer > li ul
section h1 ~ p, article h2 ~ p

Internals and implementation

Every query is compiled and cached sor subsequent use.

Lexer

The first stage is tokenization (lexer.py lexer) which is loosely based on the W3C selector lexer. The differences are mainly to make the tokenizer compatible with regular expressions and to strip every unnecessary feautures.

Parser

Then follows the parsing stage (parser.py parse) which produce a simple AST from the tokens. The parser is, just like the tokenizer, a simplified version of the standard one. It is a single function which implements a descent parser. The AST is a tuple of tuples and maps in a relatively close way the given query.

Compiler

The last stage is the compiler (compiler.py compile). It translates the AST into a sequence of simple actions to be performed in order to select the matching elements. Once compiled it is saved in cache and will be reused whenever the same query is seen again.

Virtual machine

The opcodes are executed by (vm.py execute). This function takes a starting element, a sequence of opcodes, and an api. The api is dict-like object. Every key corresponds to a function which implements an opcode. The default api is minidom_api.py api.

DOM API

Every function in the api is either a filter (actual filtering of nodes) or a generator (combinators expansion). The only two opcodes which don’t follows this rule are YIELD (return elements found so far) and RESET (reload the original element node after a CSS comma).

In case of other dom implementations it should be sufficient to write a new api and pass it to execute (or select*) upon querying.

Code quality and stability

The code is far from complete. It is tested but there are minor issues (attribute match doesn’t follow the specs verbatim).

Feel free to contribute.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Files for dom-query, version 0.0.1
Filename, size File type Python version Upload date Hashes
Filename, size dom_query-0.0.1-py3-none-any.whl (13.0 kB) File type Wheel Python version py3 Upload date Hashes View
Filename, size dom_query-0.0.1.tar.gz (11.0 kB) File type Source Python version None Upload date Hashes View

Supported by

Pingdom Pingdom Monitoring Google Google Object Storage and Download Analytics Sentry Sentry Error logging AWS AWS Cloud computing DataDog DataDog Monitoring Fastly Fastly CDN DigiCert DigiCert EV certificate StatusPage StatusPage Status page