Skip to main content

CSS selector syntax for python minidom and compatible DOM implementations

Project description

CSS selector syntax for python minidom and DOM implementations.

Short example

Provided an HTML file sample.html the following code will query some elements and return them as minidom Elements. In case of multiple elements (select_all) a simple python list is returned (instead of a minidom NodeList).

from xml.dom.minidom import parse
from dom_query import select, select_all

tree = parse("test/html/sample.html")

# Title element
title = select(tree, "title")

# Every P element
paragraphs = select_all(tree, "p")

# Element with type P and ID equal to "summary"
summary = select(tree, "p#summary")

# Every element with class "wide"
wide_elements = select_all(tree, ".wide")

Supported CSS syntax

Only a subset of CSS syntax is supported:

  • Compound selectors (comma separator),

  • element type and id,

  • classes presence,

  • attributes match (presence and all the other operators),

  • combinators (descendant, sibling, subsequent, child).

Some supported selectors:

p#abstract[lang|=en]
p[data-user="john"]
div > p + p, article > p + p
script[type="text/data"]
header > li ul, footer > li ul
section h1 ~ p, article h2 ~ p

Internals and implementation

Every query is compiled and cached for subsequent use.

Lexer

The first stage is tokenization (lexer.py lexer) which is loosely based on the W3C selector lexer. The differences are mainly to make the tokenizer compatible with regular expressions and to strip every unnecessary features.

Parser

Then follows the parsing stage (parser.py parse) which produce a simple AST from the tokens. The parser is, just like the tokenizer, a simplified version of the standard one. It is a single function which implements a descent parser. The AST is a tuple of tuples and maps in a relatively close way the given query.

Compiler

The last stage is the compiler (compiler.py compile). It translates the AST into a sequence of simple actions to be performed in order to select the matching elements. Once compiled it is saved in cache and will be reused whenever the same query is seen again.

Virtual machine

The opcodes are executed by (vm.py execute). This function takes a starting element, a sequence of opcodes, and an api. The api is dict-like object. Every key corresponds to a function which implements an opcode. The default api is minidom_api.py api.

DOM API

Every function in the api is either a filter (actual filtering of nodes) or a generator (combinators expansion). The only two opcodes which don’t follows this rule are YIELD (return elements found so far) and RESET (reload the original element node after a CSS comma).

In case of other dom implementations it should be sufficient to write a new api and pass it to execute (or select*) upon querying.

Code quality and stability

The code is far from complete. It is tested but there are minor issues (attribute match doesn’t follow the specs verbatim).

Feel free to contribute.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dom_query-0.0.4.tar.gz (10.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dom_query-0.0.4-py3-none-any.whl (13.0 kB view details)

Uploaded Python 3

File details

Details for the file dom_query-0.0.4.tar.gz.

File metadata

  • Download URL: dom_query-0.0.4.tar.gz
  • Upload date:
  • Size: 10.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/45.1.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.8.1

File hashes

Hashes for dom_query-0.0.4.tar.gz
Algorithm Hash digest
SHA256 4c71c135be6b1ce50f67ee9fcb9a2ac8ec87e96354e10f2c7b9b716fd531c3c2
MD5 b4a1ae5a6b7defdd5eb29b84ab0b8b0c
BLAKE2b-256 d75d1b55814987dc4ec3580c57e0833e2d929d80aef47f30570c0c6d68ac485c

See more details on using hashes here.

File details

Details for the file dom_query-0.0.4-py3-none-any.whl.

File metadata

  • Download URL: dom_query-0.0.4-py3-none-any.whl
  • Upload date:
  • Size: 13.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/3.1.1 pkginfo/1.5.0.1 requests/2.22.0 setuptools/45.1.0 requests-toolbelt/0.9.1 tqdm/4.43.0 CPython/3.8.1

File hashes

Hashes for dom_query-0.0.4-py3-none-any.whl
Algorithm Hash digest
SHA256 561de867055c0b94b99067ce700c435f97842d9133a11a414c95ed28cada88ee
MD5 622b4d82bb8219f765bced0a07289f1c
BLAKE2b-256 f149256de5111bc0c0ac7bf0bde1e6ec157829a82649a1fa66ae1ac13b9bb18b

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page