Skip to main content

generate selector schemas classes from yaml config and DSL-lang script

Project description

Selector schema codegen

RUSSIAN ENGLISH

ssc_codegen - generator of parsers for various programming languages (for html priority) using yaml-DSL configurations with built-in declarative language.

Designed to port parsers to various programming languages

Install

pipx (recommended)

pipx install ssc_codegen

pip

pip install ssc_codegen

Supported languages

language lib xpath css formatter
python bs4 NO YES black
python parsel YES YES black
dart universal_html NO YES dart format

User guide

Language features

  • DSL (Domain-Specific Language), declarative (no assignment, arithmetic, priority operations)
  • Minimalistic syntax for working with selectors, regular expressions and simple string operations
  • All methods take one argument as input and it is always selector-like type
  • 4 types
  • Regular expression syntax is like in python. For maximum compatibility, use for example [0-9] instead of \d
  • Empty lines and comments (//) are ignored by the parser.

Types description

There are 4 data types for this scripting language

type Description
SELECTOR class instance (Document, Element, Selector) from which css/xpath selectors are called. Always the first argument
SELECTOR_ARRAY representation of a list of nodes (elements?) of all found elements from the SELECTOR instance
TEXT string
ARRAY array of strings

Recommendations

  • usage css selector: they can be guaranteed converted to xpath (if target language not support CSS selectors)
  • there is a xpath to css converter for simple queries without guarantees of functionality. For example, in css there is no analogue of contains from xpath, etc.

Schematic representation of generator operation

img.png

Description of directives

  • statements are separated by line indentation \n
  • All string arguments are specified with double " quotes.
  • Space are ignored
Operator Arguments Description Return type value Example
default "" Default value if an error occurred during parsing. Listed first - default "empty"
xpath "" xpath selector, returns the first value found SELECTOR xpath "//title"
xpathAll "" xpath selector, returns all values SELECTOR xpathAll "//div"
css "" css selector, returns the first value found SELECTOR css "title"
cssAll "" css selector, returns all values SELECTOR cssAll "div > a"
attr "" get tag(s). Called after xpath/xpathAll/css/cssAll TEXT/ARRAY attr "href"
text get the text inside the tag. Called after xpath/xpathAll/css/cssAll. Can be called first to completely convert a SELECTOR object to TEXT TEXT/ARRAY text
raw get the raw tag as text. Called after xpath/xpathAll/css/cssAll TEXT/ARRAY raw
re "" regular expression. Returns the first element found. Argument must be TEXT TEXT re "(\d+)"
reAll "" regular expression. Returns all found elements. Argument must be TEXT ARRAY reAll "(\d+)"
reSub "" "" Replacement by regular expression. Argument must be TEXT TEXT reSub "(\d+)" "digit(lol)"
strip "" Removes the given string LEFT and RIGHT. Argument must be TEXT TEXT strip "\n"
lstrip "" Deletes the specified line from the LEFT. Argument must be TEXT TEXT lstrip " "
rstrip "" Deletes the specified row on the RIGHT. Argument must be TEXT TEXT rstrip " "
format "" Format string. Specify a substitution argument using the {{}} operator. Argument must be TEXT TEXT format "spam {{}} egg"
split "" Splitting a line. If count = -1 or not transmitted, divide by the maximum available. Argument must be TEXT ARRAY split ", "
replace "" "" String replacement. If count = -1 or not passed, replace it with the maximum available one. Argument must be TEXT ARRAY split ", "
limit Maximum number of elements ARRAY limit 50
index Take element by index. Argument must be ARRAY TEXT index 1
first index 1 alias TEXT first
last index -1 alias TEXT last
join "" Collects ARRAY into a string. Argument must be ARRAY TEXT join ", "
ret Tell the translator to return a value. Automatically added if not specified in the script ret
noRet "" Tell the translator not to return anything. Added for document pre-validation noRet
// ... One line comment. Ignored by the final code generator // this is comment line

Example code generation

// set default value if parse process is failing
xpath "//title"
text
format "Cool title: {{}}"

generated python equivalent code:

from parsel import Selector


def dummy_parse(part: Selector):
    val_0 = part.xpath('//title')
    val_1 = val_0.xpath('/text()').get()
    val_2 = "Cool title: {}".format(val_1)
    return val_2

generated dart equivalent code:

import 'package:html/parser.dart' as html;

dummy_parse(part){
    var val_0 = part.querySelector('title');
    String val_1 = val_0?.text ?? "";
    var val_2 = "Cool title: $val_1";
    return val_2;
}

add default value:

// set default value if parse process is failing
default "spam egg"
xpath "//title"
text
format "Cool title: {{}}"
from parsel import Selector


def dummy_parse(part: Selector):
    try:  
      val_1 = part.xpath('//title')
      val_2 = val_1.xpath('/text()').get()
      val_3 = "Cool title: {}".format(val_2)
      return val_3
    except Exception:
        return "spam egg"
import 'package:html/parser.dart' as html;

dummy_parse(html.Document part){
  try{
    var val_0 = part.querySelector('title');
    String val_1 = val_0?.text ?? "";
    var val_2 = "Cool title: $val_1";
    return val_2;
  } catch (e){
    return "spam egg";
  }
    
}

add assert validator

// not null check operation
assertCss "head > title"
xpath "//title"
text
format "Cool title: {{}}"
from parsel import Selector


def dummy_parse(part: Selector):
    assert part.css("head > title")
    val_1 = part.xpath('//title')
    val_2 = val_1.xpath('/text()').get()
    val_3 = "Cool title: {}".format(val_2)
    return val_3
import 'package:html/parser.dart' as html;

dummy_parse(html.Document part){
    assert(part.querySelector('title') != null);
    var val_0 = part.querySelector('title');
    String val_1 = val_0?.text ?? "";
    var val_2 = "Cool title: $val_1";
    return val_2;
}

Document validation

The following commands are needed to pre-validate the input document using assert and they do not change final and intermediate values.

In this DSL language there are no boolean, null types, so if the result is false it will throw an error like AssertionError.

The operators accept SELECTOR:

  • assertCss
  • assertXpath

All other operators accept TEXT:

Command Description Example
assertEqual Full string comparison (==) (case sensitive) assertEqual "lorem upsum dolor"
assertContains Comparison by presence of part of a string in TEXT assertContains "sum"
assertStarts Comparison based on the presence of part of a string at the beginning of TEXT assertStarts "lorem"
assertEnds Comparison based on the presence of part of a string at the end of TEXT assertEnds "dolor"
assertMatch Compare TEXT by regular expression assertMatch "lorem \w+ dolor"
assertCss Checking the validity of the query in SELECTOR. assertCss "head > title"
assertXpath Checking the validity of the query in SELECTOR. assertXpath "//head/title"

yaml config

An example of the structure of the generated parser class: img_2.png

  • selector - Selector/Document instance, initialized using document
  • _aliases - remapping keys for the view() method
  • _viewKeys - output keys for the view() method
  • _cachedResult - cache of obtained values from the parse() method
  • parse() - launching the parser
  • view() - getting the received values
  • _preValidate() - an optional method of preliminary validation of the input document according to the rules from the configuration. If the result is false/null, it throws AssertError
  • _partDocument() - an optional method of dividing a document into parts using a given selector. Useful, for example, for obtaining elements of the same type (product cards, etc.)
  • _parseA, _parseB, _parseC, ... - automatically generated parser methods for each key (A,B,C) according to the rules from the configuration

Example configuration file, see in examples

Usage pseudocode example:

document = ... // extracted html document
instance = Klass(document)
instance.parse()
print(instance.view())

dev

DEV

TODO

  • generated schemas checksum
  • filter operations (?)
  • constants support
  • more languages, libs support
  • codegen optimizations (usage SELECTOR fluent interfaces, one-line code generation)
  • css/xpath analyzer in pre-generate step
  • css/xpath patches (for example, if css selectors in target language not support :nth-child operation?)
  • translate regex expressions. Eg: \d to [0-9]
  • string methods: title, upper, lower, capitalize or any useful

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ssc_codegen-0.1.13.tar.gz (25.1 kB view details)

Uploaded Source

Built Distribution

ssc_codegen-0.1.13-py3-none-any.whl (29.4 kB view details)

Uploaded Python 3

File details

Details for the file ssc_codegen-0.1.13.tar.gz.

File metadata

  • Download URL: ssc_codegen-0.1.13.tar.gz
  • Upload date:
  • Size: 25.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.10.12 Linux/5.15.0-78-generic

File hashes

Hashes for ssc_codegen-0.1.13.tar.gz
Algorithm Hash digest
SHA256 c21cb7cf81fa6d155d29609072e5e84da19e92c26d4c52043e58beda228bc327
MD5 1710e2e0835a291b35e430c44787c69d
BLAKE2b-256 b0298b9c41a248d69676407e6b81ae6948be8eb65030f06371b96eb19b3919aa

See more details on using hashes here.

File details

Details for the file ssc_codegen-0.1.13-py3-none-any.whl.

File metadata

  • Download URL: ssc_codegen-0.1.13-py3-none-any.whl
  • Upload date:
  • Size: 29.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/1.7.1 CPython/3.10.12 Linux/5.15.0-78-generic

File hashes

Hashes for ssc_codegen-0.1.13-py3-none-any.whl
Algorithm Hash digest
SHA256 e3a7f46f6d5dfadcf1950b30b4d7bde12061b03d2fae28626863ac581f030316
MD5 4c882472d90a8d94fd94e4a1ea1986f6
BLAKE2b-256 6cf5f0edac5607847eec710a099fb40761d21acc298391c405370cc453e9358a

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page