generate selector schemas classes from yaml config and DSL-lang script

These details have not been verified by PyPI

Project description

Selector schema codegen

ssc_codegen - generator of parsers for various programming languages (for html priority) using yaml-DSL configurations with built-in declarative language.

Designed to port parsers to various programming languages

Install

pipx (recommended)

pipx install ssc_codegen

pip

pip install ssc_codegen

Supported languages

language	lib	xpath	css	formatter
python	bs4	NO	YES	black
python	parsel	YES	YES	black
dart	universal_html	NO	YES	dart format

User guide

Language features

DSL (Domain-Specific Language), declarative (no assignment, arithmetic, priority operations)
Minimalistic syntax for working with selectors, regular expressions and simple string operations
All methods take one argument as input and it is always selector-like type
4 types
Regular expression syntax is like in python. For maximum compatibility, use for example [0-9] instead of \d
Empty lines and comments (//) are ignored by the parser.

Types description

There are 4 data types for this scripting language

type	Description
SELECTOR	class instance (Document, Element, Selector) from which css/xpath selectors are called. Always the first argument
SELECTOR_ARRAY	representation of a list of nodes (elements?) of all found elements from the SELECTOR instance
TEXT	string
ARRAY	array of strings

Recommendations

usage css selector: they can be guaranteed converted to xpath (if target language not support CSS selectors)
there is a xpath to css converter for simple queries without guarantees of functionality. For example, in css there is no analogue of contains from xpath, etc.

Schematic representation of generator operation

Description of directives

statements are separated by line indentation \n
All string arguments are specified with double " quotes.
Space are ignored

Operator	Arguments	Description	Return type value	Example
default	""	Default value if an error occurred during parsing. Listed first	-	default "empty"
xpath	""	xpath selector, returns the first value found	SELECTOR	xpath "//title"
xpathAll	""	xpath selector, returns all values	SELECTOR	xpathAll "//div"
css	""	css selector, returns the first value found	SELECTOR	css "title"
cssAll	""	css selector, returns all values	SELECTOR	cssAll "div > a"
attr	""	get tag(s). Called after xpath/xpathAll/css/cssAll	TEXT/ARRAY	attr "href"
text		get the text inside the tag. Called after xpath/xpathAll/css/cssAll. Can be called first to completely convert a `SELECTOR` object to `TEXT`	TEXT/ARRAY	text
raw		get the raw tag as text. Called after xpath/xpathAll/css/cssAll	TEXT/ARRAY	raw
re	""	regular expression. Returns the first element found. Argument must be TEXT	TEXT	re "(\d+)"
reAll	""	regular expression. Returns all found elements. Argument must be TEXT	ARRAY	reAll "(\d+)"
reSub	"" ""	Replacement by regular expression. Argument must be TEXT	TEXT	reSub "(\d+)" "digit(lol)"
strip	""	Removes the given string LEFT and RIGHT. Argument must be TEXT	TEXT	strip "\n"
lstrip	""	Deletes the specified line from the LEFT. Argument must be TEXT	TEXT	lstrip " "
rstrip	""	Deletes the specified row on the RIGHT. Argument must be TEXT	TEXT	rstrip " "
format	""	Format string. Specify a substitution argument using the `{{}}` operator. Argument must be TEXT	TEXT	format "spam {{}} egg"
split	""	Splitting a line. If count = -1 or not transmitted, divide by the maximum available. Argument must be TEXT	ARRAY	split ", "
replace	"" ""	String replacement. If count = -1 or not passed, replace it with the maximum available one. Argument must be TEXT	ARRAY	split ", "
limit		Maximum number of elements	ARRAY	limit 50
index		Take element by index. Argument must be ARRAY	TEXT	index 1
first		`index 1` alias	TEXT	first
last		`index -1` alias	TEXT	last
join	""	Collects ARRAY into a string. Argument must be ARRAY	TEXT	join ", "
ret		Tell the translator to return a value. Automatically added if not specified in the script		ret
noRet	""	Tell the translator not to return anything. Added for document pre-validation		noRet
//	...	One line comment. Ignored by the final code generator		// this is comment line

Example code generation

// set default value if parse process is failing
xpath "//title"
text
format "Cool title: {{}}"

generated python equivalent code:

from parsel import Selector


def dummy_parse(part: Selector):
    val_0 = part.xpath('//title')
    val_1 = val_0.xpath('/text()').get()
    val_2 = "Cool title: {}".format(val_1)
    return val_2

generated dart equivalent code:

import 'package:html/parser.dart' as html;

dummy_parse(part){
    var val_0 = part.querySelector('title');
    String val_1 = val_0?.text ?? "";
    var val_2 = "Cool title: $val_1";
    return val_2;
}

add default value:

// set default value if parse process is failing
default "spam egg"
xpath "//title"
text
format "Cool title: {{}}"

from parsel import Selector


def dummy_parse(part: Selector):
    try:  
      val_1 = part.xpath('//title')
      val_2 = val_1.xpath('/text()').get()
      val_3 = "Cool title: {}".format(val_2)
      return val_3
    except Exception:
        return "spam egg"

import 'package:html/parser.dart' as html;

dummy_parse(html.Document part){
  try{
    var val_0 = part.querySelector('title');
    String val_1 = val_0?.text ?? "";
    var val_2 = "Cool title: $val_1";
    return val_2;
  } catch (e){
    return "spam egg";
  }
    
}

add assert validator

// not null check operation
assertCss "head > title"
xpath "//title"
text
format "Cool title: {{}}"

from parsel import Selector


def dummy_parse(part: Selector):
    assert part.css("head > title")
    val_1 = part.xpath('//title')
    val_2 = val_1.xpath('/text()').get()
    val_3 = "Cool title: {}".format(val_2)
    return val_3

import 'package:html/parser.dart' as html;

dummy_parse(html.Document part){
    assert(part.querySelector('title') != null);
    var val_0 = part.querySelector('title');
    String val_1 = val_0?.text ?? "";
    var val_2 = "Cool title: $val_1";
    return val_2;
}

Document validation

The following commands are needed to pre-validate the input document using assert and they do not change final and intermediate values.

In this DSL language there are no boolean, null types, so if the result is false it will throw an error like AssertionError.

The operators accept SELECTOR:

assertCss
assertXpath

All other operators accept TEXT:

Command	Description	Example
assertEqual	Full string comparison (`==`) (case sensitive)	assertEqual "lorem upsum dolor"
assertContains	Comparison by presence of part of a string in `TEXT`	assertContains "sum"
assertStarts	Comparison based on the presence of part of a string at the beginning of `TEXT`	assertStarts "lorem"
assertEnds	Comparison based on the presence of part of a string at the end of `TEXT`	assertEnds "dolor"
assertMatch	Compare `TEXT` by regular expression	assertMatch "lorem \w+ dolor"
assertCss	Checking the validity of the query in `SELECTOR`.	assertCss "head > title"
assertXpath	Checking the validity of the query in `SELECTOR`.	assertXpath "//head/title"

yaml config

An example of the structure of the generated parser class:

selector - Selector/Document instance, initialized using document
_aliases - remapping keys for the view() method
_viewKeys - output keys for the view() method
_cachedResult - cache of obtained values from the parse() method
parse() - launching the parser
view() - getting the received values
_preValidate() - an optional method of preliminary validation of the input document according to the rules from the configuration. If the result is false/null, it throws AssertError
_partDocument() - an optional method of dividing a document into parts using a given selector. Useful, for example, for obtaining elements of the same type (product cards, etc.)
_parseA, _parseB, _parseC, ... - automatically generated parser methods for each key (A,B,C) according to the rules from the configuration

Example configuration file, see in examples

Usage pseudocode example:

document = ... // extracted html document
instance = Klass(document)
instance.parse()
print(instance.view())

dev

DEV

TODO

generated schemas checksum
filter operations (?)
constants support
more languages, libs support
codegen optimizations (usage SELECTOR fluent interfaces, one-line code generation)
css/xpath analyzer in pre-generate step
css/xpath patches (for example, if css selectors in target language not support :nth-child operation?)
translate regex expressions. Eg: \d to [0-9]
string methods: title, upper, lower, capitalize or any useful

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.3.5

Sep 29, 2024

0.3.4

Sep 29, 2024

0.3.3

Sep 29, 2024

0.3.2

Sep 29, 2024

0.3.1

Jun 12, 2024

0.3.0.dev3 pre-release

Jun 12, 2024

0.2.7

Mar 1, 2024

0.2.6

Mar 1, 2024

0.2.5

Feb 15, 2024

0.2.4

Feb 15, 2024

0.2.3

Feb 15, 2024

0.2.2

Feb 15, 2024

0.2.1

Feb 14, 2024

0.2.0

Feb 14, 2024

0.1.13

Dec 13, 2023

0.1.12

Dec 13, 2023

0.1.11

Dec 10, 2023

This version

0.1.10

Dec 10, 2023

0.1.9

Dec 1, 2023

0.1.8

Nov 28, 2023

0.1.7

Nov 28, 2023

0.1.6

Nov 27, 2023

0.1.5

Nov 27, 2023

0.1.4

Nov 23, 2023

0.1.3

Nov 23, 2023

0.1.2

Nov 23, 2023

0.1.1

Nov 23, 2023

0.1.0

Nov 21, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ssc_codegen-0.1.10.tar.gz (25.0 kB view details)

Uploaded Dec 10, 2023 Source

Built Distribution

ssc_codegen-0.1.10-py3-none-any.whl (29.4 kB view details)

Uploaded Dec 10, 2023 Python 3

File details

Details for the file ssc_codegen-0.1.10.tar.gz.

File metadata

Download URL: ssc_codegen-0.1.10.tar.gz
Upload date: Dec 10, 2023
Size: 25.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.7.1 CPython/3.10.12 Linux/5.15.0-78-generic

File hashes

Hashes for ssc_codegen-0.1.10.tar.gz
Algorithm	Hash digest
SHA256	`1b332c919f8d0eeb8ffda610eb7af153f34744fa78511cb49a67c92ba861a2b8`
MD5	`ad7bfaf00554222145664e55ed6f2a16`
BLAKE2b-256	`b5d76a0f66a30f70d95bef40a3d2ca6c54f4241303a7af680cc773ded9cd97d2`

See more details on using hashes here.

File details

Details for the file ssc_codegen-0.1.10-py3-none-any.whl.

File metadata

Download URL: ssc_codegen-0.1.10-py3-none-any.whl
Upload date: Dec 10, 2023
Size: 29.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: poetry/1.7.1 CPython/3.10.12 Linux/5.15.0-78-generic

File hashes

Hashes for ssc_codegen-0.1.10-py3-none-any.whl
Algorithm	Hash digest
SHA256	`37184cd5539b6a95cd159dffae6f48584e85c2fef4d825efc942673a8c1a01e4`
MD5	`9ccd7fae35901878e156e57a7f5decf3`
BLAKE2b-256	`4cffaf0d9925e51b47ada177b6c32deef55ec98e4e7fd0f14428fde64c1cff69`