generate selector schemas classes from yaml config and DSL-lang script
Project description
Selector schema codegen
ssc_codegen - generator of parsers for various programming languages (for html priority) using yaml-DSL configurations with built-in declarative language.
Designed to port parsers to various programming languages
Install
pipx (recommended)
pipx install ssc_codegen
pip
pip install ssc_codegen
Supported languages
language | lib | xpath | css | formatter |
---|---|---|---|---|
python | bs4 | NO | YES | black |
python | parsel | YES | YES | black |
dart | universal_html | NO | YES | dart format |
User guide
Language features
- DSL (Domain-Specific Language), declarative (no assignment, arithmetic, priority operations)
- Minimalistic syntax for working with selectors, regular expressions and simple string operations
- All methods take one argument as input and it is always selector-like type
- 4 types
- Regular expression syntax is like in python. For maximum compatibility, use for example
[0-9]
instead of\d
- Empty lines and comments (
//
) are ignored by the parser.
Types description
There are 4 data types for this scripting language
type | Description |
---|---|
SELECTOR | class instance (Document, Element, Selector) from which css/xpath selectors are called. Always the first argument |
SELECTOR_ARRAY | representation of a list of nodes (elements?) of all found elements from the SELECTOR instance |
TEXT | string |
ARRAY | array of strings |
Recommendations
- usage css selector: they can be guaranteed converted to xpath (if target language not support CSS selectors)
- there is a xpath to css converter for simple queries without guarantees of functionality.
For example, in css there is no analogue of
contains
from xpath, etc.
Schematic representation of generator operation
Description of directives
- statements are separated by line indentation
\n
- All string arguments are specified with double
"
quotes. - Space are ignored
Operator | Arguments | Description | Return type value | Example |
---|---|---|---|---|
default | "" | Default value if an error occurred during parsing. Listed first | - | default "empty" |
xpath | "" | xpath selector, returns the first value found | SELECTOR | xpath "//title" |
xpathAll | "" | xpath selector, returns all values | SELECTOR | xpathAll "//div" |
css | "" | css selector, returns the first value found | SELECTOR | css "title" |
cssAll | "" | css selector, returns all values | SELECTOR | cssAll "div > a" |
attr | "" | get tag(s). Called after xpath/xpathAll/css/cssAll | TEXT/ARRAY | attr "href" |
text | get the text inside the tag. Called after xpath/xpathAll/css/cssAll. Can be called first to completely convert a SELECTOR object to TEXT |
TEXT/ARRAY | text | |
raw | get the raw tag as text. Called after xpath/xpathAll/css/cssAll | TEXT/ARRAY | raw | |
re | "" | regular expression. Returns the first element found. Argument must be TEXT | TEXT | re "(\d+)" |
reAll | "" | regular expression. Returns all found elements. Argument must be TEXT | ARRAY | reAll "(\d+)" |
reSub | "" "" | Replacement by regular expression. Argument must be TEXT | TEXT | reSub "(\d+)" "digit(lol)" |
strip | "" | Removes the given string LEFT and RIGHT. Argument must be TEXT | TEXT | strip "\n" |
lstrip | "" | Deletes the specified line from the LEFT. Argument must be TEXT | TEXT | lstrip " " |
rstrip | "" | Deletes the specified row on the RIGHT. Argument must be TEXT | TEXT | rstrip " " |
format | "" | Format string. Specify a substitution argument using the {{}} operator. Argument must be TEXT |
TEXT | format "spam {{}} egg" |
split | "" | Splitting a line. If count = -1 or not transmitted, divide by the maximum available. Argument must be TEXT | ARRAY | split ", " |
replace | "" "" | String replacement. If count = -1 or not passed, replace it with the maximum available one. Argument must be TEXT | ARRAY | split ", " |
limit | Maximum number of elements | ARRAY | limit 50 | |
index | Take element by index. Argument must be ARRAY | TEXT | index 1 | |
first | index 1 alias |
TEXT | first | |
last | index -1 alias |
TEXT | last | |
join | "" | Collects ARRAY into a string. Argument must be ARRAY | TEXT | join ", " |
ret | Tell the translator to return a value. Automatically added if not specified in the script | ret | ||
noRet | "" | Tell the translator not to return anything. Added for document pre-validation | noRet | |
// | ... | One line comment. Ignored by the final code generator | // this is comment line |
Example code generation
// set default value if parse process is failing
xpath "//title"
text
format "Cool title: {{}}"
generated python equivalent code:
from parsel import Selector
def dummy_parse(part: Selector):
val_0 = part.xpath('//title')
val_1 = val_0.xpath('/text()').get()
val_2 = "Cool title: {}".format(val_1)
return val_2
generated dart equivalent code:
import 'package:html/parser.dart' as html;
dummy_parse(part){
var val_0 = part.querySelector('title');
String val_1 = val_0?.text ?? "";
var val_2 = "Cool title: $val_1";
return val_2;
}
add default value:
// set default value if parse process is failing
default "spam egg"
xpath "//title"
text
format "Cool title: {{}}"
from parsel import Selector
def dummy_parse(part: Selector):
try:
val_1 = part.xpath('//title')
val_2 = val_1.xpath('/text()').get()
val_3 = "Cool title: {}".format(val_2)
return val_3
except Exception:
return "spam egg"
import 'package:html/parser.dart' as html;
dummy_parse(html.Document part){
try{
var val_0 = part.querySelector('title');
String val_1 = val_0?.text ?? "";
var val_2 = "Cool title: $val_1";
return val_2;
} catch (e){
return "spam egg";
}
}
add assert validator
// not null check operation
assertCss "head > title"
xpath "//title"
text
format "Cool title: {{}}"
from parsel import Selector
def dummy_parse(part: Selector):
assert part.css("head > title")
val_1 = part.xpath('//title')
val_2 = val_1.xpath('/text()').get()
val_3 = "Cool title: {}".format(val_2)
return val_3
import 'package:html/parser.dart' as html;
dummy_parse(html.Document part){
assert(part.querySelector('title') != null);
var val_0 = part.querySelector('title');
String val_1 = val_0?.text ?? "";
var val_2 = "Cool title: $val_1";
return val_2;
}
Document validation
The following commands are needed to pre-validate the input document using assert
and they do not change
final and intermediate values.
In this DSL language there are no boolean
, null
types, so if the result is false it will throw an error
like AssertionError
.
The operators accept SELECTOR
:
- assertCss
- assertXpath
All other operators accept TEXT
:
Command | Description | Example |
---|---|---|
assertEqual | Full string comparison (== ) (case sensitive) |
assertEqual "lorem upsum dolor" |
assertContains | Comparison by presence of part of a string in TEXT |
assertContains "sum" |
assertStarts | Comparison based on the presence of part of a string at the beginning of TEXT |
assertStarts "lorem" |
assertEnds | Comparison based on the presence of part of a string at the end of TEXT |
assertEnds "dolor" |
assertMatch | Compare TEXT by regular expression |
assertMatch "lorem \w+ dolor" |
assertCss | Checking the validity of the query in SELECTOR . |
assertCss "head > title" |
assertXpath | Checking the validity of the query in SELECTOR . |
assertXpath "//head/title" |
yaml config
An example of the structure of the generated parser class:
- selector - Selector/Document instance, initialized using document
- _aliases - remapping keys for the view() method
- _viewKeys - output keys for the view() method
- _cachedResult - cache of obtained values from the parse() method
- parse() - launching the parser
- view() - getting the received values
- _preValidate() - an optional method of preliminary validation of the input document according to the rules from the configuration. If the result is false/null, it throws
AssertError
- _partDocument() - an optional method of dividing a document into parts using a given selector. Useful, for example, for obtaining elements of the same type (product cards, etc.)
- _parseA, _parseB, _parseC, ... - automatically generated parser methods for each key (A,B,C) according to the rules from the configuration
Example configuration file, see in examples
Usage pseudocode example:
document = ... // extracted html document
instance = Klass(document)
instance.parse()
print(instance.view())
dev
TODO
- generated schemas checksum
- filter operations (?)
- constants support
- more languages, libs support
- codegen optimizations (usage SELECTOR fluent interfaces, one-line code generation)
- css/xpath analyzer in pre-generate step
- css/xpath patches (for example, if css selectors in target language not support
:nth-child
operation?) - translate regex expressions. Eg:
\d
to[0-9]
- string methods:
title
,upper
,lower
,capitalize
or any useful
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file ssc_codegen-0.1.13.tar.gz
.
File metadata
- Download URL: ssc_codegen-0.1.13.tar.gz
- Upload date:
- Size: 25.1 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.7.1 CPython/3.10.12 Linux/5.15.0-78-generic
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | c21cb7cf81fa6d155d29609072e5e84da19e92c26d4c52043e58beda228bc327 |
|
MD5 | 1710e2e0835a291b35e430c44787c69d |
|
BLAKE2b-256 | b0298b9c41a248d69676407e6b81ae6948be8eb65030f06371b96eb19b3919aa |
File details
Details for the file ssc_codegen-0.1.13-py3-none-any.whl
.
File metadata
- Download URL: ssc_codegen-0.1.13-py3-none-any.whl
- Upload date:
- Size: 29.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/1.7.1 CPython/3.10.12 Linux/5.15.0-78-generic
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | e3a7f46f6d5dfadcf1950b30b4d7bde12061b03d2fae28626863ac581f030316 |
|
MD5 | 4c882472d90a8d94fd94e4a1ea1986f6 |
|
BLAKE2b-256 | 6cf5f0edac5607847eec710a099fb40761d21acc298391c405370cc453e9358a |