generate web scrapers structures by dsl-like language based on python
Project description
Selector Schema Codegen
ssc_codegen is a generator for HTML parsers in various programming languages.
Why?
- For convenient development of web scrapers, unofficial API interfaces, CI/CD integration
- Support for API interfaces in various programming languages (currently available: Dart, Python)
- Easy configuration files reading
- auto documentation how to use it and generate parse structure signature
- Portability: generated parsers are not tied to a specific project and can be reused
- Simple syntax similar to jQuery, ORM frameworks, and data serialization libraries
Features
- Declarative style: describe WHAT you want to do, not HOW to program it
- Standardization: the generated code has minimal dependencies
- Ability to rebuild in other programming languages
- CSS, XPath, regex, minimal string formatting operations
- Field validation, CSS/XPath/regex expressions
- Documentation transfer into the generated code
- Conversion of CSS to XPath queries
Install
pipx
pipx install ssc_codegen
pip
pip install ssc_codegen
Usage
See examples
Supported Libraries and Programming Languages
Language | Library | XPath Support | CSS Support | Formatter |
---|---|---|---|---|
Python | bs4 | NO | YES | black |
- | parsel | YES | YES | - |
- | selectolax (modest) | NO | YES | - |
- | scrapy (based on parsel, but class init argument - Response) | YES | YES | - |
Dart | universal_html | NO | YES | dart format |
Recommendations
- For quickly obtaining effective CSS selectors, it is recommended to use any Chromium-based browser and the SelectorGadget extension.
- Use CSS selectors: they can be guaranteed convert to XPath.
- For maximum support across most programming languages, use simple queries for the following reasons:
- Some libraries do not support the full CSS specification.
For example, the selector
#product_description+ p
works inpython.parsel
andjavascript
, but not in thedart.universal_html
andselectolax
libraries.
- Some libraries do not support the full CSS specification.
For example, the selector
- There is an XPath to CSS converter, but its functionality is not guaranteed. For example, CSS has no equivalent to
contains
from XPath.
How to Read Schema Code
Before reading, make sure you are familiar with:
- CSS selectors
- XPath selectors
- Regular expressions
Shortcuts
Variable notations in the code:
- D() — mark a
Document
/Element
object - N() — mark operations with nested structures
- R() — shortcut for
D().raw()
. Useful if you only need operations with regular expressions and strings, not with selectors
Built-in Schemas
ItemSchema
Parses the structure according to the rules {<key1> = <value1>, <key2> = <value2>, ...}
, returns a hash table.
DictSchema
Parses the structure according to the rule {<key1> = <value1>, <key2> = <value2>, ...}
, returns a hash table.
ListSchema
Parses the structure according to the rule [{<key1> = <value1>, <key2> = <value2>, ...}, {<key1> = <value1>, <key2> = <value2>, ...}]
, returns a list of hash tables.
FlattenListSchema
Parses the structure according to the rule [<item1>, <item2>, ...]
, returns a list of objects.
Types
Currently, there are 5 types
TYPE | DESCRIPTION |
---|---|
DOCUMENT | 1 element/object of the document. Always the first argument in the field |
LIST_DOCUMENT | Collection of elements |
STRING | Tag string/attribute/tag text |
LIST_STRING | Collection of strings/attributes/text |
NESTED | Collection of strings/attributes/text |
Magic Methods
__SPLIT_DOC__
- splits the document into elements for easier parsing__PRE_VALIDATE__
- pre-validation of the document usingassert
. Throws an error if validation fails__KEY__
,__VALUE__
- magic methods for initializingDictSchema
structure__ITEM__
- magic method for initializingFlattenListSchema
structure
Operators
Method | Accepts | Returns | Example | Description | |
---|---|---|---|---|---|
default(None/str) | None/str | DOCUMENT | D().default(None) |
Default value if an error occurs. Must be the first | |
sub_parser | Schema | - | N().sub_parser(Books) |
Passes the document/element to another parser object. Returns the obtained result | |
css | CSS query | DOCUMENT | D().css('a') |
Returns the first found element of the selector result | |
xpath | XPATH query | DOCUMENT | D().xpath('//a') |
Returns the first found element of the selector result | |
css_all | CSS query | LIST_DOCUMENT | D().css_all('a') |
Returns all elements of the selector result | |
xpath_all | XPATH query | LIST_DOCUMENT | D().xpath_all('//a') |
Returns all elements of the selector result | |
raw | STRING/LIST_STRING | D().raw() |
Returns the raw HTML of the document/element. Works with DOCUMENT, LIST_DOCUMENT | ||
text | STRING/LIST_STRING | D().css('title').text() |
Returns the text from the HTML document/element. Works with DOCUMENT, LIST_DOCUMENT | ||
attr | ATTR-NAME | STRING/LIST_STRING | D().css('a').attr('href') |
Returns the attribute from the HTML tag. Works with DOCUMENT, LIST_DOCUMENT | |
trim | str | STRING/LIST_STRING | R().trim('<body>') |
Trims the string from the LEFT and RIGHT. Works with STRING, LIST_STRING | |
ltrim | str | STRING/LIST_STRING | D().css('a').attr('href').ltrim('//') |
Trims the string from the LEFT. Works with STRING, LIST_STRING | |
rtrim | str | STRING/LIST_STRING | D().css('title').rtrim(' ') |
Trims the string from the RIGHT. Works with STRING, LIST_STRING | |
replace/repl | old, new | STRING/LIST_STRING | D().css('a').attr('href').repl('//', 'https://') |
Replaces the string. Works with STRING, LIST_STRING | |
format/fmt | template | STRING/LIST_STRING | D().css('title').fmt("title: {{}}") |
Formats the string according to the template. Must have the {{}} marker. Works with STRING, LIST_STRING |
|
re | pattern | STRING/LIST_STRING | D().css('title').re('(\w+)') |
Finds the first matching result of the regex pattern. Works with STRING, LIST_STRING | |
re_all | pattern | LIST_STRING | D().css('title').re('(\w+)') |
Finds all matching results of the regex pattern. Works with STRING | |
re_sub | pattern, repl | STRING/LIST_STRING | D().css('title').re_sub('(\w+)', 'wow') |
Replaces the string according to the regex pattern. Works with STRING, LIST_STRING | |
index | int | STRING/DOCUMENT | D().css_all('a').index(0) |
Takes the element by index. Works with LIST_DOCUMENT, LIST_STRING | |
first | - | D().css_all('a').first |
Alias for index(0) | ||
last | - | D().css_all('a').last |
Alias for index(-1). Or implementation of a negative index | ||
join | sep | STRING | D().css_all('a').text().join(', ') |
Collects the collection into a string. Works with LIST_STRING | |
assert_in | str | NONE | D().css_all('a').attr('href').assert_in('example.com') |
Checks if the string is in the collection. The checked argument must be LIST_STRING | |
assert_re | pattern | NONE | D().css('a').attr('href').assert_re('example.com') |
Checks if the regex pattern is found. The checked argument must be STRING | |
assert_css | CSS query | NONE | D().assert_css('title') |
Checks the element by CSS. The checked argument must be DOCUMENT | |
assert_xpath | XPATH query | NONE | D().assert_xpath('//title') |
Checks the element by XPath. The checked argument must be DOCUMENT |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
File details
Details for the file ssc_codegen-0.3.4.tar.gz
.
File metadata
- Download URL: ssc_codegen-0.3.4.tar.gz
- Upload date:
- Size: 9.7 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: python-httpx/0.25.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | cf130fe2cae9c525d613bb250d19cda2591eee0688a459df02f28473e8fb32b1 |
|
MD5 | 00dfffb164da9c337e9b1c9b150a1267 |
|
BLAKE2b-256 | 4d52ebe7daf6e6d367794ad58284b87f27b297fcdc081062f7ed1cbb0e47fd3e |
File details
Details for the file ssc_codegen-0.3.4-py3-none-any.whl
.
File metadata
- Download URL: ssc_codegen-0.3.4-py3-none-any.whl
- Upload date:
- Size: 44.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: python-httpx/0.25.1
File hashes
Algorithm | Hash digest | |
---|---|---|
SHA256 | 47a848ae29a62f0eebc9e8eb726241fbb3e608aa2c11aca6c6d7a68eefab1b51 |
|
MD5 | 319d07baae422cea0e5315219e8f03a6 |
|
BLAKE2b-256 | 7afd87477377bb9f56ead62530cc30ffa0a5e8f2d1a96898e7eb6d908c204b10 |