Skip to main content

generate web scrapers structures by dsl-like language based on python

Project description

Selector Schema Codegen

RU EN

ssc_codegen is a generator for HTML parsers in various programming languages.

Why?

  • For convenient development of web scrapers, unofficial API interfaces, CI/CD integration
  • Support for API interfaces in various programming languages (currently available: Dart, Python)
  • Easy configuration files reading
  • auto documentation how to use it and generate parse structure signature
  • Portability: generated parsers are not tied to a specific project and can be reused
  • Simple syntax similar to jQuery, ORM frameworks, and data serialization libraries

Features

  • Declarative style: describe WHAT you want to do, not HOW to program it
  • Standardization: the generated code has minimal dependencies
  • Ability to rebuild in other programming languages
  • CSS, XPath, regex, minimal string formatting operations
  • Field validation, CSS/XPath/regex expressions
  • Documentation transfer into the generated code
  • Conversion of CSS to XPath queries

Install

pipx

pipx install ssc_codegen

pip

pip install ssc_codegen

Usage

See examples

Supported Libraries and Programming Languages

Language Library XPath Support CSS Support Formatter
Python bs4 NO YES black
- parsel YES YES -
- selectolax (modest) NO YES -
- scrapy (based on parsel, but class init argument - Response) YES YES -
Dart universal_html NO YES dart format

Recommendations

  • For quickly obtaining effective CSS selectors, it is recommended to use any Chromium-based browser and the SelectorGadget extension.
  • Use CSS selectors: they can be guaranteed convert to XPath.
  • For maximum support across most programming languages, use simple queries for the following reasons:
    • Some libraries do not support the full CSS specification. For example, the selector #product_description+ p works in python.parsel and javascript, but not in the dart.universal_html and selectolax libraries.
  • There is an XPath to CSS converter, but its functionality is not guaranteed. For example, CSS has no equivalent to contains from XPath.

How to Read Schema Code

Before reading, make sure you are familiar with:

  • CSS selectors
  • XPath selectors
  • Regular expressions

Shortcuts

Variable notations in the code:

  • D() — mark a Document/Element object
  • N() — mark operations with nested structures
  • R() — shortcut for D().raw(). Useful if you only need operations with regular expressions and strings, not with selectors

Built-in Schemas

ItemSchema

Parses the structure according to the rules {<key1> = <value1>, <key2> = <value2>, ...}, returns a hash table.

DictSchema

Parses the structure according to the rule {<key1> = <value1>, <key2> = <value2>, ...}, returns a hash table.

ListSchema

Parses the structure according to the rule [{<key1> = <value1>, <key2> = <value2>, ...}, {<key1> = <value1>, <key2> = <value2>, ...}], returns a list of hash tables.

FlattenListSchema

Parses the structure according to the rule [<item1>, <item2>, ...], returns a list of objects.

Types

Currently, there are 5 types

TYPE DESCRIPTION
DOCUMENT 1 element/object of the document. Always the first argument in the field
LIST_DOCUMENT Collection of elements
STRING Tag string/attribute/tag text
LIST_STRING Collection of strings/attributes/text
NESTED Collection of strings/attributes/text

Magic Methods

  • __SPLIT_DOC__ - splits the document into elements for easier parsing
  • __PRE_VALIDATE__ - pre-validation of the document using assert. Throws an error if validation fails
  • __KEY__, __VALUE__ - magic methods for initializing DictSchema structure
  • __ITEM__ - magic method for initializing FlattenListSchema structure

Operators

Method Accepts Returns Example Description
default(None/str) None/str DOCUMENT D().default(None) Default value if an error occurs. Must be the first
sub_parser Schema - N().sub_parser(Books) Passes the document/element to another parser object. Returns the obtained result
css CSS query DOCUMENT D().css('a') Returns the first found element of the selector result
xpath XPATH query DOCUMENT D().xpath('//a') Returns the first found element of the selector result
css_all CSS query LIST_DOCUMENT D().css_all('a') Returns all elements of the selector result
xpath_all XPATH query LIST_DOCUMENT D().xpath_all('//a') Returns all elements of the selector result
raw STRING/LIST_STRING D().raw() Returns the raw HTML of the document/element. Works with DOCUMENT, LIST_DOCUMENT
text STRING/LIST_STRING D().css('title').text() Returns the text from the HTML document/element. Works with DOCUMENT, LIST_DOCUMENT
attr ATTR-NAME STRING/LIST_STRING D().css('a').attr('href') Returns the attribute from the HTML tag. Works with DOCUMENT, LIST_DOCUMENT
trim str STRING/LIST_STRING R().trim('<body>') Trims the string from the LEFT and RIGHT. Works with STRING, LIST_STRING
ltrim str STRING/LIST_STRING D().css('a').attr('href').ltrim('//') Trims the string from the LEFT. Works with STRING, LIST_STRING
rtrim str STRING/LIST_STRING D().css('title').rtrim(' ') Trims the string from the RIGHT. Works with STRING, LIST_STRING
replace/repl old, new STRING/LIST_STRING D().css('a').attr('href').repl('//', 'https://') Replaces the string. Works with STRING, LIST_STRING
format/fmt template STRING/LIST_STRING D().css('title').fmt("title: {{}}") Formats the string according to the template. Must have the {{}} marker. Works with STRING, LIST_STRING
re pattern STRING/LIST_STRING D().css('title').re('(\w+)') Finds the first matching result of the regex pattern. Works with STRING, LIST_STRING
re_all pattern LIST_STRING D().css('title').re('(\w+)') Finds all matching results of the regex pattern. Works with STRING
re_sub pattern, repl STRING/LIST_STRING D().css('title').re_sub('(\w+)', 'wow') Replaces the string according to the regex pattern. Works with STRING, LIST_STRING
index int STRING/DOCUMENT D().css_all('a').index(0) Takes the element by index. Works with LIST_DOCUMENT, LIST_STRING
first - D().css_all('a').first Alias for index(0)
last - D().css_all('a').last Alias for index(-1). Or implementation of a negative index
join sep STRING D().css_all('a').text().join(', ') Collects the collection into a string. Works with LIST_STRING
assert_in str NONE D().css_all('a').attr('href').assert_in('example.com') Checks if the string is in the collection. The checked argument must be LIST_STRING
assert_re pattern NONE D().css('a').attr('href').assert_re('example.com') Checks if the regex pattern is found. The checked argument must be STRING
assert_css CSS query NONE D().assert_css('title') Checks the element by CSS. The checked argument must be DOCUMENT
assert_xpath XPATH query NONE D().assert_xpath('//title') Checks the element by XPath. The checked argument must be DOCUMENT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ssc_codegen-0.3.0.dev3.tar.gz (78.2 kB view details)

Uploaded Source

Built Distribution

ssc_codegen-0.3.0.dev3-py3-none-any.whl (44.4 kB view details)

Uploaded Python 3

File details

Details for the file ssc_codegen-0.3.0.dev3.tar.gz.

File metadata

  • Download URL: ssc_codegen-0.3.0.dev3.tar.gz
  • Upload date:
  • Size: 78.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: python-httpx/0.25.2

File hashes

Hashes for ssc_codegen-0.3.0.dev3.tar.gz
Algorithm Hash digest
SHA256 ed3d6c6526f783f45e0cc0ba9a5b951f467729e4312c5f757c9c0de93a5aee88
MD5 4903482acecd38303545d5806ad56302
BLAKE2b-256 8b38a58c337059a72f786629fc2ed0fdd515803e9da76c48ac0730d66b212bb4

See more details on using hashes here.

File details

Details for the file ssc_codegen-0.3.0.dev3-py3-none-any.whl.

File metadata

File hashes

Hashes for ssc_codegen-0.3.0.dev3-py3-none-any.whl
Algorithm Hash digest
SHA256 21939b7971252b56892a6c4aa06ce8a2027b3e2878bb72b45fa2e568dfc5781b
MD5 4980e2efccb36e8f3dc56ad40bbf1366
BLAKE2b-256 e479da1899e8c15bee910a8ef98d555d6da3c80545f1b1394b9494d13c2ca8ca

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page