Skip to main content

generate web scrapers structures by dsl-like language based on python

Project description

Selector Schema Codegen

RU EN

ssc_codegen is a generator for HTML parsers in various programming languages.

Why?

  • For convenient development of web scrapers, unofficial API interfaces, CI/CD integration
  • Support for API interfaces in various programming languages (currently available: Dart, Python)
  • Easy configuration files reading
  • auto documentation how to use it and generate parse structure signature
  • Portability: generated parsers are not tied to a specific project and can be reused
  • Simple syntax similar to jQuery, ORM frameworks, and data serialization libraries

Features

  • Declarative style: describe WHAT you want to do, not HOW to program it
  • Standardization: the generated code has minimal dependencies
  • Ability to rebuild in other programming languages
  • CSS, XPath, regex, minimal string formatting operations
  • Field validation, CSS/XPath/regex expressions
  • Documentation transfer into the generated code
  • Conversion of CSS to XPath queries

Install

pipx

pipx install ssc_codegen

pip

pip install ssc_codegen

Usage

See examples

Supported Libraries and Programming Languages

Language Library XPath Support CSS Support Formatter
Python bs4 NO YES black
- parsel YES YES -
- selectolax (modest) NO YES -
- scrapy (based on parsel, but class init argument - Response) YES YES -
Dart universal_html NO YES dart format

Recommendations

  • For quickly obtaining effective CSS selectors, it is recommended to use any Chromium-based browser and the SelectorGadget extension.
  • Use CSS selectors: they can be guaranteed convert to XPath.
  • For maximum support across most programming languages, use simple queries for the following reasons:
    • Some libraries do not support the full CSS specification. For example, the selector #product_description+ p works in python.parsel and javascript, but not in the dart.universal_html and selectolax libraries.
  • There is an XPath to CSS converter, but its functionality is not guaranteed. For example, CSS has no equivalent to contains from XPath.

How to Read Schema Code

Before reading, make sure you are familiar with:

  • CSS selectors
  • XPath selectors
  • Regular expressions

Shortcuts

Variable notations in the code:

  • D() — mark a Document/Element object
  • N() — mark operations with nested structures
  • R() — shortcut for D().raw(). Useful if you only need operations with regular expressions and strings, not with selectors

Built-in Schemas

ItemSchema

Parses the structure according to the rules {<key1> = <value1>, <key2> = <value2>, ...}, returns a hash table.

DictSchema

Parses the structure according to the rule {<key1> = <value1>, <key2> = <value2>, ...}, returns a hash table.

ListSchema

Parses the structure according to the rule [{<key1> = <value1>, <key2> = <value2>, ...}, {<key1> = <value1>, <key2> = <value2>, ...}], returns a list of hash tables.

FlattenListSchema

Parses the structure according to the rule [<item1>, <item2>, ...], returns a list of objects.

Types

Currently, there are 5 types

TYPE DESCRIPTION
DOCUMENT 1 element/object of the document. Always the first argument in the field
LIST_DOCUMENT Collection of elements
STRING Tag string/attribute/tag text
LIST_STRING Collection of strings/attributes/text
NESTED Collection of strings/attributes/text

Magic Methods

  • __SPLIT_DOC__ - splits the document into elements for easier parsing
  • __PRE_VALIDATE__ - pre-validation of the document using assert. Throws an error if validation fails
  • __KEY__, __VALUE__ - magic methods for initializing DictSchema structure
  • __ITEM__ - magic method for initializing FlattenListSchema structure

Operators

Method Accepts Returns Example Description
default(None/str) None/str DOCUMENT D().default(None) Default value if an error occurs. Must be the first
sub_parser Schema - N().sub_parser(Books) Passes the document/element to another parser object. Returns the obtained result
css CSS query DOCUMENT D().css('a') Returns the first found element of the selector result
xpath XPATH query DOCUMENT D().xpath('//a') Returns the first found element of the selector result
css_all CSS query LIST_DOCUMENT D().css_all('a') Returns all elements of the selector result
xpath_all XPATH query LIST_DOCUMENT D().xpath_all('//a') Returns all elements of the selector result
raw STRING/LIST_STRING D().raw() Returns the raw HTML of the document/element. Works with DOCUMENT, LIST_DOCUMENT
text STRING/LIST_STRING D().css('title').text() Returns the text from the HTML document/element. Works with DOCUMENT, LIST_DOCUMENT
attr ATTR-NAME STRING/LIST_STRING D().css('a').attr('href') Returns the attribute from the HTML tag. Works with DOCUMENT, LIST_DOCUMENT
trim str STRING/LIST_STRING R().trim('<body>') Trims the string from the LEFT and RIGHT. Works with STRING, LIST_STRING
ltrim str STRING/LIST_STRING D().css('a').attr('href').ltrim('//') Trims the string from the LEFT. Works with STRING, LIST_STRING
rtrim str STRING/LIST_STRING D().css('title').rtrim(' ') Trims the string from the RIGHT. Works with STRING, LIST_STRING
replace/repl old, new STRING/LIST_STRING D().css('a').attr('href').repl('//', 'https://') Replaces the string. Works with STRING, LIST_STRING
format/fmt template STRING/LIST_STRING D().css('title').fmt("title: {{}}") Formats the string according to the template. Must have the {{}} marker. Works with STRING, LIST_STRING
re pattern STRING/LIST_STRING D().css('title').re('(\w+)') Finds the first matching result of the regex pattern. Works with STRING, LIST_STRING
re_all pattern LIST_STRING D().css('title').re('(\w+)') Finds all matching results of the regex pattern. Works with STRING
re_sub pattern, repl STRING/LIST_STRING D().css('title').re_sub('(\w+)', 'wow') Replaces the string according to the regex pattern. Works with STRING, LIST_STRING
index int STRING/DOCUMENT D().css_all('a').index(0) Takes the element by index. Works with LIST_DOCUMENT, LIST_STRING
first - D().css_all('a').first Alias for index(0)
last - D().css_all('a').last Alias for index(-1). Or implementation of a negative index
join sep STRING D().css_all('a').text().join(', ') Collects the collection into a string. Works with LIST_STRING
assert_in str NONE D().css_all('a').attr('href').assert_in('example.com') Checks if the string is in the collection. The checked argument must be LIST_STRING
assert_re pattern NONE D().css('a').attr('href').assert_re('example.com') Checks if the regex pattern is found. The checked argument must be STRING
assert_css CSS query NONE D().assert_css('title') Checks the element by CSS. The checked argument must be DOCUMENT
assert_xpath XPATH query NONE D().assert_xpath('//title') Checks the element by XPath. The checked argument must be DOCUMENT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ssc_codegen-0.3.4.tar.gz (9.7 MB view details)

Uploaded Source

Built Distribution

ssc_codegen-0.3.4-py3-none-any.whl (44.5 kB view details)

Uploaded Python 3

File details

Details for the file ssc_codegen-0.3.4.tar.gz.

File metadata

  • Download URL: ssc_codegen-0.3.4.tar.gz
  • Upload date:
  • Size: 9.7 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: python-httpx/0.25.1

File hashes

Hashes for ssc_codegen-0.3.4.tar.gz
Algorithm Hash digest
SHA256 cf130fe2cae9c525d613bb250d19cda2591eee0688a459df02f28473e8fb32b1
MD5 00dfffb164da9c337e9b1c9b150a1267
BLAKE2b-256 4d52ebe7daf6e6d367794ad58284b87f27b297fcdc081062f7ed1cbb0e47fd3e

See more details on using hashes here.

File details

Details for the file ssc_codegen-0.3.4-py3-none-any.whl.

File metadata

File hashes

Hashes for ssc_codegen-0.3.4-py3-none-any.whl
Algorithm Hash digest
SHA256 47a848ae29a62f0eebc9e8eb726241fbb3e608aa2c11aca6c6d7a68eefab1b51
MD5 319d07baae422cea0e5315219e8f03a6
BLAKE2b-256 7afd87477377bb9f56ead62530cc30ffa0a5e8f2d1a96898e7eb6d908c204b10

See more details on using hashes here.

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page