Skip to main content

Python-dsl code converter to html parser for web scraping

Project description

ssc-codegen

Code generator for web scraping parsers. Describe HTML extraction rules in a declarative KDL 2.0 DSL, then generate ready-to-use parser code for multiple languages and libraries.

.kdl schema --> [kdl parser] --> AST --> [linter] --> [converter] --> output code

Features

  • Declarative DSL based on KDL 2.0 syntax
  • Static type checking and linting before code generation
  • Multiple output targets: Python (bs4, lxml, parsel, selectolax), JavaScript (DOM API)
  • Struct types: item, list, dict, table, flat
  • LLM-friendly: system prompt + linter loop for AI-assisted schema generation

Install

uv tool install ssc_codegen

Quick example

books.kdl:

(list)struct Book {
    @split-doc { css-all ".product-card" }

    title { css ".title"; text }
    price { css ".price"; text; re #"(\d+\.\d+)"#; to-float }
    url   { css "a[href]"; attr "href"; fallback #null }
}

Generate Python parser:

ssc-gen generate books.kdl -t py-bs4 -o ./output

Usage

Generate code

# single file
ssc-gen generate schema.kdl -t py-bs4 -o ./output

# all .kdl files in a directory
ssc-gen generate examples/ -t js-pure -o ./output

# with custom package name
ssc-gen generate schema.kdl -t py-bs4 -o ./parsers --package scraper

Targets: py-bs4, py-lxml, py-parsel, py-slax, js-pure

Lint schemas

# human-readable output
ssc-gen check schema.kdl

# JSON output (for LLM pipelines)
ssc-gen check schema.kdl -f json

# check all files in a directory
ssc-gen check examples/

Test schema against HTML

# from file
ssc-gen run examples/booksToScrape.kdl:MainCatalogue -t py-bs4 -i page.html

# from stdin
curl https://books.toscrape.com/ | ssc-gen run examples/booksToScrape.kdl:MainCatalogue -t py-bs4

Health check (verify selectors match elements)

# from file
ssc-gen health examples/booksToScrape.kdl:MainCatalogue -i page.html

# from stdin
curl https://books.toscrape.com/ | ssc-gen health examples/booksToScrape.kdl:MainCatalogue

Documentation

LLM integration

LLM agents can generate and validate .kdl schemas automatically using the linter feedback loop.

In chats (ChatGPT, Claude, etc.)

Use SYSTEM_PROMPT.md as system prompt. After generation, run ssc-gen check -f json and send errors back to the LLM for correction.

In AI-powered IDEs (Claude Code, Cursor, etc.)

Use the kdl-schema-dsl skill for automatic generation, validation, and iteration.

Development

uv sync                  # install dependencies
uv build --wheel         # build wheel
uv run pytest            # run tests
uv run ruff check ssc_codegen/

Test dependencies

Python tests require only uv sync. JS integration tests additionally need:

npm install      # installs jsdom (dev dependency in package.json)

Node.js must be installed and available as node in PATH. JS tests are automatically skipped if Node.js is not found.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ssc_codegen-0.28.0.tar.gz (105.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

ssc_codegen-0.28.0-py3-none-any.whl (129.4 kB view details)

Uploaded Python 3

File details

Details for the file ssc_codegen-0.28.0.tar.gz.

File metadata

  • Download URL: ssc_codegen-0.28.0.tar.gz
  • Upload date:
  • Size: 105.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.14 {"installer":{"name":"uv","version":"0.11.14","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for ssc_codegen-0.28.0.tar.gz
Algorithm Hash digest
SHA256 e4e3e6af9c6b9c52e0addd1bf66f95137f578d939341ee8feaa962abe8675c47
MD5 43f4be2ecc60d8d6d53b76bfeb3f80d6
BLAKE2b-256 1b1e8c11f54b086fc338fdc6477213142a25114d3d8c2ebab18e68229d1d64b1

See more details on using hashes here.

File details

Details for the file ssc_codegen-0.28.0-py3-none-any.whl.

File metadata

  • Download URL: ssc_codegen-0.28.0-py3-none-any.whl
  • Upload date:
  • Size: 129.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.11.14 {"installer":{"name":"uv","version":"0.11.14","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":null,"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":null}

File hashes

Hashes for ssc_codegen-0.28.0-py3-none-any.whl
Algorithm Hash digest
SHA256 f92f1b4c6ea475dfc3622c821ee11e951af2a2d06bf92f75ece1e81922a0aac4
MD5 d843b6f64b78c582dc62b780cf4b87fe
BLAKE2b-256 e96675c51ac0cc9a73bd7a7f0bb5da8da124c772c3f0d85e011b9db7e66a2de2

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page