Skip to main content

scah (scan HTML)

Project description

scah (scan HTML)

World's fastest CSS Selector.

CSS selectors meet streaming XML/HTML parsing. Filter StAX events and build targeted DOMs without loading the entire document.

Crates.io npm PyPI

What is scah?

scah is a high-performance parsing library that bridges the gap between SAX/StAX streaming efficiency and DOM convenience. Instead of loading an entire document into memory or manually tracking parser state, you declare what you want with CSS selectors; the library handles the streaming complexity and builds a targeted DOM containing only your selections.

  • Streaming core: Built on StAX; constant memory regardless of document size
  • Familiar API: CSS selectors (including combinators like >, , + (coming soon), ~ (coming soon))
  • Multi-language: Rust core with Python and TypeScript/JavaScript bindings
  • Composable queries: Chain selections and nest them with closures for structured querying; not only more efficient than flat filtering, but a fundamentally better pattern for extracting hierarchical data relationships

Quick Start

Rust

# Cargo.toml
[dependencies]
scah = "0.0.1"

Basic usage

use scah::{Query, Save, parse};

let html = r#"<ul><li><a href="/one">One</a></li><li><a href="/two">Two</a></li></ul>"#;

let queries = &[Query::all("a[href]", Save::all()).build()];
let store = parse(html, queries);

for a in store.get("a[href]").unwrap() {
    let href = a.attribute(&store, "href").unwrap();
    let text = a.text_content(&store).unwrap_or_default();
    println!("{text}: {href}");
}
// Output:
//   One: /one
//   Two: /two

Structured querying with .then()

Instead of flat filtering, nest queries with closures. Child queries only run within the context of their parent match:

use scah::{Query, Save, parse};

let query = Query::all("main > section", Save::all())
    .then(|section| [
        section.all("> a[href]", Save::all()),
        section.all("div a", Save::all()),
    ])
    .build();

let store = parse(html, &[query]);

// Access nested results through parent elements
for section in store.get("main > section").unwrap() {
    println!("Section: {}", section.inner_html.unwrap_or(""));

    if let Some(links) = section.get(&store, "> a[href]") {
        for link in links {
            println!("  Direct link: {}", link.attribute(&store, "href").unwrap());
        }
    }
}

Save options

Control what data is captured per selector:

Constructor inner_html text_content Use case
Save::all() Yes Yes Full extraction
Save::only_inner_html() Yes No Raw markup only
Save::only_text_content() No Yes Lightweight text scraping
Save::none() No No Structure-only (attributes still saved)

Supported CSS selector syntax

Syntax Example Status
Tag name a, div Working
ID #my-id Working
Class .my-class Working
Descendant main section a Working
Child main > section Working
Attribute presence a[href] Working
Attribute exact a[href="url"] Working
Attribute prefix a[href^="https"] Working
Attribute suffix a[href$=".com"] Working
Attribute substring a[href*="example"] Working
Adjacent sibling h1 + p Coming soon
General sibling h1 ~ p Coming soon

📖 Full API documentation: docs.rs/scah

Benchmarks

Criterion BenchMarks

Python

from scah import Query, Save, parse 

query = Query.all("main > section", Save.all())
    .then(lambda section: [
        section.all("> a[href]", Save.all()),
        section.all("div a", Save.all()),
    ])
    .build()

store = parse(html, [query])

Benchmark's

Real Html BenchMark (html.spec.whatwg.org) (select all a tags):

WhatWg Html Spec BenchMark

Synthetic Html BenchMark (select all a tags):

Synthetic Html BenchMark

Typescript / Javascript

import { Query, parse } from 'scah';

const query = Query.all('main > section', { innerHtml: true, textContent: true })
  .then((p) => [
    p.all('> a[href]', { innerHtml: true, textContent: true }),
    p.all('div a', { innerHtml: true, textContent: true }),
  ])
  .build();

const store = parse(html, [query]);

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scah-0.0.3.tar.gz (471.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

scah-0.0.3-cp38-abi3-manylinux_2_34_x86_64.whl (260.1 kB view details)

Uploaded CPython 3.8+manylinux: glibc 2.34+ x86-64

File details

Details for the file scah-0.0.3.tar.gz.

File metadata

  • Download URL: scah-0.0.3.tar.gz
  • Upload date:
  • Size: 471.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.12.6

File hashes

Hashes for scah-0.0.3.tar.gz
Algorithm Hash digest
SHA256 ae081f84f55fcdf1ca8e7e98831aad1b302a9b62fbfbdc092da0c94892ecd90b
MD5 3b0fb3211a82477495d9ee9f7a5583a5
BLAKE2b-256 f7ca3c441359b563557231d03a6833bbaa78c640938111100d41af596356b68f

See more details on using hashes here.

File details

Details for the file scah-0.0.3-cp38-abi3-manylinux_2_34_x86_64.whl.

File metadata

File hashes

Hashes for scah-0.0.3-cp38-abi3-manylinux_2_34_x86_64.whl
Algorithm Hash digest
SHA256 401d848b42650d1bd5822c7cfb1996da875ab4a9f22e3374f669a9c91331e5af
MD5 078223e7c9688c8c855c96b44be994e7
BLAKE2b-256 1a75e84799f189db8f4e59c5f5527e046c89ed5cca6b990d1c06731986d867a5

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page