scah (scan HTML)
Project description
scah (scan HTML)
World's fastest CSS Selector.
CSS selectors meet streaming XML/HTML parsing. Filter StAX events and build targeted DOMs without loading the entire document.
What is scah?
scah is a high-performance parsing library that bridges the gap between SAX/StAX streaming efficiency and DOM convenience. Instead of loading an entire document into memory or manually tracking parser state, you declare what you want with CSS selectors; the library handles the streaming complexity and builds a targeted DOM containing only your selections.
- Streaming core: Built on StAX; constant memory regardless of document size
- Familiar API: CSS selectors (including combinators like
>,,+(coming soon),~(coming soon)) - Multi-language: Rust core with Python and TypeScript/JavaScript bindings
- Composable queries: Chain selections and nest them with closures for structured querying; not only more efficient than flat filtering, but a fundamentally better pattern for extracting hierarchical data relationships
Quick Start
Rust
# Cargo.toml
[dependencies]
scah = "0.0.1"
Basic usage
use scah::{Query, Save, parse};
let html = r#"<ul><li><a href="/one">One</a></li><li><a href="/two">Two</a></li></ul>"#;
let queries = &[Query::all("a[href]", Save::all()).build()];
let store = parse(html, queries);
for a in store.get("a[href]").unwrap() {
let href = a.attribute(&store, "href").unwrap();
let text = a.text_content(&store).unwrap_or_default();
println!("{text}: {href}");
}
// Output:
// One: /one
// Two: /two
Structured querying with .then()
Instead of flat filtering, nest queries with closures. Child queries only run within the context of their parent match:
use scah::{Query, Save, parse};
let query = Query::all("main > section", Save::all())
.then(|section| [
section.all("> a[href]", Save::all()),
section.all("div a", Save::all()),
])
.build();
let store = parse(html, &[query]);
// Access nested results through parent elements
for section in store.get("main > section").unwrap() {
println!("Section: {}", section.inner_html.unwrap_or(""));
if let Some(links) = section.get(&store, "> a[href]") {
for link in links {
println!(" Direct link: {}", link.attribute(&store, "href").unwrap());
}
}
}
Save options
Control what data is captured per selector:
| Constructor | inner_html |
text_content |
Use case |
|---|---|---|---|
Save::all() |
Yes | Yes | Full extraction |
Save::only_inner_html() |
Yes | No | Raw markup only |
Save::only_text_content() |
No | Yes | Lightweight text scraping |
Save::none() |
No | No | Structure-only (attributes still saved) |
Supported CSS selector syntax
| Syntax | Example | Status |
|---|---|---|
| Tag name | a, div |
Working |
| ID | #my-id |
Working |
| Class | .my-class |
Working |
| Descendant | main section a |
Working |
| Child | main > section |
Working |
| Attribute presence | a[href] |
Working |
| Attribute exact | a[href="url"] |
Working |
| Attribute prefix | a[href^="https"] |
Working |
| Attribute suffix | a[href$=".com"] |
Working |
| Attribute substring | a[href*="example"] |
Working |
| Adjacent sibling | h1 + p |
Coming soon |
| General sibling | h1 ~ p |
Coming soon |
📖 Full API documentation: docs.rs/scah
Benchmarks
Python
from scah import Query, Save, parse
query = Query.all("main > section", Save.all())
.then(lambda section: [
section.all("> a[href]", Save.all()),
section.all("div a", Save.all()),
])
.build()
store = parse(html, [query])
Benchmark's
Real Html BenchMark (html.spec.whatwg.org) (select all a tags):
Synthetic Html BenchMark (select all a tags):
Typescript / Javascript
import { Query, parse } from 'scah';
const query = Query.all('main > section', { innerHtml: true, textContent: true })
.then((p) => [
p.all('> a[href]', { innerHtml: true, textContent: true }),
p.all('div a', { innerHtml: true, textContent: true }),
])
.build();
const store = parse(html, [query]);
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file scah-0.0.1-cp38-abi3-manylinux_2_34_x86_64.whl.
File metadata
- Download URL: scah-0.0.1-cp38-abi3-manylinux_2_34_x86_64.whl
- Upload date:
- Size: 259.9 kB
- Tags: CPython 3.8+, manylinux: glibc 2.34+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.12.6
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0110f597256a6cecd61fdb44c96e4dcb78387ed5d49006fa0b06d5248c636d00
|
|
| MD5 |
4083adb8fafe2da6752a7c486c8ed210
|
|
| BLAKE2b-256 |
c1ac40d9bb8f54587f44abd71087fab8fb1653e71a1d60246c17b8961e2d0284
|