Python bindings around rust-scraper/scraper with PyO3

These details have not been verified by PyPI

Project links

Project description

scraper-rs

Python bindings for the Rust scraper crate via PyO3. It gives you a lightweight Document/Element API with CSS selectors, XPath (via sxd_html/sxd_xpath), handy helpers, and zero Python-side parsing work.

Quick start

from scraper_rs import Document, first, select, select_first, xpath

html = """
<html><body>
  <div class="item" data-id="1"><a href="/a">First</a></div>
  <div class="item" data-id="2"><a href="/b">Second</a></div>
</body></html>
"""

doc = Document(html)
print(doc.text)  # "First Second"

items = doc.select(".item")
print(items[0].attr("data-id"))  # "1"
print(items[0].to_dict())        # {"tag": "div", "text": "First", "html": "<a...>", ...}

first_link = doc.select_first("a[href]")  # alias: doc.find(...)
print(first_link.text, first_link.attr("href"))  # First / /a
links_within_first = first_link.select("a[href]")
print([link.attr("href") for link in links_within_first])  # ["/a"]

# XPath (element results only)
xpath_items = doc.xpath("//div[@class='item']/a")
print([link.text for link in xpath_items])  # ["First", "Second"]
print(doc.xpath_first("//div[@data-id='1']/a").attr("href"))  # "/a"

# Functional helpers
links = select(html, "a[href]")
print([link.attr("href") for link in links])  # ["/a", "/b"]
print(first(html, "a[href]").text)            # First
print(select_first(html, "a[href]").text)     # First
print([link.text for link in xpath(html, "//div[@class='item']/a")])  # ["First", "Second"]

For a runnable sample, see examples/demo.py.

Async usage

The scraper_rs.asyncio module wraps the top-level helpers so you can keep the event loop responsive. parse yields to the loop (the Document stays in the current thread), while select/xpath run in a thread pool:

import asyncio
from scraper_rs import asyncio as scraping_async

html = "<div class='item'><a href='/a'>First</a></div>"


async def main():
    doc = await scraping_async.parse(html)
    links = await scraping_async.select(html, "a[href]")
    print(doc.select_first(".item").text)  # First
    print([link.attr("href") for link in links])  # ["/a"]


asyncio.run(main())

All async functions accept the same keyword arguments as their sync counterparts (max_size_bytes, truncate_on_limit, etc.).

Large documents and memory safety

To avoid runaway allocations, parsing defaults to a 1 GiB cap. Pass max_size_bytes to override:

from scraper_rs import Document, select

doc = Document(html, max_size_bytes=5_000_000)  # 5 MB guard
links = select(html, "a[href]", max_size_bytes=5_000_000)

If you want to parse a limited portion of an oversized HTML document instead of rejecting it entirely, use truncate_on_limit=True:

# Parse only the first 100KB of a large HTML document
doc = Document(large_html, max_size_bytes=100_000, truncate_on_limit=True)
links = doc.select("a[href]")  # Will only find links in the first 100KB

# Also works with top-level functions
items = select(large_html, ".item", max_size_bytes=100_000, truncate_on_limit=True)

Note: Truncation happens at valid UTF-8 character boundaries to prevent encoding errors.

API highlights

Document(html: str) / Document.from_html(html) parses once and keeps the DOM.
.select(css) → list[Element], .select_first(css) / .find(css) → first Element | None, .css(css) is an alias.
.xpath(expr) / .xpath_first(expr) evaluate XPath expressions that return element nodes.
.text returns normalized text; .html returns the original input.
scraper_rs.asyncio exposes async parse/select/xpath wrappers to keep the event loop responsive.
Element exposes .tag, .text, .html, .attrs plus helpers .attr(name), .get(name, default), .to_dict().
Elements support nested CSS and XPath selection via .select(css), .select_first(css), .find(css), .css(css), .xpath(expr), .xpath_first(expr).
Top-level helpers mirror the class methods: parse(html), select(html, css), select_first(html, css) / first(html, css), xpath(html, expr), xpath_first(html, expr).
max_size_bytes lets you fail fast on oversized HTML; defaults to a 1 GiB limit.
truncate_on_limit allows parsing a truncated version (limited to max_size_bytes) of oversized HTML instead of raising an error.
Call doc.close() (or with Document(html) as doc: ...) to free parsed DOM resources when you're done.

Installation

Built wheels target abi3 (CPython 3.10+). To build locally:

# Install maturin (uv is used in this repo, but pip works too)
pip install maturin

# Build a wheel
maturin build --release --compatibility linux

# Install the generated wheel
pip install target/wheels/scraper_rs-*.whl

If you have just installed, the repo includes helpers: just build (local wheel), just install-wheel (install the built wheel), and just build_manylinux (via the official maturin Docker image).

Projects Using scraper-rs

silkworm - Async web scraping framework on top of Rust

Development

Requirements: Rust toolchain, Python 3.10+, maturin, and pytest for tests.

Run tests: just test or uv run pytest tests/test_scraper.py
Format/typing: Rust and Python are small; no formatters are enforced yet.
The PyO3 module name is scraper_rs; the Rust crate is built as cdylib.

Contributions and issues are welcome. If you add public API, please extend tests/test_scraper.py and the example script accordingly.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.5.1

May 15, 2026

0.5.0

May 15, 2026

0.4.3

Mar 24, 2026

0.4.2

Mar 24, 2026

0.4.1

Mar 17, 2026

0.4.0

Mar 16, 2026

0.3.3

Mar 12, 2026

0.3.2

Feb 22, 2026

0.3.1

Feb 11, 2026

0.3.0

Feb 11, 2026

0.2.32

Feb 10, 2026

0.2.31

Feb 10, 2026

0.2.29

Feb 10, 2026

0.2.28

Feb 9, 2026

0.2.27

Jan 1, 2026

0.2.22

Dec 22, 2025

0.2.21

Dec 20, 2025

0.2.20

Dec 20, 2025

0.2.19

Dec 20, 2025

0.2.17

Dec 20, 2025

0.2.16

Dec 15, 2025

0.2.14

Dec 10, 2025

0.2.13

Dec 10, 2025

This version

0.2.12

Dec 10, 2025

0.2.11

Dec 10, 2025

0.2.10

Dec 10, 2025

0.2.9

Dec 9, 2025

0.2.8

Dec 9, 2025

0.2.7

Dec 9, 2025

0.2.6

Dec 9, 2025

0.2.5

Dec 9, 2025

0.2.4

Dec 9, 2025

0.2.3

Dec 8, 2025

0.2.2

Dec 8, 2025

0.2.1

Dec 8, 2025

0.1.3

Dec 8, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scraper_rust-0.2.12.tar.gz (32.0 kB view details)

Uploaded Dec 10, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

scraper_rust-0.2.12-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (739.6 kB view details)

Uploaded Dec 10, 2025 CPython 3.10+manylinux: glibc 2.17+ x86-64

File details

Details for the file scraper_rust-0.2.12.tar.gz.

File metadata

Download URL: scraper_rust-0.2.12.tar.gz
Upload date: Dec 10, 2025
Size: 32.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: maturin/1.10.2

File hashes

Hashes for scraper_rust-0.2.12.tar.gz
Algorithm	Hash digest
SHA256	`88bdcdaa6d075acec887bfcfe9dc7314496b07a24729dd54df2729aa60529035`
MD5	`10f36e9384ea93852e779272d41c4520`
BLAKE2b-256	`f901e7d144a559b6e834b7838ea092626ca1eb00cea0c4463e4f604af7a0b431`

See more details on using hashes here.

File details

Details for the file scraper_rust-0.2.12-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

Download URL: scraper_rust-0.2.12-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Upload date: Dec 10, 2025
Size: 739.6 kB
Tags: CPython 3.10+, manylinux: glibc 2.17+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: maturin/1.10.2

File hashes

Hashes for scraper_rust-0.2.12-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm	Hash digest
SHA256	`230c30119759111f4ebb44bd163be5d6f2cb69720335979b72e4e8b5b5894f75`
MD5	`2a7def4dc856764be28f028a43a28d74`
BLAKE2b-256	`fc9f7c43e286f5529711ec8a151587b4a63880e43c31dc14491e320c61413d40`

See more details on using hashes here.

scraper-rust 0.2.12

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

scraper-rs

Quick start

Async usage

Large documents and memory safety

API highlights

Installation

Projects Using scraper-rs

Development

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes