Skip to main content

Python bindings around rust-scraper/scraper with PyO3

Project description

scraper-rs

PyPI - Version Tests PyPI Downloads

Python bindings for the Rust scraper crate via PyO3. It provides a lightweight Document/Element API with CSS selectors, XPath (via xee-xpath), handy helpers, and zero Python-side parsing work.

Quick start

from scraper_rs import Document, first, prettify, select, select_first, xpath

html = """
<html><body>
  <div class="item" data-id="1"><a href="/a">First</a></div>
  <div class="item" data-id="2"><a href="/b">Second</a></div>
</body></html>
"""

doc = Document(html)
print(doc.text)  # "First Second"

items = doc.select(".item")
print(items[0].attr("data-id"))  # "1"
print(items[0].to_dict())        # {"tag": "div", "text": "First", "html": "<a...>", ...}

first_link = doc.select_first("a[href]")  # alias: doc.find(...)
print(first_link.text, first_link.attr("href"))  # First /a
links_within_first_item = items[0].select("a[href]")
print([link.attr("href") for link in links_within_first_item])  # ["/a"]

# XPath (element results only)
xpath_items = doc.xpath("//div[@class='item']/a")
print([link.text for link in xpath_items])  # ["First", "Second"]
print(doc.xpath_first("//div[@data-id='1']/a").attr("href"))  # "/a"

# Functional helpers
links = select(html, "a[href]")
print([link.attr("href") for link in links])  # ["/a", "/b"]
print(first(html, "a[href]").text)            # First
print(select_first(html, "a[href]").text)     # First
print([link.text for link in xpath(html, "//div[@class='item']/a")])  # ["First", "Second"]
print(prettify(html))

For runnable samples, see examples/demo.py and examples/demo_prettify_url.py. Quick URL prettify demo:

python examples/demo_prettify_url.py https://example.com --max-lines 80

Async usage

The scraper_rs.asyncio module exposes an async-first surface for coroutine code. AsyncDocument stores shareable HTML/text state instead of a thread-affine sync Document, and all selectors are awaitable for consistent async calling style:

import asyncio
from scraper_rs import asyncio as scraping_async

html = "<div class='item'><a href='/a'>First</a></div>"


async def main():
    async with await scraping_async.parse(html) as doc:
        items = await doc.select(".item")
        first_link = await items[0].select_first("a[href]")
        print(first_link.text)  # First

    links = await scraping_async.select(html, "a[href]")
    print([link.attr("href") for link in links])  # ["/a"]


asyncio.run(main())

All async functions accept the same keyword arguments as their sync counterparts (max_size_bytes, truncate_on_limit, etc.). AsyncDocument supports async with for automatic cleanup in coroutine code, and AsyncElement / AsyncDocument both expose async .prettify() helpers.

Large documents and memory safety

To avoid runaway allocations, parsing defaults to a 1 GiB cap. Pass max_size_bytes to override:

from scraper_rs import Document, select

doc = Document(html, max_size_bytes=5_000_000)  # 5 MB guard
links = select(html, "a[href]", max_size_bytes=5_000_000)

If you want to parse a limited portion of an oversized HTML document instead of rejecting it entirely, use truncate_on_limit=True:

# Parse only the first 100KB of a large HTML document
doc = Document(large_html, max_size_bytes=100_000, truncate_on_limit=True)
links = doc.select("a[href]")  # Will only find links in the first 100KB

# Also works with top-level functions
items = select(large_html, ".item", max_size_bytes=100_000, truncate_on_limit=True)

Note: Truncation happens at valid UTF-8 character boundaries to prevent encoding errors.

API highlights

  • Document(html: str) / Document.from_html(html) parse HTML for CSS and keep the DOM; XPath parsing is initialized lazily on first XPath query.
  • .select(css)list[Element], .select_first(css) / .find(css) → first Element | None, .css(css) is an alias.
  • .xpath(expr) / .xpath_first(expr) evaluate XPath expressions that return element nodes.
  • .prettify() renders the current DOM as an indented string for readable output/debugging.
  • .text returns normalized text; Document.html is the original input HTML; Element.html is inner HTML.
  • scraper_rs.asyncio exposes async parse/select/xpath wrappers plus awaitable AsyncDocument / AsyncElement methods.
  • Element exposes .tag, .text, .html, .attrs plus helpers .attr(name), .get(name, default), .to_dict().
  • Elements support nested CSS and XPath selection via .select(css), .select_first(css), .find(css), .css(css), .xpath(expr), .xpath_first(expr).
  • Elements also expose .prettify() to format element HTML with indentation.
  • Top-level helpers mirror the class methods: parse(html), prettify(html), select(html, css), select_first(html, css) / first(html, css), xpath(html, expr), xpath_first(html, expr).
  • Parse-tree helpers parse_document(html) and parse_fragment(html) return html5ever-style dictionaries for callers that need structural node data.
  • max_size_bytes lets you fail fast on oversized HTML; defaults to a 1 GiB limit.
  • truncate_on_limit allows parsing a truncated version (limited to max_size_bytes) of oversized HTML instead of raising an error.
  • Call doc.close() (or with Document(html) as doc: ...) to free parsed DOM resources when you're done.
  • In async workflows, use async with await scraper_rs.asyncio.parse(html) as doc: ... for automatic AsyncDocument cleanup.

Installation

Built wheels target abi3 (CPython 3.10+). To build locally:

# Install maturin (uv is used in this repo, but pip works too)
pip install maturin

# Build a wheel
maturin build --release --compatibility linux

# Install the generated wheel
pip install target/wheels/scraper_rust-*.whl

If you have just installed, the repo includes helpers: just build (local wheel), just install-wheel (install the built wheel), and just build_manylinux (via the official maturin Docker image).

Projects Using scraper-rs

Development

Requirements: Rust toolchain, Python 3.10+, uv, maturin, pytest, and pytest-asyncio for tests.

  • Run tests: just test or uv run pytest tests/
  • Format code: just fmt (or cargo fmt --all and uv run ruff format)
  • Lint Rust: just lint (or cargo clippy --all-targets --all-features -- -D warnings)
  • The PyO3 module name is scraper_rs; the Rust crate is built as cdylib.

Contributions and issues are welcome. If you add public API, please extend the relevant tests, type stubs, docs, and examples accordingly.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scraper_rust-0.5.0.tar.gz (87.3 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

scraper_rust-0.5.0-cp310-abi3-win_amd64.whl (2.4 MB view details)

Uploaded CPython 3.10+Windows x86-64

scraper_rust-0.5.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.6 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ x86-64

scraper_rust-0.5.0-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (2.5 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ARM64

scraper_rust-0.5.0-cp310-abi3-macosx_11_0_arm64.whl (2.3 MB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

File details

Details for the file scraper_rust-0.5.0.tar.gz.

File metadata

  • Download URL: scraper_rust-0.5.0.tar.gz
  • Upload date:
  • Size: 87.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for scraper_rust-0.5.0.tar.gz
Algorithm Hash digest
SHA256 41f8a1f3b50ce74f3da3f46606c8dacab003c5e04756918dd85d6fac9c833ec9
MD5 68b77ec288d70a306671cceb72d2209c
BLAKE2b-256 1f0fedebd02b2066cdf62f114db186599fa2158fca3e782b69f0155fc8624cee

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.5.0.tar.gz:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.5.0-cp310-abi3-win_amd64.whl.

File metadata

  • Download URL: scraper_rust-0.5.0-cp310-abi3-win_amd64.whl
  • Upload date:
  • Size: 2.4 MB
  • Tags: CPython 3.10+, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for scraper_rust-0.5.0-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 6b6b4fa66bd6fe9024f772784638d6d388ffe3a3aefc4b114137fc1039f61784
MD5 85ef649939f27cd961801a71f31985ea
BLAKE2b-256 1599aef6bf327cdcf1b4102e2862b067d062c93dc18cd46a3b8cb2e1c97577e6

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.5.0-cp310-abi3-win_amd64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.5.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.5.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 2053ce49fa253ce881d70600373cf4f0919d4b0cd8b8ef206c548f6a6ffd7b6b
MD5 63d407ca0d2b598e2aa44a0b50378d7f
BLAKE2b-256 1286a35502fb06f92841c320ae360a94ba21e4c2d10ac04b65ded6d64aaf1572

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.5.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.5.0-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.5.0-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 b3bcbb1026cc133157e6c6ac05946eef1efaffe327f5980f2564b9448838a9d7
MD5 0c4da5413c236104f3305ebb5c52c5d3
BLAKE2b-256 ed57dd7063964058a74f78b688bc16779dcd39396c494a6d833620e3849e75b1

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.5.0-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.5.0-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.5.0-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 6531d3059e305d3971cda1ce62a5b5c29b5dde712c9967f5f40e8e9a7bdcc6f4
MD5 ebe3ca0fe3c14897abcfd145edd213a9
BLAKE2b-256 9f47905b839bd0e3d50ad9eb417bd357ab8499b955d8208f65ef8f4f0d2fd043

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.5.0-cp310-abi3-macosx_11_0_arm64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page