Skip to main content

Python bindings around rust-scraper/scraper with PyO3

Project description

scraper-rs

PyPI - Version Tests PyPI Downloads

Python bindings for the Rust scraper crate via PyO3. It provides a lightweight Document/Element API with CSS selectors, XPath (via xee-xpath), handy helpers, and zero Python-side parsing work.

Quick start

from scraper_rs import Document, first, prettify, select, select_first, xpath

html = """
<html><body>
  <div class="item" data-id="1"><a href="/a">First</a></div>
  <div class="item" data-id="2"><a href="/b">Second</a></div>
</body></html>
"""

doc = Document(html)
print(doc.text)  # "First Second"

items = doc.select(".item")
print(items[0].attr("data-id"))  # "1"
print(items[0].to_dict())        # {"tag": "div", "text": "First", "html": "<a...>", ...}

first_link = doc.select_first("a[href]")  # alias: doc.find(...)
print(first_link.text, first_link.attr("href"))  # First /a
links_within_first = first_link.select("a[href]")
print([link.attr("href") for link in links_within_first])  # ["/a"]

# XPath (element results only)
xpath_items = doc.xpath("//div[@class='item']/a")
print([link.text for link in xpath_items])  # ["First", "Second"]
print(doc.xpath_first("//div[@data-id='1']/a").attr("href"))  # "/a"

# Functional helpers
links = select(html, "a[href]")
print([link.attr("href") for link in links])  # ["/a", "/b"]
print(first(html, "a[href]").text)            # First
print(select_first(html, "a[href]").text)     # First
print([link.text for link in xpath(html, "//div[@class='item']/a")])  # ["First", "Second"]
print(prettify(html))

For runnable samples, see examples/demo.py and examples/demo_prettify_url.py. Quick URL prettify demo:

python examples/demo_prettify_url.py https://example.com --max-lines 80

Async usage

The scraper_rs.asyncio module exposes an async-first surface for coroutine code. AsyncDocument stores shareable HTML/text state instead of a thread-affine sync Document, and all selectors are awaitable for consistent async calling style:

import asyncio
from scraper_rs import asyncio as scraping_async

html = "<div class='item'><a href='/a'>First</a></div>"


async def main():
    async with await scraping_async.parse(html) as doc:
        items = await doc.select(".item")
        first_link = await items[0].select_first("a[href]")
        print(first_link.text)  # First

    links = await scraping_async.select(html, "a[href]")
    print([link.attr("href") for link in links])  # ["/a"]


asyncio.run(main())

All async functions accept the same keyword arguments as their sync counterparts (max_size_bytes, truncate_on_limit, etc.). AsyncDocument supports async with for automatic cleanup in coroutine code, and AsyncElement / AsyncDocument both expose async .prettify() helpers.

Large documents and memory safety

To avoid runaway allocations, parsing defaults to a 1 GiB cap. Pass max_size_bytes to override:

from scraper_rs import Document, select

doc = Document(html, max_size_bytes=5_000_000)  # 5 MB guard
links = select(html, "a[href]", max_size_bytes=5_000_000)

If you want to parse a limited portion of an oversized HTML document instead of rejecting it entirely, use truncate_on_limit=True:

# Parse only the first 100KB of a large HTML document
doc = Document(large_html, max_size_bytes=100_000, truncate_on_limit=True)
links = doc.select("a[href]")  # Will only find links in the first 100KB

# Also works with top-level functions
items = select(large_html, ".item", max_size_bytes=100_000, truncate_on_limit=True)

Note: Truncation happens at valid UTF-8 character boundaries to prevent encoding errors.

API highlights

  • Document(html: str) / Document.from_html(html) parse HTML for CSS and keep the DOM; XPath parsing is initialized lazily on first XPath query.
  • .select(css)list[Element], .select_first(css) / .find(css) → first Element | None, .css(css) is an alias.
  • .xpath(expr) / .xpath_first(expr) evaluate XPath expressions that return element nodes.
  • .prettify() renders the current DOM as an indented string for readable output/debugging.
  • .text returns normalized text; Document.html is the original input HTML; Element.html is inner HTML.
  • scraper_rs.asyncio exposes async parse/select/xpath wrappers plus awaitable AsyncDocument / AsyncElement methods.
  • Element exposes .tag, .text, .html, .attrs plus helpers .attr(name), .get(name, default), .to_dict().
  • Elements support nested CSS and XPath selection via .select(css), .select_first(css), .find(css), .css(css), .xpath(expr), .xpath_first(expr).
  • Elements also expose .prettify() to format element HTML with indentation.
  • Top-level helpers mirror the class methods: parse(html), prettify(html), select(html, css), select_first(html, css) / first(html, css), xpath(html, expr), xpath_first(html, expr).
  • max_size_bytes lets you fail fast on oversized HTML; defaults to a 1 GiB limit.
  • truncate_on_limit allows parsing a truncated version (limited to max_size_bytes) of oversized HTML instead of raising an error.
  • Call doc.close() (or with Document(html) as doc: ...) to free parsed DOM resources when you're done.
  • In async workflows, use async with await scraper_rs.asyncio.parse(html) as doc: ... for automatic AsyncDocument cleanup.

Installation

Built wheels target abi3 (CPython 3.10+). To build locally:

# Install maturin (uv is used in this repo, but pip works too)
pip install maturin

# Build a wheel
maturin build --release --compatibility linux

# Install the generated wheel
pip install target/wheels/scraper_rust-*.whl

If you have just installed, the repo includes helpers: just build (local wheel), just install-wheel (install the built wheel), and just build_manylinux (via the official maturin Docker image).

Projects Using scraper-rs

  • silkworm - Async web scraping framework on top of Rust

Development

Requirements: Rust toolchain, Python 3.10+, maturin, pytest, and pytest-asyncio for tests.

  • Run tests: just test or uv run pytest tests/
  • Format code: just fmt (or cargo fmt --all and uv run ruff format)
  • Lint Rust: just lint (or cargo clippy --all-targets --all-features -- -D warnings)
  • The PyO3 module name is scraper_rs; the Rust crate is built as cdylib.

Contributions and issues are welcome. If you add public API, please extend tests/test_scraper.py and the example script accordingly.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scraper_rust-0.4.0.tar.gz (81.4 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

scraper_rust-0.4.0-cp314-cp314t-win_amd64.whl (2.3 MB view details)

Uploaded CPython 3.14tWindows x86-64

scraper_rust-0.4.0-cp314-cp314t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.5 MB view details)

Uploaded CPython 3.14tmanylinux: glibc 2.17+ x86-64

scraper_rust-0.4.0-cp314-cp314t-macosx_11_0_arm64.whl (2.2 MB view details)

Uploaded CPython 3.14tmacOS 11.0+ ARM64

scraper_rust-0.4.0-cp313-cp313t-win_amd64.whl (2.3 MB view details)

Uploaded CPython 3.13tWindows x86-64

scraper_rust-0.4.0-cp313-cp313t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.5 MB view details)

Uploaded CPython 3.13tmanylinux: glibc 2.17+ x86-64

scraper_rust-0.4.0-cp313-cp313t-macosx_11_0_arm64.whl (2.2 MB view details)

Uploaded CPython 3.13tmacOS 11.0+ ARM64

scraper_rust-0.4.0-cp310-abi3-win_amd64.whl (2.3 MB view details)

Uploaded CPython 3.10+Windows x86-64

scraper_rust-0.4.0-cp310-abi3-macosx_11_0_arm64.whl (2.2 MB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

scraper_rust-0.4.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.5 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.17+ x86-64

File details

Details for the file scraper_rust-0.4.0.tar.gz.

File metadata

  • Download URL: scraper_rust-0.4.0.tar.gz
  • Upload date:
  • Size: 81.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for scraper_rust-0.4.0.tar.gz
Algorithm Hash digest
SHA256 2b91d080d52ad5e00a15ecdfa89fdb4a944d08869e947684949f81ac131b59c2
MD5 4e185a2f11220dd3f11d1611005d4a03
BLAKE2b-256 8fc012d62af764e5808369424bbb29a69022419e20f43f5cc320992a03ba2a77

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.4.0.tar.gz:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.4.0-cp314-cp314t-win_amd64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.4.0-cp314-cp314t-win_amd64.whl
Algorithm Hash digest
SHA256 b57fee679efa3a7023cdcbd462ad47630c3e4aebf06fd54184a6103047705261
MD5 934bcdc15cdfdb4f3a6d4aac415170df
BLAKE2b-256 866c3352ed7f50efaaabb434cfc2df424c5859019c1187a6f17f460d06db391d

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.4.0-cp314-cp314t-win_amd64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.4.0-cp314-cp314t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.4.0-cp314-cp314t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 55c245568659534eb8faf15958faedcac5adf57e8fc26294024bc0d0b1db6cd3
MD5 c414594a89cc76d908daf98917440797
BLAKE2b-256 3fc7b7702934bba86a9006593de9ff947d90b7cfd983b42fbce9b3168e5b0653

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.4.0-cp314-cp314t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.4.0-cp314-cp314t-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.4.0-cp314-cp314t-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 6485e81e88c97a6b8444aaaf019f9ab0532c269a62231add6204544067eaea7f
MD5 758d709dbca480ec5c3ea0d789c3aca5
BLAKE2b-256 803effbec10a40f5af1da03d07a0a6c8e96d2da200d1294f91964d3727d41505

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.4.0-cp314-cp314t-macosx_11_0_arm64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.4.0-cp313-cp313t-win_amd64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.4.0-cp313-cp313t-win_amd64.whl
Algorithm Hash digest
SHA256 3435eb2c83373e9e951af45d6334582e343f93a7b5cfe24a7f0f777dd26714fb
MD5 440b84d23d4bb4797cfa887c5a06f862
BLAKE2b-256 df332e82e8fa82208ea1b4414b0e0035a0937a9d3d0880bfc41365c7a3717534

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.4.0-cp313-cp313t-win_amd64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.4.0-cp313-cp313t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.4.0-cp313-cp313t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 fb3687bdf68889e9ef6563413934eb10d70931537b19a31f5f22845a89602d79
MD5 1049b811296873be9c4d828e135be244
BLAKE2b-256 3460fc61714b03e3e7ec5c4a59f415f23315fe989b228a080b298ef3afa3df7b

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.4.0-cp313-cp313t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.4.0-cp313-cp313t-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.4.0-cp313-cp313t-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 82e22c9179aa07e00c2bac5464ad7a0730d3b411c53350104b88aae445747b2e
MD5 20f8ff4cf9323c47a55b4fecb64ecea1
BLAKE2b-256 1eef7f022edcb2473be1b6ee5a2e71ea33e6669dd22bfbf426a680031d16bb43

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.4.0-cp313-cp313t-macosx_11_0_arm64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.4.0-cp310-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.4.0-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 3cad93d9f6412dbe31d0fd29e96d57d692d02bbe876b9957fca183e16da8ff65
MD5 fd8f286f4a54cdfd1b26bad2575858fd
BLAKE2b-256 132b9667771a2d168a4056894f9e8a826ddd181033273e09e26679b8122155bd

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.4.0-cp310-abi3-win_amd64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.4.0-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.4.0-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 0af93b9b05ca0d26151a8835f171644e6c587d45572f0b22c2639695448aea28
MD5 4e8906fa2afc329d73bd7c345a866cc3
BLAKE2b-256 398e109273b89d7e8c747c7e4d317ea5abc18e066d96fa5aec89ac2c6131bf0c

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.4.0-cp310-abi3-macosx_11_0_arm64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.4.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.4.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 308c2fc64d717be61906e6087459e98f688dac2a7180d3afb6130fc2c0dd4c23
MD5 ae674cf35c066fb78e59acb942d54f72
BLAKE2b-256 fec030c683087890dd87b0799bd0bb573bd03451c828c2568d8f1597209ea9ca

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.4.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page