Skip to main content

Python bindings around rust-scraper/scraper with PyO3

Project description

scraper-rs

PyPI - Version Tests

Python bindings for the Rust scraper crate via PyO3. It provides a lightweight Document/Element API with CSS selectors, XPath (via sxd_html/sxd_xpath), handy helpers, and zero Python-side parsing work.

Quick start

from scraper_rs import Document, first, select, select_first, xpath

html = """
<html><body>
  <div class="item" data-id="1"><a href="/a">First</a></div>
  <div class="item" data-id="2"><a href="/b">Second</a></div>
</body></html>
"""

doc = Document(html)
print(doc.text)  # "First Second"

items = doc.select(".item")
print(items[0].attr("data-id"))  # "1"
print(items[0].to_dict())        # {"tag": "div", "text": "First", "html": "<a...>", ...}

first_link = doc.select_first("a[href]")  # alias: doc.find(...)
print(first_link.text, first_link.attr("href"))  # First /a
links_within_first = first_link.select("a[href]")
print([link.attr("href") for link in links_within_first])  # ["/a"]

# XPath (element results only)
xpath_items = doc.xpath("//div[@class='item']/a")
print([link.text for link in xpath_items])  # ["First", "Second"]
print(doc.xpath_first("//div[@data-id='1']/a").attr("href"))  # "/a"

# Functional helpers
links = select(html, "a[href]")
print([link.attr("href") for link in links])  # ["/a", "/b"]
print(first(html, "a[href]").text)            # First
print(select_first(html, "a[href]").text)     # First
print([link.text for link in xpath(html, "//div[@class='item']/a")])  # ["First", "Second"]

For a runnable sample, see examples/demo.py.

Async usage

The scraper_rs.asyncio module wraps the top-level helpers to keep the event loop responsive. parse yields to the event loop between operations, while select/xpath run in a thread pool:

import asyncio
from scraper_rs import asyncio as scraping_async

html = "<div class='item'><a href='/a'>First</a></div>"


async def main():
    doc = await scraping_async.parse(html)
    links = await scraping_async.select(html, "a[href]")
    print(doc.select_first(".item").text)  # First
    print([link.attr("href") for link in links])  # ["/a"]


asyncio.run(main())

All async functions accept the same keyword arguments as their sync counterparts (max_size_bytes, truncate_on_limit, etc.).

Large documents and memory safety

To avoid runaway allocations, parsing defaults to a 1 GiB cap. Pass max_size_bytes to override:

from scraper_rs import Document, select

doc = Document(html, max_size_bytes=5_000_000)  # 5 MB guard
links = select(html, "a[href]", max_size_bytes=5_000_000)

If you want to parse a limited portion of an oversized HTML document instead of rejecting it entirely, use truncate_on_limit=True:

# Parse only the first 100KB of a large HTML document
doc = Document(large_html, max_size_bytes=100_000, truncate_on_limit=True)
links = doc.select("a[href]")  # Will only find links in the first 100KB

# Also works with top-level functions
items = select(large_html, ".item", max_size_bytes=100_000, truncate_on_limit=True)

Note: Truncation happens at valid UTF-8 character boundaries to prevent encoding errors.

API highlights

  • Document(html: str) / Document.from_html(html) parses once and keeps the DOM.
  • .select(css)list[Element], .select_first(css) / .find(css) → first Element | None, .css(css) is an alias.
  • .xpath(expr) / .xpath_first(expr) evaluate XPath expressions that return element nodes.
  • .text returns normalized text; .html returns the element's HTML.
  • scraper_rs.asyncio exposes async parse/select/xpath wrappers to keep the event loop responsive.
  • Element exposes .tag, .text, .html, .attrs plus helpers .attr(name), .get(name, default), .to_dict().
  • Elements support nested CSS and XPath selection via .select(css), .select_first(css), .find(css), .css(css), .xpath(expr), .xpath_first(expr).
  • Top-level helpers mirror the class methods: parse(html), select(html, css), select_first(html, css) / first(html, css), xpath(html, expr), xpath_first(html, expr).
  • max_size_bytes lets you fail fast on oversized HTML; defaults to a 1 GiB limit.
  • truncate_on_limit allows parsing a truncated version (limited to max_size_bytes) of oversized HTML instead of raising an error.
  • Call doc.close() (or with Document(html) as doc: ...) to free parsed DOM resources when you're done.

Installation

Built wheels target abi3 (CPython 3.10+). To build locally:

# Install maturin (uv is used in this repo, but pip works too)
pip install maturin

# Build a wheel
maturin build --release --compatibility linux

# Install the generated wheel
pip install target/wheels/scraper_rs-*.whl

If you have just installed, the repo includes helpers: just build (local wheel), just install-wheel (install the built wheel), and just build_manylinux (via the official maturin Docker image).

Projects Using scraper-rs

  • silkworm - Async web scraping framework on top of Rust

Development

Requirements: Rust toolchain, Python 3.10+, maturin, and pytest for tests.

  • Run tests: just test or uv run pytest tests/test_scraper.py
  • Format/typing: The codebase is small; formatters are not strictly enforced yet.
  • The PyO3 module name is scraper_rs; the Rust crate is built as cdylib.

Contributions and issues are welcome. If you add public API, please extend tests/test_scraper.py and the example script accordingly.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scraper_rust-0.2.21.tar.gz (41.3 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

scraper_rust-0.2.21-cp314-cp314t-win_amd64.whl (700.4 kB view details)

Uploaded CPython 3.14tWindows x86-64

scraper_rust-0.2.21-cp314-cp314t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (837.0 kB view details)

Uploaded CPython 3.14tmanylinux: glibc 2.17+ x86-64

scraper_rust-0.2.21-cp314-cp314t-macosx_11_0_arm64.whl (726.6 kB view details)

Uploaded CPython 3.14tmacOS 11.0+ ARM64

scraper_rust-0.2.21-cp313-cp313t-win_amd64.whl (700.3 kB view details)

Uploaded CPython 3.13tWindows x86-64

scraper_rust-0.2.21-cp313-cp313t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (836.9 kB view details)

Uploaded CPython 3.13tmanylinux: glibc 2.17+ x86-64

scraper_rust-0.2.21-cp313-cp313t-macosx_11_0_arm64.whl (726.4 kB view details)

Uploaded CPython 3.13tmacOS 11.0+ ARM64

scraper_rust-0.2.21-cp310-abi3-win_amd64.whl (703.3 kB view details)

Uploaded CPython 3.10+Windows x86-64

scraper_rust-0.2.21-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (826.6 kB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ x86-64

scraper_rust-0.2.21-cp310-abi3-macosx_11_0_arm64.whl (729.1 kB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

File details

Details for the file scraper_rust-0.2.21.tar.gz.

File metadata

  • Download URL: scraper_rust-0.2.21.tar.gz
  • Upload date:
  • Size: 41.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for scraper_rust-0.2.21.tar.gz
Algorithm Hash digest
SHA256 7dcd1cc8250fc2f7eb2648ccd41e9fcbec995ccebfa1b28f1a8a900eddbf5e13
MD5 5cea464121813df4c0881c0c27f20ed9
BLAKE2b-256 f40bfed2600d2467972a19576deef627d8e4963c2b4c5d8b532105d66952bdb0

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.2.21.tar.gz:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.2.21-cp314-cp314t-win_amd64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.2.21-cp314-cp314t-win_amd64.whl
Algorithm Hash digest
SHA256 b840bbf0385b4aa67ea66f552b0371f41c979be61f4eff688284567adcef3de0
MD5 d719283f0a3b2f4888c456db20cc0f10
BLAKE2b-256 f914174dab3260ae473463993a9903d8e954f9f6b5dcb75441b7cbab296d321f

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.2.21-cp314-cp314t-win_amd64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.2.21-cp314-cp314t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.2.21-cp314-cp314t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 1a18cc47127970693390751595b4bd2670ba843d96801c017b739e1f6b7c380d
MD5 7b71e10f071e8adecd52df8486615067
BLAKE2b-256 ce42f89440395c79fd70edcc34c3f26e2c8210af923ec5f76b674a72e61f6313

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.2.21-cp314-cp314t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.2.21-cp314-cp314t-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.2.21-cp314-cp314t-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 0b7434df14814fc4a9a0c0724e0f08f1749c0afdc31b86c888f41d5e7b676ca1
MD5 2924fda16d10e1992595eca46e4f2feb
BLAKE2b-256 d85867458448410e4e7f6889897f999e592be5214ef28c7068762efad535a9b3

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.2.21-cp314-cp314t-macosx_11_0_arm64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.2.21-cp313-cp313t-win_amd64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.2.21-cp313-cp313t-win_amd64.whl
Algorithm Hash digest
SHA256 3525df85a73dc2fc39b5d915dc67317015688aabd03f23267bf92ece86de28f9
MD5 757b62b0439972d19cf9659811fd0f4c
BLAKE2b-256 86c3097e55b288e1ea85d846e61141e37a8845e8f9cfe9fd051030143bf41cb1

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.2.21-cp313-cp313t-win_amd64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.2.21-cp313-cp313t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.2.21-cp313-cp313t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 1d94ce1cf30d6a0a4c7066562bc4393f26e15b1e5131d6d7f043935db3440224
MD5 79e3aa9c7bf8a694edb94edd7c4010cd
BLAKE2b-256 01579be7ce560497c7a6cd1bd8521461c3b223c1102ab29228aa27dccb45358b

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.2.21-cp313-cp313t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.2.21-cp313-cp313t-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.2.21-cp313-cp313t-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 d51a5b4a62579c90bf6b9076117baa9fa8c83b4ff44abe4465d738787acf9856
MD5 2cee3fd3bc6d381d962fc78166cd7954
BLAKE2b-256 9e2634cf3a8d6815a7dbc2970c7ca67179816109050f52c698cdc4ab0f64cb95

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.2.21-cp313-cp313t-macosx_11_0_arm64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.2.21-cp310-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.2.21-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 de0e587a91e9108f7f802f3c652982ff06f953f614694b6f04fd4fb565c04792
MD5 a606dfffdb7f883dd5d90ae271f7d310
BLAKE2b-256 0d29da460a10ba2a34f80a0bee0ce893a927e29cf3c3859c722cc7977d709d3c

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.2.21-cp310-abi3-win_amd64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.2.21-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.2.21-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 a111f611a5a231bb1ffddb2d55fec8bfd4408659c177f41ff29d7506f26e8733
MD5 a4b27188ad3242b8660bb1752379dcc2
BLAKE2b-256 776a5005bbbc81919feb8ccb1e69700d5de2c6d4897bb9eaa729a08bb2682f25

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.2.21-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.2.21-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.2.21-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 e5e7f6bbdd368324e166e2bbda220d29aa6d5e44885153db7f8f68f77db9a96c
MD5 fd79e047f679ec957c3be697fe8ddc26
BLAKE2b-256 89cc5f4305a425110007f8ca5a999712512cf88dd1fbf440ca24372360705bf2

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.2.21-cp310-abi3-macosx_11_0_arm64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page