Skip to main content

Python bindings around rust-scraper/scraper with PyO3

Project description

scraper-rs

PyPI - Version Tests PyPI Downloads

Python bindings for the Rust scraper crate via PyO3. It provides a lightweight Document/Element API with CSS selectors, XPath (via sxd_html/sxd_xpath), handy helpers, and zero Python-side parsing work.

Quick start

from scraper_rs import Document, first, select, select_first, xpath

html = """
<html><body>
  <div class="item" data-id="1"><a href="/a">First</a></div>
  <div class="item" data-id="2"><a href="/b">Second</a></div>
</body></html>
"""

doc = Document(html)
print(doc.text)  # "First Second"

items = doc.select(".item")
print(items[0].attr("data-id"))  # "1"
print(items[0].to_dict())        # {"tag": "div", "text": "First", "html": "<a...>", ...}

first_link = doc.select_first("a[href]")  # alias: doc.find(...)
print(first_link.text, first_link.attr("href"))  # First /a
links_within_first = first_link.select("a[href]")
print([link.attr("href") for link in links_within_first])  # ["/a"]

# XPath (element results only)
xpath_items = doc.xpath("//div[@class='item']/a")
print([link.text for link in xpath_items])  # ["First", "Second"]
print(doc.xpath_first("//div[@data-id='1']/a").attr("href"))  # "/a"

# Functional helpers
links = select(html, "a[href]")
print([link.attr("href") for link in links])  # ["/a", "/b"]
print(first(html, "a[href]").text)            # First
print(select_first(html, "a[href]").text)     # First
print([link.text for link in xpath(html, "//div[@class='item']/a")])  # ["First", "Second"]

For a runnable sample, see examples/demo.py.

Async usage

The scraper_rs.asyncio module wraps the top-level helpers to keep the event loop responsive. parse yields to the event loop between operations, while select/xpath run in a thread pool. Parsed documents and elements are wrapped with awaitable selector methods for nested queries:

import asyncio
from scraper_rs import asyncio as scraping_async

html = "<div class='item'><a href='/a'>First</a></div>"


async def main():
    async with await scraping_async.parse(html) as doc:
        items = await doc.select(".item")
        first_link = await items[0].select_first("a[href]")
        print(first_link.text)  # First

    links = await scraping_async.select(html, "a[href]")
    print([link.attr("href") for link in links])  # ["/a"]


asyncio.run(main())

All async functions accept the same keyword arguments as their sync counterparts (max_size_bytes, truncate_on_limit, etc.). Async wrappers expose the underlying sync objects via .document and .element if you need direct access. AsyncDocument supports async with for automatic cleanup in coroutine code.

Large documents and memory safety

To avoid runaway allocations, parsing defaults to a 1 GiB cap. Pass max_size_bytes to override:

from scraper_rs import Document, select

doc = Document(html, max_size_bytes=5_000_000)  # 5 MB guard
links = select(html, "a[href]", max_size_bytes=5_000_000)

If you want to parse a limited portion of an oversized HTML document instead of rejecting it entirely, use truncate_on_limit=True:

# Parse only the first 100KB of a large HTML document
doc = Document(large_html, max_size_bytes=100_000, truncate_on_limit=True)
links = doc.select("a[href]")  # Will only find links in the first 100KB

# Also works with top-level functions
items = select(large_html, ".item", max_size_bytes=100_000, truncate_on_limit=True)

Note: Truncation happens at valid UTF-8 character boundaries to prevent encoding errors.

API highlights

  • Document(html: str) / Document.from_html(html) parses once and keeps the DOM.
  • .select(css)list[Element], .select_first(css) / .find(css) → first Element | None, .css(css) is an alias.
  • .xpath(expr) / .xpath_first(expr) evaluate XPath expressions that return element nodes.
  • .text returns normalized text; .html returns the element's HTML.
  • scraper_rs.asyncio exposes async parse/select/xpath wrappers to keep the event loop responsive.
  • Element exposes .tag, .text, .html, .attrs plus helpers .attr(name), .get(name, default), .to_dict().
  • Elements support nested CSS and XPath selection via .select(css), .select_first(css), .find(css), .css(css), .xpath(expr), .xpath_first(expr).
  • Top-level helpers mirror the class methods: parse(html), select(html, css), select_first(html, css) / first(html, css), xpath(html, expr), xpath_first(html, expr).
  • max_size_bytes lets you fail fast on oversized HTML; defaults to a 1 GiB limit.
  • truncate_on_limit allows parsing a truncated version (limited to max_size_bytes) of oversized HTML instead of raising an error.
  • Call doc.close() (or with Document(html) as doc: ...) to free parsed DOM resources when you're done.
  • In async workflows, use async with await scraper_rs.asyncio.parse(html) as doc: ... for automatic AsyncDocument cleanup.

Installation

Built wheels target abi3 (CPython 3.10+). To build locally:

# Install maturin (uv is used in this repo, but pip works too)
pip install maturin

# Build a wheel
maturin build --release --compatibility linux

# Install the generated wheel
pip install target/wheels/scraper_rs-*.whl

If you have just installed, the repo includes helpers: just build (local wheel), just install-wheel (install the built wheel), and just build_manylinux (via the official maturin Docker image).

Projects Using scraper-rs

  • silkworm - Async web scraping framework on top of Rust

Development

Requirements: Rust toolchain, Python 3.10+, maturin, and pytest for tests.

  • Run tests: just test or uv run pytest tests/test_scraper.py
  • Format/typing: The codebase is small; formatters are not strictly enforced yet.
  • The PyO3 module name is scraper_rs; the Rust crate is built as cdylib.

Contributions and issues are welcome. If you add public API, please extend tests/test_scraper.py and the example script accordingly.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scraper_rust-0.2.29.tar.gz (52.6 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

scraper_rust-0.2.29-cp314-cp314t-win_amd64.whl (756.0 kB view details)

Uploaded CPython 3.14tWindows x86-64

scraper_rust-0.2.29-cp314-cp314t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (940.8 kB view details)

Uploaded CPython 3.14tmanylinux: glibc 2.17+ x86-64

scraper_rust-0.2.29-cp314-cp314t-macosx_11_0_arm64.whl (790.3 kB view details)

Uploaded CPython 3.14tmacOS 11.0+ ARM64

scraper_rust-0.2.29-cp313-cp313t-win_amd64.whl (761.5 kB view details)

Uploaded CPython 3.13tWindows x86-64

scraper_rust-0.2.29-cp313-cp313t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (944.0 kB view details)

Uploaded CPython 3.13tmanylinux: glibc 2.17+ x86-64

scraper_rust-0.2.29-cp313-cp313t-macosx_11_0_arm64.whl (790.4 kB view details)

Uploaded CPython 3.13tmacOS 11.0+ ARM64

scraper_rust-0.2.29-cp310-abi3-win_amd64.whl (759.5 kB view details)

Uploaded CPython 3.10+Windows x86-64

scraper_rust-0.2.29-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (931.2 kB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ x86-64

scraper_rust-0.2.29-cp310-abi3-macosx_11_0_arm64.whl (795.2 kB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

File details

Details for the file scraper_rust-0.2.29.tar.gz.

File metadata

  • Download URL: scraper_rust-0.2.29.tar.gz
  • Upload date:
  • Size: 52.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for scraper_rust-0.2.29.tar.gz
Algorithm Hash digest
SHA256 be41d09c6bc89234a1df52a1744cb5a62d5efd0e99c14beeea886c7bbfa71e4d
MD5 b61234d9cd8529c4ac9eebc1fc620749
BLAKE2b-256 245b05c165c6e310887db836c167beb2ea0161df96f9b4c7a6d09cd6f2d3a334

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.2.29.tar.gz:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.2.29-cp314-cp314t-win_amd64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.2.29-cp314-cp314t-win_amd64.whl
Algorithm Hash digest
SHA256 aa5065bcff1bf528bc125e9cf8b97b9146d0610303824a7f20de9f4fcb99b601
MD5 a95e564bc2e0caad33aa180387d431d4
BLAKE2b-256 7396f820edf61f75b8287498042c324a05f92dd80c4c5d363486d0f02d248b41

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.2.29-cp314-cp314t-win_amd64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.2.29-cp314-cp314t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.2.29-cp314-cp314t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 b193c8ede1a822677b4d205e751fdc8962bff1fe69b5575d1b4b13617b95dd77
MD5 496c52ecd888d9225db7d16399136683
BLAKE2b-256 84ba536d09f00f0afdf7e044bcb1185ac932147b6ea150dc37e35d12fb45ff32

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.2.29-cp314-cp314t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.2.29-cp314-cp314t-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.2.29-cp314-cp314t-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 abb42efb5b03de3c02e63c3e74dd5f5eff4fc8aae706efa899fbe84f00822d30
MD5 f108402dac06b4e09e7add5534d2e67f
BLAKE2b-256 e49cadf22ca8fb4f704a9766f892eb63e76aa7d6f8725535e01c23b57359e171

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.2.29-cp314-cp314t-macosx_11_0_arm64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.2.29-cp313-cp313t-win_amd64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.2.29-cp313-cp313t-win_amd64.whl
Algorithm Hash digest
SHA256 1c4c93b93c9ce706d75132d9e79ff570e729095ef6177aa84ac5be5085624859
MD5 ac37adfbf25700e2bbb60c9347ac81e6
BLAKE2b-256 c00226f5c5944f1cfc546f93d0c6d3242ef542e0fc571eb5d87eb44c73f94e29

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.2.29-cp313-cp313t-win_amd64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.2.29-cp313-cp313t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.2.29-cp313-cp313t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 566a171024ae560891d75ef71ca232ebe1bf617b3e4db07b320f73fba648bd9d
MD5 aeeb4696cdcba989aa6f3d0f2478bca3
BLAKE2b-256 222a40bdfaa963e89c8920766e97bddc45ce4a30fddf68c881076d416433f1b6

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.2.29-cp313-cp313t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.2.29-cp313-cp313t-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.2.29-cp313-cp313t-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 436512e08717630f3188e4f2b298f406d34cbaa63977b5c66724417b9df74a6c
MD5 b57694be4ada8d76589ed431a49e5ba9
BLAKE2b-256 f345f406b0142a7ac339063c5576b364c4e01448ccc2609f49f6006409b4a470

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.2.29-cp313-cp313t-macosx_11_0_arm64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.2.29-cp310-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.2.29-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 1cc888cb921b48434cc6217e359032d054b74f40316a4f57377adcb435e145c5
MD5 cc7609671c88808b4aa393c760ffa961
BLAKE2b-256 b63cb4d0b6be9d85ad31ec245c0468f5aa2c2ab8b603015846e64a974086b2f9

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.2.29-cp310-abi3-win_amd64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.2.29-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.2.29-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 ccc5e9c164527c444d3b02f0315bc2f398b3e1b7eac762db6259941e0242c161
MD5 b13d11934e2d0c8cd1b8312e2a1cb278
BLAKE2b-256 df38ccd5a7f49179049d89cb55c65f1f0d1baadfd6f7a81f8d212a7af6fdcdc9

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.2.29-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.2.29-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.2.29-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 f27f6e3ac1484b02953fbb9a7f1b3d746da2ed320c9b9f737d8f8ba289ae507a
MD5 790af029d288ce0174f26fdbf540e03f
BLAKE2b-256 e03b743cc240f922e36b5796f244816ffe453af096a9002f0c801a0d03f8c6ec

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.2.29-cp310-abi3-macosx_11_0_arm64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page