Skip to main content

Python bindings around rust-scraper/scraper with PyO3

Project description

scraper-rs

PyPI - Version Tests PyPI Downloads

Python bindings for the Rust scraper crate via PyO3. It provides a lightweight Document/Element API with CSS selectors, XPath (via sxd_html/sxd_xpath), handy helpers, and zero Python-side parsing work.

Quick start

from scraper_rs import Document, first, select, select_first, xpath

html = """
<html><body>
  <div class="item" data-id="1"><a href="/a">First</a></div>
  <div class="item" data-id="2"><a href="/b">Second</a></div>
</body></html>
"""

doc = Document(html)
print(doc.text)  # "First Second"

items = doc.select(".item")
print(items[0].attr("data-id"))  # "1"
print(items[0].to_dict())        # {"tag": "div", "text": "First", "html": "<a...>", ...}

first_link = doc.select_first("a[href]")  # alias: doc.find(...)
print(first_link.text, first_link.attr("href"))  # First /a
links_within_first = first_link.select("a[href]")
print([link.attr("href") for link in links_within_first])  # ["/a"]

# XPath (element results only)
xpath_items = doc.xpath("//div[@class='item']/a")
print([link.text for link in xpath_items])  # ["First", "Second"]
print(doc.xpath_first("//div[@data-id='1']/a").attr("href"))  # "/a"

# Functional helpers
links = select(html, "a[href]")
print([link.attr("href") for link in links])  # ["/a", "/b"]
print(first(html, "a[href]").text)            # First
print(select_first(html, "a[href]").text)     # First
print([link.text for link in xpath(html, "//div[@class='item']/a")])  # ["First", "Second"]

For a runnable sample, see examples/demo.py.

Async usage

The scraper_rs.asyncio module keeps the event loop responsive. parse yields once (asyncio.sleep(0)) before constructing a sync Document in the current thread, while select/xpath helpers run in a thread pool. Parsed documents and elements are wrapped with awaitable selector methods for nested queries:

import asyncio
from scraper_rs import asyncio as scraping_async

html = "<div class='item'><a href='/a'>First</a></div>"


async def main():
    async with await scraping_async.parse(html) as doc:
        items = await doc.select(".item")
        first_link = await items[0].select_first("a[href]")
        print(first_link.text)  # First

    links = await scraping_async.select(html, "a[href]")
    print([link.attr("href") for link in links])  # ["/a"]


asyncio.run(main())

All async functions accept the same keyword arguments as their sync counterparts (max_size_bytes, truncate_on_limit, etc.). Async wrappers expose the underlying sync objects via .document and .element if you need direct access. AsyncDocument supports async with for automatic cleanup in coroutine code.

Large documents and memory safety

To avoid runaway allocations, parsing defaults to a 1 GiB cap. Pass max_size_bytes to override:

from scraper_rs import Document, select

doc = Document(html, max_size_bytes=5_000_000)  # 5 MB guard
links = select(html, "a[href]", max_size_bytes=5_000_000)

If you want to parse a limited portion of an oversized HTML document instead of rejecting it entirely, use truncate_on_limit=True:

# Parse only the first 100KB of a large HTML document
doc = Document(large_html, max_size_bytes=100_000, truncate_on_limit=True)
links = doc.select("a[href]")  # Will only find links in the first 100KB

# Also works with top-level functions
items = select(large_html, ".item", max_size_bytes=100_000, truncate_on_limit=True)

Note: Truncation happens at valid UTF-8 character boundaries to prevent encoding errors.

API highlights

  • Document(html: str) / Document.from_html(html) parse HTML for CSS and keep the DOM; XPath parsing is initialized lazily on first XPath query.
  • .select(css)list[Element], .select_first(css) / .find(css) → first Element | None, .css(css) is an alias.
  • .xpath(expr) / .xpath_first(expr) evaluate XPath expressions that return element nodes.
  • .text returns normalized text; Document.html is the original input HTML; Element.html is inner HTML.
  • scraper_rs.asyncio exposes async parse/select/xpath wrappers to keep the event loop responsive.
  • Element exposes .tag, .text, .html, .attrs plus helpers .attr(name), .get(name, default), .to_dict().
  • Elements support nested CSS and XPath selection via .select(css), .select_first(css), .find(css), .css(css), .xpath(expr), .xpath_first(expr).
  • Top-level helpers mirror the class methods: parse(html), select(html, css), select_first(html, css) / first(html, css), xpath(html, expr), xpath_first(html, expr).
  • max_size_bytes lets you fail fast on oversized HTML; defaults to a 1 GiB limit.
  • truncate_on_limit allows parsing a truncated version (limited to max_size_bytes) of oversized HTML instead of raising an error.
  • Call doc.close() (or with Document(html) as doc: ...) to free parsed DOM resources when you're done.
  • In async workflows, use async with await scraper_rs.asyncio.parse(html) as doc: ... for automatic AsyncDocument cleanup.

Installation

Built wheels target abi3 (CPython 3.10+). To build locally:

# Install maturin (uv is used in this repo, but pip works too)
pip install maturin

# Build a wheel
maturin build --release --compatibility linux

# Install the generated wheel
pip install target/wheels/scraper_rust-*.whl

If you have just installed, the repo includes helpers: just build (local wheel), just install-wheel (install the built wheel), and just build_manylinux (via the official maturin Docker image).

Projects Using scraper-rs

  • silkworm - Async web scraping framework on top of Rust

Development

Requirements: Rust toolchain, Python 3.10+, maturin, pytest, and pytest-asyncio for tests.

  • Run tests: just test or uv run pytest tests/
  • Format code: just fmt (or cargo fmt --all and uv run ruff format)
  • Lint Rust: just lint (or cargo clippy --all-targets --all-features -- -D warnings)
  • The PyO3 module name is scraper_rs; the Rust crate is built as cdylib.

Contributions and issues are welcome. If you add public API, please extend tests/test_scraper.py and the example script accordingly.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scraper_rust-0.2.31.tar.gz (62.0 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

scraper_rust-0.2.31-cp314-cp314t-win_amd64.whl (753.6 kB view details)

Uploaded CPython 3.14tWindows x86-64

scraper_rust-0.2.31-cp314-cp314t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (937.3 kB view details)

Uploaded CPython 3.14tmanylinux: glibc 2.17+ x86-64

scraper_rust-0.2.31-cp314-cp314t-macosx_11_0_arm64.whl (786.2 kB view details)

Uploaded CPython 3.14tmacOS 11.0+ ARM64

scraper_rust-0.2.31-cp313-cp313t-win_amd64.whl (758.7 kB view details)

Uploaded CPython 3.13tWindows x86-64

scraper_rust-0.2.31-cp313-cp313t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (941.0 kB view details)

Uploaded CPython 3.13tmanylinux: glibc 2.17+ x86-64

scraper_rust-0.2.31-cp313-cp313t-macosx_11_0_arm64.whl (786.5 kB view details)

Uploaded CPython 3.13tmacOS 11.0+ ARM64

scraper_rust-0.2.31-cp310-abi3-win_amd64.whl (756.3 kB view details)

Uploaded CPython 3.10+Windows x86-64

scraper_rust-0.2.31-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (928.9 kB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ x86-64

scraper_rust-0.2.31-cp310-abi3-macosx_11_0_arm64.whl (791.3 kB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

File details

Details for the file scraper_rust-0.2.31.tar.gz.

File metadata

  • Download URL: scraper_rust-0.2.31.tar.gz
  • Upload date:
  • Size: 62.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for scraper_rust-0.2.31.tar.gz
Algorithm Hash digest
SHA256 fc08402b4be0d32f6b4fa9b91a92f768301cdaf07c5ef3973812f80e93605d9b
MD5 c78fcac898232a3c83f4aa015876482d
BLAKE2b-256 e4b1d65b48d182b20db7e90e992b09d5504ad0c8445eceed3efb33b97322a984

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.2.31.tar.gz:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.2.31-cp314-cp314t-win_amd64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.2.31-cp314-cp314t-win_amd64.whl
Algorithm Hash digest
SHA256 aa562713bedb4e913ad2245549a0348a931fef19cf80ae8da120b261d5c243a0
MD5 6c5a0ad58abb4d1cd1427b181b33af5c
BLAKE2b-256 9443881de0134e0a9ada94dd405e2f76f0ef3b336281d880e0604df083317896

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.2.31-cp314-cp314t-win_amd64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.2.31-cp314-cp314t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.2.31-cp314-cp314t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 2a8162f390d8e79b22bedea3edaea0d27d3cf8458c994ab63d42a7295faee2ef
MD5 081a9a02ae0b371d47db8deac29e74ea
BLAKE2b-256 1c15e5827daade00db63fde4b2f0c450f2e8ac20f12a8752062443681809eadb

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.2.31-cp314-cp314t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.2.31-cp314-cp314t-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.2.31-cp314-cp314t-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 bb48f15d0981adbbed5e9a20635935827f349337cbef2377a54c87e32ee4678f
MD5 6d7b0b5e133d8ea4d489b403afc1a9d4
BLAKE2b-256 79022f3120250189df2144685a8306ff4e38ea9bacc3ab5698870ebad317f757

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.2.31-cp314-cp314t-macosx_11_0_arm64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.2.31-cp313-cp313t-win_amd64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.2.31-cp313-cp313t-win_amd64.whl
Algorithm Hash digest
SHA256 54f6d92358a5d5c544c85e7fe99ea4744cbac7750201148bf9fd2436bff5e6a6
MD5 108ec6da865e5984908599dd326d8513
BLAKE2b-256 0f88cdc44b6f51301d09e2a22dbd0d398f422b89d595600d350569b50067c5e8

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.2.31-cp313-cp313t-win_amd64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.2.31-cp313-cp313t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.2.31-cp313-cp313t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 11ed8a666138eeec05d4d3c23d1bd0904a7a91e74aedc609cd127d7df44c4d4a
MD5 660e4d7f63f7d287e86305cf2ab484f5
BLAKE2b-256 f39d76675756a3c045919b9377fb105d5774d626fc4eab5f31a1eab6d84a548a

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.2.31-cp313-cp313t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.2.31-cp313-cp313t-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.2.31-cp313-cp313t-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 747882cc27488623009ec2d38228fc948f22e2fc5ceac77fca4b3305aaa2f21a
MD5 6ab470abb2b61b74bacc156a005c0d45
BLAKE2b-256 067c8cd5bee0b182a51d59e2137277740f520c4488db3338f7465683230768ac

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.2.31-cp313-cp313t-macosx_11_0_arm64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.2.31-cp310-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.2.31-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 21e46b525ba4188866d44db9127e5aec79b24cb93703125a20e15eb72b8dd99b
MD5 33966aac4ac9c766c771d48edafd3310
BLAKE2b-256 cb08492243debc19d5bbbec2fd045f73aeb5ee13dfeef04a10fe823492102890

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.2.31-cp310-abi3-win_amd64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.2.31-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.2.31-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 5c929ec219069c91ba2d59f529e3617b0bde4b1a3e02d9117860f49295ba6cf2
MD5 8c9cde359a36a2a64ec754d677ea83fd
BLAKE2b-256 3c2ef7a9855ed58d327e47f965cc226affd3692667608f15ab5cb99d1c12b073

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.2.31-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.2.31-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.2.31-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 a1be90f56385bfdfc7d26bbae40ae0d0417c51bdaa22b6cffe057344de50f8b4
MD5 6fb4c7932c6ae0e6deacbd28700a8ebd
BLAKE2b-256 1baca31cacb089d707449c63fce18f4de88ef314158a0e07ec97e5341b688503

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.2.31-cp310-abi3-macosx_11_0_arm64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page