Skip to main content

Python bindings around rust-scraper/scraper with PyO3

Project description

scraper-rs

PyPI - Version Tests

Python bindings for the Rust scraper crate via PyO3. It provides a lightweight Document/Element API with CSS selectors, XPath (via sxd_html/sxd_xpath), handy helpers, and zero Python-side parsing work.

Quick start

from scraper_rs import Document, first, select, select_first, xpath

html = """
<html><body>
  <div class="item" data-id="1"><a href="/a">First</a></div>
  <div class="item" data-id="2"><a href="/b">Second</a></div>
</body></html>
"""

doc = Document(html)
print(doc.text)  # "First Second"

items = doc.select(".item")
print(items[0].attr("data-id"))  # "1"
print(items[0].to_dict())        # {"tag": "div", "text": "First", "html": "<a...>", ...}

first_link = doc.select_first("a[href]")  # alias: doc.find(...)
print(first_link.text, first_link.attr("href"))  # First /a
links_within_first = first_link.select("a[href]")
print([link.attr("href") for link in links_within_first])  # ["/a"]

# XPath (element results only)
xpath_items = doc.xpath("//div[@class='item']/a")
print([link.text for link in xpath_items])  # ["First", "Second"]
print(doc.xpath_first("//div[@data-id='1']/a").attr("href"))  # "/a"

# Functional helpers
links = select(html, "a[href]")
print([link.attr("href") for link in links])  # ["/a", "/b"]
print(first(html, "a[href]").text)            # First
print(select_first(html, "a[href]").text)     # First
print([link.text for link in xpath(html, "//div[@class='item']/a")])  # ["First", "Second"]

For a runnable sample, see examples/demo.py.

Async usage

The scraper_rs.asyncio module wraps the top-level helpers to keep the event loop responsive. parse yields to the event loop between operations, while select/xpath run in a thread pool. Parsed documents and elements are wrapped with awaitable selector methods for nested queries:

import asyncio
from scraper_rs import asyncio as scraping_async

html = "<div class='item'><a href='/a'>First</a></div>"


async def main():
    doc = await scraping_async.parse(html)
    items = await doc.select(".item")
    first_link = await items[0].select_first("a[href]")
    links = await scraping_async.select(html, "a[href]")
    print(first_link.text)  # First
    print([link.attr("href") for link in links])  # ["/a"]


asyncio.run(main())

All async functions accept the same keyword arguments as their sync counterparts (max_size_bytes, truncate_on_limit, etc.). Async wrappers expose the underlying sync objects via .document and .element if you need direct access.

Large documents and memory safety

To avoid runaway allocations, parsing defaults to a 1 GiB cap. Pass max_size_bytes to override:

from scraper_rs import Document, select

doc = Document(html, max_size_bytes=5_000_000)  # 5 MB guard
links = select(html, "a[href]", max_size_bytes=5_000_000)

If you want to parse a limited portion of an oversized HTML document instead of rejecting it entirely, use truncate_on_limit=True:

# Parse only the first 100KB of a large HTML document
doc = Document(large_html, max_size_bytes=100_000, truncate_on_limit=True)
links = doc.select("a[href]")  # Will only find links in the first 100KB

# Also works with top-level functions
items = select(large_html, ".item", max_size_bytes=100_000, truncate_on_limit=True)

Note: Truncation happens at valid UTF-8 character boundaries to prevent encoding errors.

API highlights

  • Document(html: str) / Document.from_html(html) parses once and keeps the DOM.
  • .select(css)list[Element], .select_first(css) / .find(css) → first Element | None, .css(css) is an alias.
  • .xpath(expr) / .xpath_first(expr) evaluate XPath expressions that return element nodes.
  • .text returns normalized text; .html returns the element's HTML.
  • scraper_rs.asyncio exposes async parse/select/xpath wrappers to keep the event loop responsive.
  • Element exposes .tag, .text, .html, .attrs plus helpers .attr(name), .get(name, default), .to_dict().
  • Elements support nested CSS and XPath selection via .select(css), .select_first(css), .find(css), .css(css), .xpath(expr), .xpath_first(expr).
  • Top-level helpers mirror the class methods: parse(html), select(html, css), select_first(html, css) / first(html, css), xpath(html, expr), xpath_first(html, expr).
  • max_size_bytes lets you fail fast on oversized HTML; defaults to a 1 GiB limit.
  • truncate_on_limit allows parsing a truncated version (limited to max_size_bytes) of oversized HTML instead of raising an error.
  • Call doc.close() (or with Document(html) as doc: ...) to free parsed DOM resources when you're done.

Installation

Built wheels target abi3 (CPython 3.10+). To build locally:

# Install maturin (uv is used in this repo, but pip works too)
pip install maturin

# Build a wheel
maturin build --release --compatibility linux

# Install the generated wheel
pip install target/wheels/scraper_rs-*.whl

If you have just installed, the repo includes helpers: just build (local wheel), just install-wheel (install the built wheel), and just build_manylinux (via the official maturin Docker image).

Projects Using scraper-rs

  • silkworm - Async web scraping framework on top of Rust

Development

Requirements: Rust toolchain, Python 3.10+, maturin, and pytest for tests.

  • Run tests: just test or uv run pytest tests/test_scraper.py
  • Format/typing: The codebase is small; formatters are not strictly enforced yet.
  • The PyO3 module name is scraper_rs; the Rust crate is built as cdylib.

Contributions and issues are welcome. If you add public API, please extend tests/test_scraper.py and the example script accordingly.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scraper_rust-0.2.27.tar.gz (53.4 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

scraper_rust-0.2.27-cp314-cp314t-win_amd64.whl (765.2 kB view details)

Uploaded CPython 3.14tWindows x86-64

scraper_rust-0.2.27-cp314-cp314t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (947.1 kB view details)

Uploaded CPython 3.14tmanylinux: glibc 2.17+ x86-64

scraper_rust-0.2.27-cp314-cp314t-macosx_11_0_arm64.whl (796.8 kB view details)

Uploaded CPython 3.14tmacOS 11.0+ ARM64

scraper_rust-0.2.27-cp313-cp313t-win_amd64.whl (768.7 kB view details)

Uploaded CPython 3.13tWindows x86-64

scraper_rust-0.2.27-cp313-cp313t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (950.3 kB view details)

Uploaded CPython 3.13tmanylinux: glibc 2.17+ x86-64

scraper_rust-0.2.27-cp313-cp313t-macosx_11_0_arm64.whl (797.9 kB view details)

Uploaded CPython 3.13tmacOS 11.0+ ARM64

scraper_rust-0.2.27-cp310-abi3-win_amd64.whl (769.2 kB view details)

Uploaded CPython 3.10+Windows x86-64

scraper_rust-0.2.27-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (939.3 kB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ x86-64

scraper_rust-0.2.27-cp310-abi3-macosx_11_0_arm64.whl (800.2 kB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

File details

Details for the file scraper_rust-0.2.27.tar.gz.

File metadata

  • Download URL: scraper_rust-0.2.27.tar.gz
  • Upload date:
  • Size: 53.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for scraper_rust-0.2.27.tar.gz
Algorithm Hash digest
SHA256 5066b5ad8562495c0f2aa7f5034e257dbf8fe10e7d7bc3e1052b8696d9f9c878
MD5 d8e2e21f20d84ee90c7c3cc7a5f0880d
BLAKE2b-256 96d62e91023f3b8b7cc274439e1bdebc89419a200fb0401c90a7ed76815a4df8

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.2.27.tar.gz:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.2.27-cp314-cp314t-win_amd64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.2.27-cp314-cp314t-win_amd64.whl
Algorithm Hash digest
SHA256 29cb0565c53fd85fb028548bc2d61a00e766fad3dbc90682f5825d4a243e028c
MD5 b01ade2fa448619830794efdd60db9cc
BLAKE2b-256 745cd90797cf47ab3ce9b887fa7af0f842c9a99df87d93f9f681edc90c5e2758

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.2.27-cp314-cp314t-win_amd64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.2.27-cp314-cp314t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.2.27-cp314-cp314t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 294ef2903d85586c5b708259ad4432b8d98fd99a3d2bde84b90f717e7044274d
MD5 d934af4825586939ead420fcf63fb6d3
BLAKE2b-256 c925bb7d44478a883197ab712858676e6b2f23330c6671fd96ec9c0617d84fae

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.2.27-cp314-cp314t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.2.27-cp314-cp314t-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.2.27-cp314-cp314t-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 33f5fd0ce4407a40c58b9888774dfa970773686a1444f7bbea49d46127495de2
MD5 6192a959218625ba492bdd66014003ad
BLAKE2b-256 465384cc2befcc2193d5eeb9779726849af7af082c33b02cc9ec0ad1222aaad0

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.2.27-cp314-cp314t-macosx_11_0_arm64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.2.27-cp313-cp313t-win_amd64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.2.27-cp313-cp313t-win_amd64.whl
Algorithm Hash digest
SHA256 af5c4dc3ce6dde32f04e681a3ad1edc119b0014969473aa36cf3b4595f04f6c6
MD5 46af02b8cc38ed89533936f3ad338750
BLAKE2b-256 d57f35fe77a654797659cf2e5e6ab82f3ed0c2c2bdf41b35f4e9b2deda08a06e

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.2.27-cp313-cp313t-win_amd64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.2.27-cp313-cp313t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.2.27-cp313-cp313t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 26f075bc51037de6fa76869501ad89ec8727ea7a4d8edd7788c820ed30828270
MD5 1c1e46f90e33d3e0957d4a1befff6d45
BLAKE2b-256 438ff86c547c8450135fddb85797f40737474d98c9c123913dbc5ac5c439286a

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.2.27-cp313-cp313t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.2.27-cp313-cp313t-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.2.27-cp313-cp313t-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 fc24017d527c40ca39952456525f56be86bca53e5a8b83b2dc1ce494fad5352c
MD5 729767067a2d62a830e88b948ed2ca53
BLAKE2b-256 870e8df8e9ce1296807eb16e9670f89f7981cb741a369d030d928e59f5ca7325

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.2.27-cp313-cp313t-macosx_11_0_arm64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.2.27-cp310-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.2.27-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 d3ea6691fd6178dd42c07c9a903ee60a8fe1b055560d9055aee41411ee0655a6
MD5 1714b637d0ee7867d35aaf3e48320246
BLAKE2b-256 c3ab1fcb27a206d7fb51dc54a0d87965d7cc15e0b61982c5b3948c4b879558db

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.2.27-cp310-abi3-win_amd64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.2.27-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.2.27-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 c6af8c037fdf1a89db5c77408ad6892194b58e9d19d703fd92a5313934963831
MD5 c5e50b72232d11fafa22e62260e23908
BLAKE2b-256 0d0c006427662e203e3fc43bcdc8e5b63e8ca2f25ddb69872d9ab2d83c50560e

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.2.27-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.2.27-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.2.27-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 954384d52b306f68da3fba5ff46d32e5b822c6f46fd3a96e7eed66aa0ca39de2
MD5 3b4b4b561e59029cd7a25811254ef282
BLAKE2b-256 00c456daa9f56a2fb58d85559348421a5607db92bf0ba411f7b5f4a895ca67fb

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.2.27-cp310-abi3-macosx_11_0_arm64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page