Skip to main content

Python bindings around rust-scraper/scraper with PyO3

Project description

scraper-rs

PyPI - Version Tests PyPI Downloads

Python bindings for the Rust scraper crate via PyO3. It provides a lightweight Document/Element API with CSS selectors, XPath (via xee-xpath), handy helpers, and zero Python-side parsing work.

Quick start

from scraper_rs import Document, first, prettify, select, select_first, xpath

html = """
<html><body>
  <div class="item" data-id="1"><a href="/a">First</a></div>
  <div class="item" data-id="2"><a href="/b">Second</a></div>
</body></html>
"""

doc = Document(html)
print(doc.text)  # "First Second"

items = doc.select(".item")
print(items[0].attr("data-id"))  # "1"
print(items[0].to_dict())        # {"tag": "div", "text": "First", "html": "<a...>", ...}

first_link = doc.select_first("a[href]")  # alias: doc.find(...)
print(first_link.text, first_link.attr("href"))  # First /a
links_within_first_item = items[0].select("a[href]")
print([link.attr("href") for link in links_within_first_item])  # ["/a"]

# XPath (element results only)
xpath_items = doc.xpath("//div[@class='item']/a")
print([link.text for link in xpath_items])  # ["First", "Second"]
print(doc.xpath_first("//div[@data-id='1']/a").attr("href"))  # "/a"

# Functional helpers
links = select(html, "a[href]")
print([link.attr("href") for link in links])  # ["/a", "/b"]
print(first(html, "a[href]").text)            # First
print(select_first(html, "a[href]").text)     # First
print([link.text for link in xpath(html, "//div[@class='item']/a")])  # ["First", "Second"]
print(prettify(html))

For runnable samples, see examples/demo.py and examples/demo_prettify_url.py. Quick URL prettify demo:

python examples/demo_prettify_url.py https://example.com --max-lines 80

Async usage

The scraper_rs.asyncio module exposes an async-first surface for coroutine code. AsyncDocument stores shareable HTML/text state instead of a thread-affine sync Document, and all selectors are awaitable for consistent async calling style:

import asyncio
from scraper_rs import asyncio as scraping_async

html = "<div class='item'><a href='/a'>First</a></div>"


async def main():
    async with await scraping_async.parse(html) as doc:
        items = await doc.select(".item")
        first_link = await items[0].select_first("a[href]")
        print(first_link.text)  # First

    links = await scraping_async.select(html, "a[href]")
    print([link.attr("href") for link in links])  # ["/a"]


asyncio.run(main())

All async functions accept the same keyword arguments as their sync counterparts (max_size_bytes, truncate_on_limit, etc.). CPU-bound async parsing and selection work is offloaded through the Rust async core onto Rayon worker threads so concurrent coroutines do not run that work on the event-loop thread. AsyncDocument supports async with for automatic cleanup in coroutine code, and AsyncElement / AsyncDocument both expose async .prettify() helpers.

Large documents and memory safety

To avoid runaway allocations, parsing defaults to a 1 GiB cap. Pass max_size_bytes to override:

from scraper_rs import Document, select

doc = Document(html, max_size_bytes=5_000_000)  # 5 MB guard
links = select(html, "a[href]", max_size_bytes=5_000_000)

If you want to parse a limited portion of an oversized HTML document instead of rejecting it entirely, use truncate_on_limit=True:

# Parse only the first 100KB of a large HTML document
doc = Document(large_html, max_size_bytes=100_000, truncate_on_limit=True)
links = doc.select("a[href]")  # Will only find links in the first 100KB

# Also works with top-level functions
items = select(large_html, ".item", max_size_bytes=100_000, truncate_on_limit=True)

Note: Truncation happens at valid UTF-8 character boundaries to prevent encoding errors.

API highlights

  • Document(html: str) / Document.from_html(html) parse HTML for CSS and keep the DOM; XPath parsing is initialized lazily on first XPath query.
  • .select(css)list[Element], .select_first(css) / .find(css) → first Element | None, .css(css) is an alias.
  • .xpath(expr) / .xpath_first(expr) evaluate XPath expressions that return element nodes.
  • .prettify() renders the current DOM as an indented string for readable output/debugging.
  • .text returns normalized text; Document.html is the original input HTML; Element.html is inner HTML.
  • scraper_rs.asyncio exposes async parse/select/xpath wrappers plus awaitable AsyncDocument / AsyncElement methods.
  • Element exposes .tag, .text, .html, .attrs plus helpers .attr(name), .get(name, default), .to_dict().
  • Elements support nested CSS and XPath selection via .select(css), .select_first(css), .find(css), .css(css), .xpath(expr), .xpath_first(expr).
  • Elements also expose .prettify() to format element HTML with indentation.
  • Top-level helpers mirror the class methods: parse(html), prettify(html), select(html, css), select_first(html, css) / first(html, css), xpath(html, expr), xpath_first(html, expr).
  • Parse-tree helpers parse_document(html) and parse_fragment(html) return html5ever-style dictionaries for callers that need structural node data.
  • max_size_bytes lets you fail fast on oversized HTML; defaults to a 1 GiB limit.
  • truncate_on_limit allows parsing a truncated version (limited to max_size_bytes) of oversized HTML instead of raising an error.
  • Call doc.close() (or with Document(html) as doc: ...) to free parsed DOM resources when you're done.
  • In async workflows, use async with await scraper_rs.asyncio.parse(html) as doc: ... for automatic AsyncDocument cleanup.

Installation

Built wheels target abi3 (CPython 3.10+). To build locally:

# Install maturin (uv is used in this repo, but pip works too)
pip install maturin

# Build a wheel
maturin build --release --compatibility linux

# Install the generated wheel
pip install target/wheels/scraper_rust-*.whl

If you have just installed, the repo includes helpers: just build (local wheel), just install-wheel (install the built wheel), and just build_manylinux (via the official maturin Docker image).

Projects Using scraper-rs

Development

Requirements: Rust toolchain, Python 3.10+, uv, maturin, pytest, and pytest-asyncio for tests.

  • Run tests: just test or uv run pytest tests/
  • Format code: just fmt (or cargo fmt --all and uv run ruff format)
  • Lint Rust: just lint (or cargo clippy --all-targets --all-features -- -D warnings)
  • The PyO3 module name is scraper_rs; the Rust crate is built as cdylib.

Contributions and issues are welcome. If you add public API, please extend the relevant tests, type stubs, docs, and examples accordingly.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scraper_rust-0.5.1.tar.gz (87.6 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

scraper_rust-0.5.1-cp310-abi3-win_amd64.whl (2.4 MB view details)

Uploaded CPython 3.10+Windows x86-64

scraper_rust-0.5.1-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.6 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ x86-64

scraper_rust-0.5.1-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (2.5 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ ARM64

scraper_rust-0.5.1-cp310-abi3-macosx_11_0_arm64.whl (2.3 MB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

File details

Details for the file scraper_rust-0.5.1.tar.gz.

File metadata

  • Download URL: scraper_rust-0.5.1.tar.gz
  • Upload date:
  • Size: 87.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for scraper_rust-0.5.1.tar.gz
Algorithm Hash digest
SHA256 c0ada601fb851b9d9fe48df22643e45e6f480ccfd41a45c9d345b162b624a218
MD5 bd5def816ca7a5eb7381d3f25495d1bf
BLAKE2b-256 91f042c4b3d78411fc1a795387a4622b026cbeaf930b0b33a98c6c14a10389df

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.5.1.tar.gz:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.5.1-cp310-abi3-win_amd64.whl.

File metadata

  • Download URL: scraper_rust-0.5.1-cp310-abi3-win_amd64.whl
  • Upload date:
  • Size: 2.4 MB
  • Tags: CPython 3.10+, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for scraper_rust-0.5.1-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 a042c1821bb04d4b752cb8dccfa65e34da1615df0d1f1e17c87113979daf7a18
MD5 6df9ede28418c97e65c368637fee6fd5
BLAKE2b-256 95a939daf63f73997b6ce6cd6df3d530019780985e26f902032a4ecd9c165795

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.5.1-cp310-abi3-win_amd64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.5.1-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.5.1-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 b41bfd172da80ea8627e74fd2f86249e1dd15cf9937bd39e5a044b9a1bd72e56
MD5 0b068ea0b8f1358b93b9e77c8106ec5e
BLAKE2b-256 be678a200b5178626cb7c8dfd977cceffeface9d9f15e8c96e583e3555d8c20a

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.5.1-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.5.1-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.5.1-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 5cd61ffa237a4e9075c958ed103e41ab75915868efa569a9348130e5be405161
MD5 ad96a2721fbd99e3ae93a4f2b74e5350
BLAKE2b-256 35d6ecbc5b87ff424d1cad04afa088b85d6edacd2976304bfff4a9af48803a2c

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.5.1-cp310-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.5.1-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.5.1-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 5154a2400ace68e85dbccbcf36782cc8bb8d67f3e36d641182cb2eef89db206b
MD5 6968641bb8328e1a2527cbd4767334df
BLAKE2b-256 870ed8ffe66b35ea2efae733b25446390c94038b39c8f48a362d1fce23eedd98

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.5.1-cp310-abi3-macosx_11_0_arm64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page