Skip to main content

Python bindings around rust-scraper/scraper with PyO3

Project description

scraper-rs

PyPI - Version Tests PyPI Downloads

Python bindings for the Rust scraper crate via PyO3. It provides a lightweight Document/Element API with CSS selectors, XPath (via xee-xpath), handy helpers, and zero Python-side parsing work.

Quick start

from scraper_rs import Document, first, prettify, select, select_first, xpath

html = """
<html><body>
  <div class="item" data-id="1"><a href="/a">First</a></div>
  <div class="item" data-id="2"><a href="/b">Second</a></div>
</body></html>
"""

doc = Document(html)
print(doc.text)  # "First Second"

items = doc.select(".item")
print(items[0].attr("data-id"))  # "1"
print(items[0].to_dict())        # {"tag": "div", "text": "First", "html": "<a...>", ...}

first_link = doc.select_first("a[href]")  # alias: doc.find(...)
print(first_link.text, first_link.attr("href"))  # First /a
links_within_first = first_link.select("a[href]")
print([link.attr("href") for link in links_within_first])  # ["/a"]

# XPath (element results only)
xpath_items = doc.xpath("//div[@class='item']/a")
print([link.text for link in xpath_items])  # ["First", "Second"]
print(doc.xpath_first("//div[@data-id='1']/a").attr("href"))  # "/a"

# Functional helpers
links = select(html, "a[href]")
print([link.attr("href") for link in links])  # ["/a", "/b"]
print(first(html, "a[href]").text)            # First
print(select_first(html, "a[href]").text)     # First
print([link.text for link in xpath(html, "//div[@class='item']/a")])  # ["First", "Second"]
print(prettify(html))

For runnable samples, see examples/demo.py and examples/demo_prettify_url.py. Quick URL prettify demo:

python examples/demo_prettify_url.py https://example.com --max-lines 80

Async usage

The scraper_rs.asyncio module exposes an async-first surface for coroutine code. AsyncDocument stores shareable HTML/text state instead of a thread-affine sync Document, and all selectors are awaitable for consistent async calling style:

import asyncio
from scraper_rs import asyncio as scraping_async

html = "<div class='item'><a href='/a'>First</a></div>"


async def main():
    async with await scraping_async.parse(html) as doc:
        items = await doc.select(".item")
        first_link = await items[0].select_first("a[href]")
        print(first_link.text)  # First

    links = await scraping_async.select(html, "a[href]")
    print([link.attr("href") for link in links])  # ["/a"]


asyncio.run(main())

All async functions accept the same keyword arguments as their sync counterparts (max_size_bytes, truncate_on_limit, etc.). AsyncDocument supports async with for automatic cleanup in coroutine code, and AsyncElement / AsyncDocument both expose async .prettify() helpers.

Large documents and memory safety

To avoid runaway allocations, parsing defaults to a 1 GiB cap. Pass max_size_bytes to override:

from scraper_rs import Document, select

doc = Document(html, max_size_bytes=5_000_000)  # 5 MB guard
links = select(html, "a[href]", max_size_bytes=5_000_000)

If you want to parse a limited portion of an oversized HTML document instead of rejecting it entirely, use truncate_on_limit=True:

# Parse only the first 100KB of a large HTML document
doc = Document(large_html, max_size_bytes=100_000, truncate_on_limit=True)
links = doc.select("a[href]")  # Will only find links in the first 100KB

# Also works with top-level functions
items = select(large_html, ".item", max_size_bytes=100_000, truncate_on_limit=True)

Note: Truncation happens at valid UTF-8 character boundaries to prevent encoding errors.

API highlights

  • Document(html: str) / Document.from_html(html) parse HTML for CSS and keep the DOM; XPath parsing is initialized lazily on first XPath query.
  • .select(css)list[Element], .select_first(css) / .find(css) → first Element | None, .css(css) is an alias.
  • .xpath(expr) / .xpath_first(expr) evaluate XPath expressions that return element nodes.
  • .prettify() renders the current DOM as an indented string for readable output/debugging.
  • .text returns normalized text; Document.html is the original input HTML; Element.html is inner HTML.
  • scraper_rs.asyncio exposes async parse/select/xpath wrappers plus awaitable AsyncDocument / AsyncElement methods.
  • Element exposes .tag, .text, .html, .attrs plus helpers .attr(name), .get(name, default), .to_dict().
  • Elements support nested CSS and XPath selection via .select(css), .select_first(css), .find(css), .css(css), .xpath(expr), .xpath_first(expr).
  • Elements also expose .prettify() to format element HTML with indentation.
  • Top-level helpers mirror the class methods: parse(html), prettify(html), select(html, css), select_first(html, css) / first(html, css), xpath(html, expr), xpath_first(html, expr).
  • max_size_bytes lets you fail fast on oversized HTML; defaults to a 1 GiB limit.
  • truncate_on_limit allows parsing a truncated version (limited to max_size_bytes) of oversized HTML instead of raising an error.
  • Call doc.close() (or with Document(html) as doc: ...) to free parsed DOM resources when you're done.
  • In async workflows, use async with await scraper_rs.asyncio.parse(html) as doc: ... for automatic AsyncDocument cleanup.

Installation

Built wheels target abi3 (CPython 3.10+). To build locally:

# Install maturin (uv is used in this repo, but pip works too)
pip install maturin

# Build a wheel
maturin build --release --compatibility linux

# Install the generated wheel
pip install target/wheels/scraper_rust-*.whl

If you have just installed, the repo includes helpers: just build (local wheel), just install-wheel (install the built wheel), and just build_manylinux (via the official maturin Docker image).

Projects Using scraper-rs

Development

Requirements: Rust toolchain, Python 3.10+, maturin, pytest, and pytest-asyncio for tests.

  • Run tests: just test or uv run pytest tests/
  • Format code: just fmt (or cargo fmt --all and uv run ruff format)
  • Lint Rust: just lint (or cargo clippy --all-targets --all-features -- -D warnings)
  • The PyO3 module name is scraper_rs; the Rust crate is built as cdylib.

Contributions and issues are welcome. If you add public API, please extend tests/test_scraper.py and the example script accordingly.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scraper_rust-0.4.2.tar.gz (85.4 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

scraper_rust-0.4.2-cp314-cp314t-macosx_11_0_arm64.whl (2.3 MB view details)

Uploaded CPython 3.14tmacOS 11.0+ ARM64

scraper_rust-0.4.2-cp313-cp313t-macosx_11_0_arm64.whl (2.3 MB view details)

Uploaded CPython 3.13tmacOS 11.0+ ARM64

scraper_rust-0.4.2-cp310-abi3-win_amd64.whl (2.4 MB view details)

Uploaded CPython 3.10+Windows x86-64

scraper_rust-0.4.2-cp310-abi3-macosx_11_0_arm64.whl (2.3 MB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

scraper_rust-0.4.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.6 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.17+ x86-64

scraper_rust-0.4.2-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (2.5 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.17+ ARM64

File details

Details for the file scraper_rust-0.4.2.tar.gz.

File metadata

  • Download URL: scraper_rust-0.4.2.tar.gz
  • Upload date:
  • Size: 85.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for scraper_rust-0.4.2.tar.gz
Algorithm Hash digest
SHA256 75ed28e06b3ae314d4be305add99ed0dfb147e00c0cf600487ba355669ab8cf8
MD5 f9f79d9e60bd66e90f5a4e4502f48702
BLAKE2b-256 112718cdcddba5d4219925bde320641dfb15e09746b1fd74aa8c9d20a0074790

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.4.2.tar.gz:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.4.2-cp314-cp314t-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.4.2-cp314-cp314t-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 165e1e690747a96897755d9f9dd34bc639259d9086a86df52cba150227dced1d
MD5 c18dd002cb3ba028964c195edfacf2fa
BLAKE2b-256 deeb7c1adc988d89b8be0678e89c76731721b72d97f9c8a5d55e7a5338904fea

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.4.2-cp314-cp314t-macosx_11_0_arm64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.4.2-cp313-cp313t-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.4.2-cp313-cp313t-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 ffe49e6bd4175222c31ea7e710eb586769482eb82bb33170d725e8b3f205beea
MD5 fa4671eff30fff75880f34b1502c6d57
BLAKE2b-256 746db6d76f6e50d3df299524cc70caadd22600c411b799b7b0121ac2c032c20c

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.4.2-cp313-cp313t-macosx_11_0_arm64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.4.2-cp310-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.4.2-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 d2e0f4f2014224164b6b7c8e405719a6a1a7015607976a687a23093241c56f4c
MD5 8b1e7caafc2a10bd81c4fb18b4e45162
BLAKE2b-256 e4adca64613b251fde9fb49898996c786d4237f5779010c6639e98c09f9e158c

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.4.2-cp310-abi3-win_amd64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.4.2-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.4.2-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 3bd78e6d2622527bad261d24c5f259a3f94e5e4b6a07dd9b76013418d7b33a75
MD5 8d34037d6b5def5aae481ba4781f30c3
BLAKE2b-256 f8bc34ba104e807afe3636d8ce72bf013e9e2bfd658e62a28ff9f50d1cf8bd02

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.4.2-cp310-abi3-macosx_11_0_arm64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.4.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.4.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 bdef11d92e42dead41d2b00d1fd348c7efea647ab16636bd5596e9a59e0a3019
MD5 d17da8a8866a2d9c0f60e0bbaad6b053
BLAKE2b-256 29a7ef181246cdfc899b931cf0da8c3851785a7bc4eb0f077acdff71dcab318e

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.4.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.4.2-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.4.2-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 5c24532fdecd65e8ddc5909d324af0007aee24f80aa1db8dc607af2531a4ace8
MD5 db780137cc9c58d482eab263ec893293
BLAKE2b-256 aa525ada4e3b8a761bf3152b7d623a261c2ef34161f04bef1021778ea2b7534d

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.4.2-cp38-cp38-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page