Skip to main content

Python bindings around rust-scraper/scraper with PyO3

Project description

scraper-rs

PyPI - Version Tests

Python bindings for the Rust scraper crate via PyO3. It provides a lightweight Document/Element API with CSS selectors, XPath (via sxd_html/sxd_xpath), handy helpers, and zero Python-side parsing work.

Quick start

from scraper_rs import Document, first, select, select_first, xpath

html = """
<html><body>
  <div class="item" data-id="1"><a href="/a">First</a></div>
  <div class="item" data-id="2"><a href="/b">Second</a></div>
</body></html>
"""

doc = Document(html)
print(doc.text)  # "First Second"

items = doc.select(".item")
print(items[0].attr("data-id"))  # "1"
print(items[0].to_dict())        # {"tag": "div", "text": "First", "html": "<a...>", ...}

first_link = doc.select_first("a[href]")  # alias: doc.find(...)
print(first_link.text, first_link.attr("href"))  # First /a
links_within_first = first_link.select("a[href]")
print([link.attr("href") for link in links_within_first])  # ["/a"]

# XPath (element results only)
xpath_items = doc.xpath("//div[@class='item']/a")
print([link.text for link in xpath_items])  # ["First", "Second"]
print(doc.xpath_first("//div[@data-id='1']/a").attr("href"))  # "/a"

# Functional helpers
links = select(html, "a[href]")
print([link.attr("href") for link in links])  # ["/a", "/b"]
print(first(html, "a[href]").text)            # First
print(select_first(html, "a[href]").text)     # First
print([link.text for link in xpath(html, "//div[@class='item']/a")])  # ["First", "Second"]

For a runnable sample, see examples/demo.py.

Async usage

The scraper_rs.asyncio module wraps the top-level helpers to keep the event loop responsive. parse yields to the event loop between operations, while select/xpath run in a thread pool. Parsed documents and elements are wrapped with awaitable selector methods for nested queries:

import asyncio
from scraper_rs import asyncio as scraping_async

html = "<div class='item'><a href='/a'>First</a></div>"


async def main():
    async with await scraping_async.parse(html) as doc:
        items = await doc.select(".item")
        first_link = await items[0].select_first("a[href]")
        print(first_link.text)  # First

    links = await scraping_async.select(html, "a[href]")
    print([link.attr("href") for link in links])  # ["/a"]


asyncio.run(main())

All async functions accept the same keyword arguments as their sync counterparts (max_size_bytes, truncate_on_limit, etc.). Async wrappers expose the underlying sync objects via .document and .element if you need direct access. AsyncDocument supports async with for automatic cleanup in coroutine code.

Large documents and memory safety

To avoid runaway allocations, parsing defaults to a 1 GiB cap. Pass max_size_bytes to override:

from scraper_rs import Document, select

doc = Document(html, max_size_bytes=5_000_000)  # 5 MB guard
links = select(html, "a[href]", max_size_bytes=5_000_000)

If you want to parse a limited portion of an oversized HTML document instead of rejecting it entirely, use truncate_on_limit=True:

# Parse only the first 100KB of a large HTML document
doc = Document(large_html, max_size_bytes=100_000, truncate_on_limit=True)
links = doc.select("a[href]")  # Will only find links in the first 100KB

# Also works with top-level functions
items = select(large_html, ".item", max_size_bytes=100_000, truncate_on_limit=True)

Note: Truncation happens at valid UTF-8 character boundaries to prevent encoding errors.

API highlights

  • Document(html: str) / Document.from_html(html) parses once and keeps the DOM.
  • .select(css)list[Element], .select_first(css) / .find(css) → first Element | None, .css(css) is an alias.
  • .xpath(expr) / .xpath_first(expr) evaluate XPath expressions that return element nodes.
  • .text returns normalized text; .html returns the element's HTML.
  • scraper_rs.asyncio exposes async parse/select/xpath wrappers to keep the event loop responsive.
  • Element exposes .tag, .text, .html, .attrs plus helpers .attr(name), .get(name, default), .to_dict().
  • Elements support nested CSS and XPath selection via .select(css), .select_first(css), .find(css), .css(css), .xpath(expr), .xpath_first(expr).
  • Top-level helpers mirror the class methods: parse(html), select(html, css), select_first(html, css) / first(html, css), xpath(html, expr), xpath_first(html, expr).
  • max_size_bytes lets you fail fast on oversized HTML; defaults to a 1 GiB limit.
  • truncate_on_limit allows parsing a truncated version (limited to max_size_bytes) of oversized HTML instead of raising an error.
  • Call doc.close() (or with Document(html) as doc: ...) to free parsed DOM resources when you're done.
  • In async workflows, use async with await scraper_rs.asyncio.parse(html) as doc: ... for automatic AsyncDocument cleanup.

Installation

Built wheels target abi3 (CPython 3.10+). To build locally:

# Install maturin (uv is used in this repo, but pip works too)
pip install maturin

# Build a wheel
maturin build --release --compatibility linux

# Install the generated wheel
pip install target/wheels/scraper_rs-*.whl

If you have just installed, the repo includes helpers: just build (local wheel), just install-wheel (install the built wheel), and just build_manylinux (via the official maturin Docker image).

Projects Using scraper-rs

  • silkworm - Async web scraping framework on top of Rust

Development

Requirements: Rust toolchain, Python 3.10+, maturin, and pytest for tests.

  • Run tests: just test or uv run pytest tests/test_scraper.py
  • Format/typing: The codebase is small; formatters are not strictly enforced yet.
  • The PyO3 module name is scraper_rs; the Rust crate is built as cdylib.

Contributions and issues are welcome. If you add public API, please extend tests/test_scraper.py and the example script accordingly.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scraper_rust-0.2.28.tar.gz (52.8 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

scraper_rust-0.2.28-cp314-cp314t-win_amd64.whl (754.0 kB view details)

Uploaded CPython 3.14tWindows x86-64

scraper_rust-0.2.28-cp314-cp314t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (939.1 kB view details)

Uploaded CPython 3.14tmanylinux: glibc 2.17+ x86-64

scraper_rust-0.2.28-cp314-cp314t-macosx_11_0_arm64.whl (789.0 kB view details)

Uploaded CPython 3.14tmacOS 11.0+ ARM64

scraper_rust-0.2.28-cp313-cp313t-win_amd64.whl (754.5 kB view details)

Uploaded CPython 3.13tWindows x86-64

scraper_rust-0.2.28-cp313-cp313t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (939.6 kB view details)

Uploaded CPython 3.13tmanylinux: glibc 2.17+ x86-64

scraper_rust-0.2.28-cp313-cp313t-macosx_11_0_arm64.whl (789.2 kB view details)

Uploaded CPython 3.13tmacOS 11.0+ ARM64

scraper_rust-0.2.28-cp310-abi3-win_amd64.whl (755.8 kB view details)

Uploaded CPython 3.10+Windows x86-64

scraper_rust-0.2.28-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (927.9 kB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ x86-64

scraper_rust-0.2.28-cp310-abi3-macosx_11_0_arm64.whl (791.8 kB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

File details

Details for the file scraper_rust-0.2.28.tar.gz.

File metadata

  • Download URL: scraper_rust-0.2.28.tar.gz
  • Upload date:
  • Size: 52.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for scraper_rust-0.2.28.tar.gz
Algorithm Hash digest
SHA256 6865d4ad2695876fb37be1810d927ce33b901427b5833052ad6dfc0ccd41c9b1
MD5 e436ac53a01d823033ff274ac9e51373
BLAKE2b-256 82a5b3b1dffbaadff33523c82765c0117949b489678695b033d95383e1b4dc9c

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.2.28.tar.gz:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.2.28-cp314-cp314t-win_amd64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.2.28-cp314-cp314t-win_amd64.whl
Algorithm Hash digest
SHA256 c38dc47f8af5f1c5ca2741d9264ea2cf384e327c1786e9374eb5e42663abfa1d
MD5 25b08441973dab0eabbcd27f896c6546
BLAKE2b-256 fb7cf4033d0ab03c4a282ddd26fa55789882f9ef58f8b9eca39c6f8c237e5514

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.2.28-cp314-cp314t-win_amd64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.2.28-cp314-cp314t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.2.28-cp314-cp314t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 7fe63b33e7137db5fe88c308ac73a024fdcf21e6a6f252552415e7668e8b70cd
MD5 3523ce2cffdeb4b870db39d69ae4a03f
BLAKE2b-256 8cffb21c822c0df7306e6d9032b656575874be7e08fc2d7b7ee4a689ef07ebc9

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.2.28-cp314-cp314t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.2.28-cp314-cp314t-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.2.28-cp314-cp314t-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 7bd51cefa8908bb5e9400db80bfc54c62f63c67f2d7840fe4d404504d661e08d
MD5 36c86f1fef06e8abefd05651255d24ca
BLAKE2b-256 3074b27c566a73f6dbd7bf314b809fbc4eb9a847c06e7ddd86c39e32f5b3861f

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.2.28-cp314-cp314t-macosx_11_0_arm64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.2.28-cp313-cp313t-win_amd64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.2.28-cp313-cp313t-win_amd64.whl
Algorithm Hash digest
SHA256 b6c6544e217a1b62a50bac4c1379195c80cee0018f2e11270ae2f87c7d728a80
MD5 e35a241b6367dbd09ac35f0cd3efd803
BLAKE2b-256 16fef804845476505b4686bd9b426c8f1bc440ebdcb38a71df807630a941f44a

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.2.28-cp313-cp313t-win_amd64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.2.28-cp313-cp313t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.2.28-cp313-cp313t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 4eae129af2f4537e97d2699bc3ee81937b3402be8c1aceafb9988cd96174a15e
MD5 9b8d032214a102da8f3995b2ee8d51d0
BLAKE2b-256 792dcdbba2ac9e85ab269bb8f8b821c40c2dc2643d40fef2c407171b64f3c3ff

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.2.28-cp313-cp313t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.2.28-cp313-cp313t-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.2.28-cp313-cp313t-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 e2e3765c4615f0f9e2417113a613cc07fddff0fe6dd780f79d7482d8d35cc56b
MD5 d1788499e050a9f59774fdc9cead61fc
BLAKE2b-256 61d15d1ecc62000730add7b4b7f91c353396803dd70ca9b0fb45664733a90384

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.2.28-cp313-cp313t-macosx_11_0_arm64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.2.28-cp310-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.2.28-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 670009d7e8cba212e1f8d770de9a50bf5eb8f1628eea98920828540ba04c653a
MD5 8700bfd70e3495b1bdca65c1aba177b2
BLAKE2b-256 0e6c40f84ab605dec130ad7cdb656df3455eb174a8447b92b0f9ac6406919477

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.2.28-cp310-abi3-win_amd64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.2.28-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.2.28-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 1800250e8780c65c548f84042bc49836138be1375192df87c9713ea16a71090c
MD5 b7976ff61e1310b13106902e0b79f9b9
BLAKE2b-256 5eaaf0c0f57857a662d7d2e57a851dca0c4383580259b9a72e78e50d7cc43b35

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.2.28-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.2.28-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.2.28-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 50cc4cda9bbff02cf900d155e404bce2f69b272a8bfb372c081f4b1c6954739b
MD5 f4e219c060275d49fd9535c5dca4eb38
BLAKE2b-256 08a836d736e1da89cb417cebfc7ea9f1857f827959f99514c26adcf3f776d022

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.2.28-cp310-abi3-macosx_11_0_arm64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page