Skip to main content

Python bindings around rust-scraper/scraper with PyO3

Project description

scraper-rs

PyPI - Version Tests PyPI Downloads

Python bindings for the Rust scraper crate via PyO3. It provides a lightweight Document/Element API with CSS selectors, XPath (via xee-xpath), handy helpers, and zero Python-side parsing work.

Quick start

from scraper_rs import Document, first, prettify, select, select_first, xpath

html = """
<html><body>
  <div class="item" data-id="1"><a href="/a">First</a></div>
  <div class="item" data-id="2"><a href="/b">Second</a></div>
</body></html>
"""

doc = Document(html)
print(doc.text)  # "First Second"

items = doc.select(".item")
print(items[0].attr("data-id"))  # "1"
print(items[0].to_dict())        # {"tag": "div", "text": "First", "html": "<a...>", ...}

first_link = doc.select_first("a[href]")  # alias: doc.find(...)
print(first_link.text, first_link.attr("href"))  # First /a
links_within_first = first_link.select("a[href]")
print([link.attr("href") for link in links_within_first])  # ["/a"]

# XPath (element results only)
xpath_items = doc.xpath("//div[@class='item']/a")
print([link.text for link in xpath_items])  # ["First", "Second"]
print(doc.xpath_first("//div[@data-id='1']/a").attr("href"))  # "/a"

# Functional helpers
links = select(html, "a[href]")
print([link.attr("href") for link in links])  # ["/a", "/b"]
print(first(html, "a[href]").text)            # First
print(select_first(html, "a[href]").text)     # First
print([link.text for link in xpath(html, "//div[@class='item']/a")])  # ["First", "Second"]
print(prettify(html))

For runnable samples, see examples/demo.py and examples/demo_prettify_url.py. Quick URL prettify demo:

python examples/demo_prettify_url.py https://example.com --max-lines 80

Async usage

The scraper_rs.asyncio module keeps the event loop responsive. parse yields once (asyncio.sleep(0)) before constructing a sync Document in the current thread, while select/xpath helpers run in a thread pool. Parsed documents and elements are wrapped with awaitable selector methods for nested queries:

import asyncio
from scraper_rs import asyncio as scraping_async

html = "<div class='item'><a href='/a'>First</a></div>"


async def main():
    async with await scraping_async.parse(html) as doc:
        items = await doc.select(".item")
        first_link = await items[0].select_first("a[href]")
        print(first_link.text)  # First

    links = await scraping_async.select(html, "a[href]")
    print([link.attr("href") for link in links])  # ["/a"]


asyncio.run(main())

All async functions accept the same keyword arguments as their sync counterparts (max_size_bytes, truncate_on_limit, etc.). Async wrappers expose the underlying sync objects via .document and .element if you need direct access. AsyncDocument supports async with for automatic cleanup in coroutine code.

Large documents and memory safety

To avoid runaway allocations, parsing defaults to a 1 GiB cap. Pass max_size_bytes to override:

from scraper_rs import Document, select

doc = Document(html, max_size_bytes=5_000_000)  # 5 MB guard
links = select(html, "a[href]", max_size_bytes=5_000_000)

If you want to parse a limited portion of an oversized HTML document instead of rejecting it entirely, use truncate_on_limit=True:

# Parse only the first 100KB of a large HTML document
doc = Document(large_html, max_size_bytes=100_000, truncate_on_limit=True)
links = doc.select("a[href]")  # Will only find links in the first 100KB

# Also works with top-level functions
items = select(large_html, ".item", max_size_bytes=100_000, truncate_on_limit=True)

Note: Truncation happens at valid UTF-8 character boundaries to prevent encoding errors.

API highlights

  • Document(html: str) / Document.from_html(html) parse HTML for CSS and keep the DOM; XPath parsing is initialized lazily on first XPath query.
  • .select(css)list[Element], .select_first(css) / .find(css) → first Element | None, .css(css) is an alias.
  • .xpath(expr) / .xpath_first(expr) evaluate XPath expressions that return element nodes.
  • .prettify() renders the current DOM as an indented string for readable output/debugging.
  • .text returns normalized text; Document.html is the original input HTML; Element.html is inner HTML.
  • scraper_rs.asyncio exposes async parse/select/xpath wrappers to keep the event loop responsive.
  • Element exposes .tag, .text, .html, .attrs plus helpers .attr(name), .get(name, default), .to_dict().
  • Elements support nested CSS and XPath selection via .select(css), .select_first(css), .find(css), .css(css), .xpath(expr), .xpath_first(expr).
  • Elements also expose .prettify() to format element HTML with indentation.
  • Top-level helpers mirror the class methods: parse(html), prettify(html), select(html, css), select_first(html, css) / first(html, css), xpath(html, expr), xpath_first(html, expr).
  • max_size_bytes lets you fail fast on oversized HTML; defaults to a 1 GiB limit.
  • truncate_on_limit allows parsing a truncated version (limited to max_size_bytes) of oversized HTML instead of raising an error.
  • Call doc.close() (or with Document(html) as doc: ...) to free parsed DOM resources when you're done.
  • In async workflows, use async with await scraper_rs.asyncio.parse(html) as doc: ... for automatic AsyncDocument cleanup.

Installation

Built wheels target abi3 (CPython 3.10+). To build locally:

# Install maturin (uv is used in this repo, but pip works too)
pip install maturin

# Build a wheel
maturin build --release --compatibility linux

# Install the generated wheel
pip install target/wheels/scraper_rust-*.whl

If you have just installed, the repo includes helpers: just build (local wheel), just install-wheel (install the built wheel), and just build_manylinux (via the official maturin Docker image).

Projects Using scraper-rs

  • silkworm - Async web scraping framework on top of Rust

Development

Requirements: Rust toolchain, Python 3.10+, maturin, pytest, and pytest-asyncio for tests.

  • Run tests: just test or uv run pytest tests/
  • Format code: just fmt (or cargo fmt --all and uv run ruff format)
  • Lint Rust: just lint (or cargo clippy --all-targets --all-features -- -D warnings)
  • The PyO3 module name is scraper_rs; the Rust crate is built as cdylib.

Contributions and issues are welcome. If you add public API, please extend tests/test_scraper.py and the example script accordingly.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scraper_rust-0.3.0.tar.gz (75.9 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

scraper_rust-0.3.0-cp314-cp314t-win_amd64.whl (2.2 MB view details)

Uploaded CPython 3.14tWindows x86-64

scraper_rust-0.3.0-cp314-cp314t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.4 MB view details)

Uploaded CPython 3.14tmanylinux: glibc 2.17+ x86-64

scraper_rust-0.3.0-cp314-cp314t-macosx_11_0_arm64.whl (2.1 MB view details)

Uploaded CPython 3.14tmacOS 11.0+ ARM64

scraper_rust-0.3.0-cp313-cp313t-win_amd64.whl (2.2 MB view details)

Uploaded CPython 3.13tWindows x86-64

scraper_rust-0.3.0-cp313-cp313t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.4 MB view details)

Uploaded CPython 3.13tmanylinux: glibc 2.17+ x86-64

scraper_rust-0.3.0-cp313-cp313t-macosx_11_0_arm64.whl (2.1 MB view details)

Uploaded CPython 3.13tmacOS 11.0+ ARM64

scraper_rust-0.3.0-cp310-abi3-win_amd64.whl (2.2 MB view details)

Uploaded CPython 3.10+Windows x86-64

scraper_rust-0.3.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.4 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ x86-64

scraper_rust-0.3.0-cp310-abi3-macosx_11_0_arm64.whl (2.1 MB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

File details

Details for the file scraper_rust-0.3.0.tar.gz.

File metadata

  • Download URL: scraper_rust-0.3.0.tar.gz
  • Upload date:
  • Size: 75.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for scraper_rust-0.3.0.tar.gz
Algorithm Hash digest
SHA256 124dd28504530a62bf90212b5f7e46a504aa148a12e45c8fcfe24f88427f00a6
MD5 715426d0c4449caaff063f65b18482cb
BLAKE2b-256 b2541d9faad662a98cc9d8f989b2cda797d261a89c6896f4d1b09f8072a04d0d

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.3.0.tar.gz:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.3.0-cp314-cp314t-win_amd64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.3.0-cp314-cp314t-win_amd64.whl
Algorithm Hash digest
SHA256 caef5591717399798b938e63e9f7fe64ab8ad03a31725ed4b62809020fcb49f2
MD5 282cabd2b985f0507737afaf3b84111b
BLAKE2b-256 a471d62dd742f0bf70bf1e4b6a99396fcb268e23164c8d2907be0f95e1fc51ed

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.3.0-cp314-cp314t-win_amd64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.3.0-cp314-cp314t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.3.0-cp314-cp314t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 6990e282953727b5499b6a1e6e785eef5df7d61837f5c05f5d127fdd42e4ee56
MD5 1f05d7fc4e62f72b307d10c89ec44b5b
BLAKE2b-256 62318e3c136082721f943d0b04ddac4e37518231df63daf3e79bae30554a43a5

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.3.0-cp314-cp314t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.3.0-cp314-cp314t-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.3.0-cp314-cp314t-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 da96ad4e036ccba8ef44e469cb71bd0731c3906d7fddb2d0efc4c10b5b79234c
MD5 528e984bdd5701e96c531640b404aa51
BLAKE2b-256 6a13f2f13b88a4492cdbe0ed1809a9925d2e779f897281c08b72f73386ff2bb0

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.3.0-cp314-cp314t-macosx_11_0_arm64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.3.0-cp313-cp313t-win_amd64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.3.0-cp313-cp313t-win_amd64.whl
Algorithm Hash digest
SHA256 29d85fc30106de4240618447f46ae4fd882c23434cf85324de4504d7d7a73976
MD5 3aede202d329f4d89265f69c89883136
BLAKE2b-256 eb7ac62762af4b1e4fb7c8c4ad098921781e43179b1a1e14fa3d236119640be0

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.3.0-cp313-cp313t-win_amd64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.3.0-cp313-cp313t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.3.0-cp313-cp313t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 e9d15532c90350c707d87ebfcc4c39e3214bfaae174468477a156a6719f424f9
MD5 e2ca488819ee470a33482a18ea77e70f
BLAKE2b-256 d10d588b44be2ea7b48cb6aecc2ddb4330ad89e5aca9cce0ed14ed969df5d253

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.3.0-cp313-cp313t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.3.0-cp313-cp313t-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.3.0-cp313-cp313t-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 0aca7551300d62777dcbc538882730c6b070e46a2b6180f2c7b61a74eb764f02
MD5 6e9137de56cac42c0b611db08b3d00c0
BLAKE2b-256 e6350efaa796965054403680ec00f1a4aebe6806e9c827ee3f7ff5e3d1e082f6

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.3.0-cp313-cp313t-macosx_11_0_arm64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.3.0-cp310-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.3.0-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 9c7e1d1493f5db804b9fd2dbe33ec0d4f377f85a0e5a1c6a5b2eacafe329c8e4
MD5 02127292081e940b0c519b20d0fa3a58
BLAKE2b-256 4f1b1bcf1aa2fcb014f9a594b8fde88831c083f15e123cf8722a4bb36c0c80a9

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.3.0-cp310-abi3-win_amd64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.3.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.3.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 b4db55e856a03171e3c7eb943920530f4bb83aef6b9e9c4139815d14c63e5b8a
MD5 d71e4e0abca203097dbd74c0811aa27e
BLAKE2b-256 ee8bff72cab2fcb81cc44af73c4fff6d59b2fa358e24ed1e628e88c3e1edef20

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.3.0-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.3.0-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.3.0-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 36344b37f3cf36cda6cfcebbc9cbd0bed0f44cbe2b095cea5553476ef32fba4c
MD5 eaafe637b8805c9e57f3770bbbae8a10
BLAKE2b-256 e31211532bd62976ebd8b2aa77f0398196757f64e8d5ce97a63dc80bc235a2ea

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.3.0-cp310-abi3-macosx_11_0_arm64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page