Skip to main content

Python bindings around rust-scraper/scraper with PyO3

Project description

scraper-rs

PyPI - Version Tests PyPI Downloads

Python bindings for the Rust scraper crate via PyO3. It provides a lightweight Document/Element API with CSS selectors, XPath (via sxd_html/sxd_xpath), handy helpers, and zero Python-side parsing work.

Quick start

from scraper_rs import Document, first, prettify, select, select_first, xpath

html = """
<html><body>
  <div class="item" data-id="1"><a href="/a">First</a></div>
  <div class="item" data-id="2"><a href="/b">Second</a></div>
</body></html>
"""

doc = Document(html)
print(doc.text)  # "First Second"

items = doc.select(".item")
print(items[0].attr("data-id"))  # "1"
print(items[0].to_dict())        # {"tag": "div", "text": "First", "html": "<a...>", ...}

first_link = doc.select_first("a[href]")  # alias: doc.find(...)
print(first_link.text, first_link.attr("href"))  # First /a
links_within_first = first_link.select("a[href]")
print([link.attr("href") for link in links_within_first])  # ["/a"]

# XPath (element results only)
xpath_items = doc.xpath("//div[@class='item']/a")
print([link.text for link in xpath_items])  # ["First", "Second"]
print(doc.xpath_first("//div[@data-id='1']/a").attr("href"))  # "/a"

# Functional helpers
links = select(html, "a[href]")
print([link.attr("href") for link in links])  # ["/a", "/b"]
print(first(html, "a[href]").text)            # First
print(select_first(html, "a[href]").text)     # First
print([link.text for link in xpath(html, "//div[@class='item']/a")])  # ["First", "Second"]
print(prettify(html))

For runnable samples, see examples/demo.py and examples/demo_prettify_url.py. Quick URL prettify demo:

python examples/demo_prettify_url.py https://example.com --max-lines 80

Async usage

The scraper_rs.asyncio module keeps the event loop responsive. parse yields once (asyncio.sleep(0)) before constructing a sync Document in the current thread, while select/xpath helpers run in a thread pool. Parsed documents and elements are wrapped with awaitable selector methods for nested queries:

import asyncio
from scraper_rs import asyncio as scraping_async

html = "<div class='item'><a href='/a'>First</a></div>"


async def main():
    async with await scraping_async.parse(html) as doc:
        items = await doc.select(".item")
        first_link = await items[0].select_first("a[href]")
        print(first_link.text)  # First

    links = await scraping_async.select(html, "a[href]")
    print([link.attr("href") for link in links])  # ["/a"]


asyncio.run(main())

All async functions accept the same keyword arguments as their sync counterparts (max_size_bytes, truncate_on_limit, etc.). Async wrappers expose the underlying sync objects via .document and .element if you need direct access. AsyncDocument supports async with for automatic cleanup in coroutine code.

Large documents and memory safety

To avoid runaway allocations, parsing defaults to a 1 GiB cap. Pass max_size_bytes to override:

from scraper_rs import Document, select

doc = Document(html, max_size_bytes=5_000_000)  # 5 MB guard
links = select(html, "a[href]", max_size_bytes=5_000_000)

If you want to parse a limited portion of an oversized HTML document instead of rejecting it entirely, use truncate_on_limit=True:

# Parse only the first 100KB of a large HTML document
doc = Document(large_html, max_size_bytes=100_000, truncate_on_limit=True)
links = doc.select("a[href]")  # Will only find links in the first 100KB

# Also works with top-level functions
items = select(large_html, ".item", max_size_bytes=100_000, truncate_on_limit=True)

Note: Truncation happens at valid UTF-8 character boundaries to prevent encoding errors.

API highlights

  • Document(html: str) / Document.from_html(html) parse HTML for CSS and keep the DOM; XPath parsing is initialized lazily on first XPath query.
  • .select(css)list[Element], .select_first(css) / .find(css) → first Element | None, .css(css) is an alias.
  • .xpath(expr) / .xpath_first(expr) evaluate XPath expressions that return element nodes.
  • .prettify() renders the current DOM as an indented string for readable output/debugging.
  • .text returns normalized text; Document.html is the original input HTML; Element.html is inner HTML.
  • scraper_rs.asyncio exposes async parse/select/xpath wrappers to keep the event loop responsive.
  • Element exposes .tag, .text, .html, .attrs plus helpers .attr(name), .get(name, default), .to_dict().
  • Elements support nested CSS and XPath selection via .select(css), .select_first(css), .find(css), .css(css), .xpath(expr), .xpath_first(expr).
  • Elements also expose .prettify() to format element HTML with indentation.
  • Top-level helpers mirror the class methods: parse(html), prettify(html), select(html, css), select_first(html, css) / first(html, css), xpath(html, expr), xpath_first(html, expr).
  • max_size_bytes lets you fail fast on oversized HTML; defaults to a 1 GiB limit.
  • truncate_on_limit allows parsing a truncated version (limited to max_size_bytes) of oversized HTML instead of raising an error.
  • Call doc.close() (or with Document(html) as doc: ...) to free parsed DOM resources when you're done.
  • In async workflows, use async with await scraper_rs.asyncio.parse(html) as doc: ... for automatic AsyncDocument cleanup.

Installation

Built wheels target abi3 (CPython 3.10+). To build locally:

# Install maturin (uv is used in this repo, but pip works too)
pip install maturin

# Build a wheel
maturin build --release --compatibility linux

# Install the generated wheel
pip install target/wheels/scraper_rust-*.whl

If you have just installed, the repo includes helpers: just build (local wheel), just install-wheel (install the built wheel), and just build_manylinux (via the official maturin Docker image).

Projects Using scraper-rs

  • silkworm - Async web scraping framework on top of Rust

Development

Requirements: Rust toolchain, Python 3.10+, maturin, pytest, and pytest-asyncio for tests.

  • Run tests: just test or uv run pytest tests/
  • Format code: just fmt (or cargo fmt --all and uv run ruff format)
  • Lint Rust: just lint (or cargo clippy --all-targets --all-features -- -D warnings)
  • The PyO3 module name is scraper_rs; the Rust crate is built as cdylib.

Contributions and issues are welcome. If you add public API, please extend tests/test_scraper.py and the example script accordingly.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scraper_rust-0.2.32.tar.gz (64.5 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

scraper_rust-0.2.32-cp314-cp314t-win_amd64.whl (760.2 kB view details)

Uploaded CPython 3.14tWindows x86-64

scraper_rust-0.2.32-cp314-cp314t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (942.5 kB view details)

Uploaded CPython 3.14tmanylinux: glibc 2.17+ x86-64

scraper_rust-0.2.32-cp314-cp314t-macosx_11_0_arm64.whl (791.9 kB view details)

Uploaded CPython 3.14tmacOS 11.0+ ARM64

scraper_rust-0.2.32-cp313-cp313t-win_amd64.whl (766.8 kB view details)

Uploaded CPython 3.13tWindows x86-64

scraper_rust-0.2.32-cp313-cp313t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (947.2 kB view details)

Uploaded CPython 3.13tmanylinux: glibc 2.17+ x86-64

scraper_rust-0.2.32-cp313-cp313t-macosx_11_0_arm64.whl (791.1 kB view details)

Uploaded CPython 3.13tmacOS 11.0+ ARM64

scraper_rust-0.2.32-cp310-abi3-win_amd64.whl (766.0 kB view details)

Uploaded CPython 3.10+Windows x86-64

scraper_rust-0.2.32-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (934.1 kB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ x86-64

scraper_rust-0.2.32-cp310-abi3-macosx_11_0_arm64.whl (796.1 kB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

File details

Details for the file scraper_rust-0.2.32.tar.gz.

File metadata

  • Download URL: scraper_rust-0.2.32.tar.gz
  • Upload date:
  • Size: 64.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for scraper_rust-0.2.32.tar.gz
Algorithm Hash digest
SHA256 cd0184859b15de39b31021c642bc4906bd6ec8c32cb0b9d82ece1221bacfc0c1
MD5 88f2c0986977caa75d7f2378ec47ba71
BLAKE2b-256 ed19ec2911ed867d1210ee59e4ceb1e249a9f1eb06155fecce275cab6b976cce

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.2.32.tar.gz:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.2.32-cp314-cp314t-win_amd64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.2.32-cp314-cp314t-win_amd64.whl
Algorithm Hash digest
SHA256 d7527d658a7e9ab845202d3d41452ac1a4c360183ebb343bad9bfcfc8be7bc43
MD5 6fdf3a8f2999de3f3d66ae2af122f464
BLAKE2b-256 cd7253eee0c4d03f3b0999eb3a8a3ce1ef437eb115b6c5b3133a1252439ea0cb

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.2.32-cp314-cp314t-win_amd64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.2.32-cp314-cp314t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.2.32-cp314-cp314t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 2ef348278a33cea75769018f3d15700d3eef36c908054c94b9d1cd90e6359dea
MD5 82bedb3f41edabc35796394f30f2e74b
BLAKE2b-256 64dd34c8baa93fcc632838b372e5372184a25307e3a070373a05c96ac9530c47

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.2.32-cp314-cp314t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.2.32-cp314-cp314t-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.2.32-cp314-cp314t-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 8ad8530f987bca1ed652c913e4e401d57ab9208c359e3d5f45b7e022c1628e12
MD5 78f74b17f1df9f30086db246104f05c4
BLAKE2b-256 8f47b8b1c97580153ba9063f0a3d0c804a8072951b62d55f3c61e0c2d4ede25e

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.2.32-cp314-cp314t-macosx_11_0_arm64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.2.32-cp313-cp313t-win_amd64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.2.32-cp313-cp313t-win_amd64.whl
Algorithm Hash digest
SHA256 6e51241a05c44ab3306ab04ab187051e510754cbcf9a7f2c260f5c6bd75b579d
MD5 cb0cdccaa5f46ee6dfca3b52e0ae7d9b
BLAKE2b-256 865312a5af357468f74eeb3c1ba12920dc0d3af6db05f4f81f22950bce621c0a

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.2.32-cp313-cp313t-win_amd64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.2.32-cp313-cp313t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.2.32-cp313-cp313t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 d8228500441ef811b958133a97e28ddb10f71c08685c5f9cb501ea8f693b5d13
MD5 499a37970d161ba6d741360e43f42337
BLAKE2b-256 2a1108a65db4536679de41eae2aa8f007b624f6cb5594c33cc06e22fdef1457d

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.2.32-cp313-cp313t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.2.32-cp313-cp313t-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.2.32-cp313-cp313t-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 fdd6db37591abc0474815542d38c9d7ccc4a1e71a88d0e6a069584943f47b426
MD5 85bd4122715de3e3fa893fbf520e89ac
BLAKE2b-256 c406c5b10b485357f34699bc3d5c937bdb8fee293e574a25346d971b5f2ce8ed

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.2.32-cp313-cp313t-macosx_11_0_arm64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.2.32-cp310-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.2.32-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 8700d2cbfdb8d703466f73d35bb5135edf6dae06cf692aaa5471efd87227c826
MD5 42a6a951395a8457d8ddff5c556c8b6b
BLAKE2b-256 a45e02bb025f16dab66f5526066188222dda74037028af34771af437358b5aac

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.2.32-cp310-abi3-win_amd64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.2.32-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.2.32-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 de589491e92d774e95661b9a35302a701ad7ae565505ff3924dbdf9d911f97e4
MD5 463918a57d7853c051862f209c662235
BLAKE2b-256 446ad0b6d994009b12d5e807c3bf61b150a2b31666d99ad494c59fb6987f2aa1

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.2.32-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.2.32-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.2.32-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 2ddb9d08564fc58ea962dec052177609aaaddcbf07094ccc8d2b03aab4921486
MD5 07540d9f7d7139c22492767c5f4cecb7
BLAKE2b-256 53db88ab74a9cd5b725246771d71fd909f4e1870c760b7457ce4725190093ca3

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.2.32-cp310-abi3-macosx_11_0_arm64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page