Skip to main content

Python bindings around rust-scraper/scraper with PyO3

Project description

scraper-rs

PyPI - Version Tests PyPI Downloads

Python bindings for the Rust scraper crate via PyO3. It provides a lightweight Document/Element API with CSS selectors, XPath (via xee-xpath), handy helpers, and zero Python-side parsing work.

Quick start

from scraper_rs import Document, first, prettify, select, select_first, xpath

html = """
<html><body>
  <div class="item" data-id="1"><a href="/a">First</a></div>
  <div class="item" data-id="2"><a href="/b">Second</a></div>
</body></html>
"""

doc = Document(html)
print(doc.text)  # "First Second"

items = doc.select(".item")
print(items[0].attr("data-id"))  # "1"
print(items[0].to_dict())        # {"tag": "div", "text": "First", "html": "<a...>", ...}

first_link = doc.select_first("a[href]")  # alias: doc.find(...)
print(first_link.text, first_link.attr("href"))  # First /a
links_within_first = first_link.select("a[href]")
print([link.attr("href") for link in links_within_first])  # ["/a"]

# XPath (element results only)
xpath_items = doc.xpath("//div[@class='item']/a")
print([link.text for link in xpath_items])  # ["First", "Second"]
print(doc.xpath_first("//div[@data-id='1']/a").attr("href"))  # "/a"

# Functional helpers
links = select(html, "a[href]")
print([link.attr("href") for link in links])  # ["/a", "/b"]
print(first(html, "a[href]").text)            # First
print(select_first(html, "a[href]").text)     # First
print([link.text for link in xpath(html, "//div[@class='item']/a")])  # ["First", "Second"]
print(prettify(html))

For runnable samples, see examples/demo.py and examples/demo_prettify_url.py. Quick URL prettify demo:

python examples/demo_prettify_url.py https://example.com --max-lines 80

Async usage

The scraper_rs.asyncio module keeps the event loop responsive. parse yields once (asyncio.sleep(0)) before constructing a sync Document in the current thread, while select/xpath helpers run in a thread pool. Parsed documents and elements are wrapped with awaitable selector methods for nested queries:

import asyncio
from scraper_rs import asyncio as scraping_async

html = "<div class='item'><a href='/a'>First</a></div>"


async def main():
    async with await scraping_async.parse(html) as doc:
        items = await doc.select(".item")
        first_link = await items[0].select_first("a[href]")
        print(first_link.text)  # First

    links = await scraping_async.select(html, "a[href]")
    print([link.attr("href") for link in links])  # ["/a"]


asyncio.run(main())

All async functions accept the same keyword arguments as their sync counterparts (max_size_bytes, truncate_on_limit, etc.). Async wrappers expose the underlying sync objects via .document and .element if you need direct access. AsyncDocument supports async with for automatic cleanup in coroutine code.

Large documents and memory safety

To avoid runaway allocations, parsing defaults to a 1 GiB cap. Pass max_size_bytes to override:

from scraper_rs import Document, select

doc = Document(html, max_size_bytes=5_000_000)  # 5 MB guard
links = select(html, "a[href]", max_size_bytes=5_000_000)

If you want to parse a limited portion of an oversized HTML document instead of rejecting it entirely, use truncate_on_limit=True:

# Parse only the first 100KB of a large HTML document
doc = Document(large_html, max_size_bytes=100_000, truncate_on_limit=True)
links = doc.select("a[href]")  # Will only find links in the first 100KB

# Also works with top-level functions
items = select(large_html, ".item", max_size_bytes=100_000, truncate_on_limit=True)

Note: Truncation happens at valid UTF-8 character boundaries to prevent encoding errors.

API highlights

  • Document(html: str) / Document.from_html(html) parse HTML for CSS and keep the DOM; XPath parsing is initialized lazily on first XPath query.
  • .select(css)list[Element], .select_first(css) / .find(css) → first Element | None, .css(css) is an alias.
  • .xpath(expr) / .xpath_first(expr) evaluate XPath expressions that return element nodes.
  • .prettify() renders the current DOM as an indented string for readable output/debugging.
  • .text returns normalized text; Document.html is the original input HTML; Element.html is inner HTML.
  • scraper_rs.asyncio exposes async parse/select/xpath wrappers to keep the event loop responsive.
  • Element exposes .tag, .text, .html, .attrs plus helpers .attr(name), .get(name, default), .to_dict().
  • Elements support nested CSS and XPath selection via .select(css), .select_first(css), .find(css), .css(css), .xpath(expr), .xpath_first(expr).
  • Elements also expose .prettify() to format element HTML with indentation.
  • Top-level helpers mirror the class methods: parse(html), prettify(html), select(html, css), select_first(html, css) / first(html, css), xpath(html, expr), xpath_first(html, expr).
  • max_size_bytes lets you fail fast on oversized HTML; defaults to a 1 GiB limit.
  • truncate_on_limit allows parsing a truncated version (limited to max_size_bytes) of oversized HTML instead of raising an error.
  • Call doc.close() (or with Document(html) as doc: ...) to free parsed DOM resources when you're done.
  • In async workflows, use async with await scraper_rs.asyncio.parse(html) as doc: ... for automatic AsyncDocument cleanup.

Installation

Built wheels target abi3 (CPython 3.10+). To build locally:

# Install maturin (uv is used in this repo, but pip works too)
pip install maturin

# Build a wheel
maturin build --release --compatibility linux

# Install the generated wheel
pip install target/wheels/scraper_rust-*.whl

If you have just installed, the repo includes helpers: just build (local wheel), just install-wheel (install the built wheel), and just build_manylinux (via the official maturin Docker image).

Projects Using scraper-rs

  • silkworm - Async web scraping framework on top of Rust

Development

Requirements: Rust toolchain, Python 3.10+, maturin, pytest, and pytest-asyncio for tests.

  • Run tests: just test or uv run pytest tests/
  • Format code: just fmt (or cargo fmt --all and uv run ruff format)
  • Lint Rust: just lint (or cargo clippy --all-targets --all-features -- -D warnings)
  • The PyO3 module name is scraper_rs; the Rust crate is built as cdylib.

Contributions and issues are welcome. If you add public API, please extend tests/test_scraper.py and the example script accordingly.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scraper_rust-0.3.1.tar.gz (76.1 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

scraper_rust-0.3.1-cp314-cp314t-win_amd64.whl (2.2 MB view details)

Uploaded CPython 3.14tWindows x86-64

scraper_rust-0.3.1-cp314-cp314t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.4 MB view details)

Uploaded CPython 3.14tmanylinux: glibc 2.17+ x86-64

scraper_rust-0.3.1-cp314-cp314t-macosx_11_0_arm64.whl (2.1 MB view details)

Uploaded CPython 3.14tmacOS 11.0+ ARM64

scraper_rust-0.3.1-cp313-cp313t-win_amd64.whl (2.2 MB view details)

Uploaded CPython 3.13tWindows x86-64

scraper_rust-0.3.1-cp313-cp313t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.4 MB view details)

Uploaded CPython 3.13tmanylinux: glibc 2.17+ x86-64

scraper_rust-0.3.1-cp313-cp313t-macosx_11_0_arm64.whl (2.1 MB view details)

Uploaded CPython 3.13tmacOS 11.0+ ARM64

scraper_rust-0.3.1-cp310-abi3-win_amd64.whl (2.2 MB view details)

Uploaded CPython 3.10+Windows x86-64

scraper_rust-0.3.1-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.4 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ x86-64

scraper_rust-0.3.1-cp310-abi3-macosx_11_0_arm64.whl (2.1 MB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

File details

Details for the file scraper_rust-0.3.1.tar.gz.

File metadata

  • Download URL: scraper_rust-0.3.1.tar.gz
  • Upload date:
  • Size: 76.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for scraper_rust-0.3.1.tar.gz
Algorithm Hash digest
SHA256 b6d56146951aacf58723a024c3d1a09d7f463bb1fdeb9ddbcaf688ba46a75019
MD5 9fe5adc2b61fbff3e7c385577b0516b1
BLAKE2b-256 a0f3a5c63f31de29596f9465dbd342a2d44d8c36404c17d1b73bfe76515e3670

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.3.1.tar.gz:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.3.1-cp314-cp314t-win_amd64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.3.1-cp314-cp314t-win_amd64.whl
Algorithm Hash digest
SHA256 ee732f8053107467fa449a3617592f387bea584a5f75a7e9a81c5f0b7670c3eb
MD5 20fb2f3707d76a65ad60e5ab03910c2e
BLAKE2b-256 c1ea71ac8a8696cfb7fdc0314d3ae1d19a39ca999765223a5456a7372e89e458

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.3.1-cp314-cp314t-win_amd64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.3.1-cp314-cp314t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.3.1-cp314-cp314t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 8b72e85806bb60dfed050043edbb0d4d3ae1cc1d2d9a4971afd8d05f751a3c04
MD5 be6e6f509167dad41ef9b783e9428da4
BLAKE2b-256 0738377ca10b97daa700ae12203594b6fbe22127b75c95ed3b384846d19cb411

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.3.1-cp314-cp314t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.3.1-cp314-cp314t-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.3.1-cp314-cp314t-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 ef6eba5c3279d6930b63ac622b7b8f0ca99d2450c04bfec982db654b7508190a
MD5 c14af8195eba01459a52549379369038
BLAKE2b-256 6bd66d76d1e1087a0620ec7bc2e04c50bd09c9565682174e489cc0a22baad3d1

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.3.1-cp314-cp314t-macosx_11_0_arm64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.3.1-cp313-cp313t-win_amd64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.3.1-cp313-cp313t-win_amd64.whl
Algorithm Hash digest
SHA256 bb48389cbfdded02f94f7f838ac1ac8570cf26cc1438b12ca0bc2f368686da39
MD5 c4ab23bd0995165cebfb21aa45016909
BLAKE2b-256 af7e36ccabf2fb0536cd697c872a7c6aa266b3fb71f79b4b8b32e6b295fc73ab

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.3.1-cp313-cp313t-win_amd64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.3.1-cp313-cp313t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.3.1-cp313-cp313t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 87cde57d80b3288e0a608ee0fcdc0f29ed56719a4bf884ad239379b023ef6f1d
MD5 743bf3e700cd1388c3ed229fe9f1ac58
BLAKE2b-256 d7994780ed24598d60339896b0a657b069af6e7475acca60e7ae4b50e17f7777

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.3.1-cp313-cp313t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.3.1-cp313-cp313t-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.3.1-cp313-cp313t-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 379eb3304ab9813716c1e8f02ff83a6def906030f3edd7bf478b246b58da79d6
MD5 0134272db01077ef7365d88e8e8dd3e2
BLAKE2b-256 0ed975653e67c2838b06111b5f27a207805e26495e7f3f0d30879a5295a37730

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.3.1-cp313-cp313t-macosx_11_0_arm64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.3.1-cp310-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.3.1-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 330595e7a9ba808036c65665d32914771549eb8ed401ef7815f09a69826e2eb9
MD5 8b05db773e3aee6cba20711670b7f749
BLAKE2b-256 e76ef1cc1d8bacfc5d22dd3f353307f13c55cf176286275b6030c49545ffc2c2

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.3.1-cp310-abi3-win_amd64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.3.1-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.3.1-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 96cd7152e4c5a32ddcc22c65b0fb25ce2187f08ebe59cc4d447b965d096214ec
MD5 43c77b49d76f3bc027977072c971a20e
BLAKE2b-256 95716f31a15d71dac8f74919f7542c20a63b958dbdc7287394436b359da359d0

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.3.1-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.3.1-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.3.1-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 d4545ab40dd32c40afc0a4057a207e8fdb6f8f269598ebcc3657b2520e381765
MD5 af68592af94ffa7ac246d65ec7be2cfa
BLAKE2b-256 979418ea599ed1dedf97ad2729dd485dbf8821c84460cd2045fedbadb5bca2e5

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.3.1-cp310-abi3-macosx_11_0_arm64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page