Skip to main content

Python bindings around rust-scraper/scraper with PyO3

Project description

scraper-rs

PyPI - Version Tests PyPI Downloads

Python bindings for the Rust scraper crate via PyO3. It provides a lightweight Document/Element API with CSS selectors, XPath (via xee-xpath), handy helpers, and zero Python-side parsing work.

Quick start

from scraper_rs import Document, first, prettify, select, select_first, xpath

html = """
<html><body>
  <div class="item" data-id="1"><a href="/a">First</a></div>
  <div class="item" data-id="2"><a href="/b">Second</a></div>
</body></html>
"""

doc = Document(html)
print(doc.text)  # "First Second"

items = doc.select(".item")
print(items[0].attr("data-id"))  # "1"
print(items[0].to_dict())        # {"tag": "div", "text": "First", "html": "<a...>", ...}

first_link = doc.select_first("a[href]")  # alias: doc.find(...)
print(first_link.text, first_link.attr("href"))  # First /a
links_within_first = first_link.select("a[href]")
print([link.attr("href") for link in links_within_first])  # ["/a"]

# XPath (element results only)
xpath_items = doc.xpath("//div[@class='item']/a")
print([link.text for link in xpath_items])  # ["First", "Second"]
print(doc.xpath_first("//div[@data-id='1']/a").attr("href"))  # "/a"

# Functional helpers
links = select(html, "a[href]")
print([link.attr("href") for link in links])  # ["/a", "/b"]
print(first(html, "a[href]").text)            # First
print(select_first(html, "a[href]").text)     # First
print([link.text for link in xpath(html, "//div[@class='item']/a")])  # ["First", "Second"]
print(prettify(html))

For runnable samples, see examples/demo.py and examples/demo_prettify_url.py. Quick URL prettify demo:

python examples/demo_prettify_url.py https://example.com --max-lines 80

Async usage

The scraper_rs.asyncio module keeps the event loop responsive. parse yields once (asyncio.sleep(0)) before constructing a sync Document in the current thread, while select/xpath helpers run in a thread pool. Parsed documents and elements are wrapped with awaitable selector methods for nested queries:

import asyncio
from scraper_rs import asyncio as scraping_async

html = "<div class='item'><a href='/a'>First</a></div>"


async def main():
    async with await scraping_async.parse(html) as doc:
        items = await doc.select(".item")
        first_link = await items[0].select_first("a[href]")
        print(first_link.text)  # First

    links = await scraping_async.select(html, "a[href]")
    print([link.attr("href") for link in links])  # ["/a"]


asyncio.run(main())

All async functions accept the same keyword arguments as their sync counterparts (max_size_bytes, truncate_on_limit, etc.). Async wrappers expose the underlying sync objects via .document and .element if you need direct access. AsyncDocument supports async with for automatic cleanup in coroutine code.

Large documents and memory safety

To avoid runaway allocations, parsing defaults to a 1 GiB cap. Pass max_size_bytes to override:

from scraper_rs import Document, select

doc = Document(html, max_size_bytes=5_000_000)  # 5 MB guard
links = select(html, "a[href]", max_size_bytes=5_000_000)

If you want to parse a limited portion of an oversized HTML document instead of rejecting it entirely, use truncate_on_limit=True:

# Parse only the first 100KB of a large HTML document
doc = Document(large_html, max_size_bytes=100_000, truncate_on_limit=True)
links = doc.select("a[href]")  # Will only find links in the first 100KB

# Also works with top-level functions
items = select(large_html, ".item", max_size_bytes=100_000, truncate_on_limit=True)

Note: Truncation happens at valid UTF-8 character boundaries to prevent encoding errors.

API highlights

  • Document(html: str) / Document.from_html(html) parse HTML for CSS and keep the DOM; XPath parsing is initialized lazily on first XPath query.
  • .select(css)list[Element], .select_first(css) / .find(css) → first Element | None, .css(css) is an alias.
  • .xpath(expr) / .xpath_first(expr) evaluate XPath expressions that return element nodes.
  • .prettify() renders the current DOM as an indented string for readable output/debugging.
  • .text returns normalized text; Document.html is the original input HTML; Element.html is inner HTML.
  • scraper_rs.asyncio exposes async parse/select/xpath wrappers to keep the event loop responsive.
  • Element exposes .tag, .text, .html, .attrs plus helpers .attr(name), .get(name, default), .to_dict().
  • Elements support nested CSS and XPath selection via .select(css), .select_first(css), .find(css), .css(css), .xpath(expr), .xpath_first(expr).
  • Elements also expose .prettify() to format element HTML with indentation.
  • Top-level helpers mirror the class methods: parse(html), prettify(html), select(html, css), select_first(html, css) / first(html, css), xpath(html, expr), xpath_first(html, expr).
  • max_size_bytes lets you fail fast on oversized HTML; defaults to a 1 GiB limit.
  • truncate_on_limit allows parsing a truncated version (limited to max_size_bytes) of oversized HTML instead of raising an error.
  • Call doc.close() (or with Document(html) as doc: ...) to free parsed DOM resources when you're done.
  • In async workflows, use async with await scraper_rs.asyncio.parse(html) as doc: ... for automatic AsyncDocument cleanup.

Installation

Built wheels target abi3 (CPython 3.10+). To build locally:

# Install maturin (uv is used in this repo, but pip works too)
pip install maturin

# Build a wheel
maturin build --release --compatibility linux

# Install the generated wheel
pip install target/wheels/scraper_rust-*.whl

If you have just installed, the repo includes helpers: just build (local wheel), just install-wheel (install the built wheel), and just build_manylinux (via the official maturin Docker image).

Projects Using scraper-rs

  • silkworm - Async web scraping framework on top of Rust

Development

Requirements: Rust toolchain, Python 3.10+, maturin, pytest, and pytest-asyncio for tests.

  • Run tests: just test or uv run pytest tests/
  • Format code: just fmt (or cargo fmt --all and uv run ruff format)
  • Lint Rust: just lint (or cargo clippy --all-targets --all-features -- -D warnings)
  • The PyO3 module name is scraper_rs; the Rust crate is built as cdylib.

Contributions and issues are welcome. If you add public API, please extend tests/test_scraper.py and the example script accordingly.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scraper_rust-0.3.3.tar.gz (80.4 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

scraper_rust-0.3.3-cp314-cp314t-win_amd64.whl (2.2 MB view details)

Uploaded CPython 3.14tWindows x86-64

scraper_rust-0.3.3-cp314-cp314t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.5 MB view details)

Uploaded CPython 3.14tmanylinux: glibc 2.17+ x86-64

scraper_rust-0.3.3-cp314-cp314t-macosx_11_0_arm64.whl (2.2 MB view details)

Uploaded CPython 3.14tmacOS 11.0+ ARM64

scraper_rust-0.3.3-cp313-cp313t-win_amd64.whl (2.2 MB view details)

Uploaded CPython 3.13tWindows x86-64

scraper_rust-0.3.3-cp313-cp313t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.5 MB view details)

Uploaded CPython 3.13tmanylinux: glibc 2.17+ x86-64

scraper_rust-0.3.3-cp313-cp313t-macosx_11_0_arm64.whl (2.2 MB view details)

Uploaded CPython 3.13tmacOS 11.0+ ARM64

scraper_rust-0.3.3-cp310-abi3-win_amd64.whl (2.2 MB view details)

Uploaded CPython 3.10+Windows x86-64

scraper_rust-0.3.3-cp310-abi3-macosx_11_0_arm64.whl (2.2 MB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

scraper_rust-0.3.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.5 MB view details)

Uploaded CPython 3.8manylinux: glibc 2.17+ x86-64

File details

Details for the file scraper_rust-0.3.3.tar.gz.

File metadata

  • Download URL: scraper_rust-0.3.3.tar.gz
  • Upload date:
  • Size: 80.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for scraper_rust-0.3.3.tar.gz
Algorithm Hash digest
SHA256 88753702f97901fb1586b1b4d16c9f3a06b93b07e2b4ebe52ab14e3ee147f591
MD5 6449051d9e78538657d5de53ee5613b0
BLAKE2b-256 255b9052f3eadcce04f1975a69e5e02e1b8c9e157a458d783972f9511e3f4fcd

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.3.3.tar.gz:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.3.3-cp314-cp314t-win_amd64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.3.3-cp314-cp314t-win_amd64.whl
Algorithm Hash digest
SHA256 94ee5dde3ca5cda832341144899220c6175eb662c93f798ccd15dc60bbbd6360
MD5 0df6d395ead28ebdba81c05c8a8cc1b9
BLAKE2b-256 33d7b9eabadf208d6ad199dfed59865ad700c4e0576c176973ef59e2985a1dfc

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.3.3-cp314-cp314t-win_amd64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.3.3-cp314-cp314t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.3.3-cp314-cp314t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 1d89bc48f2cdfd5c62e314228305b4233cc3f7f6c8bd59494c5a3ab1bc5f6b4b
MD5 48f19b6f18ce79534081432e6b7fcb1c
BLAKE2b-256 df674feb4986c2494c0bc23fed0e33fe05c86e12c4ffaf76efc287bd052211ed

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.3.3-cp314-cp314t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.3.3-cp314-cp314t-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.3.3-cp314-cp314t-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 d05ce4174e7cddbafe091532bc19b90a4539d3fea681fc024b07a01c4b057b40
MD5 2e411e22d418fd51714edc741ea4a282
BLAKE2b-256 9e4ae037f5b76f4cf100d2517b9373cdf5ff833641d69f378dac38ab8810ee9b

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.3.3-cp314-cp314t-macosx_11_0_arm64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.3.3-cp313-cp313t-win_amd64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.3.3-cp313-cp313t-win_amd64.whl
Algorithm Hash digest
SHA256 8a68ed4cbbe9c9e63592ec78f94b83b13eb6e169cf0b84144052431a904b7663
MD5 233c234a25e8c498a7a8081e48d070f9
BLAKE2b-256 9e2cf6ecf7cdb2359d00f7dbf2f2db42b6542b6dbf7a75a04f32dc87805d134b

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.3.3-cp313-cp313t-win_amd64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.3.3-cp313-cp313t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.3.3-cp313-cp313t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 c47128f0a74cd4a17d65b37fe82478ec275b32e059eb76ae4c664c1a9ad658e8
MD5 6659d114d8e6dc51dff3e62fff893398
BLAKE2b-256 3adc4c4a38e562624582337f14e47a8574e1690ee918eb73156b97cfa1a1cc0a

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.3.3-cp313-cp313t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.3.3-cp313-cp313t-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.3.3-cp313-cp313t-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 d6f5e393bb398b180b2292b797819ce9d4a39944c4897d4c63c79cb19f848748
MD5 548c904cb16f1010a0179b0b1a492fc4
BLAKE2b-256 aaf84a546073deabf010a0f1cfa0ddb98a5ec1b36b8853f6a983cae3f2a2ae27

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.3.3-cp313-cp313t-macosx_11_0_arm64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.3.3-cp310-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.3.3-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 cb79c1b5289592a7bdf8ddf81937760605475e1b6d7910f7d3034c9b884bdc64
MD5 11df0529c72011006af7ded085d2750e
BLAKE2b-256 ffcd66c105b2c0ed2ded40cc3b9677f053e4faf27faf66e6eef6a4c5b8c99da1

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.3.3-cp310-abi3-win_amd64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.3.3-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.3.3-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 9ce1cb721f3b018734ac4975b20340127c72827937e96bf515959bcebbb42db7
MD5 5b69aa0673aae5b2239f6683f2978868
BLAKE2b-256 003087350a03f4628f26f404b857fad5a600c48ac22c1aa089f8d82a23ddc98d

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.3.3-cp310-abi3-macosx_11_0_arm64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.3.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.3.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 474d4472179c46c1b236905b4c1eaec819505b8de3c91c3ad9426dfacf54a92c
MD5 c3294140a6f641152246bd6650225af2
BLAKE2b-256 5641f776c5b0d3cad1e339db28ab76440b6b5575f11230a2a51db8444c05e618

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.3.3-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page