Skip to main content

Python bindings around rust-scraper/scraper with PyO3

Project description

scraper-rs

PyPI - Version Tests

Python bindings for the Rust scraper crate via PyO3. It provides a lightweight Document/Element API with CSS selectors, XPath (via sxd_html/sxd_xpath), handy helpers, and zero Python-side parsing work.

Quick start

from scraper_rs import Document, first, select, select_first, xpath

html = """
<html><body>
  <div class="item" data-id="1"><a href="/a">First</a></div>
  <div class="item" data-id="2"><a href="/b">Second</a></div>
</body></html>
"""

doc = Document(html)
print(doc.text)  # "First Second"

items = doc.select(".item")
print(items[0].attr("data-id"))  # "1"
print(items[0].to_dict())        # {"tag": "div", "text": "First", "html": "<a...>", ...}

first_link = doc.select_first("a[href]")  # alias: doc.find(...)
print(first_link.text, first_link.attr("href"))  # First /a
links_within_first = first_link.select("a[href]")
print([link.attr("href") for link in links_within_first])  # ["/a"]

# XPath (element results only)
xpath_items = doc.xpath("//div[@class='item']/a")
print([link.text for link in xpath_items])  # ["First", "Second"]
print(doc.xpath_first("//div[@data-id='1']/a").attr("href"))  # "/a"

# Functional helpers
links = select(html, "a[href]")
print([link.attr("href") for link in links])  # ["/a", "/b"]
print(first(html, "a[href]").text)            # First
print(select_first(html, "a[href]").text)     # First
print([link.text for link in xpath(html, "//div[@class='item']/a")])  # ["First", "Second"]

For a runnable sample, see examples/demo.py.

Async usage

The scraper_rs.asyncio module wraps the top-level helpers to keep the event loop responsive. parse yields to the event loop between operations, while select/xpath run in a thread pool. Parsed documents and elements are wrapped with awaitable selector methods for nested queries:

import asyncio
from scraper_rs import asyncio as scraping_async

html = "<div class='item'><a href='/a'>First</a></div>"


async def main():
    doc = await scraping_async.parse(html)
    items = await doc.select(".item")
    first_link = await items[0].select_first("a[href]")
    links = await scraping_async.select(html, "a[href]")
    print(first_link.text)  # First
    print([link.attr("href") for link in links])  # ["/a"]


asyncio.run(main())

All async functions accept the same keyword arguments as their sync counterparts (max_size_bytes, truncate_on_limit, etc.). Async wrappers expose the underlying sync objects via .document and .element if you need direct access.

Large documents and memory safety

To avoid runaway allocations, parsing defaults to a 1 GiB cap. Pass max_size_bytes to override:

from scraper_rs import Document, select

doc = Document(html, max_size_bytes=5_000_000)  # 5 MB guard
links = select(html, "a[href]", max_size_bytes=5_000_000)

If you want to parse a limited portion of an oversized HTML document instead of rejecting it entirely, use truncate_on_limit=True:

# Parse only the first 100KB of a large HTML document
doc = Document(large_html, max_size_bytes=100_000, truncate_on_limit=True)
links = doc.select("a[href]")  # Will only find links in the first 100KB

# Also works with top-level functions
items = select(large_html, ".item", max_size_bytes=100_000, truncate_on_limit=True)

Note: Truncation happens at valid UTF-8 character boundaries to prevent encoding errors.

API highlights

  • Document(html: str) / Document.from_html(html) parses once and keeps the DOM.
  • .select(css)list[Element], .select_first(css) / .find(css) → first Element | None, .css(css) is an alias.
  • .xpath(expr) / .xpath_first(expr) evaluate XPath expressions that return element nodes.
  • .text returns normalized text; .html returns the element's HTML.
  • scraper_rs.asyncio exposes async parse/select/xpath wrappers to keep the event loop responsive.
  • Element exposes .tag, .text, .html, .attrs plus helpers .attr(name), .get(name, default), .to_dict().
  • Elements support nested CSS and XPath selection via .select(css), .select_first(css), .find(css), .css(css), .xpath(expr), .xpath_first(expr).
  • Top-level helpers mirror the class methods: parse(html), select(html, css), select_first(html, css) / first(html, css), xpath(html, expr), xpath_first(html, expr).
  • max_size_bytes lets you fail fast on oversized HTML; defaults to a 1 GiB limit.
  • truncate_on_limit allows parsing a truncated version (limited to max_size_bytes) of oversized HTML instead of raising an error.
  • Call doc.close() (or with Document(html) as doc: ...) to free parsed DOM resources when you're done.

Installation

Built wheels target abi3 (CPython 3.10+). To build locally:

# Install maturin (uv is used in this repo, but pip works too)
pip install maturin

# Build a wheel
maturin build --release --compatibility linux

# Install the generated wheel
pip install target/wheels/scraper_rs-*.whl

If you have just installed, the repo includes helpers: just build (local wheel), just install-wheel (install the built wheel), and just build_manylinux (via the official maturin Docker image).

Projects Using scraper-rs

  • silkworm - Async web scraping framework on top of Rust

Development

Requirements: Rust toolchain, Python 3.10+, maturin, and pytest for tests.

  • Run tests: just test or uv run pytest tests/test_scraper.py
  • Format/typing: The codebase is small; formatters are not strictly enforced yet.
  • The PyO3 module name is scraper_rs; the Rust crate is built as cdylib.

Contributions and issues are welcome. If you add public API, please extend tests/test_scraper.py and the example script accordingly.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scraper_rust-0.2.22.tar.gz (44.2 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

scraper_rust-0.2.22-cp314-cp314t-win_amd64.whl (725.4 kB view details)

Uploaded CPython 3.14tWindows x86-64

scraper_rust-0.2.22-cp314-cp314t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (854.6 kB view details)

Uploaded CPython 3.14tmanylinux: glibc 2.17+ x86-64

scraper_rust-0.2.22-cp314-cp314t-macosx_11_0_arm64.whl (743.3 kB view details)

Uploaded CPython 3.14tmacOS 11.0+ ARM64

scraper_rust-0.2.22-cp313-cp313t-win_amd64.whl (725.4 kB view details)

Uploaded CPython 3.13tWindows x86-64

scraper_rust-0.2.22-cp313-cp313t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (854.5 kB view details)

Uploaded CPython 3.13tmanylinux: glibc 2.17+ x86-64

scraper_rust-0.2.22-cp313-cp313t-macosx_11_0_arm64.whl (743.2 kB view details)

Uploaded CPython 3.13tmacOS 11.0+ ARM64

scraper_rust-0.2.22-cp310-abi3-win_amd64.whl (729.8 kB view details)

Uploaded CPython 3.10+Windows x86-64

scraper_rust-0.2.22-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (843.8 kB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ x86-64

scraper_rust-0.2.22-cp310-abi3-macosx_11_0_arm64.whl (745.5 kB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

File details

Details for the file scraper_rust-0.2.22.tar.gz.

File metadata

  • Download URL: scraper_rust-0.2.22.tar.gz
  • Upload date:
  • Size: 44.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for scraper_rust-0.2.22.tar.gz
Algorithm Hash digest
SHA256 ffb3c4be4c951758305ecd9d8f9ad4edd69c9833d30f1d8c0d871a03df325761
MD5 e015a68d80553b6b735cc395905d0eb4
BLAKE2b-256 dc89b8ffd7673c11192568ab65fbdd1acb37ef042d6ea0abf5909e5032e1298b

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.2.22.tar.gz:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.2.22-cp314-cp314t-win_amd64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.2.22-cp314-cp314t-win_amd64.whl
Algorithm Hash digest
SHA256 c6592e6f2f6053f7110124d875cf4fbabf25bc5b52bbda2b5a5dd322b7b7c8fb
MD5 4fcb58365f3c8fd12ac6fd12a8c4f1e0
BLAKE2b-256 1a2f909fa99ed04d49133ada49f7c9bf481b171d6d619b480016074514cd984b

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.2.22-cp314-cp314t-win_amd64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.2.22-cp314-cp314t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.2.22-cp314-cp314t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 f2769cd7f3f0ddf8e4fe8d341a44fa16f4070444da602b2645dff088535e3fd5
MD5 2ba2c273e971ef927bff76426b237f4b
BLAKE2b-256 16d9636e6058367337a3d9a9a7b2933fd18536ec80e566ebae62da705e8773ce

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.2.22-cp314-cp314t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.2.22-cp314-cp314t-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.2.22-cp314-cp314t-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 5f3859b3c35ef912d5a489696ffd75193e4c6b5f353b452ba9c433f05528fa50
MD5 919a2f7121cffd0cf1c4d79799a6b419
BLAKE2b-256 ce6314b71f5805e09efd2f91dfb817e2460120d74c86c04466a23209c147ee29

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.2.22-cp314-cp314t-macosx_11_0_arm64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.2.22-cp313-cp313t-win_amd64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.2.22-cp313-cp313t-win_amd64.whl
Algorithm Hash digest
SHA256 b726803886b0a0f5c0a4e71d8be9e6c261b8eebd1855edb2d70fc03e7344dba1
MD5 0493dab1cac3d662b5cb6121c0ee2087
BLAKE2b-256 7d79021ba4aa9751139fe165e8ed0098856ed7a6e51fd055de71366a82d8448f

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.2.22-cp313-cp313t-win_amd64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.2.22-cp313-cp313t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.2.22-cp313-cp313t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 27df2e4f2ce91ddbbfd4cff990892be6debcd8a77eba30e1a1c22440e98a95e5
MD5 39919b6a17b5110d3ebf3defe2736ba4
BLAKE2b-256 e91ec121c171386a320babd3816ffd17793bc63abcfc642e0b58807ee6f630e6

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.2.22-cp313-cp313t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.2.22-cp313-cp313t-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.2.22-cp313-cp313t-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 6b24be636c8f1e0dc897ee9ae8b36e5e57dc85374c1111a605a5aed6c32affde
MD5 90284642b6860ab776563d6d25fb3452
BLAKE2b-256 67449cf9577f5b41fef2c3e2b6523b39cd36b340d99e5f4248c83711e55d489f

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.2.22-cp313-cp313t-macosx_11_0_arm64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.2.22-cp310-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.2.22-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 f2818dd0f14fcf333345b310f8d3023919439e1ea06acf93672366b1e4eade2f
MD5 b605ecf0da64c29b6b651a55e06b45e7
BLAKE2b-256 18281a6cf97699a937862122ca2016e895e901a0e05efc789f8d739dc5e79754

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.2.22-cp310-abi3-win_amd64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.2.22-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.2.22-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 4b39694d68945f28dc674390fc5de799edb9cb9ca1cc6209b927c2c7aa3e3d6e
MD5 d162ab22544b3b699f713d494aaba634
BLAKE2b-256 ff4463b6919352c4260b3b358cc48e29de189159e846b67b35de743204f29741

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.2.22-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.2.22-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.2.22-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 6822e546d684354d50ed2596ebddfe26fa7e5ad2c9392be383812ff4a1356def
MD5 d5b3de2037c62e9d40f535d2d386298e
BLAKE2b-256 22de3ff61c25e12a91936297cb2712185e1993bb8376e3bb91dcb573ba69e8fe

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.2.22-cp310-abi3-macosx_11_0_arm64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page