Skip to main content

Python bindings around rust-scraper/scraper with PyO3

Project description

scraper-rs

PyPI - Version Tests

Python bindings for the Rust scraper crate via PyO3. It provides a lightweight Document/Element API with CSS selectors, XPath (via sxd_html/sxd_xpath), handy helpers, and zero Python-side parsing work.

Quick start

from scraper_rs import Document, first, select, select_first, xpath

html = """
<html><body>
  <div class="item" data-id="1"><a href="/a">First</a></div>
  <div class="item" data-id="2"><a href="/b">Second</a></div>
</body></html>
"""

doc = Document(html)
print(doc.text)  # "First Second"

items = doc.select(".item")
print(items[0].attr("data-id"))  # "1"
print(items[0].to_dict())        # {"tag": "div", "text": "First", "html": "<a...>", ...}

first_link = doc.select_first("a[href]")  # alias: doc.find(...)
print(first_link.text, first_link.attr("href"))  # First /a
links_within_first = first_link.select("a[href]")
print([link.attr("href") for link in links_within_first])  # ["/a"]

# XPath (element results only)
xpath_items = doc.xpath("//div[@class='item']/a")
print([link.text for link in xpath_items])  # ["First", "Second"]
print(doc.xpath_first("//div[@data-id='1']/a").attr("href"))  # "/a"

# Functional helpers
links = select(html, "a[href]")
print([link.attr("href") for link in links])  # ["/a", "/b"]
print(first(html, "a[href]").text)            # First
print(select_first(html, "a[href]").text)     # First
print([link.text for link in xpath(html, "//div[@class='item']/a")])  # ["First", "Second"]

For a runnable sample, see examples/demo.py.

Async usage

The scraper_rs.asyncio module wraps the top-level helpers to keep the event loop responsive. parse yields to the event loop between operations, while select/xpath run in a thread pool:

import asyncio
from scraper_rs import asyncio as scraping_async

html = "<div class='item'><a href='/a'>First</a></div>"


async def main():
    doc = await scraping_async.parse(html)
    links = await scraping_async.select(html, "a[href]")
    print(doc.select_first(".item").text)  # First
    print([link.attr("href") for link in links])  # ["/a"]


asyncio.run(main())

All async functions accept the same keyword arguments as their sync counterparts (max_size_bytes, truncate_on_limit, etc.).

Large documents and memory safety

To avoid runaway allocations, parsing defaults to a 1 GiB cap. Pass max_size_bytes to override:

from scraper_rs import Document, select

doc = Document(html, max_size_bytes=5_000_000)  # 5 MB guard
links = select(html, "a[href]", max_size_bytes=5_000_000)

If you want to parse a limited portion of an oversized HTML document instead of rejecting it entirely, use truncate_on_limit=True:

# Parse only the first 100KB of a large HTML document
doc = Document(large_html, max_size_bytes=100_000, truncate_on_limit=True)
links = doc.select("a[href]")  # Will only find links in the first 100KB

# Also works with top-level functions
items = select(large_html, ".item", max_size_bytes=100_000, truncate_on_limit=True)

Note: Truncation happens at valid UTF-8 character boundaries to prevent encoding errors.

API highlights

  • Document(html: str) / Document.from_html(html) parses once and keeps the DOM.
  • .select(css)list[Element], .select_first(css) / .find(css) → first Element | None, .css(css) is an alias.
  • .xpath(expr) / .xpath_first(expr) evaluate XPath expressions that return element nodes.
  • .text returns normalized text; .html returns the element's HTML.
  • scraper_rs.asyncio exposes async parse/select/xpath wrappers to keep the event loop responsive.
  • Element exposes .tag, .text, .html, .attrs plus helpers .attr(name), .get(name, default), .to_dict().
  • Elements support nested CSS and XPath selection via .select(css), .select_first(css), .find(css), .css(css), .xpath(expr), .xpath_first(expr).
  • Top-level helpers mirror the class methods: parse(html), select(html, css), select_first(html, css) / first(html, css), xpath(html, expr), xpath_first(html, expr).
  • max_size_bytes lets you fail fast on oversized HTML; defaults to a 1 GiB limit.
  • truncate_on_limit allows parsing a truncated version (limited to max_size_bytes) of oversized HTML instead of raising an error.
  • Call doc.close() (or with Document(html) as doc: ...) to free parsed DOM resources when you're done.

Installation

Built wheels target abi3 (CPython 3.10+). To build locally:

# Install maturin (uv is used in this repo, but pip works too)
pip install maturin

# Build a wheel
maturin build --release --compatibility linux

# Install the generated wheel
pip install target/wheels/scraper_rs-*.whl

If you have just installed, the repo includes helpers: just build (local wheel), just install-wheel (install the built wheel), and just build_manylinux (via the official maturin Docker image).

Projects Using scraper-rs

  • silkworm - Async web scraping framework on top of Rust

Development

Requirements: Rust toolchain, Python 3.10+, maturin, and pytest for tests.

  • Run tests: just test or uv run pytest tests/test_scraper.py
  • Format/typing: The codebase is small; formatters are not strictly enforced yet.
  • The PyO3 module name is scraper_rs; the Rust crate is built as cdylib.

Contributions and issues are welcome. If you add public API, please extend tests/test_scraper.py and the example script accordingly.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scraper_rust-0.2.17.tar.gz (37.3 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

scraper_rust-0.2.17-cp314-cp314t-win_amd64.whl (608.9 kB view details)

Uploaded CPython 3.14tWindows x86-64

scraper_rust-0.2.17-cp314-cp314t-macosx_11_0_arm64.whl (647.1 kB view details)

Uploaded CPython 3.14tmacOS 11.0+ ARM64

scraper_rust-0.2.17-cp310-abi3-win_amd64.whl (610.3 kB view details)

Uploaded CPython 3.10+Windows x86-64

scraper_rust-0.2.17-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (733.6 kB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ x86-64

scraper_rust-0.2.17-cp310-abi3-macosx_11_0_arm64.whl (649.2 kB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

File details

Details for the file scraper_rust-0.2.17.tar.gz.

File metadata

  • Download URL: scraper_rust-0.2.17.tar.gz
  • Upload date:
  • Size: 37.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.10.2

File hashes

Hashes for scraper_rust-0.2.17.tar.gz
Algorithm Hash digest
SHA256 6a655b203b1ea7d08ad50044fd259fbd9939c7172f374316e4cb754b81cf7898
MD5 b6a6e00ee3b1c2ecf3d8176498385dd0
BLAKE2b-256 bcf8f7c2c3e0522f9f37aa2b8b8346172339f52c68526525e67e7890ac06bacd

See more details on using hashes here.

File details

Details for the file scraper_rust-0.2.17-cp314-cp314t-win_amd64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.2.17-cp314-cp314t-win_amd64.whl
Algorithm Hash digest
SHA256 c0e1f052c437939b6267aa9d6d41cd51a42204e1ba5d0c2aaa82194af2c8e365
MD5 9ac58ef87a4987f002bc5206a50b3904
BLAKE2b-256 73d570317bc0d2e115b1f5639fbfd56891367bebda810c8f27518707dbe2c533

See more details on using hashes here.

File details

Details for the file scraper_rust-0.2.17-cp314-cp314t-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.2.17-cp314-cp314t-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 cacca3f1a141efc2c688cefd518779a69593f7ab4e047ad9a9d0a483d7394fd2
MD5 94c0d56a2a47074776ac21e29fd0de58
BLAKE2b-256 408b67ab85bcf7509cfa948a6863f69450eb1ea136beb28da42436212c8bfaa0

See more details on using hashes here.

File details

Details for the file scraper_rust-0.2.17-cp310-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.2.17-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 f2dea6075da61fd9dfbcee0409cecd75668adb2da0149361b6b5bd75320923b1
MD5 355119dca5e01fdd8892767774c27f92
BLAKE2b-256 fbfdc9d45b4469190244ed08407b94cf296dbc7393a88cbec583571ec9318918

See more details on using hashes here.

File details

Details for the file scraper_rust-0.2.17-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.2.17-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 c153e34921f240ec76c2d4be80d568652c9c814d727f657c1a8f361a246f4693
MD5 371bbff3cc40cec8b34c6fc19c2e51f2
BLAKE2b-256 d8bb47f75a0bd003ca86c773870917d6dee5eb404b8e4be3990ae52995c00710

See more details on using hashes here.

File details

Details for the file scraper_rust-0.2.17-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.2.17-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 c8e3a64b9ada89e9bb1022779f4f6b72f1e4c020b1663b805c30b0e599599c96
MD5 06985c404965662b9060f167664c0a46
BLAKE2b-256 dd915c602e4daa82dac43a9f503ce4da5648314fb0865989ae3961785b1a4b72

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page