Skip to main content

Python bindings around rust-scraper/scraper with PyO3

Project description

scraper-rs

PyPI - Version Tests

Python bindings for the Rust scraper crate via PyO3. It provides a lightweight Document/Element API with CSS selectors, XPath (via sxd_html/sxd_xpath), handy helpers, and zero Python-side parsing work.

Quick start

from scraper_rs import Document, first, select, select_first, xpath

html = """
<html><body>
  <div class="item" data-id="1"><a href="/a">First</a></div>
  <div class="item" data-id="2"><a href="/b">Second</a></div>
</body></html>
"""

doc = Document(html)
print(doc.text)  # "First Second"

items = doc.select(".item")
print(items[0].attr("data-id"))  # "1"
print(items[0].to_dict())        # {"tag": "div", "text": "First", "html": "<a...>", ...}

first_link = doc.select_first("a[href]")  # alias: doc.find(...)
print(first_link.text, first_link.attr("href"))  # First /a
links_within_first = first_link.select("a[href]")
print([link.attr("href") for link in links_within_first])  # ["/a"]

# XPath (element results only)
xpath_items = doc.xpath("//div[@class='item']/a")
print([link.text for link in xpath_items])  # ["First", "Second"]
print(doc.xpath_first("//div[@data-id='1']/a").attr("href"))  # "/a"

# Functional helpers
links = select(html, "a[href]")
print([link.attr("href") for link in links])  # ["/a", "/b"]
print(first(html, "a[href]").text)            # First
print(select_first(html, "a[href]").text)     # First
print([link.text for link in xpath(html, "//div[@class='item']/a")])  # ["First", "Second"]

For a runnable sample, see examples/demo.py.

Async usage

The scraper_rs.asyncio module wraps the top-level helpers to keep the event loop responsive. parse yields to the event loop between operations, while select/xpath run in a thread pool:

import asyncio
from scraper_rs import asyncio as scraping_async

html = "<div class='item'><a href='/a'>First</a></div>"


async def main():
    doc = await scraping_async.parse(html)
    links = await scraping_async.select(html, "a[href]")
    print(doc.select_first(".item").text)  # First
    print([link.attr("href") for link in links])  # ["/a"]


asyncio.run(main())

All async functions accept the same keyword arguments as their sync counterparts (max_size_bytes, truncate_on_limit, etc.).

Large documents and memory safety

To avoid runaway allocations, parsing defaults to a 1 GiB cap. Pass max_size_bytes to override:

from scraper_rs import Document, select

doc = Document(html, max_size_bytes=5_000_000)  # 5 MB guard
links = select(html, "a[href]", max_size_bytes=5_000_000)

If you want to parse a limited portion of an oversized HTML document instead of rejecting it entirely, use truncate_on_limit=True:

# Parse only the first 100KB of a large HTML document
doc = Document(large_html, max_size_bytes=100_000, truncate_on_limit=True)
links = doc.select("a[href]")  # Will only find links in the first 100KB

# Also works with top-level functions
items = select(large_html, ".item", max_size_bytes=100_000, truncate_on_limit=True)

Note: Truncation happens at valid UTF-8 character boundaries to prevent encoding errors.

API highlights

  • Document(html: str) / Document.from_html(html) parses once and keeps the DOM.
  • .select(css)list[Element], .select_first(css) / .find(css) → first Element | None, .css(css) is an alias.
  • .xpath(expr) / .xpath_first(expr) evaluate XPath expressions that return element nodes.
  • .text returns normalized text; .html returns the element's HTML.
  • scraper_rs.asyncio exposes async parse/select/xpath wrappers to keep the event loop responsive.
  • Element exposes .tag, .text, .html, .attrs plus helpers .attr(name), .get(name, default), .to_dict().
  • Elements support nested CSS and XPath selection via .select(css), .select_first(css), .find(css), .css(css), .xpath(expr), .xpath_first(expr).
  • Top-level helpers mirror the class methods: parse(html), select(html, css), select_first(html, css) / first(html, css), xpath(html, expr), xpath_first(html, expr).
  • max_size_bytes lets you fail fast on oversized HTML; defaults to a 1 GiB limit.
  • truncate_on_limit allows parsing a truncated version (limited to max_size_bytes) of oversized HTML instead of raising an error.
  • Call doc.close() (or with Document(html) as doc: ...) to free parsed DOM resources when you're done.

Installation

Built wheels target abi3 (CPython 3.10+). To build locally:

# Install maturin (uv is used in this repo, but pip works too)
pip install maturin

# Build a wheel
maturin build --release --compatibility linux

# Install the generated wheel
pip install target/wheels/scraper_rs-*.whl

If you have just installed, the repo includes helpers: just build (local wheel), just install-wheel (install the built wheel), and just build_manylinux (via the official maturin Docker image).

Projects Using scraper-rs

  • silkworm - Async web scraping framework on top of Rust

Development

Requirements: Rust toolchain, Python 3.10+, maturin, and pytest for tests.

  • Run tests: just test or uv run pytest tests/test_scraper.py
  • Format/typing: The codebase is small; formatters are not strictly enforced yet.
  • The PyO3 module name is scraper_rs; the Rust crate is built as cdylib.

Contributions and issues are welcome. If you add public API, please extend tests/test_scraper.py and the example script accordingly.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scraper_rust-0.2.19.tar.gz (41.2 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

scraper_rust-0.2.19-cp314-cp314t-win_amd64.whl (700.6 kB view details)

Uploaded CPython 3.14tWindows x86-64

scraper_rust-0.2.19-cp314-cp314t-macosx_11_0_arm64.whl (726.6 kB view details)

Uploaded CPython 3.14tmacOS 11.0+ ARM64

scraper_rust-0.2.19-cp313-cp313t-win_amd64.whl (700.6 kB view details)

Uploaded CPython 3.13tWindows x86-64

scraper_rust-0.2.19-cp313-cp313t-macosx_11_0_arm64.whl (726.4 kB view details)

Uploaded CPython 3.13tmacOS 11.0+ ARM64

scraper_rust-0.2.19-cp310-abi3-win_amd64.whl (703.4 kB view details)

Uploaded CPython 3.10+Windows x86-64

scraper_rust-0.2.19-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (826.7 kB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ x86-64

scraper_rust-0.2.19-cp310-abi3-macosx_11_0_arm64.whl (729.2 kB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

File details

Details for the file scraper_rust-0.2.19.tar.gz.

File metadata

  • Download URL: scraper_rust-0.2.19.tar.gz
  • Upload date:
  • Size: 41.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.10.2

File hashes

Hashes for scraper_rust-0.2.19.tar.gz
Algorithm Hash digest
SHA256 297bb4c68399771bb42b822eba07f32a730fdf2bf846f963514641920593244b
MD5 a3a2df022a0edac414890fc9d08b46e2
BLAKE2b-256 0bea31474b0491ccfb64836d26a37b17aab8704076ac5f087cfe70cbd6e25bb1

See more details on using hashes here.

File details

Details for the file scraper_rust-0.2.19-cp314-cp314t-win_amd64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.2.19-cp314-cp314t-win_amd64.whl
Algorithm Hash digest
SHA256 1ae4adb5c45df616288cbb499edc039dece211173c2adaf5c2e59639dec05075
MD5 f0d1769d13c5f5429fbcd48ef3a84050
BLAKE2b-256 427996c31dec2a3b1ffcce77bd6f10d1211a63e8d510965aa21739c9b26c641f

See more details on using hashes here.

File details

Details for the file scraper_rust-0.2.19-cp314-cp314t-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.2.19-cp314-cp314t-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 7aa8886ac892846423f0cfa7d46b0c2e2f44445f58d79b49082e0479df39d736
MD5 4caffcaa282a99acf7fb3a17f0f574ed
BLAKE2b-256 5002017d18979be3912c2474b7fb2e255d467676a3a15921c47275ba91cfe303

See more details on using hashes here.

File details

Details for the file scraper_rust-0.2.19-cp313-cp313t-win_amd64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.2.19-cp313-cp313t-win_amd64.whl
Algorithm Hash digest
SHA256 a65b0c6df32fd80a4b3a9b3ce4c72ac146951ecd3101d98305db29a7840fff73
MD5 83dd34a953420b2f8edba8d7b14e3f92
BLAKE2b-256 94bb2ab9d68e87b05043811ce7a9c9fbcd72e844d8f8c5b4102f9ffc9fa440a8

See more details on using hashes here.

File details

Details for the file scraper_rust-0.2.19-cp313-cp313t-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.2.19-cp313-cp313t-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 ef90a2e6dd99e79783f78e1405db68316935be0241ed372410b2dbef5ae5ff04
MD5 bcca6ac2a22afa1bec169d603c9006ce
BLAKE2b-256 666d4f050ec943864f03485d107241f95f44340aab755e174ec667a706e1c144

See more details on using hashes here.

File details

Details for the file scraper_rust-0.2.19-cp310-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.2.19-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 48ddb7a59158b66db8dab3943c7eabd2d409479bed33e72a90b98844b99accfb
MD5 a8b7aa959b5e7389ccb78208bbf50a42
BLAKE2b-256 caeb9fddca56b2fd68cfa97398f3eef247ee50703d3dca2a7651c702b3c828ae

See more details on using hashes here.

File details

Details for the file scraper_rust-0.2.19-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.2.19-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 a32901d3dbe355b09172b3b89305135ec9d778b2caa05e2228fcc716c06726c0
MD5 5de55bcfb9c45034cfb05a541bc6e537
BLAKE2b-256 9499d9bc3397f7aac1a8fa0fffc7f57dead52bec4eb0ed60c699385cdd8670b5

See more details on using hashes here.

File details

Details for the file scraper_rust-0.2.19-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.2.19-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 65aba1e6a6c9c078f46bfa824b9be65633b81e5a2da808585ad4af0bc872b7d2
MD5 d85ab550b3366df98852098732b2ecd7
BLAKE2b-256 ce1c41bf750ef7e66fa3124c4a95b849c9b0f3bff3ae03b42bdc4097002770e8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page