Skip to main content

Python bindings around rust-scraper/scraper with PyO3

Project description

scraper-rs

PyPI - Version Tests PyPI Downloads

Python bindings for the Rust scraper crate via PyO3. It provides a lightweight Document/Element API with CSS selectors, XPath (via xee-xpath), handy helpers, and zero Python-side parsing work.

Quick start

from scraper_rs import Document, first, prettify, select, select_first, xpath

html = """
<html><body>
  <div class="item" data-id="1"><a href="/a">First</a></div>
  <div class="item" data-id="2"><a href="/b">Second</a></div>
</body></html>
"""

doc = Document(html)
print(doc.text)  # "First Second"

items = doc.select(".item")
print(items[0].attr("data-id"))  # "1"
print(items[0].to_dict())        # {"tag": "div", "text": "First", "html": "<a...>", ...}

first_link = doc.select_first("a[href]")  # alias: doc.find(...)
print(first_link.text, first_link.attr("href"))  # First /a
links_within_first = first_link.select("a[href]")
print([link.attr("href") for link in links_within_first])  # ["/a"]

# XPath (element results only)
xpath_items = doc.xpath("//div[@class='item']/a")
print([link.text for link in xpath_items])  # ["First", "Second"]
print(doc.xpath_first("//div[@data-id='1']/a").attr("href"))  # "/a"

# Functional helpers
links = select(html, "a[href]")
print([link.attr("href") for link in links])  # ["/a", "/b"]
print(first(html, "a[href]").text)            # First
print(select_first(html, "a[href]").text)     # First
print([link.text for link in xpath(html, "//div[@class='item']/a")])  # ["First", "Second"]
print(prettify(html))

For runnable samples, see examples/demo.py and examples/demo_prettify_url.py. Quick URL prettify demo:

python examples/demo_prettify_url.py https://example.com --max-lines 80

Async usage

The scraper_rs.asyncio module keeps the event loop responsive. parse yields once (asyncio.sleep(0)) before constructing a sync Document in the current thread, while select/xpath helpers run in a thread pool. Parsed documents and elements are wrapped with awaitable selector methods for nested queries:

import asyncio
from scraper_rs import asyncio as scraping_async

html = "<div class='item'><a href='/a'>First</a></div>"


async def main():
    async with await scraping_async.parse(html) as doc:
        items = await doc.select(".item")
        first_link = await items[0].select_first("a[href]")
        print(first_link.text)  # First

    links = await scraping_async.select(html, "a[href]")
    print([link.attr("href") for link in links])  # ["/a"]


asyncio.run(main())

All async functions accept the same keyword arguments as their sync counterparts (max_size_bytes, truncate_on_limit, etc.). Async wrappers expose the underlying sync objects via .document and .element if you need direct access. AsyncDocument supports async with for automatic cleanup in coroutine code.

Large documents and memory safety

To avoid runaway allocations, parsing defaults to a 1 GiB cap. Pass max_size_bytes to override:

from scraper_rs import Document, select

doc = Document(html, max_size_bytes=5_000_000)  # 5 MB guard
links = select(html, "a[href]", max_size_bytes=5_000_000)

If you want to parse a limited portion of an oversized HTML document instead of rejecting it entirely, use truncate_on_limit=True:

# Parse only the first 100KB of a large HTML document
doc = Document(large_html, max_size_bytes=100_000, truncate_on_limit=True)
links = doc.select("a[href]")  # Will only find links in the first 100KB

# Also works with top-level functions
items = select(large_html, ".item", max_size_bytes=100_000, truncate_on_limit=True)

Note: Truncation happens at valid UTF-8 character boundaries to prevent encoding errors.

API highlights

  • Document(html: str) / Document.from_html(html) parse HTML for CSS and keep the DOM; XPath parsing is initialized lazily on first XPath query.
  • .select(css)list[Element], .select_first(css) / .find(css) → first Element | None, .css(css) is an alias.
  • .xpath(expr) / .xpath_first(expr) evaluate XPath expressions that return element nodes.
  • .prettify() renders the current DOM as an indented string for readable output/debugging.
  • .text returns normalized text; Document.html is the original input HTML; Element.html is inner HTML.
  • scraper_rs.asyncio exposes async parse/select/xpath wrappers to keep the event loop responsive.
  • Element exposes .tag, .text, .html, .attrs plus helpers .attr(name), .get(name, default), .to_dict().
  • Elements support nested CSS and XPath selection via .select(css), .select_first(css), .find(css), .css(css), .xpath(expr), .xpath_first(expr).
  • Elements also expose .prettify() to format element HTML with indentation.
  • Top-level helpers mirror the class methods: parse(html), prettify(html), select(html, css), select_first(html, css) / first(html, css), xpath(html, expr), xpath_first(html, expr).
  • max_size_bytes lets you fail fast on oversized HTML; defaults to a 1 GiB limit.
  • truncate_on_limit allows parsing a truncated version (limited to max_size_bytes) of oversized HTML instead of raising an error.
  • Call doc.close() (or with Document(html) as doc: ...) to free parsed DOM resources when you're done.
  • In async workflows, use async with await scraper_rs.asyncio.parse(html) as doc: ... for automatic AsyncDocument cleanup.

Installation

Built wheels target abi3 (CPython 3.10+). To build locally:

# Install maturin (uv is used in this repo, but pip works too)
pip install maturin

# Build a wheel
maturin build --release --compatibility linux

# Install the generated wheel
pip install target/wheels/scraper_rust-*.whl

If you have just installed, the repo includes helpers: just build (local wheel), just install-wheel (install the built wheel), and just build_manylinux (via the official maturin Docker image).

Projects Using scraper-rs

  • silkworm - Async web scraping framework on top of Rust

Development

Requirements: Rust toolchain, Python 3.10+, maturin, pytest, and pytest-asyncio for tests.

  • Run tests: just test or uv run pytest tests/
  • Format code: just fmt (or cargo fmt --all and uv run ruff format)
  • Lint Rust: just lint (or cargo clippy --all-targets --all-features -- -D warnings)
  • The PyO3 module name is scraper_rs; the Rust crate is built as cdylib.

Contributions and issues are welcome. If you add public API, please extend tests/test_scraper.py and the example script accordingly.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scraper_rust-0.3.2.tar.gz (80.5 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

scraper_rust-0.3.2-cp314-cp314t-win_amd64.whl (2.2 MB view details)

Uploaded CPython 3.14tWindows x86-64

scraper_rust-0.3.2-cp314-cp314t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.5 MB view details)

Uploaded CPython 3.14tmanylinux: glibc 2.17+ x86-64

scraper_rust-0.3.2-cp314-cp314t-macosx_11_0_arm64.whl (2.2 MB view details)

Uploaded CPython 3.14tmacOS 11.0+ ARM64

scraper_rust-0.3.2-cp313-cp313t-win_amd64.whl (2.2 MB view details)

Uploaded CPython 3.13tWindows x86-64

scraper_rust-0.3.2-cp313-cp313t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.5 MB view details)

Uploaded CPython 3.13tmanylinux: glibc 2.17+ x86-64

scraper_rust-0.3.2-cp313-cp313t-macosx_11_0_arm64.whl (2.2 MB view details)

Uploaded CPython 3.13tmacOS 11.0+ ARM64

scraper_rust-0.3.2-cp310-abi3-win_amd64.whl (2.2 MB view details)

Uploaded CPython 3.10+Windows x86-64

scraper_rust-0.3.2-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (2.5 MB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ x86-64

scraper_rust-0.3.2-cp310-abi3-macosx_11_0_arm64.whl (2.2 MB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

File details

Details for the file scraper_rust-0.3.2.tar.gz.

File metadata

  • Download URL: scraper_rust-0.3.2.tar.gz
  • Upload date:
  • Size: 80.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for scraper_rust-0.3.2.tar.gz
Algorithm Hash digest
SHA256 08c2b92f5059f64cbca9ebb3f802e30248abd94c7ade6c32697b5109f85abd76
MD5 e1d94ed66668df080e497bd82f37e484
BLAKE2b-256 74b49153d7c02230ebd05b45e2a72654ad21bbceeb1dea158a8411671cbad19b

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.3.2.tar.gz:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.3.2-cp314-cp314t-win_amd64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.3.2-cp314-cp314t-win_amd64.whl
Algorithm Hash digest
SHA256 bf311ea6fe1bd82a372d85bca416826c88405fffd02271ed736a516fa4f74dab
MD5 d07e901ea6ddca056cccdd83a8115755
BLAKE2b-256 736484199d52ac75b5ff4f99b1cadec2ade34cbe315825f2f8090b311ea00e41

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.3.2-cp314-cp314t-win_amd64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.3.2-cp314-cp314t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.3.2-cp314-cp314t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 89ac61be946f654507584dda9894755c27fcf418edb3d150343d8f27e69a9b1b
MD5 08d575e2237177c2a5cef5ddb614daaa
BLAKE2b-256 9da3e5186ed8a0b2ea39c69c98dc0c8c67d5dbc5613b4ab08cb4d6c6c716ab6b

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.3.2-cp314-cp314t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.3.2-cp314-cp314t-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.3.2-cp314-cp314t-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 70d5145c22164eb32788307bfc56b1c4309a94c3a4361b099deedb53e376a3a0
MD5 060564c6245f1a6dfa39b3e440a7f149
BLAKE2b-256 38434f41d6a61c74da4bf6a36babf9e84a3f6a31b3c60e3b922be4ce18847238

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.3.2-cp314-cp314t-macosx_11_0_arm64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.3.2-cp313-cp313t-win_amd64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.3.2-cp313-cp313t-win_amd64.whl
Algorithm Hash digest
SHA256 c81ead5e7f9b1fa4f3ed5a30e56652f311c7b1581522be5baf1309809574f14b
MD5 b5fe755e5f3a5aba8b0814d675655572
BLAKE2b-256 30288efcd23ffe77edec0357b7213657edd197bd5e9771923b95fac840b8bf6a

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.3.2-cp313-cp313t-win_amd64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.3.2-cp313-cp313t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.3.2-cp313-cp313t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 bd5ce9590ad415749ad34ffb3720d0e616e0a7adc99c7b0f06a98eacb2664351
MD5 8c1dedb548c38eaec9356296052ccee0
BLAKE2b-256 0919f410e72ad31b76193fcf7bc5afed5108d65bec32b5999aa68368502225e7

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.3.2-cp313-cp313t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.3.2-cp313-cp313t-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.3.2-cp313-cp313t-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 3e5f111d9e4b2ae6bf039ac364e3084ed395bb162483562cf853babc80bbe108
MD5 aa10cca035ce73628e4e847852563e94
BLAKE2b-256 a8dc69f2895d97c46857610e727acd7c96c18853638d902849d0762101774d34

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.3.2-cp313-cp313t-macosx_11_0_arm64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.3.2-cp310-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.3.2-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 43eccb4d0f4bca12540d2e6b9e0dcb100f1d8ab5e87e708a598ed772434b35b3
MD5 b90eb6dab8fbae04ccd8f2b313793b01
BLAKE2b-256 8f25063fbf3ac2ef8416b4d1c0c71669f0bfc40da4bd2c885204bf1a90c2c41b

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.3.2-cp310-abi3-win_amd64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.3.2-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.3.2-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 4fd432dc454afa7d5e0b0322c70337023c3f3f499999bd47e8db795daf02decb
MD5 f6fda8ce204ffa29b8e66822c57ab4aa
BLAKE2b-256 dcee07138613692adbb15134119777f1ced397ade75953e279bae0db58587813

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.3.2-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file scraper_rust-0.3.2-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.3.2-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 57b49627b38e997045d5708c17c025837595e6421d312f777784a40228b23016
MD5 b35065c7e707004c5ef5ba920ee77672
BLAKE2b-256 6347dc99a295d7681aa8527012390d6d3e5e8fe0d3737faf88cf65d4b9180c11

See more details on using hashes here.

Provenance

The following attestation bundles were made for scraper_rust-0.3.2-cp310-abi3-macosx_11_0_arm64.whl:

Publisher: release.yml on RustedBytes/scraper-rs

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page