Skip to main content

Python bindings around rust-scraper/scraper with PyO3

Project description

scraper-rs

PyPI - Version Tests

Python bindings for the Rust scraper crate via PyO3. It provides a lightweight Document/Element API with CSS selectors, XPath (via sxd_html/sxd_xpath), handy helpers, and zero Python-side parsing work.

Quick start

from scraper_rs import Document, first, select, select_first, xpath

html = """
<html><body>
  <div class="item" data-id="1"><a href="/a">First</a></div>
  <div class="item" data-id="2"><a href="/b">Second</a></div>
</body></html>
"""

doc = Document(html)
print(doc.text)  # "First Second"

items = doc.select(".item")
print(items[0].attr("data-id"))  # "1"
print(items[0].to_dict())        # {"tag": "div", "text": "First", "html": "<a...>", ...}

first_link = doc.select_first("a[href]")  # alias: doc.find(...)
print(first_link.text, first_link.attr("href"))  # First /a
links_within_first = first_link.select("a[href]")
print([link.attr("href") for link in links_within_first])  # ["/a"]

# XPath (element results only)
xpath_items = doc.xpath("//div[@class='item']/a")
print([link.text for link in xpath_items])  # ["First", "Second"]
print(doc.xpath_first("//div[@data-id='1']/a").attr("href"))  # "/a"

# Functional helpers
links = select(html, "a[href]")
print([link.attr("href") for link in links])  # ["/a", "/b"]
print(first(html, "a[href]").text)            # First
print(select_first(html, "a[href]").text)     # First
print([link.text for link in xpath(html, "//div[@class='item']/a")])  # ["First", "Second"]

For a runnable sample, see examples/demo.py.

Async usage

The scraper_rs.asyncio module wraps the top-level helpers to keep the event loop responsive. parse yields to the event loop between operations, while select/xpath run in a thread pool:

import asyncio
from scraper_rs import asyncio as scraping_async

html = "<div class='item'><a href='/a'>First</a></div>"


async def main():
    doc = await scraping_async.parse(html)
    links = await scraping_async.select(html, "a[href]")
    print(doc.select_first(".item").text)  # First
    print([link.attr("href") for link in links])  # ["/a"]


asyncio.run(main())

All async functions accept the same keyword arguments as their sync counterparts (max_size_bytes, truncate_on_limit, etc.).

Large documents and memory safety

To avoid runaway allocations, parsing defaults to a 1 GiB cap. Pass max_size_bytes to override:

from scraper_rs import Document, select

doc = Document(html, max_size_bytes=5_000_000)  # 5 MB guard
links = select(html, "a[href]", max_size_bytes=5_000_000)

If you want to parse a limited portion of an oversized HTML document instead of rejecting it entirely, use truncate_on_limit=True:

# Parse only the first 100KB of a large HTML document
doc = Document(large_html, max_size_bytes=100_000, truncate_on_limit=True)
links = doc.select("a[href]")  # Will only find links in the first 100KB

# Also works with top-level functions
items = select(large_html, ".item", max_size_bytes=100_000, truncate_on_limit=True)

Note: Truncation happens at valid UTF-8 character boundaries to prevent encoding errors.

API highlights

  • Document(html: str) / Document.from_html(html) parses once and keeps the DOM.
  • .select(css)list[Element], .select_first(css) / .find(css) → first Element | None, .css(css) is an alias.
  • .xpath(expr) / .xpath_first(expr) evaluate XPath expressions that return element nodes.
  • .text returns normalized text; .html returns the element's HTML.
  • scraper_rs.asyncio exposes async parse/select/xpath wrappers to keep the event loop responsive.
  • Element exposes .tag, .text, .html, .attrs plus helpers .attr(name), .get(name, default), .to_dict().
  • Elements support nested CSS and XPath selection via .select(css), .select_first(css), .find(css), .css(css), .xpath(expr), .xpath_first(expr).
  • Top-level helpers mirror the class methods: parse(html), select(html, css), select_first(html, css) / first(html, css), xpath(html, expr), xpath_first(html, expr).
  • max_size_bytes lets you fail fast on oversized HTML; defaults to a 1 GiB limit.
  • truncate_on_limit allows parsing a truncated version (limited to max_size_bytes) of oversized HTML instead of raising an error.
  • Call doc.close() (or with Document(html) as doc: ...) to free parsed DOM resources when you're done.

Installation

Built wheels target abi3 (CPython 3.10+). To build locally:

# Install maturin (uv is used in this repo, but pip works too)
pip install maturin

# Build a wheel
maturin build --release --compatibility linux

# Install the generated wheel
pip install target/wheels/scraper_rs-*.whl

If you have just installed, the repo includes helpers: just build (local wheel), just install-wheel (install the built wheel), and just build_manylinux (via the official maturin Docker image).

Projects Using scraper-rs

  • silkworm - Async web scraping framework on top of Rust

Development

Requirements: Rust toolchain, Python 3.10+, maturin, and pytest for tests.

  • Run tests: just test or uv run pytest tests/test_scraper.py
  • Format/typing: The codebase is small; formatters are not strictly enforced yet.
  • The PyO3 module name is scraper_rs; the Rust crate is built as cdylib.

Contributions and issues are welcome. If you add public API, please extend tests/test_scraper.py and the example script accordingly.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

scraper_rust-0.2.20.tar.gz (41.3 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

scraper_rust-0.2.20-cp314-cp314t-win_amd64.whl (700.3 kB view details)

Uploaded CPython 3.14tWindows x86-64

scraper_rust-0.2.20-cp314-cp314t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (836.7 kB view details)

Uploaded CPython 3.14tmanylinux: glibc 2.17+ x86-64

scraper_rust-0.2.20-cp314-cp314t-macosx_11_0_arm64.whl (726.7 kB view details)

Uploaded CPython 3.14tmacOS 11.0+ ARM64

scraper_rust-0.2.20-cp313-cp313t-win_amd64.whl (700.2 kB view details)

Uploaded CPython 3.13tWindows x86-64

scraper_rust-0.2.20-cp313-cp313t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (836.9 kB view details)

Uploaded CPython 3.13tmanylinux: glibc 2.17+ x86-64

scraper_rust-0.2.20-cp313-cp313t-macosx_11_0_arm64.whl (726.4 kB view details)

Uploaded CPython 3.13tmacOS 11.0+ ARM64

scraper_rust-0.2.20-cp310-abi3-win_amd64.whl (703.2 kB view details)

Uploaded CPython 3.10+Windows x86-64

scraper_rust-0.2.20-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (826.5 kB view details)

Uploaded CPython 3.10+manylinux: glibc 2.17+ x86-64

scraper_rust-0.2.20-cp310-abi3-macosx_11_0_arm64.whl (729.3 kB view details)

Uploaded CPython 3.10+macOS 11.0+ ARM64

File details

Details for the file scraper_rust-0.2.20.tar.gz.

File metadata

  • Download URL: scraper_rust-0.2.20.tar.gz
  • Upload date:
  • Size: 41.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: maturin/1.10.2

File hashes

Hashes for scraper_rust-0.2.20.tar.gz
Algorithm Hash digest
SHA256 f46867891d2cc63d423ff4e5d9019e2227c5b63b31dff6d444008baa324e103f
MD5 d7a6b009e1cd1cf2f217cc342bea1147
BLAKE2b-256 2fff914799599826868a55132c4117ec5d040ff863e6b59d6d22ee1eb6b3174e

See more details on using hashes here.

File details

Details for the file scraper_rust-0.2.20-cp314-cp314t-win_amd64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.2.20-cp314-cp314t-win_amd64.whl
Algorithm Hash digest
SHA256 a0adabf221fd4031a5bf2f8acf84db4c8b9260cfbe746bd7a851e1f80bfcdbdd
MD5 67a8c221f86e601b84781dc3ace9ace4
BLAKE2b-256 896b3ac94bebd55852987382adf61a7493794f1feb1d98274f6c3df246dff010

See more details on using hashes here.

File details

Details for the file scraper_rust-0.2.20-cp314-cp314t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.2.20-cp314-cp314t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 72f12835693b0f415c6e8e5f060d897bdc34247c2a6fd1baefebf70ae1f36af0
MD5 72595343b00234ee5c15db198ffab675
BLAKE2b-256 ff19525ed7e831bfde2c1592154babe2f476df34b739cba62ad0defae45afc1c

See more details on using hashes here.

File details

Details for the file scraper_rust-0.2.20-cp314-cp314t-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.2.20-cp314-cp314t-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 4c57c0a38e7a674ffac199374e6ca6fce58a0dc3452ea00fb48c9463c4d7f777
MD5 9197195a75780692ac21d1b4b95342e2
BLAKE2b-256 d669847196d6be05015d83e8d871952d287aa14ff9634244bfd77fc08dba8276

See more details on using hashes here.

File details

Details for the file scraper_rust-0.2.20-cp313-cp313t-win_amd64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.2.20-cp313-cp313t-win_amd64.whl
Algorithm Hash digest
SHA256 d1fff844f8ec3c7a59e0af7a0de92a792d10776e4ff23eda6747a8fe2ea6c4ea
MD5 b97358ccc32e2c41921fdac9827033f8
BLAKE2b-256 9b64f86da57566f7b69cae54d7fb48ec52c934ff07c9ee8968b39b2f32e6617d

See more details on using hashes here.

File details

Details for the file scraper_rust-0.2.20-cp313-cp313t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.2.20-cp313-cp313t-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 dc9470e8cb6d57782da5fb3ac3fc33f473560e3a0f9fe75c549ab8528ffd2fd6
MD5 5dec60bca095e30d4058d3d519beb4dd
BLAKE2b-256 ce86344b2f55e1aad6763cfe70436f5ea534e5aeb388cdf392e4d9d527b4ac72

See more details on using hashes here.

File details

Details for the file scraper_rust-0.2.20-cp313-cp313t-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.2.20-cp313-cp313t-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 3be79285ca17056c708589eb8fce9d576035b6c0809b25e5a68f2a9647daada3
MD5 34ad57b0d72585c0da99ffd1ee43bb49
BLAKE2b-256 cda29c2115cf86cd8f8e49ab654254f7e916e3fa6ff21e2e644b3ce8839d52c1

See more details on using hashes here.

File details

Details for the file scraper_rust-0.2.20-cp310-abi3-win_amd64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.2.20-cp310-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 23fa02788e77beb7e2d5b27fcc5e390640096dd27c85058ea48efcd0bd93aa60
MD5 fac4071477ea29a6c8895d660dfeb23f
BLAKE2b-256 fabdfd5bab8f61d5785ca6b6c7cca5e52749b6d763ab56d211b144e463623cbe

See more details on using hashes here.

File details

Details for the file scraper_rust-0.2.20-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.2.20-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 c1e51bc5d4e9d2ccb8dce5168b5bab85a3ac80e7458e30c448d4faac646b6e4b
MD5 efd6ec7e96546528f53a5176546b7e05
BLAKE2b-256 6a6e415e4e552053d1ac438bfb885bc0868f8ea29739b1d4f309157892871e39

See more details on using hashes here.

File details

Details for the file scraper_rust-0.2.20-cp310-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for scraper_rust-0.2.20-cp310-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 3125d18a2d9c58fd21cedb0228ebbeef11d2215345a9d3a861b8b7d73d9737b9
MD5 669c90bc518229f70e1453f1e4b40098
BLAKE2b-256 7eac539d2eb50ed3416f0aceea9517b81670c350ba7a00e0abeae7672e2fc7ae

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page