Python bindings around rust-scraper/scraper with PyO3
Project description
scraper-rs
Python bindings for the Rust scraper crate via PyO3. It gives you a lightweight Document/Element API with CSS selectors, XPath (via sxd_html/sxd_xpath), handy helpers, and zero Python-side parsing work.
Quick start
from scraper_rs import Document, first, select, select_first, xpath
html = """
<html><body>
<div class="item" data-id="1"><a href="/a">First</a></div>
<div class="item" data-id="2"><a href="/b">Second</a></div>
</body></html>
"""
doc = Document(html)
print(doc.text) # "First Second"
items = doc.select(".item")
print(items[0].attr("data-id")) # "1"
print(items[0].to_dict()) # {"tag": "div", "text": "First", "html": "<a...>", ...}
first_link = doc.select_first("a[href]") # alias: doc.find(...)
print(first_link.text, first_link.attr("href")) # First / /a
links_within_first = first_link.select("a[href]")
print([link.attr("href") for link in links_within_first]) # ["/a"]
# XPath (element results only)
xpath_items = doc.xpath("//div[@class='item']/a")
print([link.text for link in xpath_items]) # ["First", "Second"]
print(doc.xpath_first("//div[@data-id='1']/a").attr("href")) # "/a"
# Functional helpers
links = select(html, "a[href]")
print([link.attr("href") for link in links]) # ["/a", "/b"]
print(first(html, "a[href]").text) # First
print(select_first(html, "a[href]").text) # First
print([link.text for link in xpath(html, "//div[@class='item']/a")]) # ["First", "Second"]
For a runnable sample, see examples/demo.py.
API highlights
Document(html: str)/Document.from_html(html)parses once and keeps the DOM..select(css)→list[Element],.select_first(css)/.find(css)→ firstElement | None,.css(css)is an alias..xpath(expr)/.xpath_first(expr)evaluate XPath expressions that return element nodes..textreturns normalized text;.htmlreturns the original input.Elementexposes.tag,.text,.html,.attrsplus helpers.attr(name),.get(name, default),.to_dict().- Elements support nested CSS and XPath selection via
.select(css),.select_first(css),.find(css),.css(css),.xpath(expr),.xpath_first(expr). - Top-level helpers mirror the class methods:
parse(html),select(html, css),select_first(html, css)/first(html, css),xpath(html, expr),xpath_first(html, expr). - Call
doc.close()(orwith Document(html) as doc: ...) to free parsed DOM resources when you're done.
Installation
Built wheels target abi3 (CPython 3.10+). To build locally:
# Install maturin (uv is used in this repo, but pip works too)
pip install maturin
# Build a wheel
maturin build --release --compatibility linux
# Install the generated wheel
pip install target/wheels/scraper_rs-*.whl
If you have just installed, the repo includes helpers: just build (local wheel), just install-wheel (install the built wheel), and just build_manylinux (via the official maturin Docker image).
Development
Requirements: Rust toolchain, Python 3.10+, maturin, and pytest for tests.
- Run tests:
just testoruv run pytest tests/test_scraper.py - Format/typing: Rust and Python are small; no formatters are enforced yet.
- The PyO3 module name is
scraper_rs; the Rust crate is built ascdylib.
Contributions and issues are welcome. If you add public API, please extend tests/test_scraper.py and the example script accordingly.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file scraper_rust-0.2.4.tar.gz.
File metadata
- Download URL: scraper_rust-0.2.4.tar.gz
- Upload date:
- Size: 24.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.10.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c0ecc2778e974fa07ae96d0a080533bfd7196c69a6d24f02edaa976271d8b839
|
|
| MD5 |
d49c4bcbf84ddf36b5bf94d44f506543
|
|
| BLAKE2b-256 |
4812e5f97677741f3a3b061639b5ec9a61d3b844b34865d9425b0144a457e8a4
|
File details
Details for the file scraper_rust-0.2.4-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: scraper_rust-0.2.4-cp310-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 734.9 kB
- Tags: CPython 3.10+, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.10.2
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
653edccfcf197ffee01c61c631b746a03a2cf16fcc301b17659c8d4474714525
|
|
| MD5 |
0b8756db787d0c5ca21b8f1be3acd357
|
|
| BLAKE2b-256 |
f3464aef504166fe219073315725427852f5404b27ecc4da209bf443268738d2
|