Lightweight web scraper with rate limiting and CSS selectors
Project description
philiprehberger-web-scraper
Lightweight web scraper with rate limiting and CSS selectors.
Install
pip install philiprehberger-web-scraper
Usage
from philiprehberger_web_scraper import Scraper
scraper = Scraper(rate_limit=2.0, retry_attempts=3)
# Fetch a single page
page = scraper.get("https://example.com")
titles = page.select_all("h2.title")
link = page.select_one("a.next")
all_links = page.links()
# Extract data
for el in page.select_all(".product"):
print(el.select_one(".name").text)
print(el.select_one("a").attr("href"))
# Crawl multiple pages
for page in scraper.crawl("https://example.com/blog", max_pages=20):
for article in page.select_all("article"):
print(article.select_one("h2").text)
# Export
Scraper.export_csv(data, "output.csv")
Scraper.export_json(data, "output.json")
Features
- Built-in rate limiting (token bucket)
- Retry with backoff on 429/5xx errors
- CSS selector API wrapping BeautifulSoup
- Crawl mode with same-domain filtering
- Link and image extraction
- CSV and JSON export helpers
Options
Scraper(
rate_limit=2.0, # max requests/second
retry_attempts=3, # retries on failure
retry_delay=1.0, # base delay between retries
timeout=30.0, # request timeout
headers={...}, # custom headers
)
API
| Function / Class | Description |
|---|---|
Scraper(rate_limit, retry_attempts, retry_delay, timeout, headers) |
Web scraper with rate limiting, retry, and CSS selector extraction |
Page |
A fetched web page with select_one(), select_all(), links(), images(), and title/text properties |
Element |
Wrapper around a parsed element with text, html, attr(), select_one(), select_all() |
Development
pip install -e .
python -m pytest tests/ -v
License
MIT
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file philiprehberger_web_scraper-0.1.6.tar.gz.
File metadata
- Download URL: philiprehberger_web_scraper-0.1.6.tar.gz
- Upload date:
- Size: 5.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b92c2def7cf103fd46296aa2241b454610867382ea5ce2ce57388b22c7ddbb28
|
|
| MD5 |
514a38f4d8f80b146715df9a86f68206
|
|
| BLAKE2b-256 |
0ef478cff5e3f89edc49548eab66de82591ab362e6536c5094451c3370f08aa1
|
File details
Details for the file philiprehberger_web_scraper-0.1.6-py3-none-any.whl.
File metadata
- Download URL: philiprehberger_web_scraper-0.1.6-py3-none-any.whl
- Upload date:
- Size: 5.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7a2ad24323b0caee6427badd0315e053b7f8baefb10b85f4bd16480cf79db4b7
|
|
| MD5 |
3b21b2ef7d127cf5897cbc7ccd738341
|
|
| BLAKE2b-256 |
d1d3c81c7cad001573194ed40b6546cdf759be11546d1202ca9d37d8a4f54862
|