Skip to main content

Lightweight web scraper with rate limiting and CSS selectors.

Project description

philiprehberger-web-scraper

Tests PyPI version Last updated

Lightweight web scraper with rate limiting and CSS selectors.

Installation

pip install philiprehberger-web-scraper

Usage

from philiprehberger_web_scraper import Scraper

scraper = Scraper(rate_limit=2.0, retry_attempts=3)

# Fetch a single page
page = scraper.get("https://example.com")
titles = page.select_all("h2.title")
link = page.select_one("a.next")
all_links = page.links()

# Extract data
for el in page.select_all(".product"):
    print(el.select_one(".name").text)
    print(el.select_one("a").attr("href"))

# Crawl multiple pages
for page in scraper.crawl("https://example.com/blog", max_pages=20):
    for article in page.select_all("article"):
        print(article.select_one("h2").text)

# Export
Scraper.export_csv(data, "output.csv")
Scraper.export_json(data, "output.json")

Retry with Backoff

Transient HTTP errors (429 and 503) are retried automatically with exponential backoff. Configure the number of attempts and base delay:

from philiprehberger_web_scraper import Scraper

scraper = Scraper(retry_attempts=5, retry_delay=2.0)
page = scraper.get("https://example.com/api")

Response Caching

Cache fetched pages to disk so repeated requests for the same URL skip the network entirely:

from philiprehberger_web_scraper import Scraper, ResponseCache

cache = ResponseCache(cache_dir=".scraper_cache")
scraper = Scraper(cache=cache)

page = scraper.get("https://example.com")  # fetches from network
page = scraper.get("https://example.com")  # served from disk cache

cache.clear()  # remove all cached responses

Table Extraction

Pull an HTML table into a list of dicts using extract_table():

from philiprehberger_web_scraper import Scraper, extract_table

scraper = Scraper()
page = scraper.get("https://example.com/data")
rows = extract_table(page, "table#prices")
# [{"Product": "Widget", "Price": "$9.99"}, ...]

Following Paginated Links

Use follow_links() to crawl paginated content by following a CSS-selected link on each page:

from philiprehberger_web_scraper import Scraper

scraper = Scraper()
for page in scraper.follow_links("https://example.com/page/1", "a.next-page", max_pages=10):
    for item in page.select_all(".result"):
        print(item.text)

Proxy Rotation

Distribute requests across multiple proxies by passing a list of proxy URLs:

from philiprehberger_web_scraper import Scraper

scraper = Scraper(proxies=[
    "http://proxy1:8080",
    "http://proxy2:8080",
    "http://proxy3:8080",
])
page = scraper.get("https://example.com")  # uses proxy1
page = scraper.get("https://example.com/2")  # uses proxy2

API

Function / Class Description
Scraper(rate_limit, retry_attempts, retry_delay, timeout, headers, cache, proxies) Web scraper with rate limiting, retry, caching, and proxy rotation
Scraper.get(url) Fetch a single page with retry and optional caching
Scraper.get_json(url) Fetch JSON from a URL
Scraper.follow_links(start_url, selector, max_pages) Follow paginated links matching a CSS selector
Scraper.crawl(start_url, max_pages, same_domain, next_selector) Crawl pages starting from a URL
Scraper.export_csv(data, path) Export list of dicts to CSV
Scraper.export_json(data, path, indent) Export data to JSON
Page A fetched web page with select_one(), select_all(), links(), images(), and title/text properties
Element Wrapper around a parsed element with text, html, attr(), select_one(), select_all()
ResponseCache(cache_dir) Disk-backed response cache with get(), put(), and clear() methods
extract_table(page, selector) Extract an HTML table into a list of dicts

Development

pip install -e .
python -m pytest tests/ -v

Support

If you find this project useful:

Star the repo

🐛 Report issues

💡 Suggest features

❤️ Sponsor development

🌐 All Open Source Projects

💻 GitHub Profile

🔗 LinkedIn Profile

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

philiprehberger_web_scraper-0.2.0.tar.gz (11.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

philiprehberger_web_scraper-0.2.0-py3-none-any.whl (7.8 kB view details)

Uploaded Python 3

File details

Details for the file philiprehberger_web_scraper-0.2.0.tar.gz.

File metadata

File hashes

Hashes for philiprehberger_web_scraper-0.2.0.tar.gz
Algorithm Hash digest
SHA256 cadbfe1914df4c371b5dd09535aa19d19ac9c27d39e19225cfd99e15991dac84
MD5 5b00b0771528e20c6c189ac0721a8da4
BLAKE2b-256 547af8c76fc5b699c085c4b48da27ffa2ac162868a17e3a7e24bc3c0e478c8ee

See more details on using hashes here.

File details

Details for the file philiprehberger_web_scraper-0.2.0-py3-none-any.whl.

File metadata

File hashes

Hashes for philiprehberger_web_scraper-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 330c6ecd32d00cab75b2e968293711ab76e3ff71c98abc638bce7517df45279e
MD5 3261a2aa24f0bb9b83f1e8b651d0257b
BLAKE2b-256 4b8ace24ace731318658494e5e594791b83dc1e25a842b4b9c3d13ffaa9154ec

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page